Exploring Renewable Energy Datasets with Python and Pandas

We look at how data analytics techniques using Pandas 2.x and graphing tools such as matplotlib can be used to help us understand the growth of renewable energies in the UK and world-wide.

16-08-2023
Bcorp Logo



No one can have missed the turmoil in the world's weather over the summer of 2023. We have had extreme heat waves across the Mediterranean, wildfires as far apart as Canada, Greece, and Hawaii; some of the wettest summer months in the UK, the list goes on and on. 

The development of renewable energy that reduces the harm to the world’s atmosphere is more important than ever. 

Let's take a look at how data analytics techniques using Pandas 2.x and graphing tools such as matplotlib can be used to help us understand the growth of renewable energies in the UK and world-wide.

Data Science

Data Science is an interdisciplinary field that uses techniques from a variety of different disciplines and subjects to extract knowledge from data in various forms. The intention is to gain new understanding from that data either to help decision making or otherwise provide benefit to an organisation such as to generate additional value.

A simple example might be to note that when a supermarket store card member suddenly starts buying nappies, they might well also start to buy ready meals etc. Therefore, it might be useful to put adverts for ready meals in the baby isle.

The following diagram aims to provide an overview of how data science can be considered to add value.

Exploring Renewable Energy Datasets with Python and Pandas


    On the left-hand side of the diagram we can see that Data Science can be used to provide insight into the data, for example to further the understanding of any latent or hidden patterns within the data which might be of use to the organisation. This might, for example, relate to customer or individual patterns of behaviour that might not be obvious from the raw data. The information (or knowledge) obtained from this might then be used to influence decisions made by the organisation such as what promotions to push or what process or procedural changes to make.

    On the right-hand side of the diagram, we have the ‘Development of Data Product’ which relates to the creation of products (typically software products such as data classifiers or predictors). These tools might be developed using existing data and then used to analyse future data. An example might be a machine learning classifier-based system that is trained to identify fraudulent bank loan applications. Once it has been trained it can be used to analyse new loan applications to help identify future fraud.

    data science training courses


    The Dataset

    We are using a data set that is freely available from kaggle.com. Kaggle is a subsidiary of Google that provides an online community for data scientists and machine learning engineers. Kaggle offers many different datasets that can be downloaded and used for personal research, to help analysts learn how to use data analytics techniques and to share their findings and techniques with others. It also runs regular competitions to solve data science challenges.

    The datasets being used in this blog can be accessed from:

    https://www.kaggle.com/datasets/belayethossainds/renewable-energy-world-wide-19652022

    This dataset represents data relating to renewable energy world-wide from 1965 to 2022. It includes information on global hydropower, wind, solar biofuel and geothermal renewable energy generation and usage.

    There are actually 17 different CSV files in this dataset, and these are listed below:

    1. renewable-share-energy data.
    2. modern-renewable-energy-consumption.
    3. modern-renewable-prod
    4. share-electricity-renewables
    5. hydropower-consumption
    6. hydro-share-energy
    7. share-electricity-hydro
    8. wind-generation
    9. cumulative-installed-wind-energy-capacity-gigawatts
    10. wind-share-energy
    11. share-electricity-wind
    12. solar-energy-consumption
    13. installed-solar-PV-capacity
    14. solar-share-energy
    15. share-electricity-solar
    16. biofuel-production
    17. installed-geothermal-capacity

    We will not be using all these files in this blog, but will focus on a few of files and the data they hold. We encourage you to follow along and experiment with the data yourself.

    Exploring Renewable Energy Datasets with Python and Pandas


    Getting Started

    Python is a widely used programming language that can be applied to a wide variety of different domains and problems. It can be used to create games such as space invaders, handle DevOps style scripts, create web services and of course can be used for data analytics style tasks. It is of course not the only the programming language or toolset that can be used for data analytics, for example R is another popular choice. However, the advantage of Python is that it has a huge ecosystem of third-party libraries that extend the basic capabilities of the language for whatever task you need.

    We will be using a few third-party data science-oriented libraries in this blog. In particular we will be using the Pandas library that is widely used in industry. We will be using the Pandas 2.x library which is a major revision of the original Pandas implementation that improves, amongst other things, performance.

    The Python runtime environment was originally designed a long time ago, and as such one of the aspects that is still a little old fashioned is how it deals with additional third party libraries and how they are installed. Originally any library was installed into the central Python installation which meant that every Python program running on that computer shared the same library. That is fine if you will always use the same libraries (and the same versions of a library). However, in many cases one program might use one library and another program might use a different version of that library or some library it depends upon. This is therefore not the best approach to adopt.

    In modern Python projects we handle installing third party libraries in Python using a virtual environment. These virtual environments can be used for a specific program or shared between common programs. There are two common ways of doing this using either pip virtual environments or conda virtual environments. These two approaches are not compatible and one or the other should be used (as they use their own library repositories with their own dependency handling mechanisms which can conflict with each other).

    The pip tools (which stands for Pip installs Python) comes as part of the Python environment and is therefore very easy to use. Conda is a separate third party tool that must be installed onto your operating system before it can be used.

    In either case a virtual environment can be create for your data analytics project and third-party libraries can be installed in that.

    You can check the version of pip or conda you are using via the following commands:

    pip --version

    or

    conda info

    Which ever approach you use you can then install the libraries we will be using.

    There are a range of commonly used data science libraries, we will look at several in the reminder of this book in particular we will consider:

    • NumPy which provides sophisticated facilities for handling numbers.
    • SciPy which stands for Scientific Python.
    • Pandas is a data manipulation and analysis suite of modules. It provides facilities for reading data from a wide range of different formats, for managing and handling missing data, for reshaping data etc. It uses NumPy and can also use SciPy if required.
    • Matplotlib is a sophisticated graphing library already discussed earlier in this book.
    • Seaborn is a statistical data visualisation library that is based on (builds on Matplotlib). It provides a higher level interface for creating and presenting statistical graphics.

    We will be using Pandas (which will also install NumPy for us), Matplotlib and Seaborn.

    Using pip to install third party modules

    We can create a pip virtual environment using the command python -m venv <name>, for example:

    python -m venv proj_venv

    This will create a virtual environment called proj_venv

    we can now activate that environment. This is done using the rather awkward incantation which accesses the activate script within the virtual environment directory’s bin directory. On a Mac or Linux this would be done using the source command followed by the <venv_name>/bin/activate path, for example:

    source genesis_venv/bin/activate

    Note that on this Mac you can see the virtual environment that has been activated as it is shown before the prompt in round brackets.

    On a Windows machine this file will be a bat file and therefore you can activate it by directly running the activate.bat file, for example:

    > genesis_venv\bin\activate.bat

    We can also deactivate the virtual environment using the deactivate command which on Windows is deactivate.bat, for example on Linux / Macs:

    deactivate

    Or on Windows

    > deactivate.bat

    To install these libraries using pip from the command line we can use:

    pip install pandas
    pip install matplotlib
    pip install seaborn

    This will install into a pip virtual environment the latest versions of each of these libraries. If you need to install a specific version, then you can specify this as shown below:

    pip install pandas==1.5.2

    The above would install pandas version 1.5.2

    If you wish to create a file that can be used to create the same set of modules in another virtual environment then you can freeze the virtual environment configuration and save it to a file, for example:

    pip freeze > johnsenv.txt

    You can then reload this configuration into another virtual environment at a later date, for example:

    pip install -r johnsenv.txt


    data science training courses


    Using conda to install third party modules

    You can create a conda virtual environment using the conda create command, for example:

    conda create --name projectname python=3.11

    This will create a new conda environment called projectname using python 3.11.

    The virtual environment can be activates using

    conda activate projectname

    And deactivated using

    conda deactivate

    You can also use conda to install the required modules, for example using

    conda install pandas
    conda install matplotlib
    conda install seaborn

    This will install these modules into a conda virtual environment.

    It is then possible to export these dependencies so that they can be used to configure another conda environment, for example:

    conda list --export > proj-dependencies.txt

    This would create a text file with the third party modules used in this environment to the file proj-dependencies.txt.

    These dependencies could be used to create another virtual environment using:

    conda create --name new-proj-env --file proj-dependencies.txt

    We now have the libraries required to allow us to start to process our data.

    Exploring Renewable Energy Datasets with Python and Pandas


    Exploring the Modern Renewable Energy Production data

    One of the data files in the Kaggle dataset is the modern-renewable-production.csv file.

    This data files contains world-wide information on the generation of energy from wind, hydro, solar and other renewables including bioenergy.

    We can load this data into a Python program and example the information held.

    To do that we will load the CSV file into a Pandas Data Frame, as shown below:

    import pandas as pd
    
    import matplotlib.pyplot as plt
    
    import seaborn as sns
    
    sns.set()
    
    df = pd.read_csv('03 modern-renewable-prod.csv')
    
    print(df.tail(10).to_string())

    The line print(df.tail(10).to_string()) generates a print out of the last 10 rows in the data frame. When this program is run the output generated is:

            Entity Code  Year  Electricity from wind (TWh)  Electricity from hydro (TWh)  Electricity from solar (TWh)  Other renewables including bioenergy (TWh)
    8841  Zimbabwe  ZWE  2012                          0.0                          5.34                          0.00                                        0.34
    8842  Zimbabwe  ZWE  2013                          0.0                          4.95                          0.00                                        0.41
    8843  Zimbabwe  ZWE  2014                          0.0                          5.38                          0.00                                        0.41
    8844  Zimbabwe  ZWE  2015                          0.0                          4.94                          0.01                                        0.42
    8845  Zimbabwe  ZWE  2016                          0.0                          2.95                          0.01                                        0.36
    8846  Zimbabwe  ZWE  2017                          0.0                          3.97                          0.01                                        0.32
    8847  Zimbabwe  ZWE  2018                          0.0                          5.05                          0.02                                        0.39
    8848  Zimbabwe  ZWE  2019                          0.0                          4.17                          0.03                                        0.38
    8849  Zimbabwe  ZWE  2020                          0.0                          3.81                          0.03                                        0.35
    8850  Zimbabwe  ZWE  2021                          0.0                          4.00                          0.04                                        0.38

    Pandas Series and DataFrames

    The key concepts within Pandas are the Series and the DataFrame. A series is a 1-dimensional array like object that holds data of a specific type (such as an integer, a string or a float). Each value is accessible via an index value which can be a numeric integer such as 0 or 1 or a label such as a string or a time stamp. The following diagram illustrates two such Series, one is indexed numerically from Zero and holds a sequence of floats, the second is indexed by a set of timestamps and holds a sequence of integers.

    Exploring Renewable Energy Datasets with Python and Pandas

    A DataFrame is a tabular structure a bit like a spreadsheet. It has columns and rows. Each column has a type and a label. The columns are represented by Series. Each row has an index. The index could be numeric or some other value such as a sequence of dates. A DataFrame can be visualised as shown below:

    Exploring Renewable Energy Datasets with Python and Pandas


    We will use DataFrames to load, process and analyse the related data sets.

    Exploring the Data

    The first thing we are going to do is change some of the information in the DataFrame. Whilst the headings in the CSV file are very description, they are rather long and take up a lot of space on the console and will be longer to display in any graphs.

    We can change the heading of a DataFrame column using the rename() method indicating that we want to rename a set of columns.

    df.rename(columns={'Electricity from wind (TWh)': 'Wind',
                       'Electricity from hydro (TWh)': 'Hydro',
                       'Electricity from solar (TWh)': 'Solar',
                       'Other renewables including bioenergy (TWh)': 'Other'}, inplace=True)
    
    print(df.tail(10).to_string())

    If we run this code we will see:

            Entity Code  Year  Wind  Hydro  Solar  Other
    8841  Zimbabwe  ZWE  2012   0.0   5.34   0.00   0.34
    8842  Zimbabwe  ZWE  2013   0.0   4.95   0.00   0.41
    8843  Zimbabwe  ZWE  2014   0.0   5.38   0.00   0.41
    8844  Zimbabwe  ZWE  2015   0.0   4.94   0.01   0.42
    8845  Zimbabwe  ZWE  2016   0.0   2.95   0.01   0.36
    8846  Zimbabwe  ZWE  2017   0.0   3.97   0.01   0.32
    8847  Zimbabwe  ZWE  2018   0.0   5.05   0.02   0.39
    8848  Zimbabwe  ZWE  2019   0.0   4.17   0.03   0.38
    8849  Zimbabwe  ZWE  2020   0.0   3.81   0.03   0.35
    8850  Zimbabwe  ZWE  2021   0.0   4.00   0.04   0.38 

    Which is a lot easier to read.

    At the moment we have been printing out the last 10 rows of the dataframe, so how many rows are there, we can easily find that out using the commonly used len() function, for example:

    print('Row count is:', len(df))

    In this case when this is run we can see:

    Row count is: 8851

    So there are 8851 rows in the DataFrame.

    Interestingly, this does not mean that there are 8851 items of data for each column in the data frame, for example if we use the count() function to count the number of non-NA cells for each column (or row). The default is for count to look per column. For example:

    Entity    8851
    Code      7296
    Year      8851
    Wind      8676
    Hydro     8840
    Solar     8683
    Other     8631
    dtype: int64

    This shows that the Entity column has a value in every single row in the DataFrame, however the Code column has many missing values as it only has 7296 columns with a value in it.

    It might now be interesting to see the data for the UK held in the dataset. To do this we can select only those rows with a code representing the UK. In this data set the Code for the United Kingdom is GBR. We can use the dataframe loc attribute to select a set of rows that match some condition – in this case the conditions where the ‘Code’ column value is GBR:

    print('Select all the rows for GBR')
    
    df_gbr = df.loc[df['Code'] == 'GBR']
    
    print(df_gbr.to_string())

    When we run this we obtain a set of rows containing data for 1965 to 2021:

                  Entity Code  Year    Wind   Hydro     Solar  Other
    8261  United Kingdom  GBR  1965   0.000  4.6120   0.00000   0.00
    8262  United Kingdom  GBR  1966   0.000  4.5360   0.00000   0.00
    8263  United Kingdom  GBR  1967   0.000  4.8850   0.00000   0.00
    8264  United Kingdom  GBR  1968   0.000  3.7220   0.00000   0.00
    8265  United Kingdom  GBR  1969   0.000  3.2560   0.00000   0.00
    …
    8315  United Kingdom  GBR  2019  64.330  5.9300  12.92000  37.30
    8316  United Kingdom  GBR  2020  73.640  6.6800  13.32000  38.10
    8317  United Kingdom  GBR  2021  65.020  5.5900  12.47000  39.12

    Some of these rows have been omitted above for brevity.

    We can now graph this data so that we can see for example, the amount of Hydro power generated between these years.

    To do this we can use MatPlotLib via the Pandas API. A DataFrame has a plot() method that can be used to create a plot object that can then be configured via the Matplotlib interface (plt) so that tick marks are rotated and the plot is scaled and then displayed using show(). For example:

    print('graph solar power generation against year')
    
    df_gbr.plot(x="Year", y="Hydro", kind='bar', legend=False)
    
    plt.xticks(rotation=90)
    
    plt.autoscale()
    
    plt.show()

    The resulting graph is shown below. Here we can see that the general trend was for increased hydro power generation up until 2021 which dropped off a bit from 2020.

    Exploring Renewable Energy Datasets with Python and Pandas


    If you want to plot two columns against each other, for example to plot hydro against solar generation for the last ten years for the UK, we could combine the “Hydro” and “Solar” columns. We can do this by adding both “Hydro” and “Solar” into a list for the y axis and selecting only the last 10 rows in the df_gbr dataframe using the tail(10) method. for example:

    print('Graph two columns')
    
    df_gbr.tail(10).plot(x="Year", y=["Hydro", "Solar"], kind='bar', legend=True)
    
    plt.xticks(rotation=90)
    
    plt.autoscale()
    
    plt.show()


    The graph generated by this code is:

    Exploring Renewable Energy Datasets with Python and Pandas


    This shows quiet clearly that Solar power generation has overtaken Hydro power generation by a significant amount in the last ten years.

    We can compare all the types of renewable energy production data available for the UK for the last ten year by adding all the columns into the y parameter. This shows that wind has become by far the largest generator of renewable power in the UK:

    df_gbr.tail(10).plot(x="Year", y=["Wind", "Hydro", "Solar", "Other"], kind='bar', legend=True)
    
    plt.xticks(rotation=90)
    
    plt.autoscale()
    
    plt.show()


    When this code is run we see the following graph:

    Exploring Renewable Energy Datasets with Python and Pandas


    Summary

    Using Python and Pandas along with matplotlib we are able to create powerful and effective data analytics application very quickly. This blog has illustrated for one dataset how we can explore and review that data to see how the trends in renewable energy generation have changed over the last 10 years and more.


    Would you like to know more?

    We've got lots of great Python training courses to choose from:

    Share this post on:

    We would love to hear from you

    Get in touch

    or call us on 020 3137 3920

    Get in touch