Exploring Renewable Energy Datasets with Python and Pandas
We look at how data analytics techniques using Pandas 2.x and graphing tools such as matplotlib can be used to help us understand the growth of renewable energies in the UK and world-wide.
16-08-2023
No one can have missed the turmoil in the world's weather over the summer of 2023. We have had extreme heat waves across the Mediterranean, wildfires as far apart as Canada, Greece, and Hawaii; some of the wettest summer months in the UK, the list goes on and on.
The development of renewable energy that reduces the harm to the world’s atmosphere is more important than ever.
Let's take a look at how data analytics techniques using Pandas 2.x and graphing tools such as matplotlib can be used to help us understand the growth of renewable energies in the UK and world-wide.
Data Science
Data Science is an interdisciplinary field that uses techniques from a variety of different disciplines and subjects to extract knowledge from data in various forms. The intention is to gain new understanding from that data either to help decision making or otherwise provide benefit to an organisation such as to generate additional value.
A simple example might be to note that when a supermarket store card member suddenly starts buying nappies, they might well also start to buy ready meals etc. Therefore, it might be useful to put adverts for ready meals in the baby isle.
The following diagram aims to provide an overview of how data science can be considered to add value.
On the left-hand side of the diagram we can see that Data Science can be used to provide insight into the data, for example to further the understanding of any latent or hidden patterns within the data which might be of use to the organisation. This might, for example, relate to customer or individual patterns of behaviour that might not be obvious from the raw data. The information (or knowledge) obtained from this might then be used to influence decisions made by the organisation such as what promotions to push or what process or procedural changes to make.
On the right-hand side of the diagram, we have the ‘Development of Data Product’ which relates to the creation of products (typically software products such as data classifiers or predictors). These tools might be developed using existing data and then used to analyse future data. An example might be a machine learning classifier-based system that is trained to identify fraudulent bank loan applications. Once it has been trained it can be used to analyse new loan applications to help identify future fraud.
The Dataset
We are using a data set that is freely available from kaggle.com. Kaggle is a subsidiary of Google that provides an online community for data scientists and machine learning engineers. Kaggle offers many different datasets that can be downloaded and used for personal research, to help analysts learn how to use data analytics techniques and to share their findings and techniques with others. It also runs regular competitions to solve data science challenges.
The datasets being used in this blog can be accessed from:
This dataset represents data relating to renewable energy world-wide from 1965 to 2022. It includes information on global hydropower, wind, solar biofuel and geothermal renewable energy generation and usage.
There are actually 17 different CSV files in this dataset, and these are listed below:
We will not be using all these files in this blog, but will focus on a few of files and the data they hold. We encourage you to follow along and experiment with the data yourself.
Getting Started
Python is a widely used programming language that can be applied to a wide variety of different domains and problems. It can be used to create games such as space invaders, handle DevOps style scripts, create web services and of course can be used for data analytics style tasks. It is of course not the only the programming language or toolset that can be used for data analytics, for example R
is another popular choice. However, the advantage of Python is that it has a huge ecosystem of third-party libraries that extend the basic capabilities of the language for whatever task you need.
We will be using a few third-party data science-oriented libraries in this blog. In particular we will be using the Pandas library that is widely used in industry. We will be using the Pandas 2.x library which is a major revision of the original Pandas implementation that improves, amongst other things, performance.
The Python runtime environment was originally designed a long time ago, and as such one of the aspects that is still a little old fashioned is how it deals with additional third party libraries and how they are installed. Originally any library was installed into the central Python installation which meant that every Python program running on that computer shared the same library. That is fine if you will always use the same libraries (and the same versions of a library). However, in many cases one program might use one library and another program might use a different version of that library or some library it depends upon. This is therefore not the best approach to adopt.
In modern Python projects we handle installing third party libraries in Python using a virtual environment. These virtual environments can be used for a specific program or shared between common programs. There are two common ways of doing this using either pip virtual environments or conda virtual environments. These two approaches are not compatible and one or the other should be used (as they use their own library repositories with their own dependency handling mechanisms which can conflict with each other).
The pip tools (which stands for Pip installs Python) comes as part of the Python environment and is therefore very easy to use. Conda
is a separate third party tool that must be installed onto your operating system before it can be used.
In either case a virtual environment can be create for your data analytics project and third-party libraries can be installed in that.
You can check the version of pip or conda you are using via the following commands:
pip --version
or
conda info
Which ever approach you use you can then install the libraries we will be using.
There are a range of commonly used data science libraries, we will look at several in the reminder of this book in particular we will consider:
NumPy which provides sophisticated facilities for handling numbers.
Pandas is a data manipulation and analysis suite of modules. It provides facilities for reading data from a wide range of different formats, for managing and handling missing data, for reshaping data etc. It uses NumPy and can also use SciPy if required.
Matplotlib is a sophisticated graphing library already discussed earlier in this book.
Seaborn is a statistical data visualisation library that is based on (builds on Matplotlib). It provides a higher level interface for creating and presenting statistical graphics.
We will be using Pandas (which will also install NumPy for us), Matplotlib and Seaborn.
Using pip to install third party modules
We can create a pip virtual environment using the command python -m venv <name>, for example:
python -m venv proj_venv
This will create a virtual environment called proj_venv
we can now activate that environment. This is done using the rather awkward incantation which accesses the activate script within the virtual environment directory’s bin directory. On a Mac or Linux this would be done using the source command followed by the <venv_name>/bin/activate path, for example:
source genesis_venv/bin/activate
Note that on this Mac you can see the virtual environment that has been activated as it is shown before the prompt in round brackets.
On a Windows machine this file will be a bat file and therefore you can activate it by directly running the activate.bat file, for example:
> genesis_venv\bin\activate.bat
We can also deactivate the virtual environment using the deactivate command which on Windows is deactivate.bat, for example on Linux / Macs:
deactivate
Or on Windows
> deactivate.bat
To install these libraries using pip from the command line we can use:
This will install into a pip virtual environment the latest versions of each of these libraries. If you need to install a specific version, then you can specify this as shown below:
pip install pandas==1.5.2
The above would install pandas version 1.5.2
If you wish to create a file that can be used to create the same set of modules in another virtual environment then you can freeze the virtual environment configuration and save it to a file, for example:
pip freeze > johnsenv.txt
You can then reload this configuration into another virtual environment at a later date, for example:
pip install -r johnsenv.txt
Using conda to install third party modules
You can create a conda virtual environment using the conda create command, for example:
conda create --name projectname python=3.11
This will create a new conda environment called projectname using python 3.11.
The virtual environment can be activates using
conda activate projectname
And deactivated using
conda deactivate
You can also use conda to install the required modules, for example using
The key concepts within Pandas are the Series and the DataFrame. A series is a 1-dimensional array like object that holds data of a specific type (such as an integer, a string or a float). Each value is accessible via an index value which can be a numeric integer such as 0 or 1 or a label such as a string or a time stamp. The following diagram illustrates two such Series, one is indexed numerically from Zero and holds a sequence of floats, the second is indexed by a set of timestamps and holds a sequence of integers.
A DataFrame is a tabular structure a bit like a spreadsheet. It has columns and rows. Each column has a type and a label. The columns are represented by Series. Each row has an index. The index could be numeric or some other value such as a sequence of dates. A DataFrame can be visualised as shown below:
We will use DataFrames to load, process and analyse the related data sets.
Exploring the Data
The first thing we are going to do is change some of the information in the DataFrame. Whilst the headings in the CSV file are very description, they are rather long and take up a lot of space on the console and will be longer to display in any graphs.
We can change the heading of a DataFrame column using the rename() method indicating that we want to rename a set of columns.
df.rename(columns={'Electricity from wind (TWh)': 'Wind',
'Electricity from hydro (TWh)': 'Hydro',
'Electricity from solar (TWh)': 'Solar',
'Other renewables including bioenergy (TWh)': 'Other'}, inplace=True)
print(df.tail(10).to_string())
At the moment we have been printing out the last 10 rows of the dataframe, so how many rows are there, we can easily find that out using the commonly used len() function, for example:
print('Row count is:', len(df))
In this case when this is run we can see:
Row count is: 8851
So there are 8851 rows in the DataFrame.
Interestingly, this does not mean that there are 8851 items of data for each column in the data frame, for example if we use the count() function to count the number of non-NA cells for each column (or row). The default is for count to look per column. For example:
Entity 8851
Code 7296
Year 8851
Wind 8676
Hydro 8840
Solar 8683
Other 8631
dtype: int64
This shows that the Entity column has a value in every single row in the DataFrame, however the Code column has many missing values as it only has 7296 columns with a value in it.
It might now be interesting to see the data for the UK held in the dataset. To do this we can select only those rows with a code representing the UK. In this data set the Code for the United Kingdom is GBR. We can use the dataframe loc attribute to select a set of rows that match some condition – in this case the conditions where the ‘Code’ column value is GBR:
print('Select all the rows for GBR')
df_gbr = df.loc[df['Code'] == 'GBR']
print(df_gbr.to_string())
When we run this we obtain a set of rows containing data for 1965 to 2021:
Entity Code Year Wind Hydro Solar Other
8261 United Kingdom GBR 1965 0.000 4.6120 0.00000 0.00
8262 United Kingdom GBR 1966 0.000 4.5360 0.00000 0.00
8263 United Kingdom GBR 1967 0.000 4.8850 0.00000 0.00
8264 United Kingdom GBR 1968 0.000 3.7220 0.00000 0.00
8265 United Kingdom GBR 1969 0.000 3.2560 0.00000 0.00
…
8315 United Kingdom GBR 2019 64.330 5.9300 12.92000 37.30
8316 United Kingdom GBR 2020 73.640 6.6800 13.32000 38.10
8317 United Kingdom GBR 2021 65.020 5.5900 12.47000 39.12
Some of these rows have been omitted above for brevity.
We can now graph this data so that we can see for example, the amount of Hydro power generated between these years.
To do this we can use MatPlotLib via the Pandas API. A DataFrame has a plot() method that can be used to create a plot object that can then be configured via the Matplotlib interface (plt) so that tick marks are rotated and the plot is scaled and then displayed using show(). For example:
print('graph solar power generation against year')
df_gbr.plot(x="Year", y="Hydro", kind='bar', legend=False)
plt.xticks(rotation=90)
plt.autoscale()
plt.show()
The resulting graph is shown below. Here we can see that the general trend was for increased hydro power generation up until 2021 which dropped off a bit from 2020.
If you want to plot two columns against each other, for example to plot hydro against solar generation for the last ten years for the UK, we could combine the “Hydro” and “Solar” columns. We can do this by adding both “Hydro” and “Solar” into a list for the y axis and selecting only the last 10 rows in the df_gbr dataframe using the tail(10) method. for example:
This shows quiet clearly that Solar power generation has overtaken Hydro power generation by a significant amount in the last ten years.
We can compare all the types of renewable energy production data available for the UK for the last ten year by adding all the columns into the y parameter. This shows that wind has become by far the largest generator of renewable power in the UK:
Using Python and Pandas along with matplotlib we are able to create powerful and effective data analytics application very quickly. This blog has illustrated for one dataset how we can explore and review that data to see how the trends in renewable energy generation have changed over the last 10 years and more.
Would you like to know more?
We've got lots of great Python training courses to choose from:
To help raise awareness of challenges and vulnerabilities and ways to reduce risk, we've got a bumper crop of cyber security blog articles. We've also got a robust range of hands-on training courses covering security for non-technical staff and IT professionals
We use cookies on our website to provide you with the best user experience. If you're happy with this please continue to use the site as normal. For more information please see our Privacy Policy.