Pandas is the go-to Python library for Data Analytics. Pandas 2.0 was released back on the 3rd of April 2023 and the updates have come fast and furious with the current release being Pandas 2.2.3 (released in September 2024). We take a look at some of the reasons to upgrade, if you haven't already!
07-11-2024
pandas 2.0 was the update we had all been waiting for, promising better performance memory efficiency. Let's take a look under the bonnet at what's changed since it was released in April 2023 and how it measures up.
pandas and Data Analytics
pandas is the go-to Python library for Data Analytics. In fact, when most people say they are using Python to perform Data Analytics-type tasks, what they really mean is that they are using Python plus several additional libraries - more often than not including pandas.
At its core pandas is a Python library that provides data structures (such as the DataFrame) and operations for accessing, manipulating and analysing Numerical and Time Series data.
pandas 1.0 was originally developed by Wes McKinney (starting back in 2008) as a flexible tool to perform quantitative analysis. It was open sourced in 2009 with various 1.x versions being released over the last 14 years or so. The last 1.x release was 1.5.3 at the start of 2023.
The library’s name comes from "panel data" - a common term used for multidimensional data sets (as used in Statistics and Econometrics). And in case you're wondering, they don't capitalise the name "pandas"
Installing pandas 2.2.x
You can install your preferred version of pandas 2 from PyPi using the pip install command, for example:
pip install pandas==2.2.3
or from conda-forge using
conda install -c conda-forge pandas==2.2.3
What is new in pandas 2.x?
Various performance enhancements have been implemented in the pandas 2.x set of releases such as copy-on-write enhancements. As an example, a new lazy copy mechanism that defers the copy until the object in question is modified.
One criticism that was levelled at pandas 1.x related to memory usage on very large data sets. Originally, pandas was built on top of NumPy data structures.
To explain why this is important it is useful to understand how pandas is usually used. In most cases some data set is loaded from some source such as a file, a CSV file, an Excel file, a database etc. This can be done using one of the read functions such as read_csv, read_excel or read_sql. This data is loaded into the Python/pandas program and pandas determines how to represent that data. This representation is quiet straight forward for integers and floating point numbers but for strings, dates, times, etc. dome processing / decision making is required as to the best way to hold that data. The fundamental types in Python such as list, tuple and dictionary are not designed for holding large amounts of data and can become very slow if they are used. pandas therefore choose to use another representation for arrays and that representation was to use NumPy.
At the time, this was a very good idea at the time as NumPy is very well established and provided significant out of the box support that could be built upon to provide data processing and data analytics facilities.
However, since then various rival libraries or add-ons to pandas have attempted to improve on the performance of NumPy and NumPy arrays as they were perceived to be too ‘slow’ for many modern applications and its lack of support for strings and missing values. Indeed Wes McKinney has written an article outlining why NumPy was no longer the best choice now for pandas in “Apache Arrow and the "10 Things I Hate About pandas".
In fact, pandas has been working towards decoupling itself from NumPy for several years (at least since 2018) and with pandas 2.0 PyArrow was adopted to support all data types. Releases 2.1 and 2.2 and doubled down on this integration.
What is PyArrow?
PyArrow
is an Apache Arrow library for Python. Which of course raises the question what is Apache Arrow?
Apache Arrow defines a language independent format whose goal is to provide a memory efficient and high performance way to represent and process large datasets (both in-memory and on disk). Libraries are available for a wide range of languages from C, and C++ to Go, Java, JavaScript, Python, R, Ruby and Rust.
PyArrow now underpins everything in pandas (although NumPy is still available and still used for some activities where appropriate for example indexes can now hold NumPy data types).
Interestingly, the developers of pandas have taken quiet a conservative approach to the API for pandas, ensuring that as little has changed from the developers’ point of view as possible. Thus, almost everything that a pandas 1.x developer knew is still relevant for pandas 2.x. The main difference is that when one of the read functions is used an optional dtype_backend parameter is now provided. When this parameter is set to ‘PyArrow’ then the read function will return a PyArrow-backed ArrowDype DataFrame (rather than a traditional style DataFrame).
From the developer’s perspective the only difference from that point on is that the performance of the system should be better. For example:
import io
data = io.StringIO("""a,b,c,d,e,f,g,h,i
1,2.5,True,a,,,,,
3,4.5,False,b,6,7.5,True,a,
""")
df = pd.read_csv(data,
dtype_backend="PyArrow")
df.dtypes
a int64[PyArrow]
b double[PyArrow]
c bool[PyArrow]
d string[PyArrow]
e int64[PyArrow]
f double[PyArrow]
g bool[PyArrow]
h string[PyArrow]
i null[PyArrow]
dtype: object
How does pandas 2.0.0 stack up?
The performance improvements in pandas 2.0.0 do make a marked improvement in performance over the older purely NumPy based data frames. Depending upon the tests involved this can represent just under half the time taken for some operation to less than a quarter of the time. As an example, a very simple test loading a very large dataset into a simple program to average a set of integers had the following results:
pandas 1.5 (NumPy) 48.9 ms
pandas 2.0 (PyArrow) 17.9 ms
pandas 2.1 15.8 ms
pandas 2.2 14.9 ms
Although care needs to be taken here as having said that all that is required is to switch to PyArrow the pandas 2.x documentation then provides numerous approaches / options that can be used to improve the performance further. Thus, if a simple, perhaps slightly naïve approach is taken to comparing like with like then pandas 2.x’s performance improvement may be underestimated. See ‘Enhancing Performance’ in pandas 2.2 for more information.
Comparing pandas 2.x
If you search on the web you will find many articles comparing pandas with a range of tools, from R and PySpark to tools such as Excel, SAS or SQL databases. Some of these are relevant (such as PySpark) others are less appropriate such as SQL. In this section we will quickly compare and contrast several additional tools.
pandas versus NumPy
This is not an uncommon analysis to find if you go searching. However, as has been indicated above it is not really an appropriate one as pandas 1.x was directly built on top of NumPy
and pandas 2.x can still use NumPy or provide PyArrow as an alternative representation. Perhaps what is worth say is that if all you need is the facilities that NumPy provides then pandas is probably overkill for you!
pandas versus SciPy
Another comparison you will find is with SciPy (or Scientific Python). Interestingly SciPy is not strictly required for pandas but is listed as an optional dependency. SciPy is another open-source Python library this time oriented around mathematics, science and engineering tasks as it contains modules for linear algebra, integration, interpolation, FFT, image processing etc. As such SciPy is complementary to pandas rather than an alternative to it.
pandas versus PySpark
PySPark is a python library for the Scala (and Java) based Spark ecosystem of tools. Spark is a framework for working with large datasets in a distributed computing environment. In contrast pandas fundamentally runs operations on a single computer. Thus, if you want to exploit the benefits of a set of networked computers for your data analytics tasks then PySpark may well be a preferrable choice. However, PySpark is more complex than pandas and obviously involves a distributed computing environment and thus requires more set up than pandas. In this comparison these are markedly different tools aimed at significantly different sized tasks.
pandas versus Dask
The last section indicated that pandas is primarily a single computer based solution, where as PySpark was a multi-computer based solution. However Dask changes things a little. Dask is library that allows pandas (and in fact other libraries such as NumPy itself and scikit learn) to scale across a distributed computing environment. It is thus possible to use Dask to make pandas operate in a multiple computer environment and take advantage of multiple computer systems for a data analytics task. This may well be a preferable approach to PySpark if you are already familiar with pandas. Thus, Dask is not really a competitor of pandas but an enabler!
pandas versus polars
This is perhaps the most appropriate comparison in this list. Polars is a direct rival to pandas in terms of its aims and objectives with the added focus of aiming to be faster and more efficient than pandas. Its main claim to performance is that it uses an Arrow based representation and natively parallelizes processing of the data. In actual fact Polars has two APIs an eager API and a lazy API. The eager API is similar in its execution to pandas. The lazy API only runs code when it needs to which can in some cases further improve performance.
Compared to pandas 1.x Polars has significant performance improvements. However, compared to pandas 2.0 the improvements are either less marked, not really noticeable or indeed not present as ins some cases pandas 2.0 may be faster. Polars is still an alternative to pandas 2.x but the results are less clear cut and any developer who wishes to select one over the other should perform their own performance benchmark tests to determine which suits them best.
One point to note is that there is far more documentation and examples available for pandas than there are for Polars.
pandas versus R (rlang)
This is an interesting comparison as pandas is a data analytics library built on top of the general-purpose programming language Python. In contrast R is a Programming Language and environment designed for statistical computing and graphics. As such there is a fundamental difference in philosophy behind these two approaches. For many developers coming from a traditional programming background Python is a simple to learn programming language which benefits from the huge ecosystem of open source (and commercial0 libraries available for it. This of course includes pandas but also all the GUI, graphing, database, restful etc. libraries available in the market. Whereas R is a slightly esoteric programming language with its own set of libraries and ecosystem of add-ons which are not as extensive as Python's.
pandas Releases
As mentioned at the start of this blog, there have been three 2.x major releases and several minor releases over the last year and a half or so. pandas 2.0 was released at the start of April 2023. There were several minor releases over the next few months with pandas 2.1.0 being released on the 30th of August 2023.
So what was new in 2.1? Why was it released so soon after 2.0?
Essentially 2.1 introduced a number of improvements and optimizations to speed up data processing times as well as deprecating some old features. In particular it built heavily on the PyArrow integration with improved support – this was a significant performance hike in pandas 2.0 which was enhanced in 2.1.
pandas 2.2.0 was released on the 19th of January 2024. This released further built on the changes in pandas 2.0 and 2.1 by including several additional features that rely n the Apache Arrow ecosystem. For example, historically pandas relied on SqlAlchemy
to read data from an SQL Database. This was very effective but not very efficient. This is because SQLAlchemy reads the data row by row, but pandas has a columnar layout, this makes reading and moving data into pandas DataFrame data structure slower than is ideal. For pandas 2.2 the ADBC Driver from the Apache Arrow project enables users to read data in a columnar layout which significant improves performance.
As you can see from this, pandas has taken quite a conservative approach to incorporating Arrow project features into its code base, ensuring that each release works before updating the system with a new release feature further updates. This is all part of a pathway leading to pandas 3.0
(which is currently still an unstable development version of pandas).
Summary
pandas 2.x is here and if you are working in the field of Data Analytics then you should certainly take a look at it. There is minimal overhead to getting started and significant gains to be obtained by using it.
Would you like to know more?
If you found this article interesting you might be interested in some of our related training courses:
To help raise awareness of challenges and vulnerabilities and ways to reduce risk, we've got a bumper crop of cyber security blog articles. We've also got a robust range of hands-on training courses covering security for non-technical staff and IT professionals
We use cookies on our website to provide you with the best user experience. If you're happy with this please continue to use the site as normal. For more information please see our Privacy Policy.