Pandas 2.0 is here in Black and White!

Pandas 2.0 has recently been released for general availability (April 3rd 2023). This is the Pandas update we have all been waiting for as it aims to make Pandas faster and more memory efficient.

17-04-2023
Bcorp Logo


Pandas 2.0 is here in Black and White!


Pandas 2.0 has recently been released for general availability (April 3rd 2023). This is the Pandas update we have all been waiting for as it aims to make Pandas faster and more memory efficient.

Pandas is the go to Python library for Data Analytics. In fact, when most people say they are using Python to perform Data Analytics type tasks what they really mean is that they are using Python plus several additional libraries, the core of which is Pandas.

At its core Pandas is a Python library that provides data structures (such as the DataFrame) and operations for accessing data, manipulating that data and analysing numerical and time series data.

Pandas 1.0 was originally developed by Wes McKinney (starting back in 2008) as a flexible tool to perform quantitative analysis. It was open sourced in 2009 with various 1.x versions being released over the last 14 years or so. The most recent 1.x release was 1.5.3 which was released at the start of 2023. However, we now have Pandas 2.0 which is a major release update from the Pandas 1 series.

Installing Pandas 2.0

You can install Pandas 2.0 from PyPi using the pip install command, for example:

pip install pandas==2.0.0

or from conda-forge using

conda install -c conda-forge pandas==2.0.0

An example of using the Python pip virtual environments to set up a new virtual environment and install pandas 2.0.0 is shown below:

Pandas 2.0 is here in Black and White!


What is new in Pandas 2?

Various performance enhancements have been implemented in pandas 2.0 such as copy-on-write enhancements. As an example, a new lazy copy mechanism that defers the copy until the object in question is modified.

One criticism that was levelled at Pandas 1.x related to memory usage on very large data sets. Originally, Pandas was built on top of NumPy data structures.

Pandas 2.0 is here in Black and White!


To explain why this is important it is useful to understand how Pandas is usually used. In most cases some data set is loaded from some source such as a file, a CSV file, an Excel file, a database etc. This can be done using one of the read functions such as read_csv, read_excel or read_sql. This data is loaded into the Python/Pandas program and Pandas determines how to represent that data. This representation is quiet straight forward for integers and floating point numbers but for strings, dates, times, etc. dome processing / decision making is required as to the best way to hold that data. The fundamental types in Python such as list, tuple and dictionary are not designed for holding large amounts of data and can become very slow if they are used. Pandas therefore choose to use another representation for arrays and that representation was to use NumPy.

At the time, this was a very good idea at the time as NumPy is very well-established and provided significant out of the box support that could be built upon to provide data processing and data analytics facilities.

However, since then various rival libraries or add-ons to pandas have attempted to improve on the performance of NumPy and NumPy arrays as they were perceived to be too ‘slow’ for many modern applications and its lack of support for strings and missing values. Indeed Wes McKinney has written an article outlining why NumPy was no longer the best choice now for Pandas in “Apache Arrow and the "10 Things I Hate About pandas".

In fact, Pandas has been working towards decoupling itself from NumPy for several years (at least since 2018) and with Pandas 2.0.0 PyArrow is now used to support all data types.

Data Science Training Courses


What is PyArrow?

In any language, there is a problem with string literals. How do we express a quotation mark in a string when quotation marks define the string? From the beginning in C#, we’ve been able to use an escape sequence:

PyArrow is an Apache Arrow library for Python. Which of course raises the question what is Apache Arrow?

Apache Arrow defines a language independent format whose goal is to provide an memory efficient and high performance way to represent and process large datasets (both in-memory and on disk). Libraries are available for a wide range of languages from C, and C++ to Go, Java, JavaScript, Python, R, Ruby and Rust.

PyArrow now underpins everything in Pandas (although NumPy is still available and still used for some activities where appropriate for example indexes can now hold NumPy data types).

Interestingly, the developers of Pandas have taken quiet a conservative approach to the API for Pandas, ensuring that as little has changed from the developers’ point of view as possible. Thus, almost everything that a Pandas 1.x developer knew is still relevant for Pandas 2.0.0. The main difference is that when one of the read functions is used an optional dtype_backend parameter is now provided. When this parameter is set to ‘PyArrow’ then the read function will return a PyArrow-backed ArrowDype DataFrame (rather than a traditional style DataFrame).

From the developer’s perspective the only difference from that point on is that the performance of the system should be better. For example:

import io
data = io.StringIO("""a,b,c,d,e,f,g,h,i
1,2.5,True,a,,,,,
3,4.5,False,b,6,7.5,True,a,
""")
df = pd.read_csv(data,
dtype_backend="PyArrow")
df.dtypes
a     int64[PyArrow]
b    double[PyArrow]
c      bool[PyArrow]
d    string[PyArrow]
e     int64[PyArrow]
f    double[PyArrow]
g      bool[PyArrow]
h    string[PyArrow]
i      null[PyArrow]
dtype: object
Pandas 2.0 is here in Black and White!


How does Pandas 2.0.0 stack up? 

The performance improvements in pandas 2.0.0 do make a marked improvement in performance over the older purely NumPy based data frames. Depending upon the tests involved this can represent just under half the time taken for some operation to less than a quarter of the time. As an example, a very simple test loading a very large dataset into a simple program to average a set of integers had the following results:

  • Pandas 1.5 (NumPy) 48.9 ms
  • Pandas 2.0 (PyArrow) 17.9 ms

Although care needs to be taken here as having said that all that is required is to switch to PyArrow the Pandas 2.0.0 documentation then provides numerous approaches / options that can be used to improve the performance further. Thus, if a simple, perhaps slightly naïve approach is taken to comparing like with like then Pandas 2.0’s performance improvement may be underestimated. See ‘Enhancing Performance’ in Pandas 2.0 for more information.

Comparing Pandas 2.0.0

If you search on the web you will find many articles comparing Pandas with a range of tools, from R and PySpark to tools such as Excel, SAS or SQL databases. Some of these are relevant (such as PySpark) others are less appropriate such as SQL. In this section we will quickly compare and contrast several additional tools.

Pandas v NumPy

This is not an uncommon analysis to find if you go searching. However, as has been indicated above it is not really an appropriate one as Pandas 1.x was directly built on top of NumPy and Pandas 2.0 can still use NumPy or provide pyArrow as an alternative representation. Perhaps what is worth say is that if all you need is the facilities that NumPy provides then Pandas is probably overkill for you!

Pandas v SciPy

Another comparison you will find is with SciPy (or Scientific Python). Interestingly SciPy is not strictly required for Pandas but is listed as an optional dependency. SciPy is another open-source Python library this time oriented around mathematics, science and engineering tasks as it contains modules for linear algebra, integration, interpolation, FFT, image processing etc. As such SciPy is complementary to Pandas rather than an alternative to it.

Pandas v PySpark

PySPark is a python library for the Scala (and Java) based Spark ecosystem of tools. Spark is a framework for working with large datasets in a distributed computing environment. In contrast Pandas fundamentally runs operations on a single computer. Thus if you want to exploit the benefits of a set of networked computers for your data analytics tasks then PySpark may well be a preferrable choice. However, PySpark is more complex than Pandas and obviously involves a distributed computing environment and thus requires more set up than Pandas. In this comparison these are markedly different tools aimed at significantly different sized tasks.

Pandas v Dask

The last section indicated that Pandas is primarily a single computer based solution, where as PySpark was a multi-computer based solution. However Dask changes things a little. Dask is library that allows Pandas (and in fact other libraries such as NumPy itself and scikit learn) to scale across a distributed computing environment. It is thus possible to use Dask to make Pandas operate in a multiple computer environment and take advantage of multiple computer systems for a data analytics task. This may well be a preferable approach to PySpark if you are already familiar with Pandas. Thus, Dask is not really a competitor of Pandas but an enabler!

Pandas v polars

This is perhaps the most appropriate comparison in this list. Polars is a direct rival to Pandas in terms of its aims and objectives with the added focus of aiming to be faster and more efficient than Pandas. Its main claim to performance is that it uses an Arrow based representation and natively parallelizes processing of the data. In actual fact Polars has two APIs an eager API and a lazy API. The eager API is similar in its execution to Pandas. The lazy API only runs code when it needs to which can in some cases further improve performance.

Compared to Pandas 1.x Polars has significant performance improvements. However, compared to Pandas 2.0 the improvements are either less marked, not really noticeable or indeed not present as ins some cases Pandas 2.0 may be faster. Polars is still an alternative to Pandas 2.0 but the results are less clear cut and any developer who wishes to select one over the other should perform their own performance benchmark tests to determine which suits them best.

One point to note is that there is far more documentation and examples available for Pandas than there are for Polars.

Pandas v R

This is an interesting comparison as Pandas is a data analytics library built on top of the general purpose programming language Python. In contrast R is a Programming Language and environment designed for statistical computing and graphics. As such there is a fundamental difference in philosophy behind these two approaches. For many developers coming from a traditional programming background Python is a simple to learn programming language which benefits from the huge ecosystem of open source (and commercial0 libraries available for it. This of course includes Pandas but also all the GUI, graphing, database, restful etc. libraries available in the market. Whereas R is a slightly esoteric programming language with its own set of libraries and ecosystem of add ons which are not as extensive as Pythons.

Summary

Pandas 2.0 is here and if you are working in the field of Data Analytics then you should certainly take a look at it. There is minimal overhead to getting started and potential significant gains to be obtained by using it.


Would you like to know more?

If you found this article interesting you might be interested in some of our related training courses:

Share this post on:

We would love to hear from you

Get in touch

or call us on 020 3137 3920

Get in touch