Our Python guru Jack Card benchmarked Python 3.14 vs 3.13.3, explored the tail call interpreter, and uncovered compiler bugs affecting the results. Come and see what he found...
Anything that suggests that there might be an improvement in overall Python performance is, of course, worth noting. But claims of 30% performance improvements are worth sitting up and really taking notice of. This is what has recently happened with the next release of Python, that is Python 3.14, due for full release in September 2025. However, alpha and beta versions of this release have been available for a little while now, and some performance benchmarks have shown a significant improvement with a new experimental compiler feature turned on. The latest release at the time of writing was 3.14.0 beta 1, released on Tuesday, 6th of May 2025. It is the first of four planned beta releases.
Python 3.14 – what’s all the fuss?
Python release 3.14 is scheduled for September 2025.It is part of the regular cadence of Python releases. Many of them are relative straightforward updates to a previous release, and often explore new language features which may or may not make it into the next full release etc. However, Python 3.14 is slightly different, it is an update to python from 3.13. In Python terms there are different versions of the language such as Python 2 and Python 3 which might be referred to as epochs, then major releases such as 3.12 and 3.13 and then minor releases such as 3.13.1 and 3.13.2. Therefore Python 3.14 is a major release for Python and you can read more about the release as a whole at the ‘What's new in Python 3.14’ web page.
However, as well as numerous changes to the language such as changes to the except
syntax and restricting return
, break
, and continue
within a finally
block, one of the most interesting things in the next release is a ‘New Type of interpreter’.
This interpreter has been added to CPython (the low-level runtime element of the Python environment) specifically to improve the performance of the runtime language. It uses a technique referred to as tail calls between small C functions (that actually implement individual Python opcodes) rather than one very large C case
statement. It is claimed that for certain (newer) compilers, this interpreter provides significantly better performance. Python.org claims that preliminary findings suggest performance improvements of up to 30% faster code.
The interpreter has been tested by python.org using Clang 19 on x86-64 and Aarch64 architectures. Currently, this feature is an ‘opt-in’ feature for now, which means that to try it out, it is necessary to download the source code and compile it yourself with the appropriate compiler flags set to ensure that it is turned on.
Tail Call Optimisation
I must admit that initially I was a little confused as there is a technology, available in several programming languages but not Python, that allows recursive functions with a particular behaviour to use use tail call optimisation.
The idea is that if a recursive function calls itself as the last thing it does, then it is a tail call. This can be optimized by a compiler or runtime such that the recursive call is replaced by some form of loop. As loops are far less expensive to execute than function calls which require setting up and then unsetting data in a call stack, this can be a significant optimization.
Tail Call Interpreter
The new interpreter does not do tail call optimisation; instead, the reference to tail calls between small C functions relates to an internal implementation detail of the CPython interpreter. In fact, it does not change the behaviour of the Python program being executed at all. Instead, it relates to the way that the underlying interpreter executes the low-level C functions that relate to the Python operations.
To be honest, back when I did my degree in Computer Science, the module I found least exciting was on compiler theory. Although this was about 40 years ago, and technology has come a long way in that time, in general I still don't get very excited about compilers and compiler technology. I use a computer language and expect the compiler/runtime to work for me. I am interested when a new version comes out, complaining some benefits in compilation time, garbage collection, or runtime speed; but only from the point of view of how it will benefit the systems I am writing. I am much more interested in what I can use the language for, rather than what happens under the hood. But sometimes, you have to look under the hood to find out what all the fuss is about.
In this case, what is going on is that the interpreter can take advantage of a feature in the underlying C compiler used to compile the interpreter itself. Using tail calls, the call to the next instruction is implemented as a tail call in the previous one. Newer C compilers can take advantage of this tail call idea and an associated preserve_none
calling convention, to generate faster executing C programs. Examples of such C compilers include Clang 19, etc. For a discussion of this idea in the Python interpreter, see here.
So, the gist of this is that it's a clever use of some underlying features in the C compiler that allow the CPython interpreter to execute faster! Yeah, that's great!
Some Benchmarks for performance testing
To explore this idea and see what the results are like, I decided to use a well-known set of Python performance benchmarks. These benchmarks are publicly available via GitHub. The benchmark chosen was the pyperformance benchmark.
There is plenty of documentation available on the benchmarks, but critically, they are straightforward to run as they come with a set of tools that allow you to run the benchmarks, save the results to file, generate comparisons between runs, etc.
Installing pyperformance
is also straightforward as pip can be used to install pyperformance
into a virtual environment using:
python3 -m pip install pyperformance
Using Python 3.13 as a baseline
I decided that it would be useful to have a baseline to compare the new 3.14 versions against. The obvious version therefore was the current 3.13.3 version of Python. Thus, the first benchmarking tests were performed by creating a virtual environment in Python 3.13.3 and installing pyperformance
. The pyperformance
command was then used to execute all the benchmarks and store the results into a json file.
Basic Python 3.14 beta1 performance
Once an initial set of benchmark results were obtained, Python 3.14 was downloaded and the benchmarks were installed. Interestingly, a few libraries that some of the benchmarks needed were not available or would not load into 3.14.
This is not actually a problem for the benchmark program as those tests are merely skipped. Thus, about 20% of the benchmarks could not be executed, but that still left more than enough for a comparison.
Once pyperformance
was installed, the available benchmarks were rerun and the results stored into a json file.
Comparing the 3.13.3 with 3.14.beta1
Once both sets of benchmarks were run, the results were compared. Remember this is the basic 3.14 without the performance enhancements turned on for the tail call interpreter. The results showed an overall small improvement of the newest version of Python against the full 3.13.3 release for some benchmarks, while for others, 3.13.3 was faster. In fact, when the lines were graphed, there was no significant improvement, and one could argue some detriment in performance.
Although it should be noted that this was a beta release with the newer beta having just come out. It is likely that what I am seeing here will improve over the coming months, and by September, we would expect to see a general improvement in performance.
Python 3.14 with tail call interpreter turned on
So now the big question was, what will happen with the tail call interpreter turned on?
To do this involves building the CPython interpreter and runtime from scratch.
To do this, you must use an appropriate C compiler such as GCC, Clang, etc. When the system is built, you must turn on the flag to enable interpreters using tail calls to be used (--with-tail-call-interp
) and also enable the Profile Guided Optimizations or PGO (--enable-optimizations) flags. The C compiler used was clang19 on an Apple Mac.
Once the new version was built pyperformance was once again installed into a virtual environment for the custom build of Python and the benchmarks were rerun. The results were stored in a json file and then compared with both the 3.13.3 and 3.14.aplha7 runs.
Performance Comparison
This is where things got a bit muddied. The results did show a slight improvement in some benchmarks over the 3.14.beta1 version but were slower than the 3.13.3 version.
My first assumption was that I had done something wrong and probably hadn’t built the system properly. So, I went back over everything and made sure the settings being used with the C compiler were correct and created a new build. I then reran the tests, and overall, the results obtained were the same.
What is going on here?
So what is going on here? Why weren’t the results I was obtaining better? At this point, I did a bit of internet searching and found that I wasn't the first to uncover this issue (see this article ‘Performance of the Python 3.14 tail-call interpreter’ for an in-depth discussion of the issue affecting my results here). And that in fact, in the time it had taken me to download Python and run the tests, etc., python.org had issued notification that the performance benefits they had obtained had been affected by a bug in the C Compiler found in Clang/LLVM 19, which causes the normal interpreter to be slower.
This meant that the perceived improvements were only at the level of 30% when compared with a version of Python affected by the C Compiler bug. Once this issue was resolved, the speed up was still present but represented a much more modest improvement.
So how come I had not been affected by this? Well, first off, the baseline chosen was 3.13.3, which was not affected by the bug in the Clang 19 compiler and so was inherently not affected by the bug. When I compiled my runtime using Clang 19, my build merely ran as it would have using the tail call interpreter, which is probably overall faster than the older technology in some situations.
So why was the beta version I used not out of line with the other two versions? Because a fix for the bug in clang 19 was merged that fixes this problem, and my runtime was probably built using this version of clang.
Illuminating Benchmarking
My experiences here, as well as that of others, potentially highlight some interesting things to be careful of when reading performance statistics, software development, and the benefit of further research rather than just jumping in at the deep end.
Software is complex and multi-layered
We all know software is complex and multi-layered in terms of its design, construction, and testing. However, we often take it for granted that the tools we are using work and that they are bug-free. Thus, when something doesn't work, we tend to assume that it's our own software that is at fault. Now I should really have remembered a lesson from way back in my undergraduate days. Back then I was using a new Ada compiler on a PC to write some very simple Ada code as part of a module I was taking (note this is 40 years ago now).
I tried to compile my code and there was a syntax error which meant that the build failed. I subsequently fixed the syntax error and tried to build again. However, no matter what I did, the code failed to compile. Being a green undergraduate, I took it to the tutor for the module who sent me to see the ‘Ada specialist’ working on a research grant in the department. He looked at the code and somewhat jeeringly said – it works.
I sent him the code and indeed it compiled and worked. I had to drag him down to my computer to show him that on my machine it didn't work. He ran the compiler and got the error message I had been getting. After a few minutes trying things, he eventually told me to go to a hidden directory, delete some hidden files there, and try again. After doing that, it worked!
It turned out I had stumbled upon two bugs. The first bug had been triggered when my code contained a syntax error, and the compiler had generated some buggy intermediate files. The second occurred because the compiler now failed to handle the buggy intermediate files correctly and so failed again and again. As a result, my code was fine, but the compiler was causing a problem.
Compilers, runtime environments, linkers, debuggers, etc., are all just software themselves, and there is nothing to say one of them doesn't have a bug in it or that a bug can’t easily be introduced into one of these following a new build, etc.
Be careful of your baseline
I chose to compare my results against 3.13.3 partly as this is the version I have been using and partly because it's the version I had on my computer. If I had just chosen to compare a 3.14 build and my own 3.14 build, I might have come away with a different set of conclusions. When you are benchmarking, it is always, and I mean always, very important to choose the appropriate baseline for performance comparisons. Get that wrong, and all your analysis may be for nothing.
Performance is more than just one factor
Even without the tail call interpreter being turned on, 3.14 aims to introduce, very worthwhile, performance improvements such as the ability to turn off the GIL and the lazy evaluation of decorators aka annotations. This may well explain some of the performance improvements I did see between 3.13.3 and 3.14.
Summary
Don't immediately believe everything you read, and as the old adage goes, if it looks too good to be true then it probably is. However, I am sure that by the time the full release of 3.14 comes along in September 2025, we will be highlighting the performance gains to be had from this new runtime environment.
Would you like to know more?
If you found this article interesting you might be interested in our instructor-led Python programming courses: