Java v Kotlin v Python for Data Analytics / Machine Learning
We consider the support provided in Java, Kotlin and Python for Data Analytics and Machine Learning and then compare and contrast these languages for such tasks.
Over the last few years Python has been the undisputed leader for Data Analytics and Machine Learning tasks. The only real competitors have been dedicated languages such as R or tools such as SAS or MatLab. However, given the interest in Data Analytics and Machine Learning, it’s not surprising that more mainstream languages such as Java have also seen some mileage.
For example, the Apache SPARK library (implemented originally in Scala) is accessible from Java. In addition, the Java library Tribuo offers itself as a Java alternative to Python’s Scikit-learn (aka SKLearn). Moreover, more modern alternatives to Java such as Kotlin have also started to set themselves up as Data Analytics and Machine Learning alternatives - JetBrains released a Data Frames library for Kotlin on June 30th 2022.
In this article we will consider the support provided in Java, Kotlin and Python for Data Analytics and Machine Learning and then compare and contrast these languages for such tasks.
Python has a huge ecosystem of libraries and frameworks available to support a range of tasks within the Data Analytics, Data Science and Machine Learning domains. Python is, like the other two languages under the spotlight, a general-purpose programming language. Such languages allow developers to write any type of application - from a simple game to a sophisticated machine learning system. However, for the sort of tasks being considered here, its main benefits are:
Huge Developer Community. Python is one of the most popular programming languages in the world. Whichever index you care to look at, you will find Python is one of the top three languages. For example, the Tiobe index for July has Python in the top position (with Java 3 and Kotlin outside the top 20). Partly because of this, there are many tutorials, guides, and quick reference resources available on the internet. In many cases, if you encounter an issue with Python or one of its popular libraries, there is likely to be a fix, workaround or guide available somewhere (such as Stack Overflow).
Cross-Platform. Python can be used on (almost) any operating system; it is thus available on Mac, Windows and Linux machines. This means that it can be used anywhere and everywhere.
Language Simplicity. The Python language is much simpler than Java and Kotlin and the like. This makes it easier to learn and quicker to develop applications. However, some of this simplicity can cause problems for larger, more mission-critical applications.
Dynamic Typing. Python is a dynamically typed language. This has both pros and cons for a traditional software system but is particularly helpful for data analytics type applications, as the expected data and the actual data may be quite different.
Ease of use. There are a wide variety of tools now available to allow developers and data analysts to work with Python. These include PyCharm, Jupyter Notebooks, Visual Studio Code and Spyder. Jupyter Notebooks and Spider are particularly focussed on the data analysis and scientific application types.
Availability of database libraries. Python can easily be used to access both SQL and non-SQL databases. For example the Python Database API Specification provides a generic API that can be used with a database specific driver to access any SQL database including MySQL (see PyMySQL) and Oracle. In addition, there are database drivers for no-SQL databases such as PyMongo.
In addition, it has a wide range of very well established, stable and well documented libraries for both Data Analytics tasks and for Machine Learning applications. The key ones within the Python world and their relationships are illustrated below:
Each is briefly introduced below:
NumPy. Short for Numerical Python, NumPy provides arrays including array based mathematical and logical operation. It also provides other numerical and transformation style features including linear algebra, matrices and random number generators.
SciPy. This is a scientific library that can be used to solve scientific and complex mathematical problems. SciPy builds on top of NumPy.
Pandas. The Python Pandas library provides facilities that simplify accessing, processing and visualizing data sets. Pandas builds on other libraries such as the NumPy library.
Scikit Learn. This is a general purpose library for performing data mining and data analysis tasks using statistical algorithms and machine learning approaches.
PyTorch. This library provides machine learning, natural language process (NLP) and computer vision tools.
TensorFlow. This is another widely used Machine Learning library for Python.
Keras. This open-source Python library can be used for deep learning machine learning applications including implementations of neural network algorithms.
MatplotLib. This is a graphing and data visualization library that is very widely used and underpins several other data visualization libraries.
Plotly. Plotly is an open Source graphing library for Python. It includes charts specifically designed for use with AI and Machine Learning applications.
Seaborn. This is a data visualization library built on top of MatPlotLib.
PySpark. This is a Python interface to the Apache Spark library used for data engineering, data science, and machine learning on single-node machines or clusters
Java is a very well-established programming language, used for a huge range of applications. As with Python it is a general-purpose programming language and can be used to develop almost any type of application. Increasingly there are libraries and frameworks that support data analytics and machine learning tasks.
The main benefits of Java from a developer or analyst’s point of view include:
Huge Developer Community. There is also a huge developer community world-wide around Java (although potentially only a small percentage of that community is focussed on Machine Learning and Data Analytics).
Cross Platform. Java is a cross platform language with versions of Java running on the JVM (Java Virtual Machine) for everything from Windows Boxes to Linux Machines, Macs as well as mobile devices running Android.
Strongly Typed Language. Java is a strongly typed language which has significant advantages for mission critical systems in terms of reliability and correctness of application code. However, it may not be such an advantage for the sort of tasks being considered here.
Extensive Tool Support. Java has extensive support in tools such as IntelliJ, Visual Studio Code, Eclipse etc. However, none of these tools target the Data Analytics or Machine Learning communities.
Availability of data base libraries. Java has a common database API called JDBC (the Java Database Connectivity API). Numerous drivers are available for SQL databases that plug into this generic API. There are also libraries that allow connection to no-SQL databases such as the MongoDB Java Driver.
Performance. In general performance is likely to be better than a language such as Python but slower than a language such as C or C++.
As with Python there are a range of third-party libraries that can support Data Analytics and Machine Learning, although there are probably less such libraries and they are less widely used.
There are also a range of libraries available that can be used from Java for Data Analytics and Machine Learning tasks such as:
Tribuo. This is a Machine Learning library with a wide range of implementations available.
ND4J. This library (which stands for N-dimension array objects for Java) provides support for numerical representations and processing. Interestingly it is modelled on NumPy and SciPy.
Deeplearning4J. This library provides deep learning algorithms for Java.
Apache Spark. Apache Spark was originally written in Scala but provides an interface for Java (as Scala also runs on the JVM – Java Virtual Machine).
Weka. Waikato Environment for Knowledge Analysis (or Weka) is another machine learning library for Java.
Tablesaw. This library provides data frames and data visualization in Java.
Java-ML. This library provides a very wide range of machine learning implementations for Java.
Kotlin is another language that can compile to the byte code format used by the JVM. This means that a Kotlin program can be compiled and run on any environment for which there is a JVM implementation. It is widely used for mobile application development on Android devices but can also be used to create stand alone or server applications. From the point of view of this blog it has many of the same characteristics as Java including:
Strongly Typed Language
Extensive Tool Support.
Performance. As with Java, the performance of Kotlin is likely to be better than Python but not as good as C / C++.
However, the design of Kotlin is more modern than Java and thus the syntax is simpler in some cases and stricter in others (for example it can indicate if a type is nullable or not).
Kotlin also has access to all data analytics, data visualization and machine learning libraries that run on the JVM including those original written for Scala such as Apache Spark. In addition, on June 30th 2022 JetBrains released the Kotlin DataFrame library preview (see https://blog.jetbrains.com/kotlin/2022/06/kotlin-dataframe-library-preview/). This library provides a flexible set of features for data representation, data wrangling and processing.
However, the developer community for Kotlin is smaller than Java or Python and indeed the Kotlin specific libraries are much smaller than either of the other languages.
Python v Java v Kotlin comparison
So which language is best suited for data analytics and machine learning?
The following table provides a summary of the key aspects of the languages.
From this table it is clear that all three could be used for Data Analytic and Machine Learning tasks but in general Python still comes out on top. Of course, if you are a Java development house then there is also a wide range of libraries available to you.
Would you like to know more?
If you found this article interesting you might be interested in some of our courses: