The Apache Spark community released a new version of Spark on Oct 13, 2021. Spark 3.2 arrived as a significant new release of this distributed computing framework. Spark 3.2 contains many exciting new features such as deeper support for the python data systems, including the addition of the Pandas API.
Let us read in detail the new features Spark 3.2 has to offer.
In the past few years there has been an explosion in python usage in both the data science and engineering community. And, for many years Apache Spark community has been working to make the two environments work together.
Databricks in the year 2019 released the koalas project. It implemented the Pandas DataFrame API on Spark. Also the addition of Pandas API on Spark 3.2 avoids the need of using third party libraries.
Apache Spark 3.2 additions
Spark 3.2 has gone to a notch higher when it comes to the pandas API.
- People working in pandas can now scale their Pandas applications
- Take advantage of multi-node Spark clusters.
Scalability beyond a single machine
One of the known limitations in pandas is that it does not scale with the large data volume due to its single-machine processing. Pandas fails out of memory when it attempts to read a dataset that is larger than the memory available in a single machine.
The new pandas API in the latest version of Spark, can scale almost linearly up to 256 nodes. It also boosted single-machine performance.
The Pandas API on spark also scales well to large clusters of nodes.
Optimized single-machine performance
The Pandas API on spark outperforms pandas even on a single-machine. The credit goes to the optimization in the Spark engine.
Both multi-threading and Spark SQL offer their contribution to the optimized performance.
Spark has a significant advantage in the chaining operations. The catalyst query optimizer can identify the filters to ignore data wisely, meanwhile pandas tend to load all the data required into the memory per step.
Interactive data visualization
Pandas in the earlier version uses matplotlib which provides static plot charts. But, in the Spark 3.2 pandas API uses a plotly backend that offers interactive charts. It allows users to zoom in and out.
Pandas API on Spark automatically determined the best possible way to execute the computation internally when generating charts that are interactive.
Leveraging unified analytics functionality in Spark
Pandas was specifically designed for Python data science with batch processing. Spark was designed for unified analytics and includes SQL, stream processing and machine learning.
- Users can query data directly through SQL
- It also supports string interpolation syntax
- Users can easily call machine learning libraries
Spark 3.2 enables data scientists to move beyond the original focus of the libraries which is batch analytics.
Addition of RockDB
An implementation of the RockDB has been added in the Spark 3.2 RockDB is basically a database for key-value in various projects including Apache Kafka.
Databricks has used RockDB previously for its in-house implementation of Spark. Now they have given the code back to the Apache Spark community so that everybody can take advantage of stateful data irrespective of the size of the streaming application.
Adaptive query execution (AQE)
The biggest improvement in Spark 3.2 is the enablement of AQE by default. AQE helps in boosting the performance of Spark workloads. It limits the need to tune the number of shuffle partitions, dynamically switching join strategies and dynamically optimizing skew joints to help avoid extreme imbalances in work.
We can expect these things from the next release of Spark:
- More type hints– The code is currently partially typed in pandas API. In the future all the code will be fully typed.
- Improvements in performance– There is still scope for further improvement in performance by closely interacting with engine and SQL optimizer.
- Stabilization– Also, there are several places which need to be fixed.
- More API coverage– The pandas API on Spark is currently covering 83% of the coverage, and it continues to increase. The target is now up to 90%.
Getting started with Ksolves Spark services
If you want to try out pandas API on Spark, you have landed on the right place. Ksolves is one of the best Apache Spark development companies in India and the USA. We have years of experience in building futuristic applications. Our Spark experts are known for their top notch and timely delivery of the projects across the globe. Give us a call or write your queries in the comments section.
Contact Us for any Query
Email : email@example.com
Call : +91 8130704295
Read related article –
The Transformation Of Data Science With The Advent Of Apache Spark
Conversational AI – The Growing Traction In The Healthcare Industry