Spark’s Version 3.2: Closer Hooks to Pandas, SQL

Spark

5 MIN READ

November 11, 2021

Spark's Version 3.2: Closer Hooks to Pandas, SQL

The Apache Spark community released a new version of Spark on Oct 13, 2021. Spark 3.2 arrived as a  significant new release of this distributed computing framework. Spark 3.2 contains many exciting new features such as deeper support for the python data systems, including the addition of the Pandas API. 

Let us read in detail the new features Spark 3.2 has to offer. 

Introduction

In the past few years, there has been an explosion in python usage in both the data science and engineering community. And, for many years Apache Spark community has been working to make the two environments work together.

Databricks in the year 2019 released the koalas project. It implemented the Pandas DataFrame API on Spark. Also, the addition of Pandas API on Spark 3.2 avoids the need of using third-party libraries.

Apache Spark 3.2 additions

Spark 3.2 has gone to a notch higher when it comes to the pandas API.

  • People working in pandas can now scale their Pandas applications
  • Take advantage of multi-node Spark clusters. 

Scalability beyond a single machine

One of the known limitations in pandas is that it does not scale with the large data volume due to its single-machine processing. Pandas fails out of memory when it attempts to read a dataset that is larger than the memory available in a single machine. 

The new pandas API in the latest version of Spark can scale almost linearly up to 256 nodes. It also boosted single-machine performance.

The Pandas API on spark also scales well to large clusters of nodes. 

Optimized single-machine performance

The Pandas API on spark outperforms pandas even on a single machine. The credit goes to the optimization in the Spark engine. 

Both multi-threading and Spark SQL offer their contribution to the optimized performance.

Spark has a significant advantage in the chaining operations. The catalyst query optimizer can identify the filters to ignore data wisely, meanwhile, pandas tend to load all the data required into the memory per step. 

Interactive data visualization

Pandas in the earlier version uses matplotlib which provides static plot charts. But, in the Spark 3.2 pandas API uses a plotly backend that offers interactive charts. It allows users to zoom in and out. 

Pandas API on Spark automatically determined the best possible way to execute the computation internally when generating charts that are interactive.

Leveraging unified analytics functionality in Spark

Pandas was specifically designed for Python data science with batch processing. Spark was designed for unified analytics and includes SQL, stream processing and machine learning. 

  • Users can query data directly through SQL 
  • It also supports string interpolation syntax
  • Users can easily call machine learning libraries 

Spark 3.2 enables data scientists to move beyond the original focus of the libraries which is batch analytics.

Addition of RockDB

An implementation of the RockDB has been added in Spark 3.2 RockDB is basically a database for key-value in various projects including Apache Kafka.

Databricks has used RockDB previously for its in-house implementation of Spark. Now they have given the code back to the Apache Spark community so that everybody can take advantage of stateful data irrespective of the size of the streaming application. 

Adaptive query execution (AQE)

The biggest improvement in Spark 3.2 is the enablement of AQE by default. AQE helps in boosting the performance of Spark workloads. It limits the need to tune the number of shuffle partitions, dynamically switching join strategies, and dynamically optimize skew joints to help avoid extreme imbalances in work.

What’s next?

We can expect these things from the next release of Spark:

  • More type hints– The code is currently partially typed in pandas API. In the future, all the code will be fully typed.
  • Improvements in performance– There is still scope for further improvement in performance by closely interacting with the engine and SQL optimizer.
  • Stabilization– Also, there are several places that need to be fixed. 
  • More API coverage– The pandas API on Spark is currently covering 83% of the coverage, and it continues to increase. The target is now up to 90%.

Getting started with Ksolves Spark services

If you want to try out pandas API on Spark, you have landed in the right place. Ksolves is one of the best Apache Spark development companies in India and the USA. We have years of experience in building futuristic applications. Our Spark experts are known for their top-notch and timely delivery of projects across the globe. Give us a call or write your queries in the comments section. 

Contact Us for any Query

Email : sales@ksolves.com

Call : +91 8130704295

Read related article –

The Transformation Of Data Science With The Advent Of Apache Spark

Conversational AI – The Growing Traction In The Healthcare Industry

authore image
Vaishali Bhatt
AUTHOR

Leave a Comment

Your email address will not be published. Required fields are marked *

(Text Character Limit 350)