Apache Spark was released some 10 years ago as an open source project. And even after so many years of its initial release, it has become one of the core technologies in this world of big data. Today, many applications running in production are built with Spark. Reason is simple- It offers a unified engine for processing large amounts of data in very less amount of time.
Continuing the legacy of making faster, easier and smarter applications, Spark has come with the newest version 3.0.
This article enlist new features and improvements introduced with Apache Spark 3.0.
What is Apache Spark?
Apache Spark is a popular and widely used platform. It is a lightning-fast, fault-tolerant, data processing engine and supports both batch and streaming processing. Many organizations have adopted it including the cloud vendors.
Spark is jam-packed with a wide variety of libraries of ML and graph algorithms and also supports real-time streaming and SQL apps. The best thing about Spark is that it can be written in Java, Scala, or even Python. These Spark applications will run 10 times faster than MapReduce apps.
Spark also features an API for distributed processing of both structured and unstructured data. Apache Spark is basically a processing hub and can connect many data sources.
Apache Spark 3.0 extends its scope with more than 3000 JIRAs resolved. Also, there are many more initiatives coming in the future. These new improvements have made Spark a great tool for ETL pipeline and streaming. Let’s understand some new features that make it worth the investment.
Apache Spark version 3.0 exciting features
Spark 3.0 has a number of new and exciting features and performance improvements. Let’s have a look-
- Adaptive Query execution Enhancements
Spark 3.0 has introduced two important improvements in AQE that has simplified Spark parameter tuning-
- AQE can now combine small portions together so that the users don’t need to worry much about the shuffle partitions. These partitions will now be dynamically adjusted during runtime.
- Once data skewed is detected, AQE breaks down partitions into smaller ones.
- Improvements on Pandas User-Defined Functions API
Spark newest release introduced a new interface of Pandas UDFs with Python type hints. The type hints introduced in Spark 3.0 will help in eliminating confusion among developers.
This new interface allows Pandas UDFs to infer the type from the given Python type hints in the Python function definition.
- New user interface or structured streaming
Apache Spark 3.0’s web UI comes with an extra tab which is dedicated to structured streaming and also simplifies the monitoring of streaming jobs.
This extra tab displays the scheduling delay and processing time for each of the micro-batch. This can be useful for troubleshooting streaming applications.
- More than 30 in-built functions
Apache Spark 3.0 comes with more than 30 new built-in functions that are added to the scala API. Among these built-in functions, functions like bit counts, hyperbolic functions, csv opertains, and many more have been added. Also, specific functions for MAP have been added to simplify the processing of MAP data types.
- Hydrogen: Deep Learning improvements
We know that AI/ML models perform well with massive amounts of data. This has also created compatibility between data processing frameworks and deep learning frameworks.
The Hydrogen project is an initiative by Spark that is aimed to unify big data processing and machine learning models. It is divided into three subsections-
- Barrier execution
- Optimized data exchange
- Accelerator aware scheduling
In this release, Spark has introduced ANSI store assignment policy for table insertion. They have also added runtime overall checking and switched the calendar to a widely used calendar which is SQL standard.
This newest version of Spark comes with ample features and performance improvements and upgrading to the latest version is a wise choice one can make. We have covered some of the features and for full features, you can contact us.
Ksolves is one of the best Apache Spark development companies in India and the USA and has delivered many projects across the globe. Our stalwart team of Spark developers are highly skilled and experienced in delivering the best possible solution customized to the requirements. If you like to have more information on Apache Spark do write to us in the comments or give us a call.
Contact Us for any Query
Email : firstname.lastname@example.org
Call : +91 8130704295
Read related articles:
Feeding Data To Apache Spark Streaming
Is Apache Spark enough to help you make great decisions?