Project Name

Data Ingression Engine

Industry

Finance

Technology

Apache Spark, Scala, Apache Kafka, Deltalake

Overview

Our client needed a robust system to handle bulk streaming data workloads efficiently. This system extracts data from intricate structures and swiftly transforms it into a simplified format, ensuring quick usability for various purposes. This efficiency enables seamless handling of large data volumes and empowers the client to make data-driven decisions with ease.

Challenges

The client faced the difficulty of handling a continuous stream of large, deeply nested and intricate structured data.
Extracting and transforming crucial information from this data into a more manageable form, and ultimately loading it into a datasource.
Their existing Java-based Microservices system proved inadequate in terms of speed for processing such enormous data volumes.
Difficulty in achieving code adaptability for processing approximately 30 different types of JSONs.
Each new JSON type required writing extensive amounts of code, leading to a time-consuming and cumbersome development process.

To overcome these obstacles and cater to client’s evolving needs, they sought a more robust, highly efficient, scalable, and fast fault-tolerant system that could seamlessly handle these diverse data requirements.

Our Solution

Developed a Spark-based processing engine that operates seamlessly through a completely metadata-driven configuration. This innovative engine showcases the ability to effortlessly ingest various JSON structures, relying solely on configurational adjustments without necessitating any alterations to the underlying code.
Leveraged Spark’s built-in parallel processing capabilities to ensure the fastest possible processing times for substantial data sets.
Seamlessly integrated Apache Spark with the existing Apache Kafka data, providing real-time streaming support and fault tolerance features.
Leveraged Spark SQL, which closely resembles SQL to simplify Data Transformation process and increase efficiency.
Apache Spark’s robust JSON reading methods with schema validation streamlined data transformation from JSON files.
Achieved Code Stability by defining transformations through a mapping file, eliminating repetitive coding.
Streamlined processes with new mapping sheets for effortless integration of new JSON types.

Data Flow Diagram

Conclusion

With Apache Spark, Apache Kafka and Scala as the backend programming language, we crafted an exceptional system that seamlessly tackled all the challenges faced by the client in handling extensive data operations, including Extraction, Transformation, and Loading. This powerful solution not only proved instrumental in efficient data processing but also excelled in resource management, delivering a highly beneficial outcome for our client.