Project Name

Migrated Legacy Data Pipelines to PySpark & Apache Airflow for a Telecom Operator

Industry

Telecommunication

Technology

Apache Spark (PySpark), Apache Airflow, ETL Pipeline Migration, DAG-Based Orchestration, Logic-Parity Validation Framework

Overview

A North African telecom operator was running critical network, billing, and CDR processing workloads on single-threaded Java ETL pipelines that could no longer keep pace with subscriber data volumes. Processing windows were regularly breached, the architecture had no horizontal scaling path, and the accumulated transformation logic in the Java codebase carried enough complexity that any migration attempt risked breaking the billing and reporting outputs that downstream systems depended on. Ksolves was engaged to deliver a structured Java-to-PySpark migration with Apache Airflow orchestration, replicating all existing transformation logic with record-level output parity, replacing manual cron scheduling with dependency-aware DAGs, and giving the operator a distributed platform built to scale with continued subscriber growth.

Key Challenges

The client came to Ksolves with four compounding problems that had blocked modernisation and were worsening as data volumes continued to grow:

Legacy Java Pipelines Were Slow and Could Not Scale Horizontally: The existing ETL pipelines ran as single-threaded Java processes designed for data volumes far below current production levels. As the subscriber base grew and CDR volumes increased, pipelines took longer to complete, billing and network reporting windows were regularly breached, and the architecture had no mechanism to distribute processing across additional compute nodes.
No In-House Spark Expertise to Execute the Migration Safely: The organisation had no engineers with PySpark or Apache Spark experience. Without deep Spark knowledge, any internal migration attempt risked producing jobs that were syntactically correct but behaviourally wrong, generating incorrect billing outputs or network metrics that would surface only as errors in downstream billing and operational systems.
Pipeline Scheduling Had No Dependency Modelling or Failure Recovery: All scheduling ran through cron jobs with no dependency awareness, no failure alerting, and no automatic retry capability. When a pipeline failed mid-execution, the failure was typically discovered through billing discrepancies or missing reports rather than proactive monitoring, requiring manual reruns that delayed downstream billing cycles.
Migration Carried High Business Risk of Breaking Existing Logic: The Java pipelines contained years of accumulated telecom-specific transformation logic covering CDR field mappings, subscriber aggregation rules, billing computation steps, and network metric calculations. Any deviation in the migrated PySpark jobs would produce silent billing errors or incorrect network reports, with potentially significant revenue and regulatory consequences.

Our Solution

Ksolves ensured zero business logic changes during migration by validating every PySpark job against existing Java outputs through side-by-side execution and record-level parity checks. Pipelines moved to Spark only after validation, while Apache Airflow DAGs replaced legacy cron-based scheduling for improved orchestration and reliability.

PySpark Migration with Validated Business Logic Parity: Each legacy Java ETL pipeline was analysed, decomposed, and rewritten as a PySpark job, replacing verbose single-threaded Java code with concise DataFrame transformations while preserving all existing business logic exactly. Before any migrated job was promoted to production, it ran in parallel with the Java original on identical input datasets, and outputs were compared at the record level to confirm parity. No pipeline was cut over until that confirmation was in hand.
Horizontal Scalability Through Spark Distributed Execution: Moving from Java to PySpark gave the pipelines native horizontal scaling across a Spark cluster, enabling processing to be distributed across multiple executors in parallel. Jobs that previously ran as single-threaded sequential processes could now use all available cluster compute, removing the throughput ceiling that had caused processing window breaches as data volumes grew.
Apache Airflow DAG-Based Orchestration: Manual cron scheduling was replaced entirely with Apache Airflow DAGs that modelled each pipeline's dependencies explicitly. Dependency-aware scheduling ensured downstream jobs never ran against incomplete upstream outputs, a failure mode that had previously caused silent data quality issues. Automatic retry policies and SLA alerting meant failures were detected and addressed in near real time, eliminating the manual monitoring and rerun overhead that had consumed engineering capacity under the cron-based approach.
Staged Cutover with Zero Production Downtime: Pipelines were migrated and validated in batches, moving to PySpark and Airflow one pipeline group at a time, with the Java originals kept on standby until parity was fully confirmed. This staged approach meant that any unexpected discrepancy in a migrated job could be rolled back without affecting other pipelines or causing production downtime, keeping the impact of any single migration step within a manageable scope.

Technology Stack

CATEGORY	TECHNOLOGY	ROLE IN THIS ENGAGEMENT
Processing	Apache Spark (PySpark)	Replaced the legacy Java ETL codebase with distributed PySpark DataFrame jobs, enabling horizontal scaling across a Spark cluster while preserving all existing business transformation logic with record-level output parity.
Orchestration	Apache Airflow	Replaced manual cron scheduling with DAG-based orchestration, modelling all pipeline dependencies explicitly, enabling automatic task retry on failure, and providing SLA alerting across every migrated job.
Methodology	Logic-Parity Migration Framework	Validated every migrated PySpark job against its Java counterpart through side-by-side execution and record-level comparison before cutover, providing a verifiable zero-breakage guarantee across the full migration.
Migration	Staged Batch Cutover	Migrated and validated pipelines in batches with Java originals on standby, allowing any discrepancy to be rolled back in isolation without affecting other pipelines or causing production downtime.
Monitoring	Airflow SLA Alerting and Retry	Provides real-time pipeline health visibility through structured failure alerting, automatic retry policies, and SLA breach notifications, replacing a reactive cron regime where failures surfaced through billing discrepancies hours after the fact.

Impact

Pipeline Processing Time Reduced: Horizontal Scaling Enabled: PySpark jobs now distribute across cluster executors, removing the throughput ceiling that caused regular processing window breaches as subscriber and CDR volumes grew. (target: validate against production runtime logs)
Zero Business Logic Breakage: 100% Output Parity Confirmed: Record-level parity validation confirmed identical outputs between every Java original and its PySpark replacement before cutover, making the zero-breakage guarantee verifiable rather than assumed. (target: confirm validation methodology)
Pipeline Failures Now Auto-Detected and Auto-Retried: Airflow DAGs with retry policies and SLA alerting detect and recover from failures in near real time, replacing a cron regime where failures surfaced through billing discrepancies hours after the fact.
Dependency-Aware Scheduling Eliminates Downstream Data Errors: Airflow DAG dependencies ensure downstream jobs run only after upstream jobs complete successfully, eliminating the incorrect aggregations that cron-scheduled jobs produced when run against incomplete inputs. (target)
Engineering Capacity Freed from Manual Pipeline Management: Automated retry and alerting remove the weekly manual monitoring and rerun workload, redirecting engineering effort from reactive fire-fighting to product development. (target: validate hours saved per week)

Solution Architecture

Client Testimonial

“Our billing pipelines were running on Java code that nobody wanted to touch because the risk of breaking something was too high. Ksolves migrated everything to Spark, proved the outputs were identical, and we finally have a platform that can scale with our subscriber growth.”

-VP Technology/ North African Telecom Operator

Conclusion

Before this engagement, the operator’s billing, CDR, and network ETL workloads ran on single-threaded Java pipelines with cron scheduling, no horizontal scaling, no failure recovery, and a migration risk high enough that modernisation had been repeatedly deferred. Today, Ksolves has delivered distributed PySpark jobs orchestrated by Apache Airflow DAGs that replicate all existing business transformation logic with record-level output parity, at horizontal scale, with automated retry and dependency-aware scheduling. The parity validation framework made the zero-breakage guarantee verifiable rather than assumed, and Airflow’s dependency modelling eliminated an entire class of data quality errors that cron scheduling could not prevent. For telecom operators and data engineering teams still running critical ETL on legacy batch frameworks, explore Ksolves Big Data Services to see what a structured migration can deliver.

Have A Project Idea?

Name*

Email*

Phone Number*

Message*

What is 8 + 6 ? *

Is Your Team Still Running Critical Etl on Legacy Java or Other Batch Frameworks That Cannot Scale?

How Ksolves Migrated Apache NiFi 1.27 to Kubernetes-Native 2.7 in 6 Weeks for a Financial Services Firm

Read the Story

How Ksolves Secured Expert NiFi and Airflow L3 Support for a US Bank at $27K Per Year

Read the Story

Eliminated ~900K Duplicate Oil Well Records from 6,200 Excel Files Using Azure Databricks and Spark

Read the Story

Have project in mind?

Migrated Legacy Data Pipelines to PySpark & Apache Airflow for a Telecom Operator

Technology Stack

How Ksolves Migrated Apache NiFi 1.27 to Kubernetes-Native 2.7 in 6 Weeks for a Financial Services Firm

How Ksolves Secured Expert NiFi and Airflow L3 Support for a US Bank at $27K Per Year

Eliminated ~900K Duplicate Oil Well Records from 6,200 Excel Files Using Azure Databricks and Spark

Talk To Our Experts

Request a Callback

Talk To Our Experts

Let's Talk

Talk To Our Experts

Seize Your Complimentary Reservation Now!

Book a Free 30-minute Consultation!

Book a Free 30-minute
Consultation!