Project Name
Migrated Legacy Data Pipelines to PySpark & Apache Airflow for a Telecom Operator
![]()
A North African telecom operator was running critical network, billing, and CDR processing workloads on single-threaded Java ETL pipelines that could no longer keep pace with subscriber data volumes. Processing windows were regularly breached, the architecture had no horizontal scaling path, and the accumulated transformation logic in the Java codebase carried enough complexity that any migration attempt risked breaking the billing and reporting outputs that downstream systems depended on. Ksolves was engaged to deliver a structured Java-to-PySpark migration with Apache Airflow orchestration, replicating all existing transformation logic with record-level output parity, replacing manual cron scheduling with dependency-aware DAGs, and giving the operator a distributed platform built to scale with continued subscriber growth.
The client came to Ksolves with four compounding problems that had blocked modernisation and were worsening as data volumes continued to grow:
- Legacy Java Pipelines Were Slow and Could Not Scale Horizontally: The existing ETL pipelines ran as single-threaded Java processes designed for data volumes far below current production levels. As the subscriber base grew and CDR volumes increased, pipelines took longer to complete, billing and network reporting windows were regularly breached, and the architecture had no mechanism to distribute processing across additional compute nodes.
- No In-House Spark Expertise to Execute the Migration Safely: The organisation had no engineers with PySpark or Apache Spark experience. Without deep Spark knowledge, any internal migration attempt risked producing jobs that were syntactically correct but behaviourally wrong, generating incorrect billing outputs or network metrics that would surface only as errors in downstream billing and operational systems.
- Pipeline Scheduling Had No Dependency Modelling or Failure Recovery: All scheduling ran through cron jobs with no dependency awareness, no failure alerting, and no automatic retry capability. When a pipeline failed mid-execution, the failure was typically discovered through billing discrepancies or missing reports rather than proactive monitoring, requiring manual reruns that delayed downstream billing cycles.
- Migration Carried High Business Risk of Breaking Existing Logic: The Java pipelines contained years of accumulated telecom-specific transformation logic covering CDR field mappings, subscriber aggregation rules, billing computation steps, and network metric calculations. Any deviation in the migrated PySpark jobs would produce silent billing errors or incorrect network reports, with potentially significant revenue and regulatory consequences.
Ksolves ensured zero business logic changes during migration by validating every PySpark job against existing Java outputs through side-by-side execution and record-level parity checks. Pipelines moved to Spark only after validation, while Apache Airflow DAGs replaced legacy cron-based scheduling for improved orchestration and reliability.
- PySpark Migration with Validated Business Logic Parity: Each legacy Java ETL pipeline was analysed, decomposed, and rewritten as a PySpark job, replacing verbose single-threaded Java code with concise DataFrame transformations while preserving all existing business logic exactly. Before any migrated job was promoted to production, it ran in parallel with the Java original on identical input datasets, and outputs were compared at the record level to confirm parity. No pipeline was cut over until that confirmation was in hand.
- Horizontal Scalability Through Spark Distributed Execution: Moving from Java to PySpark gave the pipelines native horizontal scaling across a Spark cluster, enabling processing to be distributed across multiple executors in parallel. Jobs that previously ran as single-threaded sequential processes could now use all available cluster compute, removing the throughput ceiling that had caused processing window breaches as data volumes grew.
- Apache Airflow DAG-Based Orchestration: Manual cron scheduling was replaced entirely with Apache Airflow DAGs that modelled each pipeline's dependencies explicitly. Dependency-aware scheduling ensured downstream jobs never ran against incomplete upstream outputs, a failure mode that had previously caused silent data quality issues. Automatic retry policies and SLA alerting meant failures were detected and addressed in near real time, eliminating the manual monitoring and rerun overhead that had consumed engineering capacity under the cron-based approach.
- Staged Cutover with Zero Production Downtime: Pipelines were migrated and validated in batches, moving to PySpark and Airflow one pipeline group at a time, with the Java originals kept on standby until parity was fully confirmed. This staged approach meant that any unexpected discrepancy in a migrated job could be rolled back without affecting other pipelines or causing production downtime, keeping the impact of any single migration step within a manageable scope.
Technology Stack
| CATEGORY | TECHNOLOGY | ROLE IN THIS ENGAGEMENT |
|---|---|---|
| Processing | Apache Spark (PySpark) | Replaced the legacy Java ETL codebase with distributed PySpark DataFrame jobs, enabling horizontal scaling across a Spark cluster while preserving all existing business transformation logic with record-level output parity. |
| Orchestration | Apache Airflow | Replaced manual cron scheduling with DAG-based orchestration, modelling all pipeline dependencies explicitly, enabling automatic task retry on failure, and providing SLA alerting across every migrated job. |
| Methodology | Logic-Parity Migration Framework | Validated every migrated PySpark job against its Java counterpart through side-by-side execution and record-level comparison before cutover, providing a verifiable zero-breakage guarantee across the full migration. |
| Migration | Staged Batch Cutover | Migrated and validated pipelines in batches with Java originals on standby, allowing any discrepancy to be rolled back in isolation without affecting other pipelines or causing production downtime. |
| Monitoring | Airflow SLA Alerting and Retry | Provides real-time pipeline health visibility through structured failure alerting, automatic retry policies, and SLA breach notifications, replacing a reactive cron regime where failures surfaced through billing discrepancies hours after the fact. |
- Pipeline Processing Time Reduced: Horizontal Scaling Enabled: PySpark jobs now distribute across cluster executors, removing the throughput ceiling that caused regular processing window breaches as subscriber and CDR volumes grew. (target: validate against production runtime logs)
- Zero Business Logic Breakage: 100% Output Parity Confirmed: Record-level parity validation confirmed identical outputs between every Java original and its PySpark replacement before cutover, making the zero-breakage guarantee verifiable rather than assumed. (target: confirm validation methodology)
- Pipeline Failures Now Auto-Detected and Auto-Retried: Airflow DAGs with retry policies and SLA alerting detect and recover from failures in near real time, replacing a cron regime where failures surfaced through billing discrepancies hours after the fact.
- Dependency-Aware Scheduling Eliminates Downstream Data Errors: Airflow DAG dependencies ensure downstream jobs run only after upstream jobs complete successfully, eliminating the incorrect aggregations that cron-scheduled jobs produced when run against incomplete inputs. (target)
- Engineering Capacity Freed from Manual Pipeline Management: Automated retry and alerting remove the weekly manual monitoring and rerun workload, redirecting engineering effort from reactive fire-fighting to product development. (target: validate hours saved per week)
“Our billing pipelines were running on Java code that nobody wanted to touch because the risk of breaking something was too high. Ksolves migrated everything to Spark, proved the outputs were identical, and we finally have a platform that can scale with our subscriber growth.”
-VP Technology/ North African Telecom Operator
Before this engagement, the operator’s billing, CDR, and network ETL workloads ran on single-threaded Java pipelines with cron scheduling, no horizontal scaling, no failure recovery, and a migration risk high enough that modernisation had been repeatedly deferred. Today, Ksolves has delivered distributed PySpark jobs orchestrated by Apache Airflow DAGs that replicate all existing business transformation logic with record-level output parity, at horizontal scale, with automated retry and dependency-aware scheduling. The parity validation framework made the zero-breakage guarantee verifiable rather than assumed, and Airflow’s dependency modelling eliminated an entire class of data quality errors that cron scheduling could not prevent. For telecom operators and data engineering teams still running critical ETL on legacy batch frameworks, explore Ksolves Big Data Services to see what a structured migration can deliver.
Is Your Team Still Running Critical Etl on Legacy Java or Other Batch Frameworks That Cannot Scale?