Project Name
How We Built a Secure Edge-to-Hub Data Lakehouse for Multi-Operator Telecom Analytics
![]()
A leading telecommunications data services provider approached Ksolves to design and implement a greenfield Edge-to-Hub Data Lakehouse architecture to support distributed CDR analytics and sovereign data management. The goal was to reliably aggregate high-velocity CDR data streams from multiple geographies into a centralized 1.5PB Data Lakehouse, while sustaining ingestion rates of over 5TB per day without impacting edge-processing performance or introducing data gaps.
The client approached us to resolve the key challenges that include:-
- Secure Edge Data Ingestion: Mobile sites required a secure and auditable mechanism for MNOs to upload raw CDR files without exposing internal file systems or network directories.
- Configuration Consistency Across Distributed Sites: With four or more geographically remote mobile sites, maintaining identical data flow logic across all locations was a major operational challenge. Any configuration change had to be deployed consistently, quickly, and safely, without introducing drift or risking data processing failures at individual sites.
- High-Volume Central Data Consolidation: The architecture needed to reliably aggregate high-velocity CDR data streams from multiple geographies into a single 1.5 PB+ central repository without impacting ingestion performance.
- Strict Multi-Tenancy & Data Privacy: Ensuring end-to-end data isolation for four or more MNOs at the Central Site while operating through a single, unified management and governance plane.
- Edge Tier (Mobile Sites): Each mobile site was equipped with SFTPGo to provide a hardened, secure gateway for raw CDR file ingestion. Apache NiFi handled validation, enrichment, and formatting at the edge, ensuring data quality before transmission. NiFi Registry enabled centralized version control, allowing configuration updates to be pushed simultaneously to all mobile sites with full rollback support.
- Central Ingestion Hub: Data is securely pushed from Mobile Sites to a Central Kafka Cluster. This decouples the edge collection from the heavy processing at the core.
- Scalable Storage & Processing: A 1.5PB+ distributed MinIO Enterprise storage layer served as the central Lakehouse foundation, utilizing a multi-tiered architecture to balance high-performance ingestion with long-term cost efficiency for 12 months of retention.
- Apache Spark on YARN High Availability: provided a robust, distributed compute fabric to process high-velocity Kafka streams into Apache Hudi tables. This implementation ensured ACID compliance, incremental processing, and asynchronous compaction, enabling the system to manage billions of records while maintaining consistent query performance for real-time CDR lifecycle management and multi-tenant analytics.
- Unified Security Layer: Keycloak provided centralized identity management and OIDC-based SSO across the entire platform. All web interfaces, including Airflow and Superset, were deployed behind Nginx and HAProxy to ensure high availability and load balancing.
- Analytics & Data Consumption: Trino enabled fast, federated SQL queries over Hudi tables stored in MinIO. Apache Superset delivered multi-tenant dashboards, ensuring each MNO accessed only their own isolated datasets through role-based and row-level security controls.
- Always-On Availability: YARN HA + HAProxy ensure zero downtime for the 1.5PB+ lakehouse
- 99.99% Data Durability: MinIO erasure coding protects critical datasets
- Seamless Failover: Instant recovery with no disruption to processing
- Scalable Architecture: Handles petabyte-scale telecom workloads effortlessly
- Real-Time Access: Continuous availability of CDR streams & historical data
- Compliance Ready: Built for strict telecom regulatory requirements
This project successfully delivers a modern Edge-to-Hub Data Lakehouse architecture purpose-built for the telecommunications sector. By securing data ingestion at mobile sites using SFTPGo and Apache NiFi, and centralizing 1.5PB+ of scalable storage and high-performance processing with MinIO and Apache Spark, the platform strikes a strong balance between edge-level security and centralized analytical capability.
The integration of Keycloak for unified identity management and HAProxy for high availability ensures the solution is fully production-ready, secure, and resilient from day one. Designed to support four or more MNOs on a shared yet isolated platform, the architecture enables reliable CDR analytics while meeting strict data privacy, sovereignty, and operational requirements.
Modernize your Telecom CDR analytics with a Secure Edge-to-Hub Lakehouse.