Project Name

How Ksolves Built a Real-Time Fraud Detection System for Telco Using Apache Kafka and Spark

How Ksolves Built a Real-Time Fraud Detection System for Telco Using Apache Kafka and Spark
Industry
Telecommunication
Technology
Apache Kafka · Apache Spark Structured Streaming · Apache Flink (Roadmap)

Loading

How Ksolves Built a Real-Time Fraud Detection System for Telco Using Apache Kafka and Spark
CLIENT OVERVIEW

The client is a large-scale Telco operator running one of the most data-intensive real-time marketing platforms in the region. Their marketing infrastructure processes billions of customer interaction events every day, serving personalised offers and campaigns to millions of subscribers across digital and mobile channels.

 

Operating during peak Golden Hours (7:00 PM to 10:00 PM GST), the platform routinely handles 100,000 to 150,000 events per second, making it one of the highest-throughput marketing event pipelines in production. Protecting this pipeline from fraudulent activity is mission-critical for both the integrity of campaign spend and the accuracy of customer engagement signals.

 

As fraudulent burst activity began eroding campaign performance and distorting customer engagement data, the client approached Ksolves to design and deploy a robust real-time fraud detection system capable of operating at their scale without disrupting legitimate marketing operations.

Key Challenges

In high-velocity digital marketing environments, fraudsters and compromised devices flood the system with abnormal volumes of marketing event requests in a very short time, a pattern known as Burst Fraud. Identifying and suppressing this at scale presented several complex engineering challenges:

  • Extreme Throughput at Scale: The system needed to evaluate fraud signals at up to 150,000 events per second during peak Golden Hours, leaving absolutely no room for high-latency processing pipelines or batch-oriented approaches.
  • Late-Arriving and Out-of-Order Data: Network delays and device-side buffering caused events to arrive out of sequence. Without structured handling, fraud detection windows would be miscalculated, leading to both false positives and missed detections.
  • Stateful Real-Time Aggregation: Counting events per Device_ID within a rolling 30-second window required maintaining distributed, low-latency state across a massively parallel environment, a challenge for systems that treat each event independently.
  • Micro-Batch Latency Gap: Spark Structured Streaming processes data in micro-batches every few seconds. Because genuine Burst Fraud can peak and complete within the same interval, there was a real risk of suppression actions missing the fraud window entirely.
  • Operational Scale and Auditability: At 3 to 5 billion daily events, even a 0.1% fraud rate translates to millions of events that must be correctly flagged, suppressed, and audited, all without disrupting legitimate customer traffic.
  • Downstream Impact of Undetected Fraud: Fraudulent events served to bots and spoofed devices resulted in wasted campaign spend and incorrectly profiled real customers, undermining the accuracy of personalisation models.
Our Solution

Ksolves implemented a closed-loop, real-time fraud detection architecture using Apache Kafka as the event streaming backbone. We leveraged Apache Spark Structured Streaming as the processing engine, purpose-built to detect velocity-based anomalies at extreme scale.

  • Event Ingestion via Apache Kafka
    • All marketing trigger events from upstream systems are published in real time to the dedicated Marketing_Triggers Kafka topic.
    • Kafka's distributed, fault-tolerant architecture absorbs peak loads of 150,000 EPS during Golden Hours without message loss, back-pressure, or consumer lag.
    • Kafka's consumer group model allows multiple downstream processors to consume in parallel, enabling horizontal scalability as event volumes grow.
  • Windowed Aggregation in Apache Spark
    • Spark Structured Streaming consumes from the Marketing_Triggers topic and applies a 30-second tumbling event-time window, grouping events by Device_ID.
    • Watermarking is enabled across the pipeline so late-arriving events are still included in their correct windows before finalisation, preventing false positives and missed detections.
    • Event-time windowing anchors detection to when the event occurred at the device, not when it was received by the processing engine, a critical distinction at high throughput.
  • Fraud Trigger and Closed-Loop Suppression
    • When a Device_ID accumulates more than 20 marketing requests within a single 30-second window, Spark classifies the entity as a fraud candidate and immediately triggers a suppression action.
    • Spark writes a Suppress command back to the Action_[Topic. Total] Kafka topic, which downstream marketing execution systems consume to immediately halt offer delivery to the flagged device.
    • This closed-loop architecture ensures suppression happens within the same micro-batch cycle where the threshold is crossed.
  • Future Enhancement: Apache Flink Migration
    • Migrating windowed aggregation to Apache Flink's native event-time streaming engine will reduce suppression latency from seconds to sub-seconds.
    • Flink's incremental window evaluation will emit fraud triggers the instant the 20-event threshold is crossed, eliminating the micro-batch latency gap.

TECHNOLOGY STACK

Technology Role in Solution Key Capability
Apache Kafka Real-time event ingestion and suppression message bus Fault-tolerant, 150K EPS with zero message loss
Apache Spark Structured Streaming Windowed aggregation and fraud classification engine 30-sec tumbling event-time windows with watermark support
Spark Watermarking Late-arriving data handling Prevents false positives from out-of-order event delivery
Kafka Topics (Dual) Marketing_Triggers input + Action output topics Enables closed-loop detection-to-suppression pipeline
Apache Flink (Roadmap) Next-gen real-time stream processing engine Sub-second fraud suppression via incremental window eval
Event-Time Windowing Anchor detection to device-side timestamps Accurate fraud windows independent of network delay
Impact
  • Burst Fraud reliably detected and suppressed at scale: Fraudulent activity is identified and acted upon within the 30-second window across the full volume of 3 to 5 billion daily events, protecting campaign budgets from wasted spend on bots and spoofed devices.
  • Elimination of window miscalculations from out-of-order events: Watermarking resolved the late-arrival problem completely. Detection accuracy is maintained even during peak network congestion, without holding windows open indefinitely.
  • Near-real-time suppression in the same processing cycle: Flagged Device_IDs receive Suppress commands within the same micro-batch cycle where the threshold is crossed, halting fraudulent offer delivery before the next batch begins.
  • Sustained peak throughput without performance degradation: The Kafka-Spark architecture successfully sustains 100,000 to 150,000 EPS during Golden Hours without service interruption, consumer lag, or processing bottlenecks.
  • Restored integrity of marketing personalisation models: With fraudulent engagement signals removed, personalisation algorithms and campaign analytics are now driven entirely by genuine customer interactions.
  • Audit-ready fraud suppression trail: Every flagged Device_ID and Suppress command is captured in Kafka's durable log, providing a complete, reproducible audit trail for compliance and campaign reporting.
  • Scalable foundation for Flink migration: The modular architecture enables a seamless upgrade to Apache Flink, bringing suppression latency down to sub-second for the next generation of real-time fraud prevention.
Conclusion

Fraud in real-time marketing environments does not wait. When a single compromised device can generate hundreds of artificial triggers within seconds, the window for detection and suppression must be measured in the same unit, seconds. Ksolves architected a production-grade solution that meets that requirement at a scale most platforms never encounter.

 

By combining Apache Kafka’s high-throughput event ingestion with Spark Structured Streaming’s event-time windowed aggregation and watermarking, the Telco now operates a fraud detection pipeline that processes billions of events daily, suppresses anomalies in near real time, and preserves the accuracy of its marketing intelligence. The upcoming migration to Apache Flink will take this further, reducing latency from seconds to milliseconds and enabling the next generation of real-time fraud prevention.

Want To Protect Your Marketing Pipeline from Fraud?