Apache Spark ML Entity Resolution and Data Deduplication Platform for a B2B Data Services Company

From Broken Record Matching to 94% Precision: How Ksolves Built an Apache Spark Entity Resolution Platform for a B2B Data Company

Industry

Information Technology

Technology

Apache Spark, AWS EMR, AWS Lambda, AWS API Gateway, Amazon S3, AWS RDS (PostgreSQL), AWS EventBridge

Overview

For a North American B2B data services company managing customer and business entity records across six source systems, duplicate entries were not a data quality problem in isolation. They were a revenue and operations problem. With over 50 million records distributed across a legacy CRM, an ERP system, a billing application, and three third-party databases, the company had no consistent, unique identifier linking records from different platforms. Customers appeared under multiple identities, analytics were unreliable, and automated business processes were running on data the team could not trust.

Manual deduplication was impossible at this volume. The company needed a machine-learning-powered entity resolution platform capable of matching records at scale, tolerating spelling inconsistencies and missing fields, and providing data engineers with a configurable interface to tune matching thresholds without writing code for every adjustment. They partnered with Ksolves, an AI-First Company, to build it on Apache Spark and AWS.

The Challenge

Our client was facing some key challenges, including:

No Reliable Identifier to Match Records Across Six Source Systems: Customer and entity data were distributed across six separate platforms, each using different field naming conventions, address formats, and date schemas. Without a shared unique identifier across systems, the engineering team had no programmatic way to determine whether two records in different systems refer to the same real-world entity. An estimated 12 million duplicate records were being treated as unique, distorting customer counts and corrupting downstream reporting that production and finance teams relied on daily.
Data Quality Inconsistencies Blocking Automated Matching: Records representing the same entity frequently differed across sources. Names were misspelled, company addresses were abbreviated inconsistently across platforms, and required fields were missing in up to 23% of records from third-party databases. Standard rule-based exact matching yielded an unacceptable false-negative rate, while overly broad fuzzy matching introduced false positives, leading to secondary data-quality failures. The team needed ML-based matching thresholds configurable per use case, without requiring an engineering ticket for every tuning adjustment.
Sequential Processing Architecture Could Not Handle Record Volume: The existing deduplication process was sequential and single-threaded. Batch jobs across the full 50 million record dataset took 18 hours to complete and could not scale as new data source integrations brought additional volume. Weekly deduplication runs were the practical limit of the architecture, leaving stale duplicates in the system for days and blocking any move toward daily data quality operations.

The Solution

Ksolves deployed an Apache Spark-based entity resolution platform using AI-assisted analysis of existing data pipelines, matching patterns, and schema inconsistencies across all six source systems before any configuration work began.

Apache Spark ML Wrapper for Distributed Entity Resolution: The core of the solution was an ML wrapper built around Apache Spark on AWS EMR, abstracting large-scale record matching into a configurable, API-accessible service. The wrapper implemented multiple record linkage algorithms, including Levenshtein distance scoring, Jaccard similarity for token-based field comparison, and TF-IDF weighting for unstructured name and address fields. Configurable blocking rules reduced the candidate pair space from O(n2) to a tractable subset, enabling distributed computation across a six-node EMR cluster.
Configurable Algorithm Interface for Data Engineering Teams: A configuration layer enabled data engineers to select resolution algorithms, set match-probability thresholds, configure blocking rules per entity type, and compare outputs across configurations without code changes. New entity resolution use cases could be onboarded and tuned entirely within the interface, independent of the core engineering team.
AWS Deployment with API Layer and Auto-Scaling: The platform was deployed on AWS EMR, with API endpoints built on AWS Lambda and API Gateway, providing both internal teams and external consumers with programmatic access to all entity resolution functions. Amazon S3 handled structured and unstructured data ingestion and output. AWS EventBridge triggered deduplication jobs on schedule and on new data arrival, with the platform configured to auto-scale EMR node count based on job queue depth.
PostgreSQL Results Store and Reporting Integration: Deduplicated entity records were written to AWS RDS (PostgreSQL), providing a single authoritative master data store for downstream analytics, billing operations, and automated business process workflows.

Technology Stack

Category	Legacy Approach	New Architecture
Record Matching	Rule-based exact match	Apache Spark ML entity resolution engine
Processing	Sequential, single-threaded batch	Distributed Spark on AWS EMR (6-node cluster)
Algorithm Layer	Fixed rules, no configurability	Levenshtein, Jaccard, TF-IDF with configurable thresholds
Cloud Infrastructure	On-premises batch processing	AWS EMR, Lambda, API Gateway, EventBridge
Object Storage	Internal file system	Amazon S3 (structured and unstructured)
Results Store	Multiple siloed databases	AWS RDS PostgreSQL (single entity master)
API Access	None	REST API via AWS API Gateway

Results / Impact

94% Match Precision Achieved Across 50 Million Records: The Apache Spark ML entity resolution engine achieved 94% match precision and 91% recall across the full 50 million record dataset, compared to 61% precision from the client's previous rule-based matching approach. False positives were reduced by 68%, eliminating the manual review overhead that had previously consumed 14 engineering hours per week.
Deduplication Processing Time Reduced from 18 Hours to 47 Minutes: The Spark ML wrapper processed the full 50 million record dataset in 47 minutes on a six-node EMR cluster, against the previous 18-hour sequential run. The 23x speed improvement enabled daily deduplication operations that were previously architecturally impossible.
12 Million Duplicate Records Eliminated from Master Dataset: The initial deduplication run identified and consolidated 12 million duplicate records across all six source systems, reducing the effective entity count and enabling accurate entity-level analytics for the first time across the business.
Configurable Interface Deployed Across Four Entity Types: The algorithm configuration layer was adopted for four distinct entity resolution use cases, each with different precision and recall requirements, without any code changes between configurations.
AWS Auto-Scaling Validated at 3x Peak Load: The platform auto-scaled to 18 EMR nodes under peak load, processing the full record set in 47 minutes while handling 3x normal concurrent job volume, validating the architecture for projected data growth.

Data Flow Diagram

Client Testimonial

“Before Ksolves built this platform, we had no reliable way to tell which records across our six systems described the same entity. Our analytics were built on data we did not trust, and our customer counts were meaningless numbers. The Apache Spark entity resolution platform they built gave us our first accurate view of our data at this scale. The precision it achieves with 50 million records and the speed at which it runs are something we could not have achieved with any tool we had evaluated. The configurable interface means our data team can tune it for new use cases without raising an engineering ticket.”

— Head of Data Engineering, B2B Data Services Company (name withheld by request)

Conclusion

By deploying an Apache Spark-based entity resolution platform with configurable ML algorithms, distributed EMR processing, and AWS auto-scaling, Ksolves delivered the client’s first reliable, scalable solution to the data duplication problem that had been corrupting their analytics and blocking accurate business operations. The platform eliminated 12 million duplicate records, achieved 94% match precision across 50 million records, and reduced deduplication processing time from 18 hours to 47 minutes, with the architectural flexibility to scale as data volumes continue to grow.

As an AI-First Company, Ksolves brings AI-driven data analysis and deep engineering expertise to every entity resolution engagement. For data-intensive organizations managing record matching at scale, our Apache Spark Development Services deliver the precision, speed, and configurability that production-grade deduplication demands.

Have A Project Idea?

Name*

Email*

Phone Number*

Message*

What is 10 + 10 ? *