Project Name
How Ksolves Built a Cloud-Native Entity Resolution Platform Across AWS and GCP
![]()
A US-based data intelligence company was building a SaaS platform to help enterprises consolidate, profile, and govern data assets across AWS and GCP. Their engineering team was focused on product development, not algorithm research, leaving five foundational platform capabilities unbuilt: a unified entity resolution service, a cloud-deployable data profiling layer, a canonical SQL function catalog across ten platforms, cross-platform JSON function parity, and Apache Flink SQL support in their query layer. Ksolves was engaged to design and deliver all five workstreams simultaneously, building an API-first architecture across AWS and GCP that abstracted every algorithm, profiling tool, and cloud service behind clean interfaces, allowing the application tier to consume capabilities without being coupled to any specific engine or cloud provider.
The client came to Ksolves with six structural gaps that were blocking core platform capabilities from reaching enterprise customers:
- No Single Algorithm Solves All Entity Resolution Cases: Datasets vary in size, schema complexity, and field quality, each responding best to a different matching method. The client needed a service that could select rule-based, probabilistic, or ML-based matching per dataset rather than locking customers into one approach.
- Data Profiling at Scale Requires Algorithm Diversity Across 38+ Tools: Functional dependency detection, cardinality estimation, and inclusion dependency discovery each require fundamentally different algorithms. Integrating 38+ profiling tools into a single deployable service on GCP Dataproc had no off-the-shelf solution.
- SQL Function Landscape Fragmented Across 10 Platforms: Apache Spark, Flink, BigQuery, Redshift, Snowflake, Azure Synapse, and Databricks each have incompatible SQL function naming, signatures, and capabilities, making cross-platform query portability and consistent UI behaviour impossible without a canonical reference.
- JSON Functions Missing Across Platform Boundaries: JSON manipulation functions available on some platforms were absent on others, causing inconsistent behaviour in customer pipelines. Each missing function required a compatible JavaScript implementation deployable as a platform extension.
- No Native Apache Flink Dialect in SQLGlot: The client's SQL editor used SQLGlot for cross-dialect parsing, but SQLGlot had no Flink SQL dialect, leaving Flink syntax unsupported or incorrectly rendered and blocking an entire enterprise customer segment.
- Cloud Architecture Spanning AWS and GCP with No Unified API: Matching and profiling capabilities needed to run across both AWS and GCP, each with different trigger mechanisms, storage formats, and execution environments. Without a unified API layer, the application tier would have to manage cloud-specific complexity directly.
Ksolves delivered the platform through five integrated workstreams built on an API-first architecture. Every profiling engine, algorithm, and cloud service was exposed through unified APIs, enabling the client to consume capabilities seamlessly without dependency on specific tools, engines, or cloud providers.
- Multi-Algorithm Entity Resolution API: JedAI was customised for CLI-based API integration, Splink was integrated for probabilistic record linkage, and Dedupe for ML-based deduplication. All three engines were unified behind a single AWS-hosted API with an algorithm selection layer that chose the optimal method per dataset and returned deduplicated entity clusters with confidence scores
- Extensible Data Profiling Platform on GCP: The Metanome tool was customised into a CLI-driven profiling service deployed on GCP Dataproc, integrating 38+ profiling algorithms covering cardinality estimation (HyperLogLog++, LogLog, MinCount), functional dependency detection (HyFD, Tane, FastFDs), and inclusion dependency discovery (BINDER, SPIDER). Each algorithm was accessible via a standardised API Gateway endpoint with profiling results stored in Google Cloud Storage.
- 10-Platform SQL Data Catalogue: All predefined functions and operators were collected and catalogued across 10 SQL platforms, including Apache Spark 3.4.0, Apache Flink 1.17.0, Google BigQuery, Trino, Amazon Redshift, Snowflake, Azure Synapse, AWS Glue DataBrew, Trifacta, and Databricks. The catalog was stored in PostgreSQL with canonical naming, category classification, and cardinality metadata, providing the single authoritative reference for the client's cross-platform SQL editor.
- JavaScript JSON Function Implementations: All JSON functions were assessed across each platform, 20+ functions were missing from one or more platforms, and each was implemented in JavaScript, including bool, float64, isjson, json_path_exists, json_object, to_json, json_array, and json_extract_scalar. This gave customers consistent JSON manipulation behaviour across all supported platforms without requiring bespoke workarounds.
- Apache Flink SQL Dialect Extension for SQLGlot: A custom Apache Flink SQL dialect was developed for SQLGlot using the Ibis framework as a structural reference and following SQLGlot's packaging conventions for native dialect integration. A Python application was built to read Flink SQL queries, parse them to Abstract Syntax Trees, validate their accuracy against Flink documentation, and convert them back to SQL, enabling full round-trip Flink SQL parsing and rendering within the client's cross-platform query layer.
Technology Stack
| CATEGORY | TECHNOLOGY | ROLE IN THIS ENGAGEMENT |
|---|---|---|
| Matching Engine | JedAI, Splink, Dedupe | JedAI, Splink, and Dedupe unified behind a single AWS-hosted API with an algorithm selection layer that picks the optimal matching method per dataset. |
| Data Profiling | Metanome CLI, GCP Dataproc | Custom Metanome CLI on GCP Dataproc integrating 38+ algorithms for cardinality estimation, functional dependency detection, and inclusion dependency discovery, all exposed via API Gateway. |
| Cloud – AWS | AWS Glue, EventBridge, RDS, S3 | Handles raw data ingestion (S3), structured record storage (RDS), ETL execution (Glue), and event-driven job triggering (EventBridge) across the entity resolution workloads. |
| Cloud – GCP | Dataproc, Cloud Functions, GCS, API Gateway, Terraform | Powers distributed profiling (Dataproc), serverless algorithm execution (Cloud Functions), profiling storage (GCS), and unified API exposure (API Gateway), with infrastructure provisioned via Terraform. |
| Data Catalog | PostgreSQL, 10-Platform Coverage | PostgreSQL catalog of all SQL functions across 10 platforms, normalised to canonical names with cardinality and category metadata, serving as the authoritative reference for the cross-platform SQL editor. |
| Cross-Platform | SQLGlot, Python, JavaScript | SQLGlot extended with a native Flink SQL dialect for full parse-and-render support; Python for AST validation; JavaScript for 20+ missing JSON function implementations across platforms. |
Across five delivered workstreams, the platform unlocked five capabilities that were previously absent or required bespoke implementation per customer engagement:
- 3 Matching Engines Unified Behind a Single Entity Resolution API: JedAI, Splink, and Dedupe are now accessible through one AWS-hosted API with automatic algorithm selection, enabling the platform to serve diverse dataset types without manual algorithm configuration by end users.
- 38+ Profiling Algorithms Accessible via a Single API Call: The full Metanome algorithm library is deployed on GCP Dataproc and exposed via API Gateway, reducing profiling setup from weeks of bespoke implementation to a single standardised API call available to all platform customers.
- SQL Function Catalog Covering All 10 Target Platforms: A complete PostgreSQL-backed catalog of all SQL functions across 10 platforms, with canonical naming and category metadata, eliminates manual documentation research and provides the authoritative reference for the client's cross-platform SQL editor.
- 20+ Missing JSON Functions Implemented Across All Platforms: JavaScript implementations of 20+ missing JSON functions deliver consistent cross-platform behaviour for customers writing data transformation pipelines, eliminating a known class of compatibility failures that previously required bespoke workarounds.
- Apache Flink SQL Dialect Live in SQLGlot: A production-grade Flink SQL dialect integrated into SQLGlot with a full AST parse, validation, and render test suite unblocks Flink SQL support across the client's entire query layer, opening the platform to a previously excluded enterprise customer segment.
Before this engagement, the client had no production entity resolution service, no cloud-deployable profiling platform, no canonical SQL function reference across their ten target platforms, and no Flink SQL support in their query layer. Today, Ksolves has delivered an API-first, cloud-native data intelligence platform spanning AWS and GCP, with three matching engines behind one entity resolution API, 38+ profiling algorithms on GCP Dataproc, a 10-platform SQL catalog in PostgreSQL, 20+ JavaScript JSON function implementations, and a production Flink SQL dialect in SQLGlot. The architecture is extensible by design: new algorithms, platforms, and cloud providers can be added without rearchitecting the API layer. For data intelligence companies building cross-platform SaaS products, explore Ksolves Big Data Services to see what a production-grade platform foundation can deliver.
Struggling with Duplicate Entities and Fragmented SQL Analytics?