Project Name

How Ksolves Built a Cloud-Native Entity Resolution Platform Across AWS and GCP

Industry

Data Intelligence, SaaS Platform

Technology

JedAI, Splink, Dedupe, Metanome CLI, GCP Dataproc, AWS Glue, EventBridge, SQLGlot, Apache Flink Dialect, PostgreSQL, Python, JavaScript

Overview

A US-based data intelligence company was building a SaaS platform to help enterprises consolidate, profile, and govern data assets across AWS and GCP. Their engineering team was focused on product development, not algorithm research, leaving five foundational platform capabilities unbuilt: a unified entity resolution service, a cloud-deployable data profiling layer, a canonical SQL function catalog across ten platforms, cross-platform JSON function parity, and Apache Flink SQL support in their query layer. Ksolves was engaged to design and deliver all five workstreams simultaneously, building an API-first architecture across AWS and GCP that abstracted every algorithm, profiling tool, and cloud service behind clean interfaces, allowing the application tier to consume capabilities without being coupled to any specific engine or cloud provider.

Key Challenges

The client came to Ksolves with six structural gaps that were blocking core platform capabilities from reaching enterprise customers:

No Single Algorithm Solves All Entity Resolution Cases: Datasets vary in size, schema complexity, and field quality, each responding best to a different matching method. The client needed a service that could select rule-based, probabilistic, or ML-based matching per dataset rather than locking customers into one approach.
Data Profiling at Scale Requires Algorithm Diversity Across 38+ Tools: Functional dependency detection, cardinality estimation, and inclusion dependency discovery each require fundamentally different algorithms. Integrating 38+ profiling tools into a single deployable service on GCP Dataproc had no off-the-shelf solution.
SQL Function Landscape Fragmented Across 10 Platforms: Apache Spark, Flink, BigQuery, Redshift, Snowflake, Azure Synapse, and Databricks each have incompatible SQL function naming, signatures, and capabilities, making cross-platform query portability and consistent UI behaviour impossible without a canonical reference.
JSON Functions Missing Across Platform Boundaries: JSON manipulation functions available on some platforms were absent on others, causing inconsistent behaviour in customer pipelines. Each missing function required a compatible JavaScript implementation deployable as a platform extension.
No Native Apache Flink Dialect in SQLGlot: The client's SQL editor used SQLGlot for cross-dialect parsing, but SQLGlot had no Flink SQL dialect, leaving Flink syntax unsupported or incorrectly rendered and blocking an entire enterprise customer segment.
Cloud Architecture Spanning AWS and GCP with No Unified API: Matching and profiling capabilities needed to run across both AWS and GCP, each with different trigger mechanisms, storage formats, and execution environments. Without a unified API layer, the application tier would have to manage cloud-specific complexity directly.

Our Solution

Ksolves delivered the platform through five integrated workstreams built on an API-first architecture. Every profiling engine, algorithm, and cloud service was exposed through unified APIs, enabling the client to consume capabilities seamlessly without dependency on specific tools, engines, or cloud providers.

Multi-Algorithm Entity Resolution API: JedAI was customised for CLI-based API integration, Splink was integrated for probabilistic record linkage, and Dedupe for ML-based deduplication. All three engines were unified behind a single AWS-hosted API with an algorithm selection layer that chose the optimal method per dataset and returned deduplicated entity clusters with confidence scores
Extensible Data Profiling Platform on GCP: The Metanome tool was customised into a CLI-driven profiling service deployed on GCP Dataproc, integrating 38+ profiling algorithms covering cardinality estimation (HyperLogLog++, LogLog, MinCount), functional dependency detection (HyFD, Tane, FastFDs), and inclusion dependency discovery (BINDER, SPIDER). Each algorithm was accessible via a standardised API Gateway endpoint with profiling results stored in Google Cloud Storage.
10-Platform SQL Data Catalogue: All predefined functions and operators were collected and catalogued across 10 SQL platforms, including Apache Spark 3.4.0, Apache Flink 1.17.0, Google BigQuery, Trino, Amazon Redshift, Snowflake, Azure Synapse, AWS Glue DataBrew, Trifacta, and Databricks. The catalog was stored in PostgreSQL with canonical naming, category classification, and cardinality metadata, providing the single authoritative reference for the client's cross-platform SQL editor.
JavaScript JSON Function Implementations: All JSON functions were assessed across each platform, 20+ functions were missing from one or more platforms, and each was implemented in JavaScript, including bool, float64, isjson, json_path_exists, json_object, to_json, json_array, and json_extract_scalar. This gave customers consistent JSON manipulation behaviour across all supported platforms without requiring bespoke workarounds.
Apache Flink SQL Dialect Extension for SQLGlot: A custom Apache Flink SQL dialect was developed for SQLGlot using the Ibis framework as a structural reference and following SQLGlot's packaging conventions for native dialect integration. A Python application was built to read Flink SQL queries, parse them to Abstract Syntax Trees, validate their accuracy against Flink documentation, and convert them back to SQL, enabling full round-trip Flink SQL parsing and rendering within the client's cross-platform query layer.

Technology Stack

CATEGORY	TECHNOLOGY	ROLE IN THIS ENGAGEMENT
Matching Engine	JedAI, Splink, Dedupe	JedAI, Splink, and Dedupe unified behind a single AWS-hosted API with an algorithm selection layer that picks the optimal matching method per dataset.
Data Profiling	Metanome CLI, GCP Dataproc	Custom Metanome CLI on GCP Dataproc integrating 38+ algorithms for cardinality estimation, functional dependency detection, and inclusion dependency discovery, all exposed via API Gateway.
Cloud – AWS	AWS Glue, EventBridge, RDS, S3	Handles raw data ingestion (S3), structured record storage (RDS), ETL execution (Glue), and event-driven job triggering (EventBridge) across the entity resolution workloads.
Cloud – GCP	Dataproc, Cloud Functions, GCS, API Gateway, Terraform	Powers distributed profiling (Dataproc), serverless algorithm execution (Cloud Functions), profiling storage (GCS), and unified API exposure (API Gateway), with infrastructure provisioned via Terraform.
Data Catalog	PostgreSQL, 10-Platform Coverage	PostgreSQL catalog of all SQL functions across 10 platforms, normalised to canonical names with cardinality and category metadata, serving as the authoritative reference for the cross-platform SQL editor.
Cross-Platform	SQLGlot, Python, JavaScript	SQLGlot extended with a native Flink SQL dialect for full parse-and-render support; Python for AST validation; JavaScript for 20+ missing JSON function implementations across platforms.

Impact

Across five delivered workstreams, the platform unlocked five capabilities that were previously absent or required bespoke implementation per customer engagement:

3 Matching Engines Unified Behind a Single Entity Resolution API: JedAI, Splink, and Dedupe are now accessible through one AWS-hosted API with automatic algorithm selection, enabling the platform to serve diverse dataset types without manual algorithm configuration by end users.
38+ Profiling Algorithms Accessible via a Single API Call: The full Metanome algorithm library is deployed on GCP Dataproc and exposed via API Gateway, reducing profiling setup from weeks of bespoke implementation to a single standardised API call available to all platform customers.
SQL Function Catalog Covering All 10 Target Platforms: A complete PostgreSQL-backed catalog of all SQL functions across 10 platforms, with canonical naming and category metadata, eliminates manual documentation research and provides the authoritative reference for the client's cross-platform SQL editor.
20+ Missing JSON Functions Implemented Across All Platforms: JavaScript implementations of 20+ missing JSON functions deliver consistent cross-platform behaviour for customers writing data transformation pipelines, eliminating a known class of compatibility failures that previously required bespoke workarounds.
Apache Flink SQL Dialect Live in SQLGlot: A production-grade Flink SQL dialect integrated into SQLGlot with a full AST parse, validation, and render test suite unblocks Flink SQL support across the client's entire query layer, opening the platform to a previously excluded enterprise customer segment.

Solution Architecture

Conclusion

Before this engagement, the client had no production entity resolution service, no cloud-deployable profiling platform, no canonical SQL function reference across their ten target platforms, and no Flink SQL support in their query layer. Today, Ksolves has delivered an API-first, cloud-native data intelligence platform spanning AWS and GCP, with three matching engines behind one entity resolution API, 38+ profiling algorithms on GCP Dataproc, a 10-platform SQL catalog in PostgreSQL, 20+ JavaScript JSON function implementations, and a production Flink SQL dialect in SQLGlot. The architecture is extensible by design: new algorithms, platforms, and cloud providers can be added without rearchitecting the API layer. For data intelligence companies building cross-platform SaaS products, explore Ksolves Big Data Services to see what a production-grade platform foundation can deliver.

Have A Project Idea?

Name*

Email*

Phone Number*

Message*

What is 3 + 3 ? *

Struggling with Duplicate Entities and Fragmented SQL Analytics?

How Ksolves Migrated Apache NiFi 1.27 to Kubernetes-Native 2.7 in 6 Weeks for a Financial Services Firm

Read the Story

How Ksolves Secured Expert NiFi and Airflow L3 Support for a US Bank at $27K Per Year

Read the Story

Eliminated ~900K Duplicate Oil Well Records from 6,200 Excel Files Using Azure Databricks and Spark

Read the Story

Have project in mind?

How Ksolves Built a Cloud-Native Entity Resolution Platform Across AWS and GCP

Technology Stack

How Ksolves Migrated Apache NiFi 1.27 to Kubernetes-Native 2.7 in 6 Weeks for a Financial Services Firm

How Ksolves Secured Expert NiFi and Airflow L3 Support for a US Bank at $27K Per Year

Eliminated ~900K Duplicate Oil Well Records from 6,200 Excel Files Using Azure Databricks and Spark

Talk To Our Experts

Request a Callback

Talk To Our Experts

Let's Talk

Talk To Our Experts

Seize Your Complimentary Reservation Now!

Book a Free 30-minute Consultation!

Book a Free 30-minute
Consultation!