Problems faced by them
Customer data exist in multiple systems and applications across the enterprise and it becomes highly challenging to find all records that pertain to one entity (customer). There is no unique identifier within every record that can indicate which records from one source correspond to those in other sources. The fields within records represent the same entity which may contain different information for example one record may have the address misspelt and another record may have missing fields.
- Find the duplicate data coming from different sources and determine how much (%) of it matches.
- Find the record linkage if it belongs to the same entity but comes from different sources. Also, find how much (%) it matches.
- It should scale on demand and run both on-premises and in the cloud.
- It should run on big data, including both structured and unstructured data.
- It should support more than one entity resolution algorithm so the user can run the same datasets with multiple algorithms and compare the results.
- We used an open-source entity resolution algorithm and wrote a Spark wrapper over that to run it for big data, including both kinds of structured and unstructured.
- The application runs over the spark cluster and exposes user-friendly API endpoints to end users, and submits a job with datasets.
- It has the option to choose and run the same datasets with multiple entity resolution algorithms and compare results.
- All supported algorithms are configurable for speed and accuracy.
- It also has the option to run data validation before running Entity Resolution to find out whether the data is in the required format or not and provide useful suggestions for data correction.
- It supports reading data from CSV files stored on object storage (Ex. s3 ) and also writes output on object storage (Ex. s3 ).
- Each algorithm is tested with 5M records and has the capability of auto-scale if deployed on the cloud.