Project Name
Eliminated ~900K Duplicate Oil Well Records from 6,200 Excel Files Using Azure Databricks and Spark
![]()
When an oil and gas operator cannot answer the question “how many wells do we actually own and operate,” the consequences ripple through every business function. Royalty payments are miscalculated. Regulatory filings contain conflicting well identifiers. Production volumes are double-counted across basins.
For a mid-major US exploration and production company operating across three major basins with nearly 4,800 active wells, decades of acquisitions and Excel-based data management had produced a well registry so fragmented that the same physical wellbore was frequently registered under three or four distinct identifiers across different operational systems.
Ksolves, an AI-first company, was engaged to design and deliver an end-to-end Well Master Data Deduplication program on Azure, ingesting 6,200 Excel files and resolving millions of duplicate records using Apache Spark on Azure Databricks. The result was a single authoritative well master published to Azure SQL Server, timed to fit the operator’s next regulatory filing cycle.
The client came to Ksolves with six structural problems that were causing royalty errors, regulatory risk, and analytical paralysis:
- No Single Well Identifier Across Eight Operational Systems: The API-14 number, the industry standard unique well identifier, was present in some systems, absent in others, and corrupted into non-standard API-10 or state well number formats in several Excel exports. No programmatic crosswalk existed between identifier schemes, making deterministic matching impossible without a standardization layer.
- 6,200 Excel Files with Inconsistent Schemas: Source data ranged from structured tabular exports to semi-structured worksheets where well names, coordinates, and production figures were embedded in merged cells and multi-tab workbooks. A robust ingestion and preprocessing layer was required before any deduplication logic could be applied.
- Geospatial Duplicate Identification at Basin Scale: Many duplicate records could not be resolved through identifier matching alone because the API number was missing or corrupted. Two records describing the same surface location within a 500-metre radius required distributed spatial indexing across 4.8 million records.
- Royalty Calculation Errors From Duplicate Production Attribution: Duplicate well entries in the production accounting system were attributing production volumes to multiple cost centres for the same physical wellbore. This caused royalty overpayments to mineral rights owners, double-counted gross revenue, and incorrect working interest calculations affecting partner and investor distributions.
- Regulatory Filing Inconsistencies Ahead of BLM Audit: State and Bureau of Land Management filings had been submitted with conflicting API numbers and operator name variants across multiple filing periods, creating an audit risk that needed to be resolved before the upcoming BLM compliance review.
- No Cloud Infrastructure or MDM Capability In-House: The entire Azure infrastructure, including Data Factory, Data Lake Storage Gen2, Databricks, Azure ML, Azure SQL Server, and Azure Purview, had to be architected, deployed, and operationalized by Ksolves within the program timeline.
Ksolves designed a fully Azure-native Well Master Data Deduplication platform built on three governing principles: API-14 as ground truth where it exists, geospatial awareness throughout all matching logic, and zero disruption to existing regulatory filings and royalty calculations during or after the migration. The platform was delivered across four phases over nine months.
- Azure Data Factory Ingestion Pipeline: Eight custom connectors ingested approximately 6,200 Excel files on a scheduled and event-triggered basis, converted them to Parquet format, and landed them in Azure Data Lake Storage Gen2. Each file was validated against a configurable column-mapping schema at ingestion, with malformed files routed to a dead-letter container. Azure Purview catalogued every file's lineage from source Excel through to the golden well master.
- PySpark Well Record Standardization: A PySpark notebook on Azure Databricks unified all well identifier formats, converting API-10 to API-14 using state and county FIPS codes, cross-walking state well numbers via OGC lookup tables, and normalizing UWI identifiers. Operator names were standardized using regex cleaning and fuzzy string matching. Coordinates were normalized to WGS84 and encoded as H3 geohash indices for spatial blocking.
- Geospatial Multi-Key Blocking: A five-key blocking strategy reduced the 4.8-million-record comparison space by 99.2% while retaining 99.7% recall of true duplicate pairs across 96 auto-scaled Databricks workers.
- ML Matching on Azure ML with 20 Features: A gradient boosting model trained on 180,000 hand-labelled well pairs achieved a precision of 96.9%, a recall of 97.3%, and an F1 of 0.971.
- Human-in-the-Loop MDM Review: Approximately 85,000 candidate pairs below the high-confidence threshold were routed to a map-view stewardship portal. This gives geologists spatial context for every review decision. BLM-pending wells were surfaced for same-day adjudication.
- Golden Well Master Written to Azure SQL Server: Resolved golden records were written to Azure SQL Server with sub-12ms query performance for Power BI and regulatory pipelines, with Delta Lake format providing full time-travel history for audit rollback.
Technology Stack
| Category | Technology | Role |
|---|---|---|
| Platform | Azure Databricks | Executes all pipeline stages as PySpark notebooks on 8 to 96 auto-scaling workers |
| Processing | Apache Spark, PySpark | Distributed deduplication across 4.8M well record pairs in parallel |
| Integration | Azure Data Factory | Ingests 6,200 Excel files from eight systems with schema validation |
| Storage | Azure Data Lake Gen2 | Stores well records in Parquet and Delta Lake with AES-256 encryption |
| Database | Azure SQL Server | Golden well master with sub-12ms query performance for Power BI and regulatory systems |
| AI and ML | Azure ML, Spark MLlib | Gradient boosting model trained on 180K well pairs, F1 score 0.971 |
The platform delivered measurable improvements across data quality, processing speed, royalty accuracy, regulatory compliance, and infrastructure cost:
- ~900K Duplicate Well Records Eliminated: The raw inventory of approximately 4.8 million records carried an estimated 18% duplicate rate. Post-resolution, the golden well master contained approximately 3.9 million verified, unique well records with a residual duplicate rate of 1.4%.
- Full-Corpus Deduplication Time Reduced Dramatically: A sequential proof-of-concept projected a 22-day runtime for a full-corpus pass. The Azure Databricks Spark pipeline with multi-key blocking and auto-scaling completed the same pass in a fraction of that time, enabling five full re-runs during the program for model tuning.
- Royalty Calculation Accuracy Restored: Production volumes are now attributed to a single authoritative well record, eliminating the double-counting that had driven royalty calculation discrepancies across multiple accounting periods.
- BLM Regulatory Filing Consistency Achieved: API-14 standardization and operator name normalization resolved all known identifier conflicts across the regulatory filing history, with zero API-number discrepancies in the BLM submission generated from the golden master.
- 42% Azure Infrastructure Cost Reduction: Databricks auto-scaling reduced the cluster to 8 workers during idle periods and scaled to 96 workers only during active resolution runs, bringing total compute spend approximately 42% below the fixed-cluster baseline estimate.
“We had been living with a well inventory that nobody trusted for years. Within nine months, Ksolves gave us a single well master on Azure that our land team, our finance team, and our regulators can all rely on. That is genuinely transformative for how we run this business.”
– VP of Data and Technology, Mid-Major US E&P Operator
Before this engagement, the operator’s well inventory was fragmented across 6,200 Excel files and eight source systems with an 18% duplicate rate, no cloud infrastructure, and an impending BLM audit. Today, Ksolves, with its AI-first delivery approach, has delivered a production-grade Azure well master platform with 3.9 million verified well records, full API-14 standardization, geospatial verification, and real-time consumption by 10 or more operational systems across land, finance, production, and regulatory functions.
For oil and gas operators managing fragmented well master data across legacy systems and Excel files, explore Ksolves Big Data and Azure Data Engineering Services, and find out what a clean, unified well master on Azure can deliver for your operations. You can also contact us at sales@ksolves.com.
Still Managing Critical Well Data Across Disconnected Excel Files and Legacy Systems?