Pangea Data Platform

Built a data matching platform processing 10M+ records daily with sub-second response times on Scala, Drill, and Hadoop. Worked with the UK ICO for 2+ years pre-GDPR to shape data classification standards.

What it was

Pangea was a data matching and management platform I built at SCL Elections / Cambridge Analytica. It ingested millions of records from different sources, ran real-time matching and deduplication, and gave analysts sub-second query performance on the results.

Numbers

10M+ records/day ingested and processed
Sub-second response times on complex matching queries
99.9% uptime via Mesos cluster orchestration on AWS
Real-time matching with custom Scala algorithms

How it worked

The platform was a pipeline. Ingestion services pulled data from multiple sources, a Scala matching engine handled entity resolution and deduplication, and a React frontend let analysts search, review matches, and export results. Everything sat on top of Apache Drill + Hadoop.

The stack:

Scala matching engine for entity resolution across different data sources, tuned for throughput and accuracy
Apache Drill + Hadoop as the query layer, letting analysts run SQL over massive datasets without needing predefined schemas
Apache Mesos for cluster resource management, scheduling workloads across the compute fleet
React frontend with interactive dashboards for data exploration and match review
AWS for cloud infrastructure with auto-scaling

We built Pangea before GDPR was a thing. There was no compliance framework for platforms handling data at this scale, so we worked directly with the UK Information Commissioner’s Office (ICO) for over 2 years to help figure out what the rules should look like.

One of the big debates was about IP addresses. We argued to the ICO that classifying IP addresses as Personally Identifiable Information would make it impossible to include them in server logs, monitoring, and debugging infrastructure. That would have broken how internet services fundamentally work. It was a real tension between privacy and operational reality, and we helped the ICO work through it as they were forming their guidance.

Building at the frontier of data regulation, before the rules were written, meant every architecture decision had to balance what we could do with what we thought the rules might eventually require.

Technical details

Designed the ingestion pipeline to handle schema differences across dozens of data sources
Built matching algorithms that balanced precision and recall for large-scale entity resolution
Set up monitoring and alerting that kept 99.9% uptime across the full stack
Worked with the UK ICO for 2+ years on data classification standards before GDPR landed

What it was

Numbers

How it worked

Working with the ICO before GDPR existed

Technical details

Figures