What it was
Pangea was a data matching and management platform I built at SCL Elections / Cambridge Analytica. It ingested millions of records from different sources, ran real-time matching and deduplication, and gave analysts sub-second query performance on the results.
Numbers
- 10M+ records/day ingested and processed
- Sub-second response times on complex matching queries
- 99.9% uptime via Mesos cluster orchestration on AWS
- Real-time matching with custom Scala algorithms
How it worked
The platform was a pipeline. Ingestion services pulled data from multiple sources, a Scala matching engine handled entity resolution and deduplication, and a React frontend let analysts search, review matches, and export results. Everything sat on top of Apache Drill + Hadoop.
The stack:
- Scala matching engine for entity resolution across different data sources, tuned for throughput and accuracy
- Apache Drill + Hadoop as the query layer, letting analysts run SQL over massive datasets without needing predefined schemas
- Apache Mesos for cluster resource management, scheduling workloads across the compute fleet
- React frontend with interactive dashboards for data exploration and match review
- AWS for cloud infrastructure with auto-scaling
Working with the ICO before GDPR existed
We built Pangea before GDPR was a thing. There was no compliance framework for platforms handling data at this scale, so we worked directly with the UK Information Commissioner's Office (ICO) for over 2 years to help figure out what the rules should look like.
One of the big debates was about IP addresses. We argued to the ICO that classifying IP addresses as Personally Identifiable Information would make it impossible to include them in server logs, monitoring, and debugging infrastructure. That would have broken how internet services fundamentally work. It was a real tension between privacy and operational reality, and we helped the ICO work through it as they were forming their guidance.
Building at the frontier of data regulation, before the rules were written, meant every architecture decision had to balance what we could do with what we thought the rules might eventually require.
Technical details
- Designed the ingestion pipeline to handle schema differences across dozens of data sources
- Built matching algorithms that balanced precision and recall for large-scale entity resolution
- Set up monitoring and alerting that kept 99.9% uptime across the full stack
- Worked with the UK ICO for 2+ years on data classification standards before GDPR landed

