Synthetic Data Generation and Evaluation Tool

Role Project Manager
Year 2025- (Ongoing)
Key words privacy-enhancing technologies, design

I have led the product definition, system architecture, and end-to-end execution of a privacy-preserving synthetic data platform with an eight-person team at AIOI R&D Lab Oxford. The platform automates the process of generating and evaluating synthetic datasets, letting users compare privacy-utility tradeoffs visually and decide what they're comfortable sharing.

My Role

Defining product requirements: Articulated our product vision centered on user agency in sharing data. Translating broad user and business needs into a concrete roadmap, balancing innovation with feasibility, and defining north-star outcomes for the MVP.
Collaborating closely with ML researchers and engineers to design the AutoML architecture.
Leading cross-functional execution: Drove alignment across engineering, UX research, and applied AI teams throughout the development lifecycle. I facilitated stakeholder interviews, prototyping, and patent filing processes.

Background

Organizations want to collaborate on AI development, but sharing personal data is the main barrier. Data providers need to anonymize data or create synthetic versions—a process that's costly, time-consuming, and requires specialized expertise to assess re-identification risk. The fundamental problem: stronger privacy protection typically means less useful data for training models. Finding the right balance requires technical knowledge most data providers don't have.

Persona & Customer Journey

To understand this problem deeply, I mapped the data sharing journey and interviewed stakeholders across the process—data scientists needing datasets for model training and business teams evaluating partnership opportunities.

Pain Points

The privacy-utility tradeoff is opaque. Stronger privacy protection typically degrades data quality, but users have no way to see this tradeoff or find the right balance for their needs.

Users lack control. Existing tools impose fixed anonymization settings without showing what users are gaining or losing, forcing them to either trust blindly or avoid sharing altogether.

Solution

We built a system that shifts control to data owners, automating the technical complexity of privacy-utility evaluation while keeping humans in the decision-making loop.

Local-first architecture

The platform runs entirely on the data owner's local infrastructure—real data never leaves their environment. Data owners upload their sensitive data and configure parameters: which synthetic data generator to use (e.g., RAP+, TabSyn), the desired privacy level (ε), optional fairness constraints, sample size, and the prediction target variable for ML evaluation.

Automated evaluation pipeline

Once configured, the system runs an automated evaluation pipeline. It pre-processes the real data, generates multiple synthetic dataset variants at different privacy levels, trains ML models on both the real and synthetic versions, and evaluates each synthetic dataset across four dimensions: ML performance (comparing Precision/Recall/F-score against models trained on real data), data privacy (assessing re-identification risk), ML fairness (measuring bias across protected attributes), and data quality (evaluating distributional fidelity to the original data).

Visual comparison and informed decision-making

The data owner receives a visual evaluation report showing the tradeoffs between privacy protection and data utility for each synthetic dataset option. They can compare options side-by-side, see exactly how each privacy setting affects model performance and privacy risk, iteratively refine parameters if needed, and ultimately decide whether and what to share. This approach preserves control and compliance with data governance policies while removing the technical barriers to informed data sharing decisions.

Prototype

To validate the concept and gather feedback from potential users, I designed a minimal working prototype focused on usability rather than full functionality. The goal was to get the tool in front of business stakeholders quickly—people who would use it but lacked technical expertise in privacy or synthetic data generation.

Next Steps

From prototype to production-ready platform: With the prototype now being piloted internally, several key development tracks are underway to prepare the system for broader deployment.

Improving usability for non-technical users: The current visualization is too technical. We're redesigning the dashboard to translate privacy metrics into intuitive explanations.

Strengthening privacy guarantees: We're exploring more robust re-identification risk assessment methods to quantify privacy guarantees more rigorously.

Expanding use cases: Beyond B2B data sharing, we're exploring research collaboration—enabling academics to access synthetic datasets for public interest studies. Looking further, we're considering B2C applications where individuals could generate and share their own synthetic personal data with researchers or policymakers, extending negotiated privacy boundaries to individual data subjects.

Patent

We filed a patent for the AutoML architecture.

Credits

Yui Kondo, Project Manger
Huanian Zhu, Data Scientist
Kensuke Onuma, Advisor
Doran Khamis, Machine Learning Scientist
Andreas Pentaliotis, Machine Learning Engineer
Yuki Tachibana, Machine Learning Engineer
Bernardo Perez Orozco, Machine Learning Engineer
Emma Cole, Project Coordinator