Synthetic Data Generation and Evaluation Tool
A synthetic data tool that lets users control their own privacy tradeoffs
Role Project Manager
Year 2025- (Ongoing)
Key words privacy-enhancing technologies, design
My Role
- Defining product requirements: Articulated our product vision centered on user agency in sharing data. Translating broad user and business needs into a concrete roadmap, balancing innovation with feasibility, and defining north-star outcomes for the MVP.
- Collaborating closely with ML researchers and engineers to design the AutoML architecture.
- Leading cross-functional execution: Drove alignment across engineering, UX research, and applied AI teams throughout the development lifecycle. I facilitated stakeholder interviews, prototyping, and patent filing processes.
Background
Persona & Customer Journey
Pain Points
Users lack control. Existing tools impose fixed anonymization settings without showing what users are gaining or losing, forcing them to either trust blindly or avoid sharing altogether.
Solution
The platform runs entirely on the data owner's local infrastructure—real data never leaves their environment. Data owners upload their sensitive data and configure parameters: which synthetic data generator to use (e.g., RAP+, TabSyn), the desired privacy level (ε), optional fairness constraints, sample size, and the prediction target variable for ML evaluation.
Once configured, the system runs an automated evaluation pipeline. It pre-processes the real data, generates multiple synthetic dataset variants at different privacy levels, trains ML models on both the real and synthetic versions, and evaluates each synthetic dataset across four dimensions: ML performance (comparing Precision/Recall/F-score against models trained on real data), data privacy (assessing re-identification risk), ML fairness (measuring bias across protected attributes), and data quality (evaluating distributional fidelity to the original data).
The data owner receives a visual evaluation report showing the tradeoffs between privacy protection and data utility for each synthetic dataset option. They can compare options side-by-side, see exactly how each privacy setting affects model performance and privacy risk, iteratively refine parameters if needed, and ultimately decide whether and what to share. This approach preserves control and compliance with data governance policies while removing the technical barriers to informed data sharing decisions.
Prototype
To validate the concept and gather feedback from potential users, I designed a minimal working prototype focused on usability rather than full functionality. The goal was to get the tool in front of business stakeholders quickly—people who would use it but lacked technical expertise in privacy or synthetic data generation.
Next Steps
Improving usability for non-technical users: The current visualization is too technical. We're redesigning the dashboard to translate privacy metrics into intuitive explanations.
Strengthening privacy guarantees: We're exploring more robust re-identification risk assessment methods to quantify privacy guarantees more rigorously.
Expanding use cases: Beyond B2B data sharing, we're exploring research collaboration—enabling academics to access synthetic datasets for public interest studies. Looking further, we're considering B2C applications where individuals could generate and share their own synthetic personal data with researchers or policymakers, extending negotiated privacy boundaries to individual data subjects.
Patent
Credits
Yui Kondo, Project Manger
Huanian Zhu, Data Scientist
Kensuke Onuma, Advisor
Doran Khamis, Machine Learning Scientist
Andreas Pentaliotis, Machine Learning Engineer
Yuki Tachibana, Machine Learning Engineer
Bernardo Perez Orozco, Machine Learning Engineer
Emma Cole, Project Coordinator