Request for Information: Synthetic Datasets
Share how you validate synthetic datasets or synthetic data methods
TL;DR; If you know something about this, have written or read papers on this please comment.
There are broadly two approaches to analyze data and distribute results when the data is not available to share for legal and privacy reasons.
Federated learning
Obscure the data via differential privacy or synthetic data generation
Where is the normative theory for when synthetic data is trustworthy?
There are no universal surrogate endpoints for therapeutic evaluation, because there is no way to produce a universal surrogate for every possible decision. If something is a good proxy, it has been shown to be a good proxy for a particular purpose. The same goes for synthetic datasets when used in-lieu of the real thing. How do you just generate synthetic data in a vacuum or pre-emptively before knowing what decisions are going to be made on it?
The only one I really understand is differential privacy due to its theoretical origins, but there are a wide variety of synthetic data generation schools of thought and approaches today. In practice, many of them are based on intuitive heuristics. But how do we what they are good for and when such data might be misleading?
Counterfactual imputation of missing data is “synthetic” data
LLM benchmarks: take a seed template problem and using language models to scale up many variations of the problem. This makes problems realistic but “synthetic” in that they are not written up by humans as in FrontierMath.
Generative models trained on large biological datasets as a source of new synthetic observations. We see this right now in both brain imaging research and genomics.
In short, how do we know that a synthetically generated dataset produces optimal and valid answers for every kind of scientific question and decision that could be made on it?


