Open Thread: Synthetic Datasets
Share how you validate synthetic datasets or synthetic data methods
TL;DR; If you know something about this, have written or read papers on this please comment.
There are broadly two approaches to analyze data and distribute results when the data is not available to share for legal and privacy reasons.
Federated learning
Obscure the data via differential privacy or synthetic data generation
Where is the normative theory for when synthetic data is trustworthy?
Just as there are no universal surrogate endpoints for therapeutic evaluation, there are no universal surrogates for any kind of decision-making. If something is a good proxy, it is a good proxy for a particular purpose. The same goes for synthetic datasets. How do you just generate synthetic data in a vacuum or pre-emptively before knowing what decisions are going to be made on it.
The only one I really understand is differential privacy due to its theoretical origins, but there is a lot more diversity in what is done now. Usually we make up an intuitive procedure, but it is hard to know what they are good for and when they might be misleading.
Counterfactual imputation of missing data is “synthetic” data
LLM benchmarks: take a seed template problem and using language models to scale up many variations of the problem. This makes problems realistic but “synthetic” in that they are not written up by humans as in FrontierMath.
Generative models trained on large biological datasets as a source of new synthetic observations. We see this right now in both brain imaging research and genomics.
How do we know that synthetically generated dataset produces optimal and unbiased answers for every kind of scientific question and decision that could be made on it?


