Neurostats: Surrogate Dispatches

Request for Information: Synthetic Datasets

Manjari Narayan — Fri, 14 Nov 2025 18:28:43 GMT

TL;DR; If you know something about this, have written or read papers on this please comment.

There are broadly two approaches to analyze data and distribute results when the data is not available to share for legal and privacy reasons.

Federated learning
Obscure the data via differential privacy or synthetic data generation

Where is the normative theory for when synthetic data is trustworthy?

There are no universal surrogate endpoints for therapeutic evaluation, because there is no way to produce a universal surrogate for every possible decision. If something is a good proxy, it has been shown to be a good proxy for a particular purpose. The same goes for synthetic datasets when used in-lieu of the real thing. How do you just generate synthetic data in a vacuum or pre-emptively before knowing what decisions are going to be made on it?

The only one I really understand is differential privacy due to its theoretical origins, but there are a wide variety of synthetic data generation schools of thought and approaches today. In practice, many of them are based on intuitive heuristics. But how do we what they are good for and when such data might be misleading?

Counterfactual imputation of missing data is “synthetic” data
LLM benchmarks: take a seed template problem and using language models to scale up many variations of the problem. This makes problems realistic but “synthetic” in that they are not written up by humans as in FrontierMath.
Generative models trained on large biological datasets as a source of new synthetic observations. We see this right now in both brain imaging research and genomics.

In short, how do we know that a synthetically generated dataset produces optimal and valid answers for every kind of scientific question and decision that could be made on it?

Biomarker Qualification

Manjari Narayan — Mon, 18 Aug 2025 20:00:00 GMT

Definition: Biomarker qualification is about thinking through the full chain of evidence to prove that a biomarker can be used for a particular clinical decision.

HDL serum cholesterol, for instance, is great for evaluating risk of heart disease but not for evaluating effectiveness of treatments to improve cardiovascular health. There is no such thing as a “good biomarker” in a vacuum. Decisions to use biomarkers are always dependent on the intended applications.1 Sadly, this is not something that most biomedical researchers think about when they do biomarker discovery.

Back to glossary

Biomarker guided therapeutic decisions require developing and validating biomarkers. Specifying what these criteria are requires constant meta-scientific innovation. It is easy to conflate the enterprise of biomarker validation with analytical validation2 , followed by reproducible clinical studies. But what constitutes clinical validation?

Analytical validation is all about evaluating the measurement process or assay. Many biomarker discovery studies will demonstrate test-retest reliability that looks for whether people can be reliably differentiated based on their biomarker measurements. One has to evaluate a far more thorough checklist of measurement issues that go well beyond test-retest reliability for analytical validation. We also need repeatability of measurements for any individual with good tolerance intervals, comparability of quantitative measurements in a wide variety of circumstances and many others. Yet, analytical validation is the easier part of the biomarker evaluation process with systematic criteria. Qualification on the other hand encompasses the full spectrum of validity problems across all the life-medical-health sciences — it includes all the possible “does it mean what you think it means” problems. Biomarker qualification includes assessing the clinical validity3 of the biomarker as well as other validation criteria specific to a therapeutic decision.4

Importantly, there is no easy way to preemptively specify what all the threats to validity are — construct validity, causal validities including internal validity and external validity, all the modern validities beyond reliability of the measurement that link it to disease and/or therapeutic outcomes.

Altar et. al. (2008)

Here is a concrete example of what a comprehensive understanding of biological and clinical validation looks like.

Credit: Altar, C.A. et al. (2008) ‘A prototypical process for creating evidentiary standards for biomarkers and diagnostics’, Clinical pharmacology and therapeutics, 83(2), pp. 368–371. https://doi.org/10.1038/sj.clpt.6100451.

However, this table reflects 20th century understanding. It could use significant updating given how far scientific and statistical methodology has come in 20 years.

Checklist for a biomedical stakeholder

Analytical validation and biomarker qualification are terms of art when biomarkers are proposed for drug development decisions in clinical trials. Unfortunately, these terms are not widely used within mainstream biomedical research. It is easy to think these are regulatory concerns don’t matter until one wants to bring biomarkers to the clinic, as opposed to scientific concerns that need to be addressed. Every research community has its own epistemic norms around “validation”. When you visit premier conferences in different niches of life science where biomarker research occurs, these differences become apparent. No one actually owns the problem of understanding the full scope of scientific R&D that needs to occur.

If you read an article that calls for a large-scale validation for new biomarkers, here is what you should ask yourself —

Is it clear that the biologist or scientist’s notion of validation is distinct from analytical validation? Has it at least considered all the problems in Altar 2008, for instance?
Does the roadmap for biological plausibility and clinical validation cover the full spectrum of research designs and grades of evidence that need to be generated?
Does it address all threats to scientific validity known to methodologists for a particular decision or context of use, even outside one disease area?

If not, then the field might need a better specification of the validation roadmap. It is far too late to do the necessary R&D if you wait until someone is ready to initiate conversations with the FDA.

Subscribe now

Institute of Medicine. 2010. Evaluation of Biomarkers and Surrogate Endpoints in Chronic Disease. Washington, DC: The National Academies Press. https://doi.org/10.17226/12869.

FDA on analytical validation, ICH on analytical validation

Ransohoff, D. Bias as a threat to the validity of cancer molecular-marker research. Nat Rev Cancer 5, 142–149 (2005). https://doi.org/10.1038/nrc1550

Fleming, T.R. and Powers, J.H. (2012) ‘Biomarkers and surrogate endpoints in clinical trials’, Statistics in medicine, 31(25), pp. 2973–2984. https://doi.org/10.1002/sim.5403.

Predictive (in)validity

Manjari Narayan — Mon, 12 May 2025 17:00:00 GMT

It is common in late stage drug development for pharmaceutical companies to use rigorous quantitative decision making to plan out the different stages of clinical trials and how to evaluate the risk of clinical trial failures. This often goes under the area of assurance or predictive probabilities of success.

The original use-case for assurance was to optimize sample size choices in clinical trials!

Assurance is the unconditional probability that the trial will yield a ‘positive outcome’. A positive outcome usually means a statistically significant result, according to some standard frequentist significance test. The assurance is then the prior expectation of the power, averaged over the prior distribution for the unknown true treatment effect.
We argue that assurance is an important measure of the practical utility of a proposed trial, and indeed that it will often be appropriate to choose the size of the sample (and perhaps other aspects of the design) to achieve a desired assurance, rather than to achieve a desired power conditional on an assumed treatment effect.

O'Hagan, A., Stevens, J.W. and Campbell, M.J. (2005), Assurance in clinical trial design. Pharmaceut. Statist., 4: 187-201. https://doi.org/10.1002/pst.175

Most major pharmaceutical companies like Novartis, GSK, and Roche all have teams who design methodology for assurance, sometimes even for clinical pharmacology and dosing decisions, for trials based on surrogate endpoints. The quality of a biomarker endpoint needs to be substantial to guarantee assurance.

Conditional assurance is a further extension of the original concept.

Conditional assurance is the predicted assurance of a subsequent study, conditional on the success of an initial study and the design prior.

But what about analogous probabilistic forecasting for earlier stages of drug development? Scannell et. al (2022) point out that the lack of predictive validity of high throughput screening tools and translational models are in-part responsible for the inefficiency of biopharma R&D. One might think that simply correlating the outcomes of a translational model with binary outcomes like drug approvals/failures would provide a good assessment of the predictive validity of translational models. But this is mistaken for a few conceptual reasons

Clinical trial failures have a file-drawer problem, thus we don’t have an unbiased estimate of even historical successes/failures of some category of molecules-indication combinations. Forget forecasting, we don’t have an unbiased prediction error of historical drug development. One major but not only component of the file-drawer here is right censoring.
Right censoring problem: The sequential nature of drug development implies a sequential drop off in the number of drugs tested from phase 1 → phase 2 → phase 3 trials. Thus we we don’t get to assess the future clinical validity of all assays of therapeutic potency, toxicity, DMPK, disease models, putative surrogate endpoints and so on without severe omission biases.
Underspecification of success: Defining what constitutes success is a huge outcome and metric-hacking problem. There is a reason so much time and resources are dedicated to choosing endpoints in clinical trials, so that evaluations cannot be gamed. Similarly, choosing what constitutes success for pharmaceutical forecasting at every stage of drug development is subject to external pressures.

One solution to resolving the omission bias problem is to move away from using binary measures of success such as drug approval/failure to using actual individual participant data from clinical trials which has much richer sources of variation to assess the clinical validity of in-vitro/in-silico/in-vivo models and assays. This was the basis of the program idea I developed last year during my time as a BRAINS Fellow with SpecTech, also on featured on ’s Gap Map database. But there are even more important reasons to use individual patient-level data to evaluate translational forecasts — to reduce the risk of surrogate paradoxes and other clinical trial failures that amount to getting the therapeutic benefit-adverse effect tradeoff wrong.

In principle, a translational analogs of clinical assurance for every tool and model used for drug development would be the principled approach to keeping track of their effectiveness. AI in drug development is not merely about evaluating therapeutic candidates during hit and lead optimization, but also useful to severely test the translational and clinical validity of drug development tools — all in-vitro, in-silico, in-vivo tools individually and jointly for their capacity to generate clinically valid forecasts. Even without raw preclinical and clinical trial datasets, however, there are a multitude of clever approaches to mitigating the biases from the sequential nature of clinical development.