Predictive (in)validity
Pharmaceutical probabilities of success and the selection bias problem
It is common in late stage drug development for pharmaceutical companies to use rigorous quantitative decision making to plan out the different stages of clinical trials and how to evaluate the risk of clinical trial failures. This often goes under the area of assurance or predictive probabilities of success.
The original use-case for assurance was to optimize sample size choices in clinical trials!
Assurance is the unconditional probability that the trial will yield a ‘positive outcome’. A positive outcome usually means a statistically significant result, according to some standard frequentist significance test. The assurance is then the prior expectation of the power, averaged over the prior distribution for the unknown true treatment effect.
We argue that assurance is an important measure of the practical utility of a proposed trial, and indeed that it will often be appropriate to choose the size of the sample (and perhaps other aspects of the design) to achieve a desired assurance, rather than to achieve a desired power conditional on an assumed treatment effect.
O'Hagan, A., Stevens, J.W. and Campbell, M.J. (2005), Assurance in clinical trial design. Pharmaceut. Statist., 4: 187-201. https://doi.org/10.1002/pst.175
Most major pharmaceutical companies like Novartis, GSK, and Roche all have teams who design methodology for assurance, sometimes even for clinical pharmacology and dosing decisions, for trials based on surrogate endpoints. The quality of a biomarker endpoint needs to be substantial to guarantee assurance.
Conditional assurance is a further extension of the original concept.
Conditional assurance is the predicted assurance of a subsequent study, conditional on the success of an initial study and the design prior.
But what about analogous probabilistic forecasting for earlier stages of drug development? Scannell et. al (2022) point out that the lack of predictive validity of high throughput screening tools and translational models are in-part responsible for the inefficiency of biopharma R&D. One might think that simply correlating the outcomes of a translational model with binary outcomes like drug approvals/failures would provide a good assessment of the predictive validity of translational models. But this is mistaken for a few conceptual reasons
Clinical trial failures have a file-drawer problem, thus we don’t have an unbiased estimate of even historical successes/failures of some category of molecules-indication combinations. Forget forecasting, we don’t have an unbiased prediction error of historical drug development. One major but not only component of the file-drawer here is right censoring.
Right censoring problem: The sequential nature of drug development implies a sequential drop off in the number of drugs tested from phase 1 → phase 2 → phase 3 trials. Thus we we don’t get to assess the future clinical validity of all assays of therapeutic potency, toxicity, DMPK, disease models, putative surrogate endpoints and so on without severe omission biases.
Underspecification of success: Defining what constitutes success is a huge outcome and metric-hacking problem. There is a reason so much time and resources are dedicated to choosing endpoints in clinical trials, so that evaluations cannot be gamed. Similarly, choosing what constitutes success for pharmaceutical forecasting at every stage of drug development is subject to external pressures.
One solution to resolving the omission bias problem is to move away from using binary measures of success such as drug approval/failure to using actual individual participant data from clinical trials which has much richer sources of variation to assess the clinical validity of in-vitro/in-silico/in-vivo models and assays. This was the basis of the program idea I developed last year during my time as a BRAINS Fellow with SpecTech, also on featured on Essential Technology’s Gap Map database. But there are even more important reasons to use individual patient-level data to evaluate translational forecasts — to reduce the risk of surrogate paradoxes and other clinical trial failures that amount to getting the therapeutic benefit-adverse effect tradeoff wrong.
In principle, a translational analogs of clinical assurance for every tool and model used for drug development would be the principled approach to keeping track of their effectiveness. AI in drug development is not merely about evaluating therapeutic candidates during hit and lead optimization, but also useful to severely test the translational and clinical validity of drug development tools — all in-vitro, in-silico, in-vivo tools individually and jointly for their capacity to generate clinically valid forecasts. Even without raw preclinical and clinical trial datasets, however, there are a multitude of clever approaches to mitigating the biases from the sequential nature of clinical development.


