Updating Established Prediction Models with New Biomarkers
We consider the situation where there is a known established regression model that can be used to predict an important outcome, Y, from a set of commonly available predictor variables X. There are many examples of this in the medical and epidemiologic literature. A new variable B is thought to be important and would enhance the prediction of Y. A modest sized dataset of size n containing Y, X and B is available, and the challenge is to build a good model for [Y|X,B] that uses both the available dataset and the known model for [Y|X]. Proposals in the literature to achieve this include Bayesian approaches and constrained and empirical likelihood based methods (Grill et al 2015 J Clin Epi, Chatterjee et al 2016 JASA, Cheng et al 2018 Stat in Med). The constrained approach is to maximize the likelihood for [Y|X,B] subject to the constraints on the parameters from the known model for [Y|X]. We compare these approaches and illustrate them on a prostate cancer dataset. We also propose a synthetic data approach. The approach consists of creating m additional synthetic data observations, and then analyzing the combined dataset of size n+m to estimate the parameters of the model [Y|X,B]. The synthetic data is created by replicating X then generating a synthetic value of Y from the known [Y|X] distribution. This combined dataset has missing values of B for m of the observations, and is analyzed using methods that can handle missing data. One such analysis approach is multiple imputation, or in special cases exact methods can be used. In special cases when [Y,X,B] is trivariate normal or when all of Y,X and B are binary we show that the synthetic data approach with very large m gives identical asymptotic variance for the parameters of the [Y|X,B] model as the constrained maximum likelihood estimation approach. This provides some theoretical justification for the synthetic data approach and given its broad applicability makes the approach very appealing.
This is joint work with Wenting Cheng, Bhramar Mukherjee, Jason Estes and Tian Gu
Jeremy M G Taylor, PhD, is the Pharmacia Professor of Biostatistics at the University of Michigan. He obtained a Bachelor’s degree in Mathematics and a Diploma in Statistics from Cambridge University and a PhD in Statistics from University of California Berkeley. He was a faculty member in the Department of Biostatistics and the Department of Radiation Oncology at UCLA from 1983 to 1998. He is currently a faculty member in the Department of Biostatistics, the Department of Radiation Oncology and the Department of Computational Medicine and Bioinformatics and the Director of the Center for Cancer Biostatistics at the University of Michigan. He is the winner of the Michael Fry award from the Radiation Research Society and the Mortimer Spiegelman award from the American Public Health Association. He is a former Chair of the Biometrics section of the American Statistical Association and a Fellow of the ASA. He is the former chair of the Biostatistical Methods and Research Design grant review committee for the National Institutes of Health. He was one of the coordinating editors of Biometrics from 2012-2014. He has over 350 publications and research interests in longitudinal and survival data, cure models, methods for missing data, causal inference, biomarkers, surrogate and auxiliary variables. He has worked extensively in AIDS research but currently mainly focusses on cancer research. He has served as the dissertation chair for 30 PhD students in Biostatistics at UCLA and the University of Michigan.