A causal approach to handling missing data in multiple variables: Coming to grips with “missing not at random”
In studies with multiple incomplete variables, it is widely understood that if the data are missing at random (MAR) then unbiased estimation is possible with appropriate methods. While the need to assess the plausibility of this assumption has been emphasised, the practical difficulty of this task and the stringency of MAR in the context of multivariable missingness are rarely acknowledged. Further, while MAR is sufficient, it is certainly not necessary: in a wide range of missing not at random (MNAR) scenarios unbiased estimation of certain parameters is possible. Recent developments in the computer science literature suggest that a causal reframing of missing data problems could prove more natural for stating assumptions, and could provide a more useful guide to the treatment of missing data, beyond the MAR-MNAR dichotomy. We build on that work to develop a causal approach to handling missing data in the context of a typical point-exposure epidemiological study with incomplete exposure, outcome and confounders. We use directed acyclic graphs to depict missingness assumptions, and consider a counterfactual approach to determine the conditions required for non-parametric identification, or recoverability, of a target parameter. Using this novel approach, we were able to identify various MNAR settings where complete case analysis or multiple imputation can provide unbiased estimation, and conversely the situations where they cannot and where performing an expert-elicited delta-adjustment sensitivity analysis is necessary. Further to providing a strategy for tackling the complexities of MNAR, this paradigm suggests novel approaches to estimation and sensitivity analyses, which we outline. We use numerical simulations and the Longitudinal Study of Australian Children for illustration.