Missing Values Analysis & Data Imputation

Resources used for this presentation: Many of the ideas in this presentation were created by David Garson (resource page: http://www.statisticalassociates.com/)
& David Howell. UCLA site also provided resourses: http://www.ats.ucla.edu/stat/spss/modules/missing.htm

Why do we have missing data?

Could be lots of reasons, but we will only focus on item non-response (i.e., some participates do not respond to all items).

Listwise Deletion (LD)

SPSS & other packages use listwise deletion as default.

When MCAR, LD provides an an unbias estimate, but there still are problems (i.e., lower power).

If less than 5% of data are missing from a large dataset, most researchers just use LD -- but there still may be problems.

Today, most researchers to not recommend LD -- we will need to impute, but we have to assume the data are missing at random (MAR)

Modified from the work of David Howell (https://www.uvm.edu/~dhowell/StatPages/Missing_Data/MissingDataSPSS.html)

Missing Completely at Random (MCAR)

[Download Data]
The variable names are, in order, SexP (sex parent), DeptP (parent's depression T score), AnxtP (parent's anxiety T score), GSItP (parent's global symptom index T score), DeptS, AnxtS, GSItS (same variables for spouse), SexChild, Totbpt (total behavior problem T score for child). 

Two ways to look at MCAR:

MCAR can be examinedby dividing respondents into those with and without missing data, then using t-tests of mean differences on income, age, gender, and other key variables to establish that the two groups do not differ significantly on any variable in he model, including the dependent variable. If missing data are MCAR in a sufficiently large sample, cases with missing values may be dropped listwise from the analysis without biasing the estimates.

Little's MCAR significant -- you do not want statistical significance. The SPSS Missing Values Analysis (MVA) option supports Little's MCAR test, which is a chi-square test for missing completely at random. In SPSS, select Analyze > Missing Value Analysis and check EM as the estimation method. Little's test will be printed below the EM Means, EM Covariances, and EM Correlations tables (and will have the same value in each), as illustrated below. DO NOT WANT SIGNIFICANCE!

Missing at Random (MAR) - From David Howell

Misleading -- Systematic rather than random pattern of missingness. Two conditions:

"MAR is a spectrum, depending on how much of missingness can be explained by other observed variables. A pure MAR example would be if there were test scores, test1 and test2, representing scores on two sequential tests. If students scoring 90 or greater on test1 were excused from test2, and if there were no other dropouts, missingness on test2 would be completely determined by the test1 variable. At the other end of the spectrum, in a large dataset it might happen that missingness on a given variable was significantly related to another observed variable (hence not MCAR) but the relation was so trivial in effect size that missingness could not be predicted from that variable. The point on this spectrum where prediction ceases to be useful is the point separating MAR from MNAR."

Whether data are missing at random (MAR) cannot be determined with any simple test. Ultimately, proving conclusively that data are MAR would require showing the values which are missing are distributed randomly but that is impossible as missing values are, of course, unknown. For this reason, Schafer & Graham (2002: 153) state, “In general, there is no way to test whether MAR holds in a data set, except by obtaining follow-up data from non-respondents or by imposing an unverifiable model.” “Testing for MAR” instead refers to exploratory tests to see if data are consistent with what is implied by “missing at random” and with imputing MAR data. In checking the effects of missingness, some exploratory tests require creation of a set of dummy variables for missingness for each variable of interest, coded 0 = not missing on the given variable, 1 = missing. Note that the researcher may wish to explore whether auxiliary variables not in the original analytic model may also predict missingness, and if so, add them prior to imputation.

Example of MAR (Retrieved from http://core.ecu.edu/psyc/wuenschk/MV/Screening/Screen.docx)

• Some cases are missing scores on our variable of interest, Y.
• Suppose that Y is the salary of faculty members.
• Missingness on Y is related to the actual value of Y.
• Of course, we do not know that, since we do not know the values of Y for cases with missing data.
• For example, faculty with higher salaries may be more reluctant to provide their income.
• If we estimate mean faculty salary with the data we do have on hand it will be a biased estimate.
• There is some mechanism which is causing missingness, but we do not know what it is.

Missing Not at Random (MNAR)

Missing not at random (MNAR), also called non-ignorable missingness, is the most problematic form. It exists when missing values are neither MCAR nor MAR. This happens when missingness depends at least in part on unobserved variables (which is why observed variables fail to predict missingness, making data not MAR). Under MNAR conditions, variables in the dataset are inadequate predictors of missingness because the variable with missing cases is insufficiently correlated with other variables in the dataset, undermining the effectiveness of the usual imputation methods, including multiple imputation (MI).

One approach to non-ignorable missingness is to impute values based on data otherwise external to the research design, as, for instance, estimating race based on Census block data associated with the address of the respondent, but while missingness cannot be ignored, there is no well-accepted method of dealing with non-ignorable missingness. See Muthén, Asparouhov, Hunter, & Leuchter (2011) on analyzing MNAR data using MPlus statistical software.

Example of MNAR

• Missingness on Y is not related to the true value of Y itself or is related to Y only through its relationship with another variable or set of variables, and we have scores on that other variable or variables for all cases.
• For example, suppose that the higher a professor’s academic rank the less likely he is to provide his salary. Faculty with higher ranks have higher salaries. We know the academic rank of each respondent.
• We shall assume that within each rank whether Y is missing or not is random – of course, this may not be true, that is, within each rank the missingness of Y may be related to the true value of Y.
• Again, if we use these data to estimate mean faculty salary, the estimate will be biased.
• However, within conditional distributions the estimates will be unbiased – that is, we can estimate without bias the mean salary of lecturers, assistant professors, associate professors, and full professors.
• We might get an unbiased estimate of the overall mean by calculating a weighted mean of the conditional means -- (GM=sum[pi*Mi]), where GM is the estimated grand mean, pi is, for each rank, the proportion of faculty at that rank, and Mi is the estimated mean for each rank.


Multiple Imputation

Multiple imputation is the currently prevailing method of estimating missing values. Though it may be implemented by various methods, by default in SPSS, SAS, and Stata, it uses Markov Chain Monte Carlo (MCMC) simulation methods, which are probabilistic in nature. Multiple implementation involves three steps:

1. Creating multiple datasets in which missing values have been imputed.
2. Pooling the estimates from the multiple imputed datasets.
3. Running the pooled data on statistical procedures such as linear or logistic regression.

Van Buuren (2012: 27) stated, “Nowadays multiple imputation is almost universally accepted, and in fact acts as the benchmark against which newer methods are being compared.”


MCAR: Data are MCAR if missingness on any variable in the analytic model is unrelated to the values of any other variable in the model. Little’s MCAR test should be non-significant. Listwise deletion is appropriate provided the number of deleted cases is not large.

MAR: Missingness on any variable in the analytic model may be explained solely using observed variables in the model. Unobserved variables do not explain missingness in any variable in the model. Multiple imputation (MI) is appropriate if the number of missing values is not high and if missingness may be predicted from observed variables. (There is no agreed cutoff for how high is too high but at some point the “best guess” reflected by MI ceases to be useful.)

MNAR: Missingness is not MCAR but observed variables in the model cannot well explain missingness. There is no well-accepted remedy for MNAR.