Model Covariates


We have identified country- and time-specific data from various sources to further inform our model predictions. These data, which we call covariates, act as important intelligence to supplement our individual-level dietary intake data, particularly in countries for which these inputs are limited. Below is a summary of the process for identifying and incorporating these data into our estimates.

Covariate Identification and Selection

A myriad of country-year specific data were considered as potentially useful covariates for the Bayesian hierarchical prediction model, including food availability, nutrient availability, food sales data, economic indicators, and geographical data.

We consulted experts and conducted comprehensive searches of publicly-available databases to identify covariates. We identified over 800 covariates including:

  • FAO food balance sheets data, 1980-2018
  • Harvard Global Expanded Nutrient Supply (GENuS) data, 1980-2011
  • Principal component analysis of FAO and GENuS data, 1980-2018
  • Industry sales data on fat consumption, 1998-2018
  • World Bank GDP, 1980-2018
  • Precipitation, 1982-2014
  • Unemployment rate, 1991-2015
  • Land area
  • Education years, 1980-2010
  • Latitude
  • Gini coefficient, 1980-2015
  • Coastline ratio
  • Poverty rate, 1991-2015

Covariate Imputation

Missing data for these covariates were imputed.

  • If data were missing for some (but not all) years of a given country, we used linear interpolation to fill in those years.
  • Covariate data sources that ended before 2018 were imputed using a moving average of the three most recent values to obtain values for all covariates through the year 2018.
  • Region-level (per Global Burden of Disease regional assignments) means were assigned to countries for which entire covariates were missing.

To assess validity of the imputations, we imputed non-missing values with the same model and visually compared observed versus imputed values via scatter plots.

Covariate truncation

The GDD prediction model operates on the natural log scale, and, thus, the covariate data are also transformed to the log scale during modeling. To minimize a strong effect of very small values for covariates with a broad range of values on the log scale, we have truncated covariate data on the non-transformed scale using the following rules. Doing so also reduces the likelihood of implausible final estimates.

For covariates with a 95th percentile value

  1. Greater than 3.5: Truncate values less than 0.5 to 0.5
  2. Between 1 and 3.5: Truncate values less than 0.1 to 0.1
  3. Less than 1: No truncation

Covariate Testing

We conducted principal component analysis (PCA) using the 'princomp' function in R separately for: 1) 23 grouped FAO food balance sheet (FBS) foods, beverages, and energy, 2) 142 GeNUS foods, and beverages, and 3) 19 GeNUS nutrients and energy. The first four components from each PCA were considered for inclusion.

For each dietary factor, we calculated the correlations between covariates and original survey-level stratified mean dietary intakes, and we selected up to 10 covariates for model inclusion, favoring those with the highest correlations, a mix of food/nutrients and other covariates, and sensible links to the dietary factor.

Each of the covariates identified in the correlation stage (maximum 10 covariates) and the four PCA components were then included in a stepwise regression (entry point of p<.299 and exit point of p<.30) to test for inclusion in GDD models. These stepwise regressions resulted in three nested versions of the GDD model per diet factor:

  1. Base model: Closest diet factor proxy from FAO or GENuS (1-2 covariates per model)
  2. Restricted model: All covariates with p<0.1 from the results of the stepwise regression plus base model covariate(s).
  3. Inclusive model: All covariates from the results of the stepwise regression plus base model covariate(s).

Five-fold cross-validation to test the three versions of the GDD model has been completed for all dietary factors. During five-fold cross-validation, data are split into five parts: four segments making up the training dataset, and the remaining segment as the testing data. This process is repeated using all segments as the testing and training data. The models' ELPD (expected log predictive density) are then compared to assess which model has the best predictive performance.

Final List of Covariates

The final list of selected covariates for each dietary factor will be available in the near future.