NHANES Tutorials

This series of tutorials are intended to introduce the NHANES Epiconnector project to data analysts and students working with NHANES data.

NHANES is an ongoing Health and Nutrition survey conducted by the United States CDC. Most individual-level data from the survey are publicly available, making it a rich resource both for epidemiology research and education.

NHANES is a complex survey that uses multi-stage sampling and intentionally oversamples certain population subgroups. This makes analysis of NHANES data somewhat nontrivial: Naive statistical estimates such as the sample mean and regression coefficients will be biased for their population counterparts. Even though this can be resolved by using the selection weights provided by CDC, naive estimators of the standard error will still be biased unless one takes into account clusters and PSUs used in the sampling design.

Another common issue one needs to account for is related to questionnaire administration and data missingness: Question B may be skipped depending on the answer to question A, but this information is not recorded in the response to question B, except to say that the response is missing. To distinguish such missingness from genuine missingness, one needs to understand the flow of the questionnaire.

Another set of challenges arise when we wish to combine data across NHANES cycles. Combining data across cycles gives a larger sample size, which naturally improves inference. However, not all information is collected in all cycles, and sometimes the specific details of what information is collected and where it is available changes.

The main challenge for analysts and students new to NHANES data is navigating how the data and documentation can be accessed. The CDC distributes these from their website, using the SAS transport (XPT) format for coded data, and HTML files for documentation. These are both non-standard formats for most modern data analysis tools, and some preliminary work is required to process these inputs and extract usable data.

The tutorials below try to address these issues systematically, using an R based workflow. They are meant for users comfortable with R but new to NHANES data analysis, as well as those interested in alternative workflows.

Tutorials

  • Introduction — A brief introduction to the NHANES project and how it makes data and documentation available to the public.

  • Accessing NHANES Data Locally — This article describes R based tools that make it possible to access NHANES data locally, reducing the overhead of downloads.

  • Distribution of BMI over time — Although NHANES is not a longitudinal study, it represents data collected over time. This analysis explores whether NHANES data can contain discernible hints about population characteristics that have changed substantially over the study period of continuous NHANES, using BMI as an example. It illustrates the use of appropriate survey tools needed to address the complex design of the NHANES survey, and describes how data from multiple tables can be merged.

  • Skipping of Questions in NHANES — This article discusses an important aspect of the NHANES questionnaires: branching or skipping, where a certain response to one question may lead subsequent questions to be skipped. We describe tools to identify such cases, and discuss why it is important to take skipping information into account. The analysis used in the article illustrates how to efficiently combine data from multiple cycles.