The National Health and Nutrition Examination Survey (NHANES) is a program of the National Center for Health Statistics (NCHS), which is part of the US Centers for Disease Control and Prevention (CDC). It measures the health and nutritional status of adults and children in the United States in a series of surveys that combine interviews and physical examinations.
Although the program began in the early 1960s, its structure was changed in the 1990s. Since 1999, the program has been conducted on an ongoing basis, where a nationally representative sample of about 5,000 persons (across 15 counties) is examined each year, with public-use data released in two-year cycles. This phase of the program is referred to as continuous NHANES.
The NHANES interview includes demographic, socioeconomic, dietary, and health-related questions. The examination component consists of medical, dental, and physiological measurements, as well as laboratory tests administered by highly trained medical personnel. Although the details of the responses recorded vary from cycle to cycle, there is a substantial amount of consistency, making it possible to compare data across cycles. Sampling weights are provided along with demographic details for each participant; see the NHANES analytic guidelines for details. NHANES is a rich resource that has been used extensively in epidemiological research.
Public-use data: web resources
NHANES makes a large volume of data available for download. However, rather than a single download, these data are made available as a number of separate SAS transport files, referred to as “data files” in the NHANES ecosystem, for each cycle. Each such data file or table contains records for several related variables. A comprehensive manifest of data files available for download is available here, along with subsets broken up into the following “components”: Demographics, Dietary, Examination, Laboratory, and Questionnaire.
For each data table listed in these manifests, a link to a “Doc File” (which is an HTML webpage describing the data file) and a link to a SAS transport file is provided. An additional list of limited access data files are documented here, but the corresponding data file download links are not available.
An additional manifest of variables is separately available for each component, and gives more detailed information about both the variables and the data files they are recorded in, although these tables do not provide download links directly: Demographics, Dietary, Examination, Laboratory, Questionnaire.
For reasons not specified, NHANES releases data files as SAS transport files, and provides links to proprietary Windows-only software that can supposedly be used to convert these files to CSV files.
Public-use data: R resources
One of the goals of the Epiconnector project is to provide and document an alternative access path to NHANES data and documentation via the R ecosystem. It builds on the nhanesA R package, along with utilities such as SQL databases and docker, to enable efficient and reproducible analyses of NHANES data.
The nhanesA package
The nhanesA package provides a user-friendly interface to download and process data and documentation files from the NHANES website. To use the utilities in this package, we first need to know a few more details about how NHANES data and documentation are structured.
Each available data file, which we henceforth call an NHANES table, can be identified uniquely by a name. Generally speaking, each public-use table has a corresponding data file (a SAS transport file, with extension xpt) and a corresponding documentation file (a webpage, with extension htm). The URLs from which these files can be downloaded can usually be predicted from the table name, and the cycle it belongs to. Cycles are typically of 2-year duration, starting from 1999-2000.
Although there are exceptions, a table that is available for one cycle will typically be available for other cycles as well, with a suffix appended to the name of the table indicating the cycle. To make these details concrete, let us use the nhanesManifest() function in the nhanesA package to download the list of available tables and look at the names and URLs for the DEMO data files, which contain demographic information and sampling weights for each study participant.
Table DocURL
370 DEMO /Nchs/Data/Nhanes/Public/1999/DataFiles/DEMO.htm
369 DEMO_B /Nchs/Data/Nhanes/Public/2001/DataFiles/DEMO_B.htm
368 DEMO_C /Nchs/Data/Nhanes/Public/2003/DataFiles/DEMO_C.htm
366 DEMO_D /Nchs/Data/Nhanes/Public/2005/DataFiles/DEMO_D.htm
367 DEMO_E /Nchs/Data/Nhanes/Public/2007/DataFiles/DEMO_E.htm
371 DEMO_F /Nchs/Data/Nhanes/Public/2009/DataFiles/DEMO_F.htm
372 DEMO_G /Nchs/Data/Nhanes/Public/2011/DataFiles/DEMO_G.htm
373 DEMO_H /Nchs/Data/Nhanes/Public/2013/DataFiles/DEMO_H.htm
374 DEMO_I /Nchs/Data/Nhanes/Public/2015/DataFiles/DEMO_I.htm
375 DEMO_J /Nchs/Data/Nhanes/Public/2017/DataFiles/DEMO_J.htm
377 DEMO_L /Nchs/Data/Nhanes/Public/2021/DataFiles/DEMO_L.htm
DataURL Years
370 /Nchs/Data/Nhanes/Public/1999/DataFiles/DEMO.xpt 1999-2000
369 /Nchs/Data/Nhanes/Public/2001/DataFiles/DEMO_B.xpt 2001-2002
368 /Nchs/Data/Nhanes/Public/2003/DataFiles/DEMO_C.xpt 2003-2004
366 /Nchs/Data/Nhanes/Public/2005/DataFiles/DEMO_D.xpt 2005-2006
367 /Nchs/Data/Nhanes/Public/2007/DataFiles/DEMO_E.xpt 2007-2008
371 /Nchs/Data/Nhanes/Public/2009/DataFiles/DEMO_F.xpt 2009-2010
372 /Nchs/Data/Nhanes/Public/2011/DataFiles/DEMO_G.xpt 2011-2012
373 /Nchs/Data/Nhanes/Public/2013/DataFiles/DEMO_H.xpt 2013-2014
374 /Nchs/Data/Nhanes/Public/2015/DataFiles/DEMO_I.xpt 2015-2016
375 /Nchs/Data/Nhanes/Public/2017/DataFiles/DEMO_J.xpt 2017-2018
377 /Nchs/Data/Nhanes/Public/2021/DataFiles/DEMO_L.xpt 2021-2023
Date.Published
370 Updated September 2009
369 Updated September 2009
368 Updated September 2009
366 Updated September 2009
367 September 2009
371 September 2011
372 Updated January 2015
373 October 2015
374 September 2017
375 February 2020
377 September 2024
The nhanesA package allows both data and documentation files to be accessed, either by specifying their URL explicitly, or simply using the table name, in which case the relevant URL is constructed from it. For example,
The data in these files appear as numeric codes, and must be interpreted using codebooks available in the documentation files, which can be parsed as follows.
Warning in instance$preRenderHook(instance): It seems your data is too big for
client-side DataTables. You may consider server-side processing:
https://rstudio.github.io/DT/server.html
Further analysis can be performed on these resulting datasets which are regular R data frames. Simple examples of such analyses, and other functionality in the nhanesA package such as search utilities, are described in Ale et al, 2024.
Limitations of this approach
The nhanesA package is designed to access NHANES data on demand from the CDC website. The efficiency of such an approach is naturally limited by available bandwidth. Another limitation that is not obvious at first glance is apparent when we try to combine data across multiple cycles. Not all variables are measured in all cycles, and even when they are, they may not be included in the same tables (and sometimes they are included in multiple tables). Analyzing the availability of variables of interest is difficult with the rudimentary search facilities available on the NHANES website.
Another subtle issue that is important from the perspective of reproducible research is the possibility of data updates (see below). NHANES is an ongoing program, so new datasets are released on a regular basis. More importantly from a reproducibility angle, previously released datasets are sometimes updated. Older versions are not retained on the NHANES website. This means that an analysis performed on a given date may be impossible to recreate on a later date, unless the relevant data sets have been retained.
Efficient and and reproducible analyses of NHANES data
To address these limitations, we have developed several tools, each building on the previous ones, to create a user-friendly platform for analysts who are comfortable with R as a data analysis platform. Briefly,
The cachehttp package enables local caching of NHANES data and documentation files that are only re-downloaded if they have been updated.
The nhanes-snapshot repository is used to download and periodically update raw data (as compressed CSV files) and documentation (as HTML files) with timestamps, so that they can serve as a snapshot of NHANES data available on specific dates.
The nhanesA package has been modified to recognize the database when it is avilable, and use it as an alternative data source for both data and documentation, bypassing the NHANES website. Using nhanesA in this mode leads to speedup of several orders of magnitude while requiring almost no change in user code.
The phonto package provides more advanced analysis tools that take advantage of the local database.
The easiest way to get started with these tools is to run the nhanes-postgres docker image as described in the README. In addition to the Postgresql database, the container includes R and RStudio Server along with versions of nhanesA and phonto configured to use the database. Once the included instance of RStudio Server is accessed through a browser, one can use it as a regular R session without the need to explicitly interact with the backend database in any way. This is not, however, the only way, and advanced users may prefer to use only the database from the container, accessing it from outside via port forwarding.
Other articles on this site describe more detailed examples of analyses using these tools, as well as other checks and utilities that help with such analyses.
Frequency of NHANES data releases
We conclude this document with a brief look at how frequently NHANES data files are published and / or updated, based on the information contained in the table manifest.
Recall from above that the NHANES table manifest includes a Date.Published column. This allows us to tabulate NHANES data release dates. We expect that bulk releases of tables happen all together, generally in two year intervals, while some tables may be released or updated on an as-needed basis.
The release information (available by month of release) can be summarized by tabulating the Date.Published field:
Date.Published
December 2007 July 2010 June 2020
13 13 13
Updated October 2014 July 2022 August 2021
14 15 17
December 2018 November 2007 November 2021
17 17 18
Updated November 2020 May 2004 June 2002
19 21 34
October 2015 September 2011 September 2013
37 38 38
September 2017 September 2009 February 2020
40 41 48
September 2024 Updated April 2022
55 59