Sys.setenv(EPICONDUCTOR_CONTAINER_DB = "postgres")
library(nhanesA)
Accessing NHANES data locally
In its default mode of operation, functions in the nhanesA package scrape data directly from the CDC website each time they are invoked. The advantage is simplicity; users only need to install the nhanesA package without any additional setup. However, the response time is contingent upon internet speed and the size of the requested data.
As briefly described in the introduction, nhanesA has two alternative modes of operation where data can be accessed from a local resource: (a) using a prebuilt SQL database, and (b) using a mirror.
Using SQL database
Work in a Docker container
Functions in the nhanesA package can obtain (most) data from a suitably configured SQL database instead of accessing the CDC website directly. The easiest way to obtain such a database is via a Docker image as described here. This docker image includes versions of R and RStudio, and is configured in a way that causes nhanesA to use the database when it is run inside the docker instance. Once the docker container is up and running, one can visit http://localhost:8787/ to get access to an RStudio Server instance.
After logging in using the credentials provided when initiating the docker container, the user gets access to an RStudio session where the nhanesA package can be used to access NHANES resources.
From the user’s perspective, the experience should be largely identical to the default usage mode of nhanesA, except that the data should become available without any significant delay. The output of running nhanesOptions()
indicates that nhanesA was able to detect a database when it was loaded.
Access the database via port forwarding
It is also possible to configure nhanesA to use a SQL database when running outside a docker instance, provided the machine has access to the database. Typically, such a database would be made available by running a docker image on the same machine, or on another machine in the local network, and have the host forward the port on which Postgresql should be available (typically 5432) to the running docker instance. This happens automatically if the instructions to start the docker instance are followed. The advantage of doing this is that a single database instance can be used by multiple users in a local network, avoiding making copies of the (large) database.
Using nhanesA in this mode requires one additional step. To indicate to the startup code in nhanesA that a database is available, one needs to define certain environment variables that give details of the database that is available. Most of these environment variables are optional, and to use the Postegresl backend, one only needs to mandatorily specify the variable EPICONDUCTOR_CONTAINER_DB=postgres
. In case the Postgresql port (5432) has been mapped to a different port on the host machine, this needs to be indicated using EPICONDUCTOR_DB_PORT=<port>
. If the database is available on a different host, its address needs to be specified using EPICONDUCTOR_DB_SERVER=<host>
.
For the default invocation of docker run
, and an R session running on the same computer, the following is sufficient to use nhanesA with the database.
Usage
Once a database is successfully configured, the nhanesA package should ideally behave similarly whether or not a database is being used. When a database is successfully found on startup, the package sets a flag called use.db
to TRUE
.
nhanesOptions()
$use.db
[1] TRUE
With this setting, we get
nhanesOptions(use.db = TRUE)
system.time(demo_g_db <- nhanes("DEMO_G"))
user system elapsed
0.491 0.068 1.445
Even when the database is available, it is possible to pause use of the database and revert to downloading from the CDC website by setting
nhanesOptions(use.db = FALSE, log.access = TRUE)
The log.access
option, if set, causes a message to be printed every time a web resource is accessed. With these settings, we get
system.time(demo_g_web <- nhanes("DEMO_G"))
Downloading: https://wwwn.cdc.gov/Nchs/Data/Nhanes/Public/2011/DataFiles/DEMO_G.XPT
user system elapsed
0.621 0.087 7.368
The two versions have minor differences: The order of rows and columns may be different, and categorical variables may be represented either as factors of character strings. However, as long as the data has not been updated on the NHANES website since it was downloaded for inclusion in the database, the contents should be identical.
str(demo_g_web[1:10])
'data.frame': 9756 obs. of 10 variables:
$ SEQN : num 62161 62162 62163 62164 62165 ...
$ SDDSRVYR: Factor w/ 1 level "NHANES 2011-2012 public release": 1 1 1 1 1 1 1 1 1 1 ...
$ RIDSTATR: Factor w/ 2 levels "Interviewed only",..: 2 2 2 2 2 2 2 2 2 2 ...
$ RIAGENDR: Factor w/ 2 levels "Male","Female": 1 2 1 2 2 1 1 1 1 1 ...
$ RIDAGEYR: num 22 3 14 44 14 9 0 6 21 15 ...
$ RIDAGEMN: num NA NA NA NA NA NA 11 NA NA NA ...
$ RIDRETH1: Factor w/ 5 levels "Mexican American",..: 3 1 5 3 4 3 5 5 5 5 ...
$ RIDRETH3: Factor w/ 6 levels "Mexican American",..: 3 1 5 3 4 3 5 6 5 6 ...
$ RIDEXMON: Factor w/ 2 levels "November 1 through April 30",..: 2 1 2 1 2 2 1 1 1 1 ...
$ RIDEXAGY: num NA 3 14 NA 14 10 NA 6 NA 15 ...
str(demo_g_db[1:10])
tibble [9,756 × 10] (S3: tbl_df/tbl/data.frame)
$ SEQN : int [1:9756] 62161 62162 62163 62164 62165 62166 62167 62168 62169 62170 ...
$ SDDSRVYR: chr [1:9756] "NHANES 2011-2012 public release" "NHANES 2011-2012 public release" "NHANES 2011-2012 public release" "NHANES 2011-2012 public release" ...
$ RIDSTATR: chr [1:9756] "Both interviewed and MEC examined" "Both interviewed and MEC examined" "Both interviewed and MEC examined" "Both interviewed and MEC examined" ...
$ RIAGENDR: chr [1:9756] "Male" "Female" "Male" "Female" ...
$ RIDAGEYR: int [1:9756] 22 3 14 44 14 9 0 6 21 15 ...
$ RIDAGEMN: int [1:9756] NA NA NA NA NA NA 11 NA NA NA ...
$ RIDRETH1: chr [1:9756] "Non-Hispanic White" "Mexican American" "Other Race - Including Multi-Racial" "Non-Hispanic White" ...
$ RIDRETH3: chr [1:9756] "Non-Hispanic White" "Mexican American" "Non-Hispanic Asian" "Non-Hispanic White" ...
$ RIDEXMON: chr [1:9756] "May 1 through October 31" "November 1 through April 30" "May 1 through October 31" "November 1 through April 30" ...
$ RIDEXAGY: int [1:9756] NA 3 14 NA 14 10 NA 6 NA 15 ...
Using a local mirror
A conceptually simple alternative that also avoids repetitive downloads from the CDC website is to maintain a local mirror from which the data and documentation files can be retrieved as needed.
As noted here, data and documentation URLs for a particular table are determined by the table’s name and the cycle it represents. For example, the URLs for table DEMO_C
, which is from cycle 3, i.e., 2003-2004
, would be
Data: https://wwwn.cdc.gov/Nchs/Data/Nhanes/Public/2003/DataFiles/DEMO_C.xpt
Documentation: https://wwwn.cdc.gov/Nchs/Data/Nhanes/Public/2003/DataFiles/DEMO_C.htm
It is possible to change the “base” of the server from where nhanesA tries to download these files by setting an environment variable called NHANES_TABLE_BASE
, which defaults to the value "https://wwwn.cdc.gov"
.
The steps needed to create such a mirror is beyond the scope of this document, but tools such as wget
, or even the R function download.file()
in conjunction with the list of relevant URLs obtained using nhanesManifest()
, may be used to download all files locally. Note that just downloading the files is not sufficient, and they must also be made available through a HTTP server running locally.
Dynamic caching using httpuv and BiocFileCache
Both the database and local mirroring options can get outdated when CDC releases new files or updates old ones. The BiocFileCache package can cache downloaded files locally in a persistent manner, updating them automatically when the source file has been updated. The experimental cachehttp package uses the BiocFileCache package in conjunction with the httpuv package to run a local server that downloads files from the CDC website the first time they are requested, but uses the cache for subsequent requests.
To use this package, first install it using
::install("BiocFileCache")
BiocManager::install_github("ccb-hms/cachehttp") remotes
Then, run the following in a separate R session.
require(cachehttp)
add_cache("cdc", "https://wwwn.cdc.gov",
fun = function(x) {
<- tolower(x)
x endsWith(x, ".htm") || endsWith(x, ".xpt")
})<- start_cache(host = "0.0.0.0", port = 8080,
s static_path = BiocFileCache::bfccache(BiocFileCache::BiocFileCache()))
## stopServer(s) # to stop the httpuv server
This session must be kept active for the server to work. It can even run on a different machine, as long as it is accessible via the specified port. It does not require the nhanesA package to work.
While the server is running, we can set (in a different R session)
Sys.setenv(NHANES_TABLE_BASE = "http://127.0.0.1:8080/cdc")
(changing host IP and port as necessary) to use this server instead of the primary CDC website to serve XPT
and htm
files. Although the each file is downloaded from the CDC website the first time it is requested, subsequent “downloads” should be faster.
Session information
print(sessionInfo(), locale = FALSE)
R Under development (unstable) (2025-04-06 r88113)
Platform: x86_64-apple-darwin22.2.0
Running under: macOS Ventura 13.1
Matrix products: default
BLAS: /usr/local/Cellar/openblas/0.3.29/lib/libopenblasp-r0.3.29.dylib
LAPACK: /Users/deepayan/local/lib/R/lib/libRlapack.dylib; LAPACK version 3.12.1
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] nhanesA_1.3 kableExtra_1.4.0 lattice_0.22-7 knitr_1.43
loaded via a namespace (and not attached):
[1] bit_4.0.5 jsonlite_1.8.8 selectr_0.4-2 dplyr_1.1.4
[5] compiler_4.6.0 tidyselect_1.2.1 Rcpp_1.0.10 xml2_1.3.2
[9] blob_1.2.4 stringr_1.5.1 systemfonts_1.0.6 scales_1.2.1
[13] yaml_2.2.1 fastmap_1.1.0 R6_2.5.1 plyr_1.8.6
[17] generics_0.1.3 curl_4.3.1 htmlwidgets_1.5.3 tibble_3.2.1
[21] munsell_0.5.0 lubridate_1.9.3 DBI_1.2.2 svglite_2.1.1
[25] pillar_1.10.0 rlang_1.1.3 stringi_1.5.3 xfun_0.49
[29] bit64_4.0.5 timechange_0.2.0 viridisLite_0.4.1 cli_3.6.2
[33] magrittr_2.0.3 RPostgres_1.4.6 digest_0.6.34 rvest_1.0.3
[37] grid_4.6.0 rstudioapi_0.13 dbplyr_2.5.0 hms_1.1.3
[41] lifecycle_1.0.3 vctrs_0.6.5 evaluate_0.21 glue_1.6.2
[45] codetools_0.2-20 colorspace_2.0-0 purrr_1.0.2 foreign_0.8-90
[49] rmarkdown_2.26 httr_1.4.2 tools_4.6.0 pkgconfig_2.0.3
[53] htmltools_0.5.7