Distribution of BMI over time

NHANES is not a longitudinal study; that is, it does not follow participants over time. Rather, the data collected in any NHANES cycle can be viewed as a snapshot of the US population from the period corresponding to that cycle. Analyses that combine data across cycles typically assume that the underlying population characteristics have not changed across the cycles being combined.

However, as NHANES has now collected data over more than two decades, it may also contain evidence of characteristics that have changed over time. In this analysis, we consider a specific question: whether the distribution of BMI, a standard indicator of obesity, has changed across cycles. We ask this question separately for various ethnicities and genders, as BMI is known to vary substantially across population subgroups. As an illustration of best practices, we use methods from the survey package which take into account the complex sample selection design of NHANES.

Relevant variables

To identify variables that contain information about BMI and the NHANES tables they are available in, we can use the nhanesSearch() function in the nhanesA package.

library(nhanesA)
nhanesOptions(log.access = TRUE)
nhanesSearch("body mass index")

# A tibble: 12 × 7
   Variable.Name Variable.Description      Data.File.Name Data.File.Description
   <chr>         <chr>                     <chr>          <chr>                
 1 BMXBMI        Body Mass Index (kg/m**2) BMX            Body Measures        
 2 BMXBMI        Body Mass Index (kg/m**2) BMX_B          Body Measures        
 3 BMXBMI        Body Mass Index (kg/m**2) BMX_C          Body Measures        
 4 BMXBMI        Body Mass Index (kg/m**2) BMX_D          Body Measures        
 5 BMXBMI        Body Mass Index (kg/m**2) BMX_E          Body Measures        
 6 BMXBMI        Body Mass Index (kg/m**2) BMX_F          Body Measures        
 7 BMXBMI        Body Mass Index (kg/m**2) BMX_G          Body Measures        
 8 BMXBMI        Body Mass Index (kg/m**2) BMX_H          Body Measures        
 9 BMXBMI        Body Mass Index (kg/m**2) BMX_I          Body Measures        
10 BMXBMI        Body Mass Index (kg/m**2) BMX_J          Body Measures        
11 BMXBMI        Body Mass Index (kg/m**2) P_BMX          Body Measures        
12 BMXBMI        Body Mass Index (kg/m**2) BMX_L          Body Measures        
# ℹ 3 more variables: Begin.Year <int>, EndYear <int>, Component <chr>

These results tell us that BMI measurements are available as the BMXBMI variable in the BMX tables. For any reasonable analysis, we will need to combine these at least with demographic information available in the DEMO tables.

BMI by age

To compare the distribution of BMI across cycles, we need to first understand the factors that affect its distribution within a cycle. Natural covariates are gender and ethnicity, and possibly age. To understand the dependence on age, we choose a particular demographic subgroup (white females) from a particular cycle, and plot BMI vs age.

library(dplyr)
library(lattice)
library(latticeExtra)
bmdata <-
    left_join(nhanes("DEMO_C", translated = TRUE),
              nhanes("BMX_C", translated = TRUE),
              by = "SEQN") |>
    subset(RIDAGEYR > 20 & RIDAGEYR < 75)
bmi_white_female <-
    subset(bmdata, RIDRETH1 == "Non-Hispanic White" & RIAGENDR == "Female")

xyplot(bmi_white_female, BMXBMI ~ RIDAGEYR, grid = TRUE, smooth = "loess")

The smooth line is a LOESS line giving a nonparametric estimator of the average BMI as a function of age for this population subgroup. Unfortunately, the data shown in this figure are not an i.i.d. sample from the population, and so the estimated smooth may be biased and misleading. To take the complex survey design on NHANES into account, we can use tools in the survey package, which implements variants of many standard statistical analysis tools appropriate for survey data.

The survey package does not implement a survey variant of LOESS, although it does implement local polynomial smoothing (see `?svysmooth). We will instead use a parametric variant that supports “non-linear” mean functions via basis splines. Before doing this, we first need to set up a survey design object with suitable weights, id, and strata information.

library(survey)
library(splines)
design <- svydesign(id = ~ SDMVPSU, strata = ~ SDMVSTRA, weights = ~ WTMEC2YR,
                    data = bmdata,
                    nest = TRUE)
ns_age <- function(x) splines::ns(x, knots = seq(30, 65, length = 2),
                                 Boundary.knots = c(21, 74))
fm <- svyglm(BMXBMI ~ ns_age(RIDAGEYR), design = design,
             subset = RIDRETH1 == "Non-Hispanic White" & RIAGENDR == "Female")
fm

Stratified 1 - level Cluster Sampling design (with replacement)
With (29) clusters.
svydesign(id = ~SDMVPSU, strata = ~SDMVSTRA, weights = ~WTMEC2YR, 
    data = bmdata, nest = TRUE)

Call:  svyglm(formula = BMXBMI ~ ns_age(RIDAGEYR), design = design, 
    subset = RIDRETH1 == "Non-Hispanic White" & RIAGENDR == "Female")

Coefficients:
      (Intercept)  ns_age(RIDAGEYR)1  ns_age(RIDAGEYR)2  ns_age(RIDAGEYR)3  
          24.4626             4.9647             6.9068             0.7908  

Degrees of Freedom: 1003 Total (i.e. Null);  11 Residual
  (74 observations deleted due to missingness)
Null Deviance:      52850 
Residual Deviance: 50990    AIC: 6932

This gives a different smooth that is appropriately adjusted for unequal survey weights.

xyplot(bmdata, BMXBMI ~ RIDAGEYR, grid = TRUE,
       subset = RIDRETH1 == "Non-Hispanic White" & RIAGENDR == "Female") + 
    xyplot(predict(fm, newdata = list(RIDAGEYR = x)) ~ x,
           data = data.frame(x = 20:75), type = "l")

This smooth represents an estimate of the expected BMI as a function of age for white females. What we are really interested in is how this function changes over time, that is across cycles, for this and other population subgroups.

To do this, we first obtain the survey design objects for each cycle, restricting our attention to (non-Hispanic) white and black adults.

shortCycle <- function(SDDSRVYR)
{
    cycle <- substring(SDDSRVYR, 8, 16)
    cycle[grepl("August", cycle)] <- "2021-2023"
    cycle
}
makeDesign <- function(demo_table, bmx_table)
{
    data <-
        left_join(nhanes(demo_table, translated = TRUE),
                  nhanes(bmx_table, translated = TRUE),
                  by = "SEQN")
    data <- within(data, cycle <- shortCycle(SDDSRVYR))
    if (anyNA(data$WTMEC2YR)) warning("dropping rows with missing exam weights (WTMEC2YR) in ", demo_table)
    data <- subset(data, is.finite(WTMEC2YR))
    design <- svydesign(id = ~ SDMVPSU, strata = ~ SDMVSTRA, weights = ~ WTMEC2YR,
                        data = data,
                        nest = TRUE)
    subset(design, is.finite(BMXBMI) &
                   RIDAGEYR > 20 & RIDAGEYR < 75 &
                   RIDRETH1 %in% c("Non-Hispanic Black", "Non-Hispanic White"))
}
bmx_tables <- c("BMX", "BMX_B", "BMX_C", "BMX_D", "BMX_E", "BMX_F", "BMX_G", "BMX_H", "BMX_I", "BMX_J", "BMX_L")
demo_tables <- c("DEMO", "DEMO_B", "DEMO_C", "DEMO_D", "DEMO_E", "DEMO_F", "DEMO_G", "DEMO_H", "DEMO_I", "DEMO_J", "DEMO_L")
all_designs <- mapply(makeDesign, demo_tables, bmx_tables, SIMPLIFY = FALSE)

Warning in (function (demo_table, bmx_table) : dropping rows with missing exam
weights (WTMEC2YR) in DEMO_H

Warning in (function (demo_table, bmx_table) : dropping rows with missing exam
weights (WTMEC2YR) in DEMO_I

Warning in (function (demo_table, bmx_table) : dropping rows with missing exam
weights (WTMEC2YR) in DEMO_J

Warning in (function (demo_table, bmx_table) : dropping rows with missing exam
weights (WTMEC2YR) in DEMO_L

Next, we repeat the smoothing computation for each cycle, for four population subgroups.

g <- expand.grid(cycle = names(all_designs),
                 gender = c("Female", "Male"),
                 ethnicity = c("Non-Hispanic Black", "Non-Hispanic White"),
                 KEEP.OUT.ATTRS = FALSE)
smooth  <- vector(mode = "list", length = nrow(g))
for (i in seq_len(nrow(g))) {
    fm <- svyglm(BMXBMI ~ ns_age(RIDAGEYR),
                 design = all_designs[[ g$cycle[i] ]],
                 subset = RIDRETH1 == g$ethnicity[i] & RIAGENDR == g$gender[i])
    smooth[[i]] <-
        data.frame(cycle = g$cycle[i],
                   ethnicity = g$ethnicity[i],
                   gender = g$gender[i],
                   Age = 20:75,
                   AvgBMI = predict(fm, newdata = list(RIDAGEYR = 20:75),
                                    type = "response") |> as.numeric())
}
smoothDF <- do.call(rbind, smooth)

The resulting smooths can be compared using the following plot.

xyplot(smoothDF, AvgBMI ~ Age | gender + ethnicity, type = "l", groups = cycle,
       grid = TRUE, auto.key = TRUE)

Although it is not straightforward to interpret this plot, it is clear that average BMI tends to initially increase with age, after which it tends to stabilize. For this reason, we choose to consider only a middle age group, namely, adults between ages 40–59.

Average BMI in 40–59 year olds

Once we decide on a specific age group to consider, we can ignore the dependence on age and simply compare the average BMI across cycles for various population subgroups. Of course, the estimated average BMI values are not of much use unless we also calculate standard errors or confidence intervals, which are nontrivial to compute. For this, we again use the survey package, and specifically its svymean() function, for which the associated confidence intervals can be easily obtained.

As the BMI values are somewhat right-skewed, we calculate the average and confidence interval for log-BMI, and transform it back to the BMI scale before plotting.

unrowname <- function(d) { rownames(d) <- NULL; d }
estMean <- function(design, what, groups, ...)
{
    svyby(what, groups,
          design = subset(design, RIDAGEYR >= 40 & RIDAGEYR <= 59),
          svymean, ...) |> unrowname()
}
CI <- 
    lapply(all_designs, 
           estMean,
           ~ log(BMXBMI),
           ~ cycle + RIDRETH1 + RIAGENDR,
           vartype = "ci") |> do.call(what = rbind) |> unrowname()
## transform back to BMI scale
CI <- within(CI,
{
    estBMI <- exp(`log(BMXBMI)`)
    LCL <- exp(ci_l)
    UCL <- exp(ci_u)
})

If we now plot the confidence intervals across cycles, we see that although successive confidence intervals are mostly overlapping, there is a distinct general trend of increasing average BMI values over time.

segplot(factor(cycle) ~ LCL + UCL | interaction(RIAGENDR, RIDRETH1),
        data = CI, # level = estimate,
        draw.bands = FALSE, lwd = 2, horizontal = FALSE,
        ylab = "95% Confidence Intervals for Average (log) BMI \n (for Age Group 40-59 years)",
        scales = list(x = list(rot = 90)),
        centers = estBMI) + layer_(panel.grid(v = -1, h = 0))

Session information

print(sessionInfo(), locale = FALSE)

R Under development (unstable) (2025-04-06 r88113)
Platform: x86_64-apple-darwin22.2.0
Running under: macOS Ventura 13.1

Matrix products: default
BLAS:   /usr/local/Cellar/openblas/0.3.29/lib/libopenblasp-r0.3.29.dylib 
LAPACK: /Users/deepayan/local/lib/R/lib/libRlapack.dylib;  LAPACK version 3.12.1

attached base packages:
[1] splines   grid      stats     graphics  grDevices utils     datasets 
[8] methods   base     

other attached packages:
[1] survey_4.1-1        survival_3.8-3      Matrix_1.7-3       
[4] latticeExtra_0.6-30 lattice_0.22-7      dplyr_1.1.4        
[7] nhanesA_1.3         kableExtra_1.4.0    knitr_1.43         

loaded via a namespace (and not attached):
 [1] utf8_1.1.4         generics_0.1.3     xml2_1.3.2         jpeg_0.1-10       
 [5] stringi_1.5.3      hms_1.1.3          digest_0.6.34      magrittr_2.0.3    
 [9] evaluate_0.21      timechange_0.2.0   RColorBrewer_1.1-2 fastmap_1.1.0     
[13] blob_1.2.4         plyr_1.8.6         jsonlite_1.8.8     RPostgres_1.4.6   
[17] DBI_1.2.2          httr_1.4.2         rvest_1.0.3        purrr_1.0.2       
[21] viridisLite_0.4.1  scales_1.2.1       codetools_0.2-20   cli_3.6.2         
[25] mitools_2.4        rlang_1.1.3        dbplyr_2.5.0       bit64_4.0.5       
[29] munsell_0.5.0      withr_2.5.0        yaml_2.2.1         tools_4.6.0       
[33] deldir_1.0-6       colorspace_2.0-0   interp_1.1-4       vctrs_0.6.5       
[37] R6_2.5.1           png_0.1-7          lifecycle_1.0.3    lubridate_1.9.3   
[41] stringr_1.5.1      htmlwidgets_1.5.3  bit_4.0.5          foreign_0.8-90    
[45] pkgconfig_2.0.3    pillar_1.10.0      glue_1.6.2         Rcpp_1.0.10       
[49] systemfonts_1.0.6  xfun_0.49          tibble_3.2.1       tidyselect_1.2.1  
[53] rstudioapi_0.13    htmltools_0.5.7    rmarkdown_2.26     svglite_2.1.1     
[57] compiler_4.6.0