RNHANES provides an easy way to download and analyze data from NHANES, the National Health and Nutrition Examination Survey conducted by the Centers for Disease Control.
The included analysis tools focus on the laboratory data, but the package can still be used to search for and download other types of data in NHANES.
library(RNHANES)
NHANES is a national survey that covers demographics, health, nutrition, and environmental chemical exposures. NHANES as a modern program started in 1999, with the survey being administered in two-year cycles.
The released data is split into demographic, dietary, examination, and laboratory data for each survey cycle. RNHANES is designed primarily to work with demographic and laboratory data.
There is one demographic data file for each survey cycle. There is a collection of laboratory data for each cycle, split into files related to different analyte groups.
First, we need to figure out what data we want to analyze and where we can find it in NHANES.
To find the data you’re interested in, you can search either by file or by variable. First, use RNHANES to download a list of NHANES files and the comprehensive variable list. This data isn’t bundled with the package because it is sometimes updated to fix errors or add new data. Downloading the lists lets you get the most recent versions.
files <- nhanes_data_files()
variables <- nhanes_variables()
Use nhanes_search
to search within file and variable lists. You can restrict the searches by specifying conditions on any of the columns in the list.
nhanes_search(files, "environmental phenols")
nhanes_search(files, "pesticides", component == "laboratory", cycle == "2003-2004")
nhanes_search(files, "", cycle == "2003-2004")
nhanes_search(variables, "triclosan")
nhanes_search(variables, "DDT", data_file_name == "LAB28POC")
nhanes_search(variables, "", data_file_name == "EPH_E")
Once you’ve identified the data you need for your analysis, the next step is to download the appropriate data files from NHANES through the nhanes_load_data
function. This function has a lot of options, so let’s start simple and go through them.
The most basic way to download data is to specify the name and cycle year of one data file.
nhanes_load_data("EPH_E", "2007-2008")
You can leave off the trailing suffix (e.g. the “_E" in “EPH_E”) on the file name and it will be filled in for you.
nhanes_load_data("EPH", "2007-2008")
To save time, nhanes_load_data
downloads the files and saves them so they don’t need to be redownloaded every time you run your script. By default, it saves the files to a temporary directory. You can optionally set where you want the files to be downloaded to.
nhanes_load_data("EPH", "2007-2008", cache = "./nhanes_data")
So far, we’ve been downloading the data without its accompanying demographic information, which contains demographic information like age, gender, etc. as well as the survey weights. This information is available in a separate file for each cycle. RNHANES can automatically download the correct demography file and merge it with your data.
nhanes_load_data("EPH", "2007-2008", cache = "./nhanes_data", demographics = TRUE)
Some ordinal fields in NHANES are coded as numeric factors. RNHANES can decode these fields, replacing the factors with their textual description.
nhanes_load_data("EPH", "2007-2008", cache = "./nhanes_data", demographics = TRUE, recode = TRUE)
You can also download multiple files from NHANES at once to simplify your code. You can do this in several ways; first, by specifying a vector of file names and cycle years. The result will be a list containing a data frame for each requested file.
nhanes_load_data(c("PHTHTE", "PFC"), c("2007-2008", "2007-2008"))
You can easily download all the files that were found in a search.
# Search for NHANES files related to environmental phenols and download all of them
results <- nhanes_search(files, "environmental phenols")
nhanes_load_data(results$data_file_name, results$cycle)
# Search for triclosan NHANES variables and download all related files
results <- nhanes_search(variables, "triclosan")
nhanes_load_data(results$data_file_name, results$cycle)
RNHANES includes functions to perform common analyses on NHANES data.
All of the analyses below will be done on the following data.
# Example of data loaded from one file/cycle year
phenols <- nhanes_load_data("EPH", "2007-2008", demographics = TRUE)
# Example of data loaded from multiple files/cycle years
# Download all files that contain a "triclosan" variable
results <- nhanes_search(variables, "triclosan")
triclosan <- nhanes_load_data(results$data_file_name, results$cycle, demographics = TRUE)
RNHANES uses the survey
package to compute quantiles, taking into account survey weights. The nhanes_quantile
abstracts this added complexity away from the user.
You need to specify the column name, comment column name, and weights column name for nhanes_quantile
to work. The function returns a data frame that contains the computed quantiles.
nhanes_quantile(phenols, "URXBPH", "URDBPHLC", "WTSB2YR", c(0.5, 0.95, 0.99))
You can compute quantiles for multiple columns at once. The simplest way is to pass vectors for the column, comment, and weight name inputs.
nhanes_quantile(phenols,
c("URXBPH", "URXTRS"),
c("URDBPHLC", "URDTRSLC"),
c("WTSB2YR", "WTSB2YR"),
c(0.5, 0.95, 0.99))
You can also pass in a data frame that specifies the columns to compute quantiles for.
inputs <- as.data.frame(matrix(c(
# COLUMN COMMENT WEIGHTS
"URXBPH", "URDBPHLC", "WTSB2YR",
"URXTRS", "URDTRSLC", "WTSB2YR"
), ncol = 3, byrow = TRUE), stringsAsFactors = FALSE)
names(inputs) <- c("column", "comment_column", "weights_column")
nhanes_quantile(phenols, inputs, quantiles = c(0.5, 0.95, 0.99))
This looks a little awkward with only two analytes, but becomes more useful if you have a lot of analytes you want to analyze.
nhanes_quantile
transparently handles data that was loaded from multiple files and cycle years. The variable triclosan
is a list of data frames; let’s compute quantiles for triclosan in each one.
In this case, you have to supply a data frame that specifies the columns to look at for each file name and cycle year.
This is a good example because for the 2003-2004 cycle, the triclosan column appears to be misnamed: it is “URDTRS”, when the naming convention in the rest of the file is to have column names start with “URX”.
inputs <- as.data.frame(matrix(c(
# CYCLE COLUMN COMMENT WEIGHTS
"2003-2004", "URDTRS", "URDTRSLC", "WTSC2YR",
"2005-2006", "URXTRS", "URDTRSLC", "WTSB2YR",
"2007-2008", "URXTRS", "URDTRSLC", "WTSB2YR",
"2009-2010", "URXTRS", "URDTRSLC", "WTSB2YR",
"2011-2012", "URXTRS", "URDTRSLC", "WTSA2YR"
), ncol = 4, byrow = TRUE), stringsAsFactors = FALSE)
names(inputs) <- c("cycle", "column", "comment_column", "weights_column")
nhanes_quantile(triclosan, inputs, quantiles = c(0.5, 0.95, 0.99))
Computing weighted detection frequencies works similarly to computing quantiles.
nhanes_detection_frequency(phenols, "URXBPH", "URDBPHLC", "WTSB2YR")
You can also pass in vectors of column names to analyze multiple analytes.
nhanes_detection_frequency(phenols, c("URXBPH", "URXTRS"),
c("URDBPHLC", "URDTRSLC"),
c("WTSB2YR", "WTSB2YR"))
If you want to compute detection frequencies for columns from different files or cycles, you have to pass in a data frame.
inputs <- as.data.frame(matrix(c(
# CYCLE COLUMN COMMENT WEIGHTS
"2003-2004", "URDTRS", "URDTRSLC", "WTSC2YR",
"2005-2006", "URXTRS", "URDTRSLC", "WTSB2YR",
"2007-2008", "URXTRS", "URDTRSLC", "WTSB2YR",
"2009-2010", "URXTRS", "URDTRSLC", "WTSB2YR",
"2011-2012", "URXTRS", "URDTRSLC", "WTSA2YR"
), ncol = 4, byrow = TRUE), stringsAsFactors = FALSE)
names(inputs) <- c("cycle", "column", "comment_column", "weights_column")
nhanes_detection_frequency(triclosan, inputs)
The nhanes_sample_size
function computes sample sizes of one or more analytes:
nhanes_sample_size(phenols, "URXBPH", "URDBPHLC", "WTSB2YR")
You can use any function from the survey
package to analyze your data through the nhanes_survey
function. This provides a generic way to apply a survey function to NHANES data.
For example, you can calculate means using svymean
as follows:
library(survey)
nhanes_survey(svymean, phenols, "URXBPH", "URDBPHLC", "WTSB2YR", na.rm = TRUE)
Use nhanes_hist
to plot the weighted histogram of an NHANEs variable.
nhanes_hist(phenols, "URXBPH", "URDBPHLC", "WTSB2YR")
you can transform the data before plotting the histogram by supplying a transform
function:
nhanes_hist(phenols, "URXBPH", "URDBPHLC", "WTSB2YR", transform="log")