Data: census variables

Census variables and calculation table

To retrieve census data and calculate SVI based on CDC/ATSDR documentation, a series of lists and tables containing census variables information are included in the package.

census_variables_(2012-2022): Each list contains the year-specific census variables needed for SVI calculation.
variable_ep_calculation_(2012-2022): Each table contains the SVI variable names, their theme group and corresponding census variable(s) and calculation formula.

These datasets are documented in ?census_variables and ?variable_calculation.

ZCTA-state relationship file (crosswalk)

Currently, tidycensus::get_acs() does not support requests for state-specific ZCTA-level data starting 2019(subject table)/2020(all tables). This is likely due to changes in Census API, as ZCTAs are not subgeographies of states (some ZCTAs cross state boundaries). To obtain state-specific ZCTA-level data, three atasets of ZCTA-to-state crosswalks are included to help selecting the ZCTAs in the state(s) of interest after retrieving the ZCTA data at the national level.

These crosswalk files are documented in ?zcta_state_xwalk.

Retrieve census data with `get_census_data()`

get_census_data() uses tidycensus::get_acs() with a pre-defined list of variables to retrieves ACS data for SVI calculation. The list of census variables is built in the function, and changes according to the year of interest. Importantly, a Census API key is required for this function to work, which can be obtained online and set up by tidycensus::census_api_key("YOUR KEY GOES HERE"). The arguments are largely the same with tidycensus::get_acs(), including year, geography and state.

For example, we can retrieve ZCTA-level data for Rhode Island for 2018:

data <- get_census_data(2018, "zcta", "RI")
data[1:10, 1:10]

#> # A tibble: 10 × 10
#>    GEOID NAME        B17001_002E B17001_002M B19301_001E B19301_001M B06009_002E
#>    <chr> <chr>             <dbl>       <dbl>       <dbl>       <dbl>       <dbl>
#>  1 02802 ZCTA5 02802         154         190       24925       14640          80
#>  2 02804 ZCTA5 02804         130          91       39065        6412          56
#>  3 02806 ZCTA5 02806         520         183       61534        3820         383
#>  4 02807 ZCTA5 02807          73          33       39287        7937          19
#>  5 02808 ZCTA5 02808         162         166       29356        3819         272
#>  6 02809 ZCTA5 02809        1619         368       34252        2269        2077
#>  7 02812 ZCTA5 02812          31          52       41718        5771          72
#>  8 02813 ZCTA5 02813         605         271       42612        4889         411
#>  9 02814 ZCTA5 02814         722         253       37750        3056         381
#> 10 02815 ZCTA5 02815          13          21       71975       22744           0
#> # ℹ 3 more variables: B06009_002M <dbl>, B09001_001E <dbl>, B09001_001M <dbl>

(First 10 rows and columns are shown, with the rest of the columns being other census variables.)

Note that for ZCTA-level after 2018, data retrieving by state is not supported by Census API/tidycensus. For such requests, get_census_data() first retrieves ZCTA-level data for the whole country, and then uses the ZCTA-to-state relationship file (crosswalk) to select the ZCTAs in the state(s) of interest. This results in a longer running time for these requests.

Compute SVI with `get_svi()`

get_svi() takes the year and census data (retrieved by get_census_data()) as arguments, and calculate the SVI based on CDC/ATSDR documentation (https://www.atsdr.cdc.gov/placeandhealth/svi/data_documentation_download.html). This function uses the built-in variable_calculation tables and populate the SVI variables with census variables directly, or basic summation/percentage calculation of census variables. For each SVI variable,a geographic unit is ranked against the others in the selected region, followed by summing up rankings for variables within each theme to perform percentile ranking again as the SVI for theme-specific and overall SVI.

For example, to obtain ZCTA-level SVI for Rhode Island for 2018:

result <- get_svi(2018, data)
glimpse(result)

#> Rows: 77
#> Columns: 60
#> $ GEOID      <chr> "02802", "02804", "02806", "02807", "02808", "02809", "0281…
#> $ NAME       <chr> "ZCTA5 02802", "ZCTA5 02804", "ZCTA5 02806", "ZCTA5 02807",…
#> $ E_TOTPOP   <dbl> 671, 2004, 16192, 827, 2565, 22258, 1208, 7780, 7673, 208, …
#> $ E_HU       <dbl> 314, 947, 6393, 1856, 969, 9181, 402, 5173, 3350, 76, 14272…
#> $ E_HH       <dbl> 223, 840, 6111, 429, 889, 8442, 402, 3200, 2903, 76, 13304,…
#> $ E_POV      <dbl> 154, 130, 520, 73, 162, 1619, 31, 605, 722, 13, 2575, 143, …
#> $ E_UNEMP    <dbl> 18, 12, 244, 21, 171, 424, 44, 330, 167, 0, 1016, 123, 459,…
#> $ E_PCI      <dbl> 24925, 39065, 61534, 39287, 29356, 34252, 41718, 42612, 377…
#> $ E_NOHSDP   <dbl> 80, 56, 383, 19, 272, 2077, 72, 411, 381, 0, 2011, 158, 523…
#> $ E_AGE65    <dbl> 15, 351, 2680, 221, 267, 4578, 144, 1733, 1207, 16, 5520, 8…
#> $ E_AGE17    <dbl> 220, 331, 4375, 143, 598, 3201, 323, 1265, 1489, 74, 6322, …
#> $ E_DISABL   <dbl> 194, 200, 1453, 96, 184, 2234, 149, 818, 1172, 53, 5630, 39…
#> $ E_SNGPNT   <dbl> 94, 47, 254, 36, 45, 447, 10, 202, 134, 0, 824, 176, 396, 9…
#> $ E_MINRTY   <dbl> 87, 0, 1426, 49, 264, 1850, 146, 476, 518, 37, 2058, 606, 2…
#> $ E_LIMENG   <dbl> 18, 0, 98, 0, 0, 416, 0, 0, 0, 0, 205, 47, 91, 0, 10, 14, 0…
#> $ E_MUNIT    <dbl> 72, 0, 147, 90, 0, 592, 0, 38, 46, 0, 1119, 158, 1163, 60, …
#> $ E_MOBILE   <dbl> 0, 13, 0, 37, 0, 0, 0, 232, 174, 0, 841, 98, 100, 231, 8, 0…
#> $ E_CROWD    <dbl> 18, 0, 11, 10, 0, 71, 0, 68, 11, 0, 166, 44, 69, 15, 33, 0,…
#> $ E_NOVEH    <dbl> 10, 13, 151, 11, 0, 530, 0, 90, 83, 0, 472, 0, 563, 29, 61,…
#> $ E_GROUPQ   <dbl> 0, 0, 34, 39, 0, 3559, 0, 49, 10, 0, 452, 33, 59, 288, 20, …
#> $ EP_POV     <dbl> 23.0, 6.5, 3.2, 8.8, 6.4, 8.6, 2.6, 7.8, 9.5, 6.3, 8.0, 2.4…
#> $ EP_UNEMP   <dbl> 6.4, 1.0, 2.9, 4.6, 11.4, 3.6, 6.7, 7.4, 3.8, 0.0, 5.5, 3.3…
#> $ EP_PCI     <dbl> 24925, 39065, 61534, 39287, 29356, 34252, 41718, 42612, 377…
#> $ EP_NOHSDP  <dbl> 20.1, 3.9, 3.4, 2.8, 15.7, 14.0, 8.9, 7.0, 6.7, 0.0, 8.4, 3…
#> $ EP_AGE65   <dbl> 2.2, 17.5, 16.6, 26.7, 10.4, 20.6, 11.9, 22.3, 15.7, 7.7, 1…
#> $ EP_AGE17   <dbl> 32.8, 16.5, 27.0, 17.3, 23.3, 14.4, 26.7, 16.3, 19.4, 35.6,…
#> $ EP_DISABL  <dbl> 28.9, 10.0, 9.0, 11.6, 7.2, 10.3, 12.5, 10.5, 15.3, 25.5, 1…
#> $ EP_SNGPNT  <dbl> 42.2, 5.6, 4.2, 8.4, 5.1, 5.3, 2.5, 6.3, 4.6, 0.0, 6.2, 8.1…
#> $ EP_MINRTY  <dbl> 13.0, 0.0, 8.8, 5.9, 10.3, 8.3, 12.1, 6.1, 6.8, 17.8, 6.3, …
#> $ EP_LIMENG  <dbl> 3.1, 0.0, 0.6, 0.0, 0.0, 1.9, 0.0, 0.0, 0.0, 0.0, 0.7, 0.8,…
#> $ EP_MUNIT   <dbl> 22.9, 0.0, 2.3, 4.8, 0.0, 6.4, 0.0, 0.7, 1.4, 0.0, 7.8, 6.6…
#> $ EP_MOBILE  <dbl> 0.0, 1.4, 0.0, 2.0, 0.0, 0.0, 0.0, 4.5, 5.2, 0.0, 5.9, 4.1,…
#> $ EP_CROWD   <dbl> 8.1, 0.0, 0.2, 2.3, 0.0, 0.8, 0.0, 2.1, 0.4, 0.0, 1.2, 2.0,…
#> $ EP_NOVEH   <dbl> 4.5, 1.5, 2.5, 2.6, 0.0, 6.3, 0.0, 2.8, 2.9, 0.0, 3.5, 0.0,…
#> $ EP_GROUPQ  <dbl> 0.0, 0.0, 0.2, 4.7, 0.0, 16.0, 0.0, 0.6, 0.1, 0.0, 1.4, 0.5…
#> $ EPL_POV    <dbl> 0.9054, 0.4054, 0.1486, 0.5405, 0.3919, 0.5135, 0.0946, 0.4…
#> $ EPL_UNEMP  <dbl> 0.6842, 0.1053, 0.1711, 0.4079, 0.9605, 0.2632, 0.7105, 0.8…
#> $ EPL_PCI    <dbl> 0.8684, 0.4605, 0.0263, 0.4211, 0.7763, 0.6711, 0.3158, 0.2…
#> $ EPL_NOHSDP <dbl> 0.9211, 0.2500, 0.1842, 0.1447, 0.8553, 0.8026, 0.5921, 0.4…
#> $ EPL_AGE65  <dbl> 0.0789, 0.5132, 0.4474, 0.9605, 0.1842, 0.7895, 0.2105, 0.8…
#> $ EPL_AGE17  <dbl> 0.9737, 0.2632, 0.9211, 0.3684, 0.8158, 0.1579, 0.9079, 0.2…
#> $ EPL_DISABL <dbl> 1.0000, 0.1867, 0.1467, 0.4000, 0.1067, 0.2267, 0.4667, 0.2…
#> $ EPL_SNGPNT <dbl> 0.9865, 0.4324, 0.2838, 0.7027, 0.3649, 0.3919, 0.1216, 0.5…
#> $ EPL_MINRTY <dbl> 0.6447, 0.0000, 0.4211, 0.2237, 0.5000, 0.4079, 0.5921, 0.2…
#> $ EPL_LIMENG <dbl> 0.8289, 0.0000, 0.4342, 0.0000, 0.0000, 0.7500, 0.0000, 0.0…
#> $ EPL_MUNIT  <dbl> 0.9459, 0.0000, 0.2838, 0.3919, 0.0000, 0.4459, 0.0000, 0.2…
#> $ EPL_MOBILE <dbl> 0.0000, 0.7973, 0.0000, 0.8378, 0.0000, 0.0000, 0.0000, 0.9…
#> $ EPL_CROWD  <dbl> 1.0000, 0.0000, 0.2973, 0.8243, 0.0000, 0.4865, 0.0000, 0.7…
#> $ EPL_NOVEH  <dbl> 0.4054, 0.1757, 0.2162, 0.2297, 0.0000, 0.5946, 0.0000, 0.2…
#> $ EPL_GROUPQ <dbl> 0.0000, 0.0000, 0.2368, 0.8158, 0.0000, 0.9342, 0.0000, 0.4…
#> $ SPL_theme1 <dbl> 3.3791, 1.2212, 0.5302, 1.5142, 2.9840, 2.2504, 1.7130, 1.9…
#> $ SPL_theme2 <dbl> 3.0391, 1.3955, 1.7990, 2.4316, 1.4716, 1.5660, 1.7067, 1.9…
#> $ SPL_theme3 <dbl> 1.4736, 0.0000, 0.8553, 0.2237, 0.5000, 1.1579, 0.5921, 0.2…
#> $ SPL_theme4 <dbl> 2.3513, 0.9730, 1.0341, 3.0995, 0.0000, 2.4612, 0.0000, 2.6…
#> $ RPL_theme1 <dbl> 0.9211, 0.2237, 0.0395, 0.3158, 0.8289, 0.6447, 0.4474, 0.6…
#> $ RPL_theme2 <dbl> 1.0000, 0.1711, 0.3421, 0.7237, 0.2105, 0.2632, 0.3158, 0.3…
#> $ RPL_theme3 <dbl> 0.8026, 0.0000, 0.4868, 0.1447, 0.2763, 0.5921, 0.3158, 0.1…
#> $ RPL_theme4 <dbl> 0.4474, 0.1579, 0.2237, 0.8158, 0.0000, 0.4737, 0.0000, 0.6…
#> $ SPL_themes <dbl> 10.2431, 3.5897, 4.2186, 7.2690, 4.9556, 7.4355, 4.0118, 6.…
#> $ RPL_themes <dbl> 0.8553, 0.1184, 0.1579, 0.5263, 0.2237, 0.5526, 0.1447, 0.4…

Columns include geographic unit information, individual SVI variables (“E_xx” and “EP_xx”), intermediate percentile rankings (“EPL_xx” and “SPL_xx”), and the theme-specific and overall SVIs (“RPL_xx”).

Wrapper and more: `find_svi()`

To retrieve census data and compute SVI in one step, we could use find_svi(). While get_census_data() only accepts a single year for year (and multiple states for state) just like tidycensus::get_acs(), find_svi() accepts pairing vectors of year and state for the SAME geography level. This allows processing multiple year-state combinations in one function, with separate data retrieval and SVI calculation for every year-state entry and returning a summarised SVI table for all pairs of year-state values.

One important difference in data retrieval between find_svi() and get_census_data() is that the year-state combinations will always be evaluated as “one year and one state” – that is, the option to get census data for multiple states at once (for one year) in get_census_data() will be disabled in find_svi(). There is an exception to this one-on-one rule, when a single year is supplied into year, you can set the state = NULL as default to perform nation-level data retrieval and SVI calculation.

For SVI table output, find_svi() by default returns a summarised SVI table with only the GEOID, theme-specific SVIs and SVI for all 4 themes for each year-state combination. Alternatively, there’s an option to return a full SVI table with every SVI variable and intermediate ranking values (as get_svi()) by setting full.table = TRUE. For both options, corresponding year and state information will be included as two separate columns in the table.

Single year-state entry

Using the same example as above, to obtain ZCTA-level census data and calculate SVI for Rhode Island for 2018 in one step:

onestep_result <- find_svi(2018, "RI", "zcta")
onestep_result %>% head(10)

#> # A tibble: 10 × 8
#>    GEOID RPL_theme1 RPL_theme2 RPL_theme3 RPL_theme4 RPL_themes  year state
#>    <chr>      <dbl>      <dbl>      <dbl>      <dbl>      <dbl> <dbl> <chr>
#>  1 02802     0.921       1          0.803      0.447     0.855   2018 RI   
#>  2 02804     0.224       0.171      0          0.158     0.118   2018 RI   
#>  3 02806     0.0395      0.342      0.487      0.224     0.158   2018 RI   
#>  4 02807     0.316       0.724      0.145      0.816     0.526   2018 RI   
#>  5 02808     0.829       0.210      0.276      0         0.224   2018 RI   
#>  6 02809     0.645       0.263      0.592      0.474     0.553   2018 RI   
#>  7 02812     0.447       0.316      0.316      0         0.145   2018 RI   
#>  8 02813     0.618       0.382      0.171      0.632     0.460   2018 RI   
#>  9 02814     0.5         0.487      0.224      0.342     0.382   2018 RI   
#> 10 02815     0.0263      0.513      0.342      0         0.0789  2018 RI

This is a glimpse of the first 10 rows of the summarised SVI table, with additional columns indicating the year and state information. At default, the summarised table only keeps the GEOID and SVIs. Set full.table = TRUE for a more complete SVI table with all the individual SVI variables from census data (like the result from get_svi() shown in the previous section).

Multiple year-state entries

For multiple year-state combinations, we could supply two vectors to year and state arguments and they’ll be treated as pairs. For example, to obtain county-level SVI of New Jersey and Pennsylvania for 2017 and 2018, respectively:

summarise_results <- find_svi(
  year = c(2017, 2018),
  state = c("NJ", "PA"),
  geography = "county"
) 

summarise_results %>% 
  group_by(year, state) %>% 
  slice_head(n = 5)

#> # A tibble: 10 × 8
#> # Groups:   year, state [2]
#>    GEOID RPL_theme1 RPL_theme2 RPL_theme3 RPL_theme4 RPL_themes  year state
#>    <chr>      <dbl>      <dbl>      <dbl>      <dbl>      <dbl> <dbl> <chr>
#>  1 34001      0.95      0.8        0.65        1          0.95   2017 NJ   
#>  2 34003      0.2       0.3        0.55        0.45       0.25   2017 NJ   
#>  3 34005      0.3       0.5        0.35        0.4        0.3    2017 NJ   
#>  4 34007      0.7       0.9        0.55        0.6        0.75   2017 NJ   
#>  5 34009      0.65      0.6        0.1         0.55       0.45   2017 NJ   
#>  6 42001      0.212     0.242      0.697       0.227      0.182  2018 PA   
#>  7 42003      0.136     0.0758     0.742       0.576      0.212  2018 PA   
#>  8 42005      0.621     0.530      0.0152      0.167      0.227  2018 PA   
#>  9 42007      0.182     0.409      0.530       0.348      0.197  2018 PA   
#> 10 42009      0.712     0.606      0.0758      0.288      0.394  2018 PA

As a result, we have a table summarising the county-level SVI of New Jersey for 2017 and that of Pennsylvania for 2018, after retrieving census data for these two year-state pairs (first 5 rows of SVI results for each pair are shown above). Again, here data retrieval and SVI calculation (percentile ranking) are performed separately for 2017-NJ and 2018-PA, and the resulting SVIs are combined into a summarised table.

As other R functions that accepts vectors in their arguments, another way to supply year and state pairs is to extract columns from a table. Suppose we have a table called info_table containing the year-state information we’d like to include in the analysis:

#>   year state
#> 1 2017    AZ
#> 2 2018    FL
#> 3 2014    FL
#> 4 2018    PA
#> 5 2013    MA
#> 6 2020    KY

We could extract specific columns of interest from info_table for the year and state arguments:

all_results <- find_svi(
  year = info_table$year,
  state = info_table$state,
  geography = "county"
)

all_results %>% 
  group_by(year, state) %>% 
  slice_head(n = 3)

#> # A tibble: 18 × 8
#> # Groups:   year, state [6]
#>    GEOID RPL_theme1 RPL_theme2 RPL_theme3 RPL_theme4 RPL_themes  year state
#>    <chr>      <dbl>      <dbl>      <dbl>      <dbl>      <dbl> <dbl> <chr>
#>  1 25001      0.231     0.462      0.0769     0           0      2013 MA   
#>  2 25003      0.769     0.769      0.308      0.692       0.692  2013 MA   
#>  3 25005      0.923     0.923      0.615      0.538       0.846  2013 MA   
#>  4 12001      0.333     0          0.485      0.727       0.242  2014 FL   
#>  5 12003      0.485     0.803      0.0606     0.424       0.454  2014 FL   
#>  6 12005      0.242     0.652      0.197      0.394       0.288  2014 FL   
#>  7 04001      1         0.929      0.857      0.714       1      2017 AZ   
#>  8 04003      0.214     0.714      0.571      0.429       0.357  2017 AZ   
#>  9 04005      0.357     0          0.214      0.857       0.286  2017 AZ   
#> 10 12001      0.439     0          0.606      0.636       0.242  2018 FL   
#> 11 12003      0.485     0.894      0.0758     0.439       0.439  2018 FL   
#> 12 12005      0.318     0.803      0.318      0.5         0.470  2018 FL   
#> 13 42001      0.212     0.242      0.697      0.227       0.182  2018 PA   
#> 14 42003      0.136     0.0758     0.742      0.576       0.212  2018 PA   
#> 15 42005      0.621     0.530      0.0152     0.167       0.227  2018 PA   
#> 16 21001      0.580     0.109      0.538      0.689       0.445  2020 KY   
#> 17 21003      0.664     0.782      0.277      0.353       0.555  2020 KY   
#> 18 21005      0.235     0.622      0.487      0.0084      0.118  2020 KY

Here, only showing first 3 rows of results for each year-state combination, what we’re actually getting is a table with SVIs for all the counties in the 6 year-state pairs from the columns of info_table. This will likely make things easier especially there’s a long list of year-state combinations to process.

Custom Boundaries: `find_svi_x()`

To calculate SVI for custom geographic boundaries, we could use find_svi_x() and supply an additional crosswalk (relationship table) between the custom boundaries and a Census geographic level. The census geographic level should be fully nested in the custom geographic boundaries, so that the census data can be aggregated to the custom level for SVI calculation.

As an example and a template, the crosswalk of US counties and commuting zones for 2020 is stored in the package and documented in ?cty_cz_2020_xwalk2020. Using find_svi_x(), we can retrieve the census data at the county level, aggregate the data to the commuting zone level, and calculate the SVI for commuting zones. Below shows the overall and theme-specific SVIs for commuting zones 1-10 (GEOID represents the commuting zone IDs).

cz_svi2020 <- find_svi_x(
  year = 2020,
  geography = "county",
  xwalk = cty_cz_2020_xwalk #county-commuting zone crosswalk
)

cz_svi2020 %>%
  select(GEOID, contains("RPL")) %>%
  head(10)

#> # A tibble: 10 × 6
#>    GEOID RPL_theme1 RPL_theme2 RPL_theme3 RPL_theme4 RPL_themes
#>    <int>      <dbl>      <dbl>      <dbl>      <dbl>      <dbl>
#>  1     1      0.778      0.833      0.885     0.730       0.826
#>  2     2      0.734      0.436      0.698     0.388       0.625
#>  3     3      0.871      0.892      0.703     0.570       0.833
#>  4     4      0.881      0.498      0.838     0.947       0.876
#>  5     5      0.560      0.675      0.684     0.333       0.606
#>  6     6      0.799      0.813      0.605     0.302       0.720
#>  7     7      0.821      0.680      0.802     0.875       0.842
#>  8     8      0.694      0.888      0.438     0.0842      0.570
#>  9     9      0.899      0.969      0.838     0.918       0.962
#> 10    10      0.357      0.507      0.589     0.134       0.335

Alternatively, we could also use get_census_data() with exp=TRUE and get_svi_x().

For more details on spatial analysis, validation, and custom boundaries, please see other vignettes here.

Introduction to findSVI

What is SVI

Why we might need to calculate SVI

Data: census variables

Census variables and calculation table

ZCTA-state relationship file (crosswalk)

Retrieve census data with `get_census_data()`

Compute SVI with `get_svi()`

Wrapper and more: `find_svi()`

Single year-state entry

Multiple year-state entries

Custom Boundaries: `find_svi_x()`

Introduction to findSVI

What is SVI

Why we might need to calculate SVI

Data: census variables

Census variables and calculation table

ZCTA-state relationship file (crosswalk)

Retrieve census data with get_census_data()

Compute SVI with get_svi()

Wrapper and more: find_svi()

Single year-state entry

Multiple year-state entries

Custom Boundaries: find_svi_x()

Retrieve census data with `get_census_data()`

Compute SVI with `get_svi()`

Wrapper and more: `find_svi()`

Custom Boundaries: `find_svi_x()`