Introduction to censobr

2023-09-05

censobr is an R package to download data from Brazil’s Population Census. The package is built on top of the Arrow platform, which allows users to work with larger-than-memory census data using {dplyr} familiar functions.

obs.: The package is still under development. At the moment, censobr only includes microdata from the 2000 and 2010 censuses, but it is being expanded to cover more years and data sets.

Installation

# or use the development version with latest features
utils::remove.packages('censobr')
devtools::install_github("ipeaGIT/censobr")
library(censobr)

Basic usage

The package currently includes 5 main functions to download Census microdata:

  1. read_population()
  2. read_households()
  3. read_mortality()
  4. read_families()
  5. read_emigration()

The syntax of all censobr functions operate on the same logic so it becomes intuitive to download any data set using a single line of code. Like this:

dfh <- read_households(
          year,          # year of reference
          columns,       # select columns to read
          add_labels,    # add labels to categorical variables
          as_data_frame, # return an Arrow DataSet or a data.frame
          showProgress,  # show download progress bar
          cache          # cache data for faster access later
         )

Note: all data sets in censobr are enriched with geography columns following the name standards of the {geobr} package to help data manipulation and integration with spatial data from the {geobr} package. The added columns are: c(‘code_muni’, ‘code_state’, ‘abbrev_state’, ‘name_state’, ‘code_region’, ‘name_region’, ‘code_weighting’).

Data cache

The first time the user runs a function, censobr will download the file and store it locally. This way, the data only needs to be downloaded once. When the cache parameter is set to TRUE (Default), the function will read the cached data, which is much faster.

Users can list and/or delete data files cached locally using the censobr_cache() function:

library(censobr)

# list cached files
censobr_cache(list_files = TRUE)
#> Files currently chached:
#> C:\Users\user\AppData\Local/R/cache/R/censobr_v0.1.0/2000_families.parquet
#> C:\Users\user\AppData\Local/R/cache/R/censobr_v0.1.0/2000_households.parquet
#> C:\Users\user\AppData\Local/R/cache/R/censobr_v0.1.0/2000_population.parquet
#> C:\Users\user\AppData\Local/R/cache/R/censobr_v0.1.0/2010_deaths.parquet
#> C:\Users\user\AppData\Local/R/cache/R/censobr_v0.1.0/2010_households.parquet
#> C:\Users\user\AppData\Local/R/cache/R/censobr_v0.1.0/2010_population.parquet
# delete particular file
censobr_cache(delete_file = "2010_emigration")
#> The file '2010_emigration' is not cached.
#> Files currently chached:
#> C:\Users\user\AppData\Local/R/cache/R/censobr_v0.1.0/2000_families.parquet
#> C:\Users\user\AppData\Local/R/cache/R/censobr_v0.1.0/2000_households.parquet
#> C:\Users\user\AppData\Local/R/cache/R/censobr_v0.1.0/2000_population.parquet
#> C:\Users\user\AppData\Local/R/cache/R/censobr_v0.1.0/2010_deaths.parquet
#> C:\Users\user\AppData\Local/R/cache/R/censobr_v0.1.0/2010_households.parquet
#> C:\Users\user\AppData\Local/R/cache/R/censobr_v0.1.0/2010_population.parquet

Larger-than-memory Data

Microdata of Brazilian census are often be too big to load in users’ RAM memory. To avoid this problem, censobr will by default return an Arrow table, which can be analyzed like a regular data.frame using the dplyr package without loading the full data to memory.

Let’s see how it works in a couple examples:

Reproducible examples

First, let’s load the libraries we’ll be using in this vignette.

library(censobr)
library(arrow)
#> 
#> Attaching package: 'arrow'
#> The following object is masked from 'package:utils':
#> 
#>     timestamp
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(geobr)
#> Loading required namespace: sf
library(ggplot2)

Using Population data:

In this example we’ll be calculating the proportion of people with higher education in different racial groups in the state of Rio de Janeiro. First, we need to use the read_population() function to download the population data set.

Since we don’t need to load to memory all columns from the data, we can pass a vector the the columns we’re going to use. This might be necessary in more constrained computing environments. Note that by setting add_labels = 'pt', the function returns labeled values for categorical variables.

pop <- read_population(year = 2010,
                       columns = c('abbrev_state', 'V0606', 'V0010', 'V6400'),
                       add_labels = 'pt',
                       showProgress = FALSE)
#> Reading data cached locally.

Next, we use the dplyr syntax to (a) filter observations for the state of Rio de Janeiro, (b) group observations by racial group, (c) summarize the data calculating the proportion of individuals with higher education.

df <- pop |>
      filter(abbrev_state == "RJ") |>                                                    # (a)
      collect() |>
      group_by(V0606) |>                                                                 # (b)
      summarize(higher_edu = sum(V0010[which(V6400=="Superior completo")]) / sum(V0010), # (c)
                pop = sum(V0010) ) |>
      collect()

head(df)
#> # A tibble: 6 × 3
#>   V0606    higher_edu      pop
#>   <chr>         <dbl>    <dbl>
#> 1 Amarela      0.0782  122552.
#> 2 Branca       0.151  7579023.
#> 3 Ignorado     0         3397.
#> 4 Indígena     0.109    15258.
#> 5 Parda        0.0443 6332408.
#> 6 Preta        0.0405 1937291.

Now we only need to plot the results.

df <- subset(df, V0606 != 'Ignorado')

ggplot() +
  geom_col(data = df, aes(x=V0606, y=higher_edu), fill = '#5c997e') +
  scale_y_continuous(name = 'Proportion with higher education',
                     labels = scales::percent) +
  labs(x = 'Cor/raça') +
  theme_classic()

Using household data:

In this example, we are going to map how much people spend with rent across different states in Brazil. First, we can easily download the households data set with the read_households() function.

hs <- read_households(year = 2010, 
                      showProgress = FALSE)
#> Reading data cached locally.

Now we’re going to (a) group observations by state, (b) calculate the average rent, and (c) collect the results.

rent <- hs |> 
        collect() |>
        group_by(code_state) |>                                            # (a)
        summarize(avgrent=weighted.mean(x=V2011, w=V0010, na.rm=TRUE)) |>  # (b)
        collect()                                                          # (c)

head(rent)
#> # A tibble: 6 × 2
#>   code_state avgrent
#>   <chr>        <dbl>
#> 1 11            340.
#> 2 12            295.
#> 3 13            346.
#> 4 14            298.
#> 5 15            297.
#> 6 16            322.

In order to create a map with of these values, we are going to use the geobr package to download the geometries of Brazilian states.

uf <- geobr::read_state(year = 2010, 
                        showProgress = FALSE)
#> Using year 2010
head(uf)
#> Simple feature collection with 6 features and 5 fields
#> Geometry type: MULTIPOLYGON
#> Dimension:     XY
#> Bounding box:  xmin: -73.99045 ymin: -13.6937 xmax: -46.06095 ymax: 5.271841
#> Geodetic CRS:  SIRGAS 2000
#>   code_state abbrev_state name_state code_region name_region
#> 1         11           RO   Rondônia           1       Norte
#> 2         12           AC       Acre           1       Norte
#> 3         13           AM   Amazonas           1       Norte
#> 4         14           RR    Roraima           1       Norte
#> 5         15           PA       Pará           1       Norte
#> 6         16           AP      Amapá           1       Norte
#>                             geom
#> 1 MULTIPOLYGON (((-63.32721 -...
#> 2 MULTIPOLYGON (((-73.18253 -...
#> 3 MULTIPOLYGON (((-67.32609 2...
#> 4 MULTIPOLYGON (((-60.20051 5...
#> 5 MULTIPOLYGON (((-54.95431 2...
#> 6 MULTIPOLYGON (((-51.1797 4....

Now we only need to merge the spatial data with our rent estimates and map the results.

uf$code_state <- as.character(uf$code_state)
rent_sf <- left_join(uf, rent, by = 'code_state')

ggplot() +
  geom_sf(data = rent_sf, aes(fill = avgrent), color=NA) +
  scale_fill_distiller(palette = "Greens", direction = 1, 
                       name='Avgerage\nRent in R$') +
  theme_void()