censobr is an R package to download data from Brazil’s Population Census. The package is built on top of the Arrow platform, which allows users to work with larger-than-memory census data using {dplyr} familiar functions.
obs.: The package is still under development. At the moment, censobr only includes microdata from the 2000 and 2010 censuses, but it is being expanded to cover more years and data sets.
The package currently includes 5 main functions to download Census microdata:
read_population()
read_households()
read_mortality()
read_families()
read_emigration()
The syntax of all censobr functions operate on the same logic so it becomes intuitive to download any data set using a single line of code. Like this:
dfh <- read_households(
year, # year of reference
columns, # select columns to read
add_labels, # add labels to categorical variables
as_data_frame, # return an Arrow DataSet or a data.frame
showProgress, # show download progress bar
cache # cache data for faster access later
)
Note: all data sets in
censobr are enriched with geography columns following
the name standards of the {geobr} package to help
data manipulation and integration with spatial data from the {geobr}
package. The added columns are:
c(‘code_muni’, ‘code_state’, ‘abbrev_state’, ‘name_state’, ‘code_region’, ‘name_region’, ‘code_weighting’)
.
The first time the user runs a function, censobr
will download the file and store it locally. This way, the data only
needs to be downloaded once. When the cache
parameter is
set to TRUE
(Default), the function will read the cached
data, which is much faster.
Users can list and/or delete data files cached locally using the
censobr_cache()
function:
library(censobr)
# list cached files
censobr_cache(list_files = TRUE)
#> Files currently chached:
#> C:\Users\user\AppData\Local/R/cache/R/censobr_v0.1.0/2000_families.parquet
#> C:\Users\user\AppData\Local/R/cache/R/censobr_v0.1.0/2000_households.parquet
#> C:\Users\user\AppData\Local/R/cache/R/censobr_v0.1.0/2000_population.parquet
#> C:\Users\user\AppData\Local/R/cache/R/censobr_v0.1.0/2010_deaths.parquet
#> C:\Users\user\AppData\Local/R/cache/R/censobr_v0.1.0/2010_households.parquet
#> C:\Users\user\AppData\Local/R/cache/R/censobr_v0.1.0/2010_population.parquet
# delete particular file
censobr_cache(delete_file = "2010_emigration")
#> The file '2010_emigration' is not cached.
#> Files currently chached:
#> C:\Users\user\AppData\Local/R/cache/R/censobr_v0.1.0/2000_families.parquet
#> C:\Users\user\AppData\Local/R/cache/R/censobr_v0.1.0/2000_households.parquet
#> C:\Users\user\AppData\Local/R/cache/R/censobr_v0.1.0/2000_population.parquet
#> C:\Users\user\AppData\Local/R/cache/R/censobr_v0.1.0/2010_deaths.parquet
#> C:\Users\user\AppData\Local/R/cache/R/censobr_v0.1.0/2010_households.parquet
#> C:\Users\user\AppData\Local/R/cache/R/censobr_v0.1.0/2010_population.parquet
Microdata of Brazilian census are often be too big to load in users’
RAM memory. To avoid this problem, censobr will by
default return an Arrow
table, which can be analyzed like a regular data.frame
using the dplyr
package without loading the full data to
memory.
Let’s see how it works in a couple examples:
First, let’s load the libraries we’ll be using in this vignette.
library(censobr)
library(arrow)
#>
#> Attaching package: 'arrow'
#> The following object is masked from 'package:utils':
#>
#> timestamp
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
library(geobr)
#> Loading required namespace: sf
library(ggplot2)
In this example we’ll be calculating the proportion of people with
higher education in different racial groups in the state of Rio de
Janeiro. First, we need to use the read_population()
function to download the population data set.
Since we don’t need to load to memory all columns from the data, we
can pass a vector the the columns we’re going to use. This might be
necessary in more constrained computing environments. Note that by
setting add_labels = 'pt'
, the function returns labeled
values for categorical variables.
pop <- read_population(year = 2010,
columns = c('abbrev_state', 'V0606', 'V0010', 'V6400'),
add_labels = 'pt',
showProgress = FALSE)
#> Reading data cached locally.
Next, we use the dplyr
syntax to (a) filter observations
for the state of Rio de Janeiro, (b) group observations by racial group,
(c) summarize the data calculating the proportion of individuals with
higher education.
df <- pop |>
filter(abbrev_state == "RJ") |> # (a)
collect() |>
group_by(V0606) |> # (b)
summarize(higher_edu = sum(V0010[which(V6400=="Superior completo")]) / sum(V0010), # (c)
pop = sum(V0010) ) |>
collect()
head(df)
#> # A tibble: 6 × 3
#> V0606 higher_edu pop
#> <chr> <dbl> <dbl>
#> 1 Amarela 0.0782 122552.
#> 2 Branca 0.151 7579023.
#> 3 Ignorado 0 3397.
#> 4 Indígena 0.109 15258.
#> 5 Parda 0.0443 6332408.
#> 6 Preta 0.0405 1937291.
Now we only need to plot the results.
df <- subset(df, V0606 != 'Ignorado')
ggplot() +
geom_col(data = df, aes(x=V0606, y=higher_edu), fill = '#5c997e') +
scale_y_continuous(name = 'Proportion with higher education',
labels = scales::percent) +
labs(x = 'Cor/raça') +
theme_classic()
In this example, we are going to map how much people spend with rent
across different states in Brazil. First, we can easily download the
households data set with the read_households()
function.
Now we’re going to (a) group observations by state, (b) calculate the average rent, and (c) collect the results.
rent <- hs |>
collect() |>
group_by(code_state) |> # (a)
summarize(avgrent=weighted.mean(x=V2011, w=V0010, na.rm=TRUE)) |> # (b)
collect() # (c)
head(rent)
#> # A tibble: 6 × 2
#> code_state avgrent
#> <chr> <dbl>
#> 1 11 340.
#> 2 12 295.
#> 3 13 346.
#> 4 14 298.
#> 5 15 297.
#> 6 16 322.
In order to create a map with of these values, we are going to use
the geobr
package to download the geometries of Brazilian
states.
uf <- geobr::read_state(year = 2010,
showProgress = FALSE)
#> Using year 2010
head(uf)
#> Simple feature collection with 6 features and 5 fields
#> Geometry type: MULTIPOLYGON
#> Dimension: XY
#> Bounding box: xmin: -73.99045 ymin: -13.6937 xmax: -46.06095 ymax: 5.271841
#> Geodetic CRS: SIRGAS 2000
#> code_state abbrev_state name_state code_region name_region
#> 1 11 RO Rondônia 1 Norte
#> 2 12 AC Acre 1 Norte
#> 3 13 AM Amazonas 1 Norte
#> 4 14 RR Roraima 1 Norte
#> 5 15 PA Pará 1 Norte
#> 6 16 AP Amapá 1 Norte
#> geom
#> 1 MULTIPOLYGON (((-63.32721 -...
#> 2 MULTIPOLYGON (((-73.18253 -...
#> 3 MULTIPOLYGON (((-67.32609 2...
#> 4 MULTIPOLYGON (((-60.20051 5...
#> 5 MULTIPOLYGON (((-54.95431 2...
#> 6 MULTIPOLYGON (((-51.1797 4....
Now we only need to merge the spatial data with our rent estimates and map the results.
uf$code_state <- as.character(uf$code_state)
rent_sf <- left_join(uf, rent, by = 'code_state')
ggplot() +
geom_sf(data = rent_sf, aes(fill = avgrent), color=NA) +
scale_fill_distiller(palette = "Greens", direction = 1,
name='Avgerage\nRent in R$') +
theme_void()