This package is designed for identifying disease cases from admin data for epidemiological studies. The implementation focused on code readability and re-usability. Three types of functions are included:
Interactive functions (e.g., identify_row()
,
exclude()
, fetch_var()
) based on filter and
joins from dplyr with tweaks that fix SQL translation or add features
that are not natively support by SQL. They also work for local
data.frame, and some use ‘data.table’ package
(vignette("datatable-intro", package = "data.table")
) to
speed up processing time for large data. These functions are not as
flexible as dplyr::filter()
, but they are general enough to
be useful even outside health research.
Call-building functions (e.g., build_def()
,
execute_def()
) that facilitate batch execution and re-use
of case definitions. In essence, build_def
creates codes of
definitions (which is chain of the interactive functions, e.g.,
define_case()
) that are not immediately ran.
execute_def
runs built definitions with different input
data.
Miscellaneous functions such as computing age, collapsing records within a time range into one episode, and more (on-going effort) with built-in checks signalling if things could go wrong.
In health research and surveillance, identifying diseases or events from administrative databases is often the initial step. However, crafting case-finding algorithms is a complex task. Existing algorithms, often written in SAS by experienced analysts, can be complex and difficult to decipher for the growing number of analysts trained primarily in R.
These algorithms may also affect performance if they depend on Data Step in SAS, due to a lack of translation between Data Step and SQL. This can result in SAS downloading data from a remote database to a local machine, leading to poor performance when handling large, population-based databases.
The ‘healthdb’ R package was created to address these challenges. It minimizes the need to download data and offers an easy-to-use interface for working with healthcare databases. It also includes capabilities not supported by ‘SQL’, such as matching strings by ‘stringr’ style regular expressions, and can compute comorbidity scores directly on a database server. This vignette will present an example of common use cases.
Simply run:
We will need the following packages for this demo.
library(dplyr)
#> Warning: package 'dplyr' was built under R version 4.3.2
library(dbplyr)
#> Warning: package 'dbplyr' was built under R version 4.3.3
library(lubridate)
#> Warning: package 'lubridate' was built under R version 4.3.2
library(glue)
#> Warning: package 'glue' was built under R version 4.3.3
library(purrr)
#> Warning: package 'purrr' was built under R version 4.3.2
library(healthdb)
Consider the case definition of substance use disorder (SUD) from British Columbia Centre for Disease Control’s Chronic Disease Dashboard,
One or more hospitalization with a substance use disorder diagnostic code, OR Two or more physician visits with a substance use disorder diagnostic code within one year.
We are going to implement this definition. First, let’s make a demo data sets for the two sources:
Physician claims with multiple columns of ICD-9 diagnostic codes
# make_test_dat() makes either a toy data.frame or database table in memory with known number of rows that satisfy the query we will show later
claim_db <- make_test_dat(vals_kept = c("303", "304", "305", "291", "292", glue("30{30:59}"), glue("29{10:29}"), noise_val = c("999", "111")), type = "database")
# this is a database table
# note that in-memory SQLite database stores dates as numbers
claim_db %>% head()
#> # Source: SQL [6 x 6]
#> # Database: sqlite 3.45.2 [:memory:]
#> uid clnt_id dates diagx diagx_1 diagx_2
#> <int> <int> <dbl> <chr> <chr> <chr>
#> 1 23 1 16844 3036 2922 <NA>
#> 2 7 1 17867 3047 999 999
#> 3 79 1 18440 999 999 999
#> 4 81 2 16786 999 <NA> 999
#> 5 18 2 18200 3059 3055 999
#> 6 44 4 16677 2919 2925 304
Hospitalization with ICD-10 codes
hosp_df <- make_test_dat(vals_kept = c(glue("F{10:19}"), glue("F{100:199}"), noise_val = "999"), type = "data.frame")
# this is a local data.frame/tibble
hosp_df %>% head()
#> uid clnt_id dates diagx diagx_1 diagx_2
#> 1 87 1 2017-10-06 999 <NA> <NA>
#> 2 7 2 2016-02-24 F105 F105 999
#> 3 21 2 2017-05-12 F178 F154 <NA>
#> 4 39 3 2015-07-17 F188 F167 <NA>
#> 5 68 3 2016-11-27 999 999 <NA>
#> 6 33 4 2015-07-15 F124 F123 <NA>
Let’s focus on the physician claims. Extracting clients with at least two records within a year is not difficult, and involves only a few steps. The codes could look like the following using dplyr, however, it does not work because: 1. SQL does not support multiple patterns in one LIKE operation, 2. dbply currently have issue with translating n_distinct.
## not run
claim_db %>%
# identify the target codes
filter(if_any(starts_with("diagx_"), ~ str_like(., c("291%", "292%", "303%", "304%", "305%")))) %>%
# each clnt has at least 2 records on different dates
group_by(clnt_id) %>%
# the n_distinct step is mainly for reducing computation in the next step
filter(n_distinct(dates) >= 2) %>%
# any two dates within one year?
filter((max(dates) - min(dates)) <= 365)
## end
Here’s how you could use healthdb
to achieve these
steps:
Identify rows contains the target codes
result1 <- claim_db %>%
identify_row(
vars = starts_with("diagx_"),
match = "start",
vals = c(291:292, 303:305)
)
#> ℹ Identify records with condition(s):
#> • where at least one of the diagx_1, diagx_2 column(s) in each record
#> • contains a value satisfied SQL LIKE pattern: 291% OR 292% OR 303% OR 304% OR 305%
#> ℹ To see the final query generated by 'dbplyr', use dplyr::show_query() on the output.
#> To extract the SQL string, use dbplyr::remote_query().
Bonus: remove clients with exclusion codes
This step is not in the substance use disorder definition, but other disease definitions often require exclusion of some ICDs that contradicts the ones of interest. Let’s say we want to remove clients with code “111” here.
We first identify “111” from the source, then exclude clients in the
output from the previous step’s result. exclude()
take
either a data set (via the excl argument) or expression (condition
argument) as input. For the former, it performs an anti join matching on
the by argument (see dplyr::join_by()
). For the latter, it
is the opposite of filter, i.e.,
filter(!(some_expression))
.
result2 <- result1 %>%
exclude(
excl = identify_row(claim_db, starts_with("diagx_"), "in", "111"),
by = "clnt_id"
)
#> ℹ Identify records with condition(s):
#> • where at least one of the diagx_1, diagx_2 column(s) in each record
#> • contains a value exactly matched values in set: "111"
#> ℹ Exclude records in `data` through anti_join with `excl` matching on (by argument): "clnt_id"
Restrict the number of records per client
result3 <- result2 %>% restrict_n(
clnt_id = clnt_id,
n_per_clnt = 2,
count_by = dates,
# here we use filter mode to remove records that failed the restriction
mode = "filter"
)
#> ℹ Apply restriction that each client must have at least 2 records with distinct
#> dates. Clients/groups which did not met the condition were excluded.
Restrict the temporal pattern of diagnoses
restrict_date()
also supports more complicated patterns
like having n diagnoses at least i days apart within j years, but the
“apart” feature requires relatively expensive computation and
implemented for local data.frames only. Note that when SQL interpret
order of dates, the result could be not deterministic if there were
duplicate dates within client. Therefore, a unique row id colume (uid)
has to be supplied to get consistent result.
result4 <- result3 %>% restrict_date(
clnt_id = clnt_id,
date_var = dates,
n = 2,
within = 365,
uid = uid,
# here we use flag mode to flag records that met the restriction instead of removing those
mode = "flag"
)
#> ℹ Apply restriction that each client must have 2 records that were within 365
#> days. Records that met the condition were flagged.
Fetch variables from other tables by matching common keys
Up to this point, the result is only a query and have not been downloaded. Hopefully, it has been shrunken to a manageable size for collection.
# Class of result4
class(result4)
#> [1] "tbl_SQLiteConnection" "tbl_dbi" "tbl_sql"
#> [4] "tbl_lazy" "tbl"
# execute query and download the result
result_df <- result4 %>% collect()
# Number of rows in source
nrow(claim_db %>% collect())
#> [1] 100
# Number of rows in the current result
nrow(result_df)
#> [1] 25
Our data now only contains diagnoses that are probably not enough for
further analyses. Let’s say we want to gather client demographics such
as age and sex from other sources. This certainly can be done with
multiple dplyr::left_join()
calls. Here we provide the
fetch_var()
function to make the codes more concise.
# make two look up tables
age_tab <- data.frame(
clnt_id = 1:50,
age = sample(1:90, 50),
sex = sample(c("F", "M"), 50, replace = TRUE)
)
address_tab <- data.frame(
clnt_id = rep(1:50, 5), year = rep(2016:2020, each = 50),
area_code = sample(0:200, 50, replace = TRUE)
)
# get year from dates for matching
result_df <- result_df %>% mutate(year = lubridate::year(as.Date(dates, origin = "1970-01-01")))
# note that keys must be present in all tables
result_df %>%
fetch_var(
keys = c(clnt_id, year),
linkage = list(
# |clnt_id means matching on clnt_id only
age_tab ~ c(age, sex) | clnt_id,
address_tab ~ area_code
)
) %>%
head()
#> # A tibble: 6 × 12
#> uid clnt_id dates diagx diagx_1 diagx_2 flag_restrict_n flag_restrict_date
#> <int> <int> <dbl> <chr> <chr> <chr> <int> <int>
#> 1 16 6 16766 3036 3036 <NA> 1 1
#> 2 4 6 17010 291 3039 <NA> 1 0
#> 3 26 9 17051 3033 3038 <NA> 1 0
#> 4 1 9 17462 2921 3030 999 1 0
#> 5 21 11 16994 2925 2928 <NA> 1 0
#> 6 39 11 18301 3053 3050 <NA> 1 0
#> # ℹ 4 more variables: year <dbl>, age <int>, sex <chr>, area_code <int>
To complete the definition, we need to repeat the process shown above
with hospitalization data. Some studies may use more than a handful of
data sources to define their sample. We packed steps 1-4 in one function
define_case()
, and provide tools to perform batch execution
with different data and parameters to meet those needs.
# build the full definition of SUD
sud_def <- build_def(
# name of definition
def_lab = "SUD",
# place holder names for sources
src_labs = c("claim", "hosp"),
def_fn = define_case, # you could alter it and supply your own function
# below are argumets of define_case
fn_args = list(
# if length = 1, the single element will be use for every source
vars = list(starts_with("diagx_")),
match = "start", # match ICD starts with vals
vals = list(c(291:292, 303:305), glue("F{10:19}")),
clnt_id = clnt_id,
n_per_clnt = c(2, 1),
date_var = dates,
within = c(365, NULL),
uid = uid,
mode = "flag"
)
)
sud_def
#> # A tibble: 2 × 5
#> def_lab src_labs def_fn fn_args fn_call
#> <chr> <chr> <chr> <list> <list>
#> 1 SUD claim define_case <named list [9]> <language>
#> 2 SUD hosp define_case <named list [9]> <language>
Let’s look inside the fn_call list column. Two calls of
define_case()
have been made with different parameters. The
data arguments are left empty on purpose for re-usability. For example,
you may want to repeat the analysis with data from different regions or
study periods.
sud_def$fn_call
#> [[1]]
#> define_case(data = , vars = starts_with("diagx_"), match = "start",
#> vals = c(291:292, 303:305), clnt_id = clnt_id, n_per_clnt = 2,
#> date_var = dates, within = 365, uid = uid, mode = "flag")
#>
#> [[2]]
#> define_case(data = , vars = starts_with("diagx_"), match = "start",
#> vals = glue("F{10:19}"), clnt_id = clnt_id, n_per_clnt = 1,
#> date_var = dates, within = NULL, uid = uid, mode = "flag")
Executing the definition is simple. If verbose option is not turned
off by options(healthdb.verbose = FALSE)
, the output
message will explain what has been done. You could append multiple
build_def()
outputs together and execute them all at once.
Definition and source labels will be added to the result to identify
outputs from different calls.
# execute the definition
result_list <- sud_def %>%
execute_def(with_data = list(
claim = claim_db,
hosp = hosp_df
))
#>
#> Processing source: claim_db
#> → --------------Inclusion step--------------
#> ℹ Identify records with condition(s):
#> • where at least one of the diagx_1, diagx_2 column(s) in each record
#> • contains a value satisfied SQL LIKE pattern: 291% OR 292% OR 303% OR 304% OR 305%
#> → --------------No. rows restriction--------------
#>
#> ℹ Apply restriction that each client must have at least 2 records with distinct dates. Records that met the condition were flagged.
#> → --------------Time span restriction--------------
#>
#> ℹ Apply restriction that each client must have 2 records that were within 365 days. Records that met the condition were flagged.
#> → -------------- Output all records--------------
#>
#> Processing source: hosp_df
#> → --------------Inclusion step--------------
#>
#> ℹ Identify records with condition(s):
#> • where at least one of the diagx_1, diagx_2 column(s) in each record
#> • contains a value satisfied regular expression: ^F10|^F11|^F12|^F13|^F14|^F15|^F16|^F17|^F18|^F19
#>
#> All unique value(s) and frequency in the result (as the conditions require just one of the columns containing target values; irrelevant values may come from other vars columns):
#> 999 F103 F104 F105 F106 F111 F112 F114 F115 F118 F12 F120 F122 F123 F124 F13
#> 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
#> F131 F134 F135 F138 F139 F140 F142 F144 F154 F162 F163 F167 F168 F169 F172 F174
#> 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
#> F177 F180 F185 F186 F188 F189 F19 F191 F192 F194 F196 F198 NA's
#> 1 1 1 1 1 1 1 1 1 1 1 1 1
#> → -------------- Output all records--------------
Let’s check the results!
# view the results
purrr::walk(result_list, ~ head(.) %>% print())
#> # Source: SQL [6 x 10]
#> # Database: sqlite 3.45.2 [:memory:]
#> # Ordered by: dates, uid
#> def src uid clnt_id dates diagx diagx_1 diagx_2 flag_restrict_n
#> <chr> <chr> <int> <int> <dbl> <chr> <chr> <chr> <int>
#> 1 SUD claim 23 1 16844 3036 2922 <NA> 0
#> 2 SUD claim 18 2 18200 3059 3055 999 0
#> 3 SUD claim 44 4 16677 2919 2925 304 0
#> 4 SUD claim 9 5 17135 111 3034 999 0
#> 5 SUD claim 16 6 16766 3036 3036 <NA> 1
#> 6 SUD claim 4 6 17010 291 3039 <NA> 1
#> # ℹ 1 more variable: flag_restrict_date <int>
#> def src uid clnt_id dates diagx diagx_1 diagx_2
#> 1 SUD hosp 7 2 2016-02-24 F105 F105 999
#> 2 SUD hosp 21 2 2017-05-12 F178 F154 <NA>
#> 3 SUD hosp 39 3 2015-07-17 F188 F167 <NA>
#> 4 SUD hosp 33 4 2015-07-15 F124 F123 <NA>
#> 5 SUD hosp 15 4 2015-09-20 F196 F115 999
#> 6 SUD hosp 47 5 2015-08-25 F173 F186 F196
At this point, the result from the claim database
(result[[1]]
) has not been collected locally. You could
collect it manually, do further filtering, and then combine with the
result from hospitalization data in any way you want. If you just need a
simple row bind, we have bind_source()
with convenient
naming feature.
bind_source(result_list,
# output_name = c(names in the list elements)
src = "src",
uid = "uid",
clnt_id = "clnt_id",
flag_date = c("flag_restrict_date", NA),
force_proceed = TRUE
)
#> # A tibble: 97 × 5
#> src_No src uid clnt_id flag_date
#> <int> <chr> <int> <int> <int>
#> 1 1 claim 23 1 0
#> 2 1 claim 18 2 0
#> 3 1 claim 44 4 0
#> 4 1 claim 9 5 0
#> 5 1 claim 16 6 1
#> 6 1 claim 4 6 0
#> 7 1 claim 14 8 0
#> 8 1 claim 26 9 0
#> 9 1 claim 1 9 0
#> 10 1 claim 6 10 0
#> # ℹ 87 more rows