Building base cohorts

Concept based cohort creation

A way of defining base cohorts is to identify clinical records with codes from some pre-specified concept list. Here for example we’ll first find codes for diclofenac and acetaminophen. We use the getDrugIngredientCodes() function from the package CodelistGenerator to obtain the codes for these drugs.

drug_codes <- getDrugIngredientCodes(cdm, 
                                     name = c("acetaminophen",
                                              "amoxicillin", 
                                              "diclofenac", 
                                              "simvastatin",
                                              "warfarin"))

drug_codes
#> 
#> - 11289_warfarin (2 codes)
#> - 161_acetaminophen (7 codes)
#> - 3355_diclofenac (1 codes)
#> - 36567_simvastatin (2 codes)
#> - 723_amoxicillin (4 codes)

Now we have our codes of interest, we’ll make cohorts for each of these where cohort exit is defined as the event start date (which for these will be their drug exposure end date).

cdm$drugs <- conceptCohort(cdm, 
                           conceptSet = drug_codes,
                           exit = "event_end_date",
                           name = "drugs")

settings(cdm$drugs)
#> # A tibble: 5 × 4
#>   cohort_definition_id cohort_name       cdm_version vocabulary_version
#>                  <int> <chr>             <chr>       <chr>             
#> 1                    1 11289_warfarin    5.3         v5.0 18-JAN-19    
#> 2                    2 161_acetaminophen 5.3         v5.0 18-JAN-19    
#> 3                    3 3355_diclofenac   5.3         v5.0 18-JAN-19    
#> 4                    4 36567_simvastatin 5.3         v5.0 18-JAN-19    
#> 5                    5 723_amoxicillin   5.3         v5.0 18-JAN-19
cohortCount(cdm$drugs)
#> # A tibble: 5 × 3
#>   cohort_definition_id number_records number_subjects
#>                  <int>          <int>           <int>
#> 1                    1            137             137
#> 2                    2          13908            2679
#> 3                    3            830             830
#> 4                    4            182             182
#> 5                    5           4307            2130
attrition(cdm$drugs)
#> # A tibble: 30 × 7
#>    cohort_definition_id number_records number_subjects reason_id reason         
#>                   <int>          <int>           <int>     <int> <chr>          
#>  1                    1            137             137         1 Initial qualif…
#>  2                    1            137             137         2 Record in obse…
#>  3                    1            137             137         3 Record start <…
#>  4                    1            137             137         4 Non-missing sex
#>  5                    1            137             137         5 Non-missing ye…
#>  6                    1            137             137         6 Merge overlapp…
#>  7                    2          14205            2679         1 Initial qualif…
#>  8                    2          14205            2679         2 Record in obse…
#>  9                    2          14205            2679         3 Record start <…
#> 10                    2          14205            2679         4 Non-missing sex
#> # ℹ 20 more rows
#> # ℹ 2 more variables: excluded_records <int>, excluded_subjects <int>

This creates a cohort where individuals are defined by their exposure to the specified drugs, and their cohort duration is determined by the exposure end date.

Next, let’s create a cohort for individuals with bronchitis. We define a set of codes representing bronchitis and use the conceptCohort() function to create the cohort. Here, the cohort exit is defined by the event start date (i.e., event_start_date). We set table = "condition_occurrence" so that the records for the provided concepts will be searched only in the condition_occurrence table. We then set subsetCohort = "drugs" to restrict the cohort creation to individuals already in the drugs cohort. Additionally, we use subsetCohortId = 1 to include only subjects from the cohort 1 (which corresponds to individuals who have been exposed to warfarin).


bronchitis_codes <- list(bronchitis = c(260139, 256451, 4232302))

cdm$bronchitis <- conceptCohort(cdm, 
                           conceptSet = bronchitis_codes,
                           exit = "event_start_date",
                           name = "bronchitis",
                           table = "condition_occurrence", 
                           subsetCohort = "drugs", 
                           subsetCohortId = 1
                           )


cohortCount(cdm$bronchitis)
#> # A tibble: 1 × 3
#>   cohort_definition_id number_records number_subjects
#>                  <int>          <int>           <int>
#> 1                    1            533             130
attrition(cdm$bronchitis)
#> # A tibble: 6 × 7
#>   cohort_definition_id number_records number_subjects reason_id reason          
#>                  <int>          <int>           <int>     <int> <chr>           
#> 1                    1            533             130         1 Initial qualify…
#> 2                    1            533             130         2 Record in obser…
#> 3                    1            533             130         3 Record start <=…
#> 4                    1            533             130         4 Non-missing sex 
#> 5                    1            533             130         5 Non-missing yea…
#> 6                    1            533             130         6 Merge overlappi…
#> # ℹ 2 more variables: excluded_records <int>, excluded_subjects <int>

When some records in the cohort overlap, the cohort start date will be set to the earliest start date. If we set overlap = "merge", the cohort end date will be set to the latest end date of the overlapping records.

cdm$drugs_merge <- conceptCohort(cdm, 
                           conceptSet = drug_codes,
                           overlap = "merge",
                           name = "drugs_merge")

cdm$drugs_merge |>
  attrition()
#> # A tibble: 30 × 7
#>    cohort_definition_id number_records number_subjects reason_id reason         
#>                   <int>          <int>           <int>     <int> <chr>          
#>  1                    1            137             137         1 Initial qualif…
#>  2                    1            137             137         2 Record in obse…
#>  3                    1            137             137         3 Record start <…
#>  4                    1            137             137         4 Non-missing sex
#>  5                    1            137             137         5 Non-missing ye…
#>  6                    1            137             137         6 Merge overlapp…
#>  7                    2          14205            2679         1 Initial qualif…
#>  8                    2          14205            2679         2 Record in obse…
#>  9                    2          14205            2679         3 Record start <…
#> 10                    2          14205            2679         4 Non-missing sex
#> # ℹ 20 more rows
#> # ℹ 2 more variables: excluded_records <int>, excluded_subjects <int>

Alternatively, if we set overlap = "extend", the cohort end date will be extended by summing the durations of each overlapping record.

cdm$drugs_extend <- conceptCohort(cdm, 
                           conceptSet = drug_codes,
                           overlap = "extend",
                           name = "drugs_extend")

cdm$drugs_extend |>
  attrition()
#> # A tibble: 50 × 7
#>    cohort_definition_id number_records number_subjects reason_id reason         
#>                   <int>          <int>           <int>     <int> <chr>          
#>  1                    1            137             137         1 Initial qualif…
#>  2                    1            137             137         2 Record in obse…
#>  3                    1            137             137         3 Record start <…
#>  4                    1            137             137         4 Non-missing sex
#>  5                    1            137             137         5 Non-missing ye…
#>  6                    1            137             137         6 Add overlappin…
#>  7                    1            137             137         7 Record in obse…
#>  8                    1            137             137         8 Record start <…
#>  9                    1            137             137         9 Non-missing sex
#> 10                    1            137             137        10 Non-missing ye…
#> # ℹ 40 more rows
#> # ℹ 2 more variables: excluded_records <int>, excluded_subjects <int>

To create a cohort from a concept set and include records outside of the observation period, we can set inObservation = FALSE. If we also want to search for the given concepts in the source concept_id fields, rather than only the standard concept_id fields, we can set useSourceFields = TRUE.


cdm$celecoxib <- conceptCohort(cdm, 
                           conceptSet = list(celecoxib = 44923712),
                           name = "celecoxib", 
                           inObservation = FALSE, 
                           useSourceFields = TRUE)
cdm$celecoxib |>
  glimpse()
#> Rows: ??
#> Columns: 4
#> Database: DuckDB v1.3.1 [root@Darwin 24.5.0:R 4.5.1//private/var/folders/sw/rd8zn92n2nz45cfcc5dcs_080000gr/T/RtmpwyN0k0/file110b8689aff0d.duckdb]
#> $ cohort_definition_id <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
#> $ subject_id           <int> 2411, 4474, 3, 2915, 3652, 4606, 3268, 2243, 1265…
#> $ cohort_start_date    <date> 2001-05-14, 1998-12-15, 1957-12-08, 1981-11-14, …
#> $ cohort_end_date      <date> 2001-05-14, 1998-12-15, 1957-12-08, 1981-11-14, …

Demographic based cohort creation

One base cohort we can create is based around patient demographics. Here for example we create a cohort where people enter on their 18th birthday and leave at on the day before their 66th birthday.

cdm$working_age_cohort <- demographicsCohort(cdm = cdm, 
                                             ageRange = c(18, 65), 
                                             name = "working_age_cohort")

settings(cdm$working_age_cohort)
#> # A tibble: 1 × 3
#>   cohort_definition_id cohort_name  age_range
#>                  <int> <chr>        <chr>    
#> 1                    1 demographics 18_65
cohortCount(cdm$working_age_cohort)
#> # A tibble: 1 × 3
#>   cohort_definition_id number_records number_subjects
#>                  <int>          <int>           <int>
#> 1                    1           2694            2694
attrition(cdm$working_age_cohort)
#> # A tibble: 2 × 7
#>   cohort_definition_id number_records number_subjects reason_id reason          
#>                  <int>          <int>           <int>     <int> <chr>           
#> 1                    1           2694            2694         1 Initial qualify…
#> 2                    1           2694            2694         2 Age requirement…
#> # ℹ 2 more variables: excluded_records <int>, excluded_subjects <int>

We can also add an additional requirement of only people of working age with sex “female”.

cdm$female_working_age_cohort <- demographicsCohort(cdm = cdm, 
                                             ageRange = c(18, 65),
                                             sex = "Female",
                                             name = "female_working_age_cohort")

settings(cdm$female_working_age_cohort)
#> # A tibble: 1 × 4
#>   cohort_definition_id cohort_name  age_range sex   
#>                  <int> <chr>        <chr>     <chr> 
#> 1                    1 demographics 18_65     Female
cohortCount(cdm$female_working_age_cohort)
#> # A tibble: 1 × 3
#>   cohort_definition_id number_records number_subjects
#>                  <int>          <int>           <int>
#> 1                    1           1373            1373
attrition(cdm$female_working_age_cohort)
#> # A tibble: 3 × 7
#>   cohort_definition_id number_records number_subjects reason_id reason          
#>                  <int>          <int>           <int>     <int> <chr>           
#> 1                    1           2694            2694         1 Initial qualify…
#> 2                    1           1373            1373         2 Sex requirement…
#> 3                    1           1373            1373         3 Age requirement…
#> # ℹ 2 more variables: excluded_records <int>, excluded_subjects <int>

We can also use this function to create cohorts for different combinations of age groups and sex.

cdm$age_sex_cohorts <- demographicsCohort(cdm = cdm, 
                                             ageRange = list(c(0, 17), c(18, 65), c(66,120)),
                                             sex = c("Female", "Male"),
                                             name = "age_sex_cohorts")

settings(cdm$age_sex_cohorts)
#> # A tibble: 6 × 4
#>   cohort_definition_id cohort_name    age_range sex   
#>                  <int> <chr>          <chr>     <chr> 
#> 1                    1 demographics_1 0_17      Female
#> 2                    2 demographics_2 0_17      Male  
#> 3                    3 demographics_3 18_65     Female
#> 4                    4 demographics_4 18_65     Male  
#> 5                    5 demographics_5 66_120    Female
#> 6                    6 demographics_6 66_120    Male
cohortCount(cdm$age_sex_cohorts)
#> # A tibble: 6 × 3
#>   cohort_definition_id number_records number_subjects
#>                  <int>          <int>           <int>
#> 1                    1           1373            1373
#> 2                    2           1321            1321
#> 3                    3           1373            1373
#> 4                    4           1321            1321
#> 5                    5            393             393
#> 6                    6            378             378
attrition(cdm$age_sex_cohorts)
#> # A tibble: 18 × 7
#>    cohort_definition_id number_records number_subjects reason_id reason         
#>                   <int>          <int>           <int>     <int> <chr>          
#>  1                    1           2694            2694         1 Initial qualif…
#>  2                    1           1373            1373         2 Sex requiremen…
#>  3                    1           1373            1373         3 Age requiremen…
#>  4                    2           2694            2694         1 Initial qualif…
#>  5                    2           1321            1321         2 Sex requiremen…
#>  6                    2           1321            1321         3 Age requiremen…
#>  7                    3           2694            2694         1 Initial qualif…
#>  8                    3           1373            1373         2 Sex requiremen…
#>  9                    3           1373            1373         3 Age requiremen…
#> 10                    4           2694            2694         1 Initial qualif…
#> 11                    4           1321            1321         2 Sex requiremen…
#> 12                    4           1321            1321         3 Age requiremen…
#> 13                    5           2694            2694         1 Initial qualif…
#> 14                    5           1373            1373         2 Sex requiremen…
#> 15                    5            393             393         3 Age requiremen…
#> 16                    6           2694            2694         1 Initial qualif…
#> 17                    6           1321            1321         2 Sex requiremen…
#> 18                    6            378             378         3 Age requiremen…
#> # ℹ 2 more variables: excluded_records <int>, excluded_subjects <int>

We can also specify the minimum number of days of prior observation required.

cdm$working_age_cohort_0_365 <- demographicsCohort(cdm = cdm, 
                                             ageRange = c(18, 65), 
                                             name = "working_age_cohort_0_365",
                                             minPriorObservation = c(0,365))

settings(cdm$working_age_cohort_0_365)
#> # A tibble: 2 × 4
#>   cohort_definition_id cohort_name    age_range min_prior_observation
#>                  <int> <chr>          <chr>                     <dbl>
#> 1                    1 demographics_1 18_65                         0
#> 2                    2 demographics_2 18_65                       365
cohortCount(cdm$working_age_cohort_0_365)
#> # A tibble: 2 × 3
#>   cohort_definition_id number_records number_subjects
#>                  <int>          <int>           <int>
#> 1                    1           2694            2694
#> 2                    2           2694            2694
attrition(cdm$working_age_cohort_0_365)
#> # A tibble: 6 × 7
#>   cohort_definition_id number_records number_subjects reason_id reason          
#>                  <int>          <int>           <int>     <int> <chr>           
#> 1                    1           2694            2694         1 Initial qualify…
#> 2                    1           2694            2694         2 Age requirement…
#> 3                    1           2694            2694         3 Prior observati…
#> 4                    2           2694            2694         1 Initial qualify…
#> 5                    2           2694            2694         2 Age requirement…
#> 6                    2           2694            2694         3 Prior observati…
#> # ℹ 2 more variables: excluded_records <int>, excluded_subjects <int>

Measurement Cohort

Another base cohort we can create is based around patient measurements. Here for example we create a cohort of patients who have a normal BMI (BMI between 18 and 25). To do this you must first identify the measurement you want to look at (in this case BMI (concept id = 4245997)), the unit of measurement (kg per square-meter (concept id = 9531)) and ‘normal’ value concept (concept id = 4069590). The value concept is included for the cases where the exact BMI measurement is not specified, but the BMI category (i.e. normal, overweight, obese etc) is. This means that if a record matches the value concept OR has a normal BMI score then it is included in the cohort.

cdm$cohort <- measurementCohort(
  cdm = cdm,
  name = "cohort",
  conceptSet = list("bmi_normal" = c(4245997)),
  valueAsConcept = c(4069590),
  valueAsNumber = list("9531" = c(18, 25))
)

attrition(cdm$cohort)
#> # A tibble: 6 × 7
#>   cohort_definition_id number_records number_subjects reason_id reason          
#>                  <int>          <int>           <int>     <int> <chr>           
#> 1                    1              5               3         1 Initial qualify…
#> 2                    1              3               2         2 Record in obser…
#> 3                    1              3               2         3 Not missing rec…
#> 4                    1              3               2         4 Non-missing sex 
#> 5                    1              3               2         5 Non-missing yea…
#> 6                    1              3               2         6 Distinct measur…
#> # ℹ 2 more variables: excluded_records <int>, excluded_subjects <int>
settings(cdm$cohort)
#> # A tibble: 1 × 4
#>   cohort_definition_id cohort_name cdm_version vocabulary_version
#>                  <int> <chr>       <chr>       <chr>             
#> 1                    1 bmi_normal  5.3         mock
cdm$cohort
#> # Source:   table<cohort> [?? x 4]
#> # Database: DuckDB v1.3.1 [root@Darwin 24.5.0:R 4.5.1/:memory:]
#>   cohort_definition_id subject_id cohort_start_date cohort_end_date
#>                  <int>      <dbl> <date>            <date>         
#> 1                    1          1 2015-02-19        2015-02-19     
#> 2                    1          1 2009-07-01        2009-07-01     
#> 3                    1          3 1999-09-08        1999-09-08

As you can see in the above code, the concept set is the list of BMI concepts, the concept value is the ‘normal’ weight concept, and the values are the minimum and maximum BMI scores to be considered.

It is also possible to include records outside of observation by setting the inObservation argument to false.

cdm$cohort <- measurementCohort(
  cdm = cdm,
  name = "cohort",
  conceptSet = list("bmi_normal" = c(4245997)),
  valueAsConcept = c(4069590),
  valueAsNumber = list("9531" = c(18, 25)),
  inObservation = FALSE
)

attrition(cdm$cohort)
#> # A tibble: 6 × 7
#>   cohort_definition_id number_records number_subjects reason_id reason          
#>                  <int>          <int>           <int>     <int> <chr>           
#> 1                    1              5               3         1 Initial qualify…
#> 2                    1              4               3         2 Record in obser…
#> 3                    1              4               3         3 Not missing rec…
#> 4                    1              4               3         4 Non-missing sex 
#> 5                    1              4               3         5 Non-missing yea…
#> 6                    1              4               3         6 Distinct measur…
#> # ℹ 2 more variables: excluded_records <int>, excluded_subjects <int>
settings(cdm$cohort)
#> # A tibble: 1 × 4
#>   cohort_definition_id cohort_name cdm_version vocabulary_version
#>                  <int> <chr>       <chr>       <chr>             
#> 1                    1 bmi_normal  5.3         mock
cdm$cohort
#> # Source:   table<cohort> [?? x 4]
#> # Database: DuckDB v1.3.1 [root@Darwin 24.5.0:R 4.5.1/:memory:]
#>   cohort_definition_id subject_id cohort_start_date cohort_end_date
#>                  <int>      <dbl> <date>            <date>         
#> 1                    1          1 2015-02-19        2015-02-19     
#> 2                    1          2 2006-04-01        2006-04-01     
#> 3                    1          3 1999-09-08        1999-09-08     
#> 4                    1          1 2009-07-01        2009-07-01

Building base cohorts

Introduction

Concept based cohort creation

Demographic based cohort creation

Measurement Cohort

Death cohort