Integration with existing packages

Maciej Beręsewicz

1 Setup

library(blocking)
library(data.table)
library(reclin2)

2 Data

In the example we will use the same dataset as in the Blocking records for record linkage vignette.

data(census)
data(cis)
setDT(census)
setDT(cis)
census[, x:=1:.N]
cis[, y:=1:.N]

3 Integration with the reclin2 package

The package contains function pair_ann() which aims at integration with reclin2 package. This function works as follows.

pair_ann(x = census[1:1000], 
         y = cis[1:1000], 
         on = c("pername1", "pername2", "sex", "dob_day", "dob_mon", "dob_year", "enumcap", "enumpc"), 
         deduplication = FALSE) |>
  head()
.x .y block
204 1 1
204 176 1
204 375 1
204 391 1
204 405 1
204 424 1

Which provides you information on the total number of pairs. This can be further included in the pipeline of the reclin2 package (note that we use a different ANN this time).

pair_ann(x = census[1:1000], 
         y = cis[1:1000], 
         on = c("pername1", "pername2", "sex", "dob_day", "dob_mon", "dob_year", "enumcap", "enumpc"), 
         deduplication = FALSE,
         ann = "hnsw") |>
  compare_pairs(on = c("pername1", "pername2", "sex", "dob_day", "dob_mon", "dob_year", "enumcap", "enumpc"),
                comparators = list(cmp_jarowinkler())) |>
  score_simple("score",
               on = c("pername1", "pername2", "sex", "dob_day", "dob_mon", "dob_year", "enumcap", "enumpc")) |>
  select_threshold("threshold", score = "score", threshold = 6) |>
  link(selection = "threshold") |>
  head()
.y .x person_id.x pername1.x pername2.x sex.x dob_day.x dob_mon.x dob_year.x hse_num enumcap.x enumpc.x str_nam cap_add census_id x person_id.y pername1.y pername2.y sex.y dob_day.y dob_mon.y dob_year.y enumcap.y enumpc.y cis_id y
11 945 DE256NG039003 HARRIET THOMSON F 12 1 1995 39 39 SPRINGFIELD ROAD DE256NG Springfield Road 39, Springfield Road CENSDE256NG039003 945 DE256NG039003 HARRIET THOMSON F 12 1 39 SPRINGFIELD ROAD DE256NG CISDE256NG039003 11
71 427 DE159QA062001 LEWIS GREEN M 23 3 1973 62 62 CHURCH ROAD DE159QA Church Road 62, Church Road CENSDE159QA062001 427 DE159QA062001 LEWIS GREEN M 23 3 62 CHURCH ROAD DE159QA CISDE159QA062001 71
83 720 DE237GG025002 IMOGEN DARIS F 6 4 1968 25 25 WOODLANDS ROAD DE237GG Woodlands Road 25, Woodlands Road CENSDE237GG025002 720 DE237GG025002 IMOGEW DAVIS F 6 4 25 WOODLANDS ROAD DE237GG CISDE237GG025002 83
99 136 DE125LU022001 DANIEC MICCER M 21 4 1947 22 22 PARK LANE DE125LU Park Lane 22, Park Lane CENSDE125LU022001 136 DE125LU022001 DAMIEL HILLER M 21 4 22 PARK LANE DE125LU CISDE125LU022001 99
154 949 DE256NG040002 CHLOE WILSON F 5 7 1978 40 40 SPRINGFIELD ROAD DE256NG Springfield Road 40, Springfield Road CENSDE256NG040002 949 DE256NG040002 CHLOE WILSOM F 5 7 40 SPRINGFIELD ROAD DE256NG CISDE256NG040002 154
156 549 DE159QY035002 AVA KING F 7 7 1969 35 35 CHURCH ROAD DE159QY Church Road 35, Church Road CENSDE159QY035002 549 DE159QY035002 AVA KING F 7 7 35 CHURCH ROAD DE159QY CISDE159QY035002 156

5 Usage with RecordLinkage package

Just use the block column in the argument blockfld in the compare.dedup() or compare.linkage() function. Please note that block column for the RecordLinkage package should be stored as a character not a numeric/integer vector.

mirror server hosted at Truenetwork, Russian Federation.