Analyses can slow to a crawl when models need hours to run. In this
article you will find a few tricks to prevent this bottleneck when using
orsf()
. We’ll use the flchain
data from
survival
to demonstrate.
data("flchain", package = 'survival')
flc <- flchain
# do this to avoid orsf() throwing an error about time to event = 0
flc <- flc[flc$futime > 0, ]
# modify names
names(flc)[names(flc) == 'futime'] <- 'time'
names(flc)[names(flc) == 'death'] <- 'status'
Our flc
data has 7871 rows and 11 columns:
head(flc)
#> age sex sample.yr kappa lambda flc.grp creatinine mgus time status
#> 1 97 F 1997 5.70 4.860 10 1.7 0 85 1
#> 2 92 F 2000 0.87 0.683 1 0.9 0 1281 1
#> 3 94 F 1997 4.36 3.850 10 1.4 0 69 1
#> 4 92 F 1996 2.42 2.220 9 1.0 0 115 1
#> 5 93 F 1996 1.32 1.690 6 1.1 0 1039 1
#> 6 90 F 1997 2.01 1.860 9 1.0 0 1355 1
#> chapter
#> 1 Circulatory
#> 2 Neoplasms
#> 3 Circulatory
#> 4 Circulatory
#> 5 Circulatory
#> 6 Mental
orsf_control_fast()
This is the default control
value for
orsf()
and its run-time compared to other approaches can be
striking. For example:
time_fast <- system.time(
expr = orsf(flc, time+status~., na_action = 'na_impute_meanmode',
control = orsf_control_fast(), n_tree = 10)
)
time_net <- system.time(
expr = orsf(flc, time+status~., na_action = 'na_impute_meanmode',
control = orsf_control_net(), n_tree = 10)
)
# control_fast() is much faster
time_net['elapsed'] / time_fast['elapsed']
#> elapsed
#> 50.28571
n_thread
The n_thread
argument uses multi-threading to run
aorsf
functions in parallel when possible. If you know how
many threads you want, e.g. you want exactly 5, just say
n_thread = 5
. If you aren’t sure how many threads you have
available but want to use as many as you can, say
n_thread = 0
and aorsf
will figure out the
number for you.
time_1_thread <- system.time(
expr = orsf(flc, time+status~., na_action = 'na_impute_meanmode',
n_thread = 1, n_tree = 500)
)
time_5_thread <- system.time(
expr = orsf(flc, time+status~., na_action = 'na_impute_meanmode',
n_thread = 5, n_tree = 500)
)
time_auto_thread <- system.time(
expr = orsf(flc, time+status~., na_action = 'na_impute_meanmode',
n_thread = 0, n_tree = 500)
)
# 5 threads and auto thread are both about 3 times faster than one thread
time_1_thread['elapsed'] / time_5_thread['elapsed']
#> elapsed
#> 3.392857
time_1_thread['elapsed'] / time_auto_thread['elapsed']
#> elapsed
#> 3.861789
Because R is a single threaded language, multi-threading cannot be
applied when orsf()
needs to call R functions from C++,
which occurs when a customized R function is used to find linear
combination of variables or compute prediction accuracy.
There are some defaults in orsf()
that can be adjusted
to make it run faster:
set n_retry
to 0 instead of 3 (the default)
set oobag_pred_type
to ‘none’ instead of ‘surv’ (the
default)
set ‘importance’ to ‘none’ instead of ‘anova’ (the default)
increase split_min_events
,
split_min_obs
, leaf_min_events
, or
leaf_min_obs
to make trees stop growing sooner
increase split_min_stat
to make trees stop growing
sooner
Applying these tips:
time_lightweight <- system.time(
expr = orsf(flc, time+status~., na_action = 'na_impute_meanmode',
n_thread = 0, n_tree = 500, n_retry = 0,
oobag_pred_type = 'none', importance = 'none',
split_min_events = 20, leaf_min_events = 10,
split_min_stat = 10)
)
# about two times faster than auto thread with defaults
time_auto_thread['elapsed'] / time_lightweight['elapsed']
#> elapsed
#> 1.921875
While these default values do make orsf()
run slower,
they also usually make its predictions more accurate or make the fit
easier to interpret.
Setting verbose_progress = TRUE
doesn’t make anything
run faster, but it can help make it feel like things are
running less slow.