| Type: | Package |
| Title: | Building Augmented Data to Run Multi-State Models with 'msm' Package |
| Version: | 2.2.1 |
| Date: | 2026-06-01 |
| Description: | A fast and general method for restructuring classical longitudinal observational data into augmented transition data suitable for multi-state modeling with the 'msm' package. Works with any longitudinal data where subjects accumulate repeated observations with start and end times and an optional terminal outcome. Methods are described in Grossetti, Ieva and Paganoni (2018) <doi:10.1007/s10729-017-9400-z>. |
| URL: | https://github.com/contefranz/msmtools |
| BugReports: | https://github.com/contefranz/msmtools/issues |
| License: | GPL-3 |
| LazyData: | TRUE |
| Config/roxygen2/version: | 8.0.0 |
| Depends: | R (≥ 4.1) |
| Imports: | data.table (≥ 1.18.4), cli (≥ 3.6.0), msm (≥ 1.8.2), survival (≥ 3.8-6), ggplot2 (≥ 4.0.3) |
| Suggests: | testthat (≥ 3.3.2), knitr (≥ 1.51), rmarkdown (≥ 2.31), roxygen2 (≥ 8.0.0), patchwork (≥ 1.3.2) |
| Config/testthat/edition: | 3 |
| VignetteBuilder: | knitr |
| Encoding: | UTF-8 |
| NeedsCompilation: | no |
| Packaged: | 2026-06-01 20:14:58 UTC; grossetti |
| Author: | Francesco Grossetti
|
| Maintainer: | Francesco Grossetti <francesco.grossetti@unibocconi.it> |
| Repository: | CRAN |
| Date/Publication: | 2026-06-08 15:00:14 UTC |
Building augmented data for multi-state models: the msmtools package
Description
msmtools restructures standard longitudinal datasets into augmented transition data for multi-state models fitted with msm. It also provides graphical goodness-of-fit tools for survival curves and state prevalences under the Markov assumption.
Details
The package exposes four public functions: augment(), polish(),
prevplot(), and survplot().
Author(s)
Maintainer: Francesco Grossetti francesco.grossetti@unibocconi.it (ORCID)
Authors:
Francesco Grossetti francesco.grossetti@unibocconi.it (ORCID)
See Also
Useful links:
Build augmented transition data
Description
Reshape standard longitudinal data into augmented transition data suitable for multi-state models fitted with msm.
Usage
augment(
data,
data_key,
n_events,
pattern,
state = c("IN", "OUT", "DEAD"),
t_start,
t_end,
t_cens,
t_death,
t_augmented,
more_status = NULL,
check_NA = FALSE,
copy = FALSE,
verbosity = getOption("msmtools.verbosity", "quiet")
)
Arguments
data |
A |
data_key |
A keying variable used to identify subjects and define a key
for |
n_events |
An integer variable indicating the progressive (monotonic)
event number for each subject. |
pattern |
Either an integer, a factor, or a character variable with 2 or 3 unique values that gives each subject's terminal outcome schema. When 2 values are detected, they must be in the format: 0 = "alive", 1 = "dead". When 3 values are detected, they must be: 0 = "alive", 1 = "dead during a transition", 2 = "dead after a transition has ended" (see Details). |
state |
A character vector of exactly three unique, non-missing,
non-empty labels used as the generated transition-state vocabulary.
Defaults to |
t_start |
The starting time of an observation. It can be passed as date, integer, or numeric format. |
t_end |
The ending time of an observation. It can be passed as date, integer, or numeric format. |
t_cens |
The censoring time of the study. This is the date until each ID is observed, if still active in the cohort. |
t_death |
The exact death time of a subject ID. If |
t_augmented |
A variable indicating the name of the new time variable
in the augmented format. If |
more_status |
A variable that marks further transitions beyond the
default ones given by |
check_NA |
If |
copy |
If |
verbosity |
Controls informational output. Use |
Details
augment() requires a monotonic event sequence within each subject.
The data are ordered with data.table::setkey() using data_key as the
primary key and t_start as the secondary key. The function then checks the
monotonicity of n_events; if the check fails, it stops and reports the
subjects that violate the condition. If n_events is missing, augment()
first computes a progression number named n_events and then runs the same
check.
Argument pattern describes the terminal outcome schema and must follow the
expected ordering. With two statuses, values must correspond to
0 = "alive" and 1 = "dead". With three statuses, integer values must
correspond to 0 = "alive", 1 = "dead inside a transition", and
2 = "dead outside a transition". Character and factor values must follow
the same order. For example, 0 cannot be used to indicate death.
Argument state describes the generated transition-state vocabulary. Its
order also matters. The first element is the state at t_start (for example,
"IN"), the second element is the state at t_end (for example, "OUT"),
and the third element is the absorbing state (for example, "DEAD"). A
two-value pattern still requires three state labels because augment()
infers whether death maps to the absorbing state inside or outside the
transition window.
more_status lets augment() represent transitions beyond the defaults in
state. Standard observations that add no extra information should use
"df" for "default" (see Examples, or run ?hosp and inspect rehab_it).
More complex transitions should use concise, self-explanatory labels.
By default, augment() follows data.table by-reference semantics to avoid
unnecessary copies of large longitudinal datasets. This means the input may
have its key changed, and n_events may be added when the argument is
omitted. Set copy = TRUE when the original input object must remain
unchanged.
The function always returns a data.table. Use as.data.frame() on the
result if a plain data.frame is needed by downstream code.
Value
An augmented dataset of class data.table. Each row represents a
specific transition for a given subject. augment() computes the following
key variables:
-
augmented: The transition time variable. Ift_augmentedis missing,augment()creates augmented by default. The variable is built fromt_startandt_endand inherits their class. Ift_startis a date,augment()also creates an integer variable named augmented_int. Ift_startis a difftime, it creates a numeric variable named augmented_num. -
status: A status flag that contains the states as specified instate.augment()automatically checks whether argumentpatternhas 2 or 3 unique values and computes the correct structure of a given subject as reported in the vignette. The variable is cast as character. -
status_num: The corresponding integer version of status. -
n_status: A mix ofstatusandn_eventscast as character. This is useful when modelling process progression.
If more_status is passed, augment() computes additional variables.
They mirror the meaning of status, status_num, and n_status but they
account for the more complex structure defined. They are: status_exp,
status_exp_num, and n_status_exp.
Author(s)
Francesco Grossetti francesco.grossetti@unibocconi.it.
References
Grossetti, F., Ieva, F., and Paganoni, A.M. (2018). A multi-state approach to patients affected by chronic heart failure. Health Care Management Science, 21, 281-291. doi:10.1007/s10729-017-9400-z.
Jackson, C.H. (2011). Multi-State Models for Panel Data: The msm Package for R. Journal of Statistical Software, 38(8), 1-29. https://www.jstatsoft.org/v38/i08/.
M. Dowle, A. Srinivasan, T. Short, S. Lianoglou with contributions from
R. Saporta and E. Antonyan (2016): data.table: Extension of data.frame.
R package version 1.9.6. https://github.com/Rdatatable/data.table/wiki
See Also
data.table::data.table(), data.table::setkey()
Examples
# loading data
data(hosp)
# augmenting hosp
hosp_augmented = augment(data = hosp, data_key = subj, n_events = adm_number,
pattern = label_3, t_start = dateIN, t_end = dateOUT,
t_cens = dateCENS)
# augmenting hosp by passing more information regarding transitions
# with argument more_status
hosp_augmented_more = augment(data = hosp, data_key = subj, n_events = adm_number,
pattern = label_3, t_start = dateIN, t_end = dateOUT,
t_cens = dateCENS, more_status = rehab_it)
# requesting progress output
hosp_augmented = augment(data = hosp, data_key = subj, n_events = adm_number,
pattern = label_3, t_start = dateIN, t_end = dateOUT,
t_cens = dateCENS, verbosity = "summary")
Synthetic Hospital Admissions
Description
A synthetic longitudinal dataset of hospital admissions for 10 subjects. It includes repeated admissions, admission-level clinical flags, demographic variables, and end-of-study status labels.
Usage
data(hosp)
Format
A data.table with 53 rows and 12 variables:
-
subj: Subject ID (integer). -
adm_number: Hospital admissions counter (integer). -
gender: Gender of patient (factor with 2 levels:"F"= females,"M"= males). -
age: Age of patient in years at the given observation (integer). -
rehab: Rehabilitation flag. If the admission has been in rehabilitation, thenrehab = 1; otherwiserehab = 0(integer). -
it: Intensive Therapy flag. If the admission has been in intensive therapy, thenit = 1; otherwiseit = 0(integer). -
rehab_it: String marking the admission type based onrehabandit. The standard admission is coded as"df"(default). Admissions in rehabilitation or intensive therapy are coded as"rehab"or"it"(character). -
label_2: Subject status at the end of the study. It takes 2 values:"alive"and"dead"(character). -
label_3: Subject status at the end of the study. It takes 3 values:"alive","dead_in", and"dead_out"(character). -
dateIN: Exact admission date (date). -
dateOUT: Exact discharge date (date). -
dateCENS: Either censoring time or exact death time (date).
Remove observations with different states occurring at the same time
Description
Remove subjects with transitions to different states occurring at the same
exact time in an augmented dataset produced by augment().
Usage
polish(
data,
data_key,
pattern,
time = NULL,
check_NA = FALSE,
copy = FALSE,
verbosity = getOption("msmtools.verbosity", "quiet")
)
Arguments
data |
A |
data_key |
A keying variable used to identify subjects and define a key
for |
pattern |
Either an integer, a factor, or a character variable with 2 or 3 unique values that gives each subject's terminal outcome schema. When 2 values are detected, they must be in the format: 0 = "alive", 1 = "dead". When 3 values are detected, they must be: 0 = "alive", 1 = "dead during a transition", 2 = "dead after a transition has ended" (see Details). |
time |
The time variable used to identify duplicate transition times.
If omitted or set to |
check_NA |
If |
copy |
If |
verbosity |
Controls informational output. Use |
Details
The function searches for cases where two subsequent events for the
same subject land on different states but occur at the same time. When this
happens, the whole subject, as identified by data_key, is removed from the
data. The function reports how many subjects were removed.
By default, polish() follows data.table by-reference semantics to avoid
unnecessary copies of large augmented datasets. This means the input may have
its key changed while duplicate subjects are identified. Set copy = TRUE
when the original input object must remain unchanged.
The function always returns a data.table. Use as.data.frame() on the
result if a plain data.frame is needed by downstream code.
Value
A data.table with the same columns as the input data. Subjects
whose pattern transitions occur at the same time on different states are
removed in full (every row sharing the same data_key); rows from
unaffected subjects are kept as-is. When no duplicated transitions are
found, the input data is returned unchanged.
Author(s)
Francesco Grossetti francesco.grossetti@unibocconi.it.
See Also
Examples
# loading data
data(hosp)
# augmenting longitudinal data
hosp_aug = augment(data = hosp, data_key = subj, n_events = adm_number,
pattern = label_3, t_start = dateIN, t_end = dateOUT,
t_cens = dateCENS)
# cleaning targeted duplicate transitions
hosp_aug_clean = polish(data = hosp_aug, data_key = subj, pattern = label_3)
Plot observed and expected prevalences for a multi-state model
Description
Plot observed and expected state prevalences from a fitted multi-state model. The function can also compute a rough diagnostic for where the data depart from the estimated Markov model.
Usage
prevplot(
x,
prev.obj,
exacttimes = TRUE,
M = FALSE,
ci = FALSE,
print_plot = TRUE,
verbosity = getOption("msmtools.verbosity", "quiet")
)
Arguments
x |
A fitted msm model object. |
prev.obj |
A list computed by |
exacttimes |
If |
M |
If |
ci |
If |
print_plot |
If |
verbosity |
Controls informational output. Use |
Details
When M = TRUE, a rough indicator of the deviance from the
Markov model is computed according to Titman and Sharples (2008).
A comparison at a given time t_i of a subject k in the state s between
observed counts O_is and expected counts E_is is built as
M_is = (O_is - E_is)^2 / E_is.
The deviance M plot is returned together with the standard prevalence plot
in the second row. This layout is fixed.
When M = TRUE, the combined layout is built with patchwork, which is
an optional dependency of msmtools. Install it with
install.packages("patchwork") if it is not already available; prevplot()
raises an informative error otherwise. The default M = FALSE path has no
such requirement.
Value
When M = FALSE, a gg/ggplot object with observed and expected
prevalences is returned. When M = TRUE, a patchwork object is returned
with the prevalence plot and the deviance M plot.
The returned object also carries a $prevalence field with the
long-format data.table used to build the plot. It always includes
time, state, obs, and hat; it also includes lwr and upr
when ci = TRUE, and M when M = TRUE. Access it directly:
p <- prevplot(model, prev_obj) p$prevalence
print_plot only controls whether the plot is printed as a side effect.
Returned objects are unchanged: use print_plot = FALSE to create the plot
silently.
Author(s)
Francesco Grossetti francesco.grossetti@unibocconi.it.
References
Titman, A. and Sharples, L.D. (2010). Model diagnostics for multi-state models, Statistical Methods in Medical Research, 19, 621-651.
Titman, A. and Sharples, L.D. (2008). A general goodness-of-fit test for Markov and hidden Markov models, Statistics in Medicine, 27, 2177-2195.
Gentleman RC, Lawless JF, Lindsey JC, Yan P. (1994). Multi-state Markov models for analysing incomplete disease data with illustrations for HIV disease. Statistics in Medicine, 13:805-821.
Jackson, C.H. (2011). Multi-State Models for Panel Data: The msm Package for R. Journal of Statistical Software, 38(8), 1-29. https://www.jstatsoft.org/v38/i08/.
See Also
msm::plot.prevalence.msm(), msm::msm(),
msm::prevalence.msm()
Examples
data(hosp)
# augmenting the data
hosp_augmented = augment(data = hosp, data_key = subj, n_events = adm_number,
pattern = label_3, t_start = dateIN, t_end = dateOUT,
t_cens = dateCENS)
# let's define the initial transition matrix for our model
Qmat = matrix(data = 0, nrow = 3, ncol = 3, byrow = TRUE)
Qmat[1, 1:3] = 1
Qmat[2, 1:3] = 1
colnames(Qmat) = c('IN', 'OUT', 'DEAD')
rownames(Qmat) = c('IN', 'OUT', 'DEAD')
# fitting the model using
# gender and age as covariates
library(msm)
msm_model = msm(status_num ~ augmented_int, subject = subj,
data = hosp_augmented, covariates = ~ gender + age,
exacttimes = TRUE, gen.inits = TRUE, qmatrix = Qmat,
method = 'BFGS', control = list(fnscale = 6e+05, trace = 0,
REPORT = 1, maxit = 10000))
# defining the times at which compute the prevalences
t_min = min(hosp_augmented$augmented_int)
t_max = max(hosp_augmented$augmented_int)
steps = 100L
# computing prevalences
prev = prevalence.msm(msm_model, covariates = 'mean', ci = 'normal',
times = seq(t_min, t_max, steps))
# and plotting them using prevplot()
gof = prevplot(x = msm_model, prev.obj = prev, ci = TRUE, M = TRUE)
Plot fitted survival and Kaplan-Meier curves from a multi-state model
Description
Plot fitted survival probabilities from an msm::msm() model and compare
them with Kaplan-Meier estimates. The function can also return the data used
to build each curve.
Usage
survplot(
x,
from = 1,
to = NULL,
range = NULL,
covariates = "mean",
exacttimes = TRUE,
times,
grid = 100L,
km = FALSE,
ci = c("none", "normal", "bootstrap"),
interp = c("start", "midpoint"),
B = 100L,
ci_km = c("none", "plain", "log", "log-log", "logit", "arcsin"),
print_plot = TRUE,
verbosity = getOption("msmtools.verbosity", "quiet"),
...
)
Arguments
x |
A fitted msm model object. |
from |
State from which to compute the estimated survival. Defaults to state 1. |
to |
The absorbing state to which compute the estimated survival.
Defaults to the highest state found by |
range |
A numeric vector of two elements giving the time range of the plot. |
covariates |
Covariate values for which to evaluate the expected
probabilities. These can be
The unnamed list must follow the order of the covariates in the original model formula. A named list is also accepted:
|
exacttimes |
If |
times |
An optional numeric vector giving the times at which to compute the fitted survival. |
grid |
An integer specifying the grid points at which to compute the
fitted survival curve (see Details). If |
km |
If |
ci |
A character vector with the type of confidence intervals to compute for the fitted
survival curve. Specify either |
interp |
If |
B |
Number of bootstrap or normal replicates for the confidence interval. The default is 100 rather than the usual 1000, since these plots are for rough diagnostic purposes. |
ci_km |
A character vector with the type of confidence intervals to compute for the
Kaplan-Meier curve. Specify either |
print_plot |
If |
verbosity |
Controls informational output. Use |
... |
Reserved for the migration trampoline. Passing the legacy
|
Details
The function wraps msm::plot.survfit.msm() and adds support for
exact-time plots by resetting the time scale to follow-up time. It returns
a gg/ggplot object so the plot composes directly with ggplot2::ggsave(),
ggplot2::theme(), and other ggplot operations.
You can pass custom evaluation times through times, or let survplot()
define them from grid. Larger grid values produce a finer grid and
increase computation time.
Value
A gg/ggplot object. The fitted and (when km = TRUE)
Kaplan-Meier data tables are attached to the returned plot as named
fields:
-
$fitted— adata.tablewith columnstime,surv, and (whenciis not"none")lwr/upr. Always present. -
$km— adata.tablewith the Kaplan-Meier curve, exposed only whenkm = TRUE.
Access the data through the standard $ operator:
p <- survplot(model, km = TRUE) p # prints the plot p$fitted # fitted survival data p$km # Kaplan-Meier data
print_plot only controls whether the plot is printed as a side effect.
Returned objects are unchanged: use print_plot = FALSE to create the plot
or returned data silently.
Author(s)
Francesco Grossetti francesco.grossetti@unibocconi.it.
References
Titman, A. and Sharples, L.D. (2010). Model diagnostics for multi-state models, Statistical Methods in Medical Research, 19, 621-651.
Titman, A. and Sharples, L.D. (2008). A general goodness-of-fit test for Markov and hidden Markov models, Statistics in Medicine, 27, 2177-2195.
Jackson, C.H. (2011). Multi-State Models for Panel Data: The msm Package for R. Journal of Statistical Software, 38(8), 1-29. https://www.jstatsoft.org/v38/i08/.
See Also
msm::plot.survfit.msm(), msm::msm(),
msm::pmatrix.msm(), data.table::setDF()
Examples
data(hosp)
# augmenting the data
hosp_augmented = augment(data = hosp, data_key = subj, n_events = adm_number,
pattern = label_3, t_start = dateIN, t_end = dateOUT,
t_cens = dateCENS)
# let's define the initial transition matrix for our model
Qmat = matrix(data = 0, nrow = 3, ncol = 3, byrow = TRUE)
Qmat[1, 1:3] = 1
Qmat[2, 1:3] = 1
colnames(Qmat) = c('IN', 'OUT', 'DEAD')
rownames(Qmat) = c('IN', 'OUT', 'DEAD')
# fitting the model using
# gender and age as covariates
library(msm)
msm_model = msm(status_num ~ augmented_int, subject = subj,
data = hosp_augmented, covariates = ~ gender + age,
exacttimes = TRUE, gen.inits = TRUE, qmatrix = Qmat,
method = 'BFGS', control = list(fnscale = 6e+05, trace = 0,
REPORT = 1, maxit = 10000))
# plotting the fitted and empirical survival from state = 1
theplot = survplot(x = msm_model, km = TRUE)
# the fitted and Kaplan-Meier data tables are attached to the plot
head(theplot$fitted)
head(theplot$km)