Title: | Building Augmented Data to Run Multi-State Models with 'msm' Package |
---|---|
Description: | A fast and general method for restructuring classical longitudinal data into augmented ones. The reason for this is to facilitate the modeling of longitudinal data under a multi-state framework using the 'msm' package. |
Authors: | Francesco Grossetti [aut, cre] |
Maintainer: | Francesco Grossetti <[email protected]> |
License: | GPL-3 |
Version: | 2.0.1 |
Built: | 2024-10-28 05:21:44 UTC |
Source: | https://github.com/contefranz/msmtools |
A fast and general method for reshaping standard longitudinal data into a new
structure called augmented'. This format is suitable under a multi-state
framework using the msm
package.
augment( data, data_key, n_events, pattern, state = list("IN", "OUT", "DEAD"), t_start, t_end, t_cens, t_death, t_augmented, more_status, check_NA = FALSE, convert = FALSE, verbose = TRUE )
augment( data, data_key, n_events, pattern, state = list("IN", "OUT", "DEAD"), t_start, t_end, t_cens, t_death, t_augmented, more_status, check_NA = FALSE, convert = FALSE, verbose = TRUE )
data |
A |
data_key |
A keying variable which |
n_events |
An integer variable indicating the progressive (monotonic)
event number of a given ID. |
pattern |
Either an integer, a factor or a character with 2 or 3 unique
values which provides the ID status at the end of the study. |
state |
A list of three and exactly three possible states which a
subject can reach. |
t_start |
The starting time of an observation. It can be passed as date, integer, or numeric format. |
t_end |
The ending time of an observation. It can be passed as date, integer, or numeric format. |
t_cens |
The censoring time of the study. This is the date until each ID is observed, if still active in the cohort. |
t_death |
The exact death time of a subject ID. If |
t_augmented |
A variable indicating the name of the new time variable
of the process in the augmented format. If |
more_status |
A variable which marks further transitions beside the
default ones given by |
check_NA |
If |
convert |
If |
verbose |
If |
In order to get the data processed, a monotonic increasing process
needs to be ensured. In the first place, augment
checks this both in
case n_events
is missing or not. The data are efficiently ordered through
setkey
function with data_key
as the primary
key and t_start
as the secondary key. In the second place, it checks
the monotonicity of n_events
and if it fails, it stops with error and
returns the subjects given by data_key
for whom the condition is not
met. If n_events
is missing, then augment
internally computes
the progression number with the name n_events and runs the same
procedure.
Attention needs to be payed to argument pattern
. Integer values can
be 0 and 1 if only two status are defined and they must correspond to the
status 'alive' and 'dead'. If three values are defined, then they must be 0,
1 and 2 if pattern
is an integer, or 'alive', 'dead inside a
transition' and dead outside a transition' if pattern
is either a
character or a factor. The order matters: it is not possible to specify
0 as 'dead' for instance.
When passing a list of states, the order is important so that the first element must be the state corresponding to the starting time (i.e. 'IN', inside the hospital), the second element must correspond to the ending time (i.e. 'OUT', outside the hospital), and the third state is the absorbing state (i.e. 'DEAD').
more_status
allows to manage multiple transitions beside what already
specified in state
. In particular, if the corresponding observation
is a standard admission which adds no other information than what is inside
state
, then more_status
must be set to 'df' which stands for
'Default' (see 'Examples' or run ?hosp and look at the variable 'rehab_it').
In general, it is always a good practice to fully specify the transition
with a bunch of self-explanatory characters in order to quickly understand
which is the current transition.
An augmented format dataset of class data.table
, or
data.frame
when convert
is TRUE
, where each row
represents a specific transition for a given subject. augment
returns
them after some important variables have been computed:
augmented |
The new timing variable for the process when looking
at transitions. If |
status |
A status flag which contains the states as specified
in |
status_num |
The corresponding integer version of status. |
n_status |
A mix of |
If more_status
is passed, then augment
computes some more
variables. They mimic the meaning of status, status_num,
and n_status but they account for the more complex structure defined.
They are: status_exp
, status_exp_num
, and n_status_exp
.
Francesco Grossetti [email protected].
Jackson, C.H. (2011). Multi-State Models for Panel Data:
The msm Package for R. Journal of Statistical Software, 38(8), 1-29.
URL https://www.jstatsoft.org/v38/i08/.
M. Dowle, A. Srinivasan, T. Short, S. Lianoglou with contributions from
R. Saporta and E. Antonyan (2016):
data.table: Extension of data.frame. R package version 1.9.6
URL https://github.com/Rdatatable/data.table/wiki
# loading data data( hosp ) # 1. # augmenting hosp hosp_augmented = augment( data = hosp, data_key = subj, n_events = adm_number, pattern = label_3, t_start = dateIN, t_end = dateOUT, t_cens = dateCENS ) # 2. # augmenting hosp by passing more information regarding transitions # with argument more_status hosp_augmented_more = augment( data = hosp, data_key = subj, n_events = adm_number, pattern = label_3, t_start = dateIN, t_end = dateOUT, t_cens = dateCENS, more_status = rehab_it ) # 3. # augmenting hosp and returning a data.frame hosp_augmented = augment( data = hosp, data_key = subj, n_events = adm_number, pattern = label_3, t_start = dateIN, t_end = dateOUT, t_cens = dateCENS, convert = TRUE ) class( hosp_augmented )
# loading data data( hosp ) # 1. # augmenting hosp hosp_augmented = augment( data = hosp, data_key = subj, n_events = adm_number, pattern = label_3, t_start = dateIN, t_end = dateOUT, t_cens = dateCENS ) # 2. # augmenting hosp by passing more information regarding transitions # with argument more_status hosp_augmented_more = augment( data = hosp, data_key = subj, n_events = adm_number, pattern = label_3, t_start = dateIN, t_end = dateOUT, t_cens = dateCENS, more_status = rehab_it ) # 3. # augmenting hosp and returning a data.frame hosp_augmented = augment( data = hosp, data_key = subj, n_events = adm_number, pattern = label_3, t_start = dateIN, t_end = dateOUT, t_cens = dateCENS, convert = TRUE ) class( hosp_augmented )
A dataset containing synthetic hospital admissions in the classic longitudinal format. The dataset counts imaginary 10 patients who undergo different (re)admission into a hospital. Some demographic and clinical variables are also included.
hosp
hosp
A data.table
with 53 rows and 12 variables:
Subject ID (integer)
Hospital admissions counter (integer)
Gender of patient (factor with 2 levels: "F" = females, "M" = males)
Age of patient in years at the given observation (integer)
Rehabilitation flag: if the admission has been in rehabilitation, then rehab = 1, else = 0 (integer)
Intensive Therapy flag: if the admission has been in intensive therapy, then it = 1, else = 0 (integer)
String which in one place marks the hospital admission types based on rehab and it. The standard admission is coded as "df" (default). If admission was in rehabilitation or in intensive therapy, rehab_it = "rehab" or "it", respectively (character)
Subject status at the end of the study. It takes 2 values: "alive" and "dead" (character)
Subject status at the end of the study. It takes 3 values: "alive" and "dead_in" and "dead_out" (character)
Exact admission date (date)
Exact discharge date (date)
Either censoring time or exact death time (date)
msmtools
packagemsmtools
introduces a fast and general method for restructuring classical longitudinal datasets
into augmented ones. Augmented data enhances longitudinal datasets and allow to model each
transition under a multi-state framework. msmtools
works in symbiosis with the
msm
package.
It also provides two graphical goodness-of-fit tools to inspect the model performances using
survival curves and prevalences under the Markov assumption.
msmtools
comes with 4 functions: augment
, polish
,
prevplot
, and survplot
.
Fast algorithm to get rid of transitions to different states occurring at
the same exact time in an augmented data structure as computed by
augment
(see 'Details').
polish( data, data_key, pattern, time, check_NA = FALSE, convert = FALSE, verbose = TRUE )
polish( data, data_key, pattern, time, check_NA = FALSE, convert = FALSE, verbose = TRUE )
data |
A |
data_key |
A keying variable which |
pattern |
Either an integer, a factor or a character with 2 or 3 unique
values which provides the ID status at the end of the study. |
time |
The target time variable to check duplicates. By default it is set to 'augmented_int'. |
check_NA |
If |
convert |
If |
verbose |
If |
The function finds all those cases where two subsequent events for
a given subject land on different states but occur at the same time.
When this happens, the whole subject, as identified by data_key
, is
removed from the data. The total number of subjects to be removed is
printed out in order to be more informative.
Francesco Grossetti [email protected].
# loading data data( hosp ) # augmenting longitudinal data hosp_aug = augment( data = hosp, data_key = subj, n_events = adm_number, pattern = label_3, t_start = dateIN, t_end = dateOUT, t_cens = dateCENS ) # cleaning any targeted occurrence hosp_aug_clean = polish( data = hosp_aug, data_key = subj, pattern = label_3 )
# loading data data( hosp ) # augmenting longitudinal data hosp_aug = augment( data = hosp, data_key = subj, n_events = adm_number, pattern = label_3, t_start = dateIN, t_end = dateOUT, t_cens = dateCENS ) # cleaning any targeted occurrence hosp_aug_clean = polish( data = hosp_aug, data_key = subj, pattern = label_3 )
Provides a graphical indication of goodness of fit of a multi-state model
computed by msm
using observed and expected prevalences.
It also computes a rough indicator of where the data depart from the estimated
Markov model.
prevplot(x, prev.obj, exacttimes = TRUE, M = FALSE, ci = FALSE)
prevplot(x, prev.obj, exacttimes = TRUE, M = FALSE, ci = FALSE)
x |
A |
prev.obj |
A list computed by |
exacttimes |
If |
M |
If |
ci |
If |
When M = TRUE
, a rough indicator of the deviance from the
Markov model is computed according to Titman and Sharples (2008).
A comparison at a given time of a patient k in the state
s between observed counts
with expected ones
is build as follows:
The plot of the deviance M is returned together with the standard prevalence plot in the second row. This is not editable by the user.
Francesco Grossetti [email protected].
Titman, A. and Sharples, L.D. (2010). Model diagnostics for
multi-state models, Statistical Methods in Medical Research, 19,
621-651.
Titman, A. and Sharples, L.D. (2008). A general goodness-of-fit test for
Markov and hidden Markov models, Statistics in Medicine, 27,
2177-2195.
Gentleman RC, Lawless JF, Lindsey JC, Yan P. (1994). Multi-state Markov
models for analysing incomplete disease data with illustrations for HIV
disease. Statistics in Medicine, 13:805-821.
Jackson, C.H. (2011). Multi-State Models for Panel Data:
The msm Package for R. Journal of Statistical Software, 38(8), 1-29.
URL https://www.jstatsoft.org/v38/i08/.
plot.prevalence.msm
msm
prevalence.msm
## Not run: data( hosp ) # augmenting the data hosp_augmented = augment( data = hosp, data_key = subj, n_events = adm_number, pattern = label_3, t_start = dateIN, t_end = dateOUT, t_cens = dateCENS ) # let's define the initial transition matrix for our model Qmat = matrix( data = 0, nrow = 3, ncol = 3, byrow = TRUE ) Qmat[ 1, 1:3 ] = 1 Qmat[ 2, 1:3 ] = 1 colnames( Qmat ) = c( 'IN', 'OUT', 'DEAD' ) rownames( Qmat ) = c( 'IN', 'OUT', 'DEAD' ) # attaching the msm package and running the model using # gender and age as covariates library( msm ) msm_model = msm( status_num ~ augmented_int, subject = subj, data = hosp_augmented, covariates = ~ gender + age, exacttimes = TRUE, gen.inits = TRUE, qmatrix = Qmat, method = 'BFGS', control = list( fnscale = 6e+05, trace = 0, REPORT = 1, maxit = 10000 ) ) # defining the times at which compute the prevalences t_min = min( hosp_augmented$augmented_int ) t_max = max( hosp_augmented$augmented_int ) steps = 100L # computing prevalences prev = prevalence.msm( msm_model, covariates = 'mean', ci = 'normal', times = seq( t_min, t_max, steps ) ) # and plotting them using prevplot() gof = prevplot( x = msm_model, prev.obj = prev, ci = TRUE, M = TRUE ) ## End(Not run)
## Not run: data( hosp ) # augmenting the data hosp_augmented = augment( data = hosp, data_key = subj, n_events = adm_number, pattern = label_3, t_start = dateIN, t_end = dateOUT, t_cens = dateCENS ) # let's define the initial transition matrix for our model Qmat = matrix( data = 0, nrow = 3, ncol = 3, byrow = TRUE ) Qmat[ 1, 1:3 ] = 1 Qmat[ 2, 1:3 ] = 1 colnames( Qmat ) = c( 'IN', 'OUT', 'DEAD' ) rownames( Qmat ) = c( 'IN', 'OUT', 'DEAD' ) # attaching the msm package and running the model using # gender and age as covariates library( msm ) msm_model = msm( status_num ~ augmented_int, subject = subj, data = hosp_augmented, covariates = ~ gender + age, exacttimes = TRUE, gen.inits = TRUE, qmatrix = Qmat, method = 'BFGS', control = list( fnscale = 6e+05, trace = 0, REPORT = 1, maxit = 10000 ) ) # defining the times at which compute the prevalences t_min = min( hosp_augmented$augmented_int ) t_max = max( hosp_augmented$augmented_int ) steps = 100L # computing prevalences prev = prevalence.msm( msm_model, covariates = 'mean', ci = 'normal', times = seq( t_min, t_max, steps ) ) # and plotting them using prevplot() gof = prevplot( x = msm_model, prev.obj = prev, ci = TRUE, M = TRUE ) ## End(Not run)
Plot the fitted survival probability computed over a msm
model and
compare it with the Kaplan-Meier. Fast build and return the underlying data structures.
survplot( x, from = 1, to = NULL, range = NULL, covariates = "mean", exacttimes = TRUE, times, grid = 100L, km = FALSE, out = c("none", "fitted", "km", "all"), ci = c("none", "normal", "bootstrap"), interp = c("start", "midpoint"), B = 100L, ci_km = c("none", "plain", "log", "log-log", "logit", "arcsin") )
survplot( x, from = 1, to = NULL, range = NULL, covariates = "mean", exacttimes = TRUE, times, grid = 100L, km = FALSE, out = c("none", "fitted", "km", "all"), ci = c("none", "normal", "bootstrap"), interp = c("start", "midpoint"), B = 100L, ci_km = c("none", "plain", "log", "log-log", "logit", "arcsin") )
x |
A |
from |
State from which to compute the estimated survival. Default to state 1. |
to |
The absorbing state to which compute the estimated survival.
Default to the highest state found by |
range |
A numeric vector of two elements which gives the time range of the plot. |
covariates |
Covariate values for which to evaluate the expected
probabilities. These can either be: the string |
exacttimes |
If |
times |
An optional numeric vector giving the times at which to compute the fitted survival. |
grid |
An integer specifying the grid points at which to compute the fitted
survival (see 'Details').
If |
km |
If |
out |
A character vector specifying what the function has to return. Accepted values are
|
ci |
A character vector with the type of confidence intervals to compute for the fitted
survival curve. Specify either |
interp |
If |
B |
Number of bootstrap or normal replicates for the confidence interval. The default is 100 rather than the usual 1000, since these plots are for rough diagnostic purposes. |
ci_km |
A character vector with the type of confidence intervals to compute for the
Kaplan-Meier curve. Specify either |
The function is a wrapper of plot.survfit.msm
and does more things. survplot
manages correctly the plot of a fitted
survival in an exact times framework (when exacttimes = TRUE
) by just
resetting the time scale and looking at the follow-up time. It can quickly
build and return to the user the data structures used to compute the Kaplan-Meier
and the fitted survival probability by specifying out = "all"
.
The user can defined custom times (through times
) or let
survplot
choose them on its own (through grid
).
In the latter case, survplot
looks for the follow-up time and divides
it by grid
. The higher it is, the finer the grid will be so that computing
the fitted survival will take longer, but will be more precise.
When out = "none"
, a gg/ggplot
object is returned. If out
is anything
else, then a named list is returned. The Kaplan-Meier data can be accessed with $km
while
the estimated survival data with $fitted
. If out = "all"
, the plot, the Kaplan-Meier
and the estimated curve are returned.
Francesco Grossetti [email protected].
Titman, A. and Sharples, L.D. (2010). Model diagnostics for
multi-state models, Statistical Methods in Medical Research, 19,
621-651.
Titman, A. and Sharples, L.D. (2008). A general goodness-of-fit test for
Markov and hidden Markov models, Statistics in Medicine, 27,
2177-2195.
Jackson, C.H. (2011). Multi-State Models for Panel Data:
The msm Package for R. Journal of Statistical Software, 38(8), 1-29.
URL https://www.jstatsoft.org/v38/i08/.
plot.survfit.msm
msm
,
pmatrix.msm
, setDF
## Not run: data( hosp ) # augmenting the data hosp_augmented = augment( data = hosp, data_key = subj, n_events = adm_number, pattern = label_3, t_start = dateIN, t_end = dateOUT, t_cens = dateCENS ) # let's define the initial transition matrix for our model Qmat = matrix( data = 0, nrow = 3, ncol = 3, byrow = TRUE ) Qmat[ 1, 1:3 ] = 1 Qmat[ 2, 1:3 ] = 1 colnames( Qmat ) = c( 'IN', 'OUT', 'DEAD' ) rownames( Qmat ) = c( 'IN', 'OUT', 'DEAD' ) # attaching the msm package and running the model using # gender and age as covariates library( msm ) msm_model = msm( status_num ~ augmented_int, subject = subj, data = hosp_augmented, covariates = ~ gender + age, exacttimes = TRUE, gen.inits = TRUE, qmatrix = Qmat, method = 'BFGS', control = list( fnscale = 6e+05, trace = 0, REPORT = 1, maxit = 10000 ) ) # plotting the fitted and empirical survival from state = 1 theplot = survplot( x = msm_model, km = TRUE ) # plotting the fitted and empirical survival from state = 2 and and returning both the fitted and the empirical curve out_all = survplot( msm_model, from = 2, km = TRUE, out = "all" ) ## End(Not run)
## Not run: data( hosp ) # augmenting the data hosp_augmented = augment( data = hosp, data_key = subj, n_events = adm_number, pattern = label_3, t_start = dateIN, t_end = dateOUT, t_cens = dateCENS ) # let's define the initial transition matrix for our model Qmat = matrix( data = 0, nrow = 3, ncol = 3, byrow = TRUE ) Qmat[ 1, 1:3 ] = 1 Qmat[ 2, 1:3 ] = 1 colnames( Qmat ) = c( 'IN', 'OUT', 'DEAD' ) rownames( Qmat ) = c( 'IN', 'OUT', 'DEAD' ) # attaching the msm package and running the model using # gender and age as covariates library( msm ) msm_model = msm( status_num ~ augmented_int, subject = subj, data = hosp_augmented, covariates = ~ gender + age, exacttimes = TRUE, gen.inits = TRUE, qmatrix = Qmat, method = 'BFGS', control = list( fnscale = 6e+05, trace = 0, REPORT = 1, maxit = 10000 ) ) # plotting the fitted and empirical survival from state = 1 theplot = survplot( x = msm_model, km = TRUE ) # plotting the fitted and empirical survival from state = 2 and and returning both the fitted and the empirical curve out_all = survplot( msm_model, from = 2, km = TRUE, out = "all" ) ## End(Not run)