Sometimes it is desirable to simulate correlated data from a
correlation matrix directly. For example, a simulation might require two
random effects (e.g. a random intercept and a random slope). Correlated
data like this could be generated using the defData
functionality, but it may be more natural to do this with
genCorData
or addCorData
. Currently, simstudy
can only generate multivariate normal using these functions.
Correlated data: additional distributions
Two additional functions facilitate the generation of correlated data
from binomial, poisson, gamma, and
uniform distributions: genCorGen
and
addCorGen
.
genCorGen
is an extension of genCorData
.
These functions draw on copula-based methods to generate the data. (This
Wikipedia
page provides a general introduction and copula-based modeling can
be conducted in R
using package copula.) In the
first example, we are generating data from a multivariate Poisson
distribution. We start by specifying the mean of the Poisson
distribution for each new variable, and then we specify the correlation
structure, just as we did with the normal distribution.
l <- c(8, 10, 12) # lambda for each new variable
dx <- genCorGen(1000, nvars = 3, params1 = l, dist = "poisson", rho = .3, corstr = "cs", wide = TRUE)
dx
## Key: <id>
## id V1 V2 V3
## <int> <num> <num> <num>
## 1: 1 5 16 13
## 2: 2 9 9 6
## 3: 3 7 11 18
## 4: 4 11 14 12
## 5: 5 10 8 15
## ---
## 996: 996 3 2 5
## 997: 997 6 14 11
## 998: 998 6 8 12
## 999: 999 10 12 11
## 1000: 1000 9 9 12
round(cor(as.matrix(dx[, .(V1, V2, V3)])), 2)
## V1 V2 V3
## V1 1.0 0.30 0.30
## V2 0.3 1.00 0.24
## V3 0.3 0.24 1.00
We can also generate correlated binary data by specifying the
probabilities:
genCorGen(1000, nvars = 3, params1 = c(.3, .5, .7), dist = "binary", rho = .8, corstr = "cs", wide = TRUE)
## Key: <id>
## id V1 V2 V3
## <int> <num> <num> <num>
## 1: 1 0 1 1
## 2: 2 0 1 1
## 3: 3 0 0 1
## 4: 4 1 1 1
## 5: 5 0 1 1
## ---
## 996: 996 0 0 1
## 997: 997 1 1 1
## 998: 998 0 0 0
## 999: 999 0 0 1
## 1000: 1000 0 0 0
The gamma distribution requires two parameters - the mean and
dispersion. (These are converted into shape and rate parameters more
commonly used.)
dx <- genCorGen(1000, nvars = 3, params1 = l, params2 = c(1,1,1), dist = "gamma", rho = .7, corstr = "cs", wide = TRUE, cnames="a, b, c")
dx
## Key: <id>
## id a b c
## <int> <num> <num> <num>
## 1: 1 4.137889e+00 1.9736693 5.73317661
## 2: 2 6.230611e-04 0.1790216 0.01098133
## 3: 3 9.554613e+00 21.3956071 30.07914569
## 4: 4 1.053229e+01 6.8598915 7.47104860
## 5: 5 2.556925e+01 22.8862611 17.32239223
## ---
## 996: 996 2.635737e+00 0.9269903 1.22333746
## 997: 997 1.638308e+00 12.0692638 13.01943662
## 998: 998 3.492819e+00 4.1504352 2.37403911
## 999: 999 9.336809e+00 21.2184483 25.17933311
## 1000: 1000 2.044966e+01 32.3326247 23.81715119
round(cor(as.matrix(dx[, .(a, b, c)])), 2)
## a b c
## a 1.00 0.63 0.67
## b 0.63 1.00 0.67
## c 0.67 0.67 1.00
These data sets can be generated in either wide or
long form. So far, we have generated wide form data,
where there is one row per unique id. Now, we will generate data using
the long form, where the correlated data are on different rows,
so that there are repeated measurements for each id. An id will have
multiple records (i.e. one id will appear on multiple rows):
dx <- genCorGen(1000, nvars = 3, params1 = l, params2 = c(1,1,1), dist = "gamma", rho = .7, corstr = "cs", wide = FALSE, cnames="NewCol")
dx
## Key: <id>
## id period NewCol
## <int> <num> <num>
## 1: 1 0 0.08868527
## 2: 1 1 0.17558015
## 3: 1 2 0.35553817
## 4: 2 0 2.41522425
## 5: 2 1 0.99489378
## ---
## 2996: 999 1 4.62541703
## 2997: 999 2 0.73199287
## 2998: 1000 0 3.52197152
## 2999: 1000 1 7.43262675
## 3000: 1000 2 8.36619208
addCorGen
allows us to create correlated data from an
existing data set, as one can already do using addCorData
.
In the case of addCorGen
, the parameter(s) used to define
the distribution are created as a field (or fields) in the dataset. The
correlated data are added to the existing data set. In the example
below, we are going to generate three sets (poisson, binary, and gamma)
of correlated data with means that are a function of the variable
xbase
, which varies by id.
First we define the data and generate a data set:
def <- defData(varname = "xbase", formula = 5, variance = .2, dist = "gamma", id = "cid")
def <- defData(def, varname = "lambda", formula = ".5 + .1*xbase", dist="nonrandom", link = "log")
def <- defData(def, varname = "p", formula = "-2 + .3*xbase", dist="nonrandom", link = "logit")
def <- defData(def, varname = "gammaMu", formula = ".5 + .2*xbase", dist="nonrandom", link = "log")
def <- defData(def, varname = "gammaDis", formula = 1, dist="nonrandom")
dt <- genData(10000, def)
dt
## Key: <cid>
## cid xbase lambda p gammaMu gammaDis
## <int> <num> <num> <num> <num> <num>
## 1: 1 1.546326 1.924435 0.1771026 2.246257 1
## 2: 2 5.689908 2.912439 0.4272628 5.144775 1
## 3: 3 5.059867 2.734604 0.3817705 4.535672 1
## 4: 4 4.599528 2.611573 0.3497493 4.136730 1
## 5: 5 2.402442 2.096447 0.2176749 2.665758 1
## ---
## 9996: 9996 3.610769 2.365707 0.2856166 3.394491 1
## 9997: 9997 4.984305 2.714019 0.3764348 4.467643 1
## 9998: 9998 5.122724 2.751847 0.3862310 4.593052 1
## 9999: 9999 3.393940 2.314964 0.2725312 3.250432 1
## 10000: 10000 7.722561 3.568895 0.5785365 7.725390 1
The Poisson distribution has a single parameter, lambda:
dtX1 <- addCorGen(dtOld = dt, idvar = "cid", nvars = 3, rho = .1, corstr = "cs",
dist = "poisson", param1 = "lambda", cnames = "a, b, c")
dtX1
## Key: <cid>
## cid xbase lambda p gammaMu gammaDis a b c
## <int> <num> <num> <num> <num> <num> <num> <num> <num>
## 1: 1 1.546326 1.924435 0.1771026 2.246257 1 2 2 2
## 2: 2 5.689908 2.912439 0.4272628 5.144775 1 1 0 2
## 3: 3 5.059867 2.734604 0.3817705 4.535672 1 4 0 2
## 4: 4 4.599528 2.611573 0.3497493 4.136730 1 3 1 3
## 5: 5 2.402442 2.096447 0.2176749 2.665758 1 5 4 1
## ---
## 9996: 9996 3.610769 2.365707 0.2856166 3.394491 1 1 2 2
## 9997: 9997 4.984305 2.714019 0.3764348 4.467643 1 3 2 4
## 9998: 9998 5.122724 2.751847 0.3862310 4.593052 1 7 2 1
## 9999: 9999 3.393940 2.314964 0.2725312 3.250432 1 6 0 4
## 10000: 10000 7.722561 3.568895 0.5785365 7.725390 1 12 1 3
The Bernoulli (binary) distribution has a single parameter, p:
dtX2 <- addCorGen(dtOld = dt, idvar = "cid", nvars = 4, rho = .4, corstr = "ar1",
dist = "binary", param1 = "p")
dtX2
## Key: <cid>
## cid xbase lambda p gammaMu gammaDis V1 V2 V3
## <int> <num> <num> <num> <num> <num> <num> <num> <num>
## 1: 1 1.546326 1.924435 0.1771026 2.246257 1 0 0 1
## 2: 2 5.689908 2.912439 0.4272628 5.144775 1 0 0 0
## 3: 3 5.059867 2.734604 0.3817705 4.535672 1 1 0 1
## 4: 4 4.599528 2.611573 0.3497493 4.136730 1 0 0 0
## 5: 5 2.402442 2.096447 0.2176749 2.665758 1 0 1 0
## ---
## 9996: 9996 3.610769 2.365707 0.2856166 3.394491 1 0 0 0
## 9997: 9997 4.984305 2.714019 0.3764348 4.467643 1 0 1 1
## 9998: 9998 5.122724 2.751847 0.3862310 4.593052 1 0 0 0
## 9999: 9999 3.393940 2.314964 0.2725312 3.250432 1 0 0 0
## 10000: 10000 7.722561 3.568895 0.5785365 7.725390 1 0 0 1
## V4
## <num>
## 1: 0
## 2: 1
## 3: 1
## 4: 1
## 5: 0
## ---
## 9996: 0
## 9997: 0
## 9998: 1
## 9999: 0
## 10000: 0
The Gamma distribution has two parameters - in simstudy
the mean and dispersion are specified:
dtX3 <- addCorGen(dtOld = dt, idvar = "cid", nvars = 4, rho = .4, corstr = "cs",
dist = "gamma", param1 = "gammaMu", param2 = "gammaDis")
dtX3
## Key: <cid>
## cid xbase lambda p gammaMu gammaDis V1
## <int> <num> <num> <num> <num> <num> <num>
## 1: 1 1.546326 1.924435 0.1771026 2.246257 1 4.231680194
## 2: 2 5.689908 2.912439 0.4272628 5.144775 1 6.787710358
## 3: 3 5.059867 2.734604 0.3817705 4.535672 1 3.707544002
## 4: 4 4.599528 2.611573 0.3497493 4.136730 1 0.348766326
## 5: 5 2.402442 2.096447 0.2176749 2.665758 1 2.493923583
## ---
## 9996: 9996 3.610769 2.365707 0.2856166 3.394491 1 0.002486867
## 9997: 9997 4.984305 2.714019 0.3764348 4.467643 1 3.210241942
## 9998: 9998 5.122724 2.751847 0.3862310 4.593052 1 9.556894110
## 9999: 9999 3.393940 2.314964 0.2725312 3.250432 1 1.349413306
## 10000: 10000 7.722561 3.568895 0.5785365 7.725390 1 2.404109193
## V2 V3 V4
## <num> <num> <num>
## 1: 1.2176380 0.4222348 2.1937488
## 2: 4.6349836 6.5642077 3.1765033
## 3: 1.5053746 1.1395938 1.0282041
## 4: 7.7065647 15.1843651 1.5413112
## 5: 0.6724540 2.0454487 1.5117235
## ---
## 9996: 0.4825269 0.6716621 0.6721581
## 9997: 0.2103485 4.9869686 3.8444291
## 9998: 3.5482244 10.5174231 9.9577729
## 9999: 2.3406277 1.7019004 0.8970987
## 10000: 2.7987448 3.6667012 12.1680528
If we have data in long form (e.g. longitudinal data), the
function will recognize the structure:
def <- defData(varname = "xbase", formula = 5, variance = .4, dist = "gamma", id = "cid")
def <- defData(def, "nperiods", formula = 3, dist = "noZeroPoisson")
def2 <- defDataAdd(varname = "lambda", formula = ".5+.5*period + .1*xbase", dist="nonrandom", link = "log")
dt <- genData(1000, def)
dtLong <- addPeriods(dt, idvars = "cid", nPeriods = 3)
dtLong <- addColumns(def2, dtLong)
dtLong
## Key: <timeID>
## cid period xbase nperiods timeID lambda
## <int> <int> <num> <num> <int> <num>
## 1: 1 0 7.053471 2 1 3.337917
## 2: 1 1 7.053471 2 2 5.503295
## 3: 1 2 7.053471 2 3 9.073400
## 4: 2 0 2.185136 3 4 2.051382
## 5: 2 1 2.185136 3 5 3.382157
## ---
## 2996: 999 1 9.702454 1 2996 7.172436
## 2997: 999 2 9.702454 1 2997 11.825348
## 2998: 1000 0 3.044209 4 2998 2.235402
## 2999: 1000 1 3.044209 4 2999 3.685554
## 3000: 1000 2 3.044209 4 3000 6.076452
### Generate the data
dtX3 <- addCorGen(dtOld = dtLong, idvar = "cid", nvars = 3, rho = .6, corstr = "cs",
dist = "poisson", param1 = "lambda", cnames = "NewPois")
dtX3
## Key: <cid>
## cid period xbase nperiods timeID lambda NewPois
## <int> <int> <num> <num> <int> <num> <num>
## 1: 1 0 7.053471 2 1 3.337917 5
## 2: 1 1 7.053471 2 2 5.503295 6
## 3: 1 2 7.053471 2 3 9.073400 11
## 4: 2 0 2.185136 3 4 2.051382 2
## 5: 2 1 2.185136 3 5 3.382157 1
## ---
## 2996: 999 1 9.702454 1 2996 7.172436 10
## 2997: 999 2 9.702454 1 2997 11.825348 11
## 2998: 1000 0 3.044209 4 2998 2.235402 4
## 2999: 1000 1 3.044209 4 2999 3.685554 7
## 3000: 1000 2 3.044209 4 3000 6.076452 9
We can fit a generalized estimating equation (GEE) model and examine
the coefficients and the working correlation matrix. They match closely
to the data generating parameters:
geefit <- gee(NewPois ~ period + xbase, data = dtX3, id = cid, family = poisson, corstr = "exchangeable")
## Beginning Cgee S-function, @(#) geeformula.q 4.13 98/01/27
## running glm to get initial regression estimate
## (Intercept) period xbase
## 0.4915131 0.4929285 0.1030995
round(summary(geefit)$working.correlation, 2)
## [,1] [,2] [,3]
## [1,] 1.00 0.56 0.56
## [2,] 0.56 1.00 0.56
## [3,] 0.56 0.56 1.00