Simulate Correlated Variables

The rnorm_multi() function makes multiple normally distributed vectors with specified parameters and relationships.

Quick example

For example, the following creates a sample that has 100 observations of 3 variables, drawn from a population where A has a mean of 0 and SD of 1, while B and C have means of 20 and SDs of 5. A correlates with B and C with r = 0.5, and B and C correlate with r = 0.25.


dat <- rnorm_multi(n = 100, 
                  mu = c(0, 20, 20),
                  sd = c(1, 5, 5),
                  r = c(0.5, 0.5, 0.25), 
                  varnames = c("A", "B", "C"),
                  empirical = FALSE)

n	var	A	B	C	mean	sd
100	A	1.00	0.49	0.51	-0.04	1.04
100	B	0.49	1.00	0.19	19.95	4.91
100	C	0.51	0.19	1.00	19.64	4.61

Table: Sample stats

Specify correlations

You can specify the correlations in one of four ways:

A single r for all pairs
A vars by vars matrix
A vars*vars length vector
A vars*(vars-1)/2 length vector

One Number

If you want all the pairs to have the same correlation, just specify a single number.

bvn <- rnorm_multi(100, 5, 0, 1, .3, varnames = letters[1:5])

n	var	a	b	c	d	e	mean	sd
100	a	1.00	0.18	0.29	0.33	0.31	0.04	1.03
100	b	0.18	1.00	0.18	0.33	0.30	0.13	1.06
100	c	0.29	0.18	1.00	0.14	0.20	0.07	0.99
100	d	0.33	0.33	0.14	1.00	0.28	0.15	1.06
100	e	0.31	0.30	0.20	0.28	1.00	0.03	1.03

Table: Sample stats from a single rho

Matrix

If you already have a correlation matrix, such as the output of cor(), you can specify the simulated data with that.

cmat <- cor(iris[,1:4])
bvn <- rnorm_multi(100, 4, 0, 1, cmat, 
                  varnames = colnames(cmat))

n	var	Sepal.Length	Sepal.Width	Petal.Length	Petal.Width	mean	sd
100	Sepal.Length	1.00	-0.24	0.87	0.82	0.09	0.98
100	Sepal.Width	-0.24	1.00	-0.58	-0.52	0.07	1.08
100	Petal.Length	0.87	-0.58	1.00	0.96	0.04	1.03
100	Petal.Width	0.82	-0.52	0.96	1.00	0.05	1.04

Table: Sample stats from a correlation matrix

Vector (vars*vars)

You can specify your correlation matrix by hand as a vars*vars length vector, which will include the correlations of 1 down the diagonal.

cmat <- c(1, .3, .5,
          .3, 1, 0,
          .5, 0, 1)
bvn <- rnorm_multi(100, 3, 0, 1, cmat, 
                  varnames = c("first", "second", "third"))

n	var	first	second	third	mean	sd
100	first	1.00	0.31	0.48	0.05	1.02
100	second	0.31	1.00	0.01	-0.14	0.86
100	third	0.48	0.01	1.00	0.02	1.12

Table: Sample stats from a vars*vars vector

Vector (vars*(vars-1)/2)

You can specify your correlation matrix by hand as a vars*(vars-1)/2 length vector, skipping the diagonal and lower left duplicate values.

rho1_2 <- .3
rho1_3 <- .5
rho1_4 <- .5
rho2_3 <- .2
rho2_4 <- 0
rho3_4 <- -.3
cmat <- c(rho1_2, rho1_3, rho1_4, rho2_3, rho2_4, rho3_4)
bvn <- rnorm_multi(100, 4, 0, 1, cmat, 
                  varnames = letters[1:4])

n	var	a	b	c	d	mean	sd
100	a	1.00	0.29	0.61	0.41	-0.10	1.06
100	b	0.29	1.00	0.23	-0.03	0.09	1.14
100	c	0.61	0.23	1.00	-0.28	0.08	1.17
100	d	0.41	-0.03	-0.28	1.00	-0.12	0.97

Table: Sample stats from a (vars*(vars-1)/2) vector

empirical

If you want your samples to have the exact correlations, means, and SDs you entered, set empirical to TRUE.

bvn <- rnorm_multi(100, 5, 0, 1, .3, 
                  varnames = letters[1:5], 
                  empirical = T)

n	var	a	b	c	d	e	sd
100	a	1.0	0.3	0.3	0.3	0.3	1
100	b	0.3	1.0	0.3	0.3	0.3	1
100	c	0.3	0.3	1.0	0.3	0.3	1
100	d	0.3	0.3	0.3	1.0	0.3	1
100	e	0.3	0.3	0.3	0.3	1.0	1

Table: Sample stats with empirical = TRUE

Pre-existing variables

Us rnorm_pre() to create a vector with a specified correlation to one or more pre-existing variables. The following code creates a new column called B with a mean of 10, SD of 2 and a correlation of r = 0.5 to the A column.

dat <- rnorm_multi(varnames = "A") %>%
  mutate(B = rnorm_pre(A, mu = 10, sd = 2, r = 0.5))

n	var	A	B	mean	sd
100	A	1.00	0.37	-0.03	1.10
100	B	0.37	1.00	10.02	2.28

Set empirical = TRUE to return a vector with the exact specified parameters.

dat$C <- rnorm_pre(dat$A, mu = 10, sd = 2, r = 0.5, empirical = TRUE)

n	var	A	B	C	mean	sd
100	A	1.00	0.37	0.50	-0.03	1.10
100	B	0.37	1.00	0.15	10.02	2.28
100	C	0.50	0.15	1.00	10.00	2.00

You can also specify correlations to more than one vector by setting the first argument to a data frame containing only the continuous columns and r to the correlation with each column.

dat$D <- rnorm_pre(dat, r = c(.1, .2, .3), empirical = TRUE)

n	var	A	B	C	D	mean	sd
100	A	1.00	0.37	0.50	0.1	-0.03	1.10
100	B	0.37	1.00	0.15	0.2	10.02	2.28
100	C	0.50	0.15	1.00	0.3	10.00	2.00
100	D	0.10	0.20	0.30	1.0	0.00	1.00

Not all correlation patterns are possible, so you’ll get an error message if the correlations you ask for are impossible.

dat$E <- rnorm_pre(dat, r = .9)
#> Warning in rnorm_pre(dat, r = 0.9): Correlations are impossible.