SBMTrees

R-CMD-check License: GPL-2 version R C++

The R package SBMTrees (Sequential imputation with Bayesian Trees Mixed-Effects models) implements a Bayesian non-parametric framework for imputing missing covariates and outcomes in longitudinal data under the Missing at Random (MAR) assumption. Its core model, the Bayesian Trees Mixed-Effects Model (BMTrees), extends Mixed-Effects BART by employing centralized Dirichlet Process (CDP) Normal Mixture priors, allowing it to handle non-normal random effects and errors, address model misspecification, and capture complex relationships. The package also includes two semiparametric variants, BMTrees_R and BMTrees_RE. Built on BMTrees, the longitudinal sequential imputation framework employs a Metropolis-Hastings (M-H) MCMC method to sequentially impute missing values by constructing univariate models in a fixed order, ensuring both simplicity and consistency with a valid joint distribution.

For more details on these models and their applications, please consult the following paper: “Nonparametric Bayesian Additive Regression Trees for Prediction and Missing Data Imputation in Longitudinal Studies”.

Installation

This package is based on Rcpp, RcppArmadillo, and RcppDist, please make sure these three packages can be installed.

This package can be installed from R CRAN:

install.packages("SBMTrees")

or Github:

require("devtools")
install_github("https://github.com/zjg540066169/SBMTrees")
library(SBMTrees)

Models

This package is based on the mixed-effects model for longitudinal data:

Different models impose different prior distributions on and . We also include the existing model Mixed-Effects BART (mixedBART) in this package.

Models Prior on random effects Prior on random errors
BMTrees CDP Multivariate Normal Mixture CDP Normal Mixture
BMTrees_R Multivariate Normal CDP Normal Mixture
BMTrees_RE CDP Multivariate Normal Mixture Normal
mixedBART Multivariate Normal Normal

The inference is done with posterior samples by Gibbs samplers in C++.

Usage

There are two main functions in this package. BMTrees_prediction is employed to estimate and predict longitudinal outcomes. sequential_imputation is used to multiply-impute longitudinal missing covariates and outcomes.

Prediction

We first generate a data with some individuals, each has 6 follow-up time points. As described in paper, we can specify if it has normal/non-normal random effects and random error. We randomly split 70% and 30% observations into training and testing sets.

This can be achieved by running the function simulation_prediction_conti. Here is an example:

data = simulation_prediction_conti(train_prop = 0.5, n_subject = 100, n_obs_per_sub = 5, nonlinear = TRUE, residual = "normal", randeff = "skewed_MVN", seed = 123)
X_train = data$X_train # get predictors in training set
Y_train = data$Y_train # get outcomes in training set
Z_train = data$Z_train # get random predictors in training set
subject_id_train = data$subject_id_train # get subject id in training set

X_test = data$X_test # get predictors in testing set
Z_test = data$Z_test # get random predictors in testing set
subject_id_test = data$subject_id_test # get subject id in testing set

Y_test_true = data$Y_test # get ground truth

After we get data, we can run the prediction model based on function BMTrees_prediction.

Here is an example to run the predictive model.

model = BMTrees_prediction(
   X_train = X_train,
   Y_train = Y_train,
   Z_train = Z_train,
   subject_id_train = subject_id_train,
   X_test = X_test,
   Z_test = Z_test,
   subject_id_test = subject_id_test,
   model = "BMTrees",
   binary = FALSE,
   nburn = 3L, npost = 4L, skip = 1L, verbose = FALSE, seed = 1234
 )
model$post_predictive_y_test
model$post_sigma

The users can get the posterior predictive samples for Y_test and posterior draws of other parameters.

Multiple Imputation

For imputation, we first generate a dataset comprising individuals, each with five follow-up time points. As described in the paper, we can specify whether the random effects and random errors follow normal or non-normal distributions. Different missingness mechanisms are applied to create MAR missing values.

The data with missingness is generated by running the function simulation_imputation. Here is an example:

data <- simulation_imputation(NNY = TRUE, NNX = TRUE, n_subject = 100, seed = 123)

After we get data, we can run the prediction model based on function sequential_imputation.

Here is an example to run the predictive model.

imputed_model <- sequential_imputation(X = data$data_M[,3:14], Y = data$data_M$Y, Z = data$Z,
   subject_id = data$data_M$subject_id, type = c(0, 0, 0, 1, 1, 1, 0, 0, 0, 1, 1, 1),
   outcome_model = "BMLM", binary_outcome = FALSE, model = "BMTrees", nburn = 3,
   npost = 4, skip = 2, verbose = FALSE, seed = 123)
imputed_model$imputed_data

The returned imputed_data is a three-array, whose dimension is (npost / skip, N, p + 1). N is the number of observations. p is the number of covariates.

Attribution

This package includes code derived from the BART3 package, originally developed by Rodney Sparapani.

The original source code, licensed under the GNU General Public License version 2 (GPL-2), has been modified as follows: - We include part of the C++ code in BART3, primarily about functions about wbart and cpwart. We also modify some files to make sure our package can be successfully compiled. - Modifications were made by Jungang Zou, 2024.

Licensing

mirror server hosted at Truenetwork, Russian Federation.