NumericEnsembles

The goal of NumericEnsembles is to automatically conduct a thorough analysis of numeric data. The user only needs to provide the data and answer a few questions (such as which column to analyze). NumericEnsembles fits 18 individual models to the training data, and also makes predictions and checks accuracy for each of the individual models. It also builds 14 ensembles from the ensembles of data, fits each ensemble model to the training data then makes predictions and tracks accuracy for each ensemble. The package also automatically returns 26 plots (such as train vs holdout for the best model), 6 tables (such as head of the data), and a grand summary table sorted by accuracy with the best model at the top of the report.

Installation

You can install the development version of NumericEnsembles like so:

devtools::install_github("InfiniteCuriosity/NumericEnsembles")

Example

NumericEnsembles will automatically build 32 models to predict the sale price of houses in Boston, from the Boston housing data set.

library(NumericEnsembles)
Numeric(data = MASS::Boston,
        colnum = 14,
        numresamples = 2,
        remove_VIF_above = 5.00,
        remove_ensemble_correlations_greater_than = 1.00,
        scale_all_predictors_in_data = "N",
        data_reduction_method = 0,
        ensemble_reduction_method = 0,
        how_to_handle_strings = 0,
        predict_on_new_data = "N",
        save_all_trained_models = "N",
        set_seed = "N",
        save_all_plots = "N",
        use_parallel = "Y",
        train_amount = 0.60,
        test_amount = 0.20,
        validation_amount = 0.20)

The 32 models which are all built automatically and without error are:

Bagging
BayesGLM
BayesRNN
Cubist
Earth
Elastic (optimized by cross-validation)
Ensemble Bagging
Ensemble BayesGLM
Ensemble BayesRNN
Ensemble Cubist
Ensemble Earth
Ensemble Elastic (optimized by cross-validation)
Ensemble Gradient Boosted
Ensemble Lasso (optimized by cross-validation)
Ensemble Linear (tuned)
Ensemble Ridge (optimized by cross-validation)
Ensemble RPart
EnsembleSVM (tuned)
Ensemble Trees
Ensemble XGBoost
GAM (Generalized Additive Models, with smoothing splines)
Gradient Boosted (optimized)
Lasso
Linear (tuned)
Neuralnet
PCR (Principal Components Regression)
PLS (Partial Least Squares)
Ridge (optimized by cross-validation)
RPart
SVM (Support Vector Machines, tuned)
Tree
XGBoost

The 30 plots created automatically:

Correlation plot of the numeric data (as numbers and colors)
Correlation plot of the numeric data (as circles with colors)
Cook’s D Bar Plot
Four plots in one for the most accurate model: Predicted vs actual, Residuals, Histogram of residuals, Q-Q plot
Most accurate model: Predicted vs actual
Most accurate model: Residuals
Most accurate model: Histogram of residuals
Most accurate model: Q-Q plot
Accuracy by resample and model, fixed scales
Accuracy by resample and model, free scales
Holdout RMSE/train RMSE, fixed scales
Holdout RMSE/train RMSE, free scales
Histograms of each numeric column
Boxplots of each numeric column
Predictor vs target variable
Model accuracy bar chart (RMSE)
t-test p-value bar chart
Train vs holdout by resample and model, free scales
Train vs holdout by resampleand model, fixed scales
Duration bar chart
Holdout RMSE / train RMSE bar chart
Mean bias bar chart
Mean MSE bar chart
Mean MAE bar chart
Mean SSE bar chart
Kolmogorov-Smirnof test bar chart
Bias plot by model and resample
MSE plot by model and resample
MAE plot by model and resample
SSE plot by model and resample

The tables created automatically (which are both searchable and sortable) are:

Variance Inflation Factor
Correlation of the ensemble
Head of the ensemble
Data summary
Correlation of the data
Grand summary table includes:
Mean holdout RMSE
Standard deviation of mean holdout RMSE
t-test value
t-test p-value
t-test p-value standard deviation
Kolmogorov-Smirnov stat mean
Kolmogorov-Smirnov stat p-value
Kolmogorov-Smirnov stat standard deviation
Mean bias
Mean bias standard deviation
Mean MAE
Mean MAE standard deviation
Mean MSE
Mean MSE standard deviation
Mean SSE
Mean SSE standard deviation
Mean data (this is the mean of the target column in the original data set)
Standard deviation of mean data (this is the standard deviation of the data in the target column in the original data set)
Mean train RMSE
Mean test RMSE
Mean validation RMSE
Holdout vs train mean
Holdout vs train standard deviation
Duration
Duration standard deviation

Example using pre-trained models on totally new data in the NumericEnsembles package

The NumericEnsembles package also has a way to create trained models and test those pre-trained models on totally unseen data using the same pre-trained models as on the initial analysis.

The package contains two example data sets to demonstrate this result. Boston_Housing is the Boston Housing data set, but the first five rows have been removed. We will build our models on that data set. NewBoston is totally new data, and actually the first five rows from the original Boston Housing data set.

library(NumericEnsembles)
Numeric(data = Boston_housing,
        colnum = 14,
        numresamples = 25,
        remove_VIF_above = 5.00,
        remove_ensemble_correlations_greater_than = 1.00,
        scale_all_predictors_in_data = "N",
        data_reduction_method = 0,
        ensemble_reduction_method = 0,
        how_to_handle_strings = 0,
        predict_on_new_data = "Y",
        set_seed = "N",
        save_all_trained_models = "N",
        save_all_plots = "N",
        use_parallel = "Y",
        train_amount = 0.60,
        test_amount = 0.20,
        validation_amount = 0.20)

Use the data set New_Boston when asked for “What is the URL of the new data?”. The URL for the new data is: https://raw.githubusercontent.com/InfiniteCuriosity/EnsemblesData/refs/heads/main/NewBoston.csv

External data may be used to accomplish the same result.

mirror server hosted at Truenetwork, Russian Federation.