Exploratory data analysis

Head of the data
- Discuss the characteristics of each feature.
Barchart of target (0 or 1) vs each feature, by percent (%)
- Discussion of y vs target variables
Boxplots of the numeric data (insert plot here)
- Discussion of boxplots of the numeric data
Histograms of each numeric column (insert plot here)
- Discussion of histograms of each numeric column
Data summary (insert table here)
- Discussion of the data summary
Outliers in the data (insert outliers data here)
- Discussion of outliers in the data
Correlation of the data (table)
Correlation plot of the numeric data as circles and colors
Correlation of the ensemble
Variance Inflation Factor
The stories in the exploratory data analysis

24 logistic models (Individual models then ensembles, in alphabetical order)

One paragraph summary about statistical modeling here

Cubist

cubist_train_fit <- Cubist::cubist(x = as.data.frame(train), y = train$y)
Flexible Discriminant Analysis

fda_train_fit <- MachineShop::fit(as.factor(y) ~ ., data = train01, model = “FDAModel”)
GAM (Generalized Additive Models) (uses smoothing splines)

f2 <- stats::as.formula(paste0(“y ~”, paste0(“gam::s(”, names_df, “)”, collapse = “+”)))

gam_train_fit <- gam(f2, data = train1)
Generalized Linear Models

glm_train_fit <- stats::glm(y ~ ., data = train, family = binomial)
Lasso (uses best model)

best_lasso_lambda <- lasso_cv$lambda.min

best_lasso_model <- glmnet(x, y, alpha = 1, lambda = best_lasso_lambda)
Linear (tuned)

linear_train_fit <- e1071::tune.rpart(formula = y ~ ., data = train)
Linear Discriminant Analysis

lda_train_fit <- MASS::lda(as.factor(y) ~ ., data = train01, model = “LMModel”)
Penalized Discriminant Analysis

pda_train_fit <- MachineShop::fit(as.factor(y) ~ ., data = train01, model = “PDAModel”)
Quadratic Discriminant Analysis

qda_train_fit <- MASS::qda(as.factor(y) ~ ., data = train01)
Random Forest

rf_train_fit <- randomForest(x = train, y = as.factor(y_train), data = df, family = binomial(link = “logit”))
Ridge

best_ridge_lambda <- ridge_cv$lambda.min

best_ridge_model <- glmnet(x, y, alpha = 0, lambda = best_ridge_lambda)
RPart

rpart_train_fit <- rpart::rpart(train$y ~ ., data = train)
SVM (Support Vector Machines) (tuned)

svm_train_fit <- e1071::tune.svm(x = train, y = train$y, data = train)
Tree

tree_train_fit <- tree::tree(train$y ~ ., data = train)

Ensemble models start here
Ensemble Gradient Boosted

ensemble_gb_train_fit <- gbm::gbm(ensemble_train$y_ensemble ~ ., data = ensemble_train, distribution = “gaussian”, n.trees = 100, shrinkage = 0.1, interaction.depth = 10 )
Ensemble Lasso (uses best model)

ensemble_best_lasso_lambda <- ensemble_lasso_cv$lambda.min

ensemble_best_lasso_model <- glmnet(ensemble_x, ensemble_y, alpha = 1, lambda = ensemble_best_lasso_lambda)
Ensemble Partial Least Squares

ensemble_pls_train_fit <- MachineShop::fit(as.factor(y) ~ ., data = ensemble_train, model = “PLSModel”)
Ensemble Penalized Discriminant Analysis

ensemble_pda_train_fit <- MachineShop::fit(as.factor(y) ~ ., data = ensemble_train, model = “PDAModel”)
Ensemble Ridge

x = model.matrix(y ~ ., data = ensemble_train)[, -1]

y = ensemble_train$y

ensemble_ridge_train_fit <- glmnet::glmnet(x, y, alpha = 0)
Ensemble RPart

ensemble_rpart_train_fit <- MachineShop::fit(as.factor(y) ~ ., data = ensemble_train, model = “RPartModel”)
Ensemble Support Vector Machines (SVM)

ensemble_svm_train_fit <- e1071::svm(as.factor(y) ~ ., data = ensemble_train, kernel = “radial”, gamma = 1, cost = 1)
Ensemble Trees

ensemble_tree_train_fit <- tree::tree(ensemble_train$y ~ ., data = ensemble_train)
The stories in the models (fill in here)

Ensembles and individual model plots

Negative predictive value (fixed scales)
Negative predictive value (free scales)
Positive predictive value (fixed scales)
Positive predictive value (free scales)
F1 Score (fixed scales)
F1 Score (free scales)
False negative rate (fixed scales)
False negative rate (free scales)
False positive rate (fixed scales)
False positive rate (free scales)
True negative rate (fixed scales)
True negative rate (free scales)
True positive rate (fixed scales)
True positive rate (free scales)
ROC Curves for each of the 24 models
Over or under fitting (closer to 1 is better) barchart
Duration (mean) by model barchart
Overfitting by model and resample, fixed scales
Overfitting by model and resample, free scales
Model accuracy bar chart
Accuracy by model and resample, including train and holdout by each resample, fixed scales
Accuracy by model and resample, including train and holdout by each resample, free scales
Summary report
- Accuracy (mean)
- Accuracy (standard deviation)
- True positive rate (also known as sensitivity)
- True negative rate (also known as specificity)
- False positive rate (also known as Type I error)
- False negative rate (also known as Type II error)
- Positive predictive value
- Negative predictive value
- F1 score
- Area under the curve (AUC)
- Overfitting (mean)
- Overfitting (standard deviation)
- Duration (mean)
- Duration (standard deviation)
Function call
Warnings or errors
The stories in the plots

Logistic report template

Introduction

Statement of the problem from the customer’s perspective

History of the problem, previous results

Exploratory data analysis

24 logistic models (Individual models then ensembles, in alphabetical order)

Ensembles and individual model plots

Strongest evidence based results:

Five strongest evidence based recommendations

Conclusions

References