
Machine Learning for Tidynauts
tidylearn provides a unified
tidyverse-compatible interface to R’s machine learning
ecosystem. It wraps proven packages like glmnet, randomForest, xgboost,
e1071, cluster, and dbscan - you get the reliability of established
implementations with the convenience of a consistent, tidy API.
What tidylearn does:
tl_model()) to 20+
ML algorithms%>%What tidylearn is NOT:
model$fit)Each ML package in R has its own API, output format, and conventions. tidylearn provides a translation layer so you can:
| Without tidylearn | With tidylearn |
|---|---|
| Learn different APIs for each package | One API for everything |
| Write custom code to extract results | Consistent tibble output |
| Create different plots for each model | Unified visualization |
| Manage package-specific quirks | Focus on your analysis |
The underlying algorithms are unchanged - tidylearn simply makes them easier to use together.
# Install from CRAN
install.packages("tidylearn")
# Or install development version from GitHub
# devtools::install_github("ces0491/tidylearn")A single tl_model() function dispatches to the
appropriate underlying package:
library(tidylearn)
# Classification -> uses randomForest::randomForest()
model <- tl_model(iris, Species ~ ., method = "forest")
# Regression -> uses stats::lm()
model <- tl_model(mtcars, mpg ~ wt + hp, method = "linear")
# Regularization -> uses glmnet::glmnet()
model <- tl_model(mtcars, mpg ~ ., method = "lasso")
# Clustering -> uses stats::kmeans()
model <- tl_model(iris[,1:4], method = "kmeans", k = 3)
# PCA -> uses stats::prcomp()
model <- tl_model(iris[,1:4], method = "pca")All results come back as tibbles, ready for dplyr and ggplot2:
# Predictions as tibbles
predictions <- predict(model, new_data = test_data)
# Metrics as tibbles
metrics <- tl_evaluate(model, test_data)
# Easy to pipe
model %>%
predict(test_data) %>%
bind_cols(test_data) %>%
ggplot(aes(x = actual, y = prediction)) +
geom_point()You always have access to the raw model from the underlying package:
model <- tl_model(iris, Species ~ ., method = "forest")
# Access the randomForest object directly
model$fit # This is the randomForest::randomForest() result
# Use package-specific functions if needed
randomForest::varImpPlot(model$fit)tidylearn provides a unified interface to these established R packages:
| Method | Underlying Package | Function Called |
|---|---|---|
"linear" |
stats | lm() |
"polynomial" |
stats | lm() with poly() |
"logistic" |
stats | glm(..., family = binomial) |
"ridge", "lasso",
"elastic_net" |
glmnet | glmnet() |
"tree" |
rpart | rpart() |
"forest" |
randomForest | randomForest() |
"boost" |
gbm | gbm() |
"xgboost" |
xgboost | xgb.train() |
"svm" |
e1071 | svm() |
"nn" |
nnet | nnet() |
"deep" |
keras | keras_model_sequential() |
| Method | Underlying Package | Function Called |
|---|---|---|
"pca" |
stats | prcomp() |
"mds" |
stats, MASS, smacof | cmdscale(), isoMDS(), etc. |
"kmeans" |
stats | kmeans() |
"pam" |
cluster | pam() |
"clara" |
cluster | clara() |
"hclust" |
stats | hclust() |
"dbscan" |
dbscan | dbscan() |
Beyond wrapping individual packages, tidylearn provides orchestration functions that combine multiple techniques:
# Reduce dimensions before classification
reduced <- tl_reduce_dimensions(iris, response = "Species",
method = "pca", n_components = 3)
model <- tl_model(reduced$data, Species ~ ., method = "logistic")# Add cluster membership as a feature
enriched <- tl_add_cluster_features(data, response = "target",
method = "kmeans", k = 3)
model <- tl_model(enriched, target ~ ., method = "forest")# Use clustering to propagate labels to unlabeled data
model <- tl_semisupervised(data, target ~ .,
labeled_indices = labeled_idx,
cluster_method = "kmeans")# Automatically try multiple approaches
result <- tl_auto_ml(data, target ~ .,
time_budget = 300)
result$leaderboardConsistent ggplot2-based plotting regardless of model type:
# Generic plot method works for all model types
plot(forest_model) # Automatic visualization based on model type
plot(linear_model) # Diagnostic plots for regression
plot(pca_result) # Variance explained for PCA
# Specialized plotting functions for unsupervised learning
plot_clusters(clustering_result, cluster_col = "cluster")
plot_variance_explained(pca_result$fit$variance_explained)
# Interactive dashboard for detailed exploration
tl_dashboard(model, test_data)tidylearn is built on these principles:
Transparency: The underlying packages do the real work. tidylearn makes them easier to use together without hiding what’s happening.
Consistency: One interface, tidy output, unified visualization - across all methods.
Accessibility: Focus on your analysis, not on learning different package APIs.
Interoperability: Results work seamlessly with dplyr, ggplot2, and the broader tidyverse.
# View package help
?tidylearn
# Explore main functions
?tl_model
?tl_evaluate
?tl_auto_mlContributions are welcome! Please feel free to submit a Pull Request.
MIT License - see LICENSE for details.
Cesaire Tobias (cesaire@sheetsolved.com)
tidylearn is a wrapper that builds upon the excellent work of many R package authors. The actual algorithms are implemented in:
Thank you to all the package maintainers whose work makes tidylearn possible.