# Install from CRAN
install.packages("corrselect")
# Or install development version from GitHub
# install.packages("pak")
pak::pak("gcol33/corrselect")Suggested packages (for extended functionality):
lme4, glmmTMB: Mixed-effects models in
modelPrune()
WGCNA: Biweight midcorrelation
(bicor)
energy: Distance correlation
minerva: Maximal information coefficient
corrselect identifies and removes redundant variables based on
pairwise correlation or association. Given a threshold \(\tau\), it finds subsets where all pairwise
associations satisfy \(|a_{ij}| <
\tau\) (see vignette("theory") for mathematical
formulation).
corrselect provides three levels of interface:
corrPrune() - Removes redundant predictors based on pairwise correlation:
Returns a single pruned dataset
No response variable required
Fast greedy or exact search
modelPrune() - Reduces VIF in regression models:
Returns a single pruned dataset with response
Iteratively removes high-VIF predictors
Works with lm, glm, lme4, glmmTMB
corrSelect() - Returns all maximal subsets (numeric data):
Enumerates all maximal valid subsets satisfying threshold (see
vignette("theory"))
Provides full metadata (size, avg_corr, max_corr, min_corr)
Exact or greedy search
assocSelect() - Returns all maximal subsets (mixed-type data):
Handles numeric, factor, and ordered variables
Uses appropriate association measures per variable pair
Exact or greedy search
MatSelect() - Direct matrix input:
Accepts precomputed correlation/association matrices
No data preprocessing
Useful for repeated analyses
data(mtcars)
# Remove correlated predictors (threshold = 0.7)
pruned <- corrPrune(mtcars, threshold = 0.7)
# Results
cat(sprintf("Reduced from %d to %d variables\n", ncol(mtcars), ncol(pruned)))
#> Reduced from 11 to 5 variables
names(pruned)
#> [1] "mpg" "drat" "qsec" "gear" "carb"Variables removed:
How corrPrune() selects among multiple maximal subsets:
When multiple maximal subsets exist (which is common),
corrPrune() returns the subset with the lowest
average absolute correlation. This selection criterion balances
three goals:
Minimize redundancy: Lower average correlation means more independent variables
Maximize information: Prefers diverse variable combinations over tightly clustered ones
Deterministic behavior: Always returns the same result for the same data
To explore all maximal subsets instead of just the
optimal one, use corrSelect() (see below).
# Prune based on VIF (limit = 5)
model_data <- modelPrune(
formula = mpg ~ .,
data = mtcars,
limit = 5
)
# Results
cat("Variables kept:", paste(attr(model_data, "selected_vars"), collapse = ", "), "\n")
#> Variables kept: drat, qsec, vs, am, gear, carb
cat("Variables removed:", paste(attr(model_data, "removed_vars"), collapse = ", "), "\n")
#> Variables removed: disp, cyl, wt, hpresults <- corrSelect(mtcars, threshold = 0.7)
show(results)
#> CorrCombo object
#> -----------------
#> Method: bron-kerbosch
#> Correlation: pearson
#> Threshold: 0.700
#> Subsets: 15 maximal subsets
#> Data Rows: 32 used in correlation
#> Pivot: TRUE
#>
#> Top combinations:
#> No. Variables Avg Max Size
#> ------------------------------------------------------------
#> [ 1] mpg, drat, qsec, gear, carb 0.416 0.700 5
#> [ 2] cyl, drat, qsec, gear, carb 0.434 0.700 5
#> [ 3] mpg, drat, vs, gear, carb 0.466 0.700 5
#> [ 4] wt, qsec, am, carb 0.373 0.692 4
#> [ 5] wt, qsec, gear, carb 0.388 0.656 4
#> ... (10 more combinations)Inspect subsets:
as.data.frame(results)[1:5, ] # First 5 subsets
#> VarName01 VarName02 VarName03 VarName04 VarName05
#> Subset01 [avg=0.416] mpg drat qsec gear carb
#> Subset02 [avg=0.434] cyl drat qsec gear carb
#> Subset03 [avg=0.466] mpg drat vs gear carb
#> Subset04 [avg=0.373] wt qsec am carb <NA>
#> Subset05 [avg=0.388] wt qsec gear carb <NA>Extract a specific subset:
# Create mixed-type data
df <- data.frame(
x1 = rnorm(100),
x2 = rnorm(100),
cat1 = factor(sample(c("A", "B", "C"), 100, replace = TRUE)),
ord1 = ordered(sample(1:5, 100, replace = TRUE))
)
# Handle mixed types automatically
results_mixed <- assocSelect(df, threshold = 0.5)
show(results_mixed)
#> CorrCombo object
#> -----------------
#> Method: bron-kerbosch
#> Correlation: mixed
#> AssocMethod: numeric_numeric = pearson, numeric_factor = eta, numeric_ordered
#> = spearman, factor_ordered = cramersv
#> Threshold: 0.500
#> Subsets: 1 maximal subsets
#> Data Rows: 100 used in correlation
#> Pivot: TRUE
#>
#> Top combinations:
#> No. Variables Avg Max Size
#> ------------------------------------------------------------
#> [ 1] x1, x2, cat1, ord1 0.082 0.261 4
# Verify all pairwise associations are below threshold
cat("Max pairwise association:", max(results_mixed@max_corr), "\n")
#> Max pairwise association: 0.2613954Use force_in to ensure specific variables are always
retained:
Common thresholds: 0.5 (strict), 0.7 (moderate, recommended default), 0.9 (lenient).
Lower thresholds are stricter because they allow fewer variable pairs to coexist, resulting in smaller subsets. Higher thresholds permit stronger correlations, retaining more variables.
For detailed threshold selection strategies including visualization
techniques, VIF guidelines, and sensitivity analysis, see
vignette("advanced").
| Scenario | Function | Key Parameters |
|---|---|---|
| Quick dimensionality reduction | corrPrune() |
threshold, mode |
| Model-based refinement | modelPrune() |
limit (VIF threshold), engine |
| Enumerate all maximal subsets | corrSelect() |
threshold |
| Mixed-type data | assocSelect() |
threshold |
| Precomputed matrices | MatSelect() |
threshold, method |
| Protect key variables | Any function | force_in |
Removes redundant predictors based on pairwise correlation.
corrPrune(data, threshold = 0.7, measure = "auto", mode = "auto",
force_in = NULL, by = NULL, group_q = 1, max_exact_p = 100)| Parameter | Description | Default |
|---|---|---|
data |
Data frame or matrix | required |
threshold |
Maximum allowed correlation | 0.7 |
measure |
Correlation type: "auto", "pearson",
"spearman", "kendall", "bicor",
"distance", "maximal" |
"auto" |
mode |
Algorithm: "auto", "exact",
"greedy" |
"auto" |
force_in |
Variables that must be retained | NULL |
by |
Column name(s) for grouped pruning | NULL |
group_q |
Quantile for aggregating group correlations (0-1] | 1 |
Returns: Data frame with pruned variables.
Attributes: selected_vars, removed_vars.
Iteratively removes predictors with high VIF from a regression model.
modelPrune(formula, data, engine = "lm", criterion = "vif",
limit = 5, force_in = NULL, max_steps = NULL, ...)| Parameter | Description | Default |
|---|---|---|
formula |
Model formula (e.g., y ~ .) |
required |
data |
Data frame | required |
engine |
"lm", "glm", "lme4",
"glmmTMB", or custom |
"lm" |
criterion |
"vif" or "condition_number" |
"vif" |
limit |
Maximum allowed diagnostic value | 5 |
force_in |
Variables that must be retained | NULL |
Returns: Pruned data frame. Attributes:
selected_vars, removed_vars,
final_model.
Enumerates all maximal subsets satisfying correlation threshold (numeric data).
| Parameter | Description | Default |
|---|---|---|
df |
Data frame (numeric columns only) | required |
threshold |
Maximum allowed correlation | 0.7 |
method |
Algorithm: "bron-kerbosch", "els" |
auto |
cor_method |
"pearson", "spearman",
"kendall", "bicor", "distance",
"maximal" |
"pearson" |
force_in |
Variables required in all subsets | NULL |
Returns: CorrCombo object with
properties: subset_list, avg_corr,
min_corr, max_corr.
Enumerates all maximal subsets for mixed-type data (numeric, factor, ordered).
assocSelect(df, threshold = 0.7, method = NULL, force_in = NULL,
method_num_num = "pearson", method_num_ord = "spearman",
method_ord_ord = "spearman", ...)| Parameter | Description | Default |
|---|---|---|
df |
Data frame (any column types) | required |
threshold |
Maximum allowed association | 0.7 |
method_num_num |
Numeric-numeric: "pearson", "spearman",
etc. |
"pearson" |
method_num_ord |
Numeric-ordered: "spearman",
"kendall" |
"spearman" |
method_ord_ord |
Ordered-ordered: "spearman",
"kendall" |
"spearman" |
Returns: CorrCombo object.
Direct matrix interface for precomputed correlation/association matrices.
| Parameter | Description | Default |
|---|---|---|
mat |
Symmetric correlation/association matrix | required |
threshold |
Maximum allowed value | 0.7 |
method |
Algorithm: "bron-kerbosch", "els" |
auto |
force_in |
Variables required in all subsets | NULL |
Returns: CorrCombo object.
Extracts a specific subset from a CorrCombo result.
| Parameter | Description | Default |
|---|---|---|
res |
CorrCombo object from
corrSelect/assocSelect/MatSelect |
required |
df |
Original data frame | required |
which |
Subset index or "best" (lowest avg correlation) |
"best" |
keepExtra |
Include non-numeric columns in output? | FALSE |
Returns: Data frame containing only the selected variables.
“No valid subsets found” error - Threshold too strict: all variable pairs exceed it
force_in to keep at
least one variableVIF computation fails in modelPrune() - Perfect multicollinearity (R² = 1) present
corrPrune(threshold = 0.99) first to
remove near-duplicatesForced variables conflict - Variables in
force_in are too highly correlated with each other
force_in
setSlow performance with many variables - Exact mode is exponential for large p
mode = "greedy" for p > 25For comprehensive troubleshooting with code examples, see
vignette("advanced"), Section 5.
vignette("workflows") - Complete real-world
workflows (ecological, survey, genomic, mixed models)
vignette("advanced") - Algorithmic control and
custom engines
vignette("comparison") - Comparison with caret,
Boruta, glmnet
vignette("theory") - Theoretical foundations and
formulation
?corrPrune, ?modelPrune,
?corrSelect, ?assocSelect,
?MatSelect
sessionInfo()
#> R version 4.5.2 (2025-10-31 ucrt)
#> Platform: x86_64-w64-mingw32/x64
#> Running under: Windows 11 x64 (build 26200)
#>
#> Matrix products: default
#> LAPACK version 3.12.1
#>
#> locale:
#> [1] LC_COLLATE=en_US.UTF-8 LC_CTYPE=en_US.UTF-8 LC_MONETARY=en_US.UTF-8
#> [4] LC_NUMERIC=C LC_TIME=en_US.UTF-8
#>
#> time zone: Europe/Luxembourg
#> tzcode source: internal
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] microbenchmark_1.5.0 corrselect_3.2.1
#>
#> loaded via a namespace (and not attached):
#> [1] gtable_0.3.6 xfun_0.55 bslib_0.9.0
#> [4] ggplot2_4.0.1 recipes_1.3.1 lattice_0.22-7
#> [7] vctrs_0.7.1 tools_4.5.2 generics_0.1.4
#> [10] stats4_4.5.2 parallel_4.5.2 tibble_3.3.1
#> [13] pkgconfig_2.0.3 ModelMetrics_1.2.2.2 Matrix_1.7-4
#> [16] data.table_1.18.0 RColorBrewer_1.1-3 S7_0.2.1
#> [19] lifecycle_1.0.5 compiler_4.5.2 farver_2.1.2
#> [22] stringr_1.6.0 textshaping_1.0.4 codetools_0.2-20
#> [25] htmltools_0.5.9 class_7.3-23 sass_0.4.10
#> [28] yaml_2.3.12 prodlim_2025.04.28 pillar_1.11.1
#> [31] jquerylib_0.1.4 MASS_7.3-65 cachem_1.1.0
#> [34] gower_1.0.2 iterators_1.0.14 rpart_4.1.24
#> [37] foreach_1.5.2 nlme_3.1-168 parallelly_1.46.1
#> [40] lava_1.8.2 tidyselect_1.2.1 digest_0.6.39
#> [43] stringi_1.8.7 future_1.68.0 dplyr_1.2.0
#> [46] reshape2_1.4.5 purrr_1.2.0 listenv_0.10.0
#> [49] splines_4.5.2 fastmap_1.2.0 grid_4.5.2
#> [52] cli_3.6.5 magrittr_2.0.4 survival_3.8-3
#> [55] future.apply_1.20.1 withr_3.0.2 scales_1.4.0
#> [58] lubridate_1.9.4 timechange_0.3.0 rmarkdown_2.30
#> [61] globals_0.18.0 otel_0.2.0 nnet_7.3-20
#> [64] timeDate_4051.111 evaluate_1.0.5 knitr_1.51
#> [67] hardhat_1.4.2 caret_7.0-1 rlang_1.1.7
#> [70] Rcpp_1.1.1 glue_1.8.0 pROC_1.19.0.1
#> [73] ipred_0.9-15 svglite_2.2.2 jsonlite_2.0.0
#> [76] R6_2.6.1 plyr_1.8.9 systemfonts_1.3.1