In this vignette, we demonstrate FORD algorithm in A New Measure Of Dependence: Integrated R2, a forward stepwise variable selection algorithm based on the integrated \(R^2\) dependence measure. FORD is designed for variable ranking in both linear and nonlinear multivariate regression settings.
FORD closely follows the structure of FOCI A Simple Measure Of Conditional Dependence, but replaces the core dependence measure with irdc.
Let \(Y\) be the response variable and \(\mathbf{X} = (X_1, \dots, X_p)\) the predictor variables. Given \(n\) i.i.d. samples of \((Y, \mathbf{X})\), FORD proceeds as follows:
Select \(j_1 = \arg\max_j \nu_n(Y, X_j)\)
If \(\nu_n(Y, X_{j_1}) \leq 0\), return \(\hat{V} = \emptyset\)
Iteratively add the feature that gives the maximum increase in irdc: $$ j_{k+1} = \arg\max_{j \notin {j_1, \ldots, j_k}} \nu_n(Y, (X_{j_1}, \ldots, X_{j_k}, X_j)) $$
Stop when the irdc does not increase anymore: $$ \nu_n(Y, (X_{j_1}, \ldots, X_{j_k}, X_{j_{k+1}})) \leq \nu_n(Y, (X_{j_1}, \ldots, X_{j_k})) $$
If no such \(k\) exists, select all variables.
Here, \(Y\) depends only on the first 4 features of \(X\) in a nonlinear way.
set.seed(42)
n <- 2000
p <- 100
X <- matrix(rnorm(n * p), ncol = p)
colnames(X) <- paste0("X", seq_len(p))
Y <- X[, 1] * X[, 2] + sin(X[, 1] * X[, 3]) + X[, 4]^2
result_foci_1 <- foci(Y, X, numCores = 1)
result_foci_1
#> $selectedVar
#> index names
#> <num> <char>
#> 1: 4 X4
#> 2: 1 X1
#> 3: 2 X2
#> 4: 3 X3
#>
#> $stepT
#> [1] 0.3356423 0.4027284 0.6226254 0.7619649
#>
#> attr(,"class")
#> [1] "foci"
result_ford_1 <- ford(Y, X, numCores = 1)
result_ford_1
#> $selectedVar
#> index names
#> <num> <char>
#> 1: 4 X4
#> 2: 1 X1
#> 3: 2 X2
#> 4: 3 X3
#>
#> $step_nu
#> [1] 0.3198165 0.4026348 0.6324854 0.7668089
#>
#> attr(,"class")
#> [1] "ford"
We can force both FOCI and FORD to select a specific number of variables instead of using an automatic stopping rule.
result_foci_2 <- foci(Y, X, num_features = 5, stop = FALSE, numCores = 1)
result_foci_2
#> $selectedVar
#> index names
#> <num> <char>
#> 1: 4 X4
#> 2: 1 X1
#> 3: 2 X2
#> 4: 3 X3
#> 5: 66 X66
#>
#> $stepT
#> [1] 0.3356423 0.4027284 0.6226254 0.7619649 0.6900384
#>
#> attr(,"class")
#> [1] "foci"
result_ford_2 <- ford(Y, X, num_features = 5, stop = FALSE, numCores = 1)
result_ford_2
#> $selectedVar
#> index names
#> <num> <char>
#> 1: 4 X4
#> 2: 1 X1
#> 3: 2 X2
#> 4: 3 X3
#> 5: 31 X31
#>
#> $step_nu
#> [1] 0.3198165 0.4026348 0.6324854 0.7668089 0.6988827
#>
#> attr(,"class")
#> [1] "ford"
FORD provides an interpretable, irdc-based alternative to FOCI for variable selection in regression tasks. It offers a principled forward selection framework that can detect complex nonlinear relationships and be adapted for fixed-size feature subsets.
For further theoretical details, see our paper:
Azadkia and Roudaki (2025), A New Measure Of Dependence: Integrated R2