Binary Regression

Ramin Mojab

2023-07-07

library(ldt) 

seed <- 123
set.seed(seed)

Introduction

In this vignette, I will introduce you to the ldt package and its main features for dealing with Logistic and Probit Regression models. You will learn how to estimate a binary model, make predictions, and assess model uncertainty. Additionally, we will explore the use of Principal Component Analysis as an alternative approach for handling a large number of potential explanatory variables.

One of the key ideas behind ldt is to minimize user discretion by using a rule-based approach to select data. This approach not only avoids discretion but also automates the process of searching for the best models within a defined model set.

To demonstrate these features, I will create an artificial dataset with a dependent variable and both relevant and irrelevant explanatory variables. The dependent and relevant explanatory variables are sampled from a known binary model. While we can evaluate how well the estimation process finds the true parameters, our main focus will be on how to estimate, search, predict, and report results.

Let’s get started!

A simple experiment

Let’s start by assuming that we know the structure of the model. We can do this by simulating data from two known binary models. The following command generates the required samples:

num_obs <- 100
num_exo <- 4L 
p_positive <- 0.4
max_weight <- 2
sample_l <- sim.bin(num_exo, num_obs, probit = FALSE, pPos = p_positive, maxWeight = max_weight)
sample_p <- sim.bin(sample_l$coef, num_obs, probit = TRUE, pPos = p_positive, maxWeight = max_weight)
 
print(sample_l$coef)
  > [1] -0.56047565 -0.23017749  1.55870831  0.07050839

We know the parameters of the systems because they are included in the output of the sim.bin function. Note that there is a logit and a probit model. Each model has one equation or dependent variable. This equation has an intercept and 3 exogenous variables. The sample size is 100. The coefficient vector of the logit model is generated randomly and is listed in the output, sample_l. It is used to sample data from the logit model and therefore, the parameters of the two models are the same. In these samples, 40 percent of observations are labeled as positive (because of pPos argument). Finally, since max_weight is larger than 1, observations are weighted and there is a w element in the returned lists that we should consider in the estimation process.

The LaTeX code for the equation of the two models is included in the eqLatex elements of the output. It results in the following representations:

Logit: \[\begin{aligned} P(Y = 1 | X_2, X_3, X_4) = \frac{1}{1 + e^{-(-0.56 - 0.23 X_2 + 1.56 X_3 + 0.07 X_4)}} \end{aligned}\] Probit: \[\begin{aligned} P(Y = 1 | X_2, X_3, X_4) = \Phi(-0.56 - 0.23 X_2 + 1.56 X_3 + 0.07 X_4) \end{aligned}\]

Remember that these are the parameters of the system. We can use the glm function to estimate them. The following code shows how to do this. In the first line we prepare the equations and then we fit two models for our two samples:

eq_str <- paste0("Y ~ ", paste0(colnames(sample_l$x[,-1]), collapse = " + "))

fit_l <- glm(eq_str, data = data.frame(sample_l$y, sample_l$x), 
             family = binomial(link = "logit"), weights = sample_l$w) 
fit_p <- glm(eq_str, data = data.frame(sample_p$y, sample_p$x), 
             family = binomial(link = "probit"), weights = sample_l$w) 
 
fit_sf_l <- sim.bin(fit_l$coefficients, probit = FALSE )
fit_sf_p <- sim.bin(fit_p$coefficients, probit = TRUE)

The last two lines are used for reporting the LaTeX formula. We use eqLatex element of the output and this is the result:

Logit: \[\begin{aligned} P(Y = 1 | X_2, X_3, X_4) = \frac{1}{1 + e^{-(-0.46 - 0.36 X_2 + 1.82 X_3 + 0.26 X_4)}} \end{aligned}\] Probit: \[\begin{aligned} P(Y = 1 | X_2, X_3, X_4) = \Phi(-0.76 - 0.59 X_2 + 2.39 X_3 + 0.37 X_4) \end{aligned}\]

Note that these representations are not very appropriate for an estimated model because they do not report the estimated standard errors and there are better ways to do so. However, it suits our purpose here. Also, note that you can get better results (in terms of lower difference between actual and estimated coefficients) by increasing the sample size.

The following code does the same by using the estim.bin function in the ldt package:

fit_l <- estim.bin(sample_l$y, sample_l$x[,-1], sample_l$w) 

fit_p <- estim.bin(sample_p$y, sample_p$x[,-1], sample_p$w, linkFunc = "probit") 

res_table <- coefs.table(list(Logit = fit_l, Probit = fit_p), 
                             regInfo = c("obs"), formatLatex = FALSE)

The last line converts the estimated result into a table for presentation. Additional arguments can be used to control the format and level of information displayed in the table. The kable function can be used to report the contents of the table. This is the result:

Results of estimation using ldt::estim.bin function.
Logit Probit
Intercept -0.46** -0.82***
X1 -0.36 -0.47***
X2 1.82*** 2.17***
X3 0.26 0.32**
obs 100 100

Differences between the results of systemfit and estim.bin functions may be due to variations in initialization or optimization procedures.

While the coefs.plot function was discussed in another vignette on SUR models, this vignette focuses on prediction rather than parameter estimation. As such, it will not be covered here.

Prediction

A binary regression model can be used in binary classification practices. It involves using the model to predict the likelihood that a new case will fall into one of two classes based on the values of the explanatory variables. The class is either positive (\(Y = 1\)) or negative (\(Y = 0\)). The regression model estimates the likelihood given the explanatory variables, i.e., \(P(Y=1∣X)\). We can calculate the other probability: \(P(Y=0∣X)=1−P(Y=1∣X)\). Deciding the class of the observation given the probability need a decision rule.

To continue our experiment, let’s focus on the logit model and generate a sample of size 10 from the true model using the sim.bin function. We can predict using the estim.bin function in ldt by setting the newX argument. This is the code:

sample_l_new <- sim.bin(sample_l$coef, 10, probit = FALSE)

fit_l <- estim.bin(sample_l$y, sample_l$x[,-1], sample_l$w, newX = sample_l_new$x[,-1])

We have three types of information: the actual class of the observation (sample_l$y), the probabilities used in the simulation process (sample_l$p1), and the predicted probabilities fit_l$projection[,2]. Let’s plot them:

As you can see, it’s possible to have positive observations with low probabilities and negative observations with high probabilities due to the random nature of the binomial distribution.

In practice, we do not see the green circles, i.e., \(P(Y=1∣X)\). We only see the actual class of observations. The binary regression model predicts the green circles. This means that even if we have the exact coefficient vector, the best we can do is to predict the exact probability and not the actual class of observation. There must be a decision-making process that converts the predicted probability to a specific class.

When using a binary regression model to make classification decisions, a common approach is to define a threshold value for the predicted probability. If the predicted probability for a case is greater than or equal to the threshold value, the case is classified as belonging to the positive class. If the predicted probability is less than the threshold value, the case is classified as belonging to the negative class.

However, threshold is not a parameter of the model but just a decision rule. The choice of threshold value can affect the balance between sensitivity and specificity but cannot completely eliminate classification error. One might design a more complex decision-making process than using just one fixed threshold level. There are several ways to evaluate the predictive power of a binary regression model. One common approach is to use a confusion matrix to calculate accuracy, sensitivity, specificity, and precision. Other metrics include AUC-ROC and Brier score. The AUC-ROC represents the probability that a randomly chosen positive case will have a higher predicted probability than a randomly chosen negative case. A lower Brier score indicates better predictive performance.

Similar to the previous subsection, we can use coefs.table function and report the results in a table:

res_table <- coefs.table(list(Logit = fit_l), 
                             regInfo = c("obs", "aic", "sic", "aucIn", "brierIn"), 
                             formatLatex = FALSE)
Results of estimation using ldt::estim.bin function with prediction.
Y
Intercept -0.46**
X1 -0.36
X2 1.82***
X3 0.26
obs 100
aic 161.12
sic 171.55
aucIn 0.83
brierIn 0.16

One metric (specific to our experiment) is not present in the table. Its formula is \(a = \frac{\sum_{i=1}^{n}(p_i-\hat{p}_i)^2}{n}\) where \(p_i\) represents the actual probabilities (the green circles in the plot) and \(\hat{p}_i\) represents the predicted probabilities (the red plus signs in the plot). This metric is similar to the Brier score but uses actual probabilities. A value of zero indicates a perfect fit and larger values indicate lower explanatory power. The value for this experiment is 0.0035284.

Model uncertainty

Let’s consider a more realistic situation where model uncertainty exists. That’s where ldt can help. In the previous subsection, we knew all relevant explanatory variables. Here, we consider a situation where there are some irrelevant variables too. We limit the level of uncertainty and other practical issues by restricting the number of these variables. The following code reflects our assumptions:

sample_l$x <- cbind(sample_l$x, matrix(rnorm(num_obs * 50), ncol = 50, 
                                   dimnames = list(NULL,paste0("z",1:50))))

sample_l_new$x <- cbind(sample_l_new$x, matrix(rnorm(nrow(sample_l_new$x) * 50), ncol = 50, 
                                   dimnames = list(NULL,paste0("z",1:50))))

In our experiment, there are 50 irrelevant and 3 relevant variables. The number of irrelevant data is relatively large and their names start with the z character. The second line of code creates out-of-sample data in the extended sample for use in prediction.

The following code uses the search.bin function to find the actual model:

search_res <- search.bin(sample_l$y, sample_l$x[,-1], sample_l$w,
                        xSizes = c(1:5), 
                        metricOptions = get.options.metric(typesIn = c("sic")))

The xSizes = c(1:4) part assumes that we know the number of relevant explanatory variables is less than 5. The metric_options part shows that we use SIC metrics to evaluate and compare models.

This code is time-consuming and is not evaluated here. However, on my system, the elapsed time is 77 seconds (the number of searched models is 317682). Note that if we change our previous guess and assume that the maximum number of relevant variables is larger, for example 5, the size of the practical model set becomes 3187367 (10 times larger) and this is estimated in 977 seconds (12 times larger) on my system. Many factors affect this time, including optimization process options. Also check the parallel option in get.options.search() function.

One might reduce the number of potential explanatory variables using theory or statistical testing. Since ldt dislikes user discretion, it provides a more systematic approach. The idea behind it is simple: estimate smaller models, select variables, estimate larger models with fewer potential explanatory variables. Here is the code:

x_size_steps = list(c(1, 2), c(3), c(4), c(5))
count_steps = c(NA, 20, 10, 9)

search_step_res <-
  search.bin.stepwise(y = sample_l$y, x = sample_l$x, w = sample_l$w,
                      xSizeSteps = x_size_steps, countSteps = count_steps,
                      metricOptions = get.options.metric(typesIn = c("aic","sic", "auc", "brier")),
                      searchItems = get.items.search(bestK = 10))
  > Warning in .SearchDc(y, x, w, xSizes, xPartitions, costMatrices, searchLogit, :
  > Error occurred in the search process. See 'result$counts'.
search_step_res
  > method: bin 
  > expected: 2,763, searched: 2,763 (100%), failed: 54 (2%)
  > elapsed time: 0.06722522 minutes 
  > --------
  > Failures:
  > 1. matrix singularity: 54 (100%)
  > --------
  > 1. aic:
  >  Y (best=131.239)
  > 2. sic:
  >  Y (best=146.87)
  > 3. aucIn:
  >  Y (best=0.896)
  > 4. brierIn:
  >  Y (best=0.122)

The first two lines define the steps. We use all variables (NA in count_steps means all) to estimate models with sizes defined as the first element of x_size_steps. Then we select a number of variables from the information provided by the best models and estimate models with sizes determined by the second element of x_size_steps. And so on.

The size of the model subset and running time are greatly reduced. However, let’s see its performance.

To study or report results, we should use the summary function. The output of a search project in ldt does not contain estimation results but only the minimum level of information to replicate them. The summary function does the job and estimates the models. Here is the code:

ssum <- summary(search_step_res, 
                y = sample_l$y, x = sample_l$x, w = sample_l$w, 
                newX = sample_l_new$x, 
                printMsg = FALSE)

Usually, there is more than one model in the summary output. This is because the output is first “target-variable-specific” and second “evaluation-specific”. In this application, there is just one target but we requested four different types of evaluations in the get.options.metric function. We can report the results by creating a list of estimated models and using the coefs.table function:

mod_list <- list("SIC" = ssum$sic$target1$model$bests$best1,
                 "AIC" = ssum$aic$target1$model$bests$best1,
                 "AUC" = ssum$aucIn$target1$model$bests$best1,
                 "Brier" = ssum$brierIn$target1$model$bests$best1)
res_table <- coefs.table(mod_list, 
                             regInfo = c("obs", "aic", "sic", "aucIn", "brierIn"), 
                             formatLatex = FALSE)

Since we set the newX argument in the summary function, we can plot the predictions:

With current seed and other options, all metrics point to a common model. In this regression model a relevant variable (X3) is missing and three irrelevant variables are present with non-significant coefficients.

Out-of-sample simulation

Let’s see if we can do any better by using out-of-sample evaluations. The following code is similar to the code in the previous section, but we define a out-of-sample process:

metric_options <- get.options.metric(typesOut = c("auc", "brier"), 
                                       seed = -seed, simFixSize = 5, trainRatio = 0.75)
search_step_res <-
  search.bin.stepwise(y = sample_l$y, x = sample_l$x, w = sample_l$w,
                      xSizeSteps = x_size_steps, countSteps = count_steps,
                      metricOptions = metric_options,
                      searchItems = get.items.search(bestK = 10),
                      searchOptions = get.options.search(printMsg = FALSE))
  > Warning in .SearchDc(y, x, w, xSizes, xPartitions, costMatrices, searchLogit, :
  > Error occurred in the search process. See 'result$counts'.
search_step_res
  > method: bin 
  > expected: 2,381, searched: 2,381 (100%), failed: 54 (2.3%)
  > elapsed time: 0.06695981 minutes 
  > --------
  > Failures:
  > 1. matrix singularity: 54 (100%)
  > --------
  > 1. aucOut:
  >  Y (best=0.931)
  > 2. brierOut:
  >  Y (best=0.094)

We use 0.75 ratio of the observations (determined by trainRatio) for estimating and the rest for testing. We repeat this experiment 5 times (determined by simFixSize). We can report the result similar to the previous discussion:

You can get better results by increasing the number of observations. Also, you can change optimization algorithm options or other search options to improve performance.