This vignette explains the matrix-vectorized implementation of the fully nonparametric exponential tilting (EXPTILT) estimator, as described in Appendix 2 of Riddles et al. (2016). This method is designed for fully categorical data, where the outcomes (\(Y\)), the response-model covariates (\(X_1\)), and the instrumental-variable covariates (\(X_2\)) are all discrete.
Unlike the parametric method, which estimates the parameters \(\phi\) of a response probability function \(\pi(x, y; \phi)\), the nonparametric approach directly estimates the nonresponse odds for each stratum. This is achieved using an Expectation-Maximization (EM) algorithm to find the maximum likelihood estimates of these odds.
The implementation (exptilt_nonparam.data.frame) assumes
the input data is an aggregated data
frame, where each row represents a unique stratum \(x^* = (x_1, x_2)\) and contains the
counts of respondents for each outcome \(y^*\) and the total count of
nonrespondents.
The implementation maps directly to the notation in Appendix 2. Let \(x_1\) be the covariates for the response model and \(x_2\) be the nonresponse instrumental variable. A full stratum is \(x^* = (x_1, x_2)\), and a response-model-only stratum is \(x_1\).
The algorithm is built from a set of fixed matrices (computed once) and one matrix that is iteratively updated.
n_y_x_matrix)
data[, outcome_cols]data
is pre-weighted).m_x_vec)
data[, refusal_col]p_hat_matrix)
n_y_x_matrix / rowSums(n_y_x_matrix)n_y_x1_matrix)
aggregate(n_y_x_matrix ~ data$x1_key, ...)odds_matrix)
m_y_x_matrix)
m_y_x1_matrix)
The function exptilt_nonparam.data.frame is a direct
implementation of the EM algorithm in Appendix 2. The goal is to find
the \(\text{argmax} O(x_1, y)\) that
maximizes the observed data likelihood, which is solved by iterating two
steps.
The E-Step computes the expected breakdown of nonrespondents into
outcome categories. It answers: “Given our current
odds_matrix \(O^{(t)}\),
what is the expected count \(M_{y^*x^*}^{(t)}\) for each full stratum
\(x^*\)?”
E-STEP in code): This
is vectorized. The denominator is computed via
rowSums(p_hat_matrix * odds_joined_matrix). The full
calculation is:
m_y_x_matrix <- m_x_vec * (p_hat_matrix * odds_joined_matrix) / denominatorThe M-Step updates the nonresponse odds \(O(x_1, y)\) using the expected counts from the E-Step.
m_y_x1_matrix <- aggregate(m_y_x_matrix ~ data$x1_key, ...)odds_matrix <- m_y_x1_matrix / n_safeSteps 1.1 and 1.2 are repeated until the change in the
odds_matrix (measured by the sum of absolute differences)
falls below the tolerance tol.
The exptilt_nonparam.data.frame function assumes the
input data is already aggregated. If a
survey.design object is provided to
exptilt_nonparam.survey.design, an adapter (not shown here)
must first be used to create this aggregated table from the microdata.
In that case, \(N_{y^*x^*}\) and \(M_{x^*}\) represent weighted sums
of counts (\(\sum d_i
\dots\)), not simple counts. The EM algorithm and all derived
matrices (p_hat_matrix, odds_matrix) are then
correctly computed based on these weighted inputs, following the logic
of the paper.
The final output data_to_return is the object of primary
interest for analysis. It is constructed by: 1. Calculating the final
expected nonrespondent counts \(M_{y^*x^*}^{(\text{final})}\) using the
converged odds_matrix. 2. Adding these expected counts to
the original observed respondent counts \(N_{y^*x^*}\).
data_to_return[y^*, x^*] = N_{y^*x^*} + M_{y^*x^*}^{(\text{final})}data_to_return[, outcome_cols] <- n_y_x_matrix_ordered + m_y_x_matrix_orderedThis final adjusted table represents the completed dataset, where the “Refusal” counts have been redistributed across the outcome columns according to the NMAR model.
The final population proportion for an outcome \(j\), \(\hat{\theta}_j\), is the total (weighted)
adjusted count for outcome \(j\)
divided by the total (weighted) population. This is calculated
from the data_to_return object.
data_to_return object.This differs from the paper’s Step 3 formula, which is an IPW
(Inverse Probability Weighting) estimator. However, both methods are
asymptotically equivalent, and the “add-and-sum” method (your
data_to_return) is a more direct and intuitive application
of the EM algorithm’s goal.