2025-05-28
Authors:
The gofedf
package provides computational tools to apply
goodness-of-fit tests based on empirical distribution function theory.
The package offers functions and routines to test the hypothesis that a
univariate sample follows a distribution based on the empirical
distribution function. The theory is founded on reducing the problem to
a stochastic process and computing its covariance function. An
approximate p-value is computed using the Imhof
or
Farebrother
method based on the limiting distribution of
the statistic (see note section for more details about choice of
method). Users can run the test by calculating either the Cramer-von
Mises or Anderson-Darling statistic. The covariance function of the
stochastic process relies on specific characteristics of the assumed
model. Notably, knowledge of the Fisher information matrix and the
partial derivatives of the cumulative distribution function is crucial
for computing the covariance function. However, obtaining these
quantities can be computationally intensive or challenging in general
likelihood models. To overcome this limitation, we propose an
alternative method for estimating the covariance function of the
stochastic process directly from the sample data. The package provides
tools for this estimation and for testing if a sample comes from any
general likelihood model. In summary, the package can be used to apply a
goodness-of-fit test in any of the following settings:
Validate if the assumptions about the response variable in a generalized linear model (with any link function) are satisfied. The current version only checks for the Gamma distribution.
Validate if the normality assumptions in a linear model are satisfied.
Apply a goodness-of-fit test to examine if a set of bivariate samples follows a Normal, Gamma, or Exponential distribution.
Formal model evaluation in a general likelihood model. In this case, probability inverse transformed (PIT) values of the sample and the score function (if any parameter estimation is involved) are required. See the example for more details.
You can install the released version of gofedf
from CRAN with:
install.packages('gofedf')
or the latest version from GitHub page with:
# install.packages("devtools")
::install_github('pnickchi/gofedf') devtools
In this section, we will review the package’s primary functions and provide illustrative examples to demonstrate their usage. The first two examples relate to applying the goodness-of-fit (GOF) test for i.i.d. samples, while the last two focus on linear models and generalized linear models. The final example showcases the most important feature of the package, allowing you to apply goodness-of-fit tests based on empirical distribution function EDF for a general likelihood model.
The first example illustrates the GOF test for an i.i.d. sample from
a Normal distribution. The main function is testNormal
. At
the minimum, it requires a numeric vector as input and it runs by
default values for arguments. Here we describe the parameters of this
function. Note that some of them are the same and will show up later in
other function. By default, the value of discretize
is set
to FALSE, meaning that an estimated integral equation is being solved to
compute eigenvalues. If set to TRUE, the integral equation problem is
replaced by a matrix equation problem and the eigenvalues of this matrix
are computed instead.
A common approach to solve for eigenvalues (\(`\lambda`\)) numerically is to discretize the integral over the interval [0,1]. For any given covariance function, such as \(`\hat \rho(s,t)`\) in our case, the eigenvalues can be approximated by solving the following system of equations:
\sum_{j=1}^{m} w_{j} \rho(s_i,s_j)f(s_j) = \lambda_i f(s_i) \quad \quad i=1,2,3,\ldots,m
where \(`m`\) is the number of knots
\(`s_j`\) being used to discretize the
integral over [0,1] and \(`w_j`\) are
quadrature weights. The parameters ngrid
,
gridpit
, and hessian
which will follow are
only relevant when you set discretize = TRUE
.
By default, the package uses probability integral transform values or
PITs to compute the stochastic process and its covariance function
later. You can change this behavior by setting
gridpit = FALSE
and assigning a positive value for
ngrid
. The default value for ngrid
is the same
as the number of observations, n
. This means the (0,1)
interval is divided into n
equally spaced data points to
compute the stochastic process and the covariance function (the values
for \(`s_i`\) in the integral
equation). Additionally, the Fisher information matrix, by default, is
estimated by the variance of the score function. To change this, you can
set hessian=TRUE
to estimate the Fisher information matrix
using the Hessian instead. Finally, method is a string that defines the
statistic to compute. Possible values are cvm
for
Cramer-von-Mises, ad
for Anderson-Darling, and both to
compute both.
# Reproducible example
set.seed(123)
# Randomly generate some data from Normal distribution
<- 50
n <- rnorm(n)
x
# Test if the data follows a Normal distribution by calculating the Cramer-von Mises statistic and approximate p-value of the test.
testNormal(x = x, method = 'cvm')
## $Statistic
## Cramer-von-Mises Statistic
## 0.03781322
##
## $pvalue
## [1] 0.7068748
# Generate some random sample from a non Normal distribution.
<- rexp(n)
x testNormal(x = x, method = 'cvm')
## $Statistic
## Cramer-von-Mises Statistic
## 0.2368814
##
## $pvalue
## [1] 0.005456979
The second example illustrates the GOF test for an i.i.d. sample from
a Gamma distribution. The main function is testGamma
and
the arguments remain the same as Normal case.
# Reproducible example
set.seed(123)
# Randomly generate some data
<- 50
n <- rgamma(n, shape = 1)
x
# Test if the data follows a Gamma distribution, calculate Cramer-von Mises statistic and approximate p-value
testGamma(x = x, method = 'cvm')
## $Statistic
## Cramer-von-Mises Statistic
## 0.03313641
##
## $pvalue
## [1] 0.7883045
In this example, we illustrate how to apply GOF test to verify the
assumptions of a linear model. The main function is
testLMNormal
. At the minimum, a numeric vector of response
variable, y
, and a vector/ matrix of explanatory variables,
x
, are required. Conveniently, the function can take an
object of class “lm” and directly applies the goodness-of-fit test. In
this case, there is no need to pass x
and y
.
Note that if you decide to use this feature, you need to explicitly ask
lm
function to return the design matrix and response
variable by passing x=TRUE
and y=TRUE
(as
shown in the example below). The other arguments of the function are the
same as previous examples.
# Reproducible example
set.seed(123)
# Create a set of explanatory variables and a response variable according to a linear model
# Sample size
<- 50
n
# Number of explanatory variables
<- 5
p
# Generate some coefficients
<- runif(p)
b
# Simulate random explanatory variables
<- matrix( runif(n*p), nrow = n, ncol = p)
X
# Generate some error terms from Normal distribution
<- rnorm(n)
e
# Generate response variable according to the linear model
<- X %*% b + e
y
# Test if the residuals of the model follows a Normal distribution, calculate Cramer-von Mises statistic and approximate p-value
testLMNormal(x = X, y)
## $Statistic
## Cramer-von-Mises Statistic
## 0.02285164
##
## $pvalue
## [1] 0.939902
# Or alternatively just directly pass 'lm.fit' which is returned from lm() function.
# Note that in this case you need to set x = TRUE and y = TRUE.
<- lm(y ~ X, x = TRUE, y = TRUE)
lm.fit testLMNormal(fit = lm.fit)
## $Statistic
## Cramer-von-Mises Statistic
## 0.02285164
##
## $pvalue
## [1] 0.939902
In this example, we illustrate how to apply the GOF test to verify if
the response variable in a generalized linear model with any link
function follows a Gamma distribution. The main function for this is
testGLMGamma
. At the minimum, you need a numeric vector of
the response variable, y
, a vector/matrix of explanatory
variables, x
, and a link function for the Gamma family.
Conveniently, the function can take an object of class glm
and directly apply the goodness-of-fit test. You can use
glm
or glm2
function from glm2
pacakge. We recommend using the glm2
function from the
glm2
package as it provides better estimates for the
coefficients and avoids convergence issues in the optimization process
while calculating MLE of coefficients. In either of these cases, there
is no need to pass x
and y
. However, if you
decide to use this feature (passing object from glm
or
glm2
function), you must explicitly ask the
glm
or glm2
function to return the design
matrix by passing x = TRUE
(as shown in the example below).
Additionally, you can pass a starting value, start.value
,
to be used as the initial value for MLE estimation of the coefficients.
The function also offers a list of parameters to control the fitting
process in glm
or glm2
functions. The other
arguments of the function remain consistent with previous examples.
# Reproducible example
set.seed(123)
# Create a set of explanatory variables and a response variable according to a generalized linear model.
# Sample size
<- 50
n
# Number of explanatory variables
<- 5
p
# Simulate random explanatory variables
<- matrix( rnorm(n*p, mean = 10, sd = 0.1), nrow = n, ncol = p)
X
# Generate some coefficients
<- runif(p)
b
# Generate some error terms from Gamma distribution
<- rgamma(n, shape = 3)
e
# Generate response variable according to the generalized linear model (log link function)
<- exp(X %*% b) * e
y
# Test if the Gamma assumptions of the response variable holds by calculating the Cramer-von Mises statistic and approximate p-value
testGLMGamma(x=X, y, l = 'log', method = 'cvm')
## $Statistic
## Cramer-von-Mises Statistic
## 0.0870493
##
## $pvalue
## [1] 0.1874801
# Or alternatively just pass 'glm.fit' object directly instead:
<- glm2::glm2(y ~ X, family = Gamma(link = 'log'), x = TRUE, y = TRUE)
glm.fit testGLMGamma(fit = glm.fit, l = 'log')
## $Statistic
## Cramer-von-Mises Statistic
## 0.0870493
##
## $pvalue
## [1] 0.1874801
One of the most important features of the package is to provide computational tools to apply the goodness-of-fit test based on empirical distribution functions for any general likelihood model. We provided tools to apply the test for Normal, Gamma, verify the assumptions in a linear model and generalized linear model. But this additional feature allows you to test if the sample come from any general likelihood model. For example, consider you have a sample of size \(`n`\), such as \(`X_1, X_2, \ldots, X_n`\), from a model with CDF of \(`F(X;\theta)`\) where \(`\theta`\) contains \(`p`\) parameters. Before running the test, at the minimum you need the followings:
A numeric vector of observations, x
.
The probability inverse transformed or PIT values of the sample
which ought to be a numeric vector with the same size as x
and with elements \(`F^{-1}(X_{i};\theta)`\).
If \(`\theta`\) is unknown, you also
need to provide score function. This needs to be a matrix with \(`n`\) rows and \(`p`\) columns where each row measures the
score of each observation. Note that the values are computed as \(`S(X_{i};\theta) = \frac{\partial}{\partial
\theta} \log(f(X_{i};\theta))`\) where \(`f(X_{i};\theta)`\) is the probability
density function. For sure, if \(`\theta`\) is not known, this means you
need to compute the MLE of \(`\theta`\)
to obtain item 1 and if needed the score function. The main function to
apply the GOF test in this case is testYourModel
. The
precision
argument sets the precision needed to check if
the col sums of score matrix are close enough to zero (log-likelihood is
zero at MLE). The other arguments of the function remain consistent with
previous examples.
In the following example, we demonstrate how to apply the
goodness-of-fit test to check if a sample follows an Inverse Gaussian
distribution, where the shape parameter depends on some weights. First,
we generate data from an Inverse Gaussian distribution. For illustrative
purposes, we include functions to compute the Maximum Likelihood
Estimation (MLE) and score function for the sample. In the following
chunck of code, IGScore
is a function that returns the
score for each observation, and IGPIT
is a function that
provides a vector of Probability Inverse Transformed (PIT) values.
Additionally, IGMLE
calculates the MLE of the mean and
shape parameter. Second we calculate score, PIT and MLE of parameters.
Finally we call testYourModel
function to apply the
test.
# Example: Inverse Gaussian (IG) distribution with weights
# Reproducible example
set.seed(123)
# Set the sample size
<- 50
n
# Assign weights
<- runif(n, min = 5, max = 6)
weights <- weights / sum(weights)
weights
# Set mean and shape parameters for IG distribution.
<- 2
mio <- 2
lambda
# Generate a random sample from IG distribution with weighted shape.
<- statmod::rinvgauss(n, mean = mio, shape = lambda * weights)
y
# Compute MLE of parameters, score matrix, and pit values.
<- IGMLE(obs = y, w = weights)
theta_hat <- IGScore(obs = y, w = weights, mle = theta_hat)
score.matrix <- IGPIT(obs = y , w = weights, mle = theta_hat)
pit.values
# Apply the goodness-of-fit test.
testYourModel(pit = pit.values, score = score.matrix)
## $Statistic
## Cramer-von-Mises Statistic
## 0.03292151
##
## $pvalue
## [1] 0.8616661
The calculation of the p-value for the goodness-of-fit test based on
the empirical distribution function relies on computing the tail
probability of a sum of chi-squared random variables. Specifically,
after finding the eigenvalues \(`\lambda_{1},
\lambda_{2}, \ldots, \lambda_{n}`\), we need to compute the
p-value as follows: \(`p-value =
Pr\left(\sum_{i=1}^{n} \lambda_{i}Z_{i}^{2} \geq x\right)`\),
where \(`Z_{i}^{2}`\) is a random
variable following \(`\chi^{2}_{(1)}`\)
distribution and \(`x`\) represents the
statistic (cvm or ad). The CompQuadForm
package is being
used for this purpose as it contains different methods for computing
this tail probability. We were particularly interested in the
Farebrother
and Imhof
methods. However, both
the Imhof
and Farebrother
functions from the
package encounter difficulties when computing the p-value if the
statistic is in the very tail of the distribution or if some of the
\(`\lambda_{i}`\) values are very
small. They may produce negative p-values or p-values that are not
accurate.
Through numerical experimentation in the GLM-Gamma case and
comparison between p-values generated by Imhof
and
Farebrother
, we discovered a way to solve this problem.
After computing the eigenvalues, we remove values that are extremely
small (e.g., \(`1 \times 10^{-15}`\)).
Then, we divide the remaining eigenvalues into two sets: one set
contains values greater than \(`\frac{\lambda_{1}}{2000}`\), and the other
set contains values less than \(`\frac{\lambda_{1}}{2000}`\). We then
compute the sum of the eigenvalues in the second set and use this sum to
compensate for the deleted eigenvalues, thereby correcting the cvm or ad
statistic. The values of set one is used for p-value computation.
During the computation of the p-value, we theoretically obtain both a
lower bound (LB) and an upper bound (UB) for the p-value. If the LB is
greater than 1e-7, we compute the p-value using the Imhof
method and ensure that the computed p-value falls within the range
between LB and UB. If it doesn’t, we calculate the p-value using the
Farebrother
method. If the LB falls between 1e-10 and 1e-7,
we compute the p-value using the Farebrother
method.
Finally, if the LB is less than 1e-10, we first attempt to calculate the
p-value using the Farebrother
method. If this attempt
fails, we return both the LB and UB along with a warning that
CompQuadForm
failed to generate a valid p-value.
[1] Imhof, J.P. (1961). [Computing the Distribution of Quadratic Forms in Normal Variables] Biometrika, Vol. 48, 419-426.
[2] Farebrother R.W. (1984). [Algorithm AS 204: The distribution of a Positive Linear Combination of chi-squared random variables] Journal of the Royal Statistical Society, Vol. 33, No. 3, 332-339.
[3] Giner G. and Smyth G. K. (2016). [statmod: Probability calculations for the inverse Gaussian distribution] R Journal, Vol. 8, No 3, 339-351.
[4] Stephens, M.A. (1974). [EDF Statistics for Goodness of Fit and Some Comparisons.] Journal of the American Statistical Association, Vol. 69, 730-737.
[5] Stephens, M.A. (1976). [Asymptotic results for goodness-of-fit statistics with unknown parameters.] Annals of Statistics, Vol. 4, 357-369.
[6] Duchesne, P. and Lafaye De Micheaux, P. (2010). [Computing the distribution of quadratic forms: Further comparisons between the Liu-Tang-Zhang approximation and exact methods] Computational Statistics and Data Analysis, Vol. 54, 858-862.