Getting Started with clmstan

Introduction

clmstan is an R package for fitting Cumulative Link Models (CLMs) using Bayesian inference via CmdStan. It provides: - 11 link functions: 5 standard + 6 flexible parametric links - 3 threshold structures: flexible, equidistant, symmetric - Bayesian estimation: Full posterior inference with MCMC - Pre-compiled Stan models: Fast execution via the instantiate package

Comparison with ordinal::clm()

Feature ordinal::clm() clmstan
Inference Maximum likelihood Bayesian (MCMC)
Uncertainty Asymptotic SE Posterior distribution
Link functions 5 standard 11 (5 standard + 6 flexible)
Estimation of link parameters No Yes

Installation

Before using clmstan, you need to install CmdStan:

# Step 1: Install cmdstanr
install.packages("cmdstanr",
                 repos = c("https://stan-dev.r-universe.dev",
                           getOption("repos")))

# Step 2: Install CmdStan (only needed once)
library(cmdstanr)
install_cmdstan()

# Step 3: Install clmstan
# devtools::install_github("t-momozaki/clmstan")

Quick Start

Fit a cumulative link model in just a few lines:

library(clmstan)
library(ordinal)
data(wine)

# Fit a model with logit link
fit <- clm_stan(rating ~ temp + contact, data = wine,
                chains = 4, iter = 2000, warmup = 1000)

# View results
print(fit)
summary(fit)

The wine dataset contains ratings (1-5) of 72 wine samples with two predictors: temperature (cold/warm) and contact (yes/no).

For ordinal::clm() Users

If you’re familiar with the ordinal package, the transition to clmstan is straightforward:

API Correspondence

ordinal::clm() clmstan::clm_stan()
clm(y ~ x, data) clm_stan(y ~ x, data)
link = "logit" link = "logit"
threshold = "flexible" threshold = "flexible"
summary(fit) summary(fit)
coef(fit) coef(fit)

Example Comparison

library(ordinal)
library(clmstan)
data(wine)

# ordinal::clm (frequentist)
fit_freq <- clm(rating ~ temp + contact, data = wine, link = "logit")

# clmstan::clm_stan (Bayesian)
fit_bayes <- clm_stan(rating ~ temp + contact, data = wine, link = "logit",
                      chains = 4, iter = 2000, warmup = 1000)

# Compare coefficients
coef(fit_freq)
coef(fit_bayes)

Key Differences

  1. Posterior distributions: clmstan provides full posterior samples, enabling Bayesian inference (credible intervals, posterior predictive checks)
  2. Reproducibility: Set set.seed() before clm_stan() for reproducible results
  3. Computation time: MCMC sampling is slower than MLE, but provides richer inference

Threshold Structures

clmstan supports three threshold parameterizations:

# View available threshold structures
supported_thresholds()

Flexible (Default)

Each threshold is freely estimated (K-1 parameters for K categories):

fit_flex <- clm_stan(rating ~ temp, data = wine, threshold = "flexible",
                     chains = 2, iter = 1000, warmup = 500)

Equidistant

Thresholds are equally spaced (2 parameters: start + interval):

# Useful for Likert scales with assumed equal intervals
fit_equi <- clm_stan(rating ~ temp, data = wine, threshold = "equidistant",
                     chains = 2, iter = 1000, warmup = 500)

Symmetric

Thresholds are symmetric around zero (useful for scales with a neutral center):

fit_sym <- clm_stan(rating ~ temp, data = wine, threshold = "symmetric",
                    chains = 2, iter = 1000, warmup = 500)

Prior Specification

clmstan uses weakly informative default priors, but you can customize them.

Default Priors

Parameter Default Prior
Coefficients (beta) normal(0, 2.5)
Thresholds (c) normal(0, 10)
t-link df gamma(2, 0.1)
GEV xi normal(0, 2)
SP r gamma(0.5, 0.5)
AEP theta1, theta2 gamma(2, 1)

Available Distributions

normal(mu, sigma)      # Normal distribution
gamma(alpha, beta)     # Gamma distribution (shape, rate)
student_t(df, mu, sigma)  # Student-t distribution
cauchy(mu, sigma)      # Cauchy distribution
flat()                 # Flat (improper) prior

Prior Classes

Class Description Compatible Distributions
"b" Regression coefficients normal, student_t, cauchy, flat
"Intercept" Thresholds (flexible) normal, student_t, cauchy, flat
"c1" First threshold (equidistant) normal, student_t, cauchy, flat
"d" Interval (equidistant) gamma
"cpos" Positive thresholds (symmetric) normal, student_t, cauchy, flat
"df" t-link degrees of freedom gamma
"xi" GEV shape parameter normal, student_t, cauchy
"r" SP skewness parameter gamma
"theta1", "theta2" AEP shape parameters gamma

Legacy API (clm_prior)

For backward compatibility:

fit <- clm_stan(rating ~ temp, data = wine,
                prior = clm_prior(beta_sd = 1, c_sd = 5),
                chains = 2, iter = 1000, warmup = 500)

Interpreting Results

Basic Methods

fit <- clm_stan(rating ~ temp + contact, data = wine,
                chains = 4, iter = 2000, warmup = 1000)

# Quick overview
print(fit)

# Detailed summary with credible intervals
summary(fit)
summary(fit, probs = c(0.025, 0.5, 0.975))

# Extract point estimates
coef(fit)                    # Posterior mean (default)
coef(fit, type = "median")   # Posterior median

Diagnostic Plots

# Trace plots (check mixing)
plot(fit, type = "trace")

# Posterior density
plot(fit, type = "dens")

# Posterior intervals
plot(fit, type = "intervals")

# Autocorrelation (check for high autocorrelation)
plot(fit, type = "acf")

# Select specific parameters
plot(fit, type = "trace", pars = c("beta[1]", "beta[2]"))

Convergence Diagnostics

# Quick summary
diagnostics(fit)

# Detailed output
diagnostics(fit, detail = TRUE)

Key diagnostics to check: - Rhat: Should be < 1.01 (values > 1.01 indicate poor convergence) - ESS (bulk/tail): Should be > 400 (low values suggest inefficient sampling) - Divergences: Should be 0 (divergences indicate sampling problems)

Prediction

Category Prediction

fit <- clm_stan(rating ~ temp + contact, data = wine,
                chains = 4, iter = 2000, warmup = 1000)

# Predict most likely category
pred_class <- predict(fit, type = "class")
head(pred_class)

# Prediction for new data
newdata <- data.frame(
  temp = factor("warm", levels = levels(wine$temp)),
  contact = factor("yes", levels = levels(wine$contact))
)
predict(fit, newdata = newdata, type = "class")

Probability Prediction

# Predicted probabilities for each category
pred_probs <- predict(fit, type = "probs")
head(pred_probs)

# Alternative: fitted values
fitted_vals <- fitted(fit)
head(fitted_vals)

Posterior Predictive Distribution

# Draw from posterior predictive distribution
y_rep <- posterior_predict(fit)
dim(y_rep)  # draws x observations

# Uncertainty in predictions
pred_draws <- predict(fit, type = "class", summary = FALSE)

Model Comparison

Use LOO-CV (Leave-One-Out Cross-Validation) for model comparison:

# Fit competing models
fit1 <- clm_stan(rating ~ temp, data = wine, link = "logit",
                 chains = 4, iter = 2000, warmup = 1000)
fit2 <- clm_stan(rating ~ temp + contact, data = wine, link = "logit",
                 chains = 4, iter = 2000, warmup = 1000)
fit3 <- clm_stan(rating ~ temp + contact, data = wine, link = "probit",
                 chains = 4, iter = 2000, warmup = 1000)

# Compute LOO-CV
loo1 <- loo(fit1)
loo2 <- loo(fit2)
loo3 <- loo(fit3)

# View results
print(loo1)

# Pareto k diagnostic plot
plot(loo1)

# Compare models
loo::loo_compare(loo1, loo2, loo3)

# WAIC is also available
waic(fit1)

Interpreting LOO results: - Lower elpd_loo (expected log pointwise predictive density) indicates better fit - Pareto k values > 0.7 suggest unreliable LOO estimates for those observations

Practical Workflow

Here’s a recommended workflow for ordinal data analysis:

library(clmstan)
library(ordinal)
data(wine)

# Step 1: Fit baseline model
set.seed(123)
fit_base <- clm_stan(rating ~ temp + contact, data = wine,
                     link = "logit", threshold = "flexible",
                     chains = 4, iter = 2000, warmup = 1000)

# Step 2: Check convergence
diagnostics(fit_base)
plot(fit_base, type = "trace")

# Step 3: Review results
summary(fit_base)

# Step 4: Try alternative link functions
fit_probit <- clm_stan(rating ~ temp + contact, data = wine,
                       link = "probit",
                       chains = 4, iter = 2000, warmup = 1000)

fit_gev <- clm_stan(rating ~ temp + contact, data = wine,
                    link = "gev", link_param = list(xi = "estimate"),
                    chains = 4, iter = 2000, warmup = 1000)

# Step 5: Compare models
loo_base <- loo(fit_base)
loo_probit <- loo(fit_probit)
loo_gev <- loo(fit_gev)
loo::loo_compare(loo_base, loo_probit, loo_gev)

# Step 6: Final model interpretation
summary(fit_base)
plot(fit_base, type = "intervals")

Troubleshooting

CmdStan Not Found

# Check CmdStan installation
cmdstanr::cmdstan_path()
cmdstanr::cmdstan_version()

# If not found, install CmdStan:
cmdstanr::install_cmdstan()

Convergence Issues

If you see Rhat > 1.01 or low ESS:

# Increase iterations and warmup
fit <- clm_stan(rating ~ temp, data = wine,
                chains = 4,
                iter = 4000,
                warmup = 2000)

# Increase adapt_delta (helps with divergences)
fit <- clm_stan(rating ~ temp, data = wine,
                chains = 4,
                iter = 2000,
                warmup = 1000,
                adapt_delta = 0.95)

# Increase max_treedepth
fit <- clm_stan(rating ~ temp, data = wine,
                chains = 4,
                iter = 2000,
                warmup = 1000,
                max_treedepth = 12)

Divergent Transitions

Divergent transitions indicate that the sampler had difficulty exploring the posterior:

  1. Increase adapt_delta: Try 0.95, 0.99, or even 0.999
  2. Reparameterize: Try different threshold structures
  3. Simplify priors: Use more informative priors
  4. Check data: Look for separation or outliers
# High adapt_delta
fit <- clm_stan(rating ~ temp, data = wine,
                chains = 4,
                iter = 2000,
                warmup = 1000,
                adapt_delta = 0.99)

Memory Issues

For large datasets:

# Reduce number of chains
fit <- clm_stan(rating ~ temp, data = wine,
                chains = 2,
                iter = 2000,
                warmup = 1000)

# Run chains in parallel (default)
fit <- clm_stan(rating ~ temp, data = wine,
                chains = 4,
                parallel_chains = 4)

References and Resources

Package Resources

Statistical References

mirror server hosted at Truenetwork, Russian Federation.