Understanding Method Comparison Statistics

Marcello Grassi

Introduction

This vignette provides a conceptual overview of the statistical methods implemented in valytics. The goal is to help you understand what the numbers mean and how to think about them, not to prescribe specific acceptance criteria or make decisions for you.

Whether your analysis “passes” or “fails” depends entirely on your specific application, regulatory requirements, and clinical context. This package provides the tools; you and your organization define what constitutes acceptable agreement.

library(valytics)
library(ggplot2)

Statistical Concepts in Bland-Altman Analysis

What is Bias?

The bias (mean difference) quantifies the average systematic offset between two methods. It answers: “On average, how much higher or lower does method Y read compared to method X?”

data("creatinine_serum")
ba <- ba_analysis(
  x = creatinine_serum$enzymatic,
  y = creatinine_serum$jaffe
)
cat("Bias:", round(ba$results$bias, 3), "mg/dL\n")
#> Bias: 0.174 mg/dL
cat("95% CI:", round(ba$results$bias_ci["lower"], 3), "to",
    round(ba$results$bias_ci["upper"], 3), "\n")
#> 95% CI: 0.127 to 0.22

What this tells you:

What this does NOT tell you:

What are Limits of Agreement?

The limits of agreement (LoA) define an interval expected to contain 95% of the differences between methods. They answer: “For a randomly selected sample, how much could the two methods disagree?”

cat("Lower LoA:", round(ba$results$loa_lower, 3), "\n")
#> Lower LoA: -0.236
cat("Upper LoA:", round(ba$results$loa_upper, 3), "\n")
#> Upper LoA: 0.584
cat("Width:", round(ba$results$loa_upper - ba$results$loa_lower, 3), "\n")
#> Width: 0.82

The LoA represent the range of disagreement you can expect in practice. A narrow LoA indicates consistent agreement; a wide LoA indicates variable differences.

Key insight: The LoA are often more informative than the bias alone. Two methods might have negligible average bias but wide limits of agreement, meaning individual measurements could differ substantially.

Visualizing Agreement

The Bland-Altman plot provides a visual assessment:

plot(ba)
Bland-Altman plot showing differences vs. averages.

Bland-Altman plot showing differences vs. averages.

What to look for:

Checking Assumptions

Bland-Altman analysis assumes normally distributed differences. The summary provides a Shapiro-Wilk test:

summ <- summary(ba)
if (!is.null(summ$normality_test)) {
  cat("Shapiro-Wilk p-value:", round(summ$normality_test$p.value, 4), "\n")
}
#> Shapiro-Wilk p-value: 0

A low p-value suggests non-normality. Consider:

ggplot(data.frame(diff = ba$results$differences), aes(x = diff)) +
  geom_histogram(aes(y = after_stat(density)), bins = 15,
                 fill = "steelblue", alpha = 0.7) +
  geom_density(linewidth = 1) +
  labs(x = "Difference (Jaffe - Enzymatic)", y = "Density") +
  theme_minimal()
Distribution of differences.

Distribution of differences.

Statistical Concepts in Passing-Bablok Regression

Slope and Intercept

Passing-Bablok regression fits a line: Y = intercept + slope * X

The parameters have direct interpretations:

pb <- pb_regression(
  x = creatinine_serum$enzymatic,
  y = creatinine_serum$jaffe
)
cat("Slope:", round(pb$results$slope, 4), "\n")
#> Slope: 0.9711
cat("  95% CI:", round(pb$results$slope_ci["lower"], 4), "to",
    round(pb$results$slope_ci["upper"], 4), "\n")
#>   95% CI: 0.9661 to 0.9741
cat("Intercept:", round(pb$results$intercept, 4), "\n")
#> Intercept: 0.2339
cat("  95% CI:", round(pb$results$intercept_ci["lower"], 4), "to",
    round(pb$results$intercept_ci["upper"], 4), "\n")
#>   95% CI: 0.2288 to 0.2387

How to read the confidence intervals:

Translating to Practical Differences

You can use the regression equation to estimate expected differences at specific concentrations:

# At various concentrations, what's the expected difference?
concentrations <- c(0.8, 1.3, 3.0, 6.0)

for (conc in concentrations) {
  expected_y <- pb$results$intercept + pb$results$slope * conc
  difference <- expected_y - conc
  cat(sprintf("At X = %.1f: expected Y = %.3f, difference = %.3f\n",
              conc, expected_y, difference))
}
#> At X = 0.8: expected Y = 1.011, difference = 0.211
#> At X = 1.3: expected Y = 1.496, difference = 0.196
#> At X = 3.0: expected Y = 3.147, difference = 0.147
#> At X = 6.0: expected Y = 6.060, difference = 0.060

This helps translate abstract regression parameters into concrete, application-specific terms.

Linearity Assessment

The CUSUM test evaluates whether a linear model is appropriate:

cat("CUSUM statistic:", round(pb$cusum$statistic, 4), "\n")
#> CUSUM statistic: 0.97
cat("p-value:", round(pb$cusum$p_value, 4), "\n")
#> p-value: 0.3036

A significant result (conventionally p < 0.05) suggests the relationship may not be linear across the measurement range. If non-linearity is detected:

plot(pb, type = "cusum")
CUSUM plot for linearity assessment.

CUSUM plot for linearity assessment.

Common Analysis Considerations

Correlation is Not Agreement

High correlation between methods is often reported but can be misleading:

r <- cor(creatinine_serum$enzymatic, creatinine_serum$jaffe)
cat("Correlation coefficient:", round(r, 4), "\n")
#> Correlation coefficient: 0.9952

Correlation measures whether methods rank samples similarly, not whether they give the same values. Two methods with r = 1 but different calibrations would show systematic bias that correlation fails to detect.

Sample Characteristics Matter

Your results depend on:

Be cautious about extrapolating beyond the conditions of your study.

Statistical vs. Practical Significance

A statistically significant bias (CI excludes zero) may or may not be practically important. Consider:

# Example: Is a bias of X clinically meaningful?
# This depends entirely on YOUR application
bias_value <- ba$results$bias

cat("Observed bias:", round(bias_value, 3), "mg/dL\n")
#> Observed bias: 0.174 mg/dL
cat("\nWhether this is 'acceptable' depends on:\n")
#> 
#> Whether this is 'acceptable' depends on:
cat("- Your specific clinical decision thresholds\n")
#> - Your specific clinical decision thresholds
cat("- Regulatory requirements for your application\n")
#> - Regulatory requirements for your application
cat("- Intended use of the measurement\n")
#> - Intended use of the measurement
cat("- Established performance goals (CLIA, biological variation, etc.)\n")
#> - Established performance goals (CLIA, biological variation, etc.)

Creating Analysis Reports

Here’s how to extract key statistics for reporting:

# Bland-Altman summary
cat("=== Bland-Altman Analysis ===\n")
#> === Bland-Altman Analysis ===
cat(sprintf("n = %d\n", ba$input$n))
#> n = 80
cat(sprintf("Bias: %.3f (95%% CI: %.3f to %.3f)\n",
            ba$results$bias,
            ba$results$bias_ci["lower"],
            ba$results$bias_ci["upper"]))
#> Bias: 0.174 (95% CI: 0.127 to 0.220)
cat(sprintf("SD of differences: %.3f\n", ba$results$sd_diff))
#> SD of differences: 0.209
cat(sprintf("LoA: %.3f to %.3f\n\n",
            ba$results$loa_lower,
            ba$results$loa_upper))
#> LoA: -0.236 to 0.584

# Passing-Bablok summary
cat("=== Passing-Bablok Regression ===\n")
#> === Passing-Bablok Regression ===
cat(sprintf("Slope: %.4f (95%% CI: %.4f to %.4f)\n",
            pb$results$slope,
            pb$results$slope_ci["lower"],
            pb$results$slope_ci["upper"]))
#> Slope: 0.9711 (95% CI: 0.9661 to 0.9741)
cat(sprintf("Intercept: %.4f (95%% CI: %.4f to %.4f)\n",
            pb$results$intercept,
            pb$results$intercept_ci["lower"],
            pb$results$intercept_ci["upper"]))
#> Intercept: 0.2339 (95% CI: 0.2288 to 0.2387)
cat(sprintf("CUSUM p-value: %.4f\n", pb$cusum$p_value))
#> CUSUM p-value: 0.3036

Choosing the Right Method

The valytics package provides three complementary approaches for method comparison. Each has strengths suited to different scenarios.

Method Comparison Table

Comparison of method comparison approaches
Aspect Bland-Altman Passing-Bablok Deming
Primary question How well do methods agree? Is there systematic bias? Is there systematic bias?
Statistical approach Descriptive statistics Non-parametric regression Parametric regression
Error assumption Differences ~ Normal Distribution-free Errors ~ Normal
Outlier handling Sensitive Robust Sensitive
Output focus Bias, limits of agreement Slope, intercept CIs Slope, intercept, SEs
Sample size n >= 30 recommended n >= 30 for stable CIs n >= 10 feasible
Best when Defining acceptable agreement Outliers present, unknown error Known error ratio, small n

Decision Flowchart

  1. Do you need to define acceptable limits of agreement?
    • Yes → Use Bland-Altman analysis
    • No → Continue to step 2
  2. Are there potential outliers in your data?
    • Yes → Use Passing-Bablok regression
    • No → Continue to step 3
  3. Do you know the error ratio between methods?
    • Yes → Use Deming regression with specified λ
    • No → Use Deming regression with λ = 1 (orthogonal) or Passing-Bablok
  4. Is your sample size small (n < 30)?
    • Yes → Deming regression may provide more stable estimates
    • No → Either regression method is appropriate

Using Multiple Methods

In practice, using multiple methods provides a more complete picture:

# Complete method comparison workflow
ba <- ba_analysis(reference ~ test, data = mydata)
pb <- pb_regression(reference ~ test, data = mydata)
dm <- deming_regression(reference ~ test, data = mydata)

# Bland-Altman for agreement assessment
summary(ba)
plot(ba)

# Compare regression methods
cat("Passing-Bablok slope:", pb$results$slope, "\n")
cat("Deming slope:", dm$results$slope, "\n")

If Passing-Bablok and Deming give similar results, you can be more confident in the conclusions. If they differ substantially, investigate why (outliers? non-normality? heteroscedasticity?).

Summary

The valytics package provides statistical tools for method comparison. It calculates:

These statistics describe the relationship between methods. Whether that relationship is “acceptable” for your purpose is a separate question that depends on:

The package reports what the data show. You decide what it means for your application.

References

Bland JM, Altman DG. Statistical methods for assessing agreement between two methods of clinical measurement. Lancet. 1986;1(8476):307-310.

Bland JM, Altman DG. Measuring agreement in method comparison studies. Statistical Methods in Medical Research. 1999;8(2):135-160.

Passing H, Bablok W. A new biometrical procedure for testing the equality of measurements from two different analytical methods. Journal of Clinical Chemistry and Clinical Biochemistry. 1983;21(11):709-720.

Westgard JO, Hunt MR. Use and interpretation of common statistical tests in method-comparison studies. Clinical Chemistry. 1973;19(1):49-57.

mirror server hosted at Truenetwork, Russian Federation.