---
title: "Choosing and interpreting multiple comparisons"
format: html
vignette: >
  %\VignetteIndexEntry{Choosing and interpreting multiple comparisons}
  %\VignetteEngine{quarto::html}
  %\VignetteEncoding{UTF-8}
engine: knitr
bibliography: references.bib
nocite: |
  @hothorn2008, @miller1981
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = NA,
  fig.width = 7,
  fig.height = 4
)
```

```{r setup}
library(biometryassist)
```

## Introduction

After fitting a model, a common next step is to compare predicted treatment
means. `biometryassist` provides three complementary functions for this, and
which one you want depends on *what you are comparing*:

- **`multiple_comparisons()`** is **means-centric**. It returns one row per
  treatment level (mean, standard error, confidence interval) and summarises the
  *complete* set of all pairwise comparisons with a compact letter display
  (treatments that share a letter show no evidence of a difference).
- **`pairwise_comparisons()`** is **difference-centric**. It returns a tidy
  table with one row per requested comparison, and is the honest representation
  when you care about only *some* of the comparisons, or about general contrasts.
- **`reference_comparisons()`** is **means-centric**, for the specific question
  "how does each treatment compare to a single control?". It reports each
  level's mean, the reference mean and the adjusted difference, using an exact
  Dunnett test by default.

"Means-centric" and "difference-centric" describe how each function **presents**
its results, not what it tests. All three rest on the same underlying pairwise
differences between predicted means (more generally, linear contrasts):
`multiple_comparisons()` tests *every* pairwise difference and condenses them
into letters on the means (it does not test the means themselves), whereas
`pairwise_comparisons()` and `reference_comparisons()` report a chosen subset of
those differences directly.

The rest of this vignette tours the three functions on shared example data, then
covers the cross-cutting topics: choosing a multiplicity adjustment, the shared
`adjust` and `by` arguments, confidence intervals, and a summary guide to picking
the right function. All examples use built-in data and `aov()` so they run
without any specialised modelling packages, but all three functions work across
the supported model types (`aov`, `lm`, `aovlist`, `asreml`, `lmerMod` and
`lme`).

## The three functions in action

We start with the `iris` data and a one-way model for petal width, and reuse it
across the first three subsections so the same analysis can be seen from each
function's perspective.

```{r}
model <- aov(Petal.Width ~ Species, data = iris)
```

### `multiple_comparisons()`: all pairs, means and letters

`multiple_comparisons()` compares *every* pair of means and summarises the result
as a letter display on the means, using Tukey's HSD by default:

```{r}
multiple_comparisons(model, classify = "Species")
```

Each row is a treatment mean; treatments that share a letter show no evidence of
a difference. `autoplot()` draws the means with their intervals and letters:

```{r, fig.height = 3}
autoplot(multiple_comparisons(model, classify = "Species"), label_height = 0.95)
```

The letter display is a compact summary of *all* pairwise comparisons at once.
Because every pair has been compared, "share a letter" means "no evidence of a
difference" — but that interpretation is only valid for the **complete** set (see
[The same model, several views](#the-same-model-several-views)).

### `pairwise_comparisons()`: a chosen set of differences

`pairwise_comparisons()` reports the comparisons themselves, one per row. By
default it tests **all** pairwise differences and adjusts the p-values with
Holm's method:

```{r}
pairwise_comparisons(model, classify = "Species")
```

Each row is a difference: `estimate` is `mean(level1) - mean(level2)`, with
`level1`/`level2` making the direction explicit and the `level1.mean` /
`level2.mean` columns reporting the two predicted means it is computed from. If
you care about only a specific set of comparisons, pass them via `pairs` — the
multiplicity adjustment is then applied over just that set:

```{r}
pairwise_comparisons(
  model,
  classify = "Species",
  pairs = c("setosa-versicolor", "versicolor-virginica")
)
```

`autoplot()` draws a forest plot of the differences, with a dashed line at zero
and an asterisk marking comparisons that are significant at the (adjusted) level:

```{r, fig.height = 3}
autoplot(pairwise_comparisons(model, classify = "Species"))
```

#### General contrasts

Sometimes the comparison of interest isn't a single pair. The `contrasts`
argument tests any linear contrast — for example, *setosa* versus the average of
the other two species:

```{r}
pairwise_comparisons(
  model,
  classify = "Species",
  contrasts = list(
    "setosa vs rest" = c(setosa = 1, versicolor = -0.5, virginica = -0.5)
  )
)
```

Each contrast is a named vector of coefficients (which should sum to zero) and
the list name becomes the row label. A two-level contrast such as
`c(A = 1, B = -1)` is exactly the pairwise difference `"A-B"`.

### `reference_comparisons()`: each level against a control

When the question is specifically "how does each treatment compare to a single
control?", `reference_comparisons()` is the natural tool. It is means-centric like
`multiple_comparisons()`, but only the comparisons against the chosen reference
are made — and because every comparison shares that reference, significance
attaches cleanly to each treatment (which a letter display cannot do for an
incomplete set; see [The same model, several
views](#the-same-model-several-views)).

Switching to the `chickwts` data (chick weights by feed type), we compare every
feed to the `casein` control:

```{r}
model_cw <- aov(weight ~ feed, data = chickwts)
reference_comparisons(model_cw, classify = "feed", reference = "casein")
```

Each row gives the mean of a feed (`level1.mean`), the mean of the control
(`level2.mean`) and their difference, with the adjustment defaulting to the exact
Dunnett test. The `autoplot()` method draws each feed's mean, a dashed line at
the control mean (marked with a diamond), and an interval for the difference from
the control, so the interval clears the control line exactly when the comparison
is significant; significant feeds are also flagged with an asterisk:

```{r, fig.height = 3.5}
autoplot(reference_comparisons(model_cw, classify = "feed", reference = "casein"))
```

### The same model, several views

The three functions are different *views* of the same fitted model, not different
tests. On the iris model, `multiple_comparisons()` and an all-pairs
`pairwise_comparisons()` rest on identical pairwise differences — one presents
them as means-with-letters, the other as a table of differences. Both are
correct; they answer different questions:

- the **letter display** (`multiple_comparisons()`) is a compact summary of the
  *complete* set of pairwise comparisons, valid only because every pair is
  compared;
- the **difference table** (`pairwise_comparisons()`) is the honest
  representation when the set is incomplete or irregular — forcing letters onto
  such a set would silently assert "not different" for pairs that were never
  tested;
- the **reference table** (`reference_comparisons()`) is the right view when one
  level is a control, where significance attaches to each treatment directly.

So the choice of function follows from the *question*, not from the model.

## Controlling error across comparisons

Testing several comparisons inflates the chance of at least one false positive,
so p-values are adjusted. There are two broad families of error rate, and the
right choice depends on your goal.

**Family-wise error rate (FWER)** is the probability of making *any* false
rejection across the whole family. It is the conservative, "every claim must be
trustworthy" criterion, appropriate for a small set of planned comparisons.

- **Bonferroni** is the simplest FWER method, but quite conservative: Holm
  controls the same error rate and is uniformly more powerful, so there is
  essentially never a reason to prefer plain Bonferroni over Holm.
- **Holm** [@holm1979] is a step-down procedure that controls the FWER under
  *any* dependence structure. It is the default in `pairwise_comparisons()`.
- **Tukey's HSD** [@tukey1953; @hsu1996] uses the studentized range
  distribution to give *exact* simultaneous FWER control for the complete set of
  **all** pairwise comparisons (exact for equal replication; the Tukey–Kramer
  extension covers unequal replication). Because it exploits the exact joint
  distribution of all pairwise differences, it is more powerful than Bonferroni
  or Holm *for that complete set* — which is why it is the default in
  `multiple_comparisons()`.
- **Dunnett's test** [@dunnett1955] is the exact FWER procedure for the
  specific family of comparing every treatment against a *single control*. Like
  Tukey for all pairs, it exploits the known joint distribution of those
  comparisons (the correlation induced by the shared control), so it is more
  powerful than a generic adjustment for that family. It is the default in
  `reference_comparisons()`.

**False discovery rate (FDR)** is the expected *proportion* of false rejections
among the rejected comparisons. It is less stringent than FWER and earns its keep
when comparisons are many and somewhat exploratory, such as in plant breeding.

- **Benjamini–Hochberg (`"BH"`/`"fdr"`)** [@benjamini1995] controls the FDR
  under independence and positive dependence (which pairwise comparisons
  satisfy); **`"BY"`** is valid under arbitrary dependence but more conservative.

### Why Tukey is reserved for the complete (all-pairs) set

A natural question is why `pairwise_comparisons()` and `reference_comparisons()`
do not offer `adjust = "tukey"`. Tukey's critical value is calibrated to the
largest difference among *all* the means; applied to a chosen subset it still
controls the FWER, but it is sized for comparisons you are not making, so it is
overly conservative and conceptually mismatched [@hsu1996; @bretz2010]. For a
selected set, a method calibrated to *that* set (Holm, Bonferroni, BH) is
preferred; for the all-vs-control set, Dunnett is the calibrated choice.
Conversely, for the complete set Tukey is the most powerful of these because it
uses the known structure of the pairwise differences. So Tukey belongs with the
complete-set, means-and-letters analysis (`multiple_comparisons()`), and the
other two functions reject it with an informative error pointing you there.

Because all three functions share the same predicted means and standard errors of
differences, **testing all pairs in `pairwise_comparisons()` with a given
`adjust` reproduces the adjusted p-values of `multiple_comparisons()` with the
same `adjust`** — only the presentation differs (a table of differences versus
means and letters). The one exception is Tukey, for the reasons above.

## The shared `adjust` and `by` arguments

All three functions take `adjust` and `by`, with the same meaning throughout.

### `adjust`: choosing the multiplicity method

`adjust` selects the p-value adjustment from the previous section. Each function
has a sensible default (Tukey for `multiple_comparisons()`, Holm for
`pairwise_comparisons()`, Dunnett for `reference_comparisons()`), but any
`stats::p.adjust()` method can be requested instead — for example Bonferroni:

```{r}
mc <- multiple_comparisons(model, classify = "Species", adjust = "bonferroni")
mc$comparison_method
```

The only restriction is that the calibrated methods are tied to their family:
`"tukey"` is accepted only by `multiple_comparisons()` and `"dunnett"` only by
`reference_comparisons()` (see the previous section).

### `by`: comparisons within subgroups

`by` runs the comparisons **independently within subgroups** — useful for a
factorial where you want to compare one factor *within* each level of another,
rather than across the whole experiment. Each subgroup is a separate family, with
no pooling or adjustment across groups. Using the `warpbreaks` data (number of
breaks by `wool` type and `tension`):

```{r}
wb <- aov(breaks ~ wool * tension, data = warpbreaks)
```

Comparing `tension` levels within each `wool` type, the comparison labels refer
to the remaining (non-`by`) factor:

```{r}
pairwise_comparisons(wb, classify = "wool:tension", by = "wool")
```

`by` behaves the same way in the other two functions. In `multiple_comparisons()`
the letter display restarts within each subgroup and `autoplot()` facets the
means and letters by the `by` variable:

```{r, fig.height = 3.5}
autoplot(multiple_comparisons(wb, classify = "wool:tension", by = "wool"))
```

## A note on confidence intervals

The intervals shown by `pairwise_comparisons()` and `reference_comparisons()`
(and their plots) are *per-comparison* intervals — except for Dunnett, they are
**not** adjusted for simultaneity. Significance, on the other hand, is judged by
the multiplicity-adjusted p-value. These two can disagree: an interval may
exclude zero while the adjusted p-value is no longer below `sig`, and both
functions emit a one-time note when this happens in a result. (Dunnett is the
exception: its intervals *are* the simultaneous intervals, so for
`reference_comparisons()` with its default they agree with the test by
construction.) This is the same tension that the `int.type` options address in
`multiple_comparisons()`. So in a forest plot, read significance from the
asterisk (which tracks the adjusted test), not from whether the drawn interval
happens to cross zero.

## Choosing the right function: a summary

Match the function to the question you are asking:

- **All pairwise comparisons, with a compact summary?** Use
  `multiple_comparisons()` (Tukey's HSD by default, with letter groupings).
- **A specific or planned subset of comparisons, or a general contrast?** Use
  `pairwise_comparisons()` (Holm by default), which represents the chosen set
  honestly as a table of differences.
- **Each treatment against a single control?** Use `reference_comparisons()`
  (exact Dunnett by default), which keeps the analysis means-centric.
- **Many, exploratory comparisons where a controlled proportion of false
  positives is acceptable?** Use an FDR method (`adjust = "BH"`).
- **Comparisons within subgroups of a factorial?** Use `by` in any of the three.

The same advice as a table:

| Question / comparison type | Function | Default adjustment | Not valid |
|---|---|---|---|
| All pairwise comparisons, summarised with letters | `multiple_comparisons()` | Tukey's HSD | `dunnett` |
| A chosen subset of pairs, or general contrasts | `pairwise_comparisons()` | Holm | `tukey`, `dunnett` |
| Every treatment vs a single control | `reference_comparisons()` | Dunnett | `tukey` |

Any other `stats::p.adjust()` method (Holm, Bonferroni, BH, BY, …) can be
requested in all three via `adjust`; only the calibrated methods are restricted
to their family, as shown in the final column.

## References

::: {#refs}
:::
```