Introduction to Archetypal Analysis with yaap

What is Archetypal Analysis?

Archetypal analysis (AA) is an unsupervised learning method that describes every observation as a convex mixture¹ of a small number of extreme, idealized profiles called archetypes (Cutler and Breiman 1994). Unlike PCA, whose components are abstract mathematical directions, or k-means, whose clusters are represented by centroids, archetypes are located at the boundary of the data representing the distinct extremes. AA combines cluster-style interpretability with matrix-factorization flexibility. In practice, archetypes often correspond to “pure” or “extreme” profiles. This representation is useful when observations are naturally viewed as mixtures of biologically, chemically, or behaviorally distinct states.

Formally, given a data matrix $X \in \mathbb{R}^{N \times M}$ (N samples, M features), AA finds:

Archetypes $A \in \mathbb{R}^{K \times M}$ : K extreme profiles, each of which is itself a weighted average of actual data points ($A = BX$, with $B \geq 0$, $B\mathbf{1} = \mathbf{1}$),
Compositions $S \in \mathbb{R}^{N \times K}$ : for each observation, a non-negative weight vector summing to 1 ($S \geq 0$, $S\mathbf{1} = \mathbf{1}$),

such that the reconstruction $\hat{X} = SA$ is as close as possible to $X$ in the standard squared Frobenius norm²:

\[\min_{B,S} \|X - SBX\|_F^2 \quad \text{s.t.} \quad A = BX,\; B,S \geq 0,\; B\mathbf{1} = S\mathbf{1} = \mathbf{1}.\]

The constraint $A = BX$ keeps archetypes inside (or on the boundary of) the convex hull of the data, making them directly interpretable as extreme prototypes. The composition vectors $S$ correspond to points in the archetype simplex: a composition of $(1,0,0)$ means a sample is entirely characterized by archetype 1, while $(1/3,1/3,1/3)$ means it sits equally between all three.

Relationship to other methods

Method	Main Focus	Sample Representation	Component Type	Interpretability	Flexibility
PCA / SVD	Directions of maximal variance	Linear combination	Abstract latent directions	Often moderate to low	High
NMF	Additive parts-based representation	Non-negative combination	Non-negative latent parts	Moderate to high	Moderate to high
k-means	Central cluster prototypes	Hard assignment to one cluster	Cluster centroids	High	Low to moderate
k-medoids	Representative observed prototypes	Hard assignment to one observed prototype	Observed data points	High	Low
Archetypal Analysis	Extreme profiles / corners of the data	Convex combination of archetypes	Observed-data-based extremes	High	Moderate to high

The yaap package

yaap (Yet Another Archetypes Package) provides fast, flexible AA for R. This vignette is organized into four parts:

Core workflow: fit models with run_aa(), inspect archetypes objects, visualize archetypes and compositions, and predict for new data.
Practical choices: choose K, scale features, use multiple restarts, and check convergence.
Alternative formulations: use the delta relaxation, robust fitting, and missing-data fitting.
Algorithm details: compare NNLS and PGD, and review the main tuning arguments.

library(yaap)
library(generics)
#> 
#> Attaching package: 'generics'
#> The following objects are masked from 'package:base':
#> 
#>     as.difftime, as.factor, as.ordered, intersect, is.element, setdiff,
#>     setequal, union

A first example: the toy dataset

The package ships a small two-dimensional toy dataset (originally from the archetypes) that contains 250 points scattered around three vertices of a triangle.

toy <- read.csv(system.file("extdata", "toy.csv", package = "yaap"))
head(toy, 3)
#>           x         y
#> 1  8.749423  9.922297
#> 2  4.965050  7.163594
#> 3 10.092057 13.750449
dim(toy)
#> [1] 250   2

plot(toy, asp = 1., frame.plot = FALSE, axes = FALSE,
     main = "Toy Dataset", xlab = "", ylab = "",
     pch = 19, cex = 0.6, col = "gray50")

The toy dataset: three clusters near the corners of a triangle.

We fit an AA model with $K = 3$ archetypes using run_aa().

fit <- run_aa(toy, K = 3, nrep = 3)
fit
#> 
#> Call:
#> run_aa(x = toy, K = 3, nrep = 3)
#> 
#> PGD Archetypal Analysis:
#> Number of Archetypes: 3 
#> Converged after 78 iterations.
#> Final Loss Metrics:
#>      loss        r2
#>  119.2253 0.9984708

The output reports the final $R^2$ and convergence status.

The `archetypes` result object

run_aa() returns an S3 object of class archetypes. It is a list with the following main components:

Component	Dimensions	Description
`coordinates`	K x M	Archetype profiles in feature space
`coefficients`	K x N	Weights expressing each archetype as a mix of samples
`compositions`	N x K	Barycentric coordinates of each sample
`loss`	(iter) x R	Per-iteration RSS, $R^2$, and method-specific QC
`converged`	scalar	Whether the optimizer converged

coordinates(fit)
#>            x         y
#> A1  2.864114 10.005167
#> A2 13.987939  2.209766
#> A3 18.785381 18.639291

The fundamental identity $\hat{X} = SA$ is:

X_hat <- compositions(fit) %*% coordinates(fit)

Let us verify this recovers the data well:

X <- as.matrix(toy)
X_hat <- fitted(fit)
R <- resid(fit)  # residuals: X - X_hat
(rss <- norm(R, "F")^2)  # residual sum of squares
#> [1] 119.2067
1 - rss / norm(X, "F")^2  # R^2
#> [1] 0.998471

plot(X, type = "n", asp = 1., frame.plot = FALSE, axes = FALSE,
     main = "Toy Dataset Reconstruction", xlab = "", ylab = "")
ix <- rowSums(R > 0.5) > 0  # highlight large residuals
segments(X[, 1], X[, 2], X_hat[, 1], X_hat[, 2], col = "firebrick", lwd = 0.5)
points(X    , pch = 20, cex = 0.8, col = "gray50")
points(X_hat, pch = 1,  cex = 0.7, col = "firebrick")
legend("topleft", legend = c("Data", "Reconstruction", "Residuals"), bty = "n",
       col = c("gray50", "firebrick", "firebrick"),
       pch = c(20, 1, NA),
       lwd = c(NA, NA, 0.5))

Reconstruction of toy dataset. Based on figure from archetypes package vignette.

In the reconstruction plot, data points (gray disks) inside the archetype convex hull have exact reconstructions (red circles). Points outside the hull are projected to its edges (red segments).

The $loss data frame records the per-iteration optimization trajectory and method-specific diagnostics. These columns help assess convergence and model quality.

The exact columns depend on the method, but typical columns include:

loss: working objective (e.g., RSS),
r2: proportion of variance explained (based on loss)
rloss: running objective for methods that do not always improve the objective
k_S: condition number of the composition matrix (high $\to$ near-degenerate compositions) for methods that involve solving linear systems with it.
k_A: condition number of the coordinate (similar logic to k_S)

Note that loss values may differ from the RSS computed directly above because the working objective operates on internally scaled data. See Algorithm details for a full discussion.

Integration with `broom`

yaap also supports broom-style methods for fitted archetypes objects. These methods return ordinary tibbles for downstream inspection, plotting, and reporting.

tidy() returns model matrices in long form. By default it returns the archetype coordinates: one row per archetype-feature pair, with columns archetype, term, and value.

tidy(fit)
#> # A tibble: 6 × 3
#>   archetype term  value
#>   <chr>     <chr> <dbl>
#> 1 A1        x      2.86
#> 2 A2        x     14.0 
#> 3 A3        x     18.8 
#> 4 A1        y     10.0 
#> 5 A2        y      2.21
#> 6 A3        y     18.6

The same verb can return other fitted matrices. For example, matrix = "compositions" returns one row per sample-archetype pair; value is the fitted weight of that archetype in that sample. matrix = "coefficients" similarly returns the weights that express each archetype as a mixture of training samples.

fit |> tidy(matrix = "compositions") |> head()
#> # A tibble: 6 × 3
#>   sample archetype    value
#>   <chr>  <chr>        <dbl>
#> 1 S1     A1        0.559   
#> 2 S2     A1        0.753   
#> 3 S3     A1        0.542   
#> 4 S4     A1        0.000325
#> 5 S5     A1        0.316   
#> 6 S6     A1        0.637

glance() returns one row of model-level diagnostics, including the number of archetypes, convergence status, final loss, final $R^2$, number of iterations, adapted AIC, and model family.

glance(fit)
#> # A tibble: 1 × 7
#>       K converged  loss    r2 n_iter   aic family  
#>   <int> <lgl>     <dbl> <dbl>  <int> <dbl> <chr>   
#> 1     3 TRUE       119. 0.998     78  4.12 gaussian

augment() returns the original data with one dot-prefixed composition column per archetype. The .A1, .A2, and .A3 columns are the fitted barycentric weights for each sample, so they sum to one row by row.

aug <- augment(fit, data = toy)
head(aug)
#> # A tibble: 6 × 5
#>       x     y      .A1           .A2          .A3
#>   <dbl> <dbl>    <dbl>         <dbl>        <dbl>
#> 1  8.75  9.92 0.559    0.237         0.204       
#> 2  4.97  7.16 0.753    0.247         0.0000000100
#> 3 10.1  13.8  0.542    0.0126        0.445       
#> 4 17.1  12.8  0.000325 0.357         0.643       
#> 5 13.6  16.2  0.316    0.00000001000 0.684       
#> 6  8.28 11.9  0.637    0.0737        0.289

comp_cols <- paste0(".", anames(fit))
range(rowSums(as.matrix(aug[, comp_cols])))  # all should be 1
#> [1] 1 1

Visualizing the fit

plot() method for archetypes objects offers several what options.

Archetype positions in feature space

This method plots the original data points, overlays the archetypes from the coordinates matrix, and connects them with lines.

plot(fit, what = "coordinates",
     main = "Archetype positions in feature space", xlab = "", ylab = "",
     asp = 1, frame.plot = FALSE, axes = FALSE,
     pch = 17, cex = 1.5, col = "firebrick",
     args.data.scatter = list(pch = 19, cex = 0.5, col = "gray50"))

Data and estimated archetypes. The triangle connects the three archetypes.

As expected, the archetypes lie close to the corners of the point cloud.

Plot functions invisibly return the data used to construct the figure in a ggplot2 friendly format. Use plot = FALSE to prepare those data without drawing. For example, the following code reproduces the previous figure:

library(ggplot2)

plot(fit, what = "coordinates", plot = FALSE) |>
    ggplot(aes(x, y)) +
    geom_point(aes(color = archetype))

Working with “real” data: Fisher’s iris

Besides tabular data (like data.frame or matrix), run_aa() also accepts formula and data frame inputs. Formula input is useful when the data contains a response column to exclude: the response is silently dropped and all right-hand-side predictors are used as features.

# 'Species' (the response) is discarded; the four measurements are used.
fit_iris <- run_aa(Species ~ ., data = iris, K = 3, scale = TRUE, tol = 1e-4)
fit_iris
#> 
#> Call:
#> run_aa(formula = Species ~ ., data = iris, K = 3, scale = TRUE, 
#>     tol = 1e-04)
#> 
#> PGD Archetypal Analysis:
#> Number of Archetypes: 3 
#> Converged after 62 iterations.
#> Final Loss Metrics:
#>     loss        r2
#>  45.8493 0.9230716

This is equivalent to calling run_aa(iris[, 1:4], ...)

Because the four measurements are on different scales, we set scale = TRUE to apply column-wise z-scoring before fitting.

In the following subsections, we analyze the fitted model in a typical workflow:

We first locate the archetypes in feature space using a PCA projection.
We then identify them from the observations that define them and characterize what they represent through a feature-profile plot.
We explore how the 150 samples distribute across the archetype simplex.
We finally show how the fitted model predicts compositions for new observations.

Archetype positions in feature space

Since the iris dataset has four features, the coordinate scatter requires dimensionality reduction. Pass projection = "pca" to project both data and archetypes onto the first two principal components.

pal_sp <- c(setosa = "#E69F00", versicolor = "#56B4E9", virginica = "#009E73")
iris_col <- pal_sp[as.character(iris$Species)]
plot(fit_iris, what = "coordinates", projection = "pca", frame.plot = FALSE,
     main = "PCA projection of iris data and archetypes",
     col = "firebrick", pch = 17, cex = 1.5,
     args.data.scatter = list(col = iris_col, pch = 19, cex = 0.6))

PCA projection of the iris data and its three archetypes (triangles).

Naming archetypes from their defining observations

Each archetype is a convex mixture of actual data points, described by the coefficients matrix (K x N). To name the archetypes, we summarize which species contribute most to each archetype.

B <- coef(fit_iris)  # K x N

# Sum coefficients within each species (rows = species, cols = archetypes)
coef_by_species <- aggregate(t(B), by = iris["Species"], FUN = sum)
knitr::kable(coef_by_species, digits = 3)

Species	A1	A2	A3
setosa	0.771	1	0
versicolor	0.229	0	0
virginica	0.000	0	1

The table indicates that one archetype combines “setosa” and “versicolor”, while the other two are predominantly “setosa” and “virginica”, respectively. The assigned labels reflect those species contributions.

# Concat all species contributing to an archetype
coef_mat <- as.matrix(coef_by_species[, -1])  # drop species column
rownames(coef_mat) <- coef_by_species$Species
species_contrib <- which(coef_mat > 0.1, arr.ind = TRUE)
species_contrib <- aggregate(
    row ~ col,
    data = species_contrib,
    FUN = function(ix) {
        species <- rownames(coef_mat)[ix]
        species <- substr(species, 1, 3)  # abbreviate species names
        paste(species, collapse = ".")
    }
)

anames(fit_iris) <- species_contrib$row
anames(fit_iris)
#> [1] "set.ver" "set"     "vir"

Archetype feature profiles

what = "profiles" shows a grouped barplot of the archetype coordinates on the original feature scale. All other arguments are passed to barplot().

plot(fit_iris, what = "profiles",
     horiz = TRUE,
     names.arg = sub("\\.", "\n", colnames(iris[, 1:4])),
     las = 1,
     col = pal_aa,
     xlab = "Feature value (original scale)", ylab = "",
     main = "Archetype feature profiles")

Archetype feature profiles. Each group of bars is one feature; colors distinguish archetypes.

Sample compositions

Named archetypes make the composition space easier to interpret.

plot(fit_iris, what = "simplex",
     col = pal_sp[as.character(iris$Species)], pch = 19,
     pca = TRUE, col.pca = "black", robust = FALSE)
legend("topright", legend = names(pal_sp), col = pal_sp, pch = 19, bty = "n")

Ternary plot of iris compositions, colored by species. Each point is a sample placed by its three archetype weights.

The ternary structure shows a clear separation of species in composition space: setosa concentrates near one vertex, virginica near another, and versicolor occupies intermediate mixtures along the lower edge (small setosa contribution). The black curved line is the PCA trajectory projected into the simplex, summarizing the dominant continuous transition from setosa-like to virginica-like compositions through the hybrid archetype of setosa+versicolor.

The ternary plot also shows why the hybrid archetype has a stronger setosa contribution even though more versicolor samples lie near it on average: AA searches for extreme observations near the data hull boundary, where one setosa outlier lies. This illustrates a general property of AA: archetypes anchor at the geometric boundary, so even a single extreme observation can pull a vertex toward it. For analyses where outliers are a concern, see Robust fitting.

When samples are arranged by PC1, the composition barplot shows a transition from virginica-like to setosa-like samples, with the hybrid archetype in between.

plot(fit_iris, what = "compositions",
     main = "Archetype compositions across iris samples",
     cluster_rows = "PC1", cluster_cols = FALSE, col = pal_aa
)

Composition barplot: each bar is one iris sample. Single-color bars are near-pure specimens; blended bars sit between archetypes.

Other potential orderings of the samples include:

the angle each sample forms in the PC1-PC2 plane (cluster_rows = "AOP"), or
the order that hclust() arranges them if cluster_rows = TRUE

See plot_archetypes_compositions() for details.

Predicting reconstructions and compositions for new observations

Once a model is fitted, predict() maps new observations to the archetype convex hull without re-fitting. The resulting projection can be represented in two ways:

in the original feature space as the training data (default) with type = "reconstruction"
as barycentric coordinates in the archetype simplex with type = "compositions"

new_obs <- matrix(
    c(5.1, 3.5, 1.4, 0.2,   # setosa-like
      6.7, 3.0, 5.2, 2.3),  # virginica-like
    nrow = 2, byrow = TRUE,
    dimnames = list(c("obs1", "obs2"), colnames(iris[, 1:4]))
)
print(new_obs)
#>      Sepal.Length Sepal.Width Petal.Length Petal.Width
#> obs1          5.1         3.5          1.4         0.2
#> obs2          6.7         3.0          5.2         2.3

predict(fit_iris, newdata = new_obs)
#>      Sepal.Length Sepal.Width Petal.Length Petal.Width
#> obs1     5.082102    3.507834     1.569315   0.3002319
#> obs2     6.789315    2.983888     5.661577   2.0019571

predict(fit_iris, newdata = new_obs, type = "compositions")
#>        set.ver        set          vir
#> obs1 0.3373917 0.66260826 9.999962e-09
#> obs2 0.1074892 0.07415061 8.183602e-01

Practical considerations

Choosing `K`

The main tuning parameter in archetypal analysis is K: the number of archetypes. A traditional exploratory approach is to fit a sequence of models and inspect a scree plot or elbow plot. The elbow is the smallest K after which reconstruction quality metrics improve only modestly. Typical metrics used for this purpose are:

Reconstruction loss (loss): the sum of squared residuals between the original data and the reconstructed data.
Unexplained variance (1 - r2): the proportion of variance in the original data that is not captured by the archetypes.

Alongside elbow plots, yaap also provides an adapted AIC criterion (AIC) based on (Suleman 2017) that balances model fit and complexity, which can be used to select K more systematically. Unlike the other metrics we discussed, the adapted AIC is not monotonic in K, so we are not looking for an elbow but rather for the minimum value, which indicates the best trade-off between fit and parsimony.

Here we fit a “path” of archetypes models for K = 1, ..., 6 to the toy dataset, using the archetypes_path() function, which fits multiple models in a single call and reuses preprocessing.

fit_path <- archetypes_path(
    toy,
    K = 6,
    init = "random",
    max_iter = 100,
    nrep = 5
)

The resulting fit_path object contains the fitted models for all K, which can be accessed with [[ using their index or K value, for example fit_path[[3]] (third model) or fit_path[["K3"]] (model with K = 3). The path object also supports plotting the scree and adapted AIC curves across K values with screeplot().

op <- par(mfrow = c(1, 3), mar = c(4, 4, 2.5, 0.5), bty = "n")

selected_K <- 3
fit_best <- fit_path[[selected_K]] # here equivalent to fit_path[["K3"]]

screeplot(
    fit_path,
    y = "loss",
    col = "steelblue4",
    frame.plot = FALSE,
    ylab = "Reconstruction loss",
    main = "RSS"
)
abline(v = selected_K, lty = 2, col = "firebrick")

screeplot(
    fit_path,
    y = function(fit) 1 - utils::tail(fit$loss$r2, 1L),
    col = "steelblue4",
    frame.plot = FALSE,
    ylab = expression(1 - R^2),
    main = "Unexplained Variance"
)
abline(v = selected_K, lty = 2, col = "firebrick")

screeplot(
    fit_path,
    col = "steelblue4",
    frame.plot = FALSE,
    ylab = "Adapted AIC",
    main = "Adapted AIC"
)
abline(v = selected_K, lty = 2, col = "firebrick")

RSS, unexplained variance, and adapted AIC across candidate values of K.


par(op)

For this dataset, all three diagnostics agree that the optimal number of archetypes is K = 3, the true number of generating vertices. The RSS and unexplained-variance curves show a clear elbow at K = 3. The adapted AIC has a strong minimum at K = 3. In real analyses, these plots are diagnostics rather than automatic rules. AA optimization is non-convex, so single-restart curves can be noisy and may not decrease monotonically for every larger K. Use more restarts and domain knowledge when the elbow or adapted AIC is ambiguous.

Scaling

Situation	`scale` value	Effect
Default/natural feature metric	`FALSE`	Data used as-is
Mixed-unit Gaussian features	`TRUE`	z-score columns
Custom feature importance	Numeric vector	Divides columns

The standard AA assumes Euclidean geometry and Gaussian error structure, so the optimization is carried out by minimizing the sum of squared residuals (RSS). Scaling can therefore materially affect the results.

By default, run_aa() does not transform the input data. The decision of whether to scale is similar to that in PCA: if the features are on different scales, set scale = TRUE to give them equal weight in the analysis. Otherwise, those features may have little influence on the fit. Custom scaling by passing a vector of scale factors is also supported similar to how scale() works in base R (by dividing the columns).

The difference between manually normalizing the data and setting scale = FALSE versus letting run_aa() handle the scaling internally is that the archetypes are returned on the scale of the input data. For interpretability, pass data on the scale you want the resulting archetypes to be in.

Besides scale, run_aa also has an sd_threshold = 1e-6 argument to remove features with near-zero variance, which can cause numerical instability and convergence issues. During reconstruction, the removed features are filled with the mean value of the original data.

When some samples are more reliable or informative than others, pass sample weights through the weights argument.

Multiple restarts

Because the AA objective is non-convex, the solution is sensitive to the initial archetype coordinates. run_aa() supports multiple independent restarts via the nrep argument. Setting nrep > 1 returns only the best solution across all runs, so the output is always a single archetypes object.

Multiple restarts are most useful when the initializer can produce different starting points across runs. The default initialization method is "furthest_sum". Although its starting seed is random, the procedure tends to have low variance because it greedily seeks a good initial guess. In contrast, "random" produces very different starting points, which lets the search explore the space more broadly. Below we run both methods five times on the iris dataset with K = 3 and compare the initial and final loss across runs.

runs <- do.call(rbind, lapply(1:5, function(k) {
    fit_fs  <- run_aa(iris[, 1:4], K = 3, nrep = 1)
    fit_uni <- run_aa(iris[, 1:4], K = 3, nrep = 1, init = "random")
    data.frame(
        run       = k,
        fs_start  = fit_fs$loss$loss[1],
        fs_end    = tail(fit_fs$loss$loss, 1),
        uni_start = fit_uni$loss$loss[1],
        uni_end   = tail(fit_uni$loss$loss, 1)
    )
}))

colnames(runs) <- c("run",
                    "furthest_sum start", "furthest_sum end",
                    "random start", "random end")
knitr::kable(runs, digits = 1)

run	furthest_sum start	furthest_sum end	random start	random end
1	131.1	26.2	132.9	24.9
2	131.1	26.2	164.1	24.9
3	131.1	26.2	384.6	25.0
4	131.1	26.2	72.0	24.9
5	131.1	26.2	893.0	24.9

furthest_sum starts closer to the final solution, while random starts are more variable. However, both methods converge in similarly good solutions, with the random method often even outperforming furthest_sum in final loss.

See the Initialization vignette for a more detailed discussion and comparison of all available initializers.

Convergence

If a run_aa() call does not converge, it emits a warning. The status of the fit is stored in the converged slot of the output object, and is also printed out when the object or its summary are printed.

The full optimization trajectory is available in the loss data frame and can be plotted with plot(fit, what = "loss").

fit_iris$converged
#> [1] TRUE
tail(fit_iris$loss, 1)
#>       loss        r2
#> 63 45.8493 0.9230716
plot(fit_iris, what = "loss", frame.plot = FALSE, type = "b",
     main = "Convergence of the iris model")

RSS across iterations for the iris model.

Typically, loss decreases quickly at first and then plateaus near a local minimum. If the model has not converged, increase max_iter (default 100) or add more restarts. If the loss decreases very slowly, increasing max_iter is unlikely to help; consider increasing tol or tol_r2 instead. This will not improve the fit, but it can reduce runtime and avoid unnecessary non-convergence warnings.

Variants of Canonical AA and their implementation in yaap

This vignette covers only the Euclidean geometries with Gaussian errors setting. For Non-Gaussian and Metric Variants of Archetypal Analysis see the corresponding vignette.

Relaxing the convex hull constraint with `delta`

By default (delta = 0) archetypes are constrained to the convex hull of the data, i.e., they are strictly weighted averages of actual observations.

When the data is a truncated version of a larger pattern, the genuine extremes are absent and the constrained model cannot recover the true vertices and instead anchors to the most extreme observed points.

Setting delta > 0 relaxes this constraint and allows archetypes to step outside the data hull (Mørup and Hansen 2012). The following example makes this concrete with synthetic data drawn from a truncated equilateral simplex: any composition weight above 0.8 is removed, so no sample is close to any true vertex.

K_d <- 3  # number of archetypes (simplex vertices)
M_d <- 2  # number of features (2D for easy visualization)
N_init <- 1000  # initial number of samples before truncation
thresh <- 0.8    # truncation threshold on composition weights

# True archetypes: equilateral triangle
A_true <- matrix(
    c(cos(seq(0, 2*pi*(K_d-1)/K_d, length.out = K_d)),
      sin(seq(0, 2*pi*(K_d-1)/K_d, length.out = K_d))),
    nrow = K_d, ncol = M_d
)

# Dirichlet-like mixing weights; remove observations near any vertex
S_raw   <- -log(matrix(runif(K_d * N_init), nrow = N_init, ncol = K_d))
S_raw   <- S_raw / rowSums(S_raw)
S_trunc <- S_raw[rowSums(S_raw > thresh) == 0, ]
X_delta <- S_trunc %*% A_true
print(nrow(X_delta))  # observations after truncation
#> [1] 876

# Fit models with different delta values
delta_vals <- c(0, 0.2, 0.35)
res_delta  <- lapply(
    delta_vals,
    function(d) archetypes_pgd(X_delta, K = K_d, delta = d)
)

# Plot all models together with the true archetypes
ix   <- c(1:3, 1)   # close the triangle
cols <- c("steelblue", "darkorange", "firebrick")

plot(A_true[ix, 1], A_true[ix, 2],
     type = "l", col = "black", lwd = 2, lty = 3,
     frame.plot = FALSE, xlab = "", ylab = "",
     xaxt = "n", yaxt = "n", asp = 1,
     main = "Effect of delta on recovered archetypes")
points(X_delta, pch = 19, cex = 0.4, col = "gray50")
for (i in seq_along(res_delta)) {
    A <- coordinates(res_delta[[i]])
    lines(A[ix, 1], A[ix, 2], col = cols[i], lwd = 2)
}
legend("topright",
       legend = c("True simplex", paste0("\u03b4 = ", delta_vals)),
       col = c("black", cols), lty = c(3, 1, 1, 1), lwd = 2, bty = "n")

Truncated simplex: Larger delta lets archetypes escape the data hull and approach the true positions.

With delta = 0 the estimated triangle is visibly smaller than the true one and its vertices all lie within the data cloud. Increasing delta allows archetypes to reach outside the cloud toward the true vertices. Use delta with care: very large values allow archetypes to drift far from the data, losing their interpretability as extreme prototypes.

Robust fitting

Standard AA minimizes squared residuals, making it sensitive to outliers. Setting robust = TRUE applies MASS-style "psi.bisquare" row reweighting: observations with large residuals receive reduced influence at each iteration, so outliers cannot monopolize an archetype.

Below we augment the iris dataset with five outliers and compare the results of standard and robust AA. Since the default initialization method (furthest_sum) is also sensitive to outliers, we set init = "random" to minimize their influence.

# Append five outlier observations to iris
N_out <- 5
displacement <- c(2, 2, 0, 0)  # displace in the first and second features

add_outliers <- function(X, N, displacement) {
  x_mean <- colMeans(X) + displacement # place outliers at a distance from the mean
  x_sd <- apply(X, 2, sd) * .75
  X_out <- rnorm(N * ncol(X), x_mean, x_sd)  # recycle x_mean, x_sd
  rbind(X, matrix(pmax(X_out, .1), nrow = N, byrow = TRUE))
}

X <- iris[, 1:4] |>
    as.matrix() |>
    add_outliers(N = N_out, displacement = displacement)

set.seed(123)  # fix starting positions
fit_std <- run_aa(X, K = 3, robust = FALSE, tol = 1e-4, init = "random", nrep = 5)
set.seed(123)  # same starting positions
fit_rob <- run_aa(X, K = 3, robust = TRUE, tol = 1e-4, init = "random", nrep = 5)

In standard AA, one archetype is pulled toward the outliers. In robust AA, the outliers receive less influence and the archetypes remain near the original data structure.

# Project all archetypes to PC space of data
pca <- prcomp(X, scale. = TRUE)
A_std <- predict(pca, coordinates(fit_std))
A_rob <- predict(pca, coordinates(fit_rob))
A_ori <- predict(pca, coordinates(fit_iris))   # reference: fitted on clean iris

# Plot data points
plot(pca$x,
     main = "PCA projection of iris data with outliers",
     col = "gray60", pch = 19, cex = 0.5, frame.plot = FALSE, asp = 1)
# Add outliers
points(pca$x[nrow(iris) + 1:N_out, ], col = "black", pch = 8)

# Add Archetypes
ix <- c(1:3, 1)
lines(A_ori[ix, ], col = "black", lwd = 1.5, lty = 2)
lines(A_std[ix, ], col = "firebrick", lwd = 2)
points(A_std, col = "firebrick", pch = 17, cex = 1.5)
lines(A_rob[ix, ], col = "steelblue", lwd = 2)
points(A_rob, col = "steelblue", pch = 17, cex = 1.5)

# Add legend
legend("topright",
       legend = c("Data", "Outliers", "Reference AA", "Standard AA", "Robust AA"),
       col    = c("gray60", "black", "black", "firebrick", "steelblue"),
       pch    = c(19, 8, NA, 17, 17),
       lty    = c(NA, NA, 2, 1, 1),
       lwd    =  2,
       bty    = "n")

PCA projection of iris with five added outliers (crosses). Standard AA (red) is distorted toward the outlier cluster; robust AA (blue) closely recovers the reference triangle fitted on the clean data (dashed black).

The robust argument also accepts MASS psi names such as "psi.huber" and "psi.hampel"; pass tuning constants with robust_args. See MASS::rlm() for psi details. The MASS::rlm() method = "MM" estimator is not supported because it is regression-specific and not directly applicable to the AA objective.

Missing data

When X contains NA values, run_aa() automatically switches to the missing-data PGD objective (missing = TRUE): only the observed entries contribute to the gradient, and the loss is computed over those entries only.

X <- as.matrix(iris[, 1:4])

# Introduce ~10 % missing at random
missing_rate <- 0.1
n_elem <- length(X)  # number of elements
X_miss <- X
idx_miss <- sample(n_elem, size = round(missing_rate * n_elem))
X_miss[idx_miss] <- NA

set.seed(123)
fit_miss <- run_aa(X_miss, K = 3, nrep = 10, init = "random")  # missing = TRUE auto-detected
fit_miss
#> 
#> Call:
#> run_aa(x = X_miss, K = 3, init = "random", nrep = 10)
#> 
#> PGD Archetypal Analysis:
#> Number of Archetypes: 3 
#> Converged after 96 iterations.
#> Final Loss Metrics:
#>     loss        r2
#>  20.2496 0.9976202

The fitted model reconstructs missing entries, which can be compared with the original complete data.

X_hat_miss <- fitted(fit_miss)
any(is.na(X_hat_miss))   # FALSE = all positions reconstructed
#> [1] FALSE

# compare reconstructed vs original values at missing positions
cor(X_hat_miss[idx_miss], X[idx_miss])
#> [1] 0.9808647

Algorithm details

Algorithmic approaches: NNLS vs PGD

The CRAN package archetypes implements AA via alternating NNLS (non-negative least squares): at each step a constrained NNLS subproblem is solved via nnls::nnls(). yaap’s method = "nnls" implements this alternating NNLS approach.

The default method = "pgd" uses projected gradient descent instead. PGD skips the inner NNLS solves working directly on the gradient of the reconstruction error.

Unlike NNLS, which solves a constrained subproblem at each step, PGD works entirely with matrix multiplications, avoiding the overhead of an inner solver. The following benchmark on the toy dataset with 10 restarts illustrates this difference:

toy_mat    <- as.matrix(toy)
nrep_bench <- 10
seeds      <- 2046 + seq_len(nrep_bench)

run_single <- function(method, seed, ...) {
    set.seed(seed)
    fit <- suppressWarnings(run_aa(
        toy_mat, K = 3, nrep = 1, method = method, init = "random", ...
    ))
    c(n_iter = nrow(fit$loss), final_loss = norm(resid(fit), "F"))
}

# PGD benchmark
t_pgd  <- system.time(
  res_pgd <- vapply(seeds, run_single, numeric(2), method = "pgd")
)[["elapsed"]]
n_pgd  <- round(median(res_pgd["n_iter", ]))

# NNLS benchmark
t_nnls <- system.time(
    res_nnls <- vapply(seeds, run_single, numeric(2), method = "nnls")
)[["elapsed"]]
n_nnls <- round(median(res_nnls["n_iter", ]))

# Run by run comparison
win_pgd   <- sum(res_pgd["final_loss", ] < res_nnls["final_loss", ])
loss_diff <- res_pgd["final_loss", ] - res_nnls["final_loss", ]
diff_pgd  <- median(loss_diff / res_nnls["final_loss", ])
diff_nnls <- median(loss_diff / res_pgd["final_loss", ])

bench <- data.frame(
  Method           = c("PGD", "NNLS"),
  `Median iters`   = c(n_pgd, n_nnls),
  `ms / iter`      = 1000 * c(t_pgd / n_pgd, t_nnls / n_nnls),
  `loss change (%)`= 100 * c(diff_pgd, -diff_nnls),
  `Wins`           = c(win_pgd, nrep_bench - win_pgd),
  check.names      = FALSE
)
knitr::kable(bench, digits = 2)

Method	Median iters	ms / iter	loss change (%)	Wins
PGD	64	5.06	-0.46	10
NNLS	36	34.94	0.46	0

Because PGD works directly on the gradient, it also supports extensions such as the slack terms used in the delta relaxation and missing residuals in the missing = TRUE objective. Both methods can handle robust psi reweighting, as well as weighting schemes passed via the scale and weights arguments.

PGD Algorithm

The algorithm is described in detail (Mørup and Hansen 2012) and is implemented in the archetypes_pgd() function. The solver computes partial derivatives of the loss with respect to each variable, then iteratively updates them in a descent direction. The main steps are:

Initialization: selects K well-separated data points, giving good initial coverage of the data hull.
S-step: for fixed archetypes $A$, update compositions $S$
B-step: for fixed $S$, update the archetype generators $B$
a-step (optional): in case of delta > 0, update the slacks a, keeping $B$ and $S$ fixed.
Repeat until the relative change in RSS drops below tol, or $R^2$ exceeds tol_r2.

The problems are solved via projected gradient descent with line search and backtracking. The step size starts at step_size and is multiplied by step_shrinkage after each rejected proposal; if no improvement is found after max_iter_optimizer reductions, the step is skipped entirely. When pseudo_pgd = TRUE (default), the gradient is modified before each step to remove the radial direction (that changes the magnitude of the variables), keeping the proposed update close to the simplex surface.

If no update is accepted after max_no_update backtracking steps, the algorithm halts with a warning of non-convergence.

Key arguments

Argument	Default	Description
`max_iter`	100	Maximum outer iterations
`tol`	1e-4	Relative RSS change stopping criterion
`tol_r2`	0.9999	R² early-stop threshold
`nrep`	1	Independent random restarts (best returned)
`scale`	`FALSE`	No column scaling by default
`delta`	0	Convex-hull relaxation (see § delta)
`pseudo_pgd`	`TRUE`	Remove radial component of gradient before step
`step_size`	1.0	Initial step size for the line search
`max_iter_optimizer`	10	Maximum back-tracking steps per outer iteration
`step_shrinkage`	0.5	Shrinkage factor per back-track
`max_no_update`	5	Consecutive no-improvement steps before early halt

Loss tracking

As introduced in the first example, the $loss data frame tracks one row per logged iteration; the column meanings are summarized there. The full sequence for the iris model is:

head(fit_iris$loss, 6)
#>        loss        r2
#> 1 126.99401 0.7869228
#> 2 114.08600 0.8085805
#> 3  74.67707 0.8747029
#> 4  65.43686 0.8902066
#> 5  61.60086 0.8966429
#> 6  59.03305 0.9009513

The loss metrics may differ from the direct computation (for example norm(X - fitted(fit_iris), "F")^2) because the loss is computed on the scaled data or other internal transformations. Thus, the loss metrics should only be used for monitoring convergence and comparing different runs of the same model type, not for comparing across different models or datasets. That is why we use the generic term “loss” instead of “RSS” in the column name.

The condition numbers k_S and k_A are diagnostics of numerical stability and are especially relevant for the case of method = "nnls" because it involves solving linear systems with these matrices. High k_S or k_A values signal numerical issues and are worth investigating with additional restarts or a larger delta. These diagnostics usually require attention only when convergence is unstable.

Conclusion

Limitations and challenges

Some important practical considerations and limitations of AA:

Presence of (Poor) Local Minima: The AA objective is non-convex, so the algorithm is not guaranteed to find the global minimum.
Initialization sensitivity: Because of the presence of local minima, the choice of initial archetypes can influence the final solution.
Number of archetypes: The number of archetypes is the most important tuning parameter, and its choice is subjective and can affect the interpretability and quality of the fit.
Sensitivity to Convex Hull: Standard AA can be sensitive to outliers or misspecification when the true extremes lie outside the observed convex hull.
Euclidean assumption: The standard AA formulation assumes a Euclidean geometry (Frobenius norm), which may not be appropriate for all data types.

Several methods have been proposed to mitigate these issues and this package implements some of them. The most common strategies are:

Multiple restarts and initialization: To combat non-convexity and initialization sensitivity, it is common to run AA with multiple independent restarts (nrep) and select the best solution. Sophisticated initialization strategies (see the “Choosing an Initialization Method for Archetypal Analysis” vignette) can also avoid poor local minima and improve both solution quality and convergence speed.
Choosing the number of archetypes: The number of archetypes K is the most important tuning parameter. The choice can be guided by domain knowledge, cross-validation, or information criteria, but it remains a subjective decision that affects the interpretability and quality of the fit. This package provides tools to help with this choice through archetypes_path() and screeplot().
Robust fitting: Because archetypes are constrained to be convex combinations of the data, an outlier can attract an entire archetype to itself, leading to a potentially misleading result. The option robust = TRUE applies robust psi row reweighting to reduce the influence of such observations (see Robust fitting).
Delta relaxation: If the genuine extremes of the underlying data-generating process are not observed, standard AA will be unable to recover them. To accommodate this situation, the delta relaxation allows archetypes to step outside the data hull and better capture true extremes when the data is truncated.
Non-Gaussian and metric variants: The standard AA formulation assumes a Euclidean geometry (Frobenius norm), which may not be appropriate for all data types. Extensions of AA to other geometries and data types are presented in the “Non-Gaussian and Metric Variants of Archetypal Analysis” vignette.

Summary

Task	Function / accessor
Fit a model (matrix)	`run_aa(X, K, ...)`
Fit a model (formula)	`run_aa(y ~ ., data, K, ...)`
Extract compositions ($S$)	`compositions(fit)`
Extract coordinates ($A$)	`coordinates(fit)`
Extract coefficients ($B$)	`coef(fit)`
Reconstruct data	`fitted(fit)`
Residuals	`residuals(fit)`
Predict new reconstructions	`predict(fit, newdata)`
Predict new compositions	`predict(fit, newdata, type = "compositions")`
Rename archetypes	`anames(fit) <- c(...)`
Loss / convergence	`fit$loss`, `fit$converged`
Feature profile plot	`plot(fit, "profiles")`
Composition barplot	`plot(fit, "compositions")`
Simplex plot	`plot(fit, "simplex")`
Coordinate scatter	`plot(fit, "coordinates")`
Convergence plot	`plot(fit, "loss")`
Robust fitting	`run_aa(..., robust = TRUE)`
Missing-data fitting	`run_aa(...)` (auto-detects)

Session info

sessionInfo()
#> R version 4.6.0 (2026-04-24)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.4 LTS
#> 
#> Matrix products: default
#> BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
#> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so;  LAPACK version 3.12.0
#> 
#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=en_IE.UTF-8        LC_COLLATE=C              
#>  [5] LC_MONETARY=en_IE.UTF-8    LC_MESSAGES=en_US.UTF-8   
#>  [7] LC_PAPER=en_IE.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=en_IE.UTF-8 LC_IDENTIFICATION=C       
#> 
#> time zone: Europe/Madrid
#> tzcode source: system (glibc)
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] generics_0.1.4 yaap_1.0.0    
#> 
#> loaded via a namespace (and not attached):
#>  [1] vctrs_0.7.3        cli_3.6.6          knitr_1.51         rlang_1.2.0       
#>  [5] xfun_0.57          otel_0.2.0         DEoptimR_1.1-4     tensorA_0.36.2.1  
#>  [9] jsonlite_2.0.0     glue_1.8.1         htmltools_0.5.9    sass_0.4.10       
#> [13] rmarkdown_2.31     quadprog_1.5-8     grid_4.6.0         bayesm_3.1-7      
#> [17] evaluate_1.0.5     jquerylib_0.1.4    tibble_3.3.1       MASS_7.3-65       
#> [21] fastmap_1.2.0      yaml_2.3.12        lifecycle_1.0.5    compiler_4.6.0    
#> [25] robustbase_0.99-7  Rcpp_1.1.1-1.1     pkgconfig_2.0.3    compositions_2.0-9
#> [29] lattice_0.22-9     digest_0.6.39      R6_2.6.1           utf8_1.2.6        
#> [33] pillar_1.11.1      magrittr_2.0.5     bslib_0.11.0       Matrix_1.7-5      
#> [37] tools_4.6.0        nnls_1.6           matrixStats_1.5.0  cachem_1.1.0

References

Cutler, Adele, and Leo Breiman. 1994. “Archetypal Analysis.” Technometrics 36 (4): 338–47. https://doi.org/10.1080/00401706.1994.10485840.

Mørup, Morten, and Lars Kai Hansen. 2012. “Archetypal Analysis for Machine Learning and Data Mining.” Neurocomputing 80: 54–63. https://doi.org/https://doi.org/10.1016/j.neucom.2011.06.033.

Suleman, Athar. 2017. “Validation of Archetypal Analysis.” In 2017 IEEE International Conference on Fuzzy Systems (FUZZ-IEEE), 1–6. Naples, Italy. https://doi.org/10.1109/FUZZ-IEEE.2017.8015385.

Introduction to Archetypal Analysis with yaap

What is Archetypal Analysis?

Relationship to other methods

The yaap package

A first example: the toy dataset

The archetypes result object

Integration with broom

Visualizing the fit

Archetype positions in feature space

Working with “real” data: Fisher’s iris

Archetype positions in feature space

Naming archetypes from their defining observations

Archetype feature profiles

Sample compositions

Predicting reconstructions and compositions for new observations

Practical considerations

Choosing K

Scaling

Multiple restarts

Convergence

Variants of Canonical AA and their implementation in yaap

Relaxing the convex hull constraint with delta

Robust fitting

Missing data

Algorithm details

Algorithmic approaches: NNLS vs PGD

PGD Algorithm

Key arguments

Loss tracking

Conclusion

Limitations and challenges

Summary

Session info

References

The `archetypes` result object

Integration with `broom`

Choosing `K`

Relaxing the convex hull constraint with `delta`