Introduction to Archetypal Analysis with yaap

What is Archetypal Analysis?

Archetypal analysis (AA) is an unsupervised learning method that describes every observation as a convex mixture1 of a small number of extreme, idealized profiles called archetypes (Cutler and Breiman 1994). Unlike PCA, whose components are abstract mathematical directions, or k-means, whose clusters are represented by centroids, archetypes are located at the boundary of the data representing the distinct extremes. AA combines cluster-style interpretability with matrix-factorization flexibility. In practice, archetypes often correspond to “pure” or “extreme” profiles. This representation is useful when observations are naturally viewed as mixtures of biologically, chemically, or behaviorally distinct states.

Formally, given a data matrix \(X \in \mathbb{R}^{N \times M}\) (N samples, M features), AA finds:

such that the reconstruction \(\hat{X} = SA\) is as close as possible to \(X\) in the standard squared Frobenius norm2:

\[\min_{B,S} \|X - SBX\|_F^2 \quad \text{s.t.} \quad A = BX,\; B,S \geq 0,\; B\mathbf{1} = S\mathbf{1} = \mathbf{1}.\]

The constraint \(A = BX\) keeps archetypes inside (or on the boundary of) the convex hull of the data, making them directly interpretable as extreme prototypes. The composition vectors \(S\) correspond to points in the archetype simplex: a composition of \((1,0,0)\) means a sample is entirely characterized by archetype 1, while \((1/3,1/3,1/3)\) means it sits equally between all three.

Relationship to other methods

Method Main Focus Sample Representation Component Type Interpretability Flexibility
PCA / SVD Directions of maximal variance Linear combination Abstract latent directions Often moderate to low High
NMF Additive parts-based representation Non-negative combination Non-negative latent parts Moderate to high Moderate to high
k-means Central cluster prototypes Hard assignment to one cluster Cluster centroids High Low to moderate
k-medoids Representative observed prototypes Hard assignment to one observed prototype Observed data points High Low
Archetypal Analysis Extreme profiles / corners of the data Convex combination of archetypes Observed-data-based extremes High Moderate to high

The yaap package

yaap (Yet Another Archetypes Package) provides fast, flexible AA for R. This vignette is organized into four parts:

  1. Core workflow: fit models with run_aa(), inspect archetypes objects, visualize archetypes and compositions, and predict for new data.
  2. Practical choices: choose K, scale features, use multiple restarts, and check convergence.
  3. Alternative formulations: use the delta relaxation, robust fitting, and missing-data fitting.
  4. Algorithm details: compare NNLS and PGD, and review the main tuning arguments.
library(yaap)
library(generics)
#> 
#> Attaching package: 'generics'
#> The following objects are masked from 'package:base':
#> 
#>     as.difftime, as.factor, as.ordered, intersect, is.element, setdiff,
#>     setequal, union

A first example: the toy dataset

The package ships a small two-dimensional toy dataset (originally from the archetypes) that contains 250 points scattered around three vertices of a triangle.

toy <- read.csv(system.file("extdata", "toy.csv", package = "yaap"))
head(toy, 3)
#>           x         y
#> 1  8.749423  9.922297
#> 2  4.965050  7.163594
#> 3 10.092057 13.750449
dim(toy)
#> [1] 250   2
plot(toy, asp = 1., frame.plot = FALSE, axes = FALSE,
     main = "Toy Dataset", xlab = "", ylab = "",
     pch = 19, cex = 0.6, col = "gray50")
The toy dataset: three clusters near the corners of a triangle.

The toy dataset: three clusters near the corners of a triangle.

We fit an AA model with \(K = 3\) archetypes using run_aa().

fit <- run_aa(toy, K = 3, nrep = 3)
fit
#> 
#> Call:
#> run_aa(x = toy, K = 3, nrep = 3)
#> 
#> PGD Archetypal Analysis:
#> Number of Archetypes: 3 
#> Converged after 78 iterations.
#> Final Loss Metrics:
#>      loss        r2
#>  119.2253 0.9984708

The output reports the final \(R^2\) and convergence status.

The archetypes result object

run_aa() returns an S3 object of class archetypes. It is a list with the following main components:

Component Dimensions Description
coordinates K x M Archetype profiles in feature space
coefficients K x N Weights expressing each archetype as a mix of samples
compositions N x K Barycentric coordinates of each sample
loss (iter) x R Per-iteration RSS, \(R^2\), and method-specific QC
converged scalar Whether the optimizer converged
coordinates(fit)
#>            x         y
#> A1  2.864114 10.005167
#> A2 13.987939  2.209766
#> A3 18.785381 18.639291

The fundamental identity \(\hat{X} = SA\) is:

X_hat <- compositions(fit) %*% coordinates(fit)

Let us verify this recovers the data well:

X <- as.matrix(toy)
X_hat <- fitted(fit)
R <- resid(fit)  # residuals: X - X_hat
(rss <- norm(R, "F")^2)  # residual sum of squares
#> [1] 119.2067
1 - rss / norm(X, "F")^2  # R^2
#> [1] 0.998471

plot(X, type = "n", asp = 1., frame.plot = FALSE, axes = FALSE,
     main = "Toy Dataset Reconstruction", xlab = "", ylab = "")
ix <- rowSums(R > 0.5) > 0  # highlight large residuals
segments(X[, 1], X[, 2], X_hat[, 1], X_hat[, 2], col = "firebrick", lwd = 0.5)
points(X    , pch = 20, cex = 0.8, col = "gray50")
points(X_hat, pch = 1,  cex = 0.7, col = "firebrick")
legend("topleft", legend = c("Data", "Reconstruction", "Residuals"), bty = "n",
       col = c("gray50", "firebrick", "firebrick"),
       pch = c(20, 1, NA),
       lwd = c(NA, NA, 0.5))
Reconstruction of toy dataset. Based on figure from `archetypes` package vignette.

Reconstruction of toy dataset. Based on figure from archetypes package vignette.

In the reconstruction plot, data points (gray disks) inside the archetype convex hull have exact reconstructions (red circles). Points outside the hull are projected to its edges (red segments).

The $loss data frame records the per-iteration optimization trajectory and method-specific diagnostics. These columns help assess convergence and model quality.

The exact columns depend on the method, but typical columns include:

Note that loss values may differ from the RSS computed directly above because the working objective operates on internally scaled data. See Algorithm details for a full discussion.

Integration with broom

yaap also supports broom-style methods for fitted archetypes objects. These methods return ordinary tibbles for downstream inspection, plotting, and reporting.

tidy() returns model matrices in long form. By default it returns the archetype coordinates: one row per archetype-feature pair, with columns archetype, term, and value.

tidy(fit)
#> # A tibble: 6 × 3
#>   archetype term  value
#>   <chr>     <chr> <dbl>
#> 1 A1        x      2.86
#> 2 A2        x     14.0 
#> 3 A3        x     18.8 
#> 4 A1        y     10.0 
#> 5 A2        y      2.21
#> 6 A3        y     18.6

The same verb can return other fitted matrices. For example, matrix = "compositions" returns one row per sample-archetype pair; value is the fitted weight of that archetype in that sample. matrix = "coefficients" similarly returns the weights that express each archetype as a mixture of training samples.

fit |> tidy(matrix = "compositions") |> head()
#> # A tibble: 6 × 3
#>   sample archetype    value
#>   <chr>  <chr>        <dbl>
#> 1 S1     A1        0.559   
#> 2 S2     A1        0.753   
#> 3 S3     A1        0.542   
#> 4 S4     A1        0.000325
#> 5 S5     A1        0.316   
#> 6 S6     A1        0.637

glance() returns one row of model-level diagnostics, including the number of archetypes, convergence status, final loss, final \(R^2\), number of iterations, adapted AIC, and model family.

glance(fit)
#> # A tibble: 1 × 7
#>       K converged  loss    r2 n_iter   aic family  
#>   <int> <lgl>     <dbl> <dbl>  <int> <dbl> <chr>   
#> 1     3 TRUE       119. 0.998     78  4.12 gaussian

augment() returns the original data with one dot-prefixed composition column per archetype. The .A1, .A2, and .A3 columns are the fitted barycentric weights for each sample, so they sum to one row by row.

aug <- augment(fit, data = toy)
head(aug)
#> # A tibble: 6 × 5
#>       x     y      .A1           .A2          .A3
#>   <dbl> <dbl>    <dbl>         <dbl>        <dbl>
#> 1  8.75  9.92 0.559    0.237         0.204       
#> 2  4.97  7.16 0.753    0.247         0.0000000100
#> 3 10.1  13.8  0.542    0.0126        0.445       
#> 4 17.1  12.8  0.000325 0.357         0.643       
#> 5 13.6  16.2  0.316    0.00000001000 0.684       
#> 6  8.28 11.9  0.637    0.0737        0.289

comp_cols <- paste0(".", anames(fit))
range(rowSums(as.matrix(aug[, comp_cols])))  # all should be 1
#> [1] 1 1

Visualizing the fit

plot() method for archetypes objects offers several what options.

Archetype positions in feature space

This method plots the original data points, overlays the archetypes from the coordinates matrix, and connects them with lines.

plot(fit, what = "coordinates",
     main = "Archetype positions in feature space", xlab = "", ylab = "",
     asp = 1, frame.plot = FALSE, axes = FALSE,
     pch = 17, cex = 1.5, col = "firebrick",
     args.data.scatter = list(pch = 19, cex = 0.5, col = "gray50"))
Data and estimated archetypes. The triangle connects the three archetypes.

Data and estimated archetypes. The triangle connects the three archetypes.

As expected, the archetypes lie close to the corners of the point cloud.

Plot functions invisibly return the data used to construct the figure in a ggplot2 friendly format. Use plot = FALSE to prepare those data without drawing. For example, the following code reproduces the previous figure:

library(ggplot2)

plot(fit, what = "coordinates", plot = FALSE) |>
    ggplot(aes(x, y)) +
    geom_point(aes(color = archetype))

Working with “real” data: Fisher’s iris

Besides tabular data (like data.frame or matrix), run_aa() also accepts formula and data frame inputs. Formula input is useful when the data contains a response column to exclude: the response is silently dropped and all right-hand-side predictors are used as features.

# 'Species' (the response) is discarded; the four measurements are used.
fit_iris <- run_aa(Species ~ ., data = iris, K = 3, scale = TRUE, tol = 1e-4)
fit_iris
#> 
#> Call:
#> run_aa(formula = Species ~ ., data = iris, K = 3, scale = TRUE, 
#>     tol = 1e-04)
#> 
#> PGD Archetypal Analysis:
#> Number of Archetypes: 3 
#> Converged after 62 iterations.
#> Final Loss Metrics:
#>     loss        r2
#>  45.8493 0.9230716

This is equivalent to calling run_aa(iris[, 1:4], ...)

Because the four measurements are on different scales, we set scale = TRUE to apply column-wise z-scoring before fitting.

In the following subsections, we analyze the fitted model in a typical workflow:

Archetype positions in feature space

Since the iris dataset has four features, the coordinate scatter requires dimensionality reduction. Pass projection = "pca" to project both data and archetypes onto the first two principal components.

pal_sp <- c(setosa = "#E69F00", versicolor = "#56B4E9", virginica = "#009E73")
iris_col <- pal_sp[as.character(iris$Species)]
plot(fit_iris, what = "coordinates", projection = "pca", frame.plot = FALSE,
     main = "PCA projection of iris data and archetypes",
     col = "firebrick", pch = 17, cex = 1.5,
     args.data.scatter = list(col = iris_col, pch = 19, cex = 0.6))
PCA projection of the iris data and its three archetypes (triangles).

PCA projection of the iris data and its three archetypes (triangles).

Naming archetypes from their defining observations

Each archetype is a convex mixture of actual data points, described by the coefficients matrix (K x N). To name the archetypes, we summarize which species contribute most to each archetype.

B <- coef(fit_iris)  # K x N

# Sum coefficients within each species (rows = species, cols = archetypes)
coef_by_species <- aggregate(t(B), by = iris["Species"], FUN = sum)
knitr::kable(coef_by_species, digits = 3)
Species A1 A2 A3
setosa 0.771 1 0
versicolor 0.229 0 0
virginica 0.000 0 1

The table indicates that one archetype combines “setosa” and “versicolor”, while the other two are predominantly “setosa” and “virginica”, respectively. The assigned labels reflect those species contributions.

# Concat all species contributing to an archetype
coef_mat <- as.matrix(coef_by_species[, -1])  # drop species column
rownames(coef_mat) <- coef_by_species$Species
species_contrib <- which(coef_mat > 0.1, arr.ind = TRUE)
species_contrib <- aggregate(
    row ~ col,
    data = species_contrib,
    FUN = function(ix) {
        species <- rownames(coef_mat)[ix]
        species <- substr(species, 1, 3)  # abbreviate species names
        paste(species, collapse = ".")
    }
)

anames(fit_iris) <- species_contrib$row
anames(fit_iris)
#> [1] "set.ver" "set"     "vir"

Archetype feature profiles

what = "profiles" shows a grouped barplot of the archetype coordinates on the original feature scale. All other arguments are passed to barplot().

plot(fit_iris, what = "profiles",
     horiz = TRUE,
     names.arg = sub("\\.", "\n", colnames(iris[, 1:4])),
     las = 1,
     col = pal_aa,
     xlab = "Feature value (original scale)", ylab = "",
     main = "Archetype feature profiles")
Archetype feature profiles. Each group of bars is one feature; colors distinguish archetypes.

Archetype feature profiles. Each group of bars is one feature; colors distinguish archetypes.

Sample compositions

Named archetypes make the composition space easier to interpret.

plot(fit_iris, what = "simplex",
     col = pal_sp[as.character(iris$Species)], pch = 19,
     pca = TRUE, col.pca = "black", robust = FALSE)
legend("topright", legend = names(pal_sp), col = pal_sp, pch = 19, bty = "n")
Ternary plot of iris compositions, colored by species. Each point is a sample placed by its three archetype weights.

Ternary plot of iris compositions, colored by species. Each point is a sample placed by its three archetype weights.

The ternary structure shows a clear separation of species in composition space: setosa concentrates near one vertex, virginica near another, and versicolor occupies intermediate mixtures along the lower edge (small setosa contribution). The black curved line is the PCA trajectory projected into the simplex, summarizing the dominant continuous transition from setosa-like to virginica-like compositions through the hybrid archetype of setosa+versicolor.

The ternary plot also shows why the hybrid archetype has a stronger setosa contribution even though more versicolor samples lie near it on average: AA searches for extreme observations near the data hull boundary, where one setosa outlier lies. This illustrates a general property of AA: archetypes anchor at the geometric boundary, so even a single extreme observation can pull a vertex toward it. For analyses where outliers are a concern, see Robust fitting.

When samples are arranged by PC1, the composition barplot shows a transition from virginica-like to setosa-like samples, with the hybrid archetype in between.

plot(fit_iris, what = "compositions",
     main = "Archetype compositions across iris samples",
     cluster_rows = "PC1", cluster_cols = FALSE, col = pal_aa
)
Composition barplot: each bar is one iris sample. Single-color bars are near-pure specimens; blended bars sit between archetypes.

Composition barplot: each bar is one iris sample. Single-color bars are near-pure specimens; blended bars sit between archetypes.

Other potential orderings of the samples include:

See plot_archetypes_compositions() for details.

Predicting reconstructions and compositions for new observations

Once a model is fitted, predict() maps new observations to the archetype convex hull without re-fitting. The resulting projection can be represented in two ways:

new_obs <- matrix(
    c(5.1, 3.5, 1.4, 0.2,   # setosa-like
      6.7, 3.0, 5.2, 2.3),  # virginica-like
    nrow = 2, byrow = TRUE,
    dimnames = list(c("obs1", "obs2"), colnames(iris[, 1:4]))
)
print(new_obs)
#>      Sepal.Length Sepal.Width Petal.Length Petal.Width
#> obs1          5.1         3.5          1.4         0.2
#> obs2          6.7         3.0          5.2         2.3

predict(fit_iris, newdata = new_obs)
#>      Sepal.Length Sepal.Width Petal.Length Petal.Width
#> obs1     5.082102    3.507834     1.569315   0.3002319
#> obs2     6.789315    2.983888     5.661577   2.0019571

predict(fit_iris, newdata = new_obs, type = "compositions")
#>        set.ver        set          vir
#> obs1 0.3373917 0.66260826 9.999962e-09
#> obs2 0.1074892 0.07415061 8.183602e-01

Practical considerations

Choosing K

The main tuning parameter in archetypal analysis is K: the number of archetypes. A traditional exploratory approach is to fit a sequence of models and inspect a scree plot or elbow plot. The elbow is the smallest K after which reconstruction quality metrics improve only modestly. Typical metrics used for this purpose are:

Alongside elbow plots, yaap also provides an adapted AIC criterion (AIC) based on (Suleman 2017) that balances model fit and complexity, which can be used to select K more systematically. Unlike the other metrics we discussed, the adapted AIC is not monotonic in K, so we are not looking for an elbow but rather for the minimum value, which indicates the best trade-off between fit and parsimony.

Here we fit a “path” of archetypes models for K = 1, ..., 6 to the toy dataset, using the archetypes_path() function, which fits multiple models in a single call and reuses preprocessing.

fit_path <- archetypes_path(
    toy,
    K = 6,
    init = "random",
    max_iter = 100,
    nrep = 5
)

The resulting fit_path object contains the fitted models for all K, which can be accessed with [[ using their index or K value, for example fit_path[[3]] (third model) or fit_path[["K3"]] (model with K = 3). The path object also supports plotting the scree and adapted AIC curves across K values with screeplot().

op <- par(mfrow = c(1, 3), mar = c(4, 4, 2.5, 0.5), bty = "n")

selected_K <- 3
fit_best <- fit_path[[selected_K]] # here equivalent to fit_path[["K3"]]

screeplot(
    fit_path,
    y = "loss",
    col = "steelblue4",
    frame.plot = FALSE,
    ylab = "Reconstruction loss",
    main = "RSS"
)
abline(v = selected_K, lty = 2, col = "firebrick")

screeplot(
    fit_path,
    y = function(fit) 1 - utils::tail(fit$loss$r2, 1L),
    col = "steelblue4",
    frame.plot = FALSE,
    ylab = expression(1 - R^2),
    main = "Unexplained Variance"
)
abline(v = selected_K, lty = 2, col = "firebrick")

screeplot(
    fit_path,
    col = "steelblue4",
    frame.plot = FALSE,
    ylab = "Adapted AIC",
    main = "Adapted AIC"
)
abline(v = selected_K, lty = 2, col = "firebrick")
RSS, unexplained variance, and adapted AIC across candidate values of K.

RSS, unexplained variance, and adapted AIC across candidate values of K.


par(op)

For this dataset, all three diagnostics agree that the optimal number of archetypes is K = 3, the true number of generating vertices. The RSS and unexplained-variance curves show a clear elbow at K = 3. The adapted AIC has a strong minimum at K = 3. In real analyses, these plots are diagnostics rather than automatic rules. AA optimization is non-convex, so single-restart curves can be noisy and may not decrease monotonically for every larger K. Use more restarts and domain knowledge when the elbow or adapted AIC is ambiguous.

Scaling

Situation scale value Effect
Default/natural feature metric FALSE Data used as-is
Mixed-unit Gaussian features TRUE z-score columns
Custom feature importance Numeric vector Divides columns

The standard AA assumes Euclidean geometry and Gaussian error structure, so the optimization is carried out by minimizing the sum of squared residuals (RSS). Scaling can therefore materially affect the results.

By default, run_aa() does not transform the input data. The decision of whether to scale is similar to that in PCA: if the features are on different scales, set scale = TRUE to give them equal weight in the analysis. Otherwise, those features may have little influence on the fit. Custom scaling by passing a vector of scale factors is also supported similar to how scale() works in base R (by dividing the columns).

The difference between manually normalizing the data and setting scale = FALSE versus letting run_aa() handle the scaling internally is that the archetypes are returned on the scale of the input data. For interpretability, pass data on the scale you want the resulting archetypes to be in.

Besides scale, run_aa also has an sd_threshold = 1e-6 argument to remove features with near-zero variance, which can cause numerical instability and convergence issues. During reconstruction, the removed features are filled with the mean value of the original data.

When some samples are more reliable or informative than others, pass sample weights through the weights argument.

Multiple restarts

Because the AA objective is non-convex, the solution is sensitive to the initial archetype coordinates. run_aa() supports multiple independent restarts via the nrep argument. Setting nrep > 1 returns only the best solution across all runs, so the output is always a single archetypes object.

Multiple restarts are most useful when the initializer can produce different starting points across runs. The default initialization method is "furthest_sum". Although its starting seed is random, the procedure tends to have low variance because it greedily seeks a good initial guess. In contrast, "random" produces very different starting points, which lets the search explore the space more broadly. Below we run both methods five times on the iris dataset with K = 3 and compare the initial and final loss across runs.

runs <- do.call(rbind, lapply(1:5, function(k) {
    fit_fs  <- run_aa(iris[, 1:4], K = 3, nrep = 1)
    fit_uni <- run_aa(iris[, 1:4], K = 3, nrep = 1, init = "random")
    data.frame(
        run       = k,
        fs_start  = fit_fs$loss$loss[1],
        fs_end    = tail(fit_fs$loss$loss, 1),
        uni_start = fit_uni$loss$loss[1],
        uni_end   = tail(fit_uni$loss$loss, 1)
    )
}))

colnames(runs) <- c("run",
                    "furthest_sum start", "furthest_sum end",
                    "random start", "random end")
knitr::kable(runs, digits = 1)
run furthest_sum start furthest_sum end random start random end
1 131.1 26.2 132.9 24.9
2 131.1 26.2 164.1 24.9
3 131.1 26.2 384.6 25.0
4 131.1 26.2 72.0 24.9
5 131.1 26.2 893.0 24.9

furthest_sum starts closer to the final solution, while random starts are more variable. However, both methods converge in similarly good solutions, with the random method often even outperforming furthest_sum in final loss.

See the Initialization vignette for a more detailed discussion and comparison of all available initializers.

Convergence

If a run_aa() call does not converge, it emits a warning. The status of the fit is stored in the converged slot of the output object, and is also printed out when the object or its summary are printed.

The full optimization trajectory is available in the loss data frame and can be plotted with plot(fit, what = "loss").

fit_iris$converged
#> [1] TRUE
tail(fit_iris$loss, 1)
#>       loss        r2
#> 63 45.8493 0.9230716
plot(fit_iris, what = "loss", frame.plot = FALSE, type = "b",
     main = "Convergence of the iris model")
RSS across iterations for the iris model.

RSS across iterations for the iris model.

Typically, loss decreases quickly at first and then plateaus near a local minimum. If the model has not converged, increase max_iter (default 100) or add more restarts. If the loss decreases very slowly, increasing max_iter is unlikely to help; consider increasing tol or tol_r2 instead. This will not improve the fit, but it can reduce runtime and avoid unnecessary non-convergence warnings.


Variants of Canonical AA and their implementation in yaap

This vignette covers only the Euclidean geometries with Gaussian errors setting. For Non-Gaussian and Metric Variants of Archetypal Analysis see the corresponding vignette.

Relaxing the convex hull constraint with delta

By default (delta = 0) archetypes are constrained to the convex hull of the data, i.e., they are strictly weighted averages of actual observations.

When the data is a truncated version of a larger pattern, the genuine extremes are absent and the constrained model cannot recover the true vertices and instead anchors to the most extreme observed points.

Setting delta > 0 relaxes this constraint and allows archetypes to step outside the data hull (Mørup and Hansen 2012). The following example makes this concrete with synthetic data drawn from a truncated equilateral simplex: any composition weight above 0.8 is removed, so no sample is close to any true vertex.

K_d <- 3  # number of archetypes (simplex vertices)
M_d <- 2  # number of features (2D for easy visualization)
N_init <- 1000  # initial number of samples before truncation
thresh <- 0.8    # truncation threshold on composition weights

# True archetypes: equilateral triangle
A_true <- matrix(
    c(cos(seq(0, 2*pi*(K_d-1)/K_d, length.out = K_d)),
      sin(seq(0, 2*pi*(K_d-1)/K_d, length.out = K_d))),
    nrow = K_d, ncol = M_d
)

# Dirichlet-like mixing weights; remove observations near any vertex
S_raw   <- -log(matrix(runif(K_d * N_init), nrow = N_init, ncol = K_d))
S_raw   <- S_raw / rowSums(S_raw)
S_trunc <- S_raw[rowSums(S_raw > thresh) == 0, ]
X_delta <- S_trunc %*% A_true
print(nrow(X_delta))  # observations after truncation
#> [1] 876

# Fit models with different delta values
delta_vals <- c(0, 0.2, 0.35)
res_delta  <- lapply(
    delta_vals,
    function(d) archetypes_pgd(X_delta, K = K_d, delta = d)
)

# Plot all models together with the true archetypes
ix   <- c(1:3, 1)   # close the triangle
cols <- c("steelblue", "darkorange", "firebrick")

plot(A_true[ix, 1], A_true[ix, 2],
     type = "l", col = "black", lwd = 2, lty = 3,
     frame.plot = FALSE, xlab = "", ylab = "",
     xaxt = "n", yaxt = "n", asp = 1,
     main = "Effect of delta on recovered archetypes")
points(X_delta, pch = 19, cex = 0.4, col = "gray50")
for (i in seq_along(res_delta)) {
    A <- coordinates(res_delta[[i]])
    lines(A[ix, 1], A[ix, 2], col = cols[i], lwd = 2)
}
legend("topright",
       legend = c("True simplex", paste0("\u03b4 = ", delta_vals)),
       col = c("black", cols), lty = c(3, 1, 1, 1), lwd = 2, bty = "n")
Truncated simplex: Larger delta lets archetypes escape the data hull and approach the true positions.

Truncated simplex: Larger delta lets archetypes escape the data hull and approach the true positions.

With delta = 0 the estimated triangle is visibly smaller than the true one and its vertices all lie within the data cloud. Increasing delta allows archetypes to reach outside the cloud toward the true vertices. Use delta with care: very large values allow archetypes to drift far from the data, losing their interpretability as extreme prototypes.

Robust fitting

Standard AA minimizes squared residuals, making it sensitive to outliers. Setting robust = TRUE applies MASS-style "psi.bisquare" row reweighting: observations with large residuals receive reduced influence at each iteration, so outliers cannot monopolize an archetype.

Below we augment the iris dataset with five outliers and compare the results of standard and robust AA. Since the default initialization method (furthest_sum) is also sensitive to outliers, we set init = "random" to minimize their influence.

# Append five outlier observations to iris
N_out <- 5
displacement <- c(2, 2, 0, 0)  # displace in the first and second features

add_outliers <- function(X, N, displacement) {
  x_mean <- colMeans(X) + displacement # place outliers at a distance from the mean
  x_sd <- apply(X, 2, sd) * .75
  X_out <- rnorm(N * ncol(X), x_mean, x_sd)  # recycle x_mean, x_sd
  rbind(X, matrix(pmax(X_out, .1), nrow = N, byrow = TRUE))
}

X <- iris[, 1:4] |>
    as.matrix() |>
    add_outliers(N = N_out, displacement = displacement)

set.seed(123)  # fix starting positions
fit_std <- run_aa(X, K = 3, robust = FALSE, tol = 1e-4, init = "random", nrep = 5)
set.seed(123)  # same starting positions
fit_rob <- run_aa(X, K = 3, robust = TRUE, tol = 1e-4, init = "random", nrep = 5)

In standard AA, one archetype is pulled toward the outliers. In robust AA, the outliers receive less influence and the archetypes remain near the original data structure.

# Project all archetypes to PC space of data
pca <- prcomp(X, scale. = TRUE)
A_std <- predict(pca, coordinates(fit_std))
A_rob <- predict(pca, coordinates(fit_rob))
A_ori <- predict(pca, coordinates(fit_iris))   # reference: fitted on clean iris

# Plot data points
plot(pca$x,
     main = "PCA projection of iris data with outliers",
     col = "gray60", pch = 19, cex = 0.5, frame.plot = FALSE, asp = 1)
# Add outliers
points(pca$x[nrow(iris) + 1:N_out, ], col = "black", pch = 8)

# Add Archetypes
ix <- c(1:3, 1)
lines(A_ori[ix, ], col = "black", lwd = 1.5, lty = 2)
lines(A_std[ix, ], col = "firebrick", lwd = 2)
points(A_std, col = "firebrick", pch = 17, cex = 1.5)
lines(A_rob[ix, ], col = "steelblue", lwd = 2)
points(A_rob, col = "steelblue", pch = 17, cex = 1.5)

# Add legend
legend("topright",
       legend = c("Data", "Outliers", "Reference AA", "Standard AA", "Robust AA"),
       col    = c("gray60", "black", "black", "firebrick", "steelblue"),
       pch    = c(19, 8, NA, 17, 17),
       lty    = c(NA, NA, 2, 1, 1),
       lwd    =  2,
       bty    = "n")
PCA projection of iris with five added outliers (crosses). Standard AA (red) is distorted toward the outlier cluster; robust AA (blue) closely recovers the reference triangle fitted on the clean data (dashed black).

PCA projection of iris with five added outliers (crosses). Standard AA (red) is distorted toward the outlier cluster; robust AA (blue) closely recovers the reference triangle fitted on the clean data (dashed black).

The robust argument also accepts MASS psi names such as "psi.huber" and "psi.hampel"; pass tuning constants with robust_args. See MASS::rlm() for psi details. The MASS::rlm() method = "MM" estimator is not supported because it is regression-specific and not directly applicable to the AA objective.

Missing data

When X contains NA values, run_aa() automatically switches to the missing-data PGD objective (missing = TRUE): only the observed entries contribute to the gradient, and the loss is computed over those entries only.

X <- as.matrix(iris[, 1:4])

# Introduce ~10 % missing at random
missing_rate <- 0.1
n_elem <- length(X)  # number of elements
X_miss <- X
idx_miss <- sample(n_elem, size = round(missing_rate * n_elem))
X_miss[idx_miss] <- NA

set.seed(123)
fit_miss <- run_aa(X_miss, K = 3, nrep = 10, init = "random")  # missing = TRUE auto-detected
fit_miss
#> 
#> Call:
#> run_aa(x = X_miss, K = 3, init = "random", nrep = 10)
#> 
#> PGD Archetypal Analysis:
#> Number of Archetypes: 3 
#> Converged after 96 iterations.
#> Final Loss Metrics:
#>     loss        r2
#>  20.2496 0.9976202

The fitted model reconstructs missing entries, which can be compared with the original complete data.

X_hat_miss <- fitted(fit_miss)
any(is.na(X_hat_miss))   # FALSE = all positions reconstructed
#> [1] FALSE

# compare reconstructed vs original values at missing positions
cor(X_hat_miss[idx_miss], X[idx_miss])
#> [1] 0.9808647

Algorithm details

Algorithmic approaches: NNLS vs PGD

The CRAN package archetypes implements AA via alternating NNLS (non-negative least squares): at each step a constrained NNLS subproblem is solved via nnls::nnls(). yaap’s method = "nnls" implements this alternating NNLS approach.

The default method = "pgd" uses projected gradient descent instead. PGD skips the inner NNLS solves working directly on the gradient of the reconstruction error.

Unlike NNLS, which solves a constrained subproblem at each step, PGD works entirely with matrix multiplications, avoiding the overhead of an inner solver. The following benchmark on the toy dataset with 10 restarts illustrates this difference:

toy_mat    <- as.matrix(toy)
nrep_bench <- 10
seeds      <- 2046 + seq_len(nrep_bench)

run_single <- function(method, seed, ...) {
    set.seed(seed)
    fit <- suppressWarnings(run_aa(
        toy_mat, K = 3, nrep = 1, method = method, init = "random", ...
    ))
    c(n_iter = nrow(fit$loss), final_loss = norm(resid(fit), "F"))
}

# PGD benchmark
t_pgd  <- system.time(
  res_pgd <- vapply(seeds, run_single, numeric(2), method = "pgd")
)[["elapsed"]]
n_pgd  <- round(median(res_pgd["n_iter", ]))

# NNLS benchmark
t_nnls <- system.time(
    res_nnls <- vapply(seeds, run_single, numeric(2), method = "nnls")
)[["elapsed"]]
n_nnls <- round(median(res_nnls["n_iter", ]))

# Run by run comparison
win_pgd   <- sum(res_pgd["final_loss", ] < res_nnls["final_loss", ])
loss_diff <- res_pgd["final_loss", ] - res_nnls["final_loss", ]
diff_pgd  <- median(loss_diff / res_nnls["final_loss", ])
diff_nnls <- median(loss_diff / res_pgd["final_loss", ])

bench <- data.frame(
  Method           = c("PGD", "NNLS"),
  `Median iters`   = c(n_pgd, n_nnls),
  `ms / iter`      = 1000 * c(t_pgd / n_pgd, t_nnls / n_nnls),
  `loss change (%)`= 100 * c(diff_pgd, -diff_nnls),
  `Wins`           = c(win_pgd, nrep_bench - win_pgd),
  check.names      = FALSE
)
knitr::kable(bench, digits = 2)
Method Median iters ms / iter loss change (%) Wins
PGD 64 5.06 -0.46 10
NNLS 36 34.94 0.46 0

Because PGD works directly on the gradient, it also supports extensions such as the slack terms used in the delta relaxation and missing residuals in the missing = TRUE objective. Both methods can handle robust psi reweighting, as well as weighting schemes passed via the scale and weights arguments.

PGD Algorithm

The algorithm is described in detail (Mørup and Hansen 2012) and is implemented in the archetypes_pgd() function. The solver computes partial derivatives of the loss with respect to each variable, then iteratively updates them in a descent direction. The main steps are:

  1. Initialization: selects K well-separated data points, giving good initial coverage of the data hull.
  2. S-step: for fixed archetypes \(A\), update compositions \(S\)
  3. B-step: for fixed \(S\), update the archetype generators \(B\)
  4. a-step (optional): in case of delta > 0, update the slacks a, keeping \(B\) and \(S\) fixed.
  5. Repeat until the relative change in RSS drops below tol, or \(R^2\) exceeds tol_r2.

The problems are solved via projected gradient descent with line search and backtracking. The step size starts at step_size and is multiplied by step_shrinkage after each rejected proposal; if no improvement is found after max_iter_optimizer reductions, the step is skipped entirely. When pseudo_pgd = TRUE (default), the gradient is modified before each step to remove the radial direction (that changes the magnitude of the variables), keeping the proposed update close to the simplex surface.

If no update is accepted after max_no_update backtracking steps, the algorithm halts with a warning of non-convergence.

Key arguments

Argument Default Description
max_iter 100 Maximum outer iterations
tol 1e-4 Relative RSS change stopping criterion
tol_r2 0.9999 R² early-stop threshold
nrep 1 Independent random restarts (best returned)
scale FALSE No column scaling by default
delta 0 Convex-hull relaxation (see § delta)
pseudo_pgd TRUE Remove radial component of gradient before step
step_size 1.0 Initial step size for the line search
max_iter_optimizer 10 Maximum back-tracking steps per outer iteration
step_shrinkage 0.5 Shrinkage factor per back-track
max_no_update 5 Consecutive no-improvement steps before early halt

Loss tracking

As introduced in the first example, the $loss data frame tracks one row per logged iteration; the column meanings are summarized there. The full sequence for the iris model is:

head(fit_iris$loss, 6)
#>        loss        r2
#> 1 126.99401 0.7869228
#> 2 114.08600 0.8085805
#> 3  74.67707 0.8747029
#> 4  65.43686 0.8902066
#> 5  61.60086 0.8966429
#> 6  59.03305 0.9009513

The loss metrics may differ from the direct computation (for example norm(X - fitted(fit_iris), "F")^2) because the loss is computed on the scaled data or other internal transformations. Thus, the loss metrics should only be used for monitoring convergence and comparing different runs of the same model type, not for comparing across different models or datasets. That is why we use the generic term “loss” instead of “RSS” in the column name.

The condition numbers k_S and k_A are diagnostics of numerical stability and are especially relevant for the case of method = "nnls" because it involves solving linear systems with these matrices. High k_S or k_A values signal numerical issues and are worth investigating with additional restarts or a larger delta. These diagnostics usually require attention only when convergence is unstable.


Conclusion

Limitations and challenges

Some important practical considerations and limitations of AA:

Several methods have been proposed to mitigate these issues and this package implements some of them. The most common strategies are:

Summary

Task Function / accessor
Fit a model (matrix) run_aa(X, K, ...)
Fit a model (formula) run_aa(y ~ ., data, K, ...)
Extract compositions (\(S\)) compositions(fit)
Extract coordinates (\(A\)) coordinates(fit)
Extract coefficients (\(B\)) coef(fit)
Reconstruct data fitted(fit)
Residuals residuals(fit)
Predict new reconstructions predict(fit, newdata)
Predict new compositions predict(fit, newdata, type = "compositions")
Rename archetypes anames(fit) <- c(...)
Loss / convergence fit$loss, fit$converged
Feature profile plot plot(fit, "profiles")
Composition barplot plot(fit, "compositions")
Simplex plot plot(fit, "simplex")
Coordinate scatter plot(fit, "coordinates")
Convergence plot plot(fit, "loss")
Robust fitting run_aa(..., robust = TRUE)
Missing-data fitting run_aa(...) (auto-detects)

Session info

sessionInfo()
#> R version 4.6.0 (2026-04-24)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.4 LTS
#> 
#> Matrix products: default
#> BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
#> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so;  LAPACK version 3.12.0
#> 
#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=en_IE.UTF-8        LC_COLLATE=C              
#>  [5] LC_MONETARY=en_IE.UTF-8    LC_MESSAGES=en_US.UTF-8   
#>  [7] LC_PAPER=en_IE.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=en_IE.UTF-8 LC_IDENTIFICATION=C       
#> 
#> time zone: Europe/Madrid
#> tzcode source: system (glibc)
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] generics_0.1.4 yaap_1.0.0    
#> 
#> loaded via a namespace (and not attached):
#>  [1] vctrs_0.7.3        cli_3.6.6          knitr_1.51         rlang_1.2.0       
#>  [5] xfun_0.57          otel_0.2.0         DEoptimR_1.1-4     tensorA_0.36.2.1  
#>  [9] jsonlite_2.0.0     glue_1.8.1         htmltools_0.5.9    sass_0.4.10       
#> [13] rmarkdown_2.31     quadprog_1.5-8     grid_4.6.0         bayesm_3.1-7      
#> [17] evaluate_1.0.5     jquerylib_0.1.4    tibble_3.3.1       MASS_7.3-65       
#> [21] fastmap_1.2.0      yaml_2.3.12        lifecycle_1.0.5    compiler_4.6.0    
#> [25] robustbase_0.99-7  Rcpp_1.1.1-1.1     pkgconfig_2.0.3    compositions_2.0-9
#> [29] lattice_0.22-9     digest_0.6.39      R6_2.6.1           utf8_1.2.6        
#> [33] pillar_1.11.1      magrittr_2.0.5     bslib_0.11.0       Matrix_1.7-5      
#> [37] tools_4.6.0        nnls_1.6           matrixStats_1.5.0  cachem_1.1.0

References

Cutler, Adele, and Leo Breiman. 1994. “Archetypal Analysis.” Technometrics 36 (4): 338–47. https://doi.org/10.1080/00401706.1994.10485840.
Mørup, Morten, and Lars Kai Hansen. 2012. “Archetypal Analysis for Machine Learning and Data Mining.” Neurocomputing 80: 54–63. https://doi.org/https://doi.org/10.1016/j.neucom.2011.06.033.
Suleman, Athar. 2017. “Validation of Archetypal Analysis.” In 2017 IEEE International Conference on Fuzzy Systems (FUZZ-IEEE), 1–6. Naples, Italy. https://doi.org/10.1109/FUZZ-IEEE.2017.8015385.

  1. weighted sums with non-negative weights summing to 1↩︎

  2. sum of squared residuals↩︎

mirror server hosted at Truenetwork, Russian Federation.