prepR4pcm prepR4pcm logo

R-CMD-check pkgdown Lifecycle: stable

DOI

Phylogenetic comparative methods (PCMs) need a phylogenetic tree and a trait dataset whose species names line up exactly with the tree’s tip labels. prepR4pcm addresses both halves of that prerequisite:

  1. Reconcile names when the data and the tree disagree on spelling, formatting, or synonymy — so species aren’t silently dropped from the analysis.
  2. Retrieve and date trees from public databases when you don’t already have one, including posteriors of trees so the tree-choice uncertainty can be propagated downstream.

In phylogenetic comparative analyses, trait datasets must match exactly the tip labels in the phylogenetic tree. Mismatches prevent the integration of species trait data (e.g., tables) with their evolutionary relationships (the tree), which is essential for phylogenetic comparative methods, such as studies of trait evolution, niche conservatism, or correlated trait change. These mismatches can lead to species being silently excluded from analyses. There are three main types of species name mismatches:

prepR4pcm detects and resolves all three through a multi-stage matching cascade (exact → normalised → synonym → fuzzy), documents every decision so the choices are auditable, and produces aligned data–tree pairs ready for phylogenetic generalised least squares (PGLS), phylogenetic mixed models (PGLMMs), or any other PCM.

Below you’ll find instructions for package installation, a quick example, the typical workflow, vignettes covering realistic pipelines, citation information, and a list of bundled example datasets.

Installation

Install the CRAN release:

install.packages("prepR4pcm")

Install the development version from GitHub:

# install.packages("pak")
pak::pak("itchyshin/prepR4pcm")

Features

Typical workflow

Starting point: trait data + a phylogenetic tree. If you don’t yet have a tree, fetch one with pr_get_tree() (and optionally date it with pr_date_tree()) and continue from “Trait data + Phylogenetic tree” below; see the posterior-tree pipeline vignette for the full pattern.

The diagram below shows the steps. R objects and data files are in rounded boxes; prepR4pcm functions that act on them are on the arrows.

flowchart TD
  A(["<i>Trait data</i><br>+<br><i>Phylogenetic tree</i>"])
  B(["<i>reconciliation</i>"])
  R["<b>Review</b><br>reconcile_summary()<br>reconcile_plot()<br>reconcile_report()<br><br><b>Fix (if needed)</b><br>reconcile_override()<br>reconcile_suggest()"]
  C(["<i>Aligned data</i><br>+<br><i>Pruned tree</i>"])
  D[/PGLS, PGLMM, or any PCM/]

  A -- "reconcile_tree()" --> B
  B --> R
  R -- "reconcile_apply()" --> C
  C --> D

  classDef obj fill:#e8f4f8,stroke:#2c5e4f,stroke-width:2px
  classDef inspect fill:#fffbe6,stroke:#a67c00,stroke-width:1.5px
  classDef out fill:#fff4e8,stroke:#888,stroke-width:1.5px
  class A,B,C obj
  class R inspect
  class D out

The first reconciliation pass produces a reconciliation object (an audit of every name match). You then review and fix; once you’re happy, reconcile_apply() produces the aligned dataset and pruned tree that have matching species lists — the precondition for any phylogenetic comparative method.

Quick example

This example reconciles avonet_subset (919 species rows from AVONET, a global bird-trait database; Tobias et al. 2022) against tree_jetz (657 tips from the Jetz et al. 2012 bird phylogeny). It produces an aligned data frame and a pruned tree ready for downstream modelling — both sides have the same species, in matched order, ready for a PGLS or phylogenetic mixed model.

library(prepR4pcm)
library(ape)

# Reconcile a dataset against a phylogenetic tree
rec <- reconcile_tree(
  x         = avonet_subset,
  tree      = tree_jetz,
  x_species = "Species1",
  fuzzy     = TRUE,
  resolve   = "flag"
)
#> ℹ Reconciling 919 data names vs 657 tree tips
#> ℹ Matching 919 x 657 names through 4 stages...
#> ℹ Stage 1/4: Exact matching...
#> ℹ Stage 2/4: Normalised matching (0 matched so far)...
#> ℹ Stage 3/4: Synonym resolution (657 matched so far)...
#> ℹ Stage 4/4: Fuzzy matching (657 matched so far)...
#> ✔ Matched 657/919 data names to tree tips
rec
#> 
#> ── Reconciliation: data vs tree ────────────────────────────────────────────────
#>   Source x: avonet_subset
#>   Source y: phylo (657 tips)
#>   Authority: col
#>   Timestamp: 2026-06-16 10:00:21
#> ℹ Match coverage: [█████████████████████░░░░░░░░░] 71% (657/919)
#> 
#> ── Match summary ──
#> 
#> • Exact: 0 ( 0.0%)
#> • Normalized: 657 (71.5%)
#> • Synonym: 0 ( 0.0%)
#> • Fuzzy: 0 ( 0.0%)
#> • Manual: 0 ( 0.0%)
#> ! Unresolved (x only):262 (28.5%)
#> ! Unresolved (y only):0
#> ! Flagged for review: 0
#> ℹ Use `reconcile_summary()` for details, `reconcile_mapping()` for the full table.

# Apply the reconciliation: aligned data + pruned tree
aligned <- reconcile_apply(rec, data = avonet_subset, tree = tree_jetz,
                           species_col = "Species1", drop_unresolved = TRUE)
#> ! Dropped 262 rows with unresolved species from data
#> ℹ Tree has 657 tips after alignment

# Confirm the two sides hold the SAME species (not just the same count)
data_sp <- aligned$data$Species1
tree_sp <- aligned$tree$tip.label
length(intersect(data_sp, tree_sp))   # how many species are in both
#> [1] 657
length(setdiff(data_sp, tree_sp))     # in data but not tree (should be 0)
#> [1] 0
length(setdiff(tree_sp, data_sp))     # in tree but not data (should be 0)
#> [1] 0

What just happened: reconcile_tree() matched every species name in avonet_subset$Species1 against the tip labels of tree_jetz, trying exact matches first and falling back through normalised, synonym, and fuzzy matches as needed. The printed rec object shows the count in each match category. reconcile_apply() then takes that reconciliation and produces (a) a data frame with rows restricted to species that resolved to a tree tip, and (b) the tree pruned to those tips. The intersect() / setdiff() calls above confirm that the data’s species names and the tree’s tip labels are identical sets (not just equal counts) — the actual precondition for any downstream PGLS or PGLMM call.

Quick example — fetching a tree

If you don’t already have a tree, fetch one. The snippet below pulls a 50-tree posterior of fish chronograms from the Fish Tree of Life (Rabosky et al. 2018) and asks pr_cite_tree() to format the citations for your methods section:

trees <- pr_get_tree(
  c("Salmo salar", "Esox lucius", "Oncorhynchus mykiss"),
  source = "fishtree",
  n_tree = 50
)
class(trees$tree)              # "multiPhylo"
length(trees$tree)             # 50

# Citations for the methods section
cat(pr_cite_tree(trees, format = "markdown"))

Each backend has its own coverage and quirks; the comparing tree backends vignette summarises which one to pick for a given taxon and what “n_tree > 1” returns in each case.

Vignettes

Citation

If you use prepR4pcm in your research, please cite the package and the original publication for any bundled example dataset you used (see Bundled data sources below).

For the package itself:

Nakagawa S, Ortega S, Mizuno A, Santos E, Lagisz M, Jain B, Celeste J, Poo Hernandez S (2026). prepR4pcm: Prepare Data and Trees for Phylogenetic Comparative Methods. R package version 1.0.0. https://github.com/itchyshin/prepR4pcm

BibTeX:

@Manual{,
  title  = {prepR4pcm: Prepare Data and Trees for Phylogenetic Comparative Methods},
  author = {Shinichi Nakagawa and Santiago Ortega and Ayumi Mizuno and
            Eduardo S.A. Santos and Malgorzata Lagisz and Bhavya Jain and
            Jimuel Jr Celeste and Sergio {Poo Hernandez}},
  year   = {2026},
  note   = {R package version 1.0.0},
  url    = {https://github.com/itchyshin/prepR4pcm},
}

Or run in R to get the same entry programmatically:

citation("prepR4pcm")

If citation("prepR4pcm") warns “no package ‘prepR4pcm’ was found”, the installed copy is stale or in a library R isn’t searching. Install the CRAN release with install.packages("prepR4pcm"), or install the development version with pak::pak("itchyshin/prepR4pcm"), then re-load (restart R if needed).

Key dependencies

Bundled data sources

The package contains small sample datasets — each is a subset (a few hundred rows or tips) of a larger published dataset, used only for the package’s examples, vignettes, and tests. They are not full versions: if you want to do science with these data, download the full original dataset from the source listed below. If you use any of these examples in published work, please cite the original provider.

Bird data (used by the bird-workflow vignette):

Mammal data (used by the mammal database-assembly vignette):

License

MIT

mirror server hosted at Truenetwork, Russian Federation.