Methods using a large number of parameters risk being overfit. This
usually translates in poor fitting with data and trees other than the
those originally used. With RRphylo methods this risk is usually very
low. However, the user can assess how robust the results got by applying
search.shift
, search.trend
, search.conv
, or
PGLS_fossil
are by running the overfit-
functions. The basic idea of overfit-
is using alternative
tree topologies to test for both phylogenetic and sampling uncertainty
at the same time. Such alternative phylogenies can be provided by the
user as a list
/multiPhylo
object, otherwise
can be generated through resampleTree
. In this
latter case, the original tree and data are subsampled by specifying a
s
parameter, that is the proportion of tips to be removed
from the tree, and species positions are shuffled by using the function
swapONE
. When
generating phylogenies to test the robustness of
search.shift
, search.trend
at individual
clades, and search.conv
, it is advisable to constrain
resampleTree
so that it preserves the identity of
clades/groups under testing. This is achieved by providing as
node
/categories
arguments of
resampleTree
the
node
/nodes
/state
arguments in the
main functions. Similarly, using the phylogenetic tree returned by
RRphylo
is recommended.
overfitRR
tests the robustness of the robustness of
ancestral state and rate estimates. It takes as input an object
generated by RRphylo
, a list of
alternative phylogenetic trees, and all the data used to produce it
(besides necessary phenotypic data, any other argument such as
covariate, predictor, and so on, passed to RRphylo
). The
output is a RRphyloList
object including: the list of
RRphylo
per simulation ($RR.list
), the
phenotypic estimate at the tree root per simulation
($root.est
), the 95% confidence interval around the
original phenotypic value at the tree root ($rootCI
) and
the regression parameters describing the relation between the original
values at internal nodes and the corresponding figure after subsampling
and swapping ($ace.regressions
). A regression slope close
to one indicates a better matching between original and subsampled
values, suggesting the estimation is robust to phylogenetic uncertainty
and subsampling.
Since the ancestral state estimation within RRphylo
is
unbounded, sometimes altering the tree topology leads to implausible
root estimate (n.b. setting the rootV
does not bound the
estimation, it does not solve this problem). In this case it is
advisable to discard such questionable outputs from
$RR.list
before running further analyses.
## overfitRR routine
# load the RRphylo example dataset including Ornithodirans tree and data
library(ape)
data("DataOrnithodirans")
DataOrnithodirans$treedino->treedino
log(DataOrnithodirans$massdino)->massdino
DataOrnithodirans$statedino->statedino
# extract Pterosaurs tree and data
extract.clade(treedino,746)->treeptero
massdino[match(treeptero$tip.label,names(massdino))]->massptero
massptero[match(treeptero$tip.label,names(massptero))]->massptero
# peform RRphylo on body mass
RRphylo(tree=treeptero,y=massptero,clus=cc)->RRptero
# generate a list of subsampled and swapped phylogenies to test
tree.list<-resampleTree(RRptero$tree,s = 0.25,swap.si = 0.1,swap.si2 = 0.1,nsim=10)
# test the robustness of RRphylo
ofRRptero<-overfitRR(RR = RRptero,y=massptero,phylo.list=tree.list,clus=cc)
## overfitRR routine on multiple RRphylo
# load the RRphylo example dataset including Cetaceans tree and data
data("DataCetaceans")
DataCetaceans$treecet->treecet
DataCetaceans$masscet->masscet
DataCetaceans$brainmasscet->brainmasscet
DataCetaceans$aceMyst->aceMyst
# cross-reference the phylogenetic tree and body and brain mass data. Remove from
# both the tree and vector of body sizes the species whose brain size is missing
drop.tip(treecet,treecet$tip.label[-match(names(brainmasscet),treecet$tip.label)])->treecet.multi
masscet[match(treecet.multi$tip.label,names(masscet))]->masscet.multi
# peform RRphylo on the variable (body mass) to be used as additional predictor
RRphylo(tree=treecet.multi,y=masscet.multi,clus=cc)->RRmass.multi
RRmass.multi$aces[,1]->acemass.multi
# create the predictor vector: retrieve the ancestral character estimates
# of body size at internal nodes from the RR object ($aces) and collate them
# to the vector of species' body sizes to create
c(acemass.multi,masscet.multi)->x1.mass
# peform RRphylo on brain mass by using body mass as additional predictor
RRphylo(tree=treecet.multi,y=brainmasscet,x1=x1.mass,clus=cc)->RRmulti
# generate a list of subsampled and swapped phylogenies to test
treecet.list<-resampleTree(RRmulti$tree,s = 0.25,swap.si=0.1,swap.si2=0.1,nsim=10)
# test the robustness of multiple RRphylo
ofRRcet<-overfitRR(RR = RRmulti,y=brainmasscet,phylo.list=treecet.list,clus=cc,x1 =x1.mass)
overfitSS
tests the robustness of
search.shift
. It takes as input an object generated by
RRphylo
(RR
), an object generated by
overfitRR
performed on RR
(oveRR
), and node
and/or state
as
passed to search.shift
(they can be provided at the same
time). The output is a RRphyloList
object including: the
results of search.shift
under clade condition
($SSclade.list
) and sparse condition
($SSsparse.list
) per simulation, and a
$shift.results
summary object. The latter element returns
separate results for clade
and sparse
conditions
($shift.results
). The first (clade)
includes the proportion of simulations producing significant and
positive (p.shift+) or significant and negative
(p.shift-) rate shifts for each single node (either
compared to the rest of the tree -
$single.clades$singles - or to the rest of the tree
after removing other shifting clades -
$single.clades$no.others), and for all the clades taken
as a whole ($all.clades.together - see Testing rate shifts pertaining to
entire clades for further details). Under the
sparse
condition (sparse), the same
figures as before are reported for each state category compared to the
rest of the tree and for all possible pair of categories (see Testing rate shifts pertaining to
phylogenetically unrelated species for further details).
# load the RRphylo example dataset including Ornithodirans tree and data
data("DataOrnithodirans")
DataOrnithodirans$treedino->treedino
log(DataOrnithodirans$massdino)->massdino
DataOrnithodirans$statedino->statedino
# peform RRphylo on Ornithodirans tree and data
RRphylo(tree=treedino,y=massdino,clus=cc)->dinoRates
# perform search.shift under both "clade" and "sparse" condition
search.shift(RR=dinoRates, status.type= "clade")->SSauto
search.shift(RR=dinoRates, status.type= "sparse", state=statedino)->SSstate
## overfitSS routine
# generate a list of subsampled and swapped phylogenies, setting as categories/node
# the state/node under testing
tree.list<-resampleTree(dinoRates$tree,s = 0.25,categories=statedino,
node=rownames(SSauto$single.clades),swap.si = 0.1,swap.si2 = 0.1,nsim=10)
# test the robustness of search.shift
ofRRdino<-overfitRR(RR = dinoRates,y=massdino,phylo.list=tree.list,clus=cc)
ofSS<-overfitSS(RR = dinoRates,oveRR = ofRRdino,state=statedino,
node=rownames(SSauto$single.clades))
overfitST
tests the robustness of
search.trend
. It takes as input an object generated by
RRphylo
(RR
), the phenotypic data
(y
), an object generated by overfitRR
performed on RR
(oveRR
), and the arguments
x1
, x1.residuals
, node
, and
cov
as passed to search.trend
. The output is a
RRphyloList
object including: the results of
search.trend
per simulation ($ST.list
) and a
$trend.results
summary object. Within the latter object,
results for the entire tree (tree) summarize the
proportion of simulations producing positive (slope+)
or negative (slope-) slopes significantly higher
(p.up) or lower (p.down) than BM
simulations for either phenotypes or rescaled rates versus time
regressions. Such evaluations is based on p.random only
(see Temporal trends on the entire
tree,for further details). When specific clades are under
testing, the same set of results as for the whole tree is returned for
each node (node). In this case, for phenotype versus
age regression through nodes, the proportion of significant and
positive/negative slopes
(slope+p.up,slope+p.down,slope-p.up,slope-p.down)
is accompanied by the same figures for the estimated marginal mean
differences (p.emm+ and p.emm-). As
for the temporal trend in absolute rates through node, the proportion of
significant and positive/negative estimated marginal means differences
(p.emm+ and p.emm-) and the same
figure for slope difference (p.slope+ and
p.slope-) are reported (see Temporal trends at clade
level). Finally when more than one node is tested, the
$trend.results
object also includes results for the
pairwise comparison between nodes.
library(ape)
## Case 1
# load the RRphylo example dataset including Ornithodirans tree and data
data("DataOrnithodirans")
DataOrnithodirans$treedino->treedino
log(DataOrnithodirans$massdino)->massdino
DataOrnithodirans$statedino->statedino
# extract Pterosaurs tree and data
extract.clade(treedino,746)->treeptero
massdino[match(treeptero$tip.label,names(massdino))]->massptero
massptero[match(treeptero$tip.label,names(massptero))]->massptero
# perform RRphylo and search.trend on body mass data
RRphylo(tree=treeptero,y=massptero,clus=cc)->RRptero
search.trend(RR=RRptero, y=massptero,node=143,clus=cc)->ST
## overfitST routine
# generate a list of subsampled and swapped phylogenies setting as node
# the clade under testing
treeptero.list<-resampleTree(RRptero$tree,s = 0.25,node=143,
swap.si = 0.1,swap.si2 = 0.1,nsim=10)
# test the robustness of search.trend
ofRRptero<-overfitRR(RR = RRptero,y=massptero,phylo.list=treeptero.list,clus=cc)
ofSTptero<-overfitST(RR=RRptero,y=massptero,oveRR=ofRRptero,node=143,clus=cc)
## Case 2
# load the RRphylo example dataset including Cetaceans tree and data
data("DataCetaceans")
DataCetaceans$treecet->treecet
DataCetaceans$masscet->masscet
DataCetaceans$brainmasscet->brainmasscet
# cross-reference the phylogenetic tree and body and brain mass data. Remove from
# both the tree and vector of body sizes the species whose brain size is missing
drop.tip(treecet,treecet$tip.label[-match(names(brainmasscet),
treecet$tip.label)])->treecet.multi
masscet[match(treecet.multi$tip.label,names(masscet))]->masscet.multi
# peform RRphylo on the variable (body mass) to be used as additional predictor
RRphylo(tree=treecet.multi,y=masscet.multi,clus=cc)->RRmass.multi
RRmass.multi$aces[,1]->acemass.multi
# create the predictor vector: retrieve the ancestral character estimates
# of body size at internal nodes from the RR object ($aces) and collate them
# to the vector of species' body sizes to create
c(acemass.multi,masscet.multi)->x1.mass
# peform RRphylo and search.trend on brain mass by using body mass as additional predictor
RRphylo(tree=treecet.multi,y=brainmasscet,x1=x1.mass,clus=cc)->RRmulti
search.trend(RR=RRmulti, y=brainmasscet,x1=x1.mass,clus=cc)->STcet
## overfitST routine
# generate a list of subsampled and swapped phylogenies to test
treecet.list<-resampleTree(RRmulti$tree,s = 0.25,swap.si=0.1,swap.si2=0.1,nsim=10)
# test the robustness of search.trend with and without x1.residuals
ofRRcet<-overfitRR(RR = RRmulti,y=brainmasscet,phylo.list=treecet.list,clus=cc,x1 =x1.mass)
ofSTcet1<-overfitST(RR=RRmulti,y=brainmasscet,oveRR=ofRRcet,x1 =x1.mass,clus=cc)
ofSTcet2<-overfitST(RR=RRmulti,y=brainmasscet,oveRR=ofRRcet,x1 =x1.mass,x1.residuals = TRUE,clus=cc)
overfitSC
is the last step of a short routine the user
should apply to test the robustness of search.conv
results
(see the examples below).
Generating a number of altered (i.e. subsampled and swapped)
phylogenies by using resampleTree
.
In case the multivariate shape data used as input to
search.conv
were reduced (e.g. by SVD), such data must be
matched with each of the subsampled-swapped generated tree and the
resulting dataset must be reduced as well.
The list of phylogenetic trees and multivariate shape data can be
fed to overfitSC
for testing.
Similarly to overfitRR
, overfitSC
returns
RRphyloList
objects including the outputs of
RRphylo
and, depending on the analysis performed, the
outputs of search.conv
under clade
or
state
conditions applied on the modified trees.
Results for robustness of search.conv
($conv.results
) include separate objects for convergence
between clades
or
between/within states
.
Under the first case (clade), the proportion of
simulations producing significant instance of convergence
(p.ang.bydist) or convergence and parallelism
(p.ang.conv) between selected clades are returned (see
Morphological convergence between
clades for further details). As for convergence between/within
discrete categories (state), overfitRR
returns the proportion of simulations producing significant instance of
convergence either accounting (p.ang.state.time) or not
accounting (p.ang.state) for the time intervening
between the tips in the focal state Morphological convergence
within/between categories for explanations).
library(Morpho)
## Testing search.conv
# load the RRphylo example dataset including Felids tree and data
data("DataFelids")
DataFelids$treefel->treefel
DataFelids$statefel->statefel
DataFelids$landfel->feldata
# perform data reduction via Procrustes superimposition (in this case)
procSym(feldata)->pcafel
pcafel$PCscores->PCscoresfel
# perform RRphylo on Felids tree and data
RRphylo(treefel,PCscoresfel)->RRfel
# search for morphologicl convergence between clades (automatic mode)
# and within the category
search.conv(RR=RRfel, y=PCscoresfel, min.dim=5, min.dist="time38")->sc1
search.conv(tree=treefel, y=PCscoresfel, state=statefel, declust=TRUE)->sc2
# select converging clades returned in sc1
felnods<-c(85,155)
## overfitSC routine
# generate a list of subsampled and swapped phylogenies to test for search.conv
# robustness. Use as reference tree the phylogeny returned by RRphylo.
# Set the nodes and the categories under testing as arguments of
# resampleTree so that it maintains no less than 5 species in each clade/state.
tree.list<-resampleTree(RRfel$tree,s=0.15,nodes=felnods,categories=statefel,
nsim=15,swap.si=0.1,swap.si2=0.1)
# match the original data with each subsampled-swapped phylogeny in tree.list
# and repeat data reduction
y.list<-lapply(tree.list,function(k){
treedataMatch(k,feldata)[[1]]->ynew
procSym(ynew)$PCscores
})
# test for robustness of search.conv results by overfitSC
oSC<-overfitSC(RR=RRfel,phylo.list=tree.list,y.list=y.list,
nodes = felnods,state=statefel)
overfitPGLS
tests the robustness of
PGLS_fossil
. Other than the regression formula
(modform
), it takes as input an object generated by
overfitRR
(if PGLS_fossil
should be performed
on trees rescaled according to RRphylo
rates) and/or a list
of alternative phylogenies (if no tree rescaling should be performed).
These arguments are not exclusive though, they can be provided at the
same time. The output includes separate objects for the analyses
performed on the original tree ($tree
) and on the tree
rescaled according to RRphylo
rates ($RR
).
library(phytools)
library(ape)
# generate fictional data to test the function
rtree(100)->tree
fastBM(tree)->resp
fastBM(tree,nsim=3)->resp.multi
fastBM(tree)->pred1
fastBM(tree)->pred2
data.frame(y1=resp,x2=pred1,x1=pred2)->dato
# perform RRphylo and PGLS_fossil with univariate/multivariate phenotypic data
PGLS_fossil(modform=y1~x1+x2,data=dato,tree=tree)->pgls_noRR
RRphylo(tree,resp,clus=cc)->RR
PGLS_fossil(modform=resp~pred1+pred2,RR=RR)->pgls_RR
PGLS_fossil(modform=y1~x1+x2,data=list(y1=resp.multi,x2=pred1,x1=pred2),tree=tree)->pgls2_noRR
RRphylo(tree,resp.multi,clus=cc)->RR2
PGLS_fossil(modform=resp.multi~pred1+pred2,tree=tree,RR=RR)->pgls2_RR
## overfitPGLS routine
# generate a list of subsampled and swapped phylogenies to test
tree.list<-resampleTree(RR$tree,s = 0.25,swap.si=0.1,swap.si2=0.1,nsim=10)
# test the robustnes of PGLS_fossil with univariate/multivariate phenotypic data
ofRR<-overfitRR(RR = RR,y=resp,phylo.list=tree.list,clus=cc)
ofPGLS<-overfitPGLS(oveRR = ofRR,phylo.list=tree.list,modform = y1~x1+x2,data=dato)
ofRR2<-overfitRR(RR = RR2,y=resp.multi,phylo.list=tree.list,clus=cc)
ofPGLS2<-overfitPGLS(oveRR = ofRR2,phylo.list=tree.list,modform = y1~x1+x2,
data=list(y1=resp.multi,x2=pred1,x1=pred2))