The outlier methods used in the OutliersO3 package identify
outliers in different ways. The common key parameter is usually called
alpha (alpha.cutoff for
adjOutlyingness, tolProb for
DDC), although for some methods (FastPCS,
DDC, covMcd) it is input as
(1-alpha). OutliersO3 uses tol
arguments, short for tolerance level, to put all these on an equivalent
footing. The alphas (or (1-alpha)’s) are not directly comparable and so
they should be set individually when drawing O3 plots to compare
results. It is up to users to decide which value should be used for each
method.
Figure 1 shows an example using four variables of German demographic data. Applying the default tolerance level values from the help pages of the individual methods, far too many outliers were generated with some of the methods. A little experimentation with the tolerance levels was carried out to reduce these numbers to more manageable ones. (This new set of tolerance levels is the current default in O3prep.)
library(OutliersO3)
library(ggplot2)
data(Election2005)
data <- Election2005[, c(6, 10, 17, 28)]
O3d <- O3prep(data, method=c("PCS", "BAC", "adjOut", "DDC", "MCD"), tolPCS=0.5, tolBAC=0.05, toladj=0.25, tolDDC=0.01, tolMCD=0.5)
O3d1 <- O3plotM(O3d)
O3d1$nOut## PCS BAC adjOut DDC MCD
## 293 92 46 4 256
## PCS BAC adjOut DDC MCD
## 10 5 1 4 18
Fig 1: An O3 plot displaying outliers found by any of the 5 methods
using individually specified
tol values. No outlier was
identified by all five methods for any combination of variables.
The constituency labelled X84 is the Berlin-Friedrichshain-etc, one
mentioned in the first vignette. There are 5 variable combinations where
it is identified as an outlier by four of the methods. Inspecting the
table of outlier cases by variable combinations and methods,
outsTable, reveals that the DDC method is the
odd man out three times. In fact, it does not really identify any
outliers at all, as the 4 ascribed to it all come from the boxplot
outlier identification rule that is used for methods which cannot
identify outliers for single variables.
It should not be surprising that the methods find different outliers, because they are all different and have different goals, although we might have expected that they all agree on the most extreme outliers. Perhaps the main surprise is that the differences are so great—at least for this dataset.
The etymology dataset from languageR includes 285 Dutch verbs with 13 attributes. Excluding the four factors and three other attributes, potential outliers were identified using the six methods and the same individual tolerance levels as for the example shown in Figure 1.
data(etymology, package="languageR")
data <- etymology[, c(2, 4, 5, 10, 13, 14)]
O3q <- O3prep(data, method=c("PCS", "BAC", "adjOut", "DDC", "MCD"), tolPCS=0.005, tolBAC=0.005, toladj=0.1, tolDDC=0.01, tolMCD=0.000001)
O3q1 <- O3plotM(O3q)
O3q1$nOut## PCS BAC adjOut DDC MCD
## 12 5 4 16 5
Many outliers were identified by only one method for only one
variable combination. Tightening the tolerance levels for the other two
methods finding more than a few outliers (FastPCS and
DDC) and adding case names for labels led to Figure 2.
O3r <- O3prep(data, method=c("PCS", "BAC", "adjOut", "DDC", "MCD"), k1=2, tolPCS=0.0025, tolBAC=0.005, toladj=0.1, tolDDC=0.005, tolMCD=0.000001)
O3r1 <- O3plotM(O3r, caseNames=etymology$Verb)
O3r1$nOut## PCS BAC adjOut DDC MCD
## 6 5 5 4 5
library(gridExtra)
grid.arrange(O3r1$gO3 + theme(plot.margin = unit(c(0, 1, 0, 0), "cm")), O3r1$gpcp, ncol=1, heights=c(2,1))Fig 2: An O3 plot of the etymology dataset for combinations of 2 to 6 variables. There is little agreement amongst the methods and no agreement anywhere for all five methods together. The parallel coordinate plot does suggest two pairs of potential outliers on single variables, but judging by the O3 plot they would not be identified as outliers in six dimensions.
A full analysis would include a discussion of the verbs with an expert in Dutch linguistics. Statistical methods can be used to identify cases of possible interest and O3 plots and parallel coordinate plots can help indicate why cases are outliers, but subject matter knowledge is essential.