---
title: "Outlier Exclusion Vignette"
author: "Olink DS team"
date: "`r Sys.Date()`"
output: 
  html_vignette:
    toc: true
    toc_depth: 3
    includes:
      in_header: ../man/figures/logo.html
vignette: >
  %\VignetteIndexEntry{Outlier Exclusion Vignette}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include = FALSE, echo = FALSE, eval = TRUE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  tidy = FALSE,
  tidy.opts = list(width.cutoff = 95),
  fig.width = 6,
  fig.height = 3,
  message = FALSE,
  warning = FALSE,
  time_it = TRUE,
  fig.align = "center"
)
knitr::opts_chunk$set(fig.pos = "H")
knitr::opts_chunk$set(fig.align = "center")
knitr::opts_knit$set(eval.after = "fig.cap")
```

## Introduction

```{r Outlier_data, include = FALSE, error = TRUE, message = FALSE, warning = FALSE}
# Use provided example dataset (npx_data1)
npx_data1 <- OlinkAnalyze::npx_data1

# NPX file preprocessing

# Generate check log
check_log_npx_data1 <- OlinkAnalyze::check_npx(
  df = npx_data1
)

# Clean NPX
npx_data1_clean <- OlinkAnalyze::clean_npx(
  df = npx_data1,
  check_log = check_log_npx_data1
)

# Generate check log on cleaned data
check_log_npx_data1_clean <- OlinkAnalyze::check_npx(
  df = npx_data1_clean
)

# Create Dataset with outliers
outlier_data <- npx_data1_clean |>
  dplyr::mutate(
    NPX = dplyr::if_else(
      .data[["SampleID"]] == "A25",
      .data[["NPX"]] + 4,
      .data[["NPX"]]
    )
  ) |>
  dplyr::mutate(
    NPX = dplyr::if_else(
      .data[["SampleID"]] == "A52",
      .data[["NPX"]] - 4,
      .data[["NPX"]]
    )
  )

group_data <- npx_data1_clean |>
  dplyr::mutate(
    NPX = dplyr::if_else(
      .data[["Site"]] == "Site_D",
      .data[["NPX"]] + 3,
      .data[["NPX"]]
    )
  )

# Clean up environment
rm(check_log_npx_data1, npx_data1)
```

This vignette describes how to use Olink® Analyze to evaluate a dataset for the
presence of outliers. When performing statistical analyses, it is important to
establish the presence of any outlier samples in the data prior to statistical
analysis. There are many reasons why a sample might be an outlier, it could be
due to a data entry or measurement error (i.e. labeling a control sample as a
disease sample), sampling problems or unusual conditions (i.e. contamination),
or natural variation. In many parametric statistics tests the mean of each group
is used to determine differences between groups. Since the mean is highly
sensitive to outliers, it is important to examine datasets for outliers prior to
analysis as these outliers could have a large influence on the statistical
results.

In Olink Analyze, there are three visualization functions that can be used to
identify potential outlier samples. In this vignette, you will learn how to use
`olink_pca_plot()`, `olink_dist_plot()` , and `olink_qc_plot()` to identify and
remove outliers from a dataset.

For demonstration purposes two outlier datasets have been generated to
demonstrate large outlier effects by adjusting the NPX values for specific
Samples or groups.

```{r dataset_generation, eval = FALSE, message = FALSE, warning = FALSE}
# Use provided example dataset (npx_data1)
npx_data1 <- OlinkAnalyze::npx_data1

# NPX file preprocessing

# Generate check log
check_log_npx_data1 <- OlinkAnalyze::check_npx(
  df = npx_data1
)

# Clean NPX
npx_data1_clean <- OlinkAnalyze::clean_npx(
  df = npx_data1,
  check_log = check_log_npx_data1
)

# Generate check log on cleaned data
check_log_npx_data1_clean <- OlinkAnalyze::check_npx(
  df = npx_data1_clean
)

# Create Dataset with outliers
outlier_data <- npx_data1_clean |>
  dplyr::mutate(
    NPX = dplyr::if_else(
      .data[["SampleID"]] == "A25",
      .data[["NPX"]] + 4,
      .data[["NPX"]]
    )
  ) |>
  dplyr::mutate(
    NPX = dplyr::if_else(
      .data[["SampleID"]] == "A52",
      .data[["NPX"]] - 4,
      .data[["NPX"]]
    )
  )

group_data <- npx_data1_clean |>
  dplyr::mutate(
    NPX = dplyr::if_else(
      .data[["Site"]] == "Site_D",
      .data[["NPX"]] + 3,
      .data[["NPX"]]
    )
  )
```

## PCA Plot

### Overview of PCA Plot

Principal Component Analysis (PCA) is a dimensional reduction technique. PCA
plots can be particularly useful to visualize variability within high
dimensional data by plotting the data along the axes of greatest variation. The
first two principal components make up the largest axes of variance between
samples. In `olink_pca_plot()` samples are clustered together based on
similarities in overall expression patterns of the measured proteins. These
samples can be colored by any categorical variable to determine which biological
factors may be contributing to global effects across the samples. PCA plots can
be used to:

-   Identify individual outlier samples (**Figure 1A**)

-   Identify groups of outliers (**Figure 1B**)

-   Identify batch effects (see Bridging Vignette)

Regardless of what the PCA is being used for, PCA plots are great tools to get
an overview of the data prior to analysis and pick up on any global trends.
These samples are not necessarily outliers but might be indicative of a global
difference between groups.

```{r Outlier_example_code, eval = FALSE, fig.show='hide'}
p1 <- OlinkAnalyze::olink_pca_plot(
  df = outlier_data,
  label_samples = TRUE,
  quiet = TRUE,
  check_log = check_log_npx_data1_clean
)

p2 <- OlinkAnalyze::olink_pca_plot(
  df = group_data,
  color_g = "Site",
  quiet = TRUE,
  check_log = check_log_npx_data1_clean
)

ggpubr::ggarrange(
  p1[[1L]],
  p2[[1L]],
  nrow = 1L,
  labels = "AUTO"
)
```

```{r Outlier_Example, echo = FALSE, fig.cap = fcap}
knitr::include_graphics(
  path = normalizePath(
    path = "../man/figures/PCA_Outlier_Fig1.png"
  ),
  error = FALSE
)

fcap <- paste("**Figure 1** **A.** PCA can be used to identify individual",
              "outlier samples as shown by samples A25 and A52. **B.** PCA can",
              "be used to identify difference in groups as seen by Site_D",
              "samples. This is not suggesting Site_D is an outlier, but",
              "rather that there may be a global difference between sites.")
```

### How to use PCA plot to identify an outlier

As shown above, PCA plots can be used to identify global differences in specific
samples or sets of samples. Often times these samples can be attributed to
natural variations, but sometimes these samples are outliers due to technical or
sample specific issues. To determine if a sample is a true outlier and should be
excluded, consider the following:

-   Did the sample pass QC? - Sometimes samples that are outliers and have a QC
warnings may indicate sample or technical issues. For example, a buffer sample
will often flag on QC and be plotted as an outlier as there are no proteins in
the sample.

-   How far is the sample from other samples? - Samples variability within a
specific group may be larger or there may be global variables within a group. In
this case it is important to consider the sample within the context of the
project.

-   Does the sample appear as an outlier by other plots? - If a sample is an
outlier in the PCA, NPX distribution, and QC plots, that this might indicate a
true outlier.

-   Is the sample an outlier on all panels? - Some samples may perform better on
specific samples. In the next section we will explain how to view samples by
panel.

PCA plots can be generated using `olink_pca_plot()` and specifying the color for
each sample using the `color_g` argument. By default the samples will be colored
by QC_Warning. Prior to generating the PCA plot, the duplicate SampleIDs (often
Control samples) will need to be renamed or filtered out.

```{r PCA_treatment, eval = FALSE}
OlinkAnalyze::olink_pca_plot(
  df = npx_data1_clean,
  color_g = "Treatment",
  check_log = check_log_npx_data1_clean
)
```

```{r PCA_treatment_fig, echo = FALSE}
knitr::include_graphics(
  path = normalizePath(
    path = "../man/figures/PCA_Treatment.png"
  ),
  error = FALSE
)
```

In this dataset there do not appear to be any outliers in the PCA plot.

### PCA plot by panel

When multiple panels are run, there is a chance a sample may be an outlier on
one panel. To get a global view of the samples per Panel, the `byPanel` argument
can be specified.

```{r PCA_Panel, eval = FALSE}
# Use provided example dataset (npx_data2)

npx_data2 <- OlinkAnalyze::npx_data2

# NPX file preprocessing

# Generate check log
check_log_npx_data2 <- OlinkAnalyze::check_npx(
  df = npx_data2
)

# Clean NPX
npx_data2_clean <- OlinkAnalyze::clean_npx(
  df = npx_data2,
  check_log = check_log_npx_data2
)

# Generate check log on cleaned data
check_log_npx_data2_clean <- OlinkAnalyze::check_npx(
  df = npx_data2_clean
)

# Specify by panel
OlinkAnalyze::olink_pca_plot(
  df = npx_data2_clean,
  byPanel = TRUE,
  check_log = check_log_npx_data2_clean
)
```

```{r PCA_Panel_fig, echo = FALSE}
knitr::include_graphics(
  path = normalizePath(
    path = "../man/figures/PCA_Panel.png"
  ),
  error = FALSE
)
```

The PCA plots will be saved in a list of ggplot objects and each can be viewed
individually using `[[n]]` syntax as shown below.

```{r PCA_object, eval = FALSE}
# Save the PCA plot to a variable
# By panel
# quiet argument suppresses export
pca_plots <- OlinkAnalyze::olink_pca_plot(
  df =  npx_data2_clean,
  byPanel = TRUE,
  quiet = TRUE,
  check_log = check_log_npx_data2_clean
)

pca_plots[[1L]] # Cardiometabolic PCA
pca_plots[[2L]] # Inflammation PCA
```

The plots above show an example where there are not any clear outliers in the
PCA. For the purposes of this vignette, we have also generated data where 2
samples appear as outliers.

```{r Outlier_PCA, echo = FALSE}
knitr::include_graphics(
  path = normalizePath(
    path = "../man/figures/Outlier_PCA.png"
  ),
  error = FALSE
)
```

However from this plot, we can not identify which two samples are outliers. In
the next section we will go over how to label outliers in the plot.

### Labeling outliers

There are two ways to label the samples in the PCA plot. The first way is to use
the `label_samples` argument. This argument will label all samples by replacing
the dot with the SampleID.

```{r, eval = FALSE}
OlinkAnalyze::olink_pca_plot(
  df = outlier_data,
  label_samples = TRUE,
  check_log = check_log_npx_data1_clean
)
```

```{r Label_samples, echo = FALSE}
knitr::include_graphics(
  path = normalizePath(
    path = "../man/figures/label_samples_pca.png"
  ),
  error = FALSE
)
```

Here we can see that samples A25 and A52 appear as outliers. This method can be
useful when there is clear separation in samples. However it can be difficult to
identify additional trends when all samples are labeled. In this case the
samples must also be visually identified as opposed to programmatically
extracted from the plot. For a cleaner plot with only the outliers labeled, we
can use `outlierDefX`, `outlierDefY`, `outlierLines`, and `label_outliers`.
These arguments will plot lines at a number of standard deviations from the mean
of the plotted PC and label the samples outside of these lines. The values of
`outlierDefX` and `outlierDefY` will need to be generated by the users and may
require some manual adjustment to determine the correct number to highlight the
outliers.

```{r, eval = FALSE}
if (requireNamespace(package = "ggrepel", quietly = TRUE)) {
  OlinkAnalyze::olink_pca_plot(
    df = outlier_data,
    outlierDefX = 3L,
    outlierDefY = 3L,
    outlierLines = TRUE,
    label_outliers = TRUE,
    check_log = check_log_npx_data1_clean
  )
}
```

```{r outlier_line_pca, echo = FALSE}
knitr::include_graphics(
  path = normalizePath(
    path = "../man/figures/outlier_line_pca.png"
  ),
  error = FALSE
)
```

To remove the lines and just keep the outliers labelled, `outlierLines` can be
set to False.

```{r, eval = FALSE}
if (requireNamespace(package = "ggrepel", quietly = TRUE)) {
  OlinkAnalyze::olink_pca_plot(
    df = outlier_data,
    outlierDefX = 3L,
    outlierDefY = 3L,
    outlierLines = FALSE,
    label_outliers = TRUE,
    check_log = check_log_npx_data1_clean
  )
}
```

Once the correct outliers have been highlighted in the graph, we can then
programmatically extract the outlier SampleIDs using *dplyr*.

```{r}
if (requireNamespace(package = "ggrepel", quietly = TRUE)) {
  outliers_pca_labeled <-
    OlinkAnalyze::olink_pca_plot(
      df = outlier_data,
      outlierDefX = 3L,
      outlierDefY = 3L,
      outlierLines = FALSE,
      label_outliers = TRUE,
      quiet = TRUE,
      check_log = check_log_npx_data1_clean
    )

  outliers_pca_labeled[[1L]]$data |>
    dplyr::filter(
      .data[["Outlier"]] == TRUE
    ) |>
    dplyr::select(
      dplyr::all_of("SampleID")
    ) |>
    dplyr::distinct()
}
```

## NPX Distribution Plot

### Overview of NPX Distribution Plot

NPX distribution plots generated by `olink_dist_plot()` consist of box and 
whisker plots of NPX distribution for each sample. These boxplots can be used to
determine if a sample has an unusually large or small distribution or if a
samples NPX distribution is shifted compared to other samples in the study. When
used in combination with PCA and QC plots, NPX distribution plots can give an
additional dimension of data to help identify outlier samples or global trends
within groups of samples.

### How to use NPX distribution plot to identify an outlier

In NPX distribution plot an outlier may show one of the following
characteristics:

-   A larger NPX distribution than other samples suggesting that the sample may
contain a global difference in proteins, potentially indicating a different
sample phenotype or technical error. 
-   A smaller NPX distribution than other samples suggesting that the sample may
contain less protein than other samples, often seen when running buffer samples.
-   A shift in NPX distribution in one or more samples suggesting a difference
in overall protein concentration, a difference in sample collection methodology,
or a batch effect.

If we look at a subset of the outlier data from the PCA example, we can see that
A25 and A52 have shifted NPX distributions, which in combination with the PCA
plot, suggest that these samples may be potential outliers. If there are many
samples in a project, it can be helpful to look at a subset of the samples at a
time as shown below.

```{r, eval = FALSE}
outlier_data_dist <- outlier_data |>
  dplyr::filter(
    .data[["SampleID"]] %in% c("A25", "A52", "A1", "A2", "A3", "A5",
                               "A15", "A16", "A18", "A19", "A20")
  )

OlinkAnalyze::olink_dist_plot(
  df = outlier_data_dist,
  check_log = check_log_npx_data1_clean
)
```

```{r, echo = FALSE}
knitr::include_graphics(
  path = normalizePath(
    path = "../man/figures/dist_boxplot.png"
  ),
  error = FALSE
)
```

NPX distribution plots can also be useful in identifying group trends, such as a
difference in site D as shown in the figure below. The color of the bars can be
altered using the `color_g` argument.

```{r, eval = FALSE}
group_data_dist <- group_data |>
  # Only visualizing 2 sites to see all samples
  dplyr::filter(
    .data[["Site"]] %in% c("Site_A", "Site_D")
  )

OlinkAnalyze::olink_dist_plot(
  df = group_data_dist,
  color_g = "Site",
  check_log = check_log_npx_data1_clean
)
```

```{r, echo = FALSE}
knitr::include_graphics(
  path = normalizePath(
    path = "../man/figures/site_boxplot.png"
  ),
  error = FALSE
)
```

In this case, the shift in NPX distribution could be biologically meaningful or
could indicate a sample or technical issue. There are several biological cases
where the total protein concentration in a sample may be larger in one group
than another. However, in some cases the difference in protein concentration can
also lead to skewed results, in which case it may be helpful to normalize the
data for the changes in protein concentration by performing a median adjustment
as shown below.

```{r, eval = FALSE}
# Calculate SampleID Median NPX
median_npx <- group_data |>
  dplyr::group_by(
    .data[["SampleID"]]
  ) |>
  dplyr::summarise(
    Median_NPX = median(.data[["NPX"]])
  ) |>
  dplyr::ungroup()

# Adjust by sample median
adjusted_data <- group_data |>
  dplyr::inner_join(
    median_npx,
    by = "SampleID"
  ) |>
  dplyr::mutate(
    NPX = .data[["NPX"]] - .data[["Median_NPX"]]
  )

adjusted_data_dist <- adjusted_data |>
  # Only visualizing 2 sites to see all samples
  dplyr::filter(
    .data[["Site"]] %in% c("Site_A", "Site_D")
  )

OlinkAnalyze::olink_dist_plot(
  df = adjusted_data_dist,
  color_g = "Site",
  check_log = check_log_npx_data1_clean
)
```

```{r, echo=FALSE}
knitr::include_graphics(
  path = normalizePath(
    path = "../man/figures/sample_med_boxplot.png"
  ),
  error = FALSE
)
```

## QC Plot (IQR vs Sample Median)

### Overview of QC Plot

Often there are too many samples to identify outliers using `olink_dist_plot()`.
In this case, `olink_qc_plot()` offers an alternative way to visualize the NPX
distribution per sample. In this plot, the sample median and sample
interquartile range (IQR) are plotted to visualize where the sample is centered
and how much variability is within the sample.

### How to use QC plot to identify an outlier

`olink_qc_plot()` contains similar features to `olink_pca_plot()` and can be
used in a similar way. These samples can be colored by any categorical variable
to determine which biological factors may be contributing to global effects
across the samples or individual sample outlier. QC plots can be used to
identify individual outlier samples or identify groups of outliers. In the QC
plot an outlier may show one or more of the following characteristics:

-   A larger IQR than other samples suggesting that the sample may contain a
global difference in proteins, potentially indicating a different sample
phenotype or technical error. 
-   A smaller IQR than other samples suggesting that the sample may contain less
protein than other samples, often seen when running buffer samples.
-   A shift in sample median in one or more samples suggesting a difference in
overall protein concentration, a difference in sample collection methodology, or
a batch effect.

Using the outlier data from the previous plots, we can see that sample A52 has a
lower sample median and sample A25 has a higher sample median in both panels.
Sample A48 has the highest IQR of all samples in the Inflammation panel.
However, we can see many other samples approaching the line threshold line
(default of 3 standard deviation from the mean IQR or sample median), suggesting
that this sample may not be a true outlier.

```{r, eval = FALSE}
if (requireNamespace(package = "ggrepel", quietly = TRUE)) {
  OlinkAnalyze::olink_qc_plot(
    df = outlier_data,
    label_outliers = TRUE,
    check_log = check_log_npx_data1_clean
  )
}
```

```{r, echo=FALSE}
knitr::include_graphics(
  path = normalizePath(
    path = "../man/figures/qc_plot.png"
  ),
  error = FALSE
)
```

With a standard deviation of 3 and assuming a normal distribution of samples, it
is expected that over 99% of the data will lie within 3 standard deviations from
the mean, however depending on the groups and samples within the study, there
may be global shifts which will result in one or more samples outside of the 3
SD line. These lines should be used as guidelines and not strict thresholds of
outliers.

Similar to what was shown in the NPX distribution plot example, the QC plot can
also be used to see changes in specific groups. In the example below, the
samples are colored by site and Site D appears to have samples with higher
sample median. 

```{r, eval = FALSE}
if (requireNamespace(package = "ggrepel", quietly = TRUE)) {
  OlinkAnalyze::olink_qc_plot(
    df = group_data,
    color_g = "Site",
    label_outliers = TRUE,
    check_log = check_log_npx_data1_clean
  )
}
```

```{r echo = FALSE}
knitr::include_graphics(
  path = normalizePath(
    path = "../man/figures/qc_site_plot.png"
  ),
  error = FALSE
)
```

### Labeling outliers 

To change the threshold and outliers that are labeled, we can use
`median_outlierDef`, `IQR_outlierDef`, `outlierLines`, and `label_outliers`.
These arguments will plot lines at a number of standard deviations from the mean
of the sample median or IQR and label the samples outside of these lines. The
values of `median_outlierDef` and `IQR_outlierDef` default to 3 and may require
some manual adjustment to determine the correct number to highlight the
outliers.

```{r, eval = FALSE}
if (requireNamespace(package = "ggrepel", quietly = TRUE)) {
  OlinkAnalyze::olink_qc_plot(
    df = outlier_data,
    median_outlierDef = 2L,
    IQR_outlierDef = 4L,
    outlierLines = TRUE,
    label_outliers = TRUE,
    check_log = check_log_npx_data1_clean
  )
}
```

```{r, echo = FALSE}
knitr::include_graphics(
  path = normalizePath(
    path = "../man/figures/qc_label_plot.png"
  ),
  error = FALSE
)
```

To remove the lines and just keep the outliers labelled `outlierLines` can be
set to False.

```{r, eval = FALSE}
if (requireNamespace(package = "ggrepel", quietly = TRUE)) {
  OlinkAnalyze::olink_qc_plot(
    df = outlier_data,
    median_outlierDef = 2L,
    IQR_outlierDef = 4L,
    outlierLines = FALSE,
    label_outliers = TRUE,
    check_log = check_log_npx_data1_clean
  )
}
```

Once the correct outliers have been highlighted in the graph, we can then
programmatically extract the outlier SampleIDs using *dplyr*.

```{r}
if (requireNamespace(package = "ggrepel", quietly = TRUE)) {
  outliers_qc_labeled <- OlinkAnalyze::olink_qc_plot(
    df = outlier_data,
    median_outlierDef = 2L,
    IQR_outlierDef = 4,
    outlierLines = FALSE,
    label_outliers = TRUE,
    check_log = check_log_npx_data1_clean
  )

  outliers_qc_labeled$data |>
    dplyr::filter(
      .data[["Outlier"]] == TRUE
    ) |>
    dplyr::select(
      dplyr::all_of("SampleID")
    ) |>
    dplyr::distinct()
}
```

## Interpreting and Handling Outliers

### When should outliers be removed

After reviewing these plots, a list of potential outliers may be generated.
Whether or not these outliers should be excluded depends on how different from
the other samples the potential outliers appear and if the outliers can be
traced to a biological, sample specific, or technical issue. An outlier could
also be specific to a particular panel, in which case the outlier either be
excluded from just the panel on which it is an outlier or excluded from all
panels.

In the case of a biological issue, it may be useful to keep the potential
outlier within the study as the sample has biological meaning, however
additional normalization may be needed. For example, in a diabetes study,
patients with severe diabetes may have higher protein in their urine as compared
to control subject. In this case it could be useful to normalize based on total
protein concentration or median NPX to set all values within the context of the
study.

Another example is in the case of multiple batches, one batch may appear
to cluster differently from another batch. In this case additional normalization
such as bridging is needed to bridge the studies to the same scale.

## Contact Us

We are always happy to help. Email us with any questions:

-   biostat\@olink.com for statistical services and general stats
    questions

-   support\@olink.com for Olink lab product and technical support

-   info\@olink.com for more information

## Legal Disclaimer

© `r format(Sys.Date(), "%Y")` Olink Proteomics AB, part of Thermo Fisher
Scientific.

Olink products and services are For Research Use Only. Not for use in diagnostic
procedures.

All information in this document is subject to change without notice. This
document is not intended to convey any warranties, representations and/or
recommendations of any kind, unless such warranties, representations and/or
recommendations are explicitly stated.

Olink assumes no liability arising from a prospective reader’s actions based on
this document.

OLINK, NPX, PEA, PROXIMITY EXTENSION, INSIGHT and the Olink logotype are
trademarks registered, or pending registration, by Olink Proteomics AB. All
third-party trademarks are the property of their respective owners.

Olink products and assay methods are covered by several patents and patent
applications [https://www.olink.com/patents/](https://olink.com/patents/).