---
title: "RiskyCNV: A Prostate Cancer Case Study Using TCGA-PRAD Data"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{RiskyCNV: A Prostate Cancer Case Study Using TCGA-PRAD Data}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment  = "#>"
)
```

## Introduction

Copy Number Variation (CNV) is a genetic phenomenon in which the number
of copies of a specific DNA segment varies among individuals. CNVs can
take the form of duplications, deletions, or other structural changes
in DNA sequences and can extend to long genomic segments. CNVs play an
essential role in driving cancer development, and their characteristics
are critical for disease diagnosis, prognosis, and treatment (Freeman
et al., 2006).

CNVs are broadly classified into germline and somatic types. Germline
CNVs contribute to hereditary disorders and inherited predispositions
to cancer. Somatic CNVs occur due to aberrant cell division and often
relate directly to cancer progression. The study of somatic CNVs can
uncover driver genes involved in tumorigenesis or tumour suppression
and serve as biomarkers for predicting prognosis or treatment outcomes
(Xie & Tammi, 2009).

Prostate cancer is one of the most common malignancies in men worldwide,
with marked heterogeneity in progression and treatment outcomes. Indolent
(low-Gleason) cancers show minor genomic alterations, while aggressive
metastatic tumours demonstrate gross CNVs (Hieronymus et al., 2014).
Risk stratification — categorising patients by their likelihood of
harbouring aggressive disease — is central to clinical decision-making
and personalised treatment planning.

The `RiskyCNV` package addresses a critical gap in existing CNV tools.
While packages such as CNVizard, CINmetrics, GISTIC, and CopywriteR
offer visualisation, significant CNV identification, and chromosomal
instability quantification, none provide stratified class analysis,
particularly risk-class analysis (Krause et al., 2024; Oza et al., 2023;
Mermel et al., 2011; Kuilman et al., 2015). `RiskyCNV` was designed
specifically for class-stratified exploration and analysis of CNV omics
data, with the potential to deliver personalised insights regarding
disease course and risk stratification.

This vignette demonstrates the complete `RiskyCNV` workflow applied
to prostate adenocarcinoma (PRAD) data from The Cancer Genome Atlas
(TCGA), illustrating how CNV analysis integrated with clinical grading
and RNA expression data can identify biologically meaningful biomarkers.

```{r setup}
library(RiskyCNV)
```

---

## Dataset

The data used in this case study are derived from the TCGA PRAD dataset
(https://gdac.broadinstitute.org/runs/stddata__2016_01_28/data/PRAD/20160128/).
The dataset comprises:

- **Clinical data**: Sample IDs, Gleason scores, and histopathological
  pattern information for 929 prostate cancer samples
- **Somatic CNV data**: Segment-level copy number profiles including
  chromosome, start/end coordinates, number of probes, and segment mean
- **Gene annotation**: Reference genomic coordinates for gene symbol
  assignment
- **RNA expression data**: Transcriptomic profiles for correlation
  analysis

The grade-wise distribution of TCGA-PRAD samples is as follows:

| Grade | Gleason Score | Number of Samples |
|---|---|---|
| Grade 1 | ≤ 6 | 44 |
| Grade 2 | 3+4=7 | 146 |
| Grade 3 | 4+3=7 | 99 |
| Grade 4 | 8 | 63 |
| Grade 5 | 9-10 | 139 |
| Control | — | 438 |
| **Total** | | **929** |

```{r file_paths}
sample_file     <- system.file("extdata", "sample_data.csv",
                                package = "RiskyCNV")
cnv_file        <- system.file("extdata", "cnv_data.txt",
                                package = "RiskyCNV")
gene_file       <- system.file("extdata", "gene_annotation.csv",
                                package = "RiskyCNV")
annotated_file  <- system.file("extdata", "annotated_cnv.csv",
                                package = "RiskyCNV")
cnv_matrix_file <- system.file("extdata", "cnv_matrix.csv",
                                package = "RiskyCNV")
rna_file        <- system.file("extdata", "rna_data.csv",
                                package = "RiskyCNV")

# Preview the clinical sample data
head(read.csv(sample_file))
```

Each sample contains its total Gleason score, primary histological
pattern (pattern1), secondary histological pattern (pattern2), and
assigned grade label (Balraj & Muthamilselvan, 2024).

---

## Step 1: Classify Samples into Gleason Grade Groups

### Background

Gleason grading is the cornerstone of prostate cancer pathological
assessment. The 2014 ISUP Consensus Conference (Epstein et al., 2016)
formalised a five-tier Grade Group system:

- **Grade Group 1**: Gleason ≤ 6 — well differentiated, very low risk
- **Grade Group 2**: Gleason 3+4=7 — favourable intermediate risk
- **Grade Group 3**: Gleason 4+3=7 — unfavourable intermediate risk
- **Grade Group 4**: Gleason 8 — poorly differentiated, high risk
- **Grade Group 5**: Gleason 9-10 — anaplastic, very high risk

### Using the Prostate Cancer Preset

The `extract_metadata()` function implements the ISUP Grade Group system
directly when `disease_type = "prostate"` is specified:

```{r grade_preset}
grade_groups <- extract_metadata(
  file_path    = sample_file,
  column_name  = "gleason_score",
  disease_type = "prostate",
  output_dir   = tempdir()
)

print(names(grade_groups))
print(sapply(grade_groups, length))
```

### Using the Auto Mode

For datasets where clinical thresholds are unknown, `disease_type = "auto"`
automatically derives a normalised Risk Score from the data using
min-max normalisation and divides samples into the requested number of
groups. The function assesses distribution skewness and selects
equal-width or quantile-based splitting accordingly:

```{r grade_auto}
grade_groups_auto <- extract_metadata(
  file_path    = sample_file,
  column_name  = "gleason_score",
  disease_type = "auto",
  n_groups     = 5,
  group_type   = "grade",
  output_dir   = tempdir()
)

print(names(grade_groups_auto))
```

The Risk Score is computed as:
**Risk Score = (score − min) / (max − min)**

This normalises any numeric scoring system to a 0–1 scale, making
the package applicable to any disease with a numeric grading or
staging score.

---

## Step 2: Stratify Samples by Clinical Risk Level

### Background

Risk stratification translates Gleason grades into actionable clinical
risk categories. The D'Amico classification (D'Amico et al., 1998),
the most widely used system in prostate cancer, defines:

- **Low risk**: Gleason ≤ 6 — active surveillance appropriate
- **Intermediate risk**: Gleason 7 — selective treatment indicated
- **High risk**: Gleason ≥ 8 — aggressive treatment required

### Using the Prostate Cancer Preset

```{r risk_preset}
risk_groups <- classify_risk(
  file_path    = sample_file,
  column_name  = "gleason_score",
  disease_type = "prostate",
  output_dir   = tempdir()
)

print(names(risk_groups))
print(sapply(risk_groups, length))
```

### Flexible Risk Grouping with Auto Mode

The `auto` mode supports any number of risk groups. This is useful for
diseases that use only two risk levels or four levels:

```{r risk_auto}
# Two risk groups
risk_2 <- classify_risk(
  file_path    = sample_file,
  column_name  = "gleason_score",
  disease_type = "auto",
  n_groups     = 2,
  output_dir   = tempdir()
)
print(names(risk_2))

# Four risk groups
risk_4 <- classify_risk(
  file_path    = sample_file,
  column_name  = "gleason_score",
  disease_type = "auto",
  n_groups     = 4,
  output_dir   = tempdir()
)
print(names(risk_4))
```

---

## Step 3: Detect Copy Number Aberrations

### Background

Aberration detection identifies genomic segments showing significant
gains (oncogene amplifications) or losses (tumour suppressor deletions).
Grade-wise analysis of TCGA-PRAD data showed progressive increases in
aberration frequency:

- **Grade 1**: Amplifications in chromosomes 8, 14, 20; losses in
  chromosomes 2-21
- **Grade 5**: Amplifications in chromosomes 1, 3, 7, 9, 10, 12, 14,
  16, 17, 19, 21; losses in chromosomes 1-14 and 16-22

This escalation in genomic instability with Gleason grade underscores
the biological basis of clinical risk stratification.

```{r aberration}
aberrations <- aberration(
  cnv_data_file = cnv_file,
  effect_size   = 0.3
)

# Aberrant regions per chromosome
print(sapply(aberrations, nrow))
```

Segments with segment mean > +0.3 are classified as Gains
(Aberration_Code = 1); those with segment mean < −0.3 are classified
as Losses (Aberration_Code = 0).

---

## Step 4: Identify Recurrent Copy Number Variations

### Background

Recurrent CNVs — those appearing consistently across multiple samples
within a risk group — are more likely to represent functional driver
events rather than stochastic noise. Recurrent gains in oncogenes and
recurrent losses in tumour suppressor genes carry particular clinical
significance.

````{r recurrent, eval = FALSE}
recurrent_file <- recurrent(
  x             = risk_groups,
  risk_level    = "high_risk",
  cnv_data_file = cnv_file,
  threshold     = 2
)

recurrent_data <- read.csv(recurrent_file)
head(recurrent_data)
````
The output CSV contains recurrent CNV regions for the high-risk group:

| Sample | Chromosome | Start | End | Num_Probes | Segment_Mean |
|---|---|---|---|---|---|
| TCGA.2A.A8VL | 1 | 66506701 | 66515081 | 9 | -1.1952 |
| TCGA.2A.A8VL | 1 | 74699099 | 74700802 | 5 | -1.2440 |
| TCGA.2A.A8VL | 2 | 123900537 | 123920905 | 14 | -0.7297 |

The `threshold` parameter controls stringency — higher values yield
more robustly recurrent regions. Here we focus on high-risk samples
where aggressive CNV patterns are most pronounced.

---

## Step 5: Annotate CNV Regions with Gene Symbols

### Background

Annotating CNV regions with gene symbols — by finding genomic overlaps
with a reference gene annotation — translates chromosomal coordinates
into functionally meaningful gene lists. Key genes identified in
TCGA-PRAD CNV analyses include ARHGEF16, CHMP7, ELP3, ASH2L, and
RNF157, whose roles span immune modulation, cell cycle regulation, and
tumour suppression.

```{r annotate, eval = FALSE}
annotated <- annotate(
  genes_file = gene_file,
  risk_file  = recurrent_file,
  output_dir = tempdir()
)

head(annotated)
```

Each recurrent CNV region is linked to the overlapping gene(s), sample
ID, and segment mean value.

---

## Step 6: Build a CNV Expression Matrix

### Background

A sample × gene CNV matrix organises annotated data into a format
suitable for statistical integration with expression data. Positive
values indicate copy number gains; negative values indicate losses.
This matrix structure allows efficient examination of gene variation
patterns across all samples simultaneously.

```{r cnv_matrix}
old_wd <- getwd()
setwd(tempdir())

cnv_matrix <- create_CNVMatrix(input_file = annotated_file)

setwd(old_wd)

print(dim(cnv_matrix))
print(cnv_matrix[, 1:min(5, ncol(cnv_matrix))])
```

---

## Step 7: Correlate CNV with RNA Expression

### Background

The final step integrates CNV data with transcriptomic profiles to
identify CNV-driven gene expression changes. When a gene is amplified,
its expression typically increases; when deleted, expression typically
decreases. Genes showing statistically significant positive Pearson
correlations between CNV segment mean and RNA expression are strong
biomarker candidates.

In the full TCGA-PRAD analysis, grade-specific biomarkers included:

| Grade | Key CNV-correlated Gene | Role |
|---|---|---|
| Grade 1 | PPP1R3B | Glycogen metabolism regulation |
| Grade 2 | ASH2L | Histone methylation, chromatin remodeling |
| Grade 3 | IWS1 | RNA processing |
| Grade 4 | UBE2Q1 | Ubiquitin-mediated proteolysis |
| Grade 5 | GSK3A | Cell cycle, apoptosis regulation |

Cross-grade consensus genes CHMP7 and ELP3 were common across all
five grades, suggesting fundamental roles in prostate cancer biology.

```{r correlations}
old_wd <- getwd()
setwd(tempdir())

corr_results <- correlate_with_expr(
  cnv_file = cnv_matrix_file,
  rna_file = rna_file
)

setwd(old_wd)

cat("All correlations:\n")
print(corr_results$all_correlations)

cat("\nSignificant correlations (p < 0.05):\n")
print(corr_results$significant)

cat("\nHigh-confidence CNV-driven genes (p < 0.05, r > 0.8):\n")
print(corr_results$high_correlation)
```

Three output files are produced — all correlations, significant
(p < 0.05), and high-confidence (p < 0.05 AND r > 0.8). Genes in the
high-confidence output are the strongest candidates for further
investigation as CNV-driven expression biomarkers in prostate cancer.

---

## Biological Summary

The `RiskyCNV` workflow applied to TCGA-PRAD data produced the
following key findings:

**Risk-grade comparison**: Comparing low-risk (Grade 1-2) and high-risk
(Grade 4-5) analyses revealed common genes with distinct biological
roles. Low-risk/Grade 1 shared genes included HMGA1, BRCA1, TP53, MYC,
and EGFR — major oncogenes and tumour suppressors correlated with
baseline cancer susceptibility. High-risk/Grade 4-5 shared genes
included UBE2C and UBE2Z — ubiquitin pathway members associated with
tumour aggressiveness and poor prognosis.

**Consensus biomarkers**: Integrated analysis of recurrent somatic
CNVs, germline CNVs, and RNA-correlated genes identified grade-specific
consensus biomarkers, including ME1, CLU, SERPINB5, NECAB1, CTHRC1, and
DPYS. RNF157 emerged as a multi-consensus candidate biomarker across
multiple analytical approaches, consistent with its known role in M2
macrophage polarisation and tumour immune evasion (Guan et al., 2022).

**Clinical implications**: By integrating risk stratification with
grade-wise CNV analysis, `RiskyCNV` enables a more comprehensive
understanding of prostate cancer. Recognising genes common across both
analyses enhances the ability to distinguish between genes that
primarily drive tumour progression and those that serve as early
indicators of cancer susceptibility, ultimately supporting personalised
treatment planning.

---

## Generalisability

Although demonstrated here with prostate cancer Gleason scores,
`RiskyCNV` is fully generalised:

```{r generalised, eval = FALSE}
# Breast cancer with Nottingham scores
breast_grades <- extract_metadata(
  file_path    = "breast_samples.csv",
  column_name  = "nottingham_score",
  disease_type = "auto",
  n_groups     = 3,
  group_type   = "grade",
  output_dir   = tempdir()
)

# Lymphoma with two risk groups (limited vs. advanced)
lymphoma_risk <- classify_risk(
  file_path    = "lymphoma_samples.csv",
  column_name  = "ann_arbor_stage",
  disease_type = "auto",
  n_groups     = 2,
  output_dir   = tempdir()
)
```

Seven built-in disease presets are available (`"prostate"`,
`"breast"`, `"colorectal"`, `"lung"`, `"cervical"`, `"lymphoma"`,
`"melanoma"`), and fully custom thresholds can be supplied via
`disease_type = "custom"`.

---

## References

- Balraj AS & Muthamilselvan S. (2024). PRADclass: Hybrid Gleason
  grade-informed computational strategy. *Cancer Informatics*.
- D'Amico AV, et al. (1998). Biochemical outcome after radical
  prostatectomy. *JAMA*, 280(11):969-974.
- Epstein JI, et al. (2016). The 2014 ISUP Consensus Conference on
  Gleason Grading. *Am J Surg Pathol*, 40(2):244-252.
- Freeman JL, et al. (2006). Copy number variation: new insights in
  genome diversity. *Genome Research*, 16:949-961.
- Guan H, et al. (2022). Exosomal RNF157 mRNA from prostate cancer
  cells contributes to M2 macrophage polarization. *Front Oncol*,
  12:1021270.
- Hieronymus H, et al. (2014). Copy number alteration burden predicts
  prostate cancer relapse. *PNAS*, 111:11139-11144.
- Krause J, et al. (2024). CNVizard. *BMC Bioinformatics*, 25:376.
- Kuilman T, et al. (2015). CopywriteR. *Genome Biology*, 16:49.
- Mermel CH, et al. (2011). GISTIC2.0. *Genome Biology*, 12:R41.
- Oza VH, et al. (2023). CINmetrics. *PeerJ*, 11:e15244.
- Xie C & Tammi MT. (2009). CNV-seq. *BMC Bioinformatics*, 10:80.
