Title: Longitudinal Integration Site Analysis Toolkit
Version: 0.1.2
Description: A comprehensive toolkit for the analysis of longitudinal integration site data, including data cleaning, quality control, statistical modeling, and visualization. It streamlines the entire workflow of integration site analysis, supports simple input formats, and provides user-friendly functions for researchers in virus integration site analysis. Ni et al. (2025) <doi:10.64898/2025.12.20.695672>.
License: MIT + file LICENSE
Encoding: UTF-8
RoxygenNote: 7.2.3
VignetteBuilder: knitr
Depends: R (≥ 3.5)
Imports: GenomicRanges (≥ 1.50.0), IRanges (≥ 2.32.0), tidyr (≥ 1.3.0), dplyr (≥ 1.1.4), AnnotationDbi (≥ 1.50.0), S4Vectors (≥ 0.32.0), GenomicFeatures (≥ 1.50.0), magrittr, ggplot2, purrr, broom
Suggests: TxDb.Hsapiens.UCSC.hg38.knownGene, org.Hs.eg.db, knitr, rmarkdown, grid, gt (≥ 0.9.0), gtable (≥ 0.3.6), this.path (≥ 2.0.0), plotrix (≥ 3.8-2), scales (≥ 1.2.0), writexl (≥ 1.4.0), ggrepel (≥ 0.9.4), ggpubr (≥ 0.6.0), viridisLite (≥ 0.4.2), RIdeogram (≥ 0.2.2), patchwork (≥ 1.1.3), RColorBrewer (≥ 1.1-3), colorspace, treemapify, igraph, visNetwork, Cairo (≥ 1.6-1), testthat (≥ 3.0.0)
NeedsCompilation: no
Packaged: 2026-03-24 09:27:27 UTC; nishuai
Author: Shuai Ni [aut, cre]
Maintainer: Shuai Ni <Nishuai@wakerbio.com>
Repository: CRAN
Date/Publication: 2026-03-27 10:40:03 UTC

Visualize and analyze network of common integration sites (CIS)

Description

Visualize and analyze network of common integration sites (CIS)

Usage

CIS(IS_raw, connect_distance = 50000)

Arguments

IS_raw

Data frame containing integration site data (must have Locus, Chr, nearest_gene_name columns)

connect_distance

Numeric threshold for connecting IS (default = 50000 bp)

Value

Data frame with top 10 CIS network metrics (Chr, Locus, Gene, Total_dots, etc.)


Generate colored GT table for CIS overlap across samples/timepoints

Description

Generate colored GT table for CIS overlap across samples/timepoints

Usage

CIS_overlap(CIS_data, IS_raw, Timelevels = NULL)

Arguments

CIS_data

Data frame of CIS metrics (must have Chr and Locus columns)

IS_raw

Data frame of raw integration site data (must have Sample, Chr, Locus columns)

Timelevels

Optional vector of sample/timepoint levels for ordered display (default = NULL)

Value

gt table object with colored CIS overlap status (TRUE/FALSE)


Calculate regional distribution percentages of integration sites (IS)

Description

Calculate regional distribution percentages of integration sites (IS)

Usage

Count_regions(IS_raw, Patient_timepoint)

Arguments

IS_raw

Data frame of raw integration site data (must have Sample column + regional annotation columns)

Patient_timepoint

Data frame mapping Sample_ID to Time_Point (columns: Sample_ID, Time_Point)

Value

List of data frames (per sample) with regional IS percentages (Exonic/Intronic/Enhancer etc.)


Plot cumulative curve and perform statistical analysis

Description

Plot cumulative curve and perform statistical analysis

Usage

Cumulative_curve(IS_ratio)

Arguments

IS_ratio

A numeric vector of integration site ratios (output of fit_cum_simple)

Value

A list containing the ggplot object, t-test results, and Wilcoxon test result.


Check if integration sites (IS) are located in enhancer regions

Description

Check if integration sites (IS) are located in enhancer regions

Usage

Enhancer_check(IS_raw)

Arguments

IS_raw

Data frame containing raw integration site data (must have Chr and Locus columns)

Value

Data frame with an added Enhancer column (TRUE = located in enhancer, FALSE = not located in enhancer)


Generate treemap of integration site clone contribution

Description

This function creates a treemap visualization of the top 1000 integration site (IS) clone contributions, grouped by patient time points with custom color perturbation.

Usage

IS_treemap(
  IS_raw = IS_raw,
  Patient_timepoint = Patient_timepoint,
  Timelevels = NULL
)

Arguments

IS_raw

Data frame containing IS data (columns: Sample, Locus, Clone_contribution required)

Patient_timepoint

Data frame mapping Sample_ID to Time_Point (columns: Sample_ID, Time_Point required)

Timelevels

Character vector, optional custom order of time points (default: NULL, natural sort)

Value

ggplot object (treemap of IS clone contributions)


Generate Linked Timepoint Sankey + Stacked Bar Chart

Description

Creates a highly customizable combined Sankey-flow + stacked bar chart to visualize clonal proportion changes across timepoints, with manual control over flow polygon shapes and precise formatting of top integration sites (top 10 + "Others" category). All core logic and data processing steps remain identical to the original code - only namespace prefixes (::) added and lag() fixed.

Usage

Linked_timepoints(IS_raw, Patient_timepoint, Timelevels = NULL)

Arguments

IS_raw

Data frame containing integration site data (required columns: Clone_contribution, Sample, nearest_gene_name, Chr, Locus)

Patient_timepoint

Data frame mapping Sample_ID to Time_Point (columns: Sample_ID, Time_Point required)

Timelevels

Character vector (optional). Custom ordered levels for time points (overrides natural sort). Default = NULL.

Value

ggplot object. Combined Sankey-flow + stacked bar chart of top 10 integration site proportions across timepoints.


Check if integration sites (IS) are located in promoter regions

Description

Check if integration sites (IS) are located in promoter regions

Usage

Promotor_check(IS_raw)

Arguments

IS_raw

Data frame containing raw integration site data (must have Chr and Locus columns)

Value

Data frame with an added Promotor column (TRUE = located in promoter, FALSE = not located in promoter)


Check if integration sites (IS) are located in safe harbor regions

Description

Check if integration sites (IS) are located in safe harbor regions

Usage

Safeharbor_check(IS_raw)

Arguments

IS_raw

Data frame containing raw integration site data (must have Chr and Locus columns)

Value

Data frame with an added Safeharbor column (TRUE = located in safe harbor, FALSE = not located in safe harbor)


Plot chromosome distribution of integration sites (IS)

Description

Plot chromosome distribution of integration sites (IS)

Usage

chr_distribution(IS_raw, ref_version = "random")

Arguments

IS_raw

Data frame containing raw integration site data (must have Chr column)

ref_version

Reference version for simulation (options: 'random' or 'LV', default = 'random')

Value

ggplot object of chromosome distribution (percentage of IS per chromosome)


Calculate normalized cumulative sum for top N elements of a numeric vector

Description

Calculate normalized cumulative sum for top N elements of a numeric vector

Usage

fit_cum_simple(x)

Arguments

x

Non-empty numeric vector (integration site ratio data)

Value

Named vector of cumulative sums for predefined target indices + total sum (all = 1)


Annotate integration site (IS) data with genomic features

Description

This function adds genomic feature annotations (gene/exon/intron overlap, nearest gene info) to raw integration site data, standardizes chromosome naming, and calculates clone contribution.

Usage

get_feature(IS_raw)

Arguments

IS_raw

Data frame containing raw IS data with columns: Sample, Chr, Locus, SCount, Strand

Value

Data frame with annotated genomic features and clone contribution


Plot chromosome ideogram with integration site annotations

Description

This function generates a chromosome ideogram plot showing the density and position of integration sites (IS) using the RIdeogram package.

Usage

ideogram_plot(IS_raw, output_dir)

Arguments

IS_raw

Data frame containing integration site data (columns: Chr, Locus required)

output_dir

Character, path to output directory for the PDF plot

Value

None (generates a PDF file in output_dir)


Plot AE-associated gene clone contribution

Description

This function filters integration site data for AE-associated genes (within specified distance/threshold) and generates a dot plot of clone contribution percentages for these genes.

Usage

is_in_AE_gene(IS_raw, Distance = 1e+05, threashold = 0.001)

Arguments

IS_raw

Data frame with annotated integration site data (columns: nearest_gene_name, nearest_distance, Clone_contribution, Sample required)

Distance

Numeric, maximum distance to AE gene (default: 100000 bp)

threashold

Numeric, minimum clone contribution threshold (default: 0.001)

Value

ggplot object (dot plot of clone contribution for AE-associated genes)


Plot Cancer-associated gene clone contribution

Description

This function filters integration site data for cancer-associated genes (within specified distance/threshold) and generates a dot plot of clone contribution percentages for these genes.

Usage

is_in_CG_gene(IS_raw, Distance = 1e+05, threashold = 0.001)

Arguments

IS_raw

Data frame with annotated integration site data (columns: nearest_gene_name, nearest_distance, Clone_contribution, Sample required)

Distance

Numeric, maximum distance to cancer gene (default: 100000 bp)

threashold

Numeric, minimum clone contribution threshold (default: 0.001)

Value

ggplot object (dot plot of clone contribution for cancer-associated genes)


Plot Immune-associated gene clone contribution

Description

This function filters integration site data for immune-associated genes (within specified distance/threshold) and generates a dot plot of clone contribution percentages for these genes.

Usage

is_in_immune_gene(IS_raw, Distance = 1e+05, threashold = 0.001)

Arguments

IS_raw

Data frame with annotated integration site data (columns: nearest_gene_name, nearest_distance, Clone_contribution, Sample required)

Distance

Numeric, maximum distance to immune gene (default: 100000 bp)

threashold

Numeric, minimum clone contribution threshold (default: 0.001)

Value

ggplot object (dot plot of clone contribution for immune-associated genes)


Plot Region-wise Donut Charts

Description

Plot Region-wise Donut Charts

Usage

plot_regions(Region_data, Timelevels = NULL)

Arguments

Region_data

Named list of data frames with Product/Share/Percentage/Time columns

Timelevels

Character vector to subset time levels (optional)

Value

Arranged ggplot object of donut charts


Plot Richness & Evenness Dual Y-Axis Line Chart

Description

Creates a polished dual Y-axis line chart to visualize clonal richness and evenness over time, with automatic scaling between axes, customizable styling, and optional data labels. All core functionality and parameters remain identical to the original code - only namespace prefixes (::) added.

Usage

plot_richness_evenness(
  PMD_data,
  time_col = "Time",
  richness_col = "Richness",
  evenness_col = "Eveness",
  plot_title = "Clonal eveness over time",
  subtitle = NULL,
  richness_color = "#3366CC",
  evenness_color = "#CC6677",
  show_labels = TRUE,
  Timelevels = NULL
)

Arguments

PMD_data

Data frame containing time, richness, and evenness data (required columns specified by time_col/richness_col/evenness_col)

time_col

Character (default = "Time"). Name of column containing time points.

richness_col

Character (default = "Richness"). Name of column containing richness values.

evenness_col

Character (default = "Eveness"). Name of column containing evenness values (note: intentional spelling match to original code).

plot_title

Character (default = "Clonal eveness over time"). Main plot title (spelling preserved as original).

subtitle

Character (optional). Plot subtitle (default = NULL).

richness_color

Character (default = "#3366CC"). Hex color code for richness line/points/labels.

evenness_color

Character (default = "#CC6677"). Hex color code for evenness line/points/labels.

show_labels

Logical (default = TRUE). Whether to display numeric labels on data points.

Timelevels

Character vector (optional). Custom ordered levels for time factor (overrides default ordering).

Value

ggplot object. Dual Y-axis line chart of richness (primary) and evenness (secondary) over time.


Calculate PMD (Proportional Modular Diversity) for integration site data

Description

This function computes UIS count, top clone contribution percentage, and PMD metrics (Richness/Eveness/PMD) for integration site data, and maps samples to patient time points.

Usage

pmd_analysis(IS_raw, Patient_timepoint)

Arguments

IS_raw

Data frame containing integration site data (columns: Sample, Clone_contribution required)

Patient_timepoint

Data frame mapping Sample_ID to Time_Point (columns: Sample_ID, Time_Point required)

Value

Data frame with PMD metrics (UIS, TOP_P, Richness, Eveness, PMD, Sample, Time)


Generate PMD (Proportional Modular Diversity) Scatter Plot with Inset Legend

Description

Creates a scatter plot of Richness vs. Eveness for PMD (Proportional Modular Diversity) analysis results, including reference lines, time point labels, and an inset directional legend for polyclonal/monoclonal classification. All core logic and parameters remain identical to the original code - only namespace prefixes (::) are added.

Usage

pmd_plot(PMD_data, Timelevels = NULL)

Arguments

PMD_data

Data frame output from pmd_analysis() function (required columns: Richness, Eveness, Time)

Timelevels

Character vector (optional). Custom ordered levels for the Time factor. Default = NULL (uses natural sort)

Value

ggplot object. Combined plot (main Richness-Eveness plot + inset legend)


Validate and standardize integration site (IS) raw data frame

Description

Validate and standardize integration site (IS) raw data frame

Usage

validate_IS_raw(IS_raw)

Arguments

IS_raw

Data frame containing IS data (expected columns: Sample, SCount, Chr, Locus)

Value

List with validation results:

mirror server hosted at Truenetwork, Russian Federation.