Type: Package
Title: Robust Outlier Detection for Diverse Distributions
Version: 0.1.3
Maintainer: Amanda Mejia <mandy.mejia@gmail.com>
Description: Provides robust outlier detection techniques for identifying anomalies in multivariate data, with a focus on methods that remain effective under non-Gaussian distributions. For more details see Saluja, Parlak, and Mejia (2026+) <doi:10.48550/arXiv.2505.11806>.
License: GPL-3
URL: https://github.com/mandymejia/rrobot
BugReports: https://github.com/mandymejia/rrobot/issues
Depends: R (≥ 3.6.0)
Imports: MASS, stats, cellWise, expm, robustbase, gamlss, imputeTS, isotree, ggplot2, tidyr, reshape2, rlang
Suggests: knitr, rmarkdown, testthat (≥ 3.0.0)
VignetteBuilder: knitr
Encoding: UTF-8
RoxygenNote: 7.3.3
Language: en-US
NeedsCompilation: no
Packaged: 2026-03-04 21:08:12 UTC; ddpham
Author: Amanda Mejia [aut, cre], Damon Pham ORCID iD [ctb], Saranjeet Singh Saluja [ctb], Fatma Parlak [ctb], Zeshawn Zahid [ctb]
Repository: CRAN
Date/Publication: 2026-03-09 16:30:02 UTC

Dots parameter documentation

Description

Dots parameter documentation

Arguments

...

Additional arguments to to method-specific functions.


B parameter documentation

Description

B parameter documentation

Arguments

B

Integer; number of bootstrap samples per imputed dataset (default = 1000).


M parameter documentation

Description

M parameter documentation

Arguments

M

Integer; number of multiply imputed datasets (default = 5).


Multiple Imputation with Per-Cycle Updates (OLS + MICE-style)

Description

Multiple Imputation with Per-Cycle Updates (OLS + MICE-style)

Usage

MImpute(x, w, outlier_matrix, M = 50, k = 5, ridge_eps = 1e-08, tol = NA_real_)

Arguments

x

(T × p) high-kurtosis ICA matrix to impute.

w

(T × q) predictors (e.g., low-kurtosis components).

outlier_matrix

logical (T × p) mask of entries to impute.

M

number of multiply-imputed datasets (default 50).

k

number of chained-equation cycles per dataset (default 5–10 is common).

ridge_eps

tiny ridge added to X'X for stability (default 1e-8).

tol

optional early-stop tolerance on per-cycle max change (NA to disable).

Value

list(imp_datasets, outlier_matrix)


Multiple Imputation for High-Kurtosis ICA Components

Description

Performs multiple imputation using perturbed robust regression models.

Usage

MImpute_old(x, w, outlier_matrix, M = 50, k = 100)

Arguments

x

A numeric matrix (n_time × p) of high-kurtosis ICA components to be imputed.

w

A numeric matrix (n_time × L) of low-kurtosis ICA components used as predictors (required for MI).

outlier_matrix

A logical matrix (same dim as x) indicating univariate outliers to be imputed.

M

Integer; number of multiply imputed datasets (default = 5).

k

Integer; number of perturbation cycles per imputation (default = 10).

Value

A list with:

imp_datasets

List of M imputed versions of x

outlier_matrix

Logical matrix of imputed outlier positions


Comprehensive Outlier Detection Using Robust Distance Thresholding

Description

Performs univariate outlier detection + imputation, robust distance, and multiple thresholding methods.

Usage

RD(
  x,
  w = NULL,
  method = c("SI_boot", "MI", "MI_boot", "SI", "F", "SHASH"),
  mode = "auto",
  cov_mcd = NULL,
  ind_incld = NULL,
  dist = TRUE,
  impute_method = "mean",
  cutoff = 4,
  trans = "SHASH",
  M = 50,
  k = 100,
  alpha = 0.01,
  quantile = 0.01,
  verbose = FALSE,
  boot_quant = 0.95,
  B = 1000
)

Arguments

x

A numeric matrix or data frame of dimensions T × p (observations × variables).

w

A numeric matrix (n_time × L) of low-kurtosis ICA components used as predictors (required for MI).

method

Character string; one of "all","SI","SI_boot","MI","MI_boot","F", "SHASH".

mode

Character string; either "auto" (default) to compute MCD internally or "manual" to use user-supplied values.

cov_mcd

Optional covariance matrix (p × p); required in "manual" mode.

ind_incld

Optional vector of row indices used to compute the robust mean; required in "manual" mode.

dist

Logical; if TRUE, compute squared robust Mahalanobis distances for all observations.

impute_method

Character string; imputation method for univariate outliers.

cutoff

A numeric value indicating how many MADs away from the median to flag as outliers. The default value is set to be 4.

trans

Character string; transformation method, one of "SHASH" or "robZ".

M

Integer; number of multiply imputed datasets (default = 5).

k

Integer; number of perturbation cycles per imputation (default = 10).

alpha

Significance level used to compute RD threshold (default = 0.01 for 99th percentile).

quantile

Numeric in (0,1) specifying the upper quantile for thresholding; the expected False Positive Rate for the chosen threshold.

verbose

Logical; if TRUE, print progress messages.

boot_quant

Numeric; confidence level for bootstrap confidence intervals (default = 0.95, for 95% CI).

B

Integer; number of bootstrap samples per imputed dataset (default = 1000).

Value

Depends on method:

Single method

Returns the result from the specific threshold method.

RD_obj

The robust distance object from compute_RD().

outliers

Logical vector indicating which observations have RD greater than the threshold.

call

The matched function call.


RD_obj parameter documentation

Description

RD_obj parameter documentation

Arguments

RD_obj

Pre-computed RD_result object from compute_RD.


RD_org_obj parameter documentation

Description

RD_org_obj parameter documentation

Arguments

RD_org_obj

Output list from compute_RD on the original data. Must contain $RD, $S_star, and $ind_incld.


SHASH-based Outlier Detection (Extended)

Description

Detects univariate outliers using an iterative SHASH fitting process with optional pre-flagging strategies. A SHASH (Sinh-Arcsinh) distribution is fitted to the data iteratively, each time excluding candidate outliers from the fit, until the set of flagged observations converges or maxit is reached.

Usage

SHASH_out(
  x,
  thr0 = 2.58,
  thr1 = 2.58,
  thr = 4,
  tail = c("both", "upper", "lower"),
  use_iso = TRUE,
  thr_iso = 0.6,
  maxit = 100,
  weight_init = NULL
)

Arguments

x

Numeric vector. May contain NA values; they are excluded from fitting and propagated as NA in all output vectors.

thr0

Positive numeric scalar. Threshold for initial outlier pre-flagging when use_iso = FALSE (default: 2.58).

thr1

Positive numeric scalar. Threshold used to classify observations as inliers during iterative convergence (default: 2.58).

thr

Positive numeric scalar. Final threshold applied to the converged SHASH-normalised scores to declare outliers in the returned output (default: 4).

tail

Character string specifying which tail(s) to check for outliers. Must be one of "both" (default), "upper", or "lower".

  • "upper": detect upper-tail outliers only.

  • "lower": detect lower-tail outliers only.

  • "both": detect two-sided outliers.

use_iso

Logical. If TRUE (default), uses an isolation forest (via isotree) to pre-screen candidate outliers before the iterative fitting loop begins.

thr_iso

Numeric scalar in [0, 1]. Isolation forest anomaly score threshold above which observations are treated as candidate outliers during pre-screening (default: 0.6). Only used when use_iso = TRUE.

maxit

Positive integer. Maximum number of fitting iterations before the algorithm stops regardless of convergence (default: 100).

weight_init

Optional logical vector of length length(x). If supplied, these weights initialise the iterative fit directly, bypassing both the isolation forest and empirical-rule pre-screening. TRUE means the observation is treated as an inlier in the first iteration.

Value

A list of class "SHASH_out" with the following elements:

out_idx

Integer vector. Indices of observations in x that were flagged as outliers at the final threshold thr.

x_norm

Numeric vector. SHASH-normalised scores for every observation (same length as x; NA where x was NA).

SHASH_coef

Named list with elements mu, sigma, nu, and tau: the fitted SHASH parameter estimates from the final iteration (sigma and tau are on the log scale, as returned by gamlssML).

isotree_scores

Numeric vector of isolation forest anomaly scores (same length as x). NA when use_iso = FALSE or weight_init was supplied.

initial_weights

Logical vector. Inlier weights used for the very first fitting iteration (same length as x).

indx_iters

Integer matrix of dimensions length(x) × last_iter. Each column records which observations were flagged as outliers (value 1) during that iteration.

norm_iters

Numeric matrix of dimensions length(x) × last_iter. Each column records the SHASH-normalised scores from that iteration.

last_iter

Integer. The number of iterations completed before convergence or hitting maxit.

converged

Logical. TRUE if the inlier weight vector stabilised before reaching maxit.

params

List. A record of all input parameters, stored for reproducibility.

Examples

# --- Example 1: Synthetic data with known injected outliers ---------------
# Using rnorm lets us inject outliers at known positions so we can verify
# the function finds exactly what we planted.
set.seed(42)
x <- rnorm(200, mean = 10, sd = 2)

# Shift a handful of observations far into the upper tail
outlier_positions <- c(17, 77, seq(190, 200))
x[outlier_positions] <- x[outlier_positions] + 10

result_sim <- SHASH_out(
  x,
  thr0    = 2.58,
  thr1    = 2.58,
  thr     = 4,
  tail    = "both",
  use_iso = FALSE   # skip isolation forest to keep the example fast
)

result_sim$out_idx    # should recover positions near outlier_positions
result_sim$converged  # did the iterative fit stabilise?

# --- Example 2: Real benchmark data (Hawkins-Bradu-Kass) ------------------
# hbk is a classic outlier detection benchmark shipped with robustbase,
# which this package already imports, so it is always available.
data("hbk", package = "robustbase")

result_hbk <- SHASH_out(
  hbk$X1,
  thr0    = 2.58,
  thr1    = 2.58,
  thr     = 4,
  tail    = "both",
  use_iso = FALSE
)

result_hbk$out_idx   # flagged observations in the X1 column
result_hbk$SHASH_coef  # fitted SHASH parameters; sigma and tau are log-scale

# Which positions were flagged as outliers?
result_hbk$out_idx

# Did the algorithm converge before hitting maxit?
result_hbk$converged

# How many iterations did it take?
result_hbk$last_iter


SHASH Data Transformation

Description

These two functions form a matched pair for transforming data between the SHASH (Sinh-Arcsinh) distribution and the standard normal distribution. SHASH_to_normal() maps SHASH-distributed observations onto an approximately normal scale; normal_to_SHASH() is the inverse.

Usage

SHASH_to_normal(x, mu, sigma, nu, tau)

normal_to_SHASH(x, mu, sigma, nu, tau)

Arguments

x

Numeric vector of values to transform.

mu

Numeric scalar. Location parameter controlling the mean of the SHASH distribution.

sigma

Numeric scalar. Spread parameter on the log scale. The function applies exp(sigma) internally, so pass the raw coefficient as returned by gamlssML(). Pass sigma = 0 to get unit spread since exp(0) = 1.

nu

Numeric scalar. Skewness parameter. A value of 0 gives a symmetric distribution.

tau

Numeric scalar. Tail-weight parameter on the log scale. Pass tau = 0 for normal-like tails since exp(0) = 1.

Value

A numeric vector of transformed values, the same length as x.

Functions

Examples

set.seed(42)
x <- rnorm(200)
x[c(17, 77)] <- x[c(17, 77)] + 5

mu <- 0; sigma <- 0; nu <- 0; tau <- 0

z <- SHASH_to_normal(x, mu = mu, sigma = sigma, nu = nu, tau = tau)
x_recovered <- normal_to_SHASH(z, mu = mu, sigma = sigma, nu = nu, tau = tau)
all.equal(x, x_recovered)


Alpha parameter documentation

Description

Alpha parameter documentation

Arguments

alpha

Significance level used to compute RD threshold (default = 0.01 for 99th percentile).


Binwidth parameter documentation

Description

Binwidth parameter documentation

Arguments

binwidth

Histogram bin width (default = 0.1).


Boot_quant parameter documentation

Description

Boot_quant parameter documentation

Arguments

boot_quant

Numeric; confidence level for bootstrap confidence intervals (default = 0.95, for 95% CI).


Compute Squared robust distance and covariance from a Subset

Description

Calculates the robust mean, covariance matrix, and optionally robust distances using either:

Usage

compute_RD(
  x,
  mode = c("auto", "manual"),
  cov_mcd = NULL,
  ind_incld = NULL,
  dist = TRUE
)

Arguments

x

A numeric matrix or data frame of dimensions T × p (observations × variables).

mode

Character string; either "auto" (default) to compute MCD internally or "manual" to use user-supplied values.

cov_mcd

Optional covariance matrix (p × p); required in "manual" mode.

ind_incld

Optional vector of row indices used to compute the robust mean; required in "manual" mode.

dist

Logical; if TRUE, compute squared robust Mahalanobis distances for all observations.

Value

A list with elements:

ind_incld

Vector of row indices used to compute the robust mean and covariance.

ind_excld

Vector of excluded row indices.

h

Number of included observations.

xbar_star

Robust mean vector (length p).

S_star

Robust covariance matrix (p × p).

invcov_sqrt

Matrix square root of the inverse covariance matrix (p × p).

RD

Squared robust distances for all observations (length T), or NULL if dist = FALSE.

call

The matched function call.


Cov_mcd parameter documentation

Description

Cov_mcd parameter documentation

Arguments

cov_mcd

Optional covariance matrix (p × p); required in "manual" mode.


Cutoff parameter documentation

Description

Cutoff parameter documentation

Arguments

cutoff

A numeric value indicating how many MADs away from the median to flag as outliers. The default value is set to be 4.


Dist parameter documentation

Description

Dist parameter documentation

Arguments

dist

Logical; if TRUE, compute squared robust Mahalanobis distances for all observations.


Robust Empirical Rule Outlier Detection

Description

Detects outliers using the median ± thr × MAD rule, where MAD is normalised by 1.4826 to be consistent with the standard deviation under normality.

Usage

emprule_rob(x, thr = 4, tail = c("both", "upper", "lower"))

Arguments

x

Numeric vector.

thr

Positive numeric scalar. Threshold multiplier for the MAD rule (default: 4).

tail

Character string: one of "both" (default), "upper", or "lower", indicating which tail(s) to flag.

Value

A logical vector the same length as x. TRUE indicates an outlier, FALSE indicates an inlier.


Imp_data parameter documentation

Description

Imp_data parameter documentation

Arguments

imp_data

A numeric matrix (T × p) of single-imputed data.


Imp_datasets parameter documentation

Description

Imp_datasets parameter documentation

Arguments

imp_datasets

A list of M numeric matrices (T × p); multiply imputed datasets.


Impute_method parameter documentation

Description

Impute_method parameter documentation

Arguments

impute_method

Character string; imputation method for univariate outliers.


Temporally impute univariate outliers from external detection

Description

Takes a high-kurtosis data matrix and a precomputed outlier mask, replaces the outliers with NA, and applies temporal interpolation using imputeTS::na_interpolation.

Usage

impute_univOut(x, outlier_mask, method = c("mean", "interp"))

Arguments

x

A numeric matrix or data frame of dimensions T × p (observations × variables).

outlier_mask

A logical matrix (same dimensions as x) with TRUE at outlier positions.

method

One of "mean" or "interp"; "interp" uses imputeTS::na_interpolation, "mean" fills NAs with column means.

Value

A list with elements:

x_df

Original matrix with outliers replaced as NA (as tibble).

NA_data

Matrix version of x with NAs at outlier positions.

imp_data

Imputed matrix after temporal interpolation.

NA_locs

Row-column indices of outliers (now NA).

call

The matched function call.


Ind_incld parameter documentation

Description

Ind_incld parameter documentation

Arguments

ind_incld

Optional vector of row indices used to compute the robust mean; required in "manual" mode.


K parameter documentation

Description

K parameter documentation

Arguments

k

Integer; number of perturbation cycles per imputation (default = 10).


Threshold_method parameter documentation

Description

Threshold_method parameter documentation

Arguments

method

Character string; one of "all","SI","SI_boot","MI","MI_boot","F", "SHASH".


Method_univOut parameter documentation

Description

Method_univOut parameter documentation

Arguments

method

Character string. One of "SHASH" or "robZ".


Mode parameter documentation

Description

Mode parameter documentation

Arguments

mode

Character string; either "auto" (default) to compute MCD internally or "manual" to use user-supplied values.


Plot Method for RD Analysis Results

Description

Creates diagnostic plots for robust distance analysis results.

Usage

## S3 method for class 'RD'
plot(x, type = c("histogram", "imputations", "univOut"), method = NULL, ...)

Arguments

x

An object of class "RD" from RD() or threshold_RD().

type

Character string specifying plot type: "histogram" (default), "imputations", or "univOut".

method

Character string specifying threshold method. Auto-detected if NULL.

...

Additional arguments passed to plotting functions.

Value

A ggplot object.


Plot F-Distribution Method Results

Description

Creates histogram of robust distances with F-distribution overlay and threshold.

Usage

plot_F_histogram(
  F_result,
  RD_obj,
  alpha = 0.01,
  binwidth = 0.1,
  show_f_density = TRUE,
  ...
)

Arguments

F_result

F_result object from thresh_F().

RD_obj

Pre-computed RD_result object from compute_RD.

alpha

Significance level used to compute RD threshold (default = 0.01 for 99th percentile).

binwidth

Histogram bin width (default = 0.1).

show_f_density

Logical. Show F-distribution curve overlay (default = TRUE).

...

Additional arguments to to method-specific functions.

Value

A ggplot object.


Plot Robust Distance Histogram with Threshold

Description

Creates a histogram of robust distances with threshold line for outlier detection.

Usage

plot_RD_histogram(thresh_result, RD_obj, alpha = 0.01, binwidth = 0.1, ...)

Arguments

thresh_result

A threshold result object from any threshold method containing threshold information.

RD_obj

Pre-computed RD_result object from compute_RD.

alpha

Significance level used to compute RD threshold (default = 0.01 for 99th percentile).

binwidth

Histogram bin width (default = 0.1).

...

Additional arguments to to method-specific functions.

Value

A ggplot object with histogram colored by inclusion status and threshold line.


Plot Multiple Threshold Methods on Robust Distance Histogram

Description

Creates a histogram of robust distances with multiple colored threshold lines showing different outlier detection methods simultaneously.

Usage

plot_RD_histogram_multi(
  RD_result,
  RD_obj,
  methods = c("SI", "SI_boot", "MI", "MI_boot", "F"),
  alpha = 0.01,
  binwidth = 0.1,
  ...
)

Arguments

RD_result

An RD result object from threshold_RD() with method="all" containing multiple threshold results in a list.

RD_obj

Pre-computed RD_result object from compute_RD.

methods

Character vector of threshold methods to display (default: c("SI", "SI_boot", "MI", "MI_boot")).

alpha

Significance level used to compute RD threshold (default = 0.01 for 99th percentile).

binwidth

Histogram bin width (default = 0.1).

...

Additional arguments to to method-specific functions.

Value

A ggplot object with histogram colored by inclusion status and multiple colored threshold lines for comparison of different methods.


Plot Multiple Imputation Results from RD Analysis

Description

Creates time series plots showing original data, temporal imputation, and multiple imputation results with outlier locations highlighted.

Usage

plot_imputations(x)

Arguments

x

An object of class "RD" from RD() or threshold_RD().

Value

Prints ggplot objects for each variable showing imputation results.


Plot Univariate Outliers from RD Analysis

Description

Creates a heatmap visualization of univariate outliers detected in high-kurtosis components.

Usage

plot_univOut(x, cutoff = NULL, method = NULL)

Arguments

x

An object of class "RD" from RD() or threshold_RD().

cutoff

A numeric value indicating how many MADs away from the median to flag as outliers. The default value is set to be 4.

method

Character string. One of "SHASH" or "robZ".

Value

A ggplot object showing a heatmap of outlier locations.


Quantile parameter documentation

Description

Quantile parameter documentation

Arguments

quantile

Numeric in (0,1) specifying the upper quantile for thresholding; the expected False Positive Rate for the chosen threshold.


Summary method for Hardin & Rocke F results

Description

Summary method for Hardin & Rocke F results

Usage

## S3 method for class 'F_result'
summary(object, ...)

Arguments

object

An object of class "F_result" or "HR_result"

...

Additional arguments to to method-specific functions.

Value

NULL, invisibly


Summary method for MI_boot results

Description

Summary method for MI_boot results

Usage

## S3 method for class 'MI_boot_result'
summary(object, ...)

Arguments

object

An object of class "MI_boot_result"

...

Additional arguments to to method-specific functions.

Value

NULL, invisibly


Summary method for MI results

Description

Summary method for MI results

Usage

## S3 method for class 'MI_result'
summary(object, ...)

Arguments

object

An object of class "MI_result"

...

Additional arguments to to method-specific functions.

Value

NULL, invisibly


Summary method for SI_boot results

Description

Summary method for SI_boot results

Usage

## S3 method for class 'SI_boot_result'
summary(object, ...)

Arguments

object

An object of class "SI_boot_result"

...

Additional arguments to to method-specific functions.

Value

NULL, invisibly


Summary method for SI results

Description

Summary method for SI results

Usage

## S3 method for class 'SI_result'
summary(object, ...)

Arguments

object

An object of class "SI_result"

...

Additional arguments to to method-specific functions.

Value

NULL, invisibly


Thr parameter documentation

Description

Thr parameter documentation

Arguments

thr

Threshold multiplier for outlier detection (default = 4).


Fit F-distribution Parameters for MCD-based Robust Distances

Description

Computes the scaling constant and degrees of freedom for the F-distribution approximation of squared robust Mahalanobis distances based on the Minimum Covariance Determinant (MCD) estimator, following the method of Hardin & Rocke (2005).

Usage

thresh_F(p, n, h, quantile, RD_obj, SHASH = FALSE, verbose = FALSE)

Arguments

p

Integer. The number of variables (dimension of the data).

n

Integer. The total sample size.

h

Integer. The number of observations retained in the MCD subset.

quantile

Numeric in (0,1) specifying the upper quantile for thresholding; the expected False Positive Rate for the chosen threshold.

RD_obj

Pre-computed RD_result object from compute_RD.

SHASH

Boolean. If running SHASH variant.

verbose

Logical; if TRUE, print progress messages.

Details

This function is useful for deriving robust outlier detection thresholds in high-dimensional multivariate data contaminated by outliers.

Value

A list with the following elements:

c

Consistency correction factor for robust distances.

m

Estimated degrees of freedom parameter used in the F-distribution.

df

A numeric vector of degrees of freedom: c(df1, df2).

scale

Scale factor for the threshold.

threshold

Threshold for squared robust distances.

flagged_outliers

Integer vector of row indices from original data matrix that exceed the threshold.

call

The matched function call.


Outlier Detection via Multiple Imputation Voting (MI)

Description

Applies robust distance (RD) computation to multiply imputed datasets, derives thresholds, and flags outliers via majority voting. Also computes the lower bound of the 95% confidence interval of the (1 - alpha) quantiles across imputations.

Usage

thresh_MI(
  RD_org_obj,
  imp_datasets,
  alpha = 0.01,
  boot_quant = 0.95,
  verbose = FALSE
)

Arguments

RD_org_obj

Output list from compute_RD on the original data. Must contain $RD, $S_star, and $ind_incld.

imp_datasets

A list of M numeric matrices (T × p); multiply imputed datasets.

alpha

Significance level used to compute RD threshold (default = 0.01 for 99th percentile).

boot_quant

Numeric; confidence level for bootstrap confidence intervals (default = 0.95, for 95% CI).

verbose

Logical; if TRUE, print progress messages.

Value

A list with:

thresholds

Numeric vector of length M; (1 - alpha) quantiles of RD per imputed dataset.

threshold

Lower bound of the confidence interval of thresholds.

call

The matched function call.

flagged_outliers

Integer vector of row indices from original data matrix that exceed the threshold.


Bootstrap-Based Outlier Detection via Multiple Imputation (MI_boot)

Description

Extends single imputation bootstrapping by using multiple imputation. For each of the M imputed datasets:

Usage

thresh_MI_boot(
  RD_org_obj,
  imp_datasets,
  B = 1000,
  alpha = 0.01,
  boot_quant = 0.95,
  verbose = FALSE
)

Arguments

RD_org_obj

Output list from compute_RD on the original data. Must contain $RD, $S_star, and $ind_incld.

imp_datasets

A list of M numeric matrices (T × p); multiply imputed datasets.

B

Integer; number of bootstrap samples per imputed dataset (default = 1000).

alpha

Significance level used to compute RD threshold (default = 0.01 for 99th percentile).

boot_quant

Numeric; confidence level for bootstrap confidence intervals (default = 0.95, for 95% CI).

verbose

Logical; if TRUE, print progress messages.

Details

This yields M × B threshold candidates. The lower bound of their (1 - boot_quant) confidence interval is used as the final threshold. This is applied to the RD of the original data.

Value

A list with:

thresholds

Vector of M×B thresholds from each bootstrap sample.

threshold

Lower bound of CI across thresholds.

flagged_outliers

Integer vector of row indices from original data matrix that exceed the threshold.

call

The matched function call.


Compute SI Threshold for Outlier Detection

Description

Computes a robust distance (RD) threshold based on single imputation (SI), using the robust covariance from the original data (via RD_org_obj) and recomputed mean from the imputed data.

Usage

thresh_SI(RD_org_obj, imp_data, alpha = 0.01, verbose = FALSE)

Arguments

RD_org_obj

Output list from compute_RD on the original data. Must contain $RD, $S_star, and $ind_incld.

imp_data

A numeric matrix (T × p) of single-imputed data.

alpha

Significance level used to compute RD threshold (default = 0.01 for 99th percentile).

verbose

Logical; if TRUE, print progress messages.

Value

A list with:

SI_obj

A list from compute_RD containing robust distances.

threshold

Numeric threshold based on the (1 - alpha) quantile of RD.

flagged_outliers

Integer vector of row indices from original data matrix that exceed the threshold.

call

The matched function call.


Compute SI Boot Thresholds for Outlier Detection

Description

Computes a robust distance (RD)–based threshold using single-imputed data followed by bootstrap resampling over clean (included) indices. Returns the confidence interval bounds of the bootstrapped 99th percentiles.

Usage

thresh_SI_boot(
  RD_org_obj,
  imp_data,
  B = 1000,
  alpha = 0.01,
  boot_quant = 0.95,
  verbose = FALSE
)

Arguments

RD_org_obj

Output list from compute_RD on the original data. Must contain $RD, $S_star, and $ind_incld.

imp_data

A numeric matrix (T × p) of single-imputed data.

B

Integer; number of bootstrap samples per imputed dataset (default = 1000).

alpha

Significance level used to compute RD threshold (default = 0.01 for 99th percentile).

boot_quant

Numeric; confidence level for bootstrap confidence intervals (default = 0.95, for 95% CI).

verbose

Logical; if TRUE, print progress messages.

Value

A list with:

thresholds

Vector of 99th quantiles of RD for each bootstrap sample.

threshold

Threshold based on lower bound of the confidence interval.

flagged_outliers

Integer vector of row indices from original data matrix that exceed the threshold.

UB_CI

Upper bound of the confidence interval for the 99th quantiles.

call

The matched function call.


Thresh_result parameter documentation

Description

Thresh_result parameter documentation

Arguments

thresh_result

A threshold result object from any threshold method containing threshold information.


Comprehensive Outlier Detection Using Robust Distance Thresholding

Description

Performs univariate outlier detection + imputation, robust distance, and multiple thresholding methods.

Usage

threshold_RD(
  x,
  w = NULL,
  method = c("SI_boot", "MI", "MI_boot", "SI", "F", "SHASH", "all"),
  RD_obj = NULL,
  impute_method = "mean",
  cutoff = 4,
  trans = "SHASH",
  M = 50,
  k = 100,
  alpha = 0.01,
  quantile = 0.01,
  verbose = FALSE,
  boot_quant = 0.95,
  B = 1000
)

Arguments

x

A numeric matrix or data frame of dimensions T × p (observations × variables).

w

A numeric matrix (n_time × L) of low-kurtosis ICA components used as predictors (required for MI).

method

Character string; one of "all","SI","SI_boot","MI","MI_boot","F", "SHASH".

RD_obj

Pre-computed RD_result object from compute_RD.

impute_method

Character string; imputation method for univariate outliers.

cutoff

A numeric value indicating how many MADs away from the median to flag as outliers. The default value is set to be 4.

trans

Character string; transformation method, one of "SHASH" or "robZ".

M

Integer; number of multiply imputed datasets (default = 5).

k

Integer; number of perturbation cycles per imputation (default = 10).

alpha

Significance level used to compute RD threshold (default = 0.01 for 99th percentile).

quantile

Numeric in (0,1) specifying the upper quantile for thresholding; the expected False Positive Rate for the chosen threshold.

verbose

Logical; if TRUE, print progress messages.

boot_quant

Numeric; confidence level for bootstrap confidence intervals (default = 0.95, for 95% CI).

B

Integer; number of bootstrap samples per imputed dataset (default = 1000).

Value

A list with:

thresholds

Result from the specific threshold method, or list of all methods if "all".

RD_obj

The robust distance object from compute_RD().

call

The matched function call.


Trans parameter documentation

Description

Trans parameter documentation

Arguments

trans

Character string; transformation method, one of "SHASH" or "robZ".


Temporal univariate outlier detection using SHASH, robust Yeo-Johnson, or robust MAD.

Description

Detects univariate outliers across time for each variable (column) using one of three methods:

Usage

univOut(x, cutoff = 4, method = c("SHASH", "robZ"))

Arguments

x

A numeric matrix or data frame of dimensions T × p (observations × variables).

cutoff

A numeric value indicating how many MADs away from the median to flag as outliers. The default value is set to be 4.

method

Character string. One of "SHASH" or "robZ".

Value

A list with elements:

outliers

Logical matrix of the same dimension as x, indicating detected outlier locations (TRUE = outlier).

method

A character string indicating the transformation method used.

call

The matched function call.


Verbose parameter documentation

Description

Verbose parameter documentation

Arguments

verbose

Logical; if TRUE, print progress messages.


W parameter documentation

Description

W parameter documentation

Arguments

w

A numeric matrix (n_time × L) of low-kurtosis ICA components used as predictors (required for MI).


X parameter documentation

Description

X parameter documentation

Arguments

x

A numeric matrix or data frame of dimensions T × p (observations × variables).

mirror server hosted at Truenetwork, Russian Federation.