---
title: "Chapter 12: The nmathopencl R API --- Distribution Functions on the GPU"
author: "Kjell Nygren"
date: "`r Sys.Date()`"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Chapter 12: The nmathopencl R API --- Distribution Functions on the GPU}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include = FALSE}
knitr::opts_chunk$set(collapse = TRUE, comment = "#>")
```

## Overview

`nmathopencl` exports a family of R functions that mirror R's
`stats::` distribution functions but dispatch computation to an OpenCL
device when one is available. The calling convention is intentionally close
to the base-R equivalents so that substitution is straightforward.

All exported functions share three common arguments:

| Argument | Type | Default | Meaning |
|----------|------|---------|---------|
| `fallback` | logical | `FALSE` | If `TRUE` while `nmathopencl_has_opencl()` is `TRUE`, fall back to the `stats::` analogue when the OpenCL call fails; ignored when OpenCL is unavailable (CPU is used anyway) |
| `verbose` | logical | `FALSE` | Extra diagnostics (including when picking the CPU path with no OpenCL runtime) |
| `log` / `lower.tail` | logical | distribution-specific | Mirror the `stats::` convention for the same function |

## Checking availability

```{r, eval = FALSE}
library(nmathopencl)

# TRUE if this nmathopencl build was compiled with USE_OPENCL
nmathopencl_has_opencl()

# GPU device names on the host (opencltools --- system inventory)
opencltools::gpu_names()
```

If OpenCL is not available (`nmathopencl_has_opencl()` returns `FALSE`), calls use the CPU
equivalent unconditionally so examples and CI keep working.

By default `fallback = FALSE`: when OpenCL **is** available but a dispatch fails,
the error propagates---useful while debugging kernels. Pass `fallback = TRUE`
only when you deliberately want GPU failures masked with the CPU analogue.

## Normal distribution

```{r, eval = FALSE}
x  <- rnorm(1e6)

# Density
d  <- dnorm_opencl(x, mean = 0, sd = 1, log = FALSE)

# CDF --- same core arguments as stats::pnorm(q, mean, sd, lower.tail, log.p);
#          plus opencl_parallel, fallback, verbose. Long outputs use recycling length.
p <- pnorm_opencl(
  rep(1.96, 1e6),
  mean = 0,
  sd = 1,
  lower.tail = TRUE,
  log.p = FALSE
)

# Quantile (`qnorm_opencl` retains leading `n` + scalar `p`, unlike `stats::qnorm`-style vectors)
q  <- qnorm_opencl(n = 1e6, p = 0.975, mean = 0, sd = 1)

# Random draws
r  <- rnorm_opencl(n = 1e6, mean = 0, sd = 1)
```

Note the signature difference between `dnorm_opencl` (takes a vector `x`)
and `qnorm_opencl` / `rnorm_opencl`, which retain a leading `n` argument for replicate-many quantiles (`qnorm`) or draws (`rnorm`).
For the CDF, `pnorm_opencl` parallels `stats::pnorm(q, mean = 0, sd = 1, lower.tail = TRUE, log.p = FALSE)` and adds only `opencl_parallel`, `fallback`, and `verbose`.
Output length is the usual recycled length (`max(length(q), ...)`, as in stats).
On GPU (when used), each output element runs `pnorm_kernel` once in sequence (high overhead until a batched kernel is added).

## Distribution families

### Gamma

```{r, eval = FALSE}
dgamma_opencl(n, x, shape = 2, scale = 1)
pgamma_opencl(q, shape = 2, rate = 1)
rgamma_opencl(n, shape = 2, scale = 1)
# qgamma: use stats::qgamma(); the OpenCL kernel is internal-only
# (device compile failure, see inst/OPENCL_KERNEL_KNOWN_FAILURES.md)
```

### Binomial

```{r, eval = FALSE}
dbinom_opencl(x, size = 10, prob = 0.3)
pbinom_opencl(q, size = 10, prob = 0.3)
qbinom_opencl(p, size = 10, prob = 0.3)
rbinom_opencl(n, size = 10, prob = 0.3)
```

### Poisson

```{r, eval = FALSE}
dpois_opencl(x, lambda = 3)
ppois_opencl(q, lambda = 3)
qpois_opencl(p, lambda = 3)
rpois_opencl(n, lambda = 3)
```

### Beta

```{r, eval = FALSE}
dbeta_opencl(x, shape1 = 2, shape2 = 5)
pbeta_opencl(q, shape1 = 2, shape2 = 5)
rbeta_opencl(n, shape1 = 2, shape2 = 5)
# qbeta: use stats::qbeta(); the OpenCL kernel is internal-only
# (device link failure, see inst/OPENCL_KERNEL_KNOWN_FAILURES.md)
```

### Additional families

All families below follow the same four-function pattern (`d*`, `p*`, `q*`,
`r*`) with the same `fallback` / `verbose` arguments:

| Family | R file |
|--------|--------|
| Cauchy | `cauchy_opencl.R` |
| Chi-squared | `chisq_opencl.R` |
| Exponential | `exponential_opencl.R` |
| F | `f_opencl.R` |
| Geometric | `geometric_opencl.R` |
| Hypergeometric | `hypergeometric_opencl.R` |
| Log-normal | `lnorm_opencl.R` |
| Logistic | `logistic_opencl.R` |
| Negative binomial | `negative_binomial_opencl.R` |
| t | `t_opencl.R` |
| Uniform | `uniform_opencl.R` |
| Weibull | `weibull_opencl.R` |
| Multivariate discrete | `multinomial_opencl.R` |
| Tukey | `tukey_opencl.R` |

The Wilcoxon rank-sum (`rwilcox_opencl.R`), Wilcoxon signed-rank
(`signrank_opencl.R`), and Bessel (`bessel_opencl.R`) wrappers are
internal-only: their OpenCL ports have documented allocation/runtime
failures (see `inst/OPENCL_KERNEL_KNOWN_FAILURES.md` and the notes in
`inst/cl/`). Use the `stats`/`base` equivalents instead.

### Noncentral distributions

Noncentral variants (`dnt_opencl`, `dnchisq_opencl`, `dnf_opencl`,
`dnbeta_opencl`, and their `p*`, `q*` counterparts) are available with the
noncentrality parameter `ncp`.

## Special functions and math support

```{r, eval = FALSE}
# Log-gamma
lgammafn_opencl(n, x)
lgammafn_sign_opencl(n, x)

# Gamma function
gammafn_opencl(n, x)

# Log-gamma at x+1
lgamma1p_opencl(n, x)

# Digamma / polygamma
digamma_opencl(n, x)
trigamma_opencl(n, x)
psigamma_opencl(n, x, deriv)

# Beta and log-beta
beta_special_opencl(n, a, b)
lbeta_special_opencl(n, a, b)

# Binomial coefficients
choose_special_opencl(n_out, n, k)
lchoose_special_opencl(n_out, n, k)

# Logspace arithmetic
logspace_add_opencl(n, logx, logy)
logspace_sub_opencl(n, logx, logy)
logspace_sum_opencl(n, logx, logy)

# Math support
fmax2_opencl(n, x, y)
fmin2_opencl(n, x, y)
```

These are exported via `special_opencl.R` and `math_support_opencl.R`.

## R extension utilities (`R_ext`)

```{r, eval = FALSE}
# R-compatible power
r_pow_opencl(n, x, y)
r_pow_di_opencl(n, x, i)

# Miscellaneous math helpers
log1pmx_opencl(n, x)
log1pexp_opencl(n, x)
log1mexp_opencl(n, x)
pow1p_opencl(n, x, y)
```

These are exported via `rext_utils_opencl.R`.

## RNG core

```{r, eval = FALSE}
# Base RNG draws
norm_rand_opencl(n)
unif_rand_opencl(n)
exp_rand_opencl(n)
r_unif_index_opencl(n, dt)
```

These generate random samples using device-side implementations of the
same Kinderman-Ramage / Ahrens-Dieter algorithms used by R's internal RNG.
They are exported via `rng_core_opencl.R`.

## Fallback behavior in detail

1. **`nmathopencl_has_opencl()` is `FALSE`** (no runtime device / not compiled with OpenCL): the wrapper
   always uses the CPU `stats::`/base analogue. The **`fallback` argument does not gate this**;
   it is ignored for this branch.

2. **`nmathopencl_has_opencl()` is `TRUE`**, but OpenCL enqueue/compile/bind fails inside the `.Call`: the outer
   R shim catches the failure. **`fallback = FALSE` (default):** propagate the error.
   **`fallback = TRUE`:** return the CPU analogue instead (optional masking for fragile stacks).

Runnable examples under `inst/examples/` mostly use **`fallback = FALSE`** so `R CMD check`
surfaces real OpenCL breakage when a device is present.

```{r, eval = FALSE}
# Default --- surface GPU/build errors during development / local check with OpenCL enabled:
dnorm_opencl(x)

# Permit CPU masking while OpenCL is flaky (not the development default):
dnorm_opencl(x, fallback = TRUE, verbose = TRUE)
```

## Performance notes

GPU acceleration is most beneficial for **large vectors** (>= ~10,000
elements) and for functions with non-trivial per-element computation
(e.g., `pgamma`, `pt`, `qnorm`). For very small inputs the kernel
compilation overhead and data transfer latency dominate.

The first call in an R session may be slightly slower because OpenCL JIT-
compiles the kernel source for the specific device. Subsequent calls with
the same kernel reuse the compiled program from the driver's cache.