---
title: "Chapter 03: Structure of nmath Kernel Programs"
author: "Kjell Nygren"
date: "`r Sys.Date()`"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Chapter 03: Structure of nmath Kernel Programs}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include = FALSE}
knitr::opts_chunk$set(collapse = TRUE, comment = "#>")
```

## Why "programs" rather than compiled objects

Normal C++ code in an R package is compiled ahead of time by the package build
toolchain and linked into a DLL/`.so`. OpenCL device code works differently:
the GPU source is compiled **at runtime**, on the device, by the GPU driver.
This means there is no offline linker for device code --- all GPU code that will
run together must be collected into a single source string before the driver
sees it.

In OpenCL terms, that string is a **program**. The C++ call sequence is:

```cpp
// kernel_source is one big string containing all device code
program = clCreateProgramWithSource(context, 1, &src_ptr, &src_len, &status);
clBuildProgram(program, 0, nullptr, nullptr, nullptr, nullptr);
kernel  = clCreateKernel(program, "dnorm_kernel", &status);
```

`clBuildProgram` invokes the vendor GPU compiler on `kernel_source` and
produces a device binary.  A build failure here is equivalent to a C++
compilation error, except it happens at the moment of first use, not at `R CMD
INSTALL` time.

## No `#include` on the device

OpenCL C is a restricted dialect of C99.  There is no file system visible to
device code: `#include` is simply not available inside an OpenCL program string.
Every declaration or definition that a kernel function needs must appear
**earlier in the same string**.

This creates the fundamental porting challenge for `nmathopencl`.  R's `nmath`
library is written as standard C99 and relies on a chain of `#include`
directives:

```c
#include "nmath.h"       /* includes R_ext/Arith.h, R_ext/Error.h, ... */
#include "dpq.h"
```

Those headers pull in standard C types (`FILE*`, `size_t`, `int32_t`),
R-specific runtime types (`SEXP`, `Rboolean`), and R callbacks
(`error()`, `warning()`, `R_alloc()`).  None of these concepts exist on a GPU.

The solution is to assemble the program string in layers, prepending
OpenCL-compatible replacements for everything the nmath sources expect to find
via `#include`.

## The four-layer program structure

A complete `nmathopencl` kernel program is a concatenation of four layers:

```
┌─────────────────────────────────────────────────────────────┐
│  Layer 0 – OPENCL.cl                                         │
│  Global configuration: fp64 pragma, HAVE_* macros, ML_NAN   │
├─────────────────────────────────────────────────────────────┤
│  Layer 1 – Upstream shims  (inst/cl/R_ext_types/, R_shims/, │
│                              R_ext_runtime/, System/, ...)   │
│  OpenCL-C replacements for C headers the nmath sources need  │
├─────────────────────────────────────────────────────────────┤
│  Layer 2 – nmath library  (inst/cl/nmath/)                   │
│  The ported Mathlib functions, exactly the subset required   │
│  by this kernel (determined by @all_depends_nmath)           │
├─────────────────────────────────────────────────────────────┤
│  Layer 3 – Kernel function  (inst/cl/src/)                   │
│  The __kernel entry point: GPU-parallel version of a CPU     │
│  loop over the nmath function                                │
└─────────────────────────────────────────────────────────────┘
```

Each layer is a plain text block that is physically concatenated (in order)
to form the final string passed to `clCreateProgramWithSource`.

## Layer 0: `OPENCL.cl` --- the global configuration header

`inst/cl/OPENCL.cl` is always placed at the very top.  It sets the conditions
under which the rest of the program will compile:

```c
#pragma OPENCL EXTENSION cl_khr_fp64 : enable   /* require 64-bit floats */

/* Enable expm1/log1p built-ins on OpenCL >= 1.2 */
#if defined(__OPENCL_C_VERSION__) && (__OPENCL_C_VERSION__ >= 120)
  #define HAVE_EXPM1 1
  #define HAVE_LOG1P 1
#endif
#undef HAVE_LONG_DOUBLE      /* no long double on GPU */

#define ML_NAN   (0.0/0.0)
#define ML_POSINF INFINITY
#define ML_NEGINF -INFINITY
#define INLINE static inline
```

This one file handles the biggest portability concern: double-precision support
(`cl_khr_fp64`) must be declared before any `double` variable appears in device
code, and the `HAVE_*` macros tell the nmath sources which C99 math built-ins
are available.

## Layer 1: The upstream shim layer

R's nmath sources transitively include a long chain of headers.  The shim layer
provides OpenCL C replacements, grouped by what they replace:

### Type stubs

| Files | What they replace |
|-------|-------------------|
| `System/stdint.cl` | `<stdint.h>`: `int32_t`, `uint64_t`, ... using OpenCL built-in integer types |
| `R_ext_types/Boolean.cl`, `Complex.cl`, `Constants.cl`, etc. | Type-only parts of `<R_ext/Boolean.h>`, `<R_ext/Complex.h>`, etc. |
| `R_shims/Rinternals.cl` | Skeletal `SEXP`, `SEXPREC`, `Rboolean` so transitively included headers compile |

None of these types are *used* by the nmath math logic; the stubs exist so the
preprocessor does not abort on an unrecognized identifier.

### Capability macros

`R_shims/Rconfig.cl` reproduces what R's `configure` script writes into
`config.h`:

```c
#define HAVE_EXPM1 1
#define HAVE_LOG1P 1
#define IEEE_754   1
```

Without this, conditional compilation branches inside nmath would take the
wrong path (e.g., falling back to a Taylor series instead of calling the
built-in `expm1`).

### Runtime no-ops

R's `error()` and `warning()` are host-side callbacks that `longjmp` into the R
runtime --- they cannot exist on the GPU.  `R_ext_runtime/Error.cl` replaces
them with silent macros:

```c
#define error(...)   /* no-op */
#define warning(...) /* no-op */
```

Similarly, `R_ext_runtime/Memory.cl` stubs `R_alloc()` and friends (nmath's hot
paths do not actually allocate, so these stubs satisfy the preprocessor without
ever executing), and `R_ext_runtime/Print.cl` stubs `Rprintf`.

### Math declarations

`R_ext_runtime/Arith.cl` is the most actively *used* shim.  It maps R's
numeric predicates to their OpenCL equivalents:

```c
#define R_FINITE(x)  isfinite(x)
#define ISNAN(x)     isnan(x)
```

`nmath/Rmath.cl` and `nmath/nmath.cl` close out the shim layer by providing the
forward declarations, mathematical constants, and error-code macros that the
nmath function files expect to find in `Rmath.h` / `nmath.h`.

### The design principle

Every shim is **minimally intrusive**: it defines only what the nmath math logic
actually needs, and stubs or silences everything else.  The goal is that the
nmath `.cl` files are as close to verbatim copies of the C sources as possible;
the shims carry all the differences.

## Layer 2: The nmath library and its annotation scheme

`inst/cl/nmath/` contains approximately 180 `.cl` files, one per source file
in R's `src/nmath/`.  They implement the full suite of Mathlib functions in
OpenCL C.

### `@provides` and `@depends` annotations

Because the program is assembled at runtime, the loader must know the
correct concatenation order.  Each nmath file declares what it provides
and what it depends on:

```c
// From inst/cl/nmath/dnorm.cl
// @source_origin: dnorm.c
// @depends:  nmath, dpq
// @provides: dnorm, dnorm4
// @all_depends: dpq, Rmath, nmath
// @load_order: 31
```

`opencltools::load_kernel_library()` reads these tags, builds a dependency graph, and
performs a topological sort --- files with no unsatisfied dependencies are
emitted first, then files whose dependencies have been emitted, and so on.
The result is a load order in which every function is declared before it is
called.

### Only the required subset is loaded

A kernel that calls `dnorm` needs only 4 nmath files.  A kernel that calls
`dbinom` needs 18.  Loading all ~180 files for every kernel would be wasteful
and slow to compile.  Instead, each kernel file carries a pre-computed
`@all_depends_nmath` annotation listing the exact transitive closure:

```c
// dnorm_kernel.cl  -- needs 4 files
// @all_depends_nmath: dpq, Rmath, nmath, dnorm

// dbinom_kernel.cl -- needs 18 files
// @all_depends_nmath: dpq, refactored, Rmath, nmath, stirlerr_cycle_free,
//   chebyshev, cospi, fmax2, gammalims, lgammacor, log1p, gamma, lgamma,
//   pgamma_utils, stirlerr_cycle_dependent, bd0, stirlerr, dbinom
```

`load_library_for_kernel()` reads `@all_depends_nmath`, looks up each stem in
a pre-built index (`kernel_dependency_index.tsv`) to get the global load order,
and concatenates exactly those files in the correct sequence.  No topological
sort is computed at install time --- the sort was done once when the index was
built.

## Layer 3: The kernel function

The `__kernel` function is the entry point that the GPU driver exposes.
Conceptually it is a **GPU-parallel version of a CPU `for` loop**:

```c
/* CPU equivalent:
   for (int i = 0; i < n; i++) out[i] = dnorm(x[i], mu[i], sigma[i], gl); */

__kernel void dnorm_kernel(
    __global const double* x,
    __global const double* mu,
    __global const double* sigma,
    __global const int*    give_log,
    __global double*       out,
    const int              len
) {
    int i = get_global_id(0);    /* this work-item handles element i */
    if (i >= len) return;
    int gl = (give_log[i] != 0) ? 1 : 0;
    out[i] = dnorm(x[i], mu[i], sigma[i], gl);   /* same nmath call */
}
```

`get_global_id(0)` returns the index of the current work-item in the first
dimension of the work grid.  The host launches exactly `n` work-items so each
element of the input vector is handled by one work-item in parallel.

The kernel calls `dnorm` --- the same function name as in the C source.  All
the layers below it (shims + nmath files) ensure that `dnorm` is defined and
reachable in the same program string.

## Assembling the program in C++

`build_rmath_opencl_program()` in `src/kernel_loader.cpp` shows precisely how
the four layers are concatenated:

```cpp
std::string build_rmath_opencl_program(
    const std::string& kernel_relative_path,
    const std::string& package,
    const std::string& nmath_depends_annotation)
{
  // Load exactly the nmath subset this kernel needs (via @all_depends_nmath)
  const std::string nmath_src = load_library_for_kernel(
      kernel_relative_path, "nmath", package, nmath_depends_annotation);

  return
    load_kernel_source("OPENCL.cl",      package)        + "\n"  // Layer 0
    + load_kernel_library("libR_shims",  package, false) + "\n"  // Layer 1
    + load_kernel_library("R_ext_types", package, false) + "\n"
    + load_kernel_library("R_shims",     package, false) + "\n"
    + load_kernel_library("R_ext_runtime", package, false) + "\n"
    + load_kernel_library("R_ext_internals", package, false) + "\n"
    + load_kernel_library("System",      package, false) + "\n"
    + nmath_src                                          + "\n"  // Layer 2
    + load_kernel_source(kernel_relative_path, package);         // Layer 3
}
```

The returned string is passed directly to `clCreateProgramWithSource` and then
compiled by the device driver.  At that point the GPU can execute the kernel
function.

Note the dual guard: each loader function itself is compiled to a no-op when
`USE_OPENCL` is absent (see Chapter 02), so `build_rmath_opencl_program` is
always callable but returns an empty string on CPU-only builds.  Callers
check `nmathopencl_has_opencl()` before dispatching to the GPU path.

## What this means for downstream developers

A downstream package that builds on `nmathopencl` does not need to replicate
this assembly machinery.  Kernel text loading uses **opencltools**
(`load_kernel_source`, `load_kernel_library` with `package = "nmathopencl"`).
`load_library_for_kernel` is re-exported from **nmathopencl** (and available via
`LinkingTo: nmathopencl`) and handles all of
layers 0, 1, and 2 automatically.

The downstream developer's responsibilities are:

1. **Write the `__kernel` function** (Layer 3): call the nmath functions your
   computation needs.
2. **Annotate it** with `@depends_nmath` and `@all_depends_nmath` so the loader
   knows which nmath subset to pull in.
3. **Write the C++ runner** that calls `build_rmath_opencl_program()` (or the
   equivalent chain), dispatches work-items, and transfers results.

Chapter 07 covers how to write and annotate kernel functions.  Chapter 10
walks through all three steps end-to-end for the `ex_glmbayes` example.