---
title: "Chapter 10: Case Study --- Building Custom GLM Kernels (ex_glmbayes)"
author: "Kjell Nygren"
date: "`r Sys.Date()`"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Chapter 10: Case Study --- Building Custom GLM Kernels (ex_glmbayes)}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include = FALSE}
knitr::opts_chunk$set(collapse = TRUE, comment = "#>")
```

## Overview

Chapters 03-10 describe the infrastructure of `nmathopencl`: the shim layer,
the `nmath` library, the kernel format, and the runner API. This chapter
shows how to put all of it together by walking through a concrete downstream
use-case: the `ex_glmbayes` example built into the package.

`ex_glmbayes` implements GPU-accelerated evaluation of the log-posterior
and gradient (`f2`/`f3`) for Bayesian GLMs (Gaussian, Gamma, binomial with
logit/probit/cloglog links, and Poisson). It is the type of component a
package author would build on top of `nmathopencl` to accelerate their own
likelihood computations. The name `ex_glmbayes` signals that this is an
example, not a core part of the library; it lives entirely in
`ex_glmbayes_*`-prefixed files and within the `ex_glmbayes` C++ namespace.

## Step 1 --- Identify the nmath functions your kernel needs

For each GLM family, determine which nmath functions the GPU kernel must
call:

| Kernel | nmath function(s) called | Link function |
|--------|--------------------------|---------------|
| `f2_f3_binomial_logit.cl` | `dbinom_raw` | logit |
| `f2_f3_binomial_probit.cl` | `dbinom_raw`, `pnorm5`, `dnorm4` | probit |
| `f2_f3_binomial_cloglog.cl` | `dbinom_raw` | cloglog |
| `f2_f3_gamma.cl` | `dgamma` | log |
| `f2_f3_gaussian.cl` | `dnorm4` | identity/log |
| `f2_f3_poisson.cl` | `lgamma` (OpenCL built-in) | log |

For `f2_f3_poisson`, `lgamma` is an OpenCL built-in (part of the OpenCL 1.1
math specification) so no nmath files are needed at all. For the others,
consult the `@all_depends_nmath` tag of each kernel file to get the full
transitive dependency list.

## Step 2 --- Extract the minimal nmath subset

The `@all_depends_nmath` tag in each kernel file gives the complete
transitive closure. Taking the union across the five non-Poisson kernels
yields 20 files (17 nmath functions + `Rmath`, `nmath`, `refactored` as
infrastructure):

```
Rmath, nmath, refactored,
chebyshev, cospi, fmax2, gammalims, lgammacor,
gamma, lgamma,
stirlerr_cycle_free, pgamma_utils, stirlerr_cycle_dependent, stirlerr, bd0,
dpois, dgamma, dnorm, pnorm, dbinom
```

These files are stored in `inst/cl/ex_glmbayes_nmath/`. Rather than
copying them manually, use `extract_library_subset()` to derive this
directory automatically from the kernel annotations:

```{r, eval = FALSE}
nmath_dir <- system.file("cl", "nmath",           package = "nmathopencl")
src_dir   <- system.file("cl", "ex_glmbayes_src", package = "nmathopencl")

# Load the pre-built dependency index (read once, reuse across calls)
idx <- readRDS(file.path(nmath_dir, "kernel_dependency_index.rds"))

result <- extract_library_subset(
    kernel_paths = list.files(src_dir, pattern = "\\.cl$", full.names = TRUE),
    library_dir  = nmath_dir,
    dest_dir     = "inst/cl/ex_glmbayes_nmath",
    depends_tag  = "depends_nmath",
    index        = idx
)
```

`extract_library_subset()` copies the 20 `.cl` files **plus both index
files** (`kernel_dependency_index.rds` and `kernel_dependency_index.tsv`)
into the destination directory. The index files are required for the
indexed loading described in Step 5.

A package developer would create an analogous subdirectory for their own
kernel set using the same function.

## Step 3 --- Write the kernel files

Each kernel file in `inst/cl/ex_glmbayes_src/` starts with the dependency
metadata block followed by the kernel body:

```c
// @library_deps: nmath
// @calls_nmath: dbinom_raw
// @depends_nmath: dbinom
// @all_depends_nmath_count: 17
// @all_depends_nmath: Rmath, nmath, refactored, chebyshev, cospi, fmax2,
//   gammalims, lgammacor, gamma, lgamma, stirlerr_cycle_free, pgamma_utils,
//   stirlerr_cycle_dependent, stirlerr, bd0, dbinom
// @calls_opencl_builtin: (none)

#pragma OPENCL EXTENSION cl_khr_fp64 : enable
#pragma OPENCL EXTENSION cl_khr_printf : enable

#define MAX_L2 64   // upper bound on l2

__kernel void f2_f3_binomial_logit(
    __global const double* X,      // design matrix,  size = l1 x l2, col-major
    __global const double* B,      // grid,            size = m1 x l2
    __global const double* mu,     // prior mean,      length = l2
    __global const double* P,      // prior precision, size = l2 x l2
    __global const double* alpha,  // offset,          length = l1
    __global const double* y,      // response,        length = l1
    __global const double* wt,     // weights,         length = l1
    __global double*       qf,     // out: neg log-posterior, length = m1
    __global double*       grad,   // out: gradient,   size = m1 x l2
    const int l1,
    const int l2,
    const int m1
) {
    int j = get_global_id(0);
    if (j >= m1) return;

    // prior quadratic form: 0.5 * (B_j - mu)' P (B_j - mu)
    double tmp[MAX_L2];
    for (int k = 0; k < l2; ++k) {
        double acc = 0.0;
        for (int l = 0; l < l2; ++l)
            acc += P[k*l2 + l] * (B[j*l2 + l] - mu[l]);
        tmp[k] = acc;
    }
    double res_acc = 0.0;
    for (int k = 0; k < l2; ++k)
        res_acc += 0.5 * (B[j*l2 + k] - mu[k]) * tmp[k];

    // gradient: prior part
    double g_loc[MAX_L2];
    for (int k = 0; k < l2; ++k) g_loc[k] = tmp[k];

    // data likelihood: logistic link + dbinom_raw
    for (int i = 0; i < l1; ++i) {
        double dot = -alpha[i];
        for (int k = 0; k < l2; ++k)
            dot -= X[k*l1 + i] * B[j*l2 + k];
        double e = exp(dot <= 0 ? dot : -dot);
        double p = dot <= 0 ? 1.0/(1.0+e) : e/(1.0+e);
        double q = 1.0 - p;
        res_acc += -dbinom_raw(y[i], wt[i], p, q, 1);
        double resid = p < 0.5 ? p*wt[i] - y[i]*wt[i]
                                : (1-y[i])*wt[i] - q*wt[i];
        for (int k = 0; k < l2; ++k)
            g_loc[k] += X[k*l1 + i] * resid;
    }

    qf[j] = res_acc;
    for (int k = 0; k < l2; ++k)
        grad[k * m1 + j] = g_loc[k];
}
```

Key design choices:

- `MAX_L2 64` --- a compile-time upper bound on the number of predictors,
  used to size local arrays (OpenCL C does not support variable-length
  arrays in all implementations).
- Column-major layout for `X` (matches R's memory layout for matrices
  passed from R to C without transposition).
- A single work-item per grid point `j`; all `l1` observations are
  processed sequentially within that work-item. This is appropriate when
  `m1` (number of grid points) is much larger than `l1` (number of
  observations), as is typical for envelope sampling.

## Step 4 --- Write the kernel runner (C++)

The kernel runner is the low-level C++ function that manages the OpenCL
lifecycle for this specific kernel. It lives in
`ex_glmbayes_kernel_runners.cpp` inside the `ex_glmbayes::opencl` namespace:

```cpp
// ex_glmbayes_kernel_runners.cpp (abbreviated)
namespace ex_glmbayes {
namespace opencl {

void f2_f3_kernel_runner(
    const std::string& kernel_source,
    const char*        kernel_name,
    int l1, int l2, int m1,
    const std::vector<double>& X_flat,
    const std::vector<double>& B_flat,
    const std::vector<double>& mu_flat,
    const std::vector<double>& P_flat,
    const std::vector<double>& alpha_flat,
    const std::vector<double>& y_flat,
    const std::vector<double>& wt_flat,
    std::vector<double>& qf_flat,
    std::vector<double>& grad_flat
) {
    // Select platform and device
    // Create context, command queue, program, kernel
    // Create input buffers (CL_MEM_READ_ONLY) and output buffers (CL_MEM_WRITE_ONLY)
    // Set kernel arguments
    // Enqueue kernel with global_size = m1
    // Read back qf_flat and grad_flat
    // Release all resources
}

}} // namespace ex_glmbayes::opencl
```

The runner uses the error-handling utilities from `openclPort.h`
(`opencl_make_context_error`, `opencl_status_name`) to produce informative
exceptions on failure.

## Step 5 --- Write the kernel wrapper (C++)

The kernel wrapper is the R-facing `[[Rcpp::export]]` function. It
validates and converts R inputs, selects the kernel name based on
family/link, assembles the program string using the dependency index, calls
the runner, and returns results as R objects. It lives in
`ex_glmbayes_kernel_wrappers.cpp`:

```cpp
// ex_glmbayes_kernel_wrappers.cpp (abbreviated)
namespace ex_glmbayes {
namespace opencl {

// [[Rcpp::export]]
Rcpp::List f2_f3_opencl(
    std::string family, std::string link,
    Rcpp::NumericMatrix b,  Rcpp::NumericVector y,
    Rcpp::NumericMatrix x,  Rcpp::NumericMatrix mu,
    Rcpp::NumericMatrix P,  Rcpp::NumericVector alpha,
    Rcpp::NumericVector wt, int progbar)
{
    int l1 = x.nrow(), l2 = x.ncol(), m1 = b.ncol();

    // 1. Flatten R matrices into contiguous vectors
    auto X_flat  = openclPort::flattenMatrix(x);
    auto B_flat  = openclPort::flattenMatrix(b);
    // ... (all inputs)

    // 2. Dispatch: set kernel_name and kernel_file based on family/link
    std::string kernel_name, kernel_file;
    if (family == "binomial" && link == "logit") {
        kernel_name = "f2_f3_binomial_logit";
        kernel_file = "ex_glmbayes_src/f2_f3_binomial_logit.cl";
    } // ... (all families)

    // 3. Assemble program source using the pre-built dependency index.
    //    load_library_for_kernel() reads @depends_nmath from the kernel file,
    //    looks up the transitive closure in kernel_dependency_index.tsv, and
    //    loads only the required subset in the correct order --- no sort at runtime.
    std::string nmath_src = openclPort::load_library_for_kernel(
        kernel_file,          // kernel .cl path (relative to inst/cl/)
        "ex_glmbayes_nmath",  // library subdir containing index + .cl files
        "nmathopencl",        // package
        "depends_nmath"       // annotation tag listing direct nmath entry points
    );
    std::string kernel_src = openclPort::load_kernel_source(
        kernel_file, "nmathopencl");
    std::string program_source = nmath_src + "\n" + kernel_src;

    // 4. Allocate output buffers
    std::vector<double> qf_flat(m1);
    std::vector<double> grad_flat(static_cast<size_t>(m1) * l2);

    // 5. Run the kernel
    f2_f3_kernel_runner(program_source, kernel_name.c_str(),
                        l1, l2, m1,
                        X_flat, B_flat, mu_flat, P_flat,
                        alpha_flat, y_flat, wt_flat,
                        qf_flat, grad_flat);

    // 6. Wrap outputs as R objects and return
    Rcpp::NumericVector qf(qf_flat.begin(), qf_flat.end());
    // ... (reshape grad_flat into a matrix)
    return Rcpp::List::create(Rcpp::Named("qf") = qf, ...);
}

}} // namespace ex_glmbayes::opencl
```

### Why `load_library_for_kernel()` rather than `load_kernel_library()`?

`load_kernel_library()` sorts and concatenates **every** file in the
subdirectory on every call. For a 137-file library like `nmath` this requires
reading all files sequentially. `load_library_for_kernel()` reads the
pre-built `kernel_dependency_index.tsv` once, then reads only the small
subset of files that the specific kernel actually needs --- 3 files for
`f2_f3_gaussian`, 17 for `f2_f3_binomial_logit`, and so on. The savings are
significant for large libraries and become important when the kernel wrapper is
called repeatedly.

The old full-library approach is still available for cases where the exact
dependency set is unknown or all files are genuinely needed:

```cpp
// Old approach --- loads all files in the subdirectory (kept for reference)
// std::string nmath_src = openclPort::load_kernel_library(
//     "ex_glmbayes_nmath", "nmathopencl", false);
```

## Step 6 --- Write the Rcpp export wrappers (C++ and R)

The `[[Rcpp::export]]` annotation on `f2_f3_opencl` causes
`Rcpp::compileAttributes()` to generate:

- A `.Call` entry in `RcppExports.cpp`
- An R wrapper in `RcppExports.R`

The R-side convenience function in `ex_glmbayes.R` adds a `nmathopencl_has_opencl()`
guard and CPU fallback:

```{r, eval = FALSE}
f2_f3_opencl_R <- function(family, link, b, y, x, mu, P, alpha, wt,
                             use_opencl = TRUE) {
  if (use_opencl && nmathopencl_has_opencl()) {
    f2_f3_opencl(family, link, b, y, x, mu, P, alpha, wt, progbar = 0L)
  } else {
    f2_f3_non_opencl(family, link, b, y, x, mu, P, alpha, wt)
  }
}
```

## Step 7 --- Implement the CPU fallback

The CPU fallback uses the same mathematical logic as the kernel but
executes in C++ using standard R math calls. The relevant code lives in:

- `ex_glmbayes_famfuncs_binomial.cpp` --- binomial family functions
- `ex_glmbayes_famfuncs_poisson.cpp` --- Poisson family
- `ex_glmbayes_famfuncs_Gamma.cpp` --- Gamma family
- `ex_glmbayes_famfuncs_gaussian.cpp` --- Gaussian family
- `ex_glmbayes_EnvelopeEval.cpp` --- orchestrates the CPU path
- `ex_glmbayes_EnvelopeSize.cpp` --- memory allocation for the envelope

These files are grouped under the `ex_glmbayes::fam` and `ex_glmbayes::env`
sub-namespaces. They use R's standard `<Rmath.h>` functions (the same
mathematical functions as the GPU kernels, but via the C runtime rather
than OpenCL).

## File inventory

The complete set of files contributing to the `ex_glmbayes` example:

### OpenCL source (`inst/cl/`)

| Directory | Files |
|-----------|-------|
| `ex_glmbayes_nmath/` | 20 files: the minimal nmath subset |
| `ex_glmbayes_src/` | 6 files: one `f2_f3_*.cl` kernel per GLM family |
| `ex_glmbayes_draft_src/` | Draft versions (work in progress) |

### C++ source (`src/`)

| File | Namespace | Role |
|------|-----------|------|
| `ex_glmbayes_kernel_runners.cpp` | `ex_glmbayes::opencl` | Low-level OpenCL runner for the `f2_f3` kernels |
| `ex_glmbayes_kernel_wrappers.cpp` | `ex_glmbayes::opencl` | R-facing wrapper: input validation, program assembly, runner dispatch |
| `ex_glmbayes_export_wrappers.cpp` | (Rcpp export) | `[[Rcpp::export]]` entry points for `EnvelopeSize`, `EnvelopeEval`, `glmb_Standardize_Model` |
| `ex_glmbayes_famfuncs_binomial.cpp` | `ex_glmbayes::fam` | CPU binomial likelihood |
| `ex_glmbayes_famfuncs_poisson.cpp` | `ex_glmbayes::fam` | CPU Poisson likelihood |
| `ex_glmbayes_famfuncs_Gamma.cpp` | `ex_glmbayes::fam` | CPU Gamma likelihood |
| `ex_glmbayes_famfuncs_gaussian.cpp` | `ex_glmbayes::fam` | CPU Gaussian likelihood |
| `ex_glmbayes_EnvelopeEval.cpp` | `ex_glmbayes::env` | CPU envelope evaluation (f2/f3 orchestrator) |
| `ex_glmbayes_EnvelopeSize.cpp` | `ex_glmbayes::env` | Envelope memory sizing |
| `ex_glmbayes_glmb_Standardize_Model.cpp` | `ex_glmbayes::sim` | Design matrix standardization |

### C++ headers (`src/`)

| File | Namespace declared |
|------|--------------------|
| `ex_glmbayes_opencl.h` | `ex_glmbayes::opencl` |
| `ex_glmbayes_famfuncs.h` | `ex_glmbayes::fam` |
| `ex_glmbayes_Envelopefuncs.h` | `ex_glmbayes::env` |
| `ex_glmbayes_simfuncs.h` | `ex_glmbayes::sim` |

### R source (`R/`)

| File | Role |
|------|------|
| `ex_glmbayes.R` | R-facing API: `f2_f3_opencl_R`, `f2_f3_non_opencl` |
| `ex_glmbayes_rcpp_wrappers.R` | Thin Rcpp wrappers for the three exported C++ export functions |

## Adapting this pattern for a new package

To build a similar component in your own package:

1. **Identify your nmath dependencies** --- write the `@depends_nmath`
   annotation in each of your kernel files (listing the direct nmath entry
   points your kernel calls). Then call `extract_library_subset()` to copy
   the exact transitive closure of files, plus the companion index files,
   into your package's `inst/cl/<yourpkg_nmath>/`:

   ```{r, eval = FALSE}
   nmath_dir <- system.file("cl", "nmath", package = "nmathopencl")
   idx <- readRDS(file.path(nmath_dir, "kernel_dependency_index.rds"))

   extract_library_subset(
       kernel_paths = list.files("inst/cl/mypkg_src", "\\.cl$", full.names = TRUE),
       library_dir  = nmath_dir,
       dest_dir     = "inst/cl/mypkg_nmath",  # must exist
       depends_tag  = "depends_nmath",
       index        = idx
   )
   ```

2. **Write your kernel(s)** --- add `@library_deps`, `@calls_nmath`, and
   `@all_depends_nmath` tags at the top of each `.cl` file. Store them in
   `inst/cl/<yourpkg_src>/`.

3. **Write a kernel runner** --- use `openclPort::opencl_dbl_scalar_kernel_runner`
   directly if your kernel fits the double-scalar layout, or write a custom
   runner following the same OpenCL lifecycle pattern as `f2_f3_kernel_runner`.
   Use `openclPort::opencl_make_context_error` and `opencl_status_name` for
   error handling.

4. **Write a kernel wrapper** --- flatten your R inputs with
   `openclPort::flattenMatrix` / `openclPort::copyVector`, assemble the
   program string with `openclPort::load_library_for_kernel()` +
   `openclPort::load_kernel_source()`, call the runner, and return an
   R-friendly result. The `@depends_nmath` annotation in your kernel file
   drives the subset selection automatically.

5. **Write a CPU fallback** --- implement the same computation using standard
   R math (`<Rmath.h>`) or base-R functions. Guard the GPU path with
   `nmathopencl_has_opencl()`.

6. **Add `LinkingTo: nmathopencl`** to your `DESCRIPTION` file so that
   `openclPort.h` is found at compile time.

7. **Add configure scripts for CRAN safety** --- this is the step most
   commonly overlooked.

   A package that references `-lOpenCL` or `CL/cl.h` in a static
   `src/Makevars` will **fail to compile on CRAN's build machines** (which
   have no GPU SDK), and no binary will be produced. The solution is a
   pair of configure scripts (`configure` for Linux/macOS, `configure.win`
   for Windows) that **generate `src/Makevars` dynamically at install time**:
   if the SDK is found they write `-DUSE_OPENCL` and the link flags; if not
   they write a CPU-only Makevars. The package always compiles cleanly.

   Copy the generic templates into your package root with:

   ```{r, eval = FALSE}
   use_opencl_configure()
   ```

   This writes `configure` and `configure.win` to the current directory,
   sets `configure` as executable (on Linux/macOS), and prints a checklist.
   The templates honor `OPENCL_HOME` / `OPENCL_SDK` environment variables
   so developers with non-standard SDK locations can point to them without
   modifying system paths.

   The three-entity relationship that the scripts establish:

   ```
   configure / configure.win
     -> detects CL/cl.h + libOpenCL at install time
     -> writes -DUSE_OPENCL into Makevars   (or omits it for CPU-only)

   #ifdef USE_OPENCL  (in your C++ source)
     -> guards all GPU code; compiles cleanly either way

   nmathopencl_has_opencl()  (in your R code)
     -> calls a compiled-in bool that mirrors the compile-time decision
     -> TRUE only if -DUSE_OPENCL was set when the package was built
   ```

   On Linux the `configure` script also runs a **runtime probe**
   (`clGetPlatformIDs`) before setting `USE_OPENCL`. This extra check
   catches the common case where the ICD loader (`libOpenCL.so`) is
   installed but no vendor platform has been registered in
   `/etc/OpenCL/vendors/`. `configure.win` (Windows) detects the header
   only; the ICD loader (`OpenCL.dll`) is installed with the GPU driver and
   does not require a separate probe.

   Add the generated files to `.gitignore` --- they are build artifacts,
   not source files:

   ```
   src/Makevars
   src/Makevars.win
   ```

   For more detail, including the migration plan for when these templates
   move to `opencltools`, see
   `system.file("configure-templates", "README.md", package = "nmathopencl")`.

The `ex_glmbayes` files are the authoritative reference implementation for
steps 1--6; step 7 applies universally to any downstream package that
exposes GPU acceleration.