Chapter 10: Case Study — Building Custom GLM Kernels (ex_glmbayes)

Kjell Nygren

2026-06-11

Overview

Chapters 03-10 describe the infrastructure of nmathopencl: the shim layer, the nmath library, the kernel format, and the runner API. This chapter shows how to put all of it together by walking through a concrete downstream use-case: the ex_glmbayes example built into the package.

ex_glmbayes implements GPU-accelerated evaluation of the log-posterior and gradient (f2/f3) for Bayesian GLMs (Gaussian, Gamma, binomial with logit/probit/cloglog links, and Poisson). It is the type of component a package author would build on top of nmathopencl to accelerate their own likelihood computations. The name ex_glmbayes signals that this is an example, not a core part of the library; it lives entirely in ex_glmbayes_*-prefixed files and within the ex_glmbayes C++ namespace.

Step 1 — Identify the nmath functions your kernel needs

For each GLM family, determine which nmath functions the GPU kernel must call:

Kernel nmath function(s) called Link function
f2_f3_binomial_logit.cl dbinom_raw logit
f2_f3_binomial_probit.cl dbinom_raw, pnorm5, dnorm4 probit
f2_f3_binomial_cloglog.cl dbinom_raw cloglog
f2_f3_gamma.cl dgamma log
f2_f3_gaussian.cl dnorm4 identity/log
f2_f3_poisson.cl lgamma (OpenCL built-in) log

For f2_f3_poisson, lgamma is an OpenCL built-in (part of the OpenCL 1.1 math specification) so no nmath files are needed at all. For the others, consult the @all_depends_nmath tag of each kernel file to get the full transitive dependency list.

Step 2 — Extract the minimal nmath subset

The @all_depends_nmath tag in each kernel file gives the complete transitive closure. Taking the union across the five non-Poisson kernels yields 20 files (17 nmath functions + Rmath, nmath, refactored as infrastructure):

Rmath, nmath, refactored,
chebyshev, cospi, fmax2, gammalims, lgammacor,
gamma, lgamma,
stirlerr_cycle_free, pgamma_utils, stirlerr_cycle_dependent, stirlerr, bd0,
dpois, dgamma, dnorm, pnorm, dbinom

These files are stored in inst/cl/ex_glmbayes_nmath/. Rather than copying them manually, use extract_library_subset() to derive this directory automatically from the kernel annotations:

nmath_dir <- system.file("cl", "nmath",           package = "nmathopencl")
src_dir   <- system.file("cl", "ex_glmbayes_src", package = "nmathopencl")

# Load the pre-built dependency index (read once, reuse across calls)
idx <- readRDS(file.path(nmath_dir, "kernel_dependency_index.rds"))

result <- extract_library_subset(
    kernel_paths = list.files(src_dir, pattern = "\\.cl$", full.names = TRUE),
    library_dir  = nmath_dir,
    dest_dir     = "inst/cl/ex_glmbayes_nmath",
    depends_tag  = "depends_nmath",
    index        = idx
)

extract_library_subset() copies the 20 .cl files plus both index files (kernel_dependency_index.rds and kernel_dependency_index.tsv) into the destination directory. The index files are required for the indexed loading described in Step 5.

A package developer would create an analogous subdirectory for their own kernel set using the same function.

Step 3 — Write the kernel files

Each kernel file in inst/cl/ex_glmbayes_src/ starts with the dependency metadata block followed by the kernel body:

// @library_deps: nmath
// @calls_nmath: dbinom_raw
// @depends_nmath: dbinom
// @all_depends_nmath_count: 17
// @all_depends_nmath: Rmath, nmath, refactored, chebyshev, cospi, fmax2,
//   gammalims, lgammacor, gamma, lgamma, stirlerr_cycle_free, pgamma_utils,
//   stirlerr_cycle_dependent, stirlerr, bd0, dbinom
// @calls_opencl_builtin: (none)

#pragma OPENCL EXTENSION cl_khr_fp64 : enable
#pragma OPENCL EXTENSION cl_khr_printf : enable

#define MAX_L2 64   // upper bound on l2

__kernel void f2_f3_binomial_logit(
    __global const double* X,      // design matrix,  size = l1 x l2, col-major
    __global const double* B,      // grid,            size = m1 x l2
    __global const double* mu,     // prior mean,      length = l2
    __global const double* P,      // prior precision, size = l2 x l2
    __global const double* alpha,  // offset,          length = l1
    __global const double* y,      // response,        length = l1
    __global const double* wt,     // weights,         length = l1
    __global double*       qf,     // out: neg log-posterior, length = m1
    __global double*       grad,   // out: gradient,   size = m1 x l2
    const int l1,
    const int l2,
    const int m1
) {
    int j = get_global_id(0);
    if (j >= m1) return;

    // prior quadratic form: 0.5 * (B_j - mu)' P (B_j - mu)
    double tmp[MAX_L2];
    for (int k = 0; k < l2; ++k) {
        double acc = 0.0;
        for (int l = 0; l < l2; ++l)
            acc += P[k*l2 + l] * (B[j*l2 + l] - mu[l]);
        tmp[k] = acc;
    }
    double res_acc = 0.0;
    for (int k = 0; k < l2; ++k)
        res_acc += 0.5 * (B[j*l2 + k] - mu[k]) * tmp[k];

    // gradient: prior part
    double g_loc[MAX_L2];
    for (int k = 0; k < l2; ++k) g_loc[k] = tmp[k];

    // data likelihood: logistic link + dbinom_raw
    for (int i = 0; i < l1; ++i) {
        double dot = -alpha[i];
        for (int k = 0; k < l2; ++k)
            dot -= X[k*l1 + i] * B[j*l2 + k];
        double e = exp(dot <= 0 ? dot : -dot);
        double p = dot <= 0 ? 1.0/(1.0+e) : e/(1.0+e);
        double q = 1.0 - p;
        res_acc += -dbinom_raw(y[i], wt[i], p, q, 1);
        double resid = p < 0.5 ? p*wt[i] - y[i]*wt[i]
                                : (1-y[i])*wt[i] - q*wt[i];
        for (int k = 0; k < l2; ++k)
            g_loc[k] += X[k*l1 + i] * resid;
    }

    qf[j] = res_acc;
    for (int k = 0; k < l2; ++k)
        grad[k * m1 + j] = g_loc[k];
}

Key design choices:

Step 4 — Write the kernel runner (C++)

The kernel runner is the low-level C++ function that manages the OpenCL lifecycle for this specific kernel. It lives in ex_glmbayes_kernel_runners.cpp inside the ex_glmbayes::opencl namespace:

// ex_glmbayes_kernel_runners.cpp (abbreviated)
namespace ex_glmbayes {
namespace opencl {

void f2_f3_kernel_runner(
    const std::string& kernel_source,
    const char*        kernel_name,
    int l1, int l2, int m1,
    const std::vector<double>& X_flat,
    const std::vector<double>& B_flat,
    const std::vector<double>& mu_flat,
    const std::vector<double>& P_flat,
    const std::vector<double>& alpha_flat,
    const std::vector<double>& y_flat,
    const std::vector<double>& wt_flat,
    std::vector<double>& qf_flat,
    std::vector<double>& grad_flat
) {
    // Select platform and device
    // Create context, command queue, program, kernel
    // Create input buffers (CL_MEM_READ_ONLY) and output buffers (CL_MEM_WRITE_ONLY)
    // Set kernel arguments
    // Enqueue kernel with global_size = m1
    // Read back qf_flat and grad_flat
    // Release all resources
}

}} // namespace ex_glmbayes::opencl

The runner uses the error-handling utilities from openclPort.h (opencl_make_context_error, opencl_status_name) to produce informative exceptions on failure.

Step 5 — Write the kernel wrapper (C++)

The kernel wrapper is the R-facing [[Rcpp::export]] function. It validates and converts R inputs, selects the kernel name based on family/link, assembles the program string using the dependency index, calls the runner, and returns results as R objects. It lives in ex_glmbayes_kernel_wrappers.cpp:

// ex_glmbayes_kernel_wrappers.cpp (abbreviated)
namespace ex_glmbayes {
namespace opencl {

// [[Rcpp::export]]
Rcpp::List f2_f3_opencl(
    std::string family, std::string link,
    Rcpp::NumericMatrix b,  Rcpp::NumericVector y,
    Rcpp::NumericMatrix x,  Rcpp::NumericMatrix mu,
    Rcpp::NumericMatrix P,  Rcpp::NumericVector alpha,
    Rcpp::NumericVector wt, int progbar)
{
    int l1 = x.nrow(), l2 = x.ncol(), m1 = b.ncol();

    // 1. Flatten R matrices into contiguous vectors
    auto X_flat  = openclPort::flattenMatrix(x);
    auto B_flat  = openclPort::flattenMatrix(b);
    // ... (all inputs)

    // 2. Dispatch: set kernel_name and kernel_file based on family/link
    std::string kernel_name, kernel_file;
    if (family == "binomial" && link == "logit") {
        kernel_name = "f2_f3_binomial_logit";
        kernel_file = "ex_glmbayes_src/f2_f3_binomial_logit.cl";
    } // ... (all families)

    // 3. Assemble program source using the pre-built dependency index.
    //    load_library_for_kernel() reads @depends_nmath from the kernel file,
    //    looks up the transitive closure in kernel_dependency_index.tsv, and
    //    loads only the required subset in the correct order --- no sort at runtime.
    std::string nmath_src = openclPort::load_library_for_kernel(
        kernel_file,          // kernel .cl path (relative to inst/cl/)
        "ex_glmbayes_nmath",  // library subdir containing index + .cl files
        "nmathopencl",        // package
        "depends_nmath"       // annotation tag listing direct nmath entry points
    );
    std::string kernel_src = openclPort::load_kernel_source(
        kernel_file, "nmathopencl");
    std::string program_source = nmath_src + "\n" + kernel_src;

    // 4. Allocate output buffers
    std::vector<double> qf_flat(m1);
    std::vector<double> grad_flat(static_cast<size_t>(m1) * l2);

    // 5. Run the kernel
    f2_f3_kernel_runner(program_source, kernel_name.c_str(),
                        l1, l2, m1,
                        X_flat, B_flat, mu_flat, P_flat,
                        alpha_flat, y_flat, wt_flat,
                        qf_flat, grad_flat);

    // 6. Wrap outputs as R objects and return
    Rcpp::NumericVector qf(qf_flat.begin(), qf_flat.end());
    // ... (reshape grad_flat into a matrix)
    return Rcpp::List::create(Rcpp::Named("qf") = qf, ...);
}

}} // namespace ex_glmbayes::opencl

Why load_library_for_kernel() rather than load_kernel_library()?

load_kernel_library() sorts and concatenates every file in the subdirectory on every call. For a 137-file library like nmath this requires reading all files sequentially. load_library_for_kernel() reads the pre-built kernel_dependency_index.tsv once, then reads only the small subset of files that the specific kernel actually needs — 3 files for f2_f3_gaussian, 17 for f2_f3_binomial_logit, and so on. The savings are significant for large libraries and become important when the kernel wrapper is called repeatedly.

The old full-library approach is still available for cases where the exact dependency set is unknown or all files are genuinely needed:

// Old approach --- loads all files in the subdirectory (kept for reference)
// std::string nmath_src = openclPort::load_kernel_library(
//     "ex_glmbayes_nmath", "nmathopencl", false);

Step 6 — Write the Rcpp export wrappers (C++ and R)

The [[Rcpp::export]] annotation on f2_f3_opencl causes Rcpp::compileAttributes() to generate:

The R-side convenience function in ex_glmbayes.R adds a nmathopencl_has_opencl() guard and CPU fallback:

f2_f3_opencl_R <- function(family, link, b, y, x, mu, P, alpha, wt,
                             use_opencl = TRUE) {
  if (use_opencl && nmathopencl_has_opencl()) {
    f2_f3_opencl(family, link, b, y, x, mu, P, alpha, wt, progbar = 0L)
  } else {
    f2_f3_non_opencl(family, link, b, y, x, mu, P, alpha, wt)
  }
}

Step 7 — Implement the CPU fallback

The CPU fallback uses the same mathematical logic as the kernel but executes in C++ using standard R math calls. The relevant code lives in:

These files are grouped under the ex_glmbayes::fam and ex_glmbayes::env sub-namespaces. They use R’s standard <Rmath.h> functions (the same mathematical functions as the GPU kernels, but via the C runtime rather than OpenCL).

File inventory

The complete set of files contributing to the ex_glmbayes example:

OpenCL source (inst/cl/)

Directory Files
ex_glmbayes_nmath/ 20 files: the minimal nmath subset
ex_glmbayes_src/ 6 files: one f2_f3_*.cl kernel per GLM family
ex_glmbayes_draft_src/ Draft versions (work in progress)

C++ source (src/)

File Namespace Role
ex_glmbayes_kernel_runners.cpp ex_glmbayes::opencl Low-level OpenCL runner for the f2_f3 kernels
ex_glmbayes_kernel_wrappers.cpp ex_glmbayes::opencl R-facing wrapper: input validation, program assembly, runner dispatch
ex_glmbayes_export_wrappers.cpp (Rcpp export) [[Rcpp::export]] entry points for EnvelopeSize, EnvelopeEval, glmb_Standardize_Model
ex_glmbayes_famfuncs_binomial.cpp ex_glmbayes::fam CPU binomial likelihood
ex_glmbayes_famfuncs_poisson.cpp ex_glmbayes::fam CPU Poisson likelihood
ex_glmbayes_famfuncs_Gamma.cpp ex_glmbayes::fam CPU Gamma likelihood
ex_glmbayes_famfuncs_gaussian.cpp ex_glmbayes::fam CPU Gaussian likelihood
ex_glmbayes_EnvelopeEval.cpp ex_glmbayes::env CPU envelope evaluation (f2/f3 orchestrator)
ex_glmbayes_EnvelopeSize.cpp ex_glmbayes::env Envelope memory sizing
ex_glmbayes_glmb_Standardize_Model.cpp ex_glmbayes::sim Design matrix standardization

C++ headers (src/)

File Namespace declared
ex_glmbayes_opencl.h ex_glmbayes::opencl
ex_glmbayes_famfuncs.h ex_glmbayes::fam
ex_glmbayes_Envelopefuncs.h ex_glmbayes::env
ex_glmbayes_simfuncs.h ex_glmbayes::sim

R source (R/)

File Role
ex_glmbayes.R R-facing API: f2_f3_opencl_R, f2_f3_non_opencl
ex_glmbayes_rcpp_wrappers.R Thin Rcpp wrappers for the three exported C++ export functions

Adapting this pattern for a new package

To build a similar component in your own package:

  1. Identify your nmath dependencies — write the @depends_nmath annotation in each of your kernel files (listing the direct nmath entry points your kernel calls). Then call extract_library_subset() to copy the exact transitive closure of files, plus the companion index files, into your package’s inst/cl/<yourpkg_nmath>/:

    nmath_dir <- system.file("cl", "nmath", package = "nmathopencl")
    idx <- readRDS(file.path(nmath_dir, "kernel_dependency_index.rds"))
    
    extract_library_subset(
        kernel_paths = list.files("inst/cl/mypkg_src", "\\.cl$", full.names = TRUE),
        library_dir  = nmath_dir,
        dest_dir     = "inst/cl/mypkg_nmath",  # must exist
        depends_tag  = "depends_nmath",
        index        = idx
    )
  2. Write your kernel(s) — add @library_deps, @calls_nmath, and @all_depends_nmath tags at the top of each .cl file. Store them in inst/cl/<yourpkg_src>/.

  3. Write a kernel runner — use openclPort::opencl_dbl_scalar_kernel_runner directly if your kernel fits the double-scalar layout, or write a custom runner following the same OpenCL lifecycle pattern as f2_f3_kernel_runner. Use openclPort::opencl_make_context_error and opencl_status_name for error handling.

  4. Write a kernel wrapper — flatten your R inputs with openclPort::flattenMatrix / openclPort::copyVector, assemble the program string with openclPort::load_library_for_kernel() + openclPort::load_kernel_source(), call the runner, and return an R-friendly result. The @depends_nmath annotation in your kernel file drives the subset selection automatically.

  5. Write a CPU fallback — implement the same computation using standard R math (<Rmath.h>) or base-R functions. Guard the GPU path with nmathopencl_has_opencl().

  6. Add LinkingTo: nmathopencl to your DESCRIPTION file so that openclPort.h is found at compile time.

  7. Add configure scripts for CRAN safety — this is the step most commonly overlooked.

    A package that references -lOpenCL or CL/cl.h in a static src/Makevars will fail to compile on CRAN’s build machines (which have no GPU SDK), and no binary will be produced. The solution is a pair of configure scripts (configure for Linux/macOS, configure.win for Windows) that generate src/Makevars dynamically at install time: if the SDK is found they write -DUSE_OPENCL and the link flags; if not they write a CPU-only Makevars. The package always compiles cleanly.

    Copy the generic templates into your package root with:

    use_opencl_configure()

    This writes configure and configure.win to the current directory, sets configure as executable (on Linux/macOS), and prints a checklist. The templates honor OPENCL_HOME / OPENCL_SDK environment variables so developers with non-standard SDK locations can point to them without modifying system paths.

    The three-entity relationship that the scripts establish:

    configure / configure.win
      -> detects CL/cl.h + libOpenCL at install time
      -> writes -DUSE_OPENCL into Makevars   (or omits it for CPU-only)
    
    #ifdef USE_OPENCL  (in your C++ source)
      -> guards all GPU code; compiles cleanly either way
    
    nmathopencl_has_opencl()  (in your R code)
      -> calls a compiled-in bool that mirrors the compile-time decision
      -> TRUE only if -DUSE_OPENCL was set when the package was built

    On Linux the configure script also runs a runtime probe (clGetPlatformIDs) before setting USE_OPENCL. This extra check catches the common case where the ICD loader (libOpenCL.so) is installed but no vendor platform has been registered in /etc/OpenCL/vendors/. configure.win (Windows) detects the header only; the ICD loader (OpenCL.dll) is installed with the GPU driver and does not require a separate probe.

    Add the generated files to .gitignore — they are build artifacts, not source files:

    src/Makevars
    src/Makevars.win

    For more detail, including the migration plan for when these templates move to opencltools, see system.file("configure-templates", "README.md", package = "nmathopencl").

The ex_glmbayes files are the authoritative reference implementation for steps 1–6; step 7 applies universally to any downstream package that exposes GPU acceleration.