Chapter 05: Kernels, Kernel Runners, and Kernel Wrappers

Kjell Nygren

2026-06-11

Introduction

When an R package uses nmathopencl to accelerate a computation on the GPU, the work is divided among three distinct kinds of source file. This chapter introduces them in the order that makes the design easiest to understand: starting with the part that looks most like ordinary C++ code and moving toward the part that is furthest from it.

Role File type Compiled by Conditional guard Visible to R
Kernel wrapper *.cpp Host C++ compiler, at install time Internal #ifdef USE_OPENCL block Yes (directly or via a calling function)
Kernel runner *.cpp Host C++ compiler, at install time Entire body inside #ifdef USE_OPENCL No
Kernel *.cl GPU driver, at runtime None needed No

Three design goals shape this structure:

  1. CRAN safety. CRAN’s build servers have no OpenCL SDK. Any package submitted to CRAN must compile cleanly without it. Wrapping runner bodies entirely in #ifdef USE_OPENCL means those functions simply do not exist in a CPU-only build — there are no unresolved OpenCL symbols for the linker to complain about.

  2. Clean type boundaries. R and OpenCL use entirely different type systems and memory models. Explicit conversion at each boundary — from C++ input types (including Rcpp wrapper types) to plain C++ types the runner accepts, to OpenCL device buffers, and back — makes those differences explicit and testable, rather than hiding them behind casts.

  3. Separation of concerns. The wrapper owns the R interface and type conversion. The runner owns the OpenCL bookkeeping. The kernel owns the mathematical computation. None of the three needs to know the internal details of the other two.

How this chapter is organized

Each of the three sections below follows the same pattern:

  1. A general description of what that role does and what varies between implementations.
  2. A concrete example drawn from the ex_glmbayes component of nmathopencl, which evaluates the GLM log-posterior and its gradient over a grid of candidate coefficient vectors. This is labeled clearly as an example throughout.
  3. Commentary on what you would change when writing your own version.

ex_glmbayes is used as the example rather than a simpler function because it involves inputs of several different shapes, a two-dimensional parallel structure, and two different output types. That richness makes the type-conversion and parallelism stories concrete. A simpler computation would follow the same pattern with fewer inputs and a simpler output.


The computation in the example: GLM log-posterior over a grid

A Bayesian GLM sampler such as glmbayes needs to evaluate, at each MCMC step, a function f2_f3(B) for a grid of candidate coefficient vectors B[1], …, B[m1]. For each grid point j, f2_f3 returns:

The grid points are completely independent of each other, which makes the problem ideal for the GPU. Three dimension integers — l1 (observations), l2 (coefficients), m1 (grid points) — appear in all three layers and must be consistent throughout.


Serial vs. parallel execution: the fundamental concept

Before examining the three layers in detail it is worth being precise about what “parallel” means here, because it is the reason the whole architecture exists.

The serial (CPU) mental model

On a CPU, evaluating f2_f3 over m1 grid points looks like an ordinary loop:

// CPU serial execution — one grid point at a time
for (int j = 0; j < m1; ++j) {
    qf[j]      = compute_neg_log_posterior(B[j], X, y, wt, mu, P, alpha);
    grad[:, j] = compute_gradient(B[j], X, y, wt, mu, P, alpha);
}

The iterations execute one after the other. Each call to compute_neg_log_posterior must finish before the next begins. With m1 = 500 and l1 = 300, that is 150,000 observation-level computations done sequentially.

The parallel (GPU) mental model

OpenCL replaces the loop with a single API call that launches all m1 iterations simultaneously:

// GPU parallel execution — all m1 grid points launched at once
size_t global = (size_t) m1;
clEnqueueNDRangeKernel(queue, kernel, 1, nullptr, &global, nullptr,
                       0, nullptr, nullptr);

This one call enqueues m1 independent work-items. Each work-item is an independent execution of the __kernel function, distinguished only by its unique index:

int j = get_global_id(0);   // 0, 1, 2, ..., m1-1 — one per work-item

The GPU driver maps these work-items onto the GPU’s physical compute units. A modern GPU has dozens to hundreds of compute units, each capable of running multiple work-items simultaneously. With m1 = 500, several hundred grid points can genuinely execute at the same time, not just interleaved.

The work-items are organized internally into work-groups for scheduling purposes. When the local work-group size is set to nullptr (as in the runner), the driver chooses a group size that fits the hardware. Work-items within a work-group can share fast local memory, but the ex_glmbayes kernels do not use local memory — each work-item is fully independent.

When OpenCL is not available

If the package was built without USE_OPENCL, or if nmathopencl_has_opencl() returns FALSE at runtime, the wrapper returns zero-filled output vectors. There is no serial CPU fallback that computes the actual values.

This is a deliberate design choice. Implementing a correct CPU fallback that matches the GPU computation would mean maintaining two versions of the same mathematical code. In practice, glmbayes calls f2_f3_opencl only after checking nmathopencl_has_opencl(), and falls back to its own pure-R implementation when OpenCL is not available. The zero-return convention signals clearly that the GPU path was not executed, rather than silently returning potentially stale or incorrect results.

When implementing your own wrapper, you face the same choice: return zeros (safe, unambiguous), throw an error, or call a CPU implementation of the same computation. Whatever you choose, make it explicit in the #else branch.

Summary

Mode Execution Result
GPU available (USE_OPENCL + nmathopencl_has_opencl()) m1 work-items launched simultaneously via clEnqueueNDRangeKernel GPU-computed values
CPU-only build or no GPU at runtime #ifdef/nmathopencl_has_opencl() guard triggers Zero-filled outputs (or your chosen fallback)

The kernel wrapper

General description

A kernel wrapper is an ordinary C++ function — always compiled, always callable — that serves as the boundary between the calling code and the GPU acceleration path. It is the right place to start because it is the most familiar-looking of the three roles.

Every kernel wrapper does the same four things, regardless of the specific computation:

  1. Converts its C++ input types to the plain std::vector types the runner requires and the OpenCL API can handle. The exact conversions depend on the types in the function signature.
  2. Allocates output storage for the runner to write into.
  3. Conditionally dispatches to the runner if USE_OPENCL was defined at build time. If not, the function returns the zero-initialized outputs.
  4. Converts the runner’s output from plain C++ back to the types the caller expects.

What varies between implementations is the function signature (the specific C++ types coming in and going out), the conversion steps (which helpers are needed), and the output restructuring (how flat C++ vectors become the return types the caller needs). The four-step structure, and the #ifdef USE_OPENCL placement, are the same everywhere.

Example: f2_f3_opencl from ex_glmbayes

The following is the kernel wrapper for the binomial-logit GLM gradient computation. Read it as one concrete instance of the general pattern described above.

// ── EXAMPLE: src/ex_glmbayes_kernel_wrappers.cpp ────────────────────────────
// This wrapper is specific to the GLM log-posterior computation.
// Your own wrapper will have a different name, a different signature,
// and different conversion and output steps.
// ────────────────────────────────────────────────────────────────────────────

#include <RcppArmadillo.h>
#include "openclPort.h"
#include "ex_glmbayes_opencl.h"

using namespace openclPort;
using namespace ex_glmbayes::opencl;

namespace ex_glmbayes { namespace opencl {

Rcpp::List f2_f3_opencl(
    std::string          family,   // "binomial", "poisson", etc.
    std::string          link,     // "logit", "probit", etc.
    Rcpp::NumericMatrix  b,        // coefficient grid, l2 × m1
    Rcpp::NumericVector  y,        // response, length l1
    Rcpp::NumericMatrix  x,        // design matrix, l1 × l2
    Rcpp::NumericMatrix  mu,       // prior mean
    Rcpp::NumericMatrix  P,        // prior precision, l2 × l2
    Rcpp::NumericVector  alpha,    // offsets, length l1
    Rcpp::NumericVector  wt,       // weights / trial counts, length l1
    int                  progbar
    // Your wrapper's parameters will reflect your own computation's
    // inputs. A simpler function might take just one or two vectors.
) {
  int l1 = x.nrow(), l2 = x.ncol(), m1 = b.ncol();
  // Your dimension variables will match your own inputs.

Step 1 — Convert C++ input types to flat std::vector<double>

  // ── EXAMPLE: input conversion specific to this GLM computation ──────────
  // The C++ types coming in (Rcpp::NumericMatrix, Rcpp::NumericVector)
  // reflect what this particular function signature requires.
  // Your wrapper will have different input types and will apply whichever
  // combination of flattenMatrix / copyVector / manual conversion matches
  // your own function signature.

  auto X_flat     = flattenMatrix(x);
  auto B_flat     = flattenMatrix(b);
  auto mu_flat    = flattenMatrix(mu);
  auto P_flat     = flattenMatrix(P);
  auto alpha_flat = copyVector(alpha);
  auto y_flat     = copyVector(y);
  auto wt_flat    = copyVector(wt);

This conversion step, and the output vectors allocated next, all appear outside any #ifdef USE_OPENCL block. The reason is that the output vectors (qf_flat, grad_flat) must be in scope after #endif when the wrapper constructs the return value from them. Keeping all the conversions together, before the conditional block, makes the scoping unambiguous.

Why not pass the Rcpp objects directly to the runner? A Rcpp::NumericMatrix is a C++ object whose data lives in R’s garbage-collected heap (as an R SEXP). Passing a raw pointer into that storage to the OpenCL API is unsafe: the driver requires a stable, contiguous memory region for the duration of the clCreateBuffer copy, and R’s GC can move or invalidate heap objects when control returns to R. A std::vector<double> provides the required guarantee: it owns its memory, controls its lifetime through ordinary C++ scope rules, and will not move or be freed until the vector itself goes out of scope.

Why double and not float? The function signature uses double throughout because that is the native precision of Rcpp::NumericMatrix and Rcpp::NumericVector — and of the double* arrays they wrap. Using 32-bit float inside a kernel would require explicit downcast on every input and upcast on every output, and each round-trip introduces rounding error that makes GPU results diverge from the CPU reference computation. Using double throughout keeps the two paths numerically equivalent, at the cost of requiring the cl_khr_fp64 extension on the device (supported on all current discrete GPUs and AMD APUs).

Memory layout. Rcpp::NumericMatrix stores its data in column-major order internally (matching R’s own convention). flattenMatrix preserves that order: it traverses the matrix column-by-column, so the resulting flat vector has the same layout. The kernel indexes the design matrix as X[col * l1 + row] — column-major indexing — so the flat vector and the kernel’s index arithmetic agree. If the flat vector were row-major, every off-diagonal access would silently return the wrong value.

// openclPort helpers used above (defined in OpenCL_helper.cpp):

std::vector<double> flattenMatrix(const Rcpp::NumericMatrix& mat) {
  int nrow = mat.nrow(), ncol = mat.ncol();
  std::vector<double> out;
  out.reserve(static_cast<size_t>(nrow) * ncol);
  for (int j = 0; j < ncol; ++j)       // outer: columns
    for (int i = 0; i < nrow; ++i)     // inner: rows within column
      out.push_back(mat(i, j));
  return out;
}

std::vector<double> copyVector(const Rcpp::NumericVector& vec) {
  return std::vector<double>(vec.begin(), vec.end());
}
// For IntegerVector: std::vector<int>(v.begin(), v.end())

Step 2 — Allocate output buffers

  // ── EXAMPLE: output sizes specific to this computation ──────────────────
  // Your output buffers will match your kernel's output parameters.
  // A simpler kernel writing one output vector would need only one
  // std::vector here.

  std::vector<double> qf_flat(m1);                      // scalar per grid point
  std::vector<double> grad_flat(static_cast<size_t>(m1) * l2);  // gradient matrix
  Rcpp::NumericVector qf(m1);   // zero-initialized; returned as-is on CPU-only build

std::vector<double> zero-initializes its elements by default. If USE_OPENCL was not defined, the wrapper returns these zero-filled objects — a predictable, testable result on any machine without a GPU. The static_cast<size_t> prevents integer overflow in m1 * l2 on 32-bit platforms.

Step 3 — Select the kernel and assemble the program string

#ifdef USE_OPENCL

  // ── EXAMPLE: family/link dispatch specific to GLM ────────────────────────
  // This dispatch is needed here because one wrapper covers multiple
  // GLM families. Your wrapper may call a single fixed kernel, in which
  // case kernel_name and kernel_file are constants, not variables.

  std::string kernel_name, kernel_file;

  if (family == "binomial" || family == "quasibinomial") {
    if (link == "logit") {
      kernel_name = "f2_f3_binomial_logit";
      kernel_file = "ex_glmbayes_src/f2_f3_binomial_logit.cl";
    } else if (link == "probit") {
      kernel_name = "f2_f3_binomial_probit";
      kernel_file = "ex_glmbayes_src/f2_f3_binomial_probit.cl";
    }
    /* ... other link functions ... */
  } else if (family == "poisson" || family == "quasipoisson") {
    kernel_name = "f2_f3_poisson";
    kernel_file = "ex_glmbayes_src/f2_f3_poisson.cl";
  }
  /* ... other families ... */

  // ── Program string assembly (same pattern for all wrappers) ─────────────
  // The shim libraries are always the same. What changes is the nmath
  // subset (driven by your kernel's @all_depends_nmath annotation) and
  // the kernel source file itself.

  std::string all_src =
      load_kernel_source("OPENCL.cl") +
      "\n" + load_kernel_library("libR_shims",      "nmathopencl", false) +
      "\n" + load_kernel_library("R_ext_types",     "nmathopencl", false) +
      "\n" + load_kernel_library("R_shims",         "nmathopencl", false) +
      "\n" + load_kernel_library("R_ext_runtime",   "nmathopencl", false) +
      "\n" + load_kernel_library("R_ext_internals", "nmathopencl", false) +
      "\n" + load_kernel_library("System",          "nmathopencl", false) +
      "\n" + load_library_for_kernel(
                 kernel_file, "ex_glmbayes_nmath", "nmathopencl",
                 "all_depends_nmath") +
      "\n" + load_kernel_source(kernel_file);

OpenCL has no #include directive, so the wrapper must concatenate every source file the kernel depends on into one string before calling the runner. The layers are assembled in dependency order (Chapter 03 explains each layer): platform definitions, R compatibility shims, the nmath subset the kernel needs, and finally the kernel source itself. load_library_for_kernel reads the @all_depends_nmath annotation from the kernel file and loads only those nmath files in topological order, rather than loading all 137.

Step 4 — Dispatch to the runner

  // ── EXAMPLE: runner call specific to this computation ───────────────────
  // Your call site will name your own runner function and pass your
  // own flat vectors. The pattern — pass the assembled string, the
  // kernel name, the dimensions, all inputs, and the output buffers —
  // is the same.

  f2_f3_kernel_runner(
      all_src, kernel_name.c_str(),
      l1, l2, m1,
      X_flat, B_flat, mu_flat, P_flat, alpha_flat, y_flat, wt_flat,
      qf_flat, grad_flat);    // filled in by the runner

  for (int j = 0; j < m1; ++j)
    qf[j] = qf_flat[j];      // copy scalar outputs into the Rcpp output vector

#else
  Rcpp::Rcout << "[INFO] OpenCL not available — returning zeros.\n";
#endif

Step 5 — Convert outputs from plain C++ back to the required return types

  // ── EXAMPLE: two output conversion patterns ──────────────────────────────
  // qf   uses an element-by-element copy (clear, zero overhead for small m1).
  // grad uses a zero-copy Armadillo wrap to avoid copying a large matrix.
  //
  // Your wrapper will use whichever conversion pattern matches its own
  // return type: a copy loop or Rcpp::NumericVector(...) for a vector result,
  // arma::mat for a matrix, or a Rcpp::List assembling multiple outputs.

  arma::mat grad_arma(
      grad_flat.data(),   // pointer to existing flat storage — not copied
      m1,                 // number of rows
      l2,                 // number of columns
      false,              // copy_aux_mem = false: borrow the memory
      false               // strict = false
  );

  return Rcpp::List::create(
      Rcpp::Named("qf")   = qf,
      Rcpp::Named("grad") = grad_arma
  );
}

}} // namespace ex_glmbayes::opencl

Two output conversion strategies are shown here.

Element-by-element copy (qf): the runner writes results into the plain std::vector<double> qf_flat. After the runner returns, each element is copied into the Rcpp::NumericVector qf that was pre-allocated before the #ifdef block. This is the clearest pattern and is effectively free when m1 is small. For larger outputs, Rcpp::NumericVector(v.begin(), v.end()) achieves the same conversion in one line.

Zero-copy Armadillo wrap (grad): the runner writes into std::vector<double> grad_flat. Rather than copying all m1 × l2 elements into a new matrix object, an arma::mat is constructed over grad_flat’s existing memory using Armadillo’s (ptr, nrow, ncol, copy_aux_mem, strict) constructor. Setting copy_aux_mem = false tells Armadillo to use the existing memory directly. The matrix is valid until grad_flat is destroyed, which happens after Rcpp::List::create serializes the result — so the lifetime is safe. For typical MCMC grid sizes the gradient matrix can be several hundred kilobytes; avoiding the copy measurably reduces per-call overhead.

Your own wrapper will use whichever conversion matches its return type: a copy loop or Rcpp::NumericVector(...) for a vector result, an arma::mat wrap for a large matrix, or a Rcpp::List assembling multiple outputs.


The kernel runner

General description

A kernel runner is a C++ function whose entire body is enclosed in #ifdef USE_OPENCL … #endif. It literally does not exist in a CPU-only build: no prototype visible to the linker, no symbols in the object file, no CL/cl.h ever seen by the preprocessor.

Every kernel runner follows the same eight-step lifecycle, regardless of what computation it performs:

  1. Select a device.
  2. Create a context and command queue.
  3. Compile the program on-device.
  4. Allocate device buffers and copy input data to the GPU.
  5. Bind kernel arguments.
  6. Launch the kernel.
  7. Read results back to the host (blocking).
  8. Release all resources.

What varies between implementations is the function signature (the specific std::vector and int parameters matching your computation), the number and arrangement of device buffers (one cl_mem per kernel parameter), and the number of clSetKernelArg calls (one per kernel parameter, in the same order as the kernel’s __kernel signature). The eight steps and their sequence are the same for every runner.

The runner’s interface uses only plain C++ types — std::vector<double>, int, std::string, const char*. It has no Rcpp dependencies and no knowledge of R’s memory model. This makes the runner the easiest of the three layers to test in isolation.

Example: f2_f3_kernel_runner from ex_glmbayes

// ── EXAMPLE: src/ex_glmbayes_kernel_runners.cpp ─────────────────────────────
// This runner is specific to the GLM gradient computation, which has
// seven input matrices/vectors and two output buffers.
// Your runner will have a different name, a different parameter list,
// and a different number of clCreateBuffer / clSetKernelArg calls,
// but the eight lifecycle steps will be the same.
// ────────────────────────────────────────────────────────────────────────────

#ifdef USE_OPENCL

#define CL_TARGET_OPENCL_VERSION 300
#include <CL/cl.h>

#include "openclPort.h"
using namespace openclPort;

namespace ex_glmbayes { namespace opencl {

void f2_f3_kernel_runner(
    const std::string&         kernel_source,  // fully assembled program text
    const char*                kernel_name,    // e.g. "f2_f3_binomial_logit"
    int l1, int l2, int m1,
    // ── EXAMPLE: seven input vectors specific to the GLM computation ────────
    // Your runner will have whatever std::vector parameters your kernel needs.
    // A simpler kernel might take just one or two input vectors.
    const std::vector<double>& X_flat,
    const std::vector<double>& B_flat,
    const std::vector<double>& mu_flat,
    const std::vector<double>& P_flat,
    const std::vector<double>& alpha_flat,
    const std::vector<double>& y_flat,
    const std::vector<double>& wt_flat,
    // ── EXAMPLE: two output vectors specific to this computation ────────────
    std::vector<double>&       qf_flat,        // written back on return
    std::vector<double>&       grad_flat       // written back on return
) {

Step 1 — Select a device

  cl_platform_id platform = nullptr;
  cl_device_id   device   = nullptr;
  opencl_bind_selected_fp64_device_or_throw(platform, device);
  // This call is the same in every runner. It selects the device the
  // user chose via opencltools::set_opencl_device() and verifies that
  // it supports 64-bit floating point.

Step 2 — Create a context and command queue

  cl_context context =
      clCreateContext(nullptr, 1, &device, nullptr, nullptr, &status);

  cl_queue_properties props[] = {0};
  cl_command_queue queue =
      clCreateCommandQueueWithProperties(context, device, props, &status);
  // These two calls are identical in every runner.

An OpenCL context is the environment within which all subsequent objects live — buffers, programs, kernels, and queues all belong to a context and share access to the same device memory. A command queue is the channel through which the host issues instructions to the device. Commands are placed in the queue and the device executes them in order, asynchronously with the host CPU.

Step 3 — Compile the program on-device

  const char* src_ptr = kernel_source.c_str();
  size_t      src_len = kernel_source.size();
  cl_program  program =
      clCreateProgramWithSource(context, 1, &src_ptr, &src_len, &status);

  status = clBuildProgram(program, 0, nullptr, nullptr, nullptr, nullptr);
  // These calls are identical in every runner. If compilation fails,
  // the runner retrieves the build log from the driver and throws a
  // std::runtime_error with the full diagnostic.

  cl_kernel kernel = clCreateKernel(program, kernel_name, &status);
  // kernel_name is the C-string name of the __kernel function in the
  // .cl file — passed in from the wrapper.

clBuildProgram triggers on-device compilation: the GPU driver parses the OpenCL C source, optimizes it, and produces GPU-native machine code specific to the selected device. This is analogous to R CMD SHLIB but happens at runtime inside the GPU driver process. clCreateKernel then extracts the specific __kernel function by name from the compiled program.

Step 4 — Allocate device buffers and transfer input data

  // ── EXAMPLE: nine buffers specific to this computation ──────────────────
  // Your runner will have one cl_mem per kernel parameter (excluding
  // scalar integers, which are passed directly in Step 5).
  // The split between READ_ONLY and WRITE_ONLY reflects the kernel's
  // use of each buffer.

  // Input buffers: allocate and copy from host in one call
  cl_mem bufX = clCreateBuffer(
      context,
      CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR,
      sizeof(double) * X_flat.size(),
      (void*) X_flat.data(),
      &status);
  // ... similarly for bufB, bufMu, bufP, bufA, bufY, bufW ...

  // Output buffers: allocate uninitialized device memory
  cl_mem bufQF = clCreateBuffer(
      context,
      CL_MEM_WRITE_ONLY,
      sizeof(double) * m1,
      nullptr,     // no initial data — the kernel will write here
      &status);

  cl_mem bufGrad = clCreateBuffer(
      context,
      CL_MEM_WRITE_ONLY,
      sizeof(double) * l2 * m1,
      nullptr,
      &status);

For input buffers, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR allocates device memory and immediately copies the flat vector’s data into it. After this call returns, the device buffer is fully populated and the host-side vector is no longer needed by the driver.

For output buffers, CL_MEM_WRITE_ONLY with a nullptr host pointer allocates uninitialized device memory that the kernel will write into. The results are read back to the host after execution.

Step 5 — Bind kernel arguments

  // ── EXAMPLE: twelve arguments for this kernel ────────────────────────────
  // Your runner will have one clSetKernelArg call per kernel parameter,
  // in the exact same order as the parameters appear in the __kernel
  // function signature. Mismatched order causes a runtime error.

  int arg = 0;
  clSetKernelArg(kernel, arg++, sizeof(cl_mem), &bufX);
  clSetKernelArg(kernel, arg++, sizeof(cl_mem), &bufB);
  clSetKernelArg(kernel, arg++, sizeof(cl_mem), &bufMu);
  clSetKernelArg(kernel, arg++, sizeof(cl_mem), &bufP);
  clSetKernelArg(kernel, arg++, sizeof(cl_mem), &bufA);
  clSetKernelArg(kernel, arg++, sizeof(cl_mem), &bufY);
  clSetKernelArg(kernel, arg++, sizeof(cl_mem), &bufW);
  clSetKernelArg(kernel, arg++, sizeof(cl_mem), &bufQF);
  clSetKernelArg(kernel, arg++, sizeof(cl_mem), &bufGrad);
  // Scalar integers are passed by value, not as cl_mem handles:
  clSetKernelArg(kernel, arg++, sizeof(int),    &l1);
  clSetKernelArg(kernel, arg++, sizeof(int),    &l2);
  clSetKernelArg(kernel, arg++, sizeof(int),    &m1);

clSetKernelArg connects a host-side value to a positional parameter in the kernel’s __kernel signature. The second argument is the zero-based parameter index. Buffer arguments (cl_mem) are passed by address; scalar arguments (int, double) are also passed by address and the driver copies the value. The count and order of calls here must exactly mirror the count and order of parameters in the .cl file.

Step 6 — Launch the kernel

  // The global work size is the total number of work-items to launch.
  // For this computation it is m1: one independent instance of the kernel
  // function per grid point. All m1 instances are launched simultaneously
  // (see "Serial vs. parallel execution" earlier in this chapter).
  // Your kernel may use a different global size reflecting its own
  // parallelism structure — e.g. the number of output elements.

  size_t global = (size_t) m1;
  clEnqueueNDRangeKernel(
      queue, kernel,
      1,          // one-dimensional work space (indexed by j = 0..m1-1)
      nullptr,    // global offset: start from index 0
      &global,    // total work-items: m1
      nullptr,    // local work-group size: let the driver choose
      0, nullptr, nullptr);
  // This call places the launch command in the queue and returns immediately.
  // The GPU executes the m1 work-items asynchronously with the host CPU.
  // The blocking clEnqueueReadBuffer in Step 7 is what forces the host to
  // wait until all work-items have finished before reading the results.

Step 7 — Read results back to the host (blocking)

  // ── EXAMPLE: two read-backs for this computation ─────────────────────────
  // Your runner will have one clEnqueueReadBuffer call per output buffer.

  clEnqueueReadBuffer(
      queue, bufQF,
      CL_TRUE,                              // blocking: wait until done
      0,
      sizeof(double) * qf_flat.size(),
      qf_flat.data(),
      0, nullptr, nullptr);

  clEnqueueReadBuffer(
      queue, bufGrad,
      CL_TRUE,
      0,
      sizeof(double) * grad_flat.size(),
      grad_flat.data(),
      0, nullptr, nullptr);
  // After these calls, qf_flat and grad_flat contain GPU-computed results.

CL_TRUE makes each call blocking: it waits until the data transfer from device to host is complete. Because commands execute in queue order, this also guarantees that the kernel has finished.

Step 8 — Release all resources

  // Release in reverse order of creation.
  // Every runner must release every cl_mem, cl_kernel, cl_program,
  // cl_command_queue, and cl_context it creates. Omissions leak GPU VRAM.

  clReleaseMemObject(bufGrad);
  clReleaseMemObject(bufQF);
  clReleaseMemObject(bufW);
  clReleaseMemObject(bufY);
  clReleaseMemObject(bufA);
  clReleaseMemObject(bufP);
  clReleaseMemObject(bufMu);
  clReleaseMemObject(bufB);
  clReleaseMemObject(bufX);
  clReleaseKernel(kernel);
  clReleaseProgram(program);
  clReleaseCommandQueue(queue);
  clReleaseContext(context);
}

}} // namespace ex_glmbayes::opencl

#endif // USE_OPENCL
// Everything from the opening #ifdef to this #endif is invisible to the
// preprocessor on a machine without OpenCL.

The kernel (*.cl)

General description

A kernel is an OpenCL C source file that contains the actual computation to be run on the GPU. It is the innermost piece: pure math, with no knowledge of R, no knowledge of the runner’s bookkeeping, and no access to host memory. Its only interface with the outside world is the set of __global pointer and scalar arguments the runner hands to it via clSetKernelArg.

Every kernel file has the same basic structure, regardless of the computation:

  1. Annotation header — comment lines with @all_depends_nmath (and optionally @depends, @provides) that the wrapper’s program assembler reads to determine which nmath files to prepend. If your kernel does not call any nmath functions, this section can be omitted.
  2. Extension pragmas — at minimum #pragma OPENCL EXTENSION cl_khr_fp64 : enable if the kernel uses double.
  3. Helper functions — any static inline device-side functions called by the __kernel entry point, including nmath functions that will be prepended by the assembler.
  4. __kernel entry point — the function the runner calls via clCreateKernel.

What varies between kernels is the __kernel signature (one __global parameter per device buffer allocated in the runner), the scalar parameters (dimension integers and other constants), and the computation itself.

A kernel is never compiled when the R package is installed. Instead, the source text is assembled into a string by the wrapper, passed to the runner, and handed to clCreateProgramWithSource. The GPU driver compiles it to native machine code at runtime. This on-device compilation allows a single source file to run on any GPU without vendor-specific binaries.

Example: f2_f3_binomial_logit from ex_glmbayes

// ── EXAMPLE: inst/cl/ex_glmbayes_src/f2_f3_binomial_logit.cl ───────────────
// This kernel is specific to the binomial-logit GLM gradient computation.
// Your kernel will have a different name, a different parameter list
// (matching the device buffers your runner allocates), and a different
// computation. The file structure — annotations, pragma, helpers,
// __kernel entry point — is the same.
// ────────────────────────────────────────────────────────────────────────────

// @all_depends_nmath: dpq, refactored, Rmath, nmath, stirlerr_cycle_free,
//   chebyshev, cospi, fmax2, gammalims, lgammacor, log1p, gamma, lgamma,
//   pgamma_utils, stirlerr_cycle_dependent, bd0, stirlerr, dbinom
// If your kernel calls no nmath functions, omit this annotation entirely.

#pragma OPENCL EXTENSION cl_khr_fp64 : enable
// Required whenever the kernel uses 'double'. Always include this.

#pragma OPENCL EXTENSION cl_khr_printf : enable
// Optional; enables printf() for debugging inside the kernel.

// ── EXAMPLE: compile-time constant specific to this computation ─────────────
// MAX_L2 bounds the size of the per-work-item private array g_loc[].
// GPU private memory is fixed at compile time, so a maximum is needed.
// Your kernel may need a similar constant if it uses fixed-size local arrays,
// or may not need one at all if its private storage is scalar.
#define MAX_L2 64

@all_depends_nmath lists the nmath files whose source must be prepended before this kernel when the program string is assembled. The wrapper’s load_library_for_kernel reads this annotation and loads exactly those files in topological order. Chapter 03 explains the annotation scheme and the dependency graph in detail.

MAX_L2 bounds the size of a per-work-item stack array. Each work-item allocates double g_loc[MAX_L2] in private memory — the GPU equivalent of a register file — whose size must be known at compile time. A maximum coefficient count of 64 is sufficient for the GLM use case. Your kernel may need a similar constant, or may not require any fixed-size private arrays at all.

The kernel signature

// ── EXAMPLE: parameter list matching the nine buffers allocated in the runner
// Your kernel's __global parameters must exactly correspond (in number,
// type, and order) to the cl_mem buffers created by clCreateBuffer in the
// runner and bound by clSetKernelArg.

__kernel void f2_f3_binomial_logit(
    __global const double* X,      // design matrix, l1 × l2, column-major
    __global const double* B,      // coefficient grid, m1 × l2
    __global const double* mu,     // prior mean vector, length l2
    __global const double* P,      // prior precision matrix, l2 × l2
    __global const double* alpha,  // offsets, length l1
    __global const double* y,      // response (success proportions), length l1
    __global const double* wt,     // trial counts / weights, length l1
    __global double*       qf,     // OUTPUT: neg log-posterior, length m1
    __global double*       grad,   // OUTPUT: gradient, m1 × l2
    const int l1,
    const int l2,
    const int m1
    // Scalar integers are passed by value; no __global qualifier.
) {

__kernel marks this function as a GPU entry point that clCreateKernel can look up by name.

__global is an OpenCL address-space qualifier. It marks a pointer as pointing into global device memory — the GPU’s main VRAM — which all work-items can access. Pointers without an address-space qualifier point into the work-item’s private memory, which is invisible to other work-items and cannot be mapped from the host. Every cl_mem buffer created in the runner corresponds to exactly one __global pointer in the kernel signature.

const int l1, l2, m1 are scalar values set via clSetKernelArg with sizeof(int) and passed by value. Scalar kernel arguments do not use __global.

The parallel execution body

    int j = get_global_id(0);   // which grid point is this work-item?
    if (j >= m1) return;        // guard: don't process out-of-range indices
    // Your kernel's first line will also call get_global_id(0) to determine
    // which element or slice this work-item is responsible for.

get_global_id(0) returns the unique index of this work-item within the one-dimensional work space. When the runner enqueues the kernel with a global size of m1, the GPU launches m1 work-items simultaneously, each receiving a different value of j. Each work-item executes the function body independently for its own grid point. This is the GPU’s answer to the CPU’s outer for (j = 0; j < m1; j++) loop.

The if (j >= m1) return guard handles the case where the GPU rounds up the number of work-items to a multiple of the hardware’s work-group size. Without this guard, extra work-items would write past the end of the output buffers.

    // ── EXAMPLE: GLM log-posterior and gradient ───────────────────────────
    // The computation below is specific to the binomial-logit model.
    // Your kernel body will contain whatever computation you are
    // accelerating, reading from __global input pointers and writing
    // to __global output pointers.

    double g_loc[MAX_L2];   // private (per-work-item) working storage

    // 1. Quadratic prior term: 0.5 × (B[j,] − μ)ᵀ P (B[j,] − μ)
    // 2. Binomial likelihood loop over l1 observations
    // 3. Write results

    qf[j] = res_acc;

    for (int k = 0; k < l2; ++k)
        grad[k * m1 + j] = g_loc[k];
    // grad is stored as grad[col * m1 + row] — column-major — so that
    // the wrapper's arma::mat(ptr, m1, l2) interprets it correctly.
    // Your output layout should match how your wrapper will read it back.

The gradient is written in column-major order (grad[col * m1 + row]) so that the arma::mat(grad_flat.data(), m1, l2) construction in the wrapper produces an m1 × l2 matrix in the layout R expects. The decision about output memory layout is made in the kernel and must be consistent with how the wrapper reads the data back.


The complete call chain

R:  f2_f3_opencl("binomial", "logit", b, y, x, mu, P, alpha, wt)
      │
      ▼
┌─────────────────────────────────────────────────────────────────┐
│  Kernel wrapper  (always compiled, always callable)             │
│                                                                 │
│  1. Read dimensions                                             │
│  2. Convert Rcpp inputs → std::vector<double>                   │
│     (flattenMatrix for matrices, copyVector for vectors)        │
│  3. Allocate zero-filled output vectors                         │
│  4. Select kernel name + file (family/link dispatch)            │
│  5. Assemble program string                                     │
│     (OPENCL.cl + shims + nmath subset + kernel source)          │
│                                                                 │
│  ── #ifdef USE_OPENCL ──────────────────────────────────────── │
│  6. Call f2_f3_kernel_runner(all_src, kernel_name, ...)         │
│  7. Copy qf_flat → NumericVector qf                             │
│  ── #endif ─────────────────────────────────────────────────── │
│                                                                 │
│  8. Wrap grad_flat → arma::mat (zero-copy)                      │
│  9. Return List(qf = ..., grad = ...)                           │
└─────────────────────────────────────────────────────────────────┘
      │  std::vector<double>: flat, contiguous, R-free, device-safe
      │  (only reached when USE_OPENCL is defined)
      ▼
┌─────────────────────────────────────────────────────────────────┐
│  Kernel runner  (compiled only when USE_OPENCL is defined)      │
│                                                                 │
│  1. Bind device (fp64-capable GPU)                              │
│  2. Create context + command queue                              │
│  3. clCreateProgramWithSource + clBuildProgram                  │
│     → GPU driver compiles the assembled string on-device        │
│  4. clCreateKernel("f2_f3_binomial_logit")                      │
│  5. clCreateBuffer × 9                                          │
│     inputs:  READ_ONLY | COPY_HOST_PTR → data copied to VRAM   │
│     outputs: WRITE_ONLY, nullptr → uninitialized device memory  │
│  6. clSetKernelArg × 12  (9 buffers + l1, l2, m1)              │
│  7. clEnqueueNDRangeKernel(global_size = m1)                    │
│     → queues the launch; GPU executes asynchronously            │
│  8. clEnqueueReadBuffer × 2 (blocking)                          │
│     → waits for GPU; copies results to qf_flat, grad_flat       │
│  9. clRelease* for all objects                                  │
└─────────────────────────────────────────────────────────────────┘
      │  clEnqueueNDRangeKernel(global = m1):
      │  ALL m1 work-items launched simultaneously — not a loop
      ▼
┌─────────────────────────────────────────────────────────────────┐
│  Kernel  (compiled on-device; m1 instances run in parallel)     │
│                                                                 │
│  Each work-item  j = get_global_id(0)  independently:           │
│    • quadratic prior term  qf[j] += 0.5 × (B[j,]−μ)ᵀP(B[j,]−μ)│
│    • for i in 0..l1:  sigmoid → p                               │
│                        nll_binomial_glmb_ocl(y[i], wt[i], p)    │
│                        gradient accumulation → g_loc[k]          │
│    • qf[j] = res_acc                                            │
│    • grad[k*m1 + j] = g_loc[k]  for k in 0..l2                 │
└─────────────────────────────────────────────────────────────────┘
      │
      └─► qf_flat, grad_flat  ←  runner  ←  wrapper  ←  R

CRAN safety: how the architecture prevents build failures

CRAN’s build servers have no OpenCL SDK installed. A package that includes CL/cl.h or links against -lOpenCL on a CRAN server will fail to build and be rejected.

The runner being entirely absent from CPU-only builds is the critical property:

The wrapper being unconditionally compiled means the full C++ API surface is present on every build. On a CPU-only build, f2_f3_opencl returns a list with qf and grad both zero — the same structure the caller expects, with zero values instead of GPU-computed values — making it straightforward to test the function’s existence and return shape without a GPU.


Applying the pattern in your package

File layout

src/
  mypkg_kernel_wrappers.cpp   # always compiled; internal #ifdef USE_OPENCL
  mypkg_kernel_runners.cpp    # all function bodies inside #ifdef USE_OPENCL
inst/cl/
  mypkg_src/
    myfunction_kernel.cl      # __kernel + @all_depends_nmath annotation

Type conversion reference

For inputs (wrapper converts C++ input types → std::vector before calling the runner):

C++ input type Conversion Runner argument type
Rcpp::NumericVector copyVector(v) const std::vector<double>&
Rcpp::NumericMatrix flattenMatrix(m) const std::vector<double>&
Rcpp::IntegerVector std::vector<int>(v.begin(), v.end()) const std::vector<int>&

For outputs (wrapper converts std::vector → required return types after the runner returns):

Return type needed Conversion from std::vector<double> When to use
Rcpp::NumericVector (small) for (j) out[j] = flat[j] Clarity
Rcpp::NumericVector (large) Rcpp::NumericVector(v.begin(), v.end()) Conciseness
arma::mat / Rcpp::NumericMatrix arma::mat(ptr, nrow, ncol, false, false) Avoid large copy

Headers and DESCRIPTION

Chapter 09 covers the generic openclPort runner infrastructure — device selection, scalar and matrix runner templates, and error handling — in detail. Chapter 10 applies this full three-role pattern end-to-end for all five ex_glmbayes gradient kernels.