Chapters 03-10 describe the infrastructure of
nmathopencl: the shim layer, the nmath
library, the kernel format, and the runner API. This chapter shows how
to put all of it together by walking through a concrete downstream
use-case: the ex_glmbayes example built into the
package.
ex_glmbayes implements GPU-accelerated evaluation of the
log-posterior and gradient (f2/f3) for
Bayesian GLMs (Gaussian, Gamma, binomial with logit/probit/cloglog
links, and Poisson). It is the type of component a package author would
build on top of nmathopencl to accelerate their own
likelihood computations. The name ex_glmbayes signals that
this is an example, not a core part of the library; it lives entirely in
ex_glmbayes_*-prefixed files and within the
ex_glmbayes C++ namespace.
For each GLM family, determine which nmath functions the GPU kernel must call:
| Kernel | nmath function(s) called | Link function |
|---|---|---|
f2_f3_binomial_logit.cl |
dbinom_raw |
logit |
f2_f3_binomial_probit.cl |
dbinom_raw, pnorm5,
dnorm4 |
probit |
f2_f3_binomial_cloglog.cl |
dbinom_raw |
cloglog |
f2_f3_gamma.cl |
dgamma |
log |
f2_f3_gaussian.cl |
dnorm4 |
identity/log |
f2_f3_poisson.cl |
lgamma (OpenCL built-in) |
log |
For f2_f3_poisson, lgamma is an OpenCL
built-in (part of the OpenCL 1.1 math specification) so no nmath files
are needed at all. For the others, consult the
@all_depends_nmath tag of each kernel file to get the full
transitive dependency list.
The @all_depends_nmath tag in each kernel file gives the
complete transitive closure. Taking the union across the five
non-Poisson kernels yields 20 files (17 nmath functions +
Rmath, nmath, refactored as
infrastructure):
Rmath, nmath, refactored,
chebyshev, cospi, fmax2, gammalims, lgammacor,
gamma, lgamma,
stirlerr_cycle_free, pgamma_utils, stirlerr_cycle_dependent, stirlerr, bd0,
dpois, dgamma, dnorm, pnorm, dbinom
These files are stored in inst/cl/ex_glmbayes_nmath/.
Rather than copying them manually, use
extract_library_subset() to derive this directory
automatically from the kernel annotations:
nmath_dir <- system.file("cl", "nmath", package = "nmathopencl")
src_dir <- system.file("cl", "ex_glmbayes_src", package = "nmathopencl")
# Load the pre-built dependency index (read once, reuse across calls)
idx <- readRDS(file.path(nmath_dir, "kernel_dependency_index.rds"))
result <- extract_library_subset(
kernel_paths = list.files(src_dir, pattern = "\\.cl$", full.names = TRUE),
library_dir = nmath_dir,
dest_dir = "inst/cl/ex_glmbayes_nmath",
depends_tag = "depends_nmath",
index = idx
)extract_library_subset() copies the 20 .cl
files plus both index files
(kernel_dependency_index.rds and
kernel_dependency_index.tsv) into the destination
directory. The index files are required for the indexed loading
described in Step 5.
A package developer would create an analogous subdirectory for their own kernel set using the same function.
Each kernel file in inst/cl/ex_glmbayes_src/ starts with
the dependency metadata block followed by the kernel body:
// @library_deps: nmath
// @calls_nmath: dbinom_raw
// @depends_nmath: dbinom
// @all_depends_nmath_count: 17
// @all_depends_nmath: Rmath, nmath, refactored, chebyshev, cospi, fmax2,
// gammalims, lgammacor, gamma, lgamma, stirlerr_cycle_free, pgamma_utils,
// stirlerr_cycle_dependent, stirlerr, bd0, dbinom
// @calls_opencl_builtin: (none)
#pragma OPENCL EXTENSION cl_khr_fp64 : enable
#pragma OPENCL EXTENSION cl_khr_printf : enable
#define MAX_L2 64 // upper bound on l2
__kernel void f2_f3_binomial_logit(
__global const double* X, // design matrix, size = l1 x l2, col-major
__global const double* B, // grid, size = m1 x l2
__global const double* mu, // prior mean, length = l2
__global const double* P, // prior precision, size = l2 x l2
__global const double* alpha, // offset, length = l1
__global const double* y, // response, length = l1
__global const double* wt, // weights, length = l1
__global double* qf, // out: neg log-posterior, length = m1
__global double* grad, // out: gradient, size = m1 x l2
const int l1,
const int l2,
const int m1
) {
int j = get_global_id(0);
if (j >= m1) return;
// prior quadratic form: 0.5 * (B_j - mu)' P (B_j - mu)
double tmp[MAX_L2];
for (int k = 0; k < l2; ++k) {
double acc = 0.0;
for (int l = 0; l < l2; ++l)
acc += P[k*l2 + l] * (B[j*l2 + l] - mu[l]);
tmp[k] = acc;
}
double res_acc = 0.0;
for (int k = 0; k < l2; ++k)
res_acc += 0.5 * (B[j*l2 + k] - mu[k]) * tmp[k];
// gradient: prior part
double g_loc[MAX_L2];
for (int k = 0; k < l2; ++k) g_loc[k] = tmp[k];
// data likelihood: logistic link + dbinom_raw
for (int i = 0; i < l1; ++i) {
double dot = -alpha[i];
for (int k = 0; k < l2; ++k)
dot -= X[k*l1 + i] * B[j*l2 + k];
double e = exp(dot <= 0 ? dot : -dot);
double p = dot <= 0 ? 1.0/(1.0+e) : e/(1.0+e);
double q = 1.0 - p;
res_acc += -dbinom_raw(y[i], wt[i], p, q, 1);
double resid = p < 0.5 ? p*wt[i] - y[i]*wt[i]
: (1-y[i])*wt[i] - q*wt[i];
for (int k = 0; k < l2; ++k)
g_loc[k] += X[k*l1 + i] * resid;
}
qf[j] = res_acc;
for (int k = 0; k < l2; ++k)
grad[k * m1 + j] = g_loc[k];
}Key design choices:
MAX_L2 64 — a compile-time upper bound on the number of
predictors, used to size local arrays (OpenCL C does not support
variable-length arrays in all implementations).X (matches R’s memory layout
for matrices passed from R to C without transposition).j; all
l1 observations are processed sequentially within that
work-item. This is appropriate when m1 (number of grid
points) is much larger than l1 (number of observations), as
is typical for envelope sampling.The kernel runner is the low-level C++ function that manages the
OpenCL lifecycle for this specific kernel. It lives in
ex_glmbayes_kernel_runners.cpp inside the
ex_glmbayes::opencl namespace:
// ex_glmbayes_kernel_runners.cpp (abbreviated)
namespace ex_glmbayes {
namespace opencl {
void f2_f3_kernel_runner(
const std::string& kernel_source,
const char* kernel_name,
int l1, int l2, int m1,
const std::vector<double>& X_flat,
const std::vector<double>& B_flat,
const std::vector<double>& mu_flat,
const std::vector<double>& P_flat,
const std::vector<double>& alpha_flat,
const std::vector<double>& y_flat,
const std::vector<double>& wt_flat,
std::vector<double>& qf_flat,
std::vector<double>& grad_flat
) {
// Select platform and device
// Create context, command queue, program, kernel
// Create input buffers (CL_MEM_READ_ONLY) and output buffers (CL_MEM_WRITE_ONLY)
// Set kernel arguments
// Enqueue kernel with global_size = m1
// Read back qf_flat and grad_flat
// Release all resources
}
}} // namespace ex_glmbayes::openclThe runner uses the error-handling utilities from
openclPort.h (opencl_make_context_error,
opencl_status_name) to produce informative exceptions on
failure.
The kernel wrapper is the R-facing [[Rcpp::export]]
function. It validates and converts R inputs, selects the kernel name
based on family/link, assembles the program string using the dependency
index, calls the runner, and returns results as R objects. It lives in
ex_glmbayes_kernel_wrappers.cpp:
// ex_glmbayes_kernel_wrappers.cpp (abbreviated)
namespace ex_glmbayes {
namespace opencl {
// [[Rcpp::export]]
Rcpp::List f2_f3_opencl(
std::string family, std::string link,
Rcpp::NumericMatrix b, Rcpp::NumericVector y,
Rcpp::NumericMatrix x, Rcpp::NumericMatrix mu,
Rcpp::NumericMatrix P, Rcpp::NumericVector alpha,
Rcpp::NumericVector wt, int progbar)
{
int l1 = x.nrow(), l2 = x.ncol(), m1 = b.ncol();
// 1. Flatten R matrices into contiguous vectors
auto X_flat = openclPort::flattenMatrix(x);
auto B_flat = openclPort::flattenMatrix(b);
// ... (all inputs)
// 2. Dispatch: set kernel_name and kernel_file based on family/link
std::string kernel_name, kernel_file;
if (family == "binomial" && link == "logit") {
kernel_name = "f2_f3_binomial_logit";
kernel_file = "ex_glmbayes_src/f2_f3_binomial_logit.cl";
} // ... (all families)
// 3. Assemble program source using the pre-built dependency index.
// load_library_for_kernel() reads @depends_nmath from the kernel file,
// looks up the transitive closure in kernel_dependency_index.tsv, and
// loads only the required subset in the correct order --- no sort at runtime.
std::string nmath_src = openclPort::load_library_for_kernel(
kernel_file, // kernel .cl path (relative to inst/cl/)
"ex_glmbayes_nmath", // library subdir containing index + .cl files
"nmathopencl", // package
"depends_nmath" // annotation tag listing direct nmath entry points
);
std::string kernel_src = openclPort::load_kernel_source(
kernel_file, "nmathopencl");
std::string program_source = nmath_src + "\n" + kernel_src;
// 4. Allocate output buffers
std::vector<double> qf_flat(m1);
std::vector<double> grad_flat(static_cast<size_t>(m1) * l2);
// 5. Run the kernel
f2_f3_kernel_runner(program_source, kernel_name.c_str(),
l1, l2, m1,
X_flat, B_flat, mu_flat, P_flat,
alpha_flat, y_flat, wt_flat,
qf_flat, grad_flat);
// 6. Wrap outputs as R objects and return
Rcpp::NumericVector qf(qf_flat.begin(), qf_flat.end());
// ... (reshape grad_flat into a matrix)
return Rcpp::List::create(Rcpp::Named("qf") = qf, ...);
}
}} // namespace ex_glmbayes::openclload_library_for_kernel() rather than
load_kernel_library()?load_kernel_library() sorts and concatenates
every file in the subdirectory on every call. For a
137-file library like nmath this requires reading all files
sequentially. load_library_for_kernel() reads the pre-built
kernel_dependency_index.tsv once, then reads only the small
subset of files that the specific kernel actually needs — 3 files for
f2_f3_gaussian, 17 for f2_f3_binomial_logit,
and so on. The savings are significant for large libraries and become
important when the kernel wrapper is called repeatedly.
The old full-library approach is still available for cases where the exact dependency set is unknown or all files are genuinely needed:
The [[Rcpp::export]] annotation on
f2_f3_opencl causes Rcpp::compileAttributes()
to generate:
.Call entry in RcppExports.cppRcppExports.RThe R-side convenience function in ex_glmbayes.R adds a
nmathopencl_has_opencl() guard and CPU fallback:
The CPU fallback uses the same mathematical logic as the kernel but executes in C++ using standard R math calls. The relevant code lives in:
ex_glmbayes_famfuncs_binomial.cpp — binomial family
functionsex_glmbayes_famfuncs_poisson.cpp — Poisson familyex_glmbayes_famfuncs_Gamma.cpp — Gamma familyex_glmbayes_famfuncs_gaussian.cpp — Gaussian
familyex_glmbayes_EnvelopeEval.cpp — orchestrates the CPU
pathex_glmbayes_EnvelopeSize.cpp — memory allocation for
the envelopeThese files are grouped under the ex_glmbayes::fam and
ex_glmbayes::env sub-namespaces. They use R’s standard
<Rmath.h> functions (the same mathematical functions
as the GPU kernels, but via the C runtime rather than OpenCL).
The complete set of files contributing to the
ex_glmbayes example:
inst/cl/)| Directory | Files |
|---|---|
ex_glmbayes_nmath/ |
20 files: the minimal nmath subset |
ex_glmbayes_src/ |
6 files: one f2_f3_*.cl kernel per GLM family |
ex_glmbayes_draft_src/ |
Draft versions (work in progress) |
src/)| File | Namespace | Role |
|---|---|---|
ex_glmbayes_kernel_runners.cpp |
ex_glmbayes::opencl |
Low-level OpenCL runner for the f2_f3 kernels |
ex_glmbayes_kernel_wrappers.cpp |
ex_glmbayes::opencl |
R-facing wrapper: input validation, program assembly, runner dispatch |
ex_glmbayes_export_wrappers.cpp |
(Rcpp export) | [[Rcpp::export]] entry points for
EnvelopeSize, EnvelopeEval,
glmb_Standardize_Model |
ex_glmbayes_famfuncs_binomial.cpp |
ex_glmbayes::fam |
CPU binomial likelihood |
ex_glmbayes_famfuncs_poisson.cpp |
ex_glmbayes::fam |
CPU Poisson likelihood |
ex_glmbayes_famfuncs_Gamma.cpp |
ex_glmbayes::fam |
CPU Gamma likelihood |
ex_glmbayes_famfuncs_gaussian.cpp |
ex_glmbayes::fam |
CPU Gaussian likelihood |
ex_glmbayes_EnvelopeEval.cpp |
ex_glmbayes::env |
CPU envelope evaluation (f2/f3 orchestrator) |
ex_glmbayes_EnvelopeSize.cpp |
ex_glmbayes::env |
Envelope memory sizing |
ex_glmbayes_glmb_Standardize_Model.cpp |
ex_glmbayes::sim |
Design matrix standardization |
src/)| File | Namespace declared |
|---|---|
ex_glmbayes_opencl.h |
ex_glmbayes::opencl |
ex_glmbayes_famfuncs.h |
ex_glmbayes::fam |
ex_glmbayes_Envelopefuncs.h |
ex_glmbayes::env |
ex_glmbayes_simfuncs.h |
ex_glmbayes::sim |
R/)| File | Role |
|---|---|
ex_glmbayes.R |
R-facing API: f2_f3_opencl_R,
f2_f3_non_opencl |
ex_glmbayes_rcpp_wrappers.R |
Thin Rcpp wrappers for the three exported C++ export functions |
To build a similar component in your own package:
Identify your nmath dependencies — write the
@depends_nmath annotation in each of your kernel files
(listing the direct nmath entry points your kernel calls). Then call
extract_library_subset() to copy the exact transitive
closure of files, plus the companion index files, into your package’s
inst/cl/<yourpkg_nmath>/:
nmath_dir <- system.file("cl", "nmath", package = "nmathopencl")
idx <- readRDS(file.path(nmath_dir, "kernel_dependency_index.rds"))
extract_library_subset(
kernel_paths = list.files("inst/cl/mypkg_src", "\\.cl$", full.names = TRUE),
library_dir = nmath_dir,
dest_dir = "inst/cl/mypkg_nmath", # must exist
depends_tag = "depends_nmath",
index = idx
)Write your kernel(s) — add
@library_deps, @calls_nmath, and
@all_depends_nmath tags at the top of each .cl
file. Store them in inst/cl/<yourpkg_src>/.
Write a kernel runner — use
openclPort::opencl_dbl_scalar_kernel_runner directly if
your kernel fits the double-scalar layout, or write a custom runner
following the same OpenCL lifecycle pattern as
f2_f3_kernel_runner. Use
openclPort::opencl_make_context_error and
opencl_status_name for error handling.
Write a kernel wrapper — flatten your R inputs
with openclPort::flattenMatrix /
openclPort::copyVector, assemble the program string with
openclPort::load_library_for_kernel() +
openclPort::load_kernel_source(), call the runner, and
return an R-friendly result. The @depends_nmath annotation
in your kernel file drives the subset selection automatically.
Write a CPU fallback — implement the same
computation using standard R math (<Rmath.h>) or
base-R functions. Guard the GPU path with
nmathopencl_has_opencl().
Add LinkingTo: nmathopencl to your
DESCRIPTION file so that openclPort.h is found
at compile time.
Add configure scripts for CRAN safety — this is the step most commonly overlooked.
A package that references -lOpenCL or
CL/cl.h in a static src/Makevars will
fail to compile on CRAN’s build machines (which have no
GPU SDK), and no binary will be produced. The solution is a pair of
configure scripts (configure for Linux/macOS,
configure.win for Windows) that generate
src/Makevars dynamically at install time: if the
SDK is found they write -DUSE_OPENCL and the link flags; if
not they write a CPU-only Makevars. The package always compiles
cleanly.
Copy the generic templates into your package root with:
This writes configure and configure.win to
the current directory, sets configure as executable (on
Linux/macOS), and prints a checklist. The templates honor
OPENCL_HOME / OPENCL_SDK environment variables
so developers with non-standard SDK locations can point to them without
modifying system paths.
The three-entity relationship that the scripts establish:
configure / configure.win
-> detects CL/cl.h + libOpenCL at install time
-> writes -DUSE_OPENCL into Makevars (or omits it for CPU-only)
#ifdef USE_OPENCL (in your C++ source)
-> guards all GPU code; compiles cleanly either way
nmathopencl_has_opencl() (in your R code)
-> calls a compiled-in bool that mirrors the compile-time decision
-> TRUE only if -DUSE_OPENCL was set when the package was built
On Linux the configure script also runs a
runtime probe (clGetPlatformIDs) before
setting USE_OPENCL. This extra check catches the common
case where the ICD loader (libOpenCL.so) is installed but
no vendor platform has been registered in
/etc/OpenCL/vendors/. configure.win (Windows)
detects the header only; the ICD loader (OpenCL.dll) is
installed with the GPU driver and does not require a separate probe.
Add the generated files to .gitignore — they are build
artifacts, not source files:
src/Makevars
src/Makevars.win
For more detail, including the migration plan for when these
templates move to opencltools, see
system.file("configure-templates", "README.md", package = "nmathopencl").
The ex_glmbayes files are the authoritative reference
implementation for steps 1–6; step 7 applies universally to any
downstream package that exposes GPU acceleration.