When an R package uses nmathopencl to accelerate a
computation on the GPU, the work is divided among three distinct kinds
of source file. This chapter introduces them in the order that makes the
design easiest to understand: starting with the part that looks most
like ordinary C++ code and moving toward the part that is furthest from
it.
| Role | File type | Compiled by | Conditional guard | Visible to R |
|---|---|---|---|---|
| Kernel wrapper | *.cpp |
Host C++ compiler, at install time | Internal #ifdef USE_OPENCL block |
Yes (directly or via a calling function) |
| Kernel runner | *.cpp |
Host C++ compiler, at install time | Entire body inside #ifdef USE_OPENCL |
No |
| Kernel | *.cl |
GPU driver, at runtime | None needed | No |
Three design goals shape this structure:
CRAN safety. CRAN’s build servers have no OpenCL
SDK. Any package submitted to CRAN must compile cleanly without it.
Wrapping runner bodies entirely in #ifdef USE_OPENCL means
those functions simply do not exist in a CPU-only build — there are no
unresolved OpenCL symbols for the linker to complain about.
Clean type boundaries. R and OpenCL use entirely different type systems and memory models. Explicit conversion at each boundary — from C++ input types (including Rcpp wrapper types) to plain C++ types the runner accepts, to OpenCL device buffers, and back — makes those differences explicit and testable, rather than hiding them behind casts.
Separation of concerns. The wrapper owns the R interface and type conversion. The runner owns the OpenCL bookkeeping. The kernel owns the mathematical computation. None of the three needs to know the internal details of the other two.
Each of the three sections below follows the same pattern:
ex_glmbayes component of nmathopencl, which
evaluates the GLM log-posterior and its gradient over a grid of
candidate coefficient vectors. This is labeled clearly as an example
throughout.ex_glmbayes is used as the example rather than a simpler
function because it involves inputs of several different shapes, a
two-dimensional parallel structure, and two different output types. That
richness makes the type-conversion and parallelism stories concrete. A
simpler computation would follow the same pattern with fewer inputs and
a simpler output.
A Bayesian GLM sampler such as glmbayes needs to
evaluate, at each MCMC step, a function f2_f3(B) for a grid
of candidate coefficient vectors B[1], …, B[m1]. For each
grid point j, f2_f3 returns:
qf[j] — a scalar: the negative log-posterior (quadratic
prior term plus the binomial negative log-likelihood summed over all
l1 observations).grad[, j] — a vector of length l2: the
gradient of qf[j] with respect to the coefficients.The grid points are completely independent of each other, which makes
the problem ideal for the GPU. Three dimension integers —
l1 (observations), l2 (coefficients),
m1 (grid points) — appear in all three layers and must be
consistent throughout.
Before examining the three layers in detail it is worth being precise about what “parallel” means here, because it is the reason the whole architecture exists.
On a CPU, evaluating f2_f3 over m1 grid
points looks like an ordinary loop:
// CPU serial execution — one grid point at a time
for (int j = 0; j < m1; ++j) {
qf[j] = compute_neg_log_posterior(B[j], X, y, wt, mu, P, alpha);
grad[:, j] = compute_gradient(B[j], X, y, wt, mu, P, alpha);
}The iterations execute one after the other. Each call to
compute_neg_log_posterior must finish before the next
begins. With m1 = 500 and l1 = 300, that is
150,000 observation-level computations done sequentially.
OpenCL replaces the loop with a single API call that launches all
m1 iterations simultaneously:
// GPU parallel execution — all m1 grid points launched at once
size_t global = (size_t) m1;
clEnqueueNDRangeKernel(queue, kernel, 1, nullptr, &global, nullptr,
0, nullptr, nullptr);This one call enqueues m1 independent
work-items. Each work-item is an independent execution of the
__kernel function, distinguished only by its unique
index:
The GPU driver maps these work-items onto the GPU’s physical compute
units. A modern GPU has dozens to hundreds of compute units, each
capable of running multiple work-items simultaneously. With
m1 = 500, several hundred grid points can genuinely execute
at the same time, not just interleaved.
The work-items are organized internally into work-groups for
scheduling purposes. When the local work-group size is set to
nullptr (as in the runner), the driver chooses a group size
that fits the hardware. Work-items within a work-group can share fast
local memory, but the ex_glmbayes kernels do not use local
memory — each work-item is fully independent.
If the package was built without USE_OPENCL, or if
nmathopencl_has_opencl() returns FALSE at
runtime, the wrapper returns zero-filled output vectors. There is no
serial CPU fallback that computes the actual values.
This is a deliberate design choice. Implementing a correct CPU
fallback that matches the GPU computation would mean maintaining two
versions of the same mathematical code. In practice,
glmbayes calls f2_f3_opencl only after
checking nmathopencl_has_opencl(), and falls back to its
own pure-R implementation when OpenCL is not available. The zero-return
convention signals clearly that the GPU path was not executed, rather
than silently returning potentially stale or incorrect results.
When implementing your own wrapper, you face the same choice: return
zeros (safe, unambiguous), throw an error, or call a CPU implementation
of the same computation. Whatever you choose, make it explicit in the
#else branch.
| Mode | Execution | Result |
|---|---|---|
GPU available (USE_OPENCL +
nmathopencl_has_opencl()) |
m1 work-items launched simultaneously via
clEnqueueNDRangeKernel |
GPU-computed values |
| CPU-only build or no GPU at runtime | #ifdef/nmathopencl_has_opencl() guard
triggers |
Zero-filled outputs (or your chosen fallback) |
A kernel wrapper is an ordinary C++ function — always compiled, always callable — that serves as the boundary between the calling code and the GPU acceleration path. It is the right place to start because it is the most familiar-looking of the three roles.
Every kernel wrapper does the same four things, regardless of the specific computation:
std::vector types the runner requires and the OpenCL API
can handle. The exact conversions depend on the types in the function
signature.USE_OPENCL was defined at build time. If not, the function
returns the zero-initialized outputs.What varies between implementations is the function
signature (the specific C++ types coming in and going out), the
conversion steps (which helpers are needed), and the
output restructuring (how flat C++ vectors become the
return types the caller needs). The four-step structure, and the
#ifdef USE_OPENCL placement, are the same everywhere.
f2_f3_opencl from
ex_glmbayesThe following is the kernel wrapper for the binomial-logit GLM gradient computation. Read it as one concrete instance of the general pattern described above.
// ── EXAMPLE: src/ex_glmbayes_kernel_wrappers.cpp ────────────────────────────
// This wrapper is specific to the GLM log-posterior computation.
// Your own wrapper will have a different name, a different signature,
// and different conversion and output steps.
// ────────────────────────────────────────────────────────────────────────────
#include <RcppArmadillo.h>
#include "openclPort.h"
#include "ex_glmbayes_opencl.h"
using namespace openclPort;
using namespace ex_glmbayes::opencl;
namespace ex_glmbayes { namespace opencl {
Rcpp::List f2_f3_opencl(
std::string family, // "binomial", "poisson", etc.
std::string link, // "logit", "probit", etc.
Rcpp::NumericMatrix b, // coefficient grid, l2 × m1
Rcpp::NumericVector y, // response, length l1
Rcpp::NumericMatrix x, // design matrix, l1 × l2
Rcpp::NumericMatrix mu, // prior mean
Rcpp::NumericMatrix P, // prior precision, l2 × l2
Rcpp::NumericVector alpha, // offsets, length l1
Rcpp::NumericVector wt, // weights / trial counts, length l1
int progbar
// Your wrapper's parameters will reflect your own computation's
// inputs. A simpler function might take just one or two vectors.
) {
int l1 = x.nrow(), l2 = x.ncol(), m1 = b.ncol();
// Your dimension variables will match your own inputs.std::vector<double> // ── EXAMPLE: input conversion specific to this GLM computation ──────────
// The C++ types coming in (Rcpp::NumericMatrix, Rcpp::NumericVector)
// reflect what this particular function signature requires.
// Your wrapper will have different input types and will apply whichever
// combination of flattenMatrix / copyVector / manual conversion matches
// your own function signature.
auto X_flat = flattenMatrix(x);
auto B_flat = flattenMatrix(b);
auto mu_flat = flattenMatrix(mu);
auto P_flat = flattenMatrix(P);
auto alpha_flat = copyVector(alpha);
auto y_flat = copyVector(y);
auto wt_flat = copyVector(wt);This conversion step, and the output vectors allocated next, all
appear outside any #ifdef USE_OPENCL block. The
reason is that the output vectors (qf_flat,
grad_flat) must be in scope after #endif when
the wrapper constructs the return value from them. Keeping all the
conversions together, before the conditional block, makes the scoping
unambiguous.
Why not pass the Rcpp objects directly to the
runner? A Rcpp::NumericMatrix is a C++ object
whose data lives in R’s garbage-collected heap (as an R SEXP). Passing a
raw pointer into that storage to the OpenCL API is unsafe: the driver
requires a stable, contiguous memory region for the duration of the
clCreateBuffer copy, and R’s GC can move or invalidate heap
objects when control returns to R. A
std::vector<double> provides the required guarantee:
it owns its memory, controls its lifetime through ordinary C++ scope
rules, and will not move or be freed until the vector itself goes out of
scope.
Why double and not float?
The function signature uses double throughout because that
is the native precision of Rcpp::NumericMatrix and
Rcpp::NumericVector — and of the double*
arrays they wrap. Using 32-bit float inside a kernel would
require explicit downcast on every input and upcast on every output, and
each round-trip introduces rounding error that makes GPU results diverge
from the CPU reference computation. Using double throughout
keeps the two paths numerically equivalent, at the cost of requiring the
cl_khr_fp64 extension on the device (supported on all
current discrete GPUs and AMD APUs).
Memory layout. Rcpp::NumericMatrix
stores its data in column-major order internally (matching R’s own
convention). flattenMatrix preserves that order: it
traverses the matrix column-by-column, so the resulting flat vector has
the same layout. The kernel indexes the design matrix as
X[col * l1 + row] — column-major indexing — so the flat
vector and the kernel’s index arithmetic agree. If the flat vector were
row-major, every off-diagonal access would silently return the wrong
value.
// openclPort helpers used above (defined in OpenCL_helper.cpp):
std::vector<double> flattenMatrix(const Rcpp::NumericMatrix& mat) {
int nrow = mat.nrow(), ncol = mat.ncol();
std::vector<double> out;
out.reserve(static_cast<size_t>(nrow) * ncol);
for (int j = 0; j < ncol; ++j) // outer: columns
for (int i = 0; i < nrow; ++i) // inner: rows within column
out.push_back(mat(i, j));
return out;
}
std::vector<double> copyVector(const Rcpp::NumericVector& vec) {
return std::vector<double>(vec.begin(), vec.end());
}
// For IntegerVector: std::vector<int>(v.begin(), v.end()) // ── EXAMPLE: output sizes specific to this computation ──────────────────
// Your output buffers will match your kernel's output parameters.
// A simpler kernel writing one output vector would need only one
// std::vector here.
std::vector<double> qf_flat(m1); // scalar per grid point
std::vector<double> grad_flat(static_cast<size_t>(m1) * l2); // gradient matrix
Rcpp::NumericVector qf(m1); // zero-initialized; returned as-is on CPU-only buildstd::vector<double> zero-initializes its elements
by default. If USE_OPENCL was not defined, the wrapper
returns these zero-filled objects — a predictable, testable result on
any machine without a GPU. The static_cast<size_t>
prevents integer overflow in m1 * l2 on 32-bit
platforms.
#ifdef USE_OPENCL
// ── EXAMPLE: family/link dispatch specific to GLM ────────────────────────
// This dispatch is needed here because one wrapper covers multiple
// GLM families. Your wrapper may call a single fixed kernel, in which
// case kernel_name and kernel_file are constants, not variables.
std::string kernel_name, kernel_file;
if (family == "binomial" || family == "quasibinomial") {
if (link == "logit") {
kernel_name = "f2_f3_binomial_logit";
kernel_file = "ex_glmbayes_src/f2_f3_binomial_logit.cl";
} else if (link == "probit") {
kernel_name = "f2_f3_binomial_probit";
kernel_file = "ex_glmbayes_src/f2_f3_binomial_probit.cl";
}
/* ... other link functions ... */
} else if (family == "poisson" || family == "quasipoisson") {
kernel_name = "f2_f3_poisson";
kernel_file = "ex_glmbayes_src/f2_f3_poisson.cl";
}
/* ... other families ... */
// ── Program string assembly (same pattern for all wrappers) ─────────────
// The shim libraries are always the same. What changes is the nmath
// subset (driven by your kernel's @all_depends_nmath annotation) and
// the kernel source file itself.
std::string all_src =
load_kernel_source("OPENCL.cl") +
"\n" + load_kernel_library("libR_shims", "nmathopencl", false) +
"\n" + load_kernel_library("R_ext_types", "nmathopencl", false) +
"\n" + load_kernel_library("R_shims", "nmathopencl", false) +
"\n" + load_kernel_library("R_ext_runtime", "nmathopencl", false) +
"\n" + load_kernel_library("R_ext_internals", "nmathopencl", false) +
"\n" + load_kernel_library("System", "nmathopencl", false) +
"\n" + load_library_for_kernel(
kernel_file, "ex_glmbayes_nmath", "nmathopencl",
"all_depends_nmath") +
"\n" + load_kernel_source(kernel_file);OpenCL has no #include directive, so the wrapper must
concatenate every source file the kernel depends on into one string
before calling the runner. The layers are assembled in dependency order
(Chapter 03 explains each layer): platform definitions, R compatibility
shims, the nmath subset the kernel needs, and finally the kernel source
itself. load_library_for_kernel reads the
@all_depends_nmath annotation from the kernel file and
loads only those nmath files in topological order, rather than loading
all 137.
// ── EXAMPLE: runner call specific to this computation ───────────────────
// Your call site will name your own runner function and pass your
// own flat vectors. The pattern — pass the assembled string, the
// kernel name, the dimensions, all inputs, and the output buffers —
// is the same.
f2_f3_kernel_runner(
all_src, kernel_name.c_str(),
l1, l2, m1,
X_flat, B_flat, mu_flat, P_flat, alpha_flat, y_flat, wt_flat,
qf_flat, grad_flat); // filled in by the runner
for (int j = 0; j < m1; ++j)
qf[j] = qf_flat[j]; // copy scalar outputs into the Rcpp output vector
#else
Rcpp::Rcout << "[INFO] OpenCL not available — returning zeros.\n";
#endif // ── EXAMPLE: two output conversion patterns ──────────────────────────────
// qf uses an element-by-element copy (clear, zero overhead for small m1).
// grad uses a zero-copy Armadillo wrap to avoid copying a large matrix.
//
// Your wrapper will use whichever conversion pattern matches its own
// return type: a copy loop or Rcpp::NumericVector(...) for a vector result,
// arma::mat for a matrix, or a Rcpp::List assembling multiple outputs.
arma::mat grad_arma(
grad_flat.data(), // pointer to existing flat storage — not copied
m1, // number of rows
l2, // number of columns
false, // copy_aux_mem = false: borrow the memory
false // strict = false
);
return Rcpp::List::create(
Rcpp::Named("qf") = qf,
Rcpp::Named("grad") = grad_arma
);
}
}} // namespace ex_glmbayes::openclTwo output conversion strategies are shown here.
Element-by-element copy (qf): the runner writes
results into the plain std::vector<double> qf_flat.
After the runner returns, each element is copied into the
Rcpp::NumericVector qf that was pre-allocated before the
#ifdef block. This is the clearest pattern and is
effectively free when m1 is small. For larger outputs,
Rcpp::NumericVector(v.begin(), v.end()) achieves the same
conversion in one line.
Zero-copy Armadillo wrap (grad): the runner
writes into std::vector<double> grad_flat. Rather
than copying all m1 × l2 elements into a new matrix object,
an arma::mat is constructed over grad_flat’s
existing memory using Armadillo’s
(ptr, nrow, ncol, copy_aux_mem, strict) constructor.
Setting copy_aux_mem = false tells Armadillo to use the
existing memory directly. The matrix is valid until
grad_flat is destroyed, which happens after
Rcpp::List::create serializes the result — so the lifetime
is safe. For typical MCMC grid sizes the gradient matrix can be several
hundred kilobytes; avoiding the copy measurably reduces per-call
overhead.
Your own wrapper will use whichever conversion matches its return
type: a copy loop or Rcpp::NumericVector(...) for a vector
result, an arma::mat wrap for a large matrix, or a
Rcpp::List assembling multiple outputs.
A kernel runner is a C++ function whose entire body is enclosed in
#ifdef USE_OPENCL … #endif. It literally does not exist in
a CPU-only build: no prototype visible to the linker, no symbols in the
object file, no CL/cl.h ever seen by the preprocessor.
Every kernel runner follows the same eight-step lifecycle, regardless of what computation it performs:
What varies between implementations is the function
signature (the specific std::vector and
int parameters matching your computation), the
number and arrangement of device buffers (one
cl_mem per kernel parameter), and the number of
clSetKernelArg calls (one per kernel parameter, in
the same order as the kernel’s __kernel signature). The
eight steps and their sequence are the same for every runner.
The runner’s interface uses only plain C++ types —
std::vector<double>, int,
std::string, const char*. It has no Rcpp
dependencies and no knowledge of R’s memory model. This makes the runner
the easiest of the three layers to test in isolation.
f2_f3_kernel_runner from
ex_glmbayes// ── EXAMPLE: src/ex_glmbayes_kernel_runners.cpp ─────────────────────────────
// This runner is specific to the GLM gradient computation, which has
// seven input matrices/vectors and two output buffers.
// Your runner will have a different name, a different parameter list,
// and a different number of clCreateBuffer / clSetKernelArg calls,
// but the eight lifecycle steps will be the same.
// ────────────────────────────────────────────────────────────────────────────
#ifdef USE_OPENCL
#define CL_TARGET_OPENCL_VERSION 300
#include <CL/cl.h>
#include "openclPort.h"
using namespace openclPort;
namespace ex_glmbayes { namespace opencl {
void f2_f3_kernel_runner(
const std::string& kernel_source, // fully assembled program text
const char* kernel_name, // e.g. "f2_f3_binomial_logit"
int l1, int l2, int m1,
// ── EXAMPLE: seven input vectors specific to the GLM computation ────────
// Your runner will have whatever std::vector parameters your kernel needs.
// A simpler kernel might take just one or two input vectors.
const std::vector<double>& X_flat,
const std::vector<double>& B_flat,
const std::vector<double>& mu_flat,
const std::vector<double>& P_flat,
const std::vector<double>& alpha_flat,
const std::vector<double>& y_flat,
const std::vector<double>& wt_flat,
// ── EXAMPLE: two output vectors specific to this computation ────────────
std::vector<double>& qf_flat, // written back on return
std::vector<double>& grad_flat // written back on return
) { cl_platform_id platform = nullptr;
cl_device_id device = nullptr;
opencl_bind_selected_fp64_device_or_throw(platform, device);
// This call is the same in every runner. It selects the device the
// user chose via opencltools::set_opencl_device() and verifies that
// it supports 64-bit floating point. cl_context context =
clCreateContext(nullptr, 1, &device, nullptr, nullptr, &status);
cl_queue_properties props[] = {0};
cl_command_queue queue =
clCreateCommandQueueWithProperties(context, device, props, &status);
// These two calls are identical in every runner.An OpenCL context is the environment within which all subsequent objects live — buffers, programs, kernels, and queues all belong to a context and share access to the same device memory. A command queue is the channel through which the host issues instructions to the device. Commands are placed in the queue and the device executes them in order, asynchronously with the host CPU.
const char* src_ptr = kernel_source.c_str();
size_t src_len = kernel_source.size();
cl_program program =
clCreateProgramWithSource(context, 1, &src_ptr, &src_len, &status);
status = clBuildProgram(program, 0, nullptr, nullptr, nullptr, nullptr);
// These calls are identical in every runner. If compilation fails,
// the runner retrieves the build log from the driver and throws a
// std::runtime_error with the full diagnostic.
cl_kernel kernel = clCreateKernel(program, kernel_name, &status);
// kernel_name is the C-string name of the __kernel function in the
// .cl file — passed in from the wrapper.clBuildProgram triggers on-device compilation: the GPU
driver parses the OpenCL C source, optimizes it, and produces GPU-native
machine code specific to the selected device. This is analogous to
R CMD SHLIB but happens at runtime inside the GPU driver
process. clCreateKernel then extracts the specific
__kernel function by name from the compiled program.
// ── EXAMPLE: nine buffers specific to this computation ──────────────────
// Your runner will have one cl_mem per kernel parameter (excluding
// scalar integers, which are passed directly in Step 5).
// The split between READ_ONLY and WRITE_ONLY reflects the kernel's
// use of each buffer.
// Input buffers: allocate and copy from host in one call
cl_mem bufX = clCreateBuffer(
context,
CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR,
sizeof(double) * X_flat.size(),
(void*) X_flat.data(),
&status);
// ... similarly for bufB, bufMu, bufP, bufA, bufY, bufW ...
// Output buffers: allocate uninitialized device memory
cl_mem bufQF = clCreateBuffer(
context,
CL_MEM_WRITE_ONLY,
sizeof(double) * m1,
nullptr, // no initial data — the kernel will write here
&status);
cl_mem bufGrad = clCreateBuffer(
context,
CL_MEM_WRITE_ONLY,
sizeof(double) * l2 * m1,
nullptr,
&status);For input buffers,
CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR allocates device
memory and immediately copies the flat vector’s data into it. After this
call returns, the device buffer is fully populated and the host-side
vector is no longer needed by the driver.
For output buffers, CL_MEM_WRITE_ONLY with a
nullptr host pointer allocates uninitialized device memory
that the kernel will write into. The results are read back to the host
after execution.
// ── EXAMPLE: twelve arguments for this kernel ────────────────────────────
// Your runner will have one clSetKernelArg call per kernel parameter,
// in the exact same order as the parameters appear in the __kernel
// function signature. Mismatched order causes a runtime error.
int arg = 0;
clSetKernelArg(kernel, arg++, sizeof(cl_mem), &bufX);
clSetKernelArg(kernel, arg++, sizeof(cl_mem), &bufB);
clSetKernelArg(kernel, arg++, sizeof(cl_mem), &bufMu);
clSetKernelArg(kernel, arg++, sizeof(cl_mem), &bufP);
clSetKernelArg(kernel, arg++, sizeof(cl_mem), &bufA);
clSetKernelArg(kernel, arg++, sizeof(cl_mem), &bufY);
clSetKernelArg(kernel, arg++, sizeof(cl_mem), &bufW);
clSetKernelArg(kernel, arg++, sizeof(cl_mem), &bufQF);
clSetKernelArg(kernel, arg++, sizeof(cl_mem), &bufGrad);
// Scalar integers are passed by value, not as cl_mem handles:
clSetKernelArg(kernel, arg++, sizeof(int), &l1);
clSetKernelArg(kernel, arg++, sizeof(int), &l2);
clSetKernelArg(kernel, arg++, sizeof(int), &m1);clSetKernelArg connects a host-side value to a
positional parameter in the kernel’s __kernel signature.
The second argument is the zero-based parameter index. Buffer arguments
(cl_mem) are passed by address; scalar arguments
(int, double) are also passed by address and
the driver copies the value. The count and order of calls here must
exactly mirror the count and order of parameters in the .cl
file.
// The global work size is the total number of work-items to launch.
// For this computation it is m1: one independent instance of the kernel
// function per grid point. All m1 instances are launched simultaneously
// (see "Serial vs. parallel execution" earlier in this chapter).
// Your kernel may use a different global size reflecting its own
// parallelism structure — e.g. the number of output elements.
size_t global = (size_t) m1;
clEnqueueNDRangeKernel(
queue, kernel,
1, // one-dimensional work space (indexed by j = 0..m1-1)
nullptr, // global offset: start from index 0
&global, // total work-items: m1
nullptr, // local work-group size: let the driver choose
0, nullptr, nullptr);
// This call places the launch command in the queue and returns immediately.
// The GPU executes the m1 work-items asynchronously with the host CPU.
// The blocking clEnqueueReadBuffer in Step 7 is what forces the host to
// wait until all work-items have finished before reading the results. // ── EXAMPLE: two read-backs for this computation ─────────────────────────
// Your runner will have one clEnqueueReadBuffer call per output buffer.
clEnqueueReadBuffer(
queue, bufQF,
CL_TRUE, // blocking: wait until done
0,
sizeof(double) * qf_flat.size(),
qf_flat.data(),
0, nullptr, nullptr);
clEnqueueReadBuffer(
queue, bufGrad,
CL_TRUE,
0,
sizeof(double) * grad_flat.size(),
grad_flat.data(),
0, nullptr, nullptr);
// After these calls, qf_flat and grad_flat contain GPU-computed results.CL_TRUE makes each call blocking: it waits until the
data transfer from device to host is complete. Because commands execute
in queue order, this also guarantees that the kernel has finished.
// Release in reverse order of creation.
// Every runner must release every cl_mem, cl_kernel, cl_program,
// cl_command_queue, and cl_context it creates. Omissions leak GPU VRAM.
clReleaseMemObject(bufGrad);
clReleaseMemObject(bufQF);
clReleaseMemObject(bufW);
clReleaseMemObject(bufY);
clReleaseMemObject(bufA);
clReleaseMemObject(bufP);
clReleaseMemObject(bufMu);
clReleaseMemObject(bufB);
clReleaseMemObject(bufX);
clReleaseKernel(kernel);
clReleaseProgram(program);
clReleaseCommandQueue(queue);
clReleaseContext(context);
}
}} // namespace ex_glmbayes::opencl
#endif // USE_OPENCL
// Everything from the opening #ifdef to this #endif is invisible to the
// preprocessor on a machine without OpenCL.*.cl)A kernel is an OpenCL C source file that contains the actual
computation to be run on the GPU. It is the innermost piece: pure math,
with no knowledge of R, no knowledge of the runner’s bookkeeping, and no
access to host memory. Its only interface with the outside world is the
set of __global pointer and scalar arguments the runner
hands to it via clSetKernelArg.
Every kernel file has the same basic structure, regardless of the computation:
@all_depends_nmath (and optionally @depends,
@provides) that the wrapper’s program assembler reads to
determine which nmath files to prepend. If your kernel does not call any
nmath functions, this section can be omitted.#pragma OPENCL EXTENSION cl_khr_fp64 : enable if the kernel
uses double.static inline
device-side functions called by the __kernel entry point,
including nmath functions that will be prepended by the assembler.__kernel entry point — the function
the runner calls via clCreateKernel.What varies between kernels is the __kernel
signature (one __global parameter per device
buffer allocated in the runner), the scalar parameters
(dimension integers and other constants), and the computation
itself.
A kernel is never compiled when the R package is installed. Instead,
the source text is assembled into a string by the wrapper, passed to the
runner, and handed to clCreateProgramWithSource. The GPU
driver compiles it to native machine code at runtime. This on-device
compilation allows a single source file to run on any GPU without
vendor-specific binaries.
f2_f3_binomial_logit from
ex_glmbayes// ── EXAMPLE: inst/cl/ex_glmbayes_src/f2_f3_binomial_logit.cl ───────────────
// This kernel is specific to the binomial-logit GLM gradient computation.
// Your kernel will have a different name, a different parameter list
// (matching the device buffers your runner allocates), and a different
// computation. The file structure — annotations, pragma, helpers,
// __kernel entry point — is the same.
// ────────────────────────────────────────────────────────────────────────────
// @all_depends_nmath: dpq, refactored, Rmath, nmath, stirlerr_cycle_free,
// chebyshev, cospi, fmax2, gammalims, lgammacor, log1p, gamma, lgamma,
// pgamma_utils, stirlerr_cycle_dependent, bd0, stirlerr, dbinom
// If your kernel calls no nmath functions, omit this annotation entirely.
#pragma OPENCL EXTENSION cl_khr_fp64 : enable
// Required whenever the kernel uses 'double'. Always include this.
#pragma OPENCL EXTENSION cl_khr_printf : enable
// Optional; enables printf() for debugging inside the kernel.
// ── EXAMPLE: compile-time constant specific to this computation ─────────────
// MAX_L2 bounds the size of the per-work-item private array g_loc[].
// GPU private memory is fixed at compile time, so a maximum is needed.
// Your kernel may need a similar constant if it uses fixed-size local arrays,
// or may not need one at all if its private storage is scalar.
#define MAX_L2 64@all_depends_nmath lists the nmath
files whose source must be prepended before this kernel when the program
string is assembled. The wrapper’s load_library_for_kernel
reads this annotation and loads exactly those files in topological
order. Chapter 03 explains the annotation scheme and the dependency
graph in detail.
MAX_L2 bounds the size of a
per-work-item stack array. Each work-item allocates
double g_loc[MAX_L2] in private memory — the GPU
equivalent of a register file — whose size must be known at compile
time. A maximum coefficient count of 64 is sufficient for the GLM use
case. Your kernel may need a similar constant, or may not require any
fixed-size private arrays at all.
// ── EXAMPLE: parameter list matching the nine buffers allocated in the runner
// Your kernel's __global parameters must exactly correspond (in number,
// type, and order) to the cl_mem buffers created by clCreateBuffer in the
// runner and bound by clSetKernelArg.
__kernel void f2_f3_binomial_logit(
__global const double* X, // design matrix, l1 × l2, column-major
__global const double* B, // coefficient grid, m1 × l2
__global const double* mu, // prior mean vector, length l2
__global const double* P, // prior precision matrix, l2 × l2
__global const double* alpha, // offsets, length l1
__global const double* y, // response (success proportions), length l1
__global const double* wt, // trial counts / weights, length l1
__global double* qf, // OUTPUT: neg log-posterior, length m1
__global double* grad, // OUTPUT: gradient, m1 × l2
const int l1,
const int l2,
const int m1
// Scalar integers are passed by value; no __global qualifier.
) {__kernel marks this function as a GPU
entry point that clCreateKernel can look up by name.
__global is an OpenCL address-space
qualifier. It marks a pointer as pointing into global device
memory — the GPU’s main VRAM — which all work-items can access.
Pointers without an address-space qualifier point into the work-item’s
private memory, which is invisible to other work-items and cannot be
mapped from the host. Every cl_mem buffer created in the
runner corresponds to exactly one __global pointer in the
kernel signature.
const int l1, l2, m1 are scalar values
set via clSetKernelArg with sizeof(int) and
passed by value. Scalar kernel arguments do not use
__global.
int j = get_global_id(0); // which grid point is this work-item?
if (j >= m1) return; // guard: don't process out-of-range indices
// Your kernel's first line will also call get_global_id(0) to determine
// which element or slice this work-item is responsible for.get_global_id(0) returns the unique index of this
work-item within the one-dimensional work space. When the runner
enqueues the kernel with a global size of m1, the GPU
launches m1 work-items simultaneously, each receiving a
different value of j. Each work-item executes the function
body independently for its own grid point. This is the GPU’s answer to
the CPU’s outer for (j = 0; j < m1; j++) loop.
The if (j >= m1) return guard handles the case where
the GPU rounds up the number of work-items to a multiple of the
hardware’s work-group size. Without this guard, extra work-items would
write past the end of the output buffers.
// ── EXAMPLE: GLM log-posterior and gradient ───────────────────────────
// The computation below is specific to the binomial-logit model.
// Your kernel body will contain whatever computation you are
// accelerating, reading from __global input pointers and writing
// to __global output pointers.
double g_loc[MAX_L2]; // private (per-work-item) working storage
// 1. Quadratic prior term: 0.5 × (B[j,] − μ)ᵀ P (B[j,] − μ)
// 2. Binomial likelihood loop over l1 observations
// 3. Write results
qf[j] = res_acc;
for (int k = 0; k < l2; ++k)
grad[k * m1 + j] = g_loc[k];
// grad is stored as grad[col * m1 + row] — column-major — so that
// the wrapper's arma::mat(ptr, m1, l2) interprets it correctly.
// Your output layout should match how your wrapper will read it back.The gradient is written in column-major order
(grad[col * m1 + row]) so that the
arma::mat(grad_flat.data(), m1, l2) construction in the
wrapper produces an m1 × l2 matrix in the layout R expects.
The decision about output memory layout is made in the kernel and must
be consistent with how the wrapper reads the data back.
R: f2_f3_opencl("binomial", "logit", b, y, x, mu, P, alpha, wt)
│
▼
┌─────────────────────────────────────────────────────────────────┐
│ Kernel wrapper (always compiled, always callable) │
│ │
│ 1. Read dimensions │
│ 2. Convert Rcpp inputs → std::vector<double> │
│ (flattenMatrix for matrices, copyVector for vectors) │
│ 3. Allocate zero-filled output vectors │
│ 4. Select kernel name + file (family/link dispatch) │
│ 5. Assemble program string │
│ (OPENCL.cl + shims + nmath subset + kernel source) │
│ │
│ ── #ifdef USE_OPENCL ──────────────────────────────────────── │
│ 6. Call f2_f3_kernel_runner(all_src, kernel_name, ...) │
│ 7. Copy qf_flat → NumericVector qf │
│ ── #endif ─────────────────────────────────────────────────── │
│ │
│ 8. Wrap grad_flat → arma::mat (zero-copy) │
│ 9. Return List(qf = ..., grad = ...) │
└─────────────────────────────────────────────────────────────────┘
│ std::vector<double>: flat, contiguous, R-free, device-safe
│ (only reached when USE_OPENCL is defined)
▼
┌─────────────────────────────────────────────────────────────────┐
│ Kernel runner (compiled only when USE_OPENCL is defined) │
│ │
│ 1. Bind device (fp64-capable GPU) │
│ 2. Create context + command queue │
│ 3. clCreateProgramWithSource + clBuildProgram │
│ → GPU driver compiles the assembled string on-device │
│ 4. clCreateKernel("f2_f3_binomial_logit") │
│ 5. clCreateBuffer × 9 │
│ inputs: READ_ONLY | COPY_HOST_PTR → data copied to VRAM │
│ outputs: WRITE_ONLY, nullptr → uninitialized device memory │
│ 6. clSetKernelArg × 12 (9 buffers + l1, l2, m1) │
│ 7. clEnqueueNDRangeKernel(global_size = m1) │
│ → queues the launch; GPU executes asynchronously │
│ 8. clEnqueueReadBuffer × 2 (blocking) │
│ → waits for GPU; copies results to qf_flat, grad_flat │
│ 9. clRelease* for all objects │
└─────────────────────────────────────────────────────────────────┘
│ clEnqueueNDRangeKernel(global = m1):
│ ALL m1 work-items launched simultaneously — not a loop
▼
┌─────────────────────────────────────────────────────────────────┐
│ Kernel (compiled on-device; m1 instances run in parallel) │
│ │
│ Each work-item j = get_global_id(0) independently: │
│ • quadratic prior term qf[j] += 0.5 × (B[j,]−μ)ᵀP(B[j,]−μ)│
│ • for i in 0..l1: sigmoid → p │
│ nll_binomial_glmb_ocl(y[i], wt[i], p) │
│ gradient accumulation → g_loc[k] │
│ • qf[j] = res_acc │
│ • grad[k*m1 + j] = g_loc[k] for k in 0..l2 │
└─────────────────────────────────────────────────────────────────┘
│
└─► qf_flat, grad_flat ← runner ← wrapper ← R
CRAN’s build servers have no OpenCL SDK installed. A package that
includes CL/cl.h or links against -lOpenCL on
a CRAN server will fail to build and be rejected.
The runner being entirely absent from CPU-only builds is the critical property:
#include <CL/cl.h> appears only at the top of the
runner file, inside #ifdef USE_OPENCL. On a CRAN build the
preprocessor never reaches that line.cl_* type and every clCreate* /
clEnqueue* call lives inside the same #ifdef
block. None of them generate any object-code symbols.-lOpenCL and the
build succeeds on any machine.The wrapper being unconditionally compiled means the full C++ API
surface is present on every build. On a CPU-only build,
f2_f3_opencl returns a list with qf and
grad both zero — the same structure the caller expects,
with zero values instead of GPU-computed values — making it
straightforward to test the function’s existence and return shape
without a GPU.
src/
mypkg_kernel_wrappers.cpp # always compiled; internal #ifdef USE_OPENCL
mypkg_kernel_runners.cpp # all function bodies inside #ifdef USE_OPENCL
inst/cl/
mypkg_src/
myfunction_kernel.cl # __kernel + @all_depends_nmath annotation
For inputs (wrapper converts C++ input types →
std::vector before calling the runner):
| C++ input type | Conversion | Runner argument type |
|---|---|---|
Rcpp::NumericVector |
copyVector(v) |
const std::vector<double>& |
Rcpp::NumericMatrix |
flattenMatrix(m) |
const std::vector<double>& |
Rcpp::IntegerVector |
std::vector<int>(v.begin(), v.end()) |
const std::vector<int>& |
For outputs (wrapper converts std::vector → required
return types after the runner returns):
| Return type needed | Conversion from std::vector<double> |
When to use |
|---|---|---|
Rcpp::NumericVector (small) |
for (j) out[j] = flat[j] |
Clarity |
Rcpp::NumericVector (large) |
Rcpp::NumericVector(v.begin(), v.end()) |
Conciseness |
arma::mat / Rcpp::NumericMatrix |
arma::mat(ptr, nrow, ncol, false, false) |
Avoid large copy |
DESCRIPTIONopenclPort.h
(for load_kernel_source,
load_library_for_kernel, flattenMatrix,
copyVector) and RcppArmadillo.h if returning
matrices.openclPort.h
unconditionally; include CL/cl.h inside
#ifdef USE_OPENCL.DESCRIPTION: add
nmathopencl to LinkingTo; add
Imports: Rcpp, RcppArmadillo as appropriate.Chapter 09 covers the generic openclPort runner
infrastructure — device selection, scalar and matrix runner templates,
and error handling — in detail. Chapter 10 applies this full three-role
pattern end-to-end for all five ex_glmbayes gradient
kernels.