Normal C++ code in an R package is compiled ahead of time by the
package build toolchain and linked into a DLL/.so. OpenCL
device code works differently: the GPU source is compiled at
runtime, on the device, by the GPU driver. This means there is
no offline linker for device code — all GPU code that will run together
must be collected into a single source string before the driver sees
it.
In OpenCL terms, that string is a program. The C++ call sequence is:
// kernel_source is one big string containing all device code
program = clCreateProgramWithSource(context, 1, &src_ptr, &src_len, &status);
clBuildProgram(program, 0, nullptr, nullptr, nullptr, nullptr);
kernel = clCreateKernel(program, "dnorm_kernel", &status);clBuildProgram invokes the vendor GPU compiler on
kernel_source and produces a device binary. A build failure
here is equivalent to a C++ compilation error, except it happens at the
moment of first use, not at R CMD INSTALL time.
#include on the deviceOpenCL C is a restricted dialect of C99. There is no file system
visible to device code: #include is simply not available
inside an OpenCL program string. Every declaration or definition that a
kernel function needs must appear earlier in the same
string.
This creates the fundamental porting challenge for
nmathopencl. R’s nmath library is written as
standard C99 and relies on a chain of #include
directives:
Those headers pull in standard C types (FILE*,
size_t, int32_t), R-specific runtime types
(SEXP, Rboolean), and R callbacks
(error(), warning(), R_alloc()).
None of these concepts exist on a GPU.
The solution is to assemble the program string in layers, prepending
OpenCL-compatible replacements for everything the nmath sources expect
to find via #include.
A complete nmathopencl kernel program is a concatenation
of four layers:
┌─────────────────────────────────────────────────────────────┐
│ Layer 0 – OPENCL.cl │
│ Global configuration: fp64 pragma, HAVE_* macros, ML_NAN │
├─────────────────────────────────────────────────────────────┤
│ Layer 1 – Upstream shims (inst/cl/R_ext_types/, R_shims/, │
│ R_ext_runtime/, System/, ...) │
│ OpenCL-C replacements for C headers the nmath sources need │
├─────────────────────────────────────────────────────────────┤
│ Layer 2 – nmath library (inst/cl/nmath/) │
│ The ported Mathlib functions, exactly the subset required │
│ by this kernel (determined by @all_depends_nmath) │
├─────────────────────────────────────────────────────────────┤
│ Layer 3 – Kernel function (inst/cl/src/) │
│ The __kernel entry point: GPU-parallel version of a CPU │
│ loop over the nmath function │
└─────────────────────────────────────────────────────────────┘
Each layer is a plain text block that is physically concatenated (in
order) to form the final string passed to
clCreateProgramWithSource.
OPENCL.cl — the global configuration
headerinst/cl/OPENCL.cl is always placed at the very top. It
sets the conditions under which the rest of the program will
compile:
#pragma OPENCL EXTENSION cl_khr_fp64 : enable /* require 64-bit floats */
/* Enable expm1/log1p built-ins on OpenCL >= 1.2 */
#if defined(__OPENCL_C_VERSION__) && (__OPENCL_C_VERSION__ >= 120)
#define HAVE_EXPM1 1
#define HAVE_LOG1P 1
#endif
#undef HAVE_LONG_DOUBLE /* no long double on GPU */
#define ML_NAN (0.0/0.0)
#define ML_POSINF INFINITY
#define ML_NEGINF -INFINITY
#define INLINE static inlineThis one file handles the biggest portability concern:
double-precision support (cl_khr_fp64) must be declared
before any double variable appears in device code, and the
HAVE_* macros tell the nmath sources which C99 math
built-ins are available.
R’s nmath sources transitively include a long chain of headers. The shim layer provides OpenCL C replacements, grouped by what they replace:
| Files | What they replace |
|---|---|
System/stdint.cl |
<stdint.h>: int32_t,
uint64_t, … using OpenCL built-in integer types |
R_ext_types/Boolean.cl, Complex.cl,
Constants.cl, etc. |
Type-only parts of <R_ext/Boolean.h>,
<R_ext/Complex.h>, etc. |
R_shims/Rinternals.cl |
Skeletal SEXP, SEXPREC,
Rboolean so transitively included headers compile |
None of these types are used by the nmath math logic; the stubs exist so the preprocessor does not abort on an unrecognized identifier.
R_shims/Rconfig.cl reproduces what R’s
configure script writes into config.h:
Without this, conditional compilation branches inside nmath would
take the wrong path (e.g., falling back to a Taylor series instead of
calling the built-in expm1).
R’s error() and warning() are host-side
callbacks that longjmp into the R runtime — they cannot
exist on the GPU. R_ext_runtime/Error.cl replaces them with
silent macros:
Similarly, R_ext_runtime/Memory.cl stubs
R_alloc() and friends (nmath’s hot paths do not actually
allocate, so these stubs satisfy the preprocessor without ever
executing), and R_ext_runtime/Print.cl stubs
Rprintf.
R_ext_runtime/Arith.cl is the most actively
used shim. It maps R’s numeric predicates to their OpenCL
equivalents:
nmath/Rmath.cl and nmath/nmath.cl close out
the shim layer by providing the forward declarations, mathematical
constants, and error-code macros that the nmath function files expect to
find in Rmath.h / nmath.h.
Every shim is minimally intrusive: it defines only
what the nmath math logic actually needs, and stubs or silences
everything else. The goal is that the nmath .cl files are
as close to verbatim copies of the C sources as possible; the shims
carry all the differences.
inst/cl/nmath/ contains approximately 180
.cl files, one per source file in R’s
src/nmath/. They implement the full suite of Mathlib
functions in OpenCL C.
@provides and @depends annotationsBecause the program is assembled at runtime, the loader must know the correct concatenation order. Each nmath file declares what it provides and what it depends on:
// From inst/cl/nmath/dnorm.cl
// @source_origin: dnorm.c
// @depends: nmath, dpq
// @provides: dnorm, dnorm4
// @all_depends: dpq, Rmath, nmath
// @load_order: 31opencltools::load_kernel_library() reads these tags,
builds a dependency graph, and performs a topological sort — files with
no unsatisfied dependencies are emitted first, then files whose
dependencies have been emitted, and so on. The result is a load order in
which every function is declared before it is called.
A kernel that calls dnorm needs only 4 nmath files. A
kernel that calls dbinom needs 18. Loading all ~180 files
for every kernel would be wasteful and slow to compile. Instead, each
kernel file carries a pre-computed @all_depends_nmath
annotation listing the exact transitive closure:
// dnorm_kernel.cl -- needs 4 files
// @all_depends_nmath: dpq, Rmath, nmath, dnorm
// dbinom_kernel.cl -- needs 18 files
// @all_depends_nmath: dpq, refactored, Rmath, nmath, stirlerr_cycle_free,
// chebyshev, cospi, fmax2, gammalims, lgammacor, log1p, gamma, lgamma,
// pgamma_utils, stirlerr_cycle_dependent, bd0, stirlerr, dbinomload_library_for_kernel() reads
@all_depends_nmath, looks up each stem in a pre-built index
(kernel_dependency_index.tsv) to get the global load order,
and concatenates exactly those files in the correct sequence. No
topological sort is computed at install time — the sort was done once
when the index was built.
The __kernel function is the entry point that the GPU
driver exposes. Conceptually it is a GPU-parallel version of a
CPU for loop:
/* CPU equivalent:
for (int i = 0; i < n; i++) out[i] = dnorm(x[i], mu[i], sigma[i], gl); */
__kernel void dnorm_kernel(
__global const double* x,
__global const double* mu,
__global const double* sigma,
__global const int* give_log,
__global double* out,
const int len
) {
int i = get_global_id(0); /* this work-item handles element i */
if (i >= len) return;
int gl = (give_log[i] != 0) ? 1 : 0;
out[i] = dnorm(x[i], mu[i], sigma[i], gl); /* same nmath call */
}get_global_id(0) returns the index of the current
work-item in the first dimension of the work grid. The host launches
exactly n work-items so each element of the input vector is
handled by one work-item in parallel.
The kernel calls dnorm — the same function name as in
the C source. All the layers below it (shims + nmath files) ensure that
dnorm is defined and reachable in the same program
string.
build_rmath_opencl_program() in
src/kernel_loader.cpp shows precisely how the four layers
are concatenated:
std::string build_rmath_opencl_program(
const std::string& kernel_relative_path,
const std::string& package,
const std::string& nmath_depends_annotation)
{
// Load exactly the nmath subset this kernel needs (via @all_depends_nmath)
const std::string nmath_src = load_library_for_kernel(
kernel_relative_path, "nmath", package, nmath_depends_annotation);
return
load_kernel_source("OPENCL.cl", package) + "\n" // Layer 0
+ load_kernel_library("libR_shims", package, false) + "\n" // Layer 1
+ load_kernel_library("R_ext_types", package, false) + "\n"
+ load_kernel_library("R_shims", package, false) + "\n"
+ load_kernel_library("R_ext_runtime", package, false) + "\n"
+ load_kernel_library("R_ext_internals", package, false) + "\n"
+ load_kernel_library("System", package, false) + "\n"
+ nmath_src + "\n" // Layer 2
+ load_kernel_source(kernel_relative_path, package); // Layer 3
}The returned string is passed directly to
clCreateProgramWithSource and then compiled by the device
driver. At that point the GPU can execute the kernel function.
Note the dual guard: each loader function itself is compiled to a
no-op when USE_OPENCL is absent (see Chapter 02), so
build_rmath_opencl_program is always callable but returns
an empty string on CPU-only builds. Callers check
nmathopencl_has_opencl() before dispatching to the GPU
path.
A downstream package that builds on nmathopencl does not
need to replicate this assembly machinery. Kernel text loading uses
opencltools (load_kernel_source,
load_kernel_library with
package = "nmathopencl").
load_library_for_kernel is re-exported from
nmathopencl (and available via
LinkingTo: nmathopencl) and handles all of layers 0, 1, and
2 automatically.
The downstream developer’s responsibilities are:
__kernel function (Layer 3):
call the nmath functions your computation needs.@depends_nmath and
@all_depends_nmath so the loader knows which nmath subset
to pull in.build_rmath_opencl_program() (or the equivalent chain),
dispatches work-items, and transfers results.Chapter 07 covers how to write and annotate kernel functions. Chapter
10 walks through all three steps end-to-end for the
ex_glmbayes example.