Chapter 05 described the internal structure of a kernel wrapper: how inputs are converted, how the runner is dispatched, and how results are converted back. This chapter takes a step back and looks at how kernel wrappers fit into the broader codebase of a package.
Two questions arise immediately:
What happens when OpenCL is not available? Every kernel wrapper must have a CPU path. A wrapper that simply returns zeros is safe but unhelpful; most real wrappers need to fall back to a correct CPU computation.
How is the wrapper exposed? Some kernel wrappers have a direct interface into R (callable from R code). Others are purely internal C++ components, called by other C++ functions that hold the R-facing API. The choice depends on whether the computation has a natural direct R use.
nmathopencl contains examples of both patterns. The
distribution-function wrappers (dnorm_opencl,
pnorm_opencl, etc.) are exported R functions with rich CPU
fallback logic. The GLM gradient wrapper (f2_f3_opencl) is
a purely internal C++ component, called by a C++ dispatcher that also
has a separate CPU implementation (f2_f3_non_opencl). Both
patterns are explored in detail below.
In this pattern the kernel wrapper (or a thin R function that calls
it) is exported and callable directly from R. The CPU fallback is the
equivalent computation using standard R or C functions — in
nmathopencl’s case, the stats:: distribution
functions.
R caller
│
▼
R wrapper function (exported, input validation, recycling)
│ if inputs are non-finite, sd == 0, etc. → fallback_full()
│
▼
.opencl_try_or_fallback()
│ if !nmathopencl_has_opencl() → fallback_expr() (CPU path)
│ if OpenCL call succeeds → return GPU result
│ if OpenCL call fails
│ and fallback = TRUE → fallback_expr() (CPU path)
│ and fallback = FALSE → propagate error
▼
C++ kernel wrapper (internal, not exported)
│ #ifdef USE_OPENCL + nmathopencl_has_opencl() guard
│ type conversion + program assembly + runner dispatch
▼
GPU result
The fallback can be triggered at two separate levels:
R level (before the C++ call): when input
validation detects a condition the GPU path cannot handle
(e.g. sd == 0, non-finite values).
fallback_full() calls
stats::dnorm(x, mean, sd, log = log) directly.
C++ / runtime level:
.opencl_try_or_fallback() checks
nmathopencl_has_opencl() before attempting the GPU path. If
OpenCL is not available it calls fallback_expr() without
ever touching the C++ kernel wrapper. If a GPU call throws an exception
and fallback = TRUE, it catches the error and calls
fallback_expr().
In this pattern the kernel wrapper has no direct R interface. It is
called from within a C++ dispatcher function alongside a CPU
counterpart. The R interface belongs to a higher-level function that
selects between the two based on a use_opencl flag passed
in by the caller.
R caller
│
▼
Exported R function (e.g. Ex_EnvelopeEval)
│ validates inputs, passes use_opencl flag
▼
.EnvelopeEval_cpp() (internal R → C++ bridge, [[Rcpp::export]])
▼
EnvelopeEval_cpp() (C++ dispatcher)
│ if use_opencl && nmathopencl_has_opencl()
│ → f2_f3_opencl() (OpenCL kernel wrapper)
│ else
│ → f2_f3_non_opencl() (pure C++ CPU implementation)
▼
Result (qf, grad) returned regardless of path taken
The two implementations — f2_f3_opencl and
f2_f3_non_opencl — share the same function signature and
return the same data structure. The caller cannot tell from the return
value which path was taken.
dnorm_opencldnorm_opencl in R/normal_opencl.R is the
user-facing function. It mirrors the interface of
stats::dnorm and adds opencl_parallel,
fallback, and verbose arguments.
# R/normal_opencl.R (simplified)
#' @export
dnorm_opencl <- function(x, mean = 0, sd = 1, log = FALSE,
opencl_parallel = NA, fallback = FALSE,
verbose = FALSE) {
# ── Input validation ──────────────────────────────────────────────────────
# These checks mirror stats::dnorm behavior.
if (!is.numeric(x)) stop("`x` must be numeric.")
if (!is.numeric(mean)) stop("`mean` must be numeric.")
if (!is.numeric(sd)) stop("`sd` must be numeric.")
if (length(x) == 0L) return(numeric(0))
# ── Recycling (like stats::dnorm) ─────────────────────────────────────────
len <- max(length(x), length(mean), length(sd))
xv <- rep_len(as.double(x), len)
mv <- rep_len(as.double(mean), len)
sv <- rep_len(as.double(sd), len)
logv <- rep_len(log, len)
# ── R-level fallback function ─────────────────────────────────────────────
# Called when inputs contain conditions the GPU path cannot handle,
# or when OpenCL is unavailable and fallback = TRUE.
fallback_full <- function() {
stats::dnorm(x, mean = mean, sd = sd, log = log)
}
# ── R-level conditions that force the CPU path ────────────────────────────
if (any(!is.finite(xv) | !is.finite(mv) | !is.finite(sv))) {
return(fallback_full()) # stats::dnorm handles NaN, Inf, NA
}
if (any(sv < 0)) {
stop("`sd` must be non-negative.", call. = FALSE)
}
if (any(sv == 0)) {
return(fallback_full()) # degenerate case; stats::dnorm handles it
}
# ── Dispatch: try GPU, fall back to CPU on failure if fallback = TRUE ─────
log_int <- as.integer(logv)
opc <- .encode_opencl_parallel(opencl_parallel)
.opencl_try_or_fallback(
opencl_expr = function() .dnorm_opencl(xv, mv, sv, log_int, opc, verbose),
fallback_expr = fallback_full,
fallback = fallback,
verbose = verbose,
fn_name = "dnorm_opencl"
)
}.dnorm_opencl (dot-prefixed) is the internal
Rcpp-exported symbol for the C++ kernel wrapper. It is not part of the
public API; it exists only to make the C++ function callable from R.
.opencl_try_or_fallback helperThis helper encapsulates the runtime dispatch logic that every Pattern 1 wrapper shares:
# R/opencl_linkage_utils.R
.opencl_try_or_fallback <- function(opencl_expr, fallback_expr,
fallback, verbose, fn_name) {
if (!nmathopencl_has_opencl()) {
# OpenCL not available in this build or session — go straight to CPU.
if (verbose)
message(sprintf("[%s] OpenCL unavailable; using CPU fallback.", fn_name))
return(fallback_expr())
}
# OpenCL available: try the GPU path.
out <- tryCatch(opencl_expr(), error = function(e) e)
if (inherits(out, "error")) {
if (fallback) {
# GPU call failed and the caller requested a fallback.
if (verbose) {
message(sprintf("[%s] OpenCL call failed; using CPU fallback.", fn_name))
message(out$message)
}
return(fallback_expr())
}
stop(out$message, call. = FALSE) # no fallback requested — propagate error
}
out # GPU call succeeded
}The design makes the fallback behavior explicit and controllable:
fallback = FALSE (default): if the GPU call fails, the
error propagates to the caller. The caller sees an actual error rather
than silently receiving CPU results.fallback = TRUE: if the GPU call fails, the CPU path is
used transparently. Useful in batch workflows where any result is better
than an error.The C++ kernel wrapper .dnorm_opencl is exported to R
via // [[Rcpp::export(name = ".dnorm_opencl")]]. It is the
minimal C++ entry point: it converts inputs, runs the GPU path if
available, and returns zeros if not.
// src/kernel_wrappers.cpp (within nmathopencl namespace)
// [[Rcpp::export(name = ".dnorm_opencl")]]
Rcpp::NumericVector dnorm_opencl(
const Rcpp::NumericVector& x,
const Rcpp::NumericVector& mean,
const Rcpp::NumericVector& sd,
const Rcpp::IntegerVector& give_log,
int opencl_parallel_code,
bool verbose
) {
const int len = x.size();
Rcpp::NumericVector out(len); // zero-initialized
#ifdef USE_OPENCL
if (!nmathopencl_has_opencl() || len == 0) return out;
try {
d_givelog_ndrange_kernel_fill(
"src/dnorm_kernel.cl", "dnorm_kernel",
len, {&x, &mean, &sd}, give_log, out, verbose);
} catch (const std::exception& e) {
if (verbose) Rcpp::Rcout << e.what() << "\n";
throw;
}
#endif
return out;
}Note that the C++ wrapper itself returns zeros when
!nmathopencl_has_opencl(). It does not
call stats::dnorm. The R wrapper is responsible for the
fallback to stats::dnorm; the C++ wrapper simply reports
“no GPU result” via zeros. This keeps the C++ layer free of any R
evaluation machinery.
f2_f3_openclEx_EnvelopeEval (in R/ex_glmbayes.R) is the
user-facing function. It accepts a use_opencl flag and
delegates entirely to the C++ dispatcher:
# R/ex_glmbayes.R
#' @export
Ex_EnvelopeEval <- function(G4, y, x, mu, P, alpha, wt,
family, link,
use_opencl = FALSE,
verbose = FALSE) {
# Input validation (matrix/vector type checks) ...
.EnvelopeEval_cpp(G4, y, x, mu, P, alpha, wt,
family, link, use_opencl, verbose)
}There is no R-level fallback function here. The fallback is handled entirely inside the C++ dispatcher.
EnvelopeEval_cpp (inside src/) receives
use_opencl and decides which C++ implementation to
call:
// src/ (conceptual structure — details in actual source)
Rcpp::List EnvelopeEval_cpp(
Rcpp::NumericMatrix G4, Rcpp::NumericVector y,
Rcpp::NumericMatrix x, Rcpp::NumericMatrix mu,
Rcpp::NumericMatrix P, Rcpp::NumericVector alpha,
Rcpp::NumericVector wt,
std::string family, std::string link,
bool use_opencl, bool verbose
) {
// Prepare shared inputs (common to both paths) ...
if (use_opencl && nmathopencl_has_opencl()) {
// GPU path: call the OpenCL kernel wrapper
return ex_glmbayes::opencl::f2_f3_opencl(
family, link, b, y, x, mu, P, alpha, wt, verbose);
} else {
// CPU path: call the pure C++ implementation
return ex_glmbayes::f2_f3_non_opencl(
family, link, b, y, x, mu, P, alpha, wt);
}
}Both f2_f3_opencl and f2_f3_non_opencl
return a Rcpp::List with identical structure:
list(qf = numeric(m1), grad = matrix(m1, l2)). The
dispatcher’s caller cannot tell from the return value which path was
used.
For Pattern 1 (distribution functions), the CPU fallback is an
existing well-tested function from stats::. No separate CPU
implementation is needed.
For the GLM gradient computation, no equivalent off-the-shelf CPU
function exists. f2_f3_non_opencl is a pure C++
implementation of the same mathematical computation, written without any
OpenCL dependency. It compiles on every platform and produces
bit-for-bit equivalent results to the GPU path (within double-precision
rounding).
Having both implementations under explicit control also makes it
possible to benchmark them directly: use_opencl = FALSE
forces the CPU path even on a GPU-equipped machine.
The choice between Pattern 1 and Pattern 2 comes down to whether there is a natural existing CPU computation to fall back to.
| Criterion | Pattern 1 (R interface + R fallback) | Pattern 2 (C++ dispatch + CPU implementation) |
|---|---|---|
| Existing CPU function available? | Yes (stats::, base::, etc.) |
No; need to write the CPU implementation |
| Does the computation have a direct R use? | Yes (called directly from R) | Often not (called from a C++ simulation loop) |
| Where does fallback live? | R level (fallback_full()) + runtime
(nmathopencl_has_opencl()) |
C++ level
(use_opencl && nmathopencl_has_opencl()) |
| Caller can request optional fallback? | Yes (fallback = TRUE/FALSE argument) |
Caller controls via use_opencl flag |
| Wrapper directly R-callable? | Yes (exported via [[Rcpp::export]]) |
Not necessarily — may be purely internal C++ |
Both patterns guarantee that the package compiles and runs correctly on any machine. The GPU path is always optional; the CPU path always produces a valid (if unaccelerated) result.
nmathopencl uses a consistent naming scheme to make the
role of each function clear:
| Name | Type | Role |
|---|---|---|
dnorm_opencl |
Exported R function | User-facing API; validates inputs; manages fallback |
.dnorm_opencl |
Internal R → C++ bridge | Rcpp export; positional R → C++ call only |
nmathopencl::dnorm_opencl |
C++ kernel wrapper | #ifdef guard; type conversion; runner dispatch |
nmathopencl::dnorm_kernel_runner |
C++ kernel runner | Full OpenCL lifecycle; #ifdef USE_OPENCL only |
Ex_EnvelopeEval |
Exported R function | User-facing API; passes use_opencl flag |
.EnvelopeEval_cpp |
Internal R → C++ bridge | Positional R → C++ call only |
f2_f3_opencl |
C++ kernel wrapper | OpenCL path; used inside dispatcher |
f2_f3_non_opencl |
C++ CPU implementation | CPU path; used inside same dispatcher |
The .dot prefix on internal R functions signals that
they are not part of the public API and will not appear in
?help search or autocompletion.
For your own package, a consistent analogous scheme might be:
myfunc_opencl() # exported R function (if direct R use)
.myfunc_opencl() # internal R → C++ bridge
mypkg::myfunc_opencl() # C++ kernel wrapper (in namespace)
mypkg::myfunc_runner() # C++ kernel runner (in namespace, #ifdef only)
mypkg::myfunc_cpu() # C++ CPU fallback (if Pattern 2)
Every kernel wrapper needs a CPU path. The two patterns differ in where that path lives and who controls the dispatch:
Pattern 1 puts the fallback logic in R, using
the existing stats:: ecosystem. It is the right choice when
the computation mirrors an existing R function and has direct R
users.
Pattern 2 puts the fallback logic in C++, alongside a dedicated CPU implementation. It is the right choice when the computation is novel, when it is called from a C++ simulation loop rather than directly from R, or when benchmarking between the two paths is important.
In both patterns the OpenCL infrastructure — the runner and the kernel — is identical. What differs is only how the wrapper is wired into the rest of the package.
Chapter 12 describes the nmathopencl R API in full,
showing how the distribution-function wrappers are documented and
organized. Chapter 10 works through the ex_glmbayes pattern
end-to-end.