Chapter 06: Integrating Kernel Wrappers into Your Codebase

Kjell Nygren

2026-06-11

Introduction

Chapter 05 described the internal structure of a kernel wrapper: how inputs are converted, how the runner is dispatched, and how results are converted back. This chapter takes a step back and looks at how kernel wrappers fit into the broader codebase of a package.

Two questions arise immediately:

  1. What happens when OpenCL is not available? Every kernel wrapper must have a CPU path. A wrapper that simply returns zeros is safe but unhelpful; most real wrappers need to fall back to a correct CPU computation.

  2. How is the wrapper exposed? Some kernel wrappers have a direct interface into R (callable from R code). Others are purely internal C++ components, called by other C++ functions that hold the R-facing API. The choice depends on whether the computation has a natural direct R use.

nmathopencl contains examples of both patterns. The distribution-function wrappers (dnorm_opencl, pnorm_opencl, etc.) are exported R functions with rich CPU fallback logic. The GLM gradient wrapper (f2_f3_opencl) is a purely internal C++ component, called by a C++ dispatcher that also has a separate CPU implementation (f2_f3_non_opencl). Both patterns are explored in detail below.


The two integration patterns

Pattern 1: wrapper with a direct R interface

In this pattern the kernel wrapper (or a thin R function that calls it) is exported and callable directly from R. The CPU fallback is the equivalent computation using standard R or C functions — in nmathopencl’s case, the stats:: distribution functions.

R caller
  │
  ▼
R wrapper function  (exported, input validation, recycling)
  │  if inputs are non-finite, sd == 0, etc. → fallback_full()
  │
  ▼
.opencl_try_or_fallback()
  │  if !nmathopencl_has_opencl()           → fallback_expr()  (CPU path)
  │  if OpenCL call succeeds    → return GPU result
  │  if OpenCL call fails
  │    and fallback = TRUE      → fallback_expr()  (CPU path)
  │    and fallback = FALSE     → propagate error
  ▼
C++ kernel wrapper  (internal, not exported)
  │  #ifdef USE_OPENCL + nmathopencl_has_opencl() guard
  │  type conversion + program assembly + runner dispatch
  ▼
GPU result

The fallback can be triggered at two separate levels:

Pattern 2: wrapper as an internal C++ component

In this pattern the kernel wrapper has no direct R interface. It is called from within a C++ dispatcher function alongside a CPU counterpart. The R interface belongs to a higher-level function that selects between the two based on a use_opencl flag passed in by the caller.

R caller
  │
  ▼
Exported R function  (e.g. Ex_EnvelopeEval)
  │  validates inputs, passes use_opencl flag
  ▼
.EnvelopeEval_cpp()   (internal R → C++ bridge, [[Rcpp::export]])
  ▼
EnvelopeEval_cpp()    (C++ dispatcher)
  │  if use_opencl && nmathopencl_has_opencl()
  │    → f2_f3_opencl()     (OpenCL kernel wrapper)
  │  else
  │    → f2_f3_non_opencl() (pure C++ CPU implementation)
  ▼
Result (qf, grad) returned regardless of path taken

The two implementations — f2_f3_opencl and f2_f3_non_opencl — share the same function signature and return the same data structure. The caller cannot tell from the return value which path was taken.


Pattern 1 in detail: dnorm_opencl

The R wrapper

dnorm_opencl in R/normal_opencl.R is the user-facing function. It mirrors the interface of stats::dnorm and adds opencl_parallel, fallback, and verbose arguments.

# R/normal_opencl.R  (simplified)

#' @export
dnorm_opencl <- function(x, mean = 0, sd = 1, log = FALSE,
                         opencl_parallel = NA, fallback = FALSE,
                         verbose = FALSE) {

  # ── Input validation ──────────────────────────────────────────────────────
  # These checks mirror stats::dnorm behavior.
  if (!is.numeric(x))    stop("`x` must be numeric.")
  if (!is.numeric(mean)) stop("`mean` must be numeric.")
  if (!is.numeric(sd))   stop("`sd` must be numeric.")
  if (length(x) == 0L)   return(numeric(0))

  # ── Recycling (like stats::dnorm) ─────────────────────────────────────────
  len  <- max(length(x), length(mean), length(sd))
  xv   <- rep_len(as.double(x),    len)
  mv   <- rep_len(as.double(mean), len)
  sv   <- rep_len(as.double(sd),   len)
  logv <- rep_len(log,             len)

  # ── R-level fallback function ─────────────────────────────────────────────
  # Called when inputs contain conditions the GPU path cannot handle,
  # or when OpenCL is unavailable and fallback = TRUE.
  fallback_full <- function() {
    stats::dnorm(x, mean = mean, sd = sd, log = log)
  }

  # ── R-level conditions that force the CPU path ────────────────────────────
  if (any(!is.finite(xv) | !is.finite(mv) | !is.finite(sv))) {
    return(fallback_full())   # stats::dnorm handles NaN, Inf, NA
  }
  if (any(sv < 0)) {
    stop("`sd` must be non-negative.", call. = FALSE)
  }
  if (any(sv == 0)) {
    return(fallback_full())   # degenerate case; stats::dnorm handles it
  }

  # ── Dispatch: try GPU, fall back to CPU on failure if fallback = TRUE ─────
  log_int <- as.integer(logv)
  opc     <- .encode_opencl_parallel(opencl_parallel)

  .opencl_try_or_fallback(
    opencl_expr  = function() .dnorm_opencl(xv, mv, sv, log_int, opc, verbose),
    fallback_expr = fallback_full,
    fallback      = fallback,
    verbose       = verbose,
    fn_name       = "dnorm_opencl"
  )
}

.dnorm_opencl (dot-prefixed) is the internal Rcpp-exported symbol for the C++ kernel wrapper. It is not part of the public API; it exists only to make the C++ function callable from R.

The .opencl_try_or_fallback helper

This helper encapsulates the runtime dispatch logic that every Pattern 1 wrapper shares:

# R/opencl_linkage_utils.R

.opencl_try_or_fallback <- function(opencl_expr, fallback_expr,
                                    fallback, verbose, fn_name) {
  if (!nmathopencl_has_opencl()) {
    # OpenCL not available in this build or session — go straight to CPU.
    if (verbose)
      message(sprintf("[%s] OpenCL unavailable; using CPU fallback.", fn_name))
    return(fallback_expr())
  }

  # OpenCL available: try the GPU path.
  out <- tryCatch(opencl_expr(), error = function(e) e)

  if (inherits(out, "error")) {
    if (fallback) {
      # GPU call failed and the caller requested a fallback.
      if (verbose) {
        message(sprintf("[%s] OpenCL call failed; using CPU fallback.", fn_name))
        message(out$message)
      }
      return(fallback_expr())
    }
    stop(out$message, call. = FALSE)  # no fallback requested — propagate error
  }

  out  # GPU call succeeded
}

The design makes the fallback behavior explicit and controllable:

The C++ kernel wrapper

The C++ kernel wrapper .dnorm_opencl is exported to R via // [[Rcpp::export(name = ".dnorm_opencl")]]. It is the minimal C++ entry point: it converts inputs, runs the GPU path if available, and returns zeros if not.

// src/kernel_wrappers.cpp  (within nmathopencl namespace)

// [[Rcpp::export(name = ".dnorm_opencl")]]
Rcpp::NumericVector dnorm_opencl(
    const Rcpp::NumericVector& x,
    const Rcpp::NumericVector& mean,
    const Rcpp::NumericVector& sd,
    const Rcpp::IntegerVector& give_log,
    int                        opencl_parallel_code,
    bool                       verbose
) {
  const int len = x.size();
  Rcpp::NumericVector out(len);   // zero-initialized

#ifdef USE_OPENCL
  if (!nmathopencl_has_opencl() || len == 0) return out;

  try {
    d_givelog_ndrange_kernel_fill(
        "src/dnorm_kernel.cl", "dnorm_kernel",
        len, {&x, &mean, &sd}, give_log, out, verbose);
  } catch (const std::exception& e) {
    if (verbose) Rcpp::Rcout << e.what() << "\n";
    throw;
  }
#endif

  return out;
}

Note that the C++ wrapper itself returns zeros when !nmathopencl_has_opencl(). It does not call stats::dnorm. The R wrapper is responsible for the fallback to stats::dnorm; the C++ wrapper simply reports “no GPU result” via zeros. This keeps the C++ layer free of any R evaluation machinery.


Pattern 2 in detail: f2_f3_opencl

The exported R function

Ex_EnvelopeEval (in R/ex_glmbayes.R) is the user-facing function. It accepts a use_opencl flag and delegates entirely to the C++ dispatcher:

# R/ex_glmbayes.R

#' @export
Ex_EnvelopeEval <- function(G4, y, x, mu, P, alpha, wt,
                            family, link,
                            use_opencl = FALSE,
                            verbose    = FALSE) {
  # Input validation (matrix/vector type checks) ...

  .EnvelopeEval_cpp(G4, y, x, mu, P, alpha, wt,
                    family, link, use_opencl, verbose)
}

There is no R-level fallback function here. The fallback is handled entirely inside the C++ dispatcher.

The C++ dispatcher

EnvelopeEval_cpp (inside src/) receives use_opencl and decides which C++ implementation to call:

// src/ (conceptual structure — details in actual source)

Rcpp::List EnvelopeEval_cpp(
    Rcpp::NumericMatrix G4, Rcpp::NumericVector y,
    Rcpp::NumericMatrix x,  Rcpp::NumericMatrix mu,
    Rcpp::NumericMatrix P,  Rcpp::NumericVector alpha,
    Rcpp::NumericVector wt,
    std::string family, std::string link,
    bool use_opencl, bool verbose
) {
  // Prepare shared inputs (common to both paths) ...

  if (use_opencl && nmathopencl_has_opencl()) {
    // GPU path: call the OpenCL kernel wrapper
    return ex_glmbayes::opencl::f2_f3_opencl(
        family, link, b, y, x, mu, P, alpha, wt, verbose);
  } else {
    // CPU path: call the pure C++ implementation
    return ex_glmbayes::f2_f3_non_opencl(
        family, link, b, y, x, mu, P, alpha, wt);
  }
}

Both f2_f3_opencl and f2_f3_non_opencl return a Rcpp::List with identical structure: list(qf = numeric(m1), grad = matrix(m1, l2)). The dispatcher’s caller cannot tell from the return value which path was used.

Why a dedicated CPU implementation?

For Pattern 1 (distribution functions), the CPU fallback is an existing well-tested function from stats::. No separate CPU implementation is needed.

For the GLM gradient computation, no equivalent off-the-shelf CPU function exists. f2_f3_non_opencl is a pure C++ implementation of the same mathematical computation, written without any OpenCL dependency. It compiles on every platform and produces bit-for-bit equivalent results to the GPU path (within double-precision rounding).

Having both implementations under explicit control also makes it possible to benchmark them directly: use_opencl = FALSE forces the CPU path even on a GPU-equipped machine.


Choosing between the two patterns

The choice between Pattern 1 and Pattern 2 comes down to whether there is a natural existing CPU computation to fall back to.

Criterion Pattern 1 (R interface + R fallback) Pattern 2 (C++ dispatch + CPU implementation)
Existing CPU function available? Yes (stats::, base::, etc.) No; need to write the CPU implementation
Does the computation have a direct R use? Yes (called directly from R) Often not (called from a C++ simulation loop)
Where does fallback live? R level (fallback_full()) + runtime (nmathopencl_has_opencl()) C++ level (use_opencl && nmathopencl_has_opencl())
Caller can request optional fallback? Yes (fallback = TRUE/FALSE argument) Caller controls via use_opencl flag
Wrapper directly R-callable? Yes (exported via [[Rcpp::export]]) Not necessarily — may be purely internal C++

Both patterns guarantee that the package compiles and runs correctly on any machine. The GPU path is always optional; the CPU path always produces a valid (if unaccelerated) result.


Naming conventions

nmathopencl uses a consistent naming scheme to make the role of each function clear:

Name Type Role
dnorm_opencl Exported R function User-facing API; validates inputs; manages fallback
.dnorm_opencl Internal R → C++ bridge Rcpp export; positional R → C++ call only
nmathopencl::dnorm_opencl C++ kernel wrapper #ifdef guard; type conversion; runner dispatch
nmathopencl::dnorm_kernel_runner C++ kernel runner Full OpenCL lifecycle; #ifdef USE_OPENCL only
Ex_EnvelopeEval Exported R function User-facing API; passes use_opencl flag
.EnvelopeEval_cpp Internal R → C++ bridge Positional R → C++ call only
f2_f3_opencl C++ kernel wrapper OpenCL path; used inside dispatcher
f2_f3_non_opencl C++ CPU implementation CPU path; used inside same dispatcher

The .dot prefix on internal R functions signals that they are not part of the public API and will not appear in ?help search or autocompletion.

For your own package, a consistent analogous scheme might be:

myfunc_opencl()        # exported R function  (if direct R use)
.myfunc_opencl()       # internal R → C++ bridge
mypkg::myfunc_opencl() # C++ kernel wrapper (in namespace)
mypkg::myfunc_runner() # C++ kernel runner   (in namespace, #ifdef only)
mypkg::myfunc_cpu()    # C++ CPU fallback    (if Pattern 2)

Summary

Every kernel wrapper needs a CPU path. The two patterns differ in where that path lives and who controls the dispatch:

In both patterns the OpenCL infrastructure — the runner and the kernel — is identical. What differs is only how the wrapper is wired into the rest of the package.

Chapter 12 describes the nmathopencl R API in full, showing how the distribution-function wrappers are documented and organized. Chapter 10 works through the ex_glmbayes pattern end-to-end.

mirror server hosted at Truenetwork, Russian Federation.