Chapter 09: Generic OpenCL Kernel Runners (openclPort layer)

Kjell Nygren

2026-06-11

The openclPort namespace

The openclPort namespace (declared in openclPort.h and implemented in opencl_kernel_runners.cpp) provides the generic OpenCL infrastructure that underlies all of nmathopencl’s GPU execution. It is designed to be completely independent of any particular mathematical library: it knows only about OpenCL C types and standard C++ types, not about nmath, distributions, or R data structures.

This namespace is currently shipped inside nmathopencl to make the package self-contained. It will migrate to a standalone openclPort CRAN package in a future release (see Chapter 00 for the release roadmap). The design described here will remain stable during that migration; downstream packages using these utilities via #include "openclPort.h" will require only a change in their LinkingTo field.

opencl_dbl_scalar_kernel_runner

This is the single generic runner that powers every kernel in nmathopencl.

Declaration

// openclPort.h (within namespace openclPort, guarded by #ifdef USE_OPENCL)
void opencl_dbl_scalar_kernel_runner(
    const std::string&         kernel_source,
    const char*                kernel_name,
    const std::vector<double>& dargs,
    int                        n_out,
    std::vector<double>&       out_flat
);

Arguments

Parameter Type Description
kernel_source const std::string& Complete, concatenated OpenCL program source (shims + library + kernel). Must be syntactically valid OpenCL C.
kernel_name const char* Name of the __kernel void function to execute, e.g., "dnorm_kernel".
dargs const std::vector<double>& Scalar double arguments to pass to the kernel, in the order the kernel declares them.
n_out int Number of double values to read back from the output buffer. Also becomes the global work size.
out_flat std::vector<double>& Output buffer; resized to n_out on entry and filled with the kernel’s results.

What it does

The runner performs the complete OpenCL lifecycle for a single kernel invocation:

  1. Platform and device selection — queries clGetPlatformIDs and clGetDeviceIDs to select the first available GPU device.
  2. Context and command queue — creates a cl_context and a cl_command_queue for the selected device.
  3. Program build — calls clCreateProgramWithSource followed by clBuildProgram. On failure, it retrieves the build log via CL_PROGRAM_BUILD_LOG and throws a std::runtime_error containing the compiler output.
  4. Kernel creation — calls clCreateKernel with the supplied kernel_name.
  5. Output buffer — allocates a CL_MEM_WRITE_ONLY device buffer of size n_out * sizeof(double).
  6. Argument setting — sets each element of dargs as a scalar kernel argument (clSetKernelArg) in index order, then sets the output buffer as the next argument, and finally n_out as the last integer argument.
  7. Execution — enqueues the kernel with a 1-D global work size of n_out via clEnqueueNDRangeKernel.
  8. Result retrieval — blocks on clFinish then reads the output buffer back to host memory via clEnqueueReadBuffer.
  9. Resource cleanup — releases all OpenCL objects in reverse order of creation.

Error checking is performed after every OpenCL API call. Any non-CL_SUCCESS status causes the runner to clean up all already-created resources and throw a std::runtime_error with a message that includes the status code, its symbolic name, and the platform/device information.

Argument layout contract

The runner assumes the kernel is declared with the following argument signature:

__kernel void my_kernel(
    double arg0, double arg1, ..., double argN,
    __global double* out,
    int n_out
)

The dargs vector maps element-by-element to arg0 through argN. The runner always appends the output buffer and n_out as the last two arguments automatically. Kernels in inst/cl/src/ that accept vector inputs (e.g., dnorm_kernel) do not fit this layout and use a separate, more specialized runner in kernel_runners.cpp; the double-scalar runner handles the scalar-parameter kernels that produce replicated output.

Example: calling the runner from a kernel wrapper

// Inside a kernel_wrappers.cpp function (nmathopencl namespace):
namespace nmathopencl {

Rcpp::NumericVector dnorm_opencl_impl(
    int n, double mu, double sigma, bool give_log, bool verbose)
{
    std::string src = openclPort::load_kernel_library("nmath") + "\n"
                    + openclPort::load_kernel_source("src/dnorm_scalar_kernel.cl");

    std::vector<double> out(n);
    openclPort::opencl_dbl_scalar_kernel_runner(
        src,
        "dnorm_scalar_kernel",
        { mu, sigma, static_cast<double>(give_log) },  // dargs
        n,                                              // n_out
        out
    );
    return Rcpp::NumericVector(out.begin(), out.end());
}

} // namespace nmathopencl

Error-handling utilities

Five inline functions in openclPort.h provide consistent error reporting across all runners. Because they are inline, they are available to any translation unit that includes openclPort.h; downstream packages that LinkingTo: nmathopencl get them for free.

opencl_status_name

inline const char* opencl_status_name(cl_int status);

Returns the symbolic name of an OpenCL error code as a C string:

CL_SUCCESS, CL_DEVICE_NOT_FOUND, CL_BUILD_PROGRAM_FAILURE, ...

Returns "UNKNOWN_OR_VENDOR_SPECIFIC" for codes not in the table. Covers all error codes defined by the OpenCL 3.0 specification that are commonly encountered in practice.

opencl_status_hint

inline const char* opencl_status_hint(cl_int status);

Returns a plain-English diagnostic hint for the most actionable error codes:

Status Hint
CL_OUT_OF_RESOURCES Device/runtime resource limit exceeded (watchdog, register pressure, …)
CL_OUT_OF_HOST_MEMORY Host memory allocation failed
CL_MEM_OBJECT_ALLOCATION_FAILURE Device memory allocation failed
CL_BUILD_PROGRAM_FAILURE Kernel compilation failed; inspect build log
CL_INVALID_CONTEXT OpenCL context is invalid
CL_DEVICE_NOT_AVAILABLE Device present but temporarily unavailable

Returns "No additional hint available." for all other codes.

opencl_read_platform_info_str / opencl_read_device_info_str

inline std::string opencl_read_platform_info_str(
    cl_platform_id platform, cl_platform_info param);

inline std::string opencl_read_device_info_str(
    cl_device_id device, cl_device_info param);

Query string-valued platform or device info (e.g., CL_PLATFORM_NAME, CL_DEVICE_NAME, CL_DRIVER_VERSION) safely: they handle null handles, query failures, and trailing null terminators, returning "unknown" on any error.

Typical usage:

std::string name = openclPort::opencl_read_device_info_str(
    device, CL_DEVICE_NAME);
std::string vendor = openclPort::opencl_read_platform_info_str(
    platform, CL_PLATFORM_VENDOR);

opencl_make_context_error

inline std::runtime_error opencl_make_context_error(
    cl_int status, cl_platform_id platform, cl_device_id device);

Constructs a std::runtime_error with a message of the form:

OpenCL error at clCreateContext (status=-6, name=CL_OUT_OF_HOST_MEMORY).
platform_name=NVIDIA CUDA, platform_vendor=NVIDIA Corporation,
device_name=NVIDIA GeForce RTX 4090, driver_version=560.94.
This may indicate a transient driver/runtime context failure.

Used inside opencl_dbl_scalar_kernel_runner and f2_f3_kernel_runner immediately after a clCreateContext failure.

Other utilities in openclPort

Rcpp -> std::vector conversion

std::vector<double> openclPort::flattenMatrix(const Rcpp::NumericMatrix& mat);
std::vector<double> openclPort::copyVector(const Rcpp::NumericVector& vec);

flattenMatrix converts a column-major R matrix to a row-major (or simply contiguous) std::vector<double> suitable for passing as a device buffer. copyVector does the same for a numeric vector. Both are implemented in OpenCL_helper.cpp.

Device probing (C++ / openclPort)

At the R prompt, host GPU inventory uses opencltools::gpu_names() (see Chapters 00–01). Inside the package DLL, openclPort exposes related C++ helpers for kernel runners:

Rcpp::CharacterVector openclPort::gpu_names();
int                   openclPort::detect_num_gpus_internal();
bool                  openclPort::has_opencl();
int                   openclPort::get_opencl_core_count();

OpenCL build configuration

struct openclPort::OpenCLConfig {
    bool have_expm1;
    bool have_log1p;
    std::string buildOptions;
};

openclPort::OpenCLConfig openclPort::configureOpenCL(
    cl_context context, cl_device_id device);

configureOpenCL probes the device to determine which standard math functions are available as OpenCL built-ins (specifically expm1 and log1p). It returns a set of buildOptions (e.g., -DHAVE_EXPM1 -DHAVE_LOG1P) that are passed to clBuildProgram so that the nmath .cl files can conditionally use the built-in or the ported version.

Using openclPort from a downstream package

A package that LinkingTo: nmathopencl can include openclPort.h directly:

// In a downstream package's source file:
#include "openclPort.h"

void my_kernel_runner(const std::string& program_source, int n) {
    std::vector<double> out;
    openclPort::opencl_dbl_scalar_kernel_runner(
        program_source,
        "my_kernel",
        { 1.0, 2.5 },   // scalar args
        n,
        out
    );
}

No additional CMake or autoconf work is needed beyond the LinkingTo field, because openclPort.h is installed to inst/include/ and the exported symbols are visible in the shared library.