---
title: "Chapter 09: Generic OpenCL Kernel Runners (openclPort layer)"
author: "Kjell Nygren"
date: "`r Sys.Date()`"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Chapter 09: Generic OpenCL Kernel Runners (openclPort layer)}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include = FALSE}
knitr::opts_chunk$set(collapse = TRUE, comment = "#>")
```

## The `openclPort` namespace

The `openclPort` namespace (declared in `openclPort.h` and implemented in
`opencl_kernel_runners.cpp`) provides the generic OpenCL infrastructure
that underlies all of `nmathopencl`'s GPU execution. It is designed to be
completely independent of any particular mathematical library: it knows
only about OpenCL C types and standard C++ types, not about `nmath`,
distributions, or R data structures.

This namespace is currently shipped inside `nmathopencl` to make the
package self-contained. It will migrate to a standalone `openclPort` CRAN
package in a future release (see Chapter 00 for the release roadmap). The
design described here will remain stable during that migration; downstream
packages using these utilities via `#include "openclPort.h"` will require
only a change in their `LinkingTo` field.

## `opencl_dbl_scalar_kernel_runner`

This is the single generic runner that powers every kernel in `nmathopencl`.

### Declaration

```cpp
// openclPort.h (within namespace openclPort, guarded by #ifdef USE_OPENCL)
void opencl_dbl_scalar_kernel_runner(
    const std::string&         kernel_source,
    const char*                kernel_name,
    const std::vector<double>& dargs,
    int                        n_out,
    std::vector<double>&       out_flat
);
```

### Arguments

| Parameter | Type | Description |
|-----------|------|-------------|
| `kernel_source` | `const std::string&` | Complete, concatenated OpenCL program source (shims + library + kernel). Must be syntactically valid OpenCL C. |
| `kernel_name` | `const char*` | Name of the `__kernel void` function to execute, e.g., `"dnorm_kernel"`. |
| `dargs` | `const std::vector<double>&` | Scalar `double` arguments to pass to the kernel, in the order the kernel declares them. |
| `n_out` | `int` | Number of `double` values to read back from the output buffer. Also becomes the global work size. |
| `out_flat` | `std::vector<double>&` | Output buffer; resized to `n_out` on entry and filled with the kernel's results. |

### What it does

The runner performs the complete OpenCL lifecycle for a single kernel
invocation:

1. **Platform and device selection** --- queries `clGetPlatformIDs` and
   `clGetDeviceIDs` to select the first available GPU device.
2. **Context and command queue** --- creates a `cl_context` and a
   `cl_command_queue` for the selected device.
3. **Program build** --- calls `clCreateProgramWithSource` followed by
   `clBuildProgram`. On failure, it retrieves the build log via
   `CL_PROGRAM_BUILD_LOG` and throws a `std::runtime_error` containing the
   compiler output.
4. **Kernel creation** --- calls `clCreateKernel` with the supplied
   `kernel_name`.
5. **Output buffer** --- allocates a `CL_MEM_WRITE_ONLY` device buffer of
   size `n_out * sizeof(double)`.
6. **Argument setting** --- sets each element of `dargs` as a scalar kernel
   argument (`clSetKernelArg`) in index order, then sets the output buffer
   as the next argument, and finally `n_out` as the last integer argument.
7. **Execution** --- enqueues the kernel with a 1-D global work size of
   `n_out` via `clEnqueueNDRangeKernel`.
8. **Result retrieval** --- blocks on `clFinish` then reads the output buffer
   back to host memory via `clEnqueueReadBuffer`.
9. **Resource cleanup** --- releases all OpenCL objects in reverse order of
   creation.

Error checking is performed after every OpenCL API call. Any non-`CL_SUCCESS`
status causes the runner to clean up all already-created resources and
throw a `std::runtime_error` with a message that includes the status code,
its symbolic name, and the platform/device information.

### Argument layout contract

The runner assumes the kernel is declared with the following argument
signature:

```c
__kernel void my_kernel(
    double arg0, double arg1, ..., double argN,
    __global double* out,
    int n_out
)
```

The `dargs` vector maps element-by-element to `arg0` through `argN`. The
runner always appends the output buffer and `n_out` as the last two
arguments automatically. Kernels in `inst/cl/src/` that accept vector
inputs (e.g., `dnorm_kernel`) do not fit this layout and use a separate,
more specialized runner in `kernel_runners.cpp`; the double-scalar runner
handles the scalar-parameter kernels that produce replicated output.

### Example: calling the runner from a kernel wrapper

```cpp
// Inside a kernel_wrappers.cpp function (nmathopencl namespace):
namespace nmathopencl {

Rcpp::NumericVector dnorm_opencl_impl(
    int n, double mu, double sigma, bool give_log, bool verbose)
{
    std::string src = openclPort::load_kernel_library("nmath") + "\n"
                    + openclPort::load_kernel_source("src/dnorm_scalar_kernel.cl");

    std::vector<double> out(n);
    openclPort::opencl_dbl_scalar_kernel_runner(
        src,
        "dnorm_scalar_kernel",
        { mu, sigma, static_cast<double>(give_log) },  // dargs
        n,                                              // n_out
        out
    );
    return Rcpp::NumericVector(out.begin(), out.end());
}

} // namespace nmathopencl
```

## Error-handling utilities

Five `inline` functions in `openclPort.h` provide consistent error
reporting across all runners. Because they are `inline`, they are
available to any translation unit that includes `openclPort.h`; downstream
packages that `LinkingTo: nmathopencl` get them for free.

### `opencl_status_name`

```cpp
inline const char* opencl_status_name(cl_int status);
```

Returns the symbolic name of an OpenCL error code as a C string:

```
CL_SUCCESS, CL_DEVICE_NOT_FOUND, CL_BUILD_PROGRAM_FAILURE, ...
```

Returns `"UNKNOWN_OR_VENDOR_SPECIFIC"` for codes not in the table. Covers
all error codes defined by the OpenCL 3.0 specification that are
commonly encountered in practice.

### `opencl_status_hint`

```cpp
inline const char* opencl_status_hint(cl_int status);
```

Returns a plain-English diagnostic hint for the most actionable error codes:

| Status | Hint |
|--------|------|
| `CL_OUT_OF_RESOURCES` | Device/runtime resource limit exceeded (watchdog, register pressure, ...) |
| `CL_OUT_OF_HOST_MEMORY` | Host memory allocation failed |
| `CL_MEM_OBJECT_ALLOCATION_FAILURE` | Device memory allocation failed |
| `CL_BUILD_PROGRAM_FAILURE` | Kernel compilation failed; inspect build log |
| `CL_INVALID_CONTEXT` | OpenCL context is invalid |
| `CL_DEVICE_NOT_AVAILABLE` | Device present but temporarily unavailable |

Returns `"No additional hint available."` for all other codes.

### `opencl_read_platform_info_str` / `opencl_read_device_info_str`

```cpp
inline std::string opencl_read_platform_info_str(
    cl_platform_id platform, cl_platform_info param);

inline std::string opencl_read_device_info_str(
    cl_device_id device, cl_device_info param);
```

Query string-valued platform or device info (e.g., `CL_PLATFORM_NAME`,
`CL_DEVICE_NAME`, `CL_DRIVER_VERSION`) safely: they handle null handles,
query failures, and trailing null terminators, returning `"unknown"` on
any error.

Typical usage:

```cpp
std::string name = openclPort::opencl_read_device_info_str(
    device, CL_DEVICE_NAME);
std::string vendor = openclPort::opencl_read_platform_info_str(
    platform, CL_PLATFORM_VENDOR);
```

### `opencl_make_context_error`

```cpp
inline std::runtime_error opencl_make_context_error(
    cl_int status, cl_platform_id platform, cl_device_id device);
```

Constructs a `std::runtime_error` with a message of the form:

```
OpenCL error at clCreateContext (status=-6, name=CL_OUT_OF_HOST_MEMORY).
platform_name=NVIDIA CUDA, platform_vendor=NVIDIA Corporation,
device_name=NVIDIA GeForce RTX 4090, driver_version=560.94.
This may indicate a transient driver/runtime context failure.
```

Used inside `opencl_dbl_scalar_kernel_runner` and `f2_f3_kernel_runner`
immediately after a `clCreateContext` failure.

## Other utilities in `openclPort`

### Rcpp -> `std::vector` conversion

```cpp
std::vector<double> openclPort::flattenMatrix(const Rcpp::NumericMatrix& mat);
std::vector<double> openclPort::copyVector(const Rcpp::NumericVector& vec);
```

`flattenMatrix` converts a column-major R matrix to a row-major (or simply
contiguous) `std::vector<double>` suitable for passing as a device buffer.
`copyVector` does the same for a numeric vector. Both are implemented in
`OpenCL_helper.cpp`.

### Device probing (C++ / `openclPort`)

At the R prompt, host GPU inventory uses **`opencltools::gpu_names()`** (see
Chapters 00--01). Inside the package DLL, **`openclPort`** exposes related
C++ helpers for kernel runners:

```cpp
Rcpp::CharacterVector openclPort::gpu_names();
int                   openclPort::detect_num_gpus_internal();
bool                  openclPort::has_opencl();
int                   openclPort::get_opencl_core_count();
```

- `openclPort::gpu_names()` --- OpenCL device names from the runtime linked
  into this package (distinct from `opencltools::gpu_names()` R helper).
- `detect_num_gpus_internal()` --- GPU device count for envelope-style runners.
- `nmathopencl_has_opencl()` --- compile-time `USE_OPENCL` flag in this DLL.
- `openclPort::get_opencl_core_count()` --- internal C++ helper (delegates to
  opencltools); at the R prompt use **`opencltools::get_opencl_core_count()`**.

### OpenCL build configuration

```cpp
struct openclPort::OpenCLConfig {
    bool have_expm1;
    bool have_log1p;
    std::string buildOptions;
};

openclPort::OpenCLConfig openclPort::configureOpenCL(
    cl_context context, cl_device_id device);
```

`configureOpenCL` probes the device to determine which standard math
functions are available as OpenCL built-ins (specifically `expm1` and
`log1p`). It returns a set of `buildOptions` (e.g.,
`-DHAVE_EXPM1 -DHAVE_LOG1P`) that are passed to `clBuildProgram` so that
the nmath `.cl` files can conditionally use the built-in or the ported
version.

## Using `openclPort` from a downstream package

A package that `LinkingTo: nmathopencl` can include `openclPort.h` directly:

```cpp
// In a downstream package's source file:
#include "openclPort.h"

void my_kernel_runner(const std::string& program_source, int n) {
    std::vector<double> out;
    openclPort::opencl_dbl_scalar_kernel_runner(
        program_source,
        "my_kernel",
        { 1.0, 2.5 },   // scalar args
        n,
        out
    );
}
```

No additional CMake or autoconf work is needed beyond the `LinkingTo` field,
because `openclPort.h` is installed to `inst/include/` and the exported
symbols are visible in the shared library.
