---
title: "Chapter 11: Testing, Debugging, and Benchmarking GPU Kernels"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Chapter 11: Testing, Debugging, and Benchmarking GPU Kernels}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
```

This chapter covers strategies for verifying correctness, diagnosing failures,
and measuring performance of OpenCL kernel code developed on top of
`nmathopencl`.

## Correctness testing

Because every kernel wrapper contains a CPU fallback path, the most reliable
testing strategy compares the OpenCL output against the CPU reference
output on the same inputs.  Standard R unit-test frameworks (`testthat`,
`tinytest`) work directly --- write tests that call the wrapper function and
assert numerical agreement within an appropriate tolerance (typically
`.Machine$double.eps^0.5` for `double`-precision kernels).

Key points:

* Always run the full test suite with **OpenCL disabled** (no driver, or
  `nmathopencl_has_opencl() == FALSE`) as well as with it enabled.  This ensures the
  fallback path is also covered.
* Use `opencltools::verify_opencl_runtime()` as a pre-condition guard in
  any test that requires an active OpenCL device.
* Numerical differences between GPU and CPU results arise from non-associative
  floating-point reduction order and from `float` vs `double` precision.
  Document your tolerance assumptions.

## Debugging kernel failures

When a kernel fails to compile or execute, the OpenCL runtime reports an error
code.  `nmathopencl` propagates these as R errors via `stop()`.  Common
causes:

* **Build failure** --- syntax error in the `.cl` source.  Inspect the build
  log returned by `clGetProgramBuildInfo`; `nmathopencl` includes it in the
  error message.
* **Device not found** --- no ICD-registered device matches the requested
  type.  Call `opencltools::gpu_names()` to list available devices.
* **Buffer size mismatch** --- the NDRange size does not match the buffer
  allocation.  Check that global work size equals the number of output
  elements.
* **Precision loss** --- intermediate results computed in `float` instead of
  `double`.  Verify that the `cl_khr_fp64` pragma is present and that all
  literals are written as `1.0` (not `1.0f`).

## Benchmarking

Use `bench::mark()` or `microbenchmark::microbenchmark()` to compare the
GPU path against the CPU fallback.  A few guidelines:

* **Warm up** --- the first call to any kernel incurs compilation overhead
  (`clBuildProgram`).  Exclude the first iteration or run a warm-up call
  before timing.
* **Problem size** --- GPU parallelism pays off only for large work sizes
  (typically $N \gtrsim 10^4$).  Benchmark across a range of $N$ values.
* **Transfer cost** --- host-to-device and device-to-host buffer copies
  (`clEnqueueWriteBuffer` / `clEnqueueReadBuffer`) are included in the
  wrapper timing.  For latency-sensitive use cases, consider whether the
  data can remain on the device between calls.
* **Baseline** --- compare against both the `nmathopencl` CPU fallback and
  the upstream `stats::` function to understand relative overheads.
