---
title: "A RAG pipeline"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{A RAG pipeline}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(collapse = TRUE, comment = "#>", eval = FALSE)
```

```{r setup}
library(crawlee)
```

Beyond crawling, crawlee provides three helpers to turn collected text into a
retrieval-ready corpus for retrieval-augmented generation (RAG): `cr_chunk()`,
`cr_embed()` and `cr_export()`. They operate on plain tibbles, so they slot in
right after `cr_collect()`.

## 1. Crawl and collect text

```{r}
pages <- crawler("https://books.toscrape.com/") |>
  cr_options(max_requests = 100) |>
  cr_on_html(function(ctx) {
    ctx$push_data(list(
      url   = ctx$request$url,
      title = ctx$page |> rvest::html_element("title") |> rvest::html_text2(),
      text  = ctx$page |> rvest::html_element("body") |> rvest::html_text2()
    ))
    ctx$enqueue_links(glob = "*/catalogue/*")
  }) |>
  cr_run() |>
  cr_collect()
```

## 2. Chunk

`cr_chunk()` splits text into overlapping windows. On a data frame, name the
text column; every other column is carried along as per-chunk metadata (so each
chunk keeps its `url` and `title`).

```{r}
chunks <- cr_chunk(pages, text = text, size = 1000, overlap = 200, by = "char")
chunks
#> columns: doc_id, chunk_id, chunk, text, n_chars, url, title
```

Use `by = "word"` to size chunks in words instead of characters.

## 3. Embed

`cr_embed()` is **provider-agnostic**: crawlee never calls an embedding service
itself. You pass `embed_fn`, a function that maps a character vector to a
numeric matrix (one row per input) or a list of numeric vectors. It is applied
in batches and adds an `embedding` list-column.

```{r}
# A real embedder typically calls an HTTP API (any provider) with httr2:
embed_fn <- function(texts) {
  # return a length(texts) x d numeric matrix
  resp <- httr2::request("https://api.example.com/v1/embeddings") |>
    httr2::req_auth_bearer_token(Sys.getenv("EMBEDDINGS_API_KEY")) |>
    httr2::req_body_json(list(input = texts)) |>
    httr2::req_perform()
  do.call(rbind, lapply(httr2::resp_body_json(resp)$data, \(x) unlist(x$embedding)))
}

embedded <- cr_embed(chunks, embed_fn, batch_size = 32)
```

For a quick local experiment you can pass any function — even a trivial one:

```{r}
fake_embed <- function(x) matrix(nchar(x), nrow = length(x), ncol = 1)
embedded <- cr_embed(chunks, fake_embed)
```

## 4. Export for retrieval

`cr_export()` writes the chunk table (with embeddings) to a retrieval-friendly
format. `parquet` and `jsonl` preserve the embedding vectors natively; `csv`
and `duckdb` serialise them to a `[...]` string.

```{r}
cr_export(embedded, "corpus.parquet", format = "parquet")
cr_export(embedded, "corpus.jsonl", format = "jsonl")
cr_export(embedded, "corpus.duckdb", format = "duckdb", table = "chunks")
```

## End to end

```{r}
crawler("https://books.toscrape.com/") |>
  cr_options(max_requests = 100) |>
  cr_on_html(function(ctx) {
    ctx$push_data(list(
      url  = ctx$request$url,
      text = ctx$page |> rvest::html_element("body") |> rvest::html_text2()
    ))
    ctx$enqueue_links(glob = "*/catalogue/*")
  }) |>
  cr_run() |>
  cr_collect() |>
  cr_chunk(text = text, size = 1000, overlap = 200) |>
  cr_embed(embed_fn) |>
  cr_export("corpus.parquet", format = "parquet")
```

From here, load `corpus.parquet` into your vector store or do nearest-neighbour
search in R to retrieve chunks for a prompt.
