Crawlee separates three kinds of storage: the request queue (what to crawl), the dataset (structured results) and the key-value store (binary blobs). crawlee mirrors that split and adds a one-call setup for reproducible, resumable runs.
Handlers call ctx$push_data() to append records;
cr_collect() returns them as one tibble. By default the
dataset lives in memory.
result <- crawler("https://books.toscrape.com/") |>
cr_on_html(function(ctx) {
ctx$push_data(list(url = ctx$request$url))
}) |>
cr_run() |>
cr_collect()For larger or longer crawls, choose a persistent
backend with cr_dataset():
"jsonl" — append-only, schema-flexible, one JSON object
per line;"duckdb" — appended to a DuckDB table, ready for
SQL.crawler("https://books.toscrape.com/") |>
cr_dataset(backend = "duckdb", path = "books.duckdb") |>
cr_on_html(function(ctx) ctx$push_data(list(url = ctx$request$url))) |>
cr_run()Both persistent backends resume from an existing file: re-opening the same path keeps the rows already there.
Use the key-value store for raw, non-tabular content — PDFs, images,
page snapshots. ctx$save_body() writes the current response
there, and cr_store() sets the directory.
The request queue deduplicates by a normalised key (see
cr_normalize_url()), so each URL is fetched at most once
and a crawl is deterministic. It can also persist its state — pending
requests, seen keys, handled count — which is what makes a crawl
resumable.
cr_persist()cr_persist(dir) wires everything to a run directory:
queue.rds during the
run;dataset.jsonl or
dataset.duckdb);ctx$save_body() writes under kv/;manifest.rds / manifest.json)
records the start URLs, an options snapshot and run statistics.crawl <- crawler("https://books.toscrape.com/") |>
cr_persist("runs/books", dataset = "duckdb") |>
cr_on_html(function(ctx) {
ctx$push_data(list(url = ctx$request$url))
ctx$enqueue_links(glob = "*/catalogue/*")
}) |>
cr_run()
data <- cr_collect(crawl)
cr_close(crawl) # release the DuckDB connectionIf a run is interrupted, run the exact same pipeline
again. Because the state already exists in
runs/books, cr_persist() restores it and the
crawl continues where it left off — already-fetched URLs are
skipped.
# Same code as above: it resumes instead of starting over.
crawler("https://books.toscrape.com/") |>
cr_persist("runs/books", dataset = "duckdb") |>
cr_on_html(function(ctx) {
ctx$push_data(list(url = ctx$request$url))
ctx$enqueue_links(glob = "*/catalogue/*")
}) |>
cr_run()For the DuckDB backend, call
cr_collect()beforecr_close()— closing releases the connection.