First release. A tidy, native-R, Crawlee-inspired toolkit for reproducible web crawling.
cr_stream(adaptive = TRUE, min, max) adapts the
streaming pool’s in-flight target at run time (AIMD on back-pressure),
like cr_autoscale() but for the continuous scheduler.delay / robots.txt Crawl-delay):
a host is not hit again until its interval has elapsed, while different
hosts keep running in parallel.cr_autoscale(min, max) adapts the parallel batch
concurrency at run time (Crawlee autoscaled-pool style):
additive-increase on clean batches, multiplicative-decrease on
back-pressure (a transport failure or HTTP 429/500/502/503/504), clamped
to [min, max].cr_stream(concurrency) adds a continuous-pool scheduler
(via httr2::req_perform_promise() + /): keeps
concurrency requests in flight at all times, dispatching
and refilling as each finishes — avoiding the batch engine’s “wait for
the slowest” stall.cr_parallel(concurrency) enables concurrent fetching
for the HTTP backend (Crawlee’s autoscaled-pool equivalent): the queue
is drained in batches whose network I/O runs concurrently via
httr2::req_perform_parallel(), while handlers still run
sequentially in R (no shared-state hazard). robots.txt,
retries, depth/request limits and queue checkpointing all still apply;
delay/Crawl-delay are applied between
batches.dispatch/error steps used by both the
sequential and parallel loops.cr_persist() ties a crawl to a run directory: the
request queue is checkpointed (queue.rds) during the run
and restored on the next run, so a crawl resumes where
it left off without re-fetching seen URLs.cr_dataset(backend = "jsonl") (append-only,
schema-flexible) and "duckdb" (SQL-ready). The
RequestQueue gained
save()/restore()/set_path().manifest.rds /
manifest.json) records the start URLs, options snapshot and
run stats.cr_close() releases the browser session and DuckDB
connection.cr_chunk() splits text (a character vector or a
data-frame column) into overlapping chunks, by character or word,
carrying metadata per chunk.cr_embed() attaches an embedding
list-column via a user-supplied, provider-agnostic embedding function,
applied in batches. crawlee never calls an external service itself.cr_export() writes chunks (and embeddings) to Parquet,
JSONL, CSV or DuckDB for retrieval.cr_use_browser() renders JavaScript-heavy pages with a
headless Chrome/Chromium via , with wait and
wait_selector controls. Handlers are unchanged
(ctx$page, enqueue_links()); the context gains
ctx$screenshot(), saved to the [KeyValueStore].fetched object, so handlers behave identically regardless
of HTTP vs browser.html, pdf, other) and routed to
the matching default handler; explicit request labels still take
precedence.cr_on_pdf() registers a PDF handler. Its context adds
pdf_text() (per-page text via ),
body_raw()/body_string() and
save_body().KeyValueStore plus cr_store() and
ctx$save_body(): persist raw responses (PDFs, images,
snapshots) on disk alongside the structured dataset.cr_from_sitemap() enqueues URLs from a
sitemap.xml, recursing into sitemap indexes, transparently
handling gzipped sitemaps, with glob filters and a since
filter on <lastmod> for incremental crawls.cr_from_rss() enqueues items from RSS and Atom feeds,
carrying item title and date into the request’s
user_data.robots.txt is now enforced when
respect_robots = TRUE (the default): a native
parser/matcher (User-agent grouping, */$
patterns, longest-match with Allow override, Crawl-delay), cached per
host. Disallowed URLs are skipped and reported; Crawl-delay
is honoured.crawler() builds a stateful, pipe-friendly
crawler.RequestQueue: deduplicating (normalised
unique_key), FIFO, resumable request queue with retry
rescheduling.cr_options() configures concurrency, depth, delay,
retries, user agent and log verbosity.cr_use_http() HTTP fetch backend (httr2);
cr_use_browser() reserved.cr_on_html() registers content handlers; handler
context exposes push_data() and
enqueue_links() (with glob/include/exclude and same-domain
filtering).Dataset append-only store; cr_run() drives
the crawl and cr_collect() returns a tibble.cli.