---
title: "Scaling and politeness"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Scaling and politeness}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(collapse = TRUE, comment = "#>", eval = FALSE)
```

```{r setup}
library(crawlee)
```

This article covers the two sides of crawling at scale, following
[Crawlee](https://crawlee.dev)'s *Scaling our crawlers* and *Avoid getting
blocked* guides: going **faster** (concurrency) while staying **polite** (rate
limits and `robots.txt`).

## Being a good web citizen

By default crawlee is conservative and respectful:

* **`robots.txt` is honoured** (`respect_robots = TRUE`): disallowed URLs are
  skipped, and a `Crawl-delay` directive is applied.
* set a descriptive **`user_agent`** so site owners can identify your crawler;
* **`delay`** adds a pause between requests;
* **`max_requests`** and **`max_depth`** bound the crawl;
* failed requests are **retried** (`max_retries`) with backoff.

```{r}
crawler("https://books.toscrape.com/") |>
  cr_options(
    user_agent = "my-research-bot (you@example.com)",
    delay = 0.5, # seconds between requests
    max_requests = 500,
    max_depth = 4,
    respect_robots = TRUE
  ) |>
  cr_on_html(function(ctx) ctx$enqueue_links()) |>
  cr_run()
```

## Going faster

The default engine is sequential. For higher throughput there are three
concurrent engines; all keep handlers running sequentially in R (so your
dataset and queue are never touched concurrently) — only the network I/O runs
in parallel.

### Fixed-concurrency batches — `cr_parallel()`

Drains the queue in batches whose network requests run together.

```{r}
crawler("https://books.toscrape.com/") |>
  cr_parallel(concurrency = 8) |>
  cr_on_html(function(ctx) ctx$enqueue_links()) |>
  cr_run()
```

### Adaptive batches — `cr_autoscale()`

Like `cr_parallel()`, but the batch size adapts at run time (additive-increase
on clean batches, halving on back-pressure such as HTTP 429/503 or transport
failures), staying within `[min, max]`.

```{r}
crawler("https://books.toscrape.com/") |>
  cr_autoscale(min = 2, max = 16) |>
  cr_on_html(function(ctx) ctx$enqueue_links()) |>
  cr_run()
```

### Continuous streaming pool — `cr_stream()`

Keeps `concurrency` requests in flight at all times: the moment one finishes,
its handler runs and the next request is pulled in. This avoids the batch
engines' "wait for the slowest request in the batch" stall and shines when
response latency varies a lot.

```{r}
crawler("https://books.toscrape.com/") |>
  cr_stream(concurrency = 10) |>
  cr_on_html(function(ctx) ctx$enqueue_links()) |>
  cr_run()
```

## Choosing an engine

| Engine | When to use |
|--------|-------------|
| sequential (default) | small crawls; strict per-request pacing |
| `cr_parallel()` | steady throughput with a known good concurrency |
| `cr_autoscale()` | unknown/variable server capacity — let it find the level |
| `cr_stream()` | many pages with widely varying latency; maximum throughput |

> Concurrency and politeness pull in opposite directions. The batch engines
> apply `delay` / `Crawl-delay` between batches; the streaming engine treats
> concurrency itself as the throttle and does not enforce per-request pacing.
> For strict rate limits, prefer a batch engine with a `delay`.

## Combining with persistence

Any engine composes with `cr_persist()` for resumable, checkpointed runs:

```{r}
crawler("https://books.toscrape.com/") |>
  cr_autoscale(min = 2, max = 16) |>
  cr_persist("runs/books", dataset = "duckdb") |>
  cr_on_html(function(ctx) ctx$enqueue_links()) |>
  cr_run()
```
