Tidy Interface for Reproducible Web Crawling


[Up] [Top]

Documentation for package ‘crawlee’ version 0.1.0

Help Pages

Crawler Crawler
crawler Create a crawler
Crawler-class Crawler
cr_autoscale Enable autoscaled parallel fetching
cr_chunk Chunk text for retrieval-augmented generation
cr_close Release a crawler's resources
cr_collect Collect crawl results
cr_dataset Configure the dataset backend
cr_embed Attach embeddings to chunks
cr_export Export chunks (and embeddings) for retrieval
cr_from_rss Discover URLs from an RSS or Atom feed
cr_from_sitemap Discover URLs from a sitemap
cr_normalize_url Normalise a URL into a canonical form
cr_on_html Register an HTML handler
cr_on_pdf Register a PDF handler
cr_options Set crawler options
cr_parallel Enable parallel (concurrent) fetching
cr_persist Persist a crawl to a run directory (and resume it)
cr_run Run a crawl
cr_store Configure the key-value store for binary content
cr_stream Enable the streaming scheduler
cr_use_browser Use the headless-browser fetch backend
cr_use_http Use the HTTP fetch backend
Dataset Dataset
KeyValueStore Key-value store
RequestQueue Request queue