Crawlee Python: The Complete Guide to Web Scraping and Browser Automation

Web scraping in Python has always meant choosing between speed (raw HTTP + parser) and capability (headless browser). Crawlee eliminates this trade-off with a unified interface that handles both — while looking human to bot detection systems.

Built by Apify, the company behind one of the largest web scraping platforms, Crawlee for Python brings production-grade crawling infrastructure to your Python scripts. With 8,400+ stars on GitHub, it's rapidly becoming the modern alternative to Scrapy.

Crawlee Python on GitHub

Key Stats

Metric	Value
GitHub Stars	8,400+
Forks	662
Created	January 2024
Language	Python
License	Apache 2.0
Releases	58
Created by	Apify
Homepage	crawlee.dev/python
Used by	168 projects

What Is Crawlee?

Crawlee is a web scraping and browser automation library for Python designed to build reliable crawlers. It helps you:

Extract data for AI, LLMs, RAG pipelines, or GPTs
Download files — HTML, PDF, JPG, PNG, and more
Crawl websites with automatic link discovery and queue management
Bypass bot protection with human-like behavior out of the box

It works with Parsel, BeautifulSoup, Playwright, and raw HTTP — both headful and headless modes, with built-in proxy rotation.

Two Crawler Types

BeautifulSoupCrawler

For fast HTML scraping without JavaScript execution:

import asyncio
from crawlee.crawlers import BeautifulSoupCrawler, BeautifulSoupCrawlingContext

async def main() -> None:
    crawler = BeautifulSoupCrawler(max_requests_per_crawl=10)

    @crawler.router.default_handler
    async def request_handler(context: BeautifulSoupCrawlingContext) -> None:
        context.log.info(f'Processing {context.request.url} ...')
        data = {
            'url': context.request.url,
            'title': context.soup.title.string if context.soup.title else None,
        }
        await context.push_data(data)
        await context.enqueue_links()

    await crawler.run(['https://crawlee.dev'])

if __name__ == '__main__':
    asyncio.run(main())

PlaywrightCrawler

For JavaScript-heavy sites requiring browser rendering:

import asyncio
from crawlee.crawlers import PlaywrightCrawler, PlaywrightCrawlingContext

async def main() -> None:
    crawler = PlaywrightCrawler(max_requests_per_crawl=10)

    @crawler.router.default_handler
    async def request_handler(context: PlaywrightCrawlingContext) -> None:
        context.log.info(f'Processing {context.request.url} ...')
        data = {
            'url': context.request.url,
            'title': await context.page.title(),
        }
        await context.push_data(data)
        await context.enqueue_links()

    await crawler.run(['https://crawlee.dev'])

if __name__ == '__main__':
    asyncio.run(main())

Key Features

Feature	Description
Unified interface	Same API for HTTP and headless browser crawling
Auto-parallel crawling	Scales based on available system resources
Automatic retries	Handles errors and blocks gracefully
Proxy rotation	Built-in proxy and session management
Request routing	Direct URLs to appropriate handlers
Persistent queue	Resume crawls after interruptions
Pluggable storage	Tabular data (datasets) and files (key-value stores)
Type hints	Full type coverage for IDE autocompletion
Asyncio-based	Modern async Python, compatible with other async libraries
Human-like behavior	Flies under bot detection radar by default

Why Crawlee Over Scrapy?

Aspect	Crawlee	Scrapy
Async approach	✅ Native Asyncio	Twisted (custom)
Type hints	✅ Complete	Partial
Integration	✅ Regular Python script	Requires Scrapy framework
State persistence	✅ Built-in	DIY
Multiple output types	✅ Datasets + K/V stores	Items pipeline
Browser support	✅ Playwright built-in	Splash/playwright-scrapy
Bot protection bypass	✅ Default	Middleware needed
Learning curve	✅ Low (plain Python)	Higher (framework concepts)

Why Crawlee Over Raw HTTP + Parser?

When you use requests + BeautifulSoup directly, you have to build everything yourself:

Error handling and retries
Proxy rotation
Rate limiting
Queue management
Data storage
Parallel execution

Crawlee provides all of this out of the box, so you focus on what matters: extracting the data you need.

Installation

With Crawlee CLI (Recommended)

uvx 'crawlee[cli]' create my-crawler

Manual Installation

pip install 'crawlee[beautifulsoup]'
# or
pip install 'crawlee[playwright]'

Running on Apify Platform

Crawlee is open-source and runs anywhere, but it integrates seamlessly with the Apify platform for cloud deployment — scheduled runs, proxy management, storage, and monitoring all included.

Use Cases

AI/LLM data extraction — Feed structured web data into RAG pipelines
E-commerce scraping — Product prices, reviews, availability
News aggregation — Automated content collection
SEO monitoring — Track rankings, metadata, broken links
Research — Academic data collection at scale
Lead generation — Contact information from business directories

Conclusion

Crawlee Python is what web scraping should have been from the start: a single library that handles HTTP scraping, browser automation, proxy rotation, error handling, and data storage — all with a clean, type-hinted, asyncio-based API.

With 8,400+ stars and backing from Apify (the company that processes billions of web pages), Crawlee gives you production-grade crawling infrastructure without the Scrapy learning curve.

Crawlee Python: The Complete Guide to Web Scraping and Browser Automation

Crawlee Python: The Complete Guide to Web Scraping and Browser Automation

Key Stats

What Is Crawlee?

Two Crawler Types

BeautifulSoupCrawler

PlaywrightCrawler

Key Features

Why Crawlee Over Scrapy?

Why Crawlee Over Raw HTTP + Parser?

Installation

With Crawlee CLI (Recommended)

Manual Installation

Running on Apify Platform

Use Cases

Conclusion

Resources

Tags

Claude Code Best Practice: The Complete Guide to Mastering Agentic Coding

Paperclip: The Complete Guide to Open-Source Orchestration for Zero-Human Companies

Context7: The Complete Guide to Up-to-Date Code Documentation for LLMs