Crawlee Python: The Complete Guide to Web Scraping and Browser Automation

Web scraping in Python has always meant choosing between speed (raw HTTP + parser) and capability (headless browser). Crawlee eliminates this trade-off with a unified interface that handles both — while looking human to bot detection systems.
Built by Apify, the company behind one of the largest web scraping platforms, Crawlee for Python brings production-grade crawling infrastructure to your Python scripts. With 8,400+ stars on GitHub, it's rapidly becoming the modern alternative to Scrapy.
Key Stats
| Metric | Value |
|---|---|
| GitHub Stars | 8,400+ |
| Forks | 662 |
| Created | January 2024 |
| Language | Python |
| License | Apache 2.0 |
| Releases | 58 |
| Created by | Apify |
| Homepage | crawlee.dev/python |
| Used by | 168 projects |
What Is Crawlee?
Crawlee is a web scraping and browser automation library for Python designed to build reliable crawlers. It helps you:
- Extract data for AI, LLMs, RAG pipelines, or GPTs
- Download files — HTML, PDF, JPG, PNG, and more
- Crawl websites with automatic link discovery and queue management
- Bypass bot protection with human-like behavior out of the box
It works with Parsel, BeautifulSoup, Playwright, and raw HTTP — both headful and headless modes, with built-in proxy rotation.
Two Crawler Types
BeautifulSoupCrawler
For fast HTML scraping without JavaScript execution:
import asyncio
from crawlee.crawlers import BeautifulSoupCrawler, BeautifulSoupCrawlingContext
async def main() -> None:
crawler = BeautifulSoupCrawler(max_requests_per_crawl=10)
@crawler.router.default_handler
async def request_handler(context: BeautifulSoupCrawlingContext) -> None:
context.log.info(f'Processing {context.request.url} ...')
data = {
'url': context.request.url,
'title': context.soup.title.string if context.soup.title else None,
}
await context.push_data(data)
await context.enqueue_links()
await crawler.run(['https://crawlee.dev'])
if __name__ == '__main__':
asyncio.run(main())
PlaywrightCrawler
For JavaScript-heavy sites requiring browser rendering:
import asyncio
from crawlee.crawlers import PlaywrightCrawler, PlaywrightCrawlingContext
async def main() -> None:
crawler = PlaywrightCrawler(max_requests_per_crawl=10)
@crawler.router.default_handler
async def request_handler(context: PlaywrightCrawlingContext) -> None:
context.log.info(f'Processing {context.request.url} ...')
data = {
'url': context.request.url,
'title': await context.page.title(),
}
await context.push_data(data)
await context.enqueue_links()
await crawler.run(['https://crawlee.dev'])
if __name__ == '__main__':
asyncio.run(main())
Key Features
| Feature | Description |
|---|---|
| Unified interface | Same API for HTTP and headless browser crawling |
| Auto-parallel crawling | Scales based on available system resources |
| Automatic retries | Handles errors and blocks gracefully |
| Proxy rotation | Built-in proxy and session management |
| Request routing | Direct URLs to appropriate handlers |
| Persistent queue | Resume crawls after interruptions |
| Pluggable storage | Tabular data (datasets) and files (key-value stores) |
| Type hints | Full type coverage for IDE autocompletion |
| Asyncio-based | Modern async Python, compatible with other async libraries |
| Human-like behavior | Flies under bot detection radar by default |
Why Crawlee Over Scrapy?
| Aspect | Crawlee | Scrapy |
|---|---|---|
| Async approach | ✅ Native Asyncio | Twisted (custom) |
| Type hints | ✅ Complete | Partial |
| Integration | ✅ Regular Python script | Requires Scrapy framework |
| State persistence | ✅ Built-in | DIY |
| Multiple output types | ✅ Datasets + K/V stores | Items pipeline |
| Browser support | ✅ Playwright built-in | Splash/playwright-scrapy |
| Bot protection bypass | ✅ Default | Middleware needed |
| Learning curve | ✅ Low (plain Python) | Higher (framework concepts) |
Why Crawlee Over Raw HTTP + Parser?
When you use requests + BeautifulSoup directly, you have to build everything yourself:
- Error handling and retries
- Proxy rotation
- Rate limiting
- Queue management
- Data storage
- Parallel execution
Crawlee provides all of this out of the box, so you focus on what matters: extracting the data you need.
Installation
With Crawlee CLI (Recommended)
uvx 'crawlee[cli]' create my-crawler
Manual Installation
pip install 'crawlee[beautifulsoup]'
# or
pip install 'crawlee[playwright]'
Running on Apify Platform
Crawlee is open-source and runs anywhere, but it integrates seamlessly with the Apify platform for cloud deployment — scheduled runs, proxy management, storage, and monitoring all included.
Use Cases
- AI/LLM data extraction — Feed structured web data into RAG pipelines
- E-commerce scraping — Product prices, reviews, availability
- News aggregation — Automated content collection
- SEO monitoring — Track rankings, metadata, broken links
- Research — Academic data collection at scale
- Lead generation — Contact information from business directories
Conclusion
Crawlee Python is what web scraping should have been from the start: a single library that handles HTTP scraping, browser automation, proxy rotation, error handling, and data storage — all with a clean, type-hinted, asyncio-based API.
With 8,400+ stars and backing from Apify (the company that processes billions of web pages), Crawlee gives you production-grade crawling infrastructure without the Scrapy learning curve.