Fix: Scrapy Not Working — Spider Crawl Returns Nothing, Robots.txt Blocked, and Pipeline Errors
Part of: Python Errors
Quick Answer
How to fix Scrapy errors — spider yields no items, robots.txt blocking all requests, 403 forbidden response, AttributeError on response.css, item pipeline not processing, AsyncIO reactor errors, and middleware not running.
The Error
You run a spider and it finishes with zero items:
2025-04-09 14:22:01 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'item_scraped_count': 0,
'log_count/INFO': 12,
'response_received_count': 1,
'finish_reason': 'finished'}Or every page returns 403:
2025-04-09 14:22:01 [scrapy.downloadermiddlewares.robotstxt] DEBUG:
Forbidden by robots.txt: <GET https://example.com/products>Or the spider crawls successfully but no items reach the database — the pipeline silently drops everything:
class MyPipeline:
def process_item(self, item, spider):
# Items pass through but DB never updates
return itemOr you upgrade Scrapy and get reactor errors:
twisted.internet.error.ReactorAlreadyInstalledError:
reactor already installedScrapy is a full asynchronous framework — not just a library. It manages spiders, middlewares, item pipelines, and a Twisted-based reactor. When something breaks, the failure can be in the spider logic, the settings, the middleware chain, or the underlying async runtime. This guide isolates each layer.
Why This Happens
Scrapy uses Twisted as its async engine — older than asyncio and with its own reactor concept. Modern Scrapy (2.0+) supports asyncio-compatible code, but mixing them incorrectly produces reactor errors. Spiders are class-based and follow a strict lifecycle: start_requests → parse callbacks → yielded items → pipelines. If the chain breaks at any point, items don’t reach storage.
The most common silent failure is robots.txt blocking requests by default. Scrapy is well-behaved out of the box — it respects robots.txt, throttles requests, and identifies itself with the Scrapy/2.x.x user agent. All three behaviors can prevent legitimate scraping.
Fix 1: Spider Yields No Items
{'item_scraped_count': 0, 'response_received_count': 5}Pages were fetched but no items were yielded. Diagnose by checking each step.
Step 1: Print the response in parse to confirm the page loaded:
import scrapy
class ProductSpider(scrapy.Spider):
name = 'products'
start_urls = ['https://example.com/products']
def parse(self, response):
# Verify the page loaded
self.logger.info(f"URL: {response.url}, status: {response.status}")
self.logger.info(f"Body length: {len(response.body)}")
self.logger.info(f"First 500 chars: {response.text[:500]}")
# Then attempt extraction
for product in response.css('div.product'):
yield {
'title': product.css('h2::text').get(),
'price': product.css('span.price::text').get(),
}Step 2: Verify selectors in scrapy shell interactively:
scrapy shell "https://example.com/products"
# In the shell:
>>> response.status
200
>>> response.css('div.product').get() # First match
>>> response.css('div.product').getall() # All matches
>>> response.xpath('//div[@class="product"]').getall()If the shell returns nothing, the selectors are wrong. If they work in the shell but not in the spider, you have a code bug in the spider.
Step 3: Check for JavaScript-rendered content. View response.text in the shell. If the content you want isn’t there, the page uses JavaScript:
scrapy shell "https://example.com/products"
>>> "div.product" in response.text # False = content not in HTML, JS-renderedFor JS-rendered sites, integrate Playwright with Scrapy:
pip install scrapy-playwright
playwright install chromium# settings.py
DOWNLOAD_HANDLERS = {
"http": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
"https": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
}
TWISTED_REACTOR = "twisted.internet.asyncioreactor.AsyncioSelectorReactor"
PLAYWRIGHT_BROWSER_TYPE = "chromium"import scrapy
class ProductSpider(scrapy.Spider):
name = 'products'
def start_requests(self):
yield scrapy.Request(
'https://example.com/products',
meta={'playwright': True, 'playwright_include_page': True},
)
async def parse(self, response):
page = response.meta['playwright_page']
await page.wait_for_selector('div.product')
await page.close()
for product in response.css('div.product'):
yield {'title': product.css('h2::text').get()}For Playwright-specific configuration, see Playwright not working.
Fix 2: Forbidden by robots.txt
[scrapy.downloadermiddlewares.robotstxt] DEBUG:
Forbidden by robots.txt: <GET https://example.com/products>Scrapy respects robots.txt by default. If a site disallows crawling certain paths, Scrapy skips them.
Check what robots.txt says:
curl https://example.com/robots.txt
# User-agent: *
# Disallow: /productsDisable robots.txt enforcement (only for sites you have permission to scrape):
# settings.py
ROBOTSTXT_OBEY = False# Or per-spider
class ProductSpider(scrapy.Spider):
name = 'products'
custom_settings = {
'ROBOTSTXT_OBEY': False,
}Note: Ignoring robots.txt may violate a site’s terms of service and could be illegal in some jurisdictions. Only do this on sites you own, have explicit permission to scrape, or where the data is clearly intended for public access.
Fix 3: 403 Forbidden — Server Blocks Default User-Agent
Many sites block Scrapy’s default user-agent (Scrapy/2.x.x (+https://scrapy.org)):
2025-04-09 14:22:01 [scrapy.spidermiddlewares.httperror] INFO:
Ignoring response <403 https://example.com/>: HTTP status code is not handled or not allowedSet a realistic user-agent:
# settings.py
USER_AGENT = (
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) '
'AppleWebKit/537.36 (KHTML, like Gecko) '
'Chrome/124.0.0.0 Safari/537.36'
)Add common headers:
# settings.py
DEFAULT_REQUEST_HEADERS = {
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Language': 'en-US,en;q=0.9',
'Accept-Encoding': 'gzip, deflate, br',
}Rotate user-agents with scrapy-user-agents:
pip install scrapy-user-agents# settings.py
DOWNLOADER_MIDDLEWARES = {
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': None,
'scrapy_user_agents.middlewares.RandomUserAgentMiddleware': 400,
}For sites with serious bot protection (Cloudflare, DataDome), Scrapy alone isn’t enough. Use scrapy-playwright (real browser) or a service like ScraperAPI.
Fix 4: AsyncIO Reactor Errors
twisted.internet.error.ReactorAlreadyInstalledError:
reactor already installed
ValueError: The installed reactor (twisted.internet.epollreactor.EPollReactor)
does not match the requested oneScrapy installs Twisted’s default reactor at import time. If something else (asyncio, scrapy-playwright) requires a different reactor, you get this error.
Set the reactor explicitly in settings.py:
# settings.py
TWISTED_REACTOR = "twisted.internet.asyncioreactor.AsyncioSelectorReactor"This is required when using:
scrapy-playwright- Async callbacks (
async def parse) - Any library that uses asyncio internally
For asyncio in spiders:
import scrapy
import asyncio
class AsyncSpider(scrapy.Spider):
name = 'async_spider'
custom_settings = {
'TWISTED_REACTOR': 'twisted.internet.asyncioreactor.AsyncioSelectorReactor',
}
async def parse(self, response):
# Can use await inside parse callbacks
await asyncio.sleep(0.1)
for item in response.css('div.product'):
yield {'title': item.css('h2::text').get()}The asyncio reactor is also required if you await anything inside parse callbacks — and it changes how Twisted deferreds and asyncio futures interoperate.
Fix 5: Item Pipeline Not Processing
Items are yielded but never reach your database. The pipeline isn’t enabled or has a silent error.
Check pipeline registration in settings.py:
# settings.py
ITEM_PIPELINES = {
'myproject.pipelines.ValidationPipeline': 100,
'myproject.pipelines.DatabasePipeline': 200,
# Lower number = runs first
}If ITEM_PIPELINES is empty or the path is wrong, no pipelines run.
Verify pipeline is actually called:
# pipelines.py
import logging
class DatabasePipeline:
def open_spider(self, spider):
spider.logger.info("DatabasePipeline opened")
self.connection = create_db_connection()
def close_spider(self, spider):
spider.logger.info("DatabasePipeline closed")
self.connection.close()
def process_item(self, item, spider):
spider.logger.info(f"Processing item: {item}")
try:
self.connection.insert(item)
except Exception as e:
spider.logger.error(f"DB insert failed: {e}")
raise # Re-raise to mark item as dropped
return item # MUST return item for next pipelineCommon Mistake: Forgetting to return item in process_item. If you don’t return the item, the next pipeline never receives it — and the spider stats will show items as “scraped” but not “stored.”
# WRONG — forgot to return
def process_item(self, item, spider):
self.db.insert(item)
# Missing: return item
# CORRECT
def process_item(self, item, spider):
self.db.insert(item)
return itemDrop items intentionally:
from scrapy.exceptions import DropItem
def process_item(self, item, spider):
if not item.get('price'):
raise DropItem(f"Missing price: {item}")
return itemDropped items are logged at INFO level — check dropped_count in spider stats.
Fix 6: Pagination and Following Links
Scraping multiple pages requires yielding new requests in addition to items.
import scrapy
class ProductSpider(scrapy.Spider):
name = 'products'
start_urls = ['https://example.com/products?page=1']
def parse(self, response):
# Yield items
for product in response.css('div.product'):
yield {
'title': product.css('h2::text').get(),
'price': product.css('span.price::text').get(),
'url': response.urljoin(product.css('a::attr(href)').get()),
}
# Follow pagination
next_page = response.css('a.next-page::attr(href)').get()
if next_page:
yield response.follow(next_page, callback=self.parse)response.follow vs scrapy.Request:
# response.follow — handles relative URLs automatically
yield response.follow('/page/2', callback=self.parse)
# scrapy.Request — must construct full URL
yield scrapy.Request('https://example.com/page/2', callback=self.parse)Multiple parse methods for different page types:
class ShopSpider(scrapy.Spider):
name = 'shop'
start_urls = ['https://example.com/categories']
def parse(self, response):
# Category page — follow each category link
for category in response.css('a.category::attr(href)').getall():
yield response.follow(category, callback=self.parse_category)
def parse_category(self, response):
# Category page — follow each product link
for product_url in response.css('a.product-link::attr(href)').getall():
yield response.follow(product_url, callback=self.parse_product)
# Pagination within category
next_page = response.css('a.next::attr(href)').get()
if next_page:
yield response.follow(next_page, callback=self.parse_category)
def parse_product(self, response):
# Product detail page — yield item
yield {
'title': response.css('h1::text').get(),
'price': response.css('.price::text').get(),
'description': response.css('.description::text').get(),
}Pass data between callbacks with cb_kwargs:
def parse_category(self, response):
category_name = response.css('h1::text').get()
for product_url in response.css('a.product::attr(href)').getall():
yield response.follow(
product_url,
callback=self.parse_product,
cb_kwargs={'category': category_name}, # Passed to parse_product
)
def parse_product(self, response, category):
yield {
'title': response.css('h1::text').get(),
'category': category, # From the category page
}Fix 7: Throttling and Avoiding Bans
Scrapy can hit servers too hard, getting your IP banned. Configure throttling:
# settings.py
# Concurrent requests
CONCURRENT_REQUESTS = 16 # Total simultaneous requests (default)
CONCURRENT_REQUESTS_PER_DOMAIN = 8 # Per domain (default)
# Delay between requests
DOWNLOAD_DELAY = 1.0 # 1 second between requests
RANDOMIZE_DOWNLOAD_DELAY = True # Randomize: 0.5x–1.5x of DOWNLOAD_DELAY
# AutoThrottle — adapts based on server response time
AUTOTHROTTLE_ENABLED = True
AUTOTHROTTLE_START_DELAY = 5
AUTOTHROTTLE_MAX_DELAY = 60
AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0 # Target 1 request in flight per domain
# Retry settings
RETRY_ENABLED = True
RETRY_TIMES = 3
RETRY_HTTP_CODES = [500, 502, 503, 504, 408, 429]Pro Tip: Always enable AUTOTHROTTLE_ENABLED for production scrapers. It dynamically adjusts request rate based on server response times — slowing down when the server is overloaded and speeding up when it’s responsive. This is far more polite (and less likely to get banned) than hardcoded delays.
HTTP cache for development — avoid re-fetching the same pages:
# settings.py
HTTPCACHE_ENABLED = True
HTTPCACHE_EXPIRATION_SECS = 3600 # 1 hour
HTTPCACHE_DIR = 'httpcache'
HTTPCACHE_IGNORE_HTTP_CODES = [500, 502, 503, 504]Fix 8: Output Formats and Storage
Run a spider with output to a file:
# JSON
scrapy crawl products -o products.json
# JSON Lines (one JSON object per line — better for streaming)
scrapy crawl products -o products.jsonl
# CSV
scrapy crawl products -o products.csv
# XML
scrapy crawl products -o products.xmlOutput to S3 directly:
scrapy crawl products -o s3://my-bucket/products-%(time)s.jsonl# settings.py
AWS_ACCESS_KEY_ID = 'YOUR_KEY'
AWS_SECRET_ACCESS_KEY = 'YOUR_SECRET'Define structured Items for type safety:
# items.py
import scrapy
class ProductItem(scrapy.Item):
title = scrapy.Field()
price = scrapy.Field()
url = scrapy.Field()
category = scrapy.Field()# spider
from myproject.items import ProductItem
class ProductSpider(scrapy.Spider):
name = 'products'
def parse(self, response):
for product in response.css('div.product'):
item = ProductItem()
item['title'] = product.css('h2::text').get()
item['price'] = product.css('span.price::text').get()
yield itemOr use ItemLoaders for cleaner extraction:
from scrapy.loader import ItemLoader
from itemloaders.processors import TakeFirst, MapCompose
class ProductLoader(ItemLoader):
default_output_processor = TakeFirst()
price_in = MapCompose(lambda x: x.replace('$', '').strip(), float)
# Spider
def parse(self, response):
for product in response.css('div.product'):
loader = ItemLoader(item=ProductItem(), selector=product)
loader.add_css('title', 'h2::text')
loader.add_css('price', 'span.price::text')
yield loader.load_item()Still Not Working?
Debugging with scrapy parse
Test a single URL with your spider’s parse method without running the full crawl:
scrapy parse "https://example.com/products" --spider=products --callback=parseThis runs the URL through your spider’s pipeline and shows what items would be yielded — invaluable for debugging selector issues.
Saving Debug Information
class DebugSpider(scrapy.Spider):
def parse(self, response):
# Save the response HTML for inspection
with open('debug.html', 'wb') as f:
f.write(response.body)
# Or use Scrapy's built-in
self.logger.debug(response.css('body').get())Scrapy vs Other Scrapers — Cross-Tool Comparison
Most “Scrapy not working” sessions end with a developer realizing Scrapy wasn’t the right tool. Knowing the alternatives saves the wasted hours.
Scrapy vs Playwright + BeautifulSoup. Scrapy is a full async crawl framework — concurrency, pipelines, middlewares, request scheduling. Playwright + BeautifulSoup is two libraries you compose yourself: Playwright fetches (handles JS), BeautifulSoup parses. The trade-off: Scrapy wins on scale (10k+ pages), Playwright + BS wins on simplicity (a few hundred pages, JS-heavy sites, one-off scrapes). For pure HTML parsing without browser rendering, see BeautifulSoup not working. If you only need browser automation, skip Scrapy entirely.
Scrapy vs Selenium. Selenium is browser automation, not crawling. It’s slower than Scrapy (one page at a time, real browser) but handles JS-heavy sites natively. Use Selenium only when the site fights back hard (Cloudflare with JS challenge, sites that detect headless browsers) and you can’t bypass it. For most JS-rendered sites, scrapy-playwright (covered in Fix 1) gives you Scrapy’s concurrency model with Playwright’s rendering — better than Selenium for nontrivial workloads. For Selenium-specific failures, see Selenium not working.
Scrapy vs Crawlee (JS/TS). Crawlee is Apify’s open-source crawler for Node.js. It has Cheerio, Puppeteer, and Playwright wrappers, plus built-in queue management, session pools, and proxy rotation. The pitch: “Scrapy but TypeScript.” Choose Crawlee if your team is JS-first or you need to share types between scraper and downstream API. Choose Scrapy if you’re already Python-first — the ecosystem (Item Loaders, exporters, stats) is deeper.
Scrapy vs Apify SDK. Apify SDK (Python or JS) is built on Crawlee with cloud-hosted infrastructure (proxy rotation, datasets, scheduling, key-value stores) baked in. If you don’t want to manage infrastructure for scrapers running daily, Apify removes that burden. The cost is vendor lock-in — your data lives in Apify’s storage by default. Scrapy gives you full control; Apify gives you less ops work.
Rate limiting and rotation comparison. This is where tools differ most:
| Tool | Throttle | Proxy rotation | UA rotation | Anti-bot bypass |
|---|---|---|---|---|
| Scrapy | AutoThrottle, DOWNLOAD_DELAY | scrapy-rotating-proxies | scrapy-user-agents | scrapy-playwright + custom mw |
| Playwright + BS | Manual (asyncio.sleep) | Browser launch args | UA in launch options | Native (real browser) |
| Selenium | Manual sleep | Browser proxy config | UA in options | Native (real browser) |
| Crawlee | Built-in queue | Built-in session pool | Built-in fingerprinting | Built-in stealth plugins |
| Apify SDK | Built-in queue | Built-in (proxy plans) | Built-in fingerprinting | Built-in + Anti-Captcha |
Scrapy needs the most assembly for anti-bot — separate packages for rotation, custom middlewares for fingerprint randomization. Crawlee and Apify ship those features by default. For scrapers targeting sites with serious bot protection (Cloudflare, DataDome, PerimeterX), starting with Crawlee or Apify is often faster than retrofitting Scrapy.
Pro Tip: Match the tool to the bot-detection level, not the team’s familiarity. A team that knows Scrapy will still lose days bypassing Cloudflare on a site that Crawlee handles out of the box. Switching tools for a single project is cheap; debugging the wrong tool isn’t.
Spider Lifecycle Hooks
Spiders have lifecycle methods you can override to set up and tear down resources:
class ProductSpider(scrapy.Spider):
name = 'products'
@classmethod
def from_crawler(cls, crawler, *args, **kwargs):
spider = super().from_crawler(crawler, *args, **kwargs)
crawler.signals.connect(spider.spider_opened, signal=signals.spider_opened)
crawler.signals.connect(spider.spider_closed, signal=signals.spider_closed)
return spider
def spider_opened(self, spider):
self.logger.info(f"Spider {spider.name} starting")
self.start_time = time.time()
def spider_closed(self, spider, reason):
elapsed = time.time() - self.start_time
self.logger.info(f"Spider finished in {elapsed:.1f}s, reason: {reason}")This is the right place to open database connections, write to log files, or send metrics — not inside parse, which runs once per page.
Running Multiple Spiders
# Run all spiders sequentially
scrapy crawl spider1 && scrapy crawl spider2
# Or programmatically with CrawlerProcess (parallel)
from scrapy.crawler import CrawlerProcess
process = CrawlerProcess(settings={'FEED_FORMAT': 'json', 'FEED_URI': 'output.json'})
process.crawl(Spider1)
process.crawl(Spider2)
process.start() # Blocks until all spiders finishScheduling with Airflow
For scheduled scraping pipelines that integrate with downstream data warehouses, schedule Scrapy spiders with Apache Airflow using BashOperator to run scrapy crawl. For Airflow DAG patterns and scheduler issues, see Airflow not working.
Proxy Rotation Without scrapy-rotating-proxies
If you want lightweight proxy rotation without another dependency, write a downloader middleware:
# middlewares.py
import random
class ProxyRotateMiddleware:
def __init__(self):
self.proxies = [
"http://user:[email protected]:8080",
"http://user:[email protected]:8080",
"http://user:[email protected]:8080",
]
def process_request(self, request, spider):
request.meta["proxy"] = random.choice(self.proxies)
def process_response(self, request, response, spider):
if response.status in (403, 429):
# Mark this proxy as bad and retry with another
request.meta["proxy"] = random.choice(self.proxies)
return request
return response# settings.py
DOWNLOADER_MIDDLEWARES = {
"myproject.middlewares.ProxyRotateMiddleware": 350,
}This is enough for a few dozen proxies and predictable rotation. Past that, the dedicated package handles dead-proxy detection and weighted selection — worth installing.
Detecting Honeypot / Layout Changes Early
Sites change CSS classes regularly, silently breaking selectors. Catch breakages before items dry up:
class ProductSpider(scrapy.Spider):
name = "products"
def parse(self, response):
products = response.css("div.product")
if not products:
self.logger.error(
f"NO PRODUCTS on {response.url} — "
f"selector 'div.product' returned empty. "
f"Body length: {len(response.body)}"
)
# Optionally save the HTML for offline inspection
with open(f"debug_{int(time.time())}.html", "wb") as f:
f.write(response.body)
return
for product in products:
yield {"title": product.css("h2::text").get()}Combined with the closespider_itemcount=0 flag, you’ll get a loud failure instead of a silent zero-item run.
Persisting Crawl State Across Runs
For resumable crawls (laptops, spot instances, daily incrementals), enable Scrapy’s state persistence:
scrapy crawl products -s JOBDIR=crawls/products-1Scrapy writes the pending request queue and dupefilter to crawls/products-1/. Stop the spider (Ctrl-C) and restart with the same command — it picks up where it left off. Combine with HTTPCACHE_ENABLED = True and you can iterate on parsing logic without re-fetching pages every run.
Solo developer based in Japan. Every solution is cross-referenced with official documentation and tested before publishing.
Was this article helpful?
Related Articles
Fix: Selenium Not Working — WebDriver Errors, Element Not Found, and Timeout Issues
How to fix Selenium errors — WebDriverException session not created, NoSuchElementException element not found, StaleElementReferenceException, TimeoutException waiting for element, headless Chrome crashes, and driver version mismatch.
Fix: BeautifulSoup Not Working — Parser Errors, Encoding Issues, and find_all Returns Empty
How to fix BeautifulSoup errors — bs4.FeatureNotFound install lxml, find_all returns empty list, Unicode decode error, JavaScript-rendered content not found, select vs find_all confusion, and slow parsing on large HTML.
Fix: joblib Not Working — Parallel Backends, Memory Cache, and Pickling Errors
How to fix joblib errors — Parallel n_jobs slower than expected, Memory cache miss, backend loky vs threading vs multiprocessing, pickling lambda not supported, dump load file size, and pytest interference.
Fix: Marshmallow Not Working — Schema Errors, Load vs Dump, and Field Validation
How to fix Marshmallow errors — Schema not validated on dump, ValidationError messages format, unknown field handling, missing vs default, post_load object construction, and Marshmallow 3 to 4 migration.