Fix: Scrapy Not Working — Spider Crawl Returns Nothing, Robots.txt Blocked, and Pipeline Errors
Quick Answer
How to fix Scrapy errors — spider yields no items, robots.txt blocking all requests, 403 forbidden response, AttributeError on response.css, item pipeline not processing, AsyncIO reactor errors, and middleware not running.
The Error
You run a spider and it finishes with zero items:
2025-04-09 14:22:01 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'item_scraped_count': 0,
'log_count/INFO': 12,
'response_received_count': 1,
'finish_reason': 'finished'}Or every page returns 403:
2025-04-09 14:22:01 [scrapy.downloadermiddlewares.robotstxt] DEBUG:
Forbidden by robots.txt: <GET https://example.com/products>Or the spider crawls successfully but no items reach the database — the pipeline silently drops everything:
class MyPipeline:
def process_item(self, item, spider):
# Items pass through but DB never updates
return itemOr you upgrade Scrapy and get reactor errors:
twisted.internet.error.ReactorAlreadyInstalledError:
reactor already installedScrapy is a full asynchronous framework — not just a library. It manages spiders, middlewares, item pipelines, and a Twisted-based reactor. When something breaks, the failure can be in the spider logic, the settings, the middleware chain, or the underlying async runtime. This guide isolates each layer.
Why This Happens
Scrapy uses Twisted as its async engine — older than asyncio and with its own reactor concept. Modern Scrapy (2.0+) supports asyncio-compatible code, but mixing them incorrectly produces reactor errors. Spiders are class-based and follow a strict lifecycle: start_requests → parse callbacks → yielded items → pipelines. If the chain breaks at any point, items don’t reach storage.
The most common silent failure is robots.txt blocking requests by default. Scrapy is well-behaved out of the box — it respects robots.txt, throttles requests, and identifies itself with the Scrapy/2.x.x user agent. All three behaviors can prevent legitimate scraping.
Fix 1: Spider Yields No Items
{'item_scraped_count': 0, 'response_received_count': 5}Pages were fetched but no items were yielded. Diagnose by checking each step.
Step 1: Print the response in parse to confirm the page loaded:
import scrapy
class ProductSpider(scrapy.Spider):
name = 'products'
start_urls = ['https://example.com/products']
def parse(self, response):
# Verify the page loaded
self.logger.info(f"URL: {response.url}, status: {response.status}")
self.logger.info(f"Body length: {len(response.body)}")
self.logger.info(f"First 500 chars: {response.text[:500]}")
# Then attempt extraction
for product in response.css('div.product'):
yield {
'title': product.css('h2::text').get(),
'price': product.css('span.price::text').get(),
}Step 2: Verify selectors in scrapy shell interactively:
scrapy shell "https://example.com/products"
# In the shell:
>>> response.status
200
>>> response.css('div.product').get() # First match
>>> response.css('div.product').getall() # All matches
>>> response.xpath('//div[@class="product"]').getall()If the shell returns nothing, the selectors are wrong. If they work in the shell but not in the spider, you have a code bug in the spider.
Step 3: Check for JavaScript-rendered content. View response.text in the shell. If the content you want isn’t there, the page uses JavaScript:
scrapy shell "https://example.com/products"
>>> "div.product" in response.text # False = content not in HTML, JS-renderedFor JS-rendered sites, integrate Playwright with Scrapy:
pip install scrapy-playwright
playwright install chromium# settings.py
DOWNLOAD_HANDLERS = {
"http": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
"https": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
}
TWISTED_REACTOR = "twisted.internet.asyncioreactor.AsyncioSelectorReactor"
PLAYWRIGHT_BROWSER_TYPE = "chromium"import scrapy
class ProductSpider(scrapy.Spider):
name = 'products'
def start_requests(self):
yield scrapy.Request(
'https://example.com/products',
meta={'playwright': True, 'playwright_include_page': True},
)
async def parse(self, response):
page = response.meta['playwright_page']
await page.wait_for_selector('div.product')
await page.close()
for product in response.css('div.product'):
yield {'title': product.css('h2::text').get()}For Playwright-specific configuration, see Playwright not working.
Fix 2: Forbidden by robots.txt
[scrapy.downloadermiddlewares.robotstxt] DEBUG:
Forbidden by robots.txt: <GET https://example.com/products>Scrapy respects robots.txt by default. If a site disallows crawling certain paths, Scrapy skips them.
Check what robots.txt says:
curl https://example.com/robots.txt
# User-agent: *
# Disallow: /productsDisable robots.txt enforcement (only for sites you have permission to scrape):
# settings.py
ROBOTSTXT_OBEY = False# Or per-spider
class ProductSpider(scrapy.Spider):
name = 'products'
custom_settings = {
'ROBOTSTXT_OBEY': False,
}Note: Ignoring robots.txt may violate a site’s terms of service and could be illegal in some jurisdictions. Only do this on sites you own, have explicit permission to scrape, or where the data is clearly intended for public access.
Fix 3: 403 Forbidden — Server Blocks Default User-Agent
Many sites block Scrapy’s default user-agent (Scrapy/2.x.x (+https://scrapy.org)):
2025-04-09 14:22:01 [scrapy.spidermiddlewares.httperror] INFO:
Ignoring response <403 https://example.com/>: HTTP status code is not handled or not allowedSet a realistic user-agent:
# settings.py
USER_AGENT = (
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) '
'AppleWebKit/537.36 (KHTML, like Gecko) '
'Chrome/124.0.0.0 Safari/537.36'
)Add common headers:
# settings.py
DEFAULT_REQUEST_HEADERS = {
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Language': 'en-US,en;q=0.9',
'Accept-Encoding': 'gzip, deflate, br',
}Rotate user-agents with scrapy-user-agents:
pip install scrapy-user-agents# settings.py
DOWNLOADER_MIDDLEWARES = {
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': None,
'scrapy_user_agents.middlewares.RandomUserAgentMiddleware': 400,
}For sites with serious bot protection (Cloudflare, DataDome), Scrapy alone isn’t enough. Use scrapy-playwright (real browser) or a service like ScraperAPI.
Fix 4: AsyncIO Reactor Errors
twisted.internet.error.ReactorAlreadyInstalledError:
reactor already installed
ValueError: The installed reactor (twisted.internet.epollreactor.EPollReactor)
does not match the requested oneScrapy installs Twisted’s default reactor at import time. If something else (asyncio, scrapy-playwright) requires a different reactor, you get this error.
Set the reactor explicitly in settings.py:
# settings.py
TWISTED_REACTOR = "twisted.internet.asyncioreactor.AsyncioSelectorReactor"This is required when using:
scrapy-playwright- Async callbacks (
async def parse) - Any library that uses asyncio internally
For asyncio in spiders:
import scrapy
import asyncio
class AsyncSpider(scrapy.Spider):
name = 'async_spider'
custom_settings = {
'TWISTED_REACTOR': 'twisted.internet.asyncioreactor.AsyncioSelectorReactor',
}
async def parse(self, response):
# Can use await inside parse callbacks
await asyncio.sleep(0.1)
for item in response.css('div.product'):
yield {'title': item.css('h2::text').get()}For asyncio event loop patterns and how they interact with Scrapy’s reactor, see Python asyncio not running.
Fix 5: Item Pipeline Not Processing
Items are yielded but never reach your database. The pipeline isn’t enabled or has a silent error.
Check pipeline registration in settings.py:
# settings.py
ITEM_PIPELINES = {
'myproject.pipelines.ValidationPipeline': 100,
'myproject.pipelines.DatabasePipeline': 200,
# Lower number = runs first
}If ITEM_PIPELINES is empty or the path is wrong, no pipelines run.
Verify pipeline is actually called:
# pipelines.py
import logging
class DatabasePipeline:
def open_spider(self, spider):
spider.logger.info("DatabasePipeline opened")
self.connection = create_db_connection()
def close_spider(self, spider):
spider.logger.info("DatabasePipeline closed")
self.connection.close()
def process_item(self, item, spider):
spider.logger.info(f"Processing item: {item}")
try:
self.connection.insert(item)
except Exception as e:
spider.logger.error(f"DB insert failed: {e}")
raise # Re-raise to mark item as dropped
return item # MUST return item for next pipelineCommon Mistake: Forgetting to return item in process_item. If you don’t return the item, the next pipeline never receives it — and the spider stats will show items as “scraped” but not “stored.”
# WRONG — forgot to return
def process_item(self, item, spider):
self.db.insert(item)
# Missing: return item
# CORRECT
def process_item(self, item, spider):
self.db.insert(item)
return itemDrop items intentionally:
from scrapy.exceptions import DropItem
def process_item(self, item, spider):
if not item.get('price'):
raise DropItem(f"Missing price: {item}")
return itemDropped items are logged at INFO level — check dropped_count in spider stats.
Fix 6: Pagination and Following Links
Scraping multiple pages requires yielding new requests in addition to items.
import scrapy
class ProductSpider(scrapy.Spider):
name = 'products'
start_urls = ['https://example.com/products?page=1']
def parse(self, response):
# Yield items
for product in response.css('div.product'):
yield {
'title': product.css('h2::text').get(),
'price': product.css('span.price::text').get(),
'url': response.urljoin(product.css('a::attr(href)').get()),
}
# Follow pagination
next_page = response.css('a.next-page::attr(href)').get()
if next_page:
yield response.follow(next_page, callback=self.parse)response.follow vs scrapy.Request:
# response.follow — handles relative URLs automatically
yield response.follow('/page/2', callback=self.parse)
# scrapy.Request — must construct full URL
yield scrapy.Request('https://example.com/page/2', callback=self.parse)Multiple parse methods for different page types:
class ShopSpider(scrapy.Spider):
name = 'shop'
start_urls = ['https://example.com/categories']
def parse(self, response):
# Category page — follow each category link
for category in response.css('a.category::attr(href)').getall():
yield response.follow(category, callback=self.parse_category)
def parse_category(self, response):
# Category page — follow each product link
for product_url in response.css('a.product-link::attr(href)').getall():
yield response.follow(product_url, callback=self.parse_product)
# Pagination within category
next_page = response.css('a.next::attr(href)').get()
if next_page:
yield response.follow(next_page, callback=self.parse_category)
def parse_product(self, response):
# Product detail page — yield item
yield {
'title': response.css('h1::text').get(),
'price': response.css('.price::text').get(),
'description': response.css('.description::text').get(),
}Pass data between callbacks with cb_kwargs:
def parse_category(self, response):
category_name = response.css('h1::text').get()
for product_url in response.css('a.product::attr(href)').getall():
yield response.follow(
product_url,
callback=self.parse_product,
cb_kwargs={'category': category_name}, # Passed to parse_product
)
def parse_product(self, response, category):
yield {
'title': response.css('h1::text').get(),
'category': category, # From the category page
}Fix 7: Throttling and Avoiding Bans
Scrapy can hit servers too hard, getting your IP banned. Configure throttling:
# settings.py
# Concurrent requests
CONCURRENT_REQUESTS = 16 # Total simultaneous requests (default)
CONCURRENT_REQUESTS_PER_DOMAIN = 8 # Per domain (default)
# Delay between requests
DOWNLOAD_DELAY = 1.0 # 1 second between requests
RANDOMIZE_DOWNLOAD_DELAY = True # Randomize: 0.5x–1.5x of DOWNLOAD_DELAY
# AutoThrottle — adapts based on server response time
AUTOTHROTTLE_ENABLED = True
AUTOTHROTTLE_START_DELAY = 5
AUTOTHROTTLE_MAX_DELAY = 60
AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0 # Target 1 request in flight per domain
# Retry settings
RETRY_ENABLED = True
RETRY_TIMES = 3
RETRY_HTTP_CODES = [500, 502, 503, 504, 408, 429]Pro Tip: Always enable AUTOTHROTTLE_ENABLED for production scrapers. It dynamically adjusts request rate based on server response times — slowing down when the server is overloaded and speeding up when it’s responsive. This is far more polite (and less likely to get banned) than hardcoded delays.
HTTP cache for development — avoid re-fetching the same pages:
# settings.py
HTTPCACHE_ENABLED = True
HTTPCACHE_EXPIRATION_SECS = 3600 # 1 hour
HTTPCACHE_DIR = 'httpcache'
HTTPCACHE_IGNORE_HTTP_CODES = [500, 502, 503, 504]Fix 8: Output Formats and Storage
Run a spider with output to a file:
# JSON
scrapy crawl products -o products.json
# JSON Lines (one JSON object per line — better for streaming)
scrapy crawl products -o products.jsonl
# CSV
scrapy crawl products -o products.csv
# XML
scrapy crawl products -o products.xmlOutput to S3 directly:
scrapy crawl products -o s3://my-bucket/products-%(time)s.jsonl# settings.py
AWS_ACCESS_KEY_ID = 'YOUR_KEY'
AWS_SECRET_ACCESS_KEY = 'YOUR_SECRET'Define structured Items for type safety:
# items.py
import scrapy
class ProductItem(scrapy.Item):
title = scrapy.Field()
price = scrapy.Field()
url = scrapy.Field()
category = scrapy.Field()# spider
from myproject.items import ProductItem
class ProductSpider(scrapy.Spider):
name = 'products'
def parse(self, response):
for product in response.css('div.product'):
item = ProductItem()
item['title'] = product.css('h2::text').get()
item['price'] = product.css('span.price::text').get()
yield itemOr use ItemLoaders for cleaner extraction:
from scrapy.loader import ItemLoader
from itemloaders.processors import TakeFirst, MapCompose
class ProductLoader(ItemLoader):
default_output_processor = TakeFirst()
price_in = MapCompose(lambda x: x.replace('$', '').strip(), float)
# Spider
def parse(self, response):
for product in response.css('div.product'):
loader = ItemLoader(item=ProductItem(), selector=product)
loader.add_css('title', 'h2::text')
loader.add_css('price', 'span.price::text')
yield loader.load_item()Still Not Working?
Debugging with scrapy parse
Test a single URL with your spider’s parse method without running the full crawl:
scrapy parse "https://example.com/products" --spider=products --callback=parseThis runs the URL through your spider’s pipeline and shows what items would be yielded — invaluable for debugging selector issues.
Saving Debug Information
class DebugSpider(scrapy.Spider):
def parse(self, response):
# Save the response HTML for inspection
with open('debug.html', 'wb') as f:
f.write(response.body)
# Or use Scrapy's built-in
self.logger.debug(response.css('body').get())Comparing with BeautifulSoup
For one-off scraping or when you don’t need Scrapy’s framework features (concurrency, pipelines, middlewares), BeautifulSoup is simpler. See BeautifulSoup not working for parser selection and find_all patterns. For browser-based scraping when JavaScript rendering is required, see Selenium not working.
Spider Lifecycle Hooks
Spiders have lifecycle methods you can override to set up and tear down resources:
class ProductSpider(scrapy.Spider):
name = 'products'
@classmethod
def from_crawler(cls, crawler, *args, **kwargs):
spider = super().from_crawler(crawler, *args, **kwargs)
crawler.signals.connect(spider.spider_opened, signal=signals.spider_opened)
crawler.signals.connect(spider.spider_closed, signal=signals.spider_closed)
return spider
def spider_opened(self, spider):
self.logger.info(f"Spider {spider.name} starting")
self.start_time = time.time()
def spider_closed(self, spider, reason):
elapsed = time.time() - self.start_time
self.logger.info(f"Spider finished in {elapsed:.1f}s, reason: {reason}")This is the right place to open database connections, write to log files, or send metrics — not inside parse, which runs once per page.
Running Multiple Spiders
# Run all spiders sequentially
scrapy crawl spider1 && scrapy crawl spider2
# Or programmatically with CrawlerProcess (parallel)
from scrapy.crawler import CrawlerProcess
process = CrawlerProcess(settings={'FEED_FORMAT': 'json', 'FEED_URI': 'output.json'})
process.crawl(Spider1)
process.crawl(Spider2)
process.start() # Blocks until all spiders finishScheduling with Airflow
For scheduled scraping pipelines that integrate with downstream data warehouses, schedule Scrapy spiders with Apache Airflow using BashOperator to run scrapy crawl. For Airflow DAG patterns and scheduler issues, see Airflow not working.
Solo developer based in Japan. Every solution is cross-referenced with official documentation and tested before publishing.
Was this article helpful?
Related Articles
Fix: Selenium Not Working — WebDriver Errors, Element Not Found, and Timeout Issues
How to fix Selenium errors — WebDriverException session not created, NoSuchElementException element not found, StaleElementReferenceException, TimeoutException waiting for element, headless Chrome crashes, and driver version mismatch.
Fix: BeautifulSoup Not Working — Parser Errors, Encoding Issues, and find_all Returns Empty
How to fix BeautifulSoup errors — bs4.FeatureNotFound install lxml, find_all returns empty list, Unicode decode error, JavaScript-rendered content not found, select vs find_all confusion, and slow parsing on large HTML.
Fix: Apache Airflow Not Working — DAG Not Found, Task Failures, and Scheduler Issues
How to fix Apache Airflow errors — DAG not appearing in UI, ImportError preventing DAG load, task stuck in running or queued, scheduler not scheduling, XCom too large, connection not found, and database migration errors.
Fix: Dash Not Working — Callback Errors, Pattern Matching, and State Management
How to fix Dash errors — circular dependency in callbacks, pattern matching callback not firing, missing attribute clientside_callback, DataTable filtering not working, clientside JavaScript errors, Input Output State confusion, and async callback delays.