Skip to content

Fix: Scrapy Not Working — Spider Crawl Returns Nothing, Robots.txt Blocked, and Pipeline Errors

FixDevs ·

Quick Answer

How to fix Scrapy errors — spider yields no items, robots.txt blocking all requests, 403 forbidden response, AttributeError on response.css, item pipeline not processing, AsyncIO reactor errors, and middleware not running.

The Error

You run a spider and it finishes with zero items:

2025-04-09 14:22:01 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'item_scraped_count': 0,
 'log_count/INFO': 12,
 'response_received_count': 1,
 'finish_reason': 'finished'}

Or every page returns 403:

2025-04-09 14:22:01 [scrapy.downloadermiddlewares.robotstxt] DEBUG:
Forbidden by robots.txt: <GET https://example.com/products>

Or the spider crawls successfully but no items reach the database — the pipeline silently drops everything:

class MyPipeline:
    def process_item(self, item, spider):
        # Items pass through but DB never updates
        return item

Or you upgrade Scrapy and get reactor errors:

twisted.internet.error.ReactorAlreadyInstalledError:
reactor already installed

Scrapy is a full asynchronous framework — not just a library. It manages spiders, middlewares, item pipelines, and a Twisted-based reactor. When something breaks, the failure can be in the spider logic, the settings, the middleware chain, or the underlying async runtime. This guide isolates each layer.

Why This Happens

Scrapy uses Twisted as its async engine — older than asyncio and with its own reactor concept. Modern Scrapy (2.0+) supports asyncio-compatible code, but mixing them incorrectly produces reactor errors. Spiders are class-based and follow a strict lifecycle: start_requestsparse callbacks → yielded items → pipelines. If the chain breaks at any point, items don’t reach storage.

The most common silent failure is robots.txt blocking requests by default. Scrapy is well-behaved out of the box — it respects robots.txt, throttles requests, and identifies itself with the Scrapy/2.x.x user agent. All three behaviors can prevent legitimate scraping.

Fix 1: Spider Yields No Items

{'item_scraped_count': 0, 'response_received_count': 5}

Pages were fetched but no items were yielded. Diagnose by checking each step.

Step 1: Print the response in parse to confirm the page loaded:

import scrapy

class ProductSpider(scrapy.Spider):
    name = 'products'
    start_urls = ['https://example.com/products']

    def parse(self, response):
        # Verify the page loaded
        self.logger.info(f"URL: {response.url}, status: {response.status}")
        self.logger.info(f"Body length: {len(response.body)}")
        self.logger.info(f"First 500 chars: {response.text[:500]}")

        # Then attempt extraction
        for product in response.css('div.product'):
            yield {
                'title': product.css('h2::text').get(),
                'price': product.css('span.price::text').get(),
            }

Step 2: Verify selectors in scrapy shell interactively:

scrapy shell "https://example.com/products"

# In the shell:
>>> response.status
200
>>> response.css('div.product').get()   # First match
>>> response.css('div.product').getall()   # All matches
>>> response.xpath('//div[@class="product"]').getall()

If the shell returns nothing, the selectors are wrong. If they work in the shell but not in the spider, you have a code bug in the spider.

Step 3: Check for JavaScript-rendered content. View response.text in the shell. If the content you want isn’t there, the page uses JavaScript:

scrapy shell "https://example.com/products"
>>> "div.product" in response.text   # False = content not in HTML, JS-rendered

For JS-rendered sites, integrate Playwright with Scrapy:

pip install scrapy-playwright
playwright install chromium
# settings.py
DOWNLOAD_HANDLERS = {
    "http": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
    "https": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
}
TWISTED_REACTOR = "twisted.internet.asyncioreactor.AsyncioSelectorReactor"
PLAYWRIGHT_BROWSER_TYPE = "chromium"
import scrapy

class ProductSpider(scrapy.Spider):
    name = 'products'

    def start_requests(self):
        yield scrapy.Request(
            'https://example.com/products',
            meta={'playwright': True, 'playwright_include_page': True},
        )

    async def parse(self, response):
        page = response.meta['playwright_page']
        await page.wait_for_selector('div.product')
        await page.close()

        for product in response.css('div.product'):
            yield {'title': product.css('h2::text').get()}

For Playwright-specific configuration, see Playwright not working.

Fix 2: Forbidden by robots.txt

[scrapy.downloadermiddlewares.robotstxt] DEBUG:
Forbidden by robots.txt: <GET https://example.com/products>

Scrapy respects robots.txt by default. If a site disallows crawling certain paths, Scrapy skips them.

Check what robots.txt says:

curl https://example.com/robots.txt
# User-agent: *
# Disallow: /products

Disable robots.txt enforcement (only for sites you have permission to scrape):

# settings.py
ROBOTSTXT_OBEY = False
# Or per-spider
class ProductSpider(scrapy.Spider):
    name = 'products'
    custom_settings = {
        'ROBOTSTXT_OBEY': False,
    }

Note: Ignoring robots.txt may violate a site’s terms of service and could be illegal in some jurisdictions. Only do this on sites you own, have explicit permission to scrape, or where the data is clearly intended for public access.

Fix 3: 403 Forbidden — Server Blocks Default User-Agent

Many sites block Scrapy’s default user-agent (Scrapy/2.x.x (+https://scrapy.org)):

2025-04-09 14:22:01 [scrapy.spidermiddlewares.httperror] INFO:
Ignoring response <403 https://example.com/>: HTTP status code is not handled or not allowed

Set a realistic user-agent:

# settings.py
USER_AGENT = (
    'Mozilla/5.0 (Windows NT 10.0; Win64; x64) '
    'AppleWebKit/537.36 (KHTML, like Gecko) '
    'Chrome/124.0.0.0 Safari/537.36'
)

Add common headers:

# settings.py
DEFAULT_REQUEST_HEADERS = {
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
    'Accept-Language': 'en-US,en;q=0.9',
    'Accept-Encoding': 'gzip, deflate, br',
}

Rotate user-agents with scrapy-user-agents:

pip install scrapy-user-agents
# settings.py
DOWNLOADER_MIDDLEWARES = {
    'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': None,
    'scrapy_user_agents.middlewares.RandomUserAgentMiddleware': 400,
}

For sites with serious bot protection (Cloudflare, DataDome), Scrapy alone isn’t enough. Use scrapy-playwright (real browser) or a service like ScraperAPI.

Fix 4: AsyncIO Reactor Errors

twisted.internet.error.ReactorAlreadyInstalledError:
reactor already installed
ValueError: The installed reactor (twisted.internet.epollreactor.EPollReactor)
does not match the requested one

Scrapy installs Twisted’s default reactor at import time. If something else (asyncio, scrapy-playwright) requires a different reactor, you get this error.

Set the reactor explicitly in settings.py:

# settings.py
TWISTED_REACTOR = "twisted.internet.asyncioreactor.AsyncioSelectorReactor"

This is required when using:

  • scrapy-playwright
  • Async callbacks (async def parse)
  • Any library that uses asyncio internally

For asyncio in spiders:

import scrapy
import asyncio

class AsyncSpider(scrapy.Spider):
    name = 'async_spider'
    custom_settings = {
        'TWISTED_REACTOR': 'twisted.internet.asyncioreactor.AsyncioSelectorReactor',
    }

    async def parse(self, response):
        # Can use await inside parse callbacks
        await asyncio.sleep(0.1)

        for item in response.css('div.product'):
            yield {'title': item.css('h2::text').get()}

For asyncio event loop patterns and how they interact with Scrapy’s reactor, see Python asyncio not running.

Fix 5: Item Pipeline Not Processing

Items are yielded but never reach your database. The pipeline isn’t enabled or has a silent error.

Check pipeline registration in settings.py:

# settings.py
ITEM_PIPELINES = {
    'myproject.pipelines.ValidationPipeline': 100,
    'myproject.pipelines.DatabasePipeline': 200,
    # Lower number = runs first
}

If ITEM_PIPELINES is empty or the path is wrong, no pipelines run.

Verify pipeline is actually called:

# pipelines.py
import logging

class DatabasePipeline:
    def open_spider(self, spider):
        spider.logger.info("DatabasePipeline opened")
        self.connection = create_db_connection()

    def close_spider(self, spider):
        spider.logger.info("DatabasePipeline closed")
        self.connection.close()

    def process_item(self, item, spider):
        spider.logger.info(f"Processing item: {item}")
        try:
            self.connection.insert(item)
        except Exception as e:
            spider.logger.error(f"DB insert failed: {e}")
            raise   # Re-raise to mark item as dropped
        return item   # MUST return item for next pipeline

Common Mistake: Forgetting to return item in process_item. If you don’t return the item, the next pipeline never receives it — and the spider stats will show items as “scraped” but not “stored.”

# WRONG — forgot to return
def process_item(self, item, spider):
    self.db.insert(item)
    # Missing: return item

# CORRECT
def process_item(self, item, spider):
    self.db.insert(item)
    return item

Drop items intentionally:

from scrapy.exceptions import DropItem

def process_item(self, item, spider):
    if not item.get('price'):
        raise DropItem(f"Missing price: {item}")
    return item

Dropped items are logged at INFO level — check dropped_count in spider stats.

Scraping multiple pages requires yielding new requests in addition to items.

import scrapy

class ProductSpider(scrapy.Spider):
    name = 'products'
    start_urls = ['https://example.com/products?page=1']

    def parse(self, response):
        # Yield items
        for product in response.css('div.product'):
            yield {
                'title': product.css('h2::text').get(),
                'price': product.css('span.price::text').get(),
                'url': response.urljoin(product.css('a::attr(href)').get()),
            }

        # Follow pagination
        next_page = response.css('a.next-page::attr(href)').get()
        if next_page:
            yield response.follow(next_page, callback=self.parse)

response.follow vs scrapy.Request:

# response.follow — handles relative URLs automatically
yield response.follow('/page/2', callback=self.parse)

# scrapy.Request — must construct full URL
yield scrapy.Request('https://example.com/page/2', callback=self.parse)

Multiple parse methods for different page types:

class ShopSpider(scrapy.Spider):
    name = 'shop'
    start_urls = ['https://example.com/categories']

    def parse(self, response):
        # Category page — follow each category link
        for category in response.css('a.category::attr(href)').getall():
            yield response.follow(category, callback=self.parse_category)

    def parse_category(self, response):
        # Category page — follow each product link
        for product_url in response.css('a.product-link::attr(href)').getall():
            yield response.follow(product_url, callback=self.parse_product)

        # Pagination within category
        next_page = response.css('a.next::attr(href)').get()
        if next_page:
            yield response.follow(next_page, callback=self.parse_category)

    def parse_product(self, response):
        # Product detail page — yield item
        yield {
            'title': response.css('h1::text').get(),
            'price': response.css('.price::text').get(),
            'description': response.css('.description::text').get(),
        }

Pass data between callbacks with cb_kwargs:

def parse_category(self, response):
    category_name = response.css('h1::text').get()
    for product_url in response.css('a.product::attr(href)').getall():
        yield response.follow(
            product_url,
            callback=self.parse_product,
            cb_kwargs={'category': category_name},   # Passed to parse_product
        )

def parse_product(self, response, category):
    yield {
        'title': response.css('h1::text').get(),
        'category': category,   # From the category page
    }

Fix 7: Throttling and Avoiding Bans

Scrapy can hit servers too hard, getting your IP banned. Configure throttling:

# settings.py

# Concurrent requests
CONCURRENT_REQUESTS = 16              # Total simultaneous requests (default)
CONCURRENT_REQUESTS_PER_DOMAIN = 8    # Per domain (default)

# Delay between requests
DOWNLOAD_DELAY = 1.0                  # 1 second between requests
RANDOMIZE_DOWNLOAD_DELAY = True       # Randomize: 0.5x–1.5x of DOWNLOAD_DELAY

# AutoThrottle — adapts based on server response time
AUTOTHROTTLE_ENABLED = True
AUTOTHROTTLE_START_DELAY = 5
AUTOTHROTTLE_MAX_DELAY = 60
AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0   # Target 1 request in flight per domain

# Retry settings
RETRY_ENABLED = True
RETRY_TIMES = 3
RETRY_HTTP_CODES = [500, 502, 503, 504, 408, 429]

Pro Tip: Always enable AUTOTHROTTLE_ENABLED for production scrapers. It dynamically adjusts request rate based on server response times — slowing down when the server is overloaded and speeding up when it’s responsive. This is far more polite (and less likely to get banned) than hardcoded delays.

HTTP cache for development — avoid re-fetching the same pages:

# settings.py
HTTPCACHE_ENABLED = True
HTTPCACHE_EXPIRATION_SECS = 3600   # 1 hour
HTTPCACHE_DIR = 'httpcache'
HTTPCACHE_IGNORE_HTTP_CODES = [500, 502, 503, 504]

Fix 8: Output Formats and Storage

Run a spider with output to a file:

# JSON
scrapy crawl products -o products.json

# JSON Lines (one JSON object per line — better for streaming)
scrapy crawl products -o products.jsonl

# CSV
scrapy crawl products -o products.csv

# XML
scrapy crawl products -o products.xml

Output to S3 directly:

scrapy crawl products -o s3://my-bucket/products-%(time)s.jsonl
# settings.py
AWS_ACCESS_KEY_ID = 'YOUR_KEY'
AWS_SECRET_ACCESS_KEY = 'YOUR_SECRET'

Define structured Items for type safety:

# items.py
import scrapy

class ProductItem(scrapy.Item):
    title = scrapy.Field()
    price = scrapy.Field()
    url = scrapy.Field()
    category = scrapy.Field()
# spider
from myproject.items import ProductItem

class ProductSpider(scrapy.Spider):
    name = 'products'

    def parse(self, response):
        for product in response.css('div.product'):
            item = ProductItem()
            item['title'] = product.css('h2::text').get()
            item['price'] = product.css('span.price::text').get()
            yield item

Or use ItemLoaders for cleaner extraction:

from scrapy.loader import ItemLoader
from itemloaders.processors import TakeFirst, MapCompose

class ProductLoader(ItemLoader):
    default_output_processor = TakeFirst()
    price_in = MapCompose(lambda x: x.replace('$', '').strip(), float)

# Spider
def parse(self, response):
    for product in response.css('div.product'):
        loader = ItemLoader(item=ProductItem(), selector=product)
        loader.add_css('title', 'h2::text')
        loader.add_css('price', 'span.price::text')
        yield loader.load_item()

Still Not Working?

Debugging with scrapy parse

Test a single URL with your spider’s parse method without running the full crawl:

scrapy parse "https://example.com/products" --spider=products --callback=parse

This runs the URL through your spider’s pipeline and shows what items would be yielded — invaluable for debugging selector issues.

Saving Debug Information

class DebugSpider(scrapy.Spider):
    def parse(self, response):
        # Save the response HTML for inspection
        with open('debug.html', 'wb') as f:
            f.write(response.body)
        
        # Or use Scrapy's built-in
        self.logger.debug(response.css('body').get())

Comparing with BeautifulSoup

For one-off scraping or when you don’t need Scrapy’s framework features (concurrency, pipelines, middlewares), BeautifulSoup is simpler. See BeautifulSoup not working for parser selection and find_all patterns. For browser-based scraping when JavaScript rendering is required, see Selenium not working.

Spider Lifecycle Hooks

Spiders have lifecycle methods you can override to set up and tear down resources:

class ProductSpider(scrapy.Spider):
    name = 'products'

    @classmethod
    def from_crawler(cls, crawler, *args, **kwargs):
        spider = super().from_crawler(crawler, *args, **kwargs)
        crawler.signals.connect(spider.spider_opened, signal=signals.spider_opened)
        crawler.signals.connect(spider.spider_closed, signal=signals.spider_closed)
        return spider

    def spider_opened(self, spider):
        self.logger.info(f"Spider {spider.name} starting")
        self.start_time = time.time()

    def spider_closed(self, spider, reason):
        elapsed = time.time() - self.start_time
        self.logger.info(f"Spider finished in {elapsed:.1f}s, reason: {reason}")

This is the right place to open database connections, write to log files, or send metrics — not inside parse, which runs once per page.

Running Multiple Spiders

# Run all spiders sequentially
scrapy crawl spider1 && scrapy crawl spider2

# Or programmatically with CrawlerProcess (parallel)
from scrapy.crawler import CrawlerProcess

process = CrawlerProcess(settings={'FEED_FORMAT': 'json', 'FEED_URI': 'output.json'})
process.crawl(Spider1)
process.crawl(Spider2)
process.start()   # Blocks until all spiders finish

Scheduling with Airflow

For scheduled scraping pipelines that integrate with downstream data warehouses, schedule Scrapy spiders with Apache Airflow using BashOperator to run scrapy crawl. For Airflow DAG patterns and scheduler issues, see Airflow not working.

F

FixDevs

Solo developer based in Japan. Every solution is cross-referenced with official documentation and tested before publishing.

Was this article helpful?

Related Articles