Skip to content

Fix: Scrapy Not Working — Spider Crawl Returns Nothing, Robots.txt Blocked, and Pipeline Errors

FixDevs · (Updated: )

Part of:  Python Errors

Quick Answer

How to fix Scrapy errors — spider yields no items, robots.txt blocking all requests, 403 forbidden response, AttributeError on response.css, item pipeline not processing, AsyncIO reactor errors, and middleware not running.

The Error

You run a spider and it finishes with zero items:

2025-04-09 14:22:01 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'item_scraped_count': 0,
 'log_count/INFO': 12,
 'response_received_count': 1,
 'finish_reason': 'finished'}

Or every page returns 403:

2025-04-09 14:22:01 [scrapy.downloadermiddlewares.robotstxt] DEBUG:
Forbidden by robots.txt: <GET https://example.com/products>

Or the spider crawls successfully but no items reach the database — the pipeline silently drops everything:

class MyPipeline:
    def process_item(self, item, spider):
        # Items pass through but DB never updates
        return item

Or you upgrade Scrapy and get reactor errors:

twisted.internet.error.ReactorAlreadyInstalledError:
reactor already installed

Scrapy is a full asynchronous framework — not just a library. It manages spiders, middlewares, item pipelines, and a Twisted-based reactor. When something breaks, the failure can be in the spider logic, the settings, the middleware chain, or the underlying async runtime. This guide isolates each layer.

Why This Happens

Scrapy uses Twisted as its async engine — older than asyncio and with its own reactor concept. Modern Scrapy (2.0+) supports asyncio-compatible code, but mixing them incorrectly produces reactor errors. Spiders are class-based and follow a strict lifecycle: start_requestsparse callbacks → yielded items → pipelines. If the chain breaks at any point, items don’t reach storage.

The most common silent failure is robots.txt blocking requests by default. Scrapy is well-behaved out of the box — it respects robots.txt, throttles requests, and identifies itself with the Scrapy/2.x.x user agent. All three behaviors can prevent legitimate scraping.

Fix 1: Spider Yields No Items

{'item_scraped_count': 0, 'response_received_count': 5}

Pages were fetched but no items were yielded. Diagnose by checking each step.

Step 1: Print the response in parse to confirm the page loaded:

import scrapy

class ProductSpider(scrapy.Spider):
    name = 'products'
    start_urls = ['https://example.com/products']

    def parse(self, response):
        # Verify the page loaded
        self.logger.info(f"URL: {response.url}, status: {response.status}")
        self.logger.info(f"Body length: {len(response.body)}")
        self.logger.info(f"First 500 chars: {response.text[:500]}")

        # Then attempt extraction
        for product in response.css('div.product'):
            yield {
                'title': product.css('h2::text').get(),
                'price': product.css('span.price::text').get(),
            }

Step 2: Verify selectors in scrapy shell interactively:

scrapy shell "https://example.com/products"

# In the shell:
>>> response.status
200
>>> response.css('div.product').get()   # First match
>>> response.css('div.product').getall()   # All matches
>>> response.xpath('//div[@class="product"]').getall()

If the shell returns nothing, the selectors are wrong. If they work in the shell but not in the spider, you have a code bug in the spider.

Step 3: Check for JavaScript-rendered content. View response.text in the shell. If the content you want isn’t there, the page uses JavaScript:

scrapy shell "https://example.com/products"
>>> "div.product" in response.text   # False = content not in HTML, JS-rendered

For JS-rendered sites, integrate Playwright with Scrapy:

pip install scrapy-playwright
playwright install chromium
# settings.py
DOWNLOAD_HANDLERS = {
    "http": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
    "https": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
}
TWISTED_REACTOR = "twisted.internet.asyncioreactor.AsyncioSelectorReactor"
PLAYWRIGHT_BROWSER_TYPE = "chromium"
import scrapy

class ProductSpider(scrapy.Spider):
    name = 'products'

    def start_requests(self):
        yield scrapy.Request(
            'https://example.com/products',
            meta={'playwright': True, 'playwright_include_page': True},
        )

    async def parse(self, response):
        page = response.meta['playwright_page']
        await page.wait_for_selector('div.product')
        await page.close()

        for product in response.css('div.product'):
            yield {'title': product.css('h2::text').get()}

For Playwright-specific configuration, see Playwright not working.

Fix 2: Forbidden by robots.txt

[scrapy.downloadermiddlewares.robotstxt] DEBUG:
Forbidden by robots.txt: <GET https://example.com/products>

Scrapy respects robots.txt by default. If a site disallows crawling certain paths, Scrapy skips them.

Check what robots.txt says:

curl https://example.com/robots.txt
# User-agent: *
# Disallow: /products

Disable robots.txt enforcement (only for sites you have permission to scrape):

# settings.py
ROBOTSTXT_OBEY = False
# Or per-spider
class ProductSpider(scrapy.Spider):
    name = 'products'
    custom_settings = {
        'ROBOTSTXT_OBEY': False,
    }

Note: Ignoring robots.txt may violate a site’s terms of service and could be illegal in some jurisdictions. Only do this on sites you own, have explicit permission to scrape, or where the data is clearly intended for public access.

Fix 3: 403 Forbidden — Server Blocks Default User-Agent

Many sites block Scrapy’s default user-agent (Scrapy/2.x.x (+https://scrapy.org)):

2025-04-09 14:22:01 [scrapy.spidermiddlewares.httperror] INFO:
Ignoring response <403 https://example.com/>: HTTP status code is not handled or not allowed

Set a realistic user-agent:

# settings.py
USER_AGENT = (
    'Mozilla/5.0 (Windows NT 10.0; Win64; x64) '
    'AppleWebKit/537.36 (KHTML, like Gecko) '
    'Chrome/124.0.0.0 Safari/537.36'
)

Add common headers:

# settings.py
DEFAULT_REQUEST_HEADERS = {
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
    'Accept-Language': 'en-US,en;q=0.9',
    'Accept-Encoding': 'gzip, deflate, br',
}

Rotate user-agents with scrapy-user-agents:

pip install scrapy-user-agents
# settings.py
DOWNLOADER_MIDDLEWARES = {
    'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': None,
    'scrapy_user_agents.middlewares.RandomUserAgentMiddleware': 400,
}

For sites with serious bot protection (Cloudflare, DataDome), Scrapy alone isn’t enough. Use scrapy-playwright (real browser) or a service like ScraperAPI.

Fix 4: AsyncIO Reactor Errors

twisted.internet.error.ReactorAlreadyInstalledError:
reactor already installed
ValueError: The installed reactor (twisted.internet.epollreactor.EPollReactor)
does not match the requested one

Scrapy installs Twisted’s default reactor at import time. If something else (asyncio, scrapy-playwright) requires a different reactor, you get this error.

Set the reactor explicitly in settings.py:

# settings.py
TWISTED_REACTOR = "twisted.internet.asyncioreactor.AsyncioSelectorReactor"

This is required when using:

  • scrapy-playwright
  • Async callbacks (async def parse)
  • Any library that uses asyncio internally

For asyncio in spiders:

import scrapy
import asyncio

class AsyncSpider(scrapy.Spider):
    name = 'async_spider'
    custom_settings = {
        'TWISTED_REACTOR': 'twisted.internet.asyncioreactor.AsyncioSelectorReactor',
    }

    async def parse(self, response):
        # Can use await inside parse callbacks
        await asyncio.sleep(0.1)

        for item in response.css('div.product'):
            yield {'title': item.css('h2::text').get()}

The asyncio reactor is also required if you await anything inside parse callbacks — and it changes how Twisted deferreds and asyncio futures interoperate.

Fix 5: Item Pipeline Not Processing

Items are yielded but never reach your database. The pipeline isn’t enabled or has a silent error.

Check pipeline registration in settings.py:

# settings.py
ITEM_PIPELINES = {
    'myproject.pipelines.ValidationPipeline': 100,
    'myproject.pipelines.DatabasePipeline': 200,
    # Lower number = runs first
}

If ITEM_PIPELINES is empty or the path is wrong, no pipelines run.

Verify pipeline is actually called:

# pipelines.py
import logging

class DatabasePipeline:
    def open_spider(self, spider):
        spider.logger.info("DatabasePipeline opened")
        self.connection = create_db_connection()

    def close_spider(self, spider):
        spider.logger.info("DatabasePipeline closed")
        self.connection.close()

    def process_item(self, item, spider):
        spider.logger.info(f"Processing item: {item}")
        try:
            self.connection.insert(item)
        except Exception as e:
            spider.logger.error(f"DB insert failed: {e}")
            raise   # Re-raise to mark item as dropped
        return item   # MUST return item for next pipeline

Common Mistake: Forgetting to return item in process_item. If you don’t return the item, the next pipeline never receives it — and the spider stats will show items as “scraped” but not “stored.”

# WRONG — forgot to return
def process_item(self, item, spider):
    self.db.insert(item)
    # Missing: return item

# CORRECT
def process_item(self, item, spider):
    self.db.insert(item)
    return item

Drop items intentionally:

from scrapy.exceptions import DropItem

def process_item(self, item, spider):
    if not item.get('price'):
        raise DropItem(f"Missing price: {item}")
    return item

Dropped items are logged at INFO level — check dropped_count in spider stats.

Scraping multiple pages requires yielding new requests in addition to items.

import scrapy

class ProductSpider(scrapy.Spider):
    name = 'products'
    start_urls = ['https://example.com/products?page=1']

    def parse(self, response):
        # Yield items
        for product in response.css('div.product'):
            yield {
                'title': product.css('h2::text').get(),
                'price': product.css('span.price::text').get(),
                'url': response.urljoin(product.css('a::attr(href)').get()),
            }

        # Follow pagination
        next_page = response.css('a.next-page::attr(href)').get()
        if next_page:
            yield response.follow(next_page, callback=self.parse)

response.follow vs scrapy.Request:

# response.follow — handles relative URLs automatically
yield response.follow('/page/2', callback=self.parse)

# scrapy.Request — must construct full URL
yield scrapy.Request('https://example.com/page/2', callback=self.parse)

Multiple parse methods for different page types:

class ShopSpider(scrapy.Spider):
    name = 'shop'
    start_urls = ['https://example.com/categories']

    def parse(self, response):
        # Category page — follow each category link
        for category in response.css('a.category::attr(href)').getall():
            yield response.follow(category, callback=self.parse_category)

    def parse_category(self, response):
        # Category page — follow each product link
        for product_url in response.css('a.product-link::attr(href)').getall():
            yield response.follow(product_url, callback=self.parse_product)

        # Pagination within category
        next_page = response.css('a.next::attr(href)').get()
        if next_page:
            yield response.follow(next_page, callback=self.parse_category)

    def parse_product(self, response):
        # Product detail page — yield item
        yield {
            'title': response.css('h1::text').get(),
            'price': response.css('.price::text').get(),
            'description': response.css('.description::text').get(),
        }

Pass data between callbacks with cb_kwargs:

def parse_category(self, response):
    category_name = response.css('h1::text').get()
    for product_url in response.css('a.product::attr(href)').getall():
        yield response.follow(
            product_url,
            callback=self.parse_product,
            cb_kwargs={'category': category_name},   # Passed to parse_product
        )

def parse_product(self, response, category):
    yield {
        'title': response.css('h1::text').get(),
        'category': category,   # From the category page
    }

Fix 7: Throttling and Avoiding Bans

Scrapy can hit servers too hard, getting your IP banned. Configure throttling:

# settings.py

# Concurrent requests
CONCURRENT_REQUESTS = 16              # Total simultaneous requests (default)
CONCURRENT_REQUESTS_PER_DOMAIN = 8    # Per domain (default)

# Delay between requests
DOWNLOAD_DELAY = 1.0                  # 1 second between requests
RANDOMIZE_DOWNLOAD_DELAY = True       # Randomize: 0.5x–1.5x of DOWNLOAD_DELAY

# AutoThrottle — adapts based on server response time
AUTOTHROTTLE_ENABLED = True
AUTOTHROTTLE_START_DELAY = 5
AUTOTHROTTLE_MAX_DELAY = 60
AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0   # Target 1 request in flight per domain

# Retry settings
RETRY_ENABLED = True
RETRY_TIMES = 3
RETRY_HTTP_CODES = [500, 502, 503, 504, 408, 429]

Pro Tip: Always enable AUTOTHROTTLE_ENABLED for production scrapers. It dynamically adjusts request rate based on server response times — slowing down when the server is overloaded and speeding up when it’s responsive. This is far more polite (and less likely to get banned) than hardcoded delays.

HTTP cache for development — avoid re-fetching the same pages:

# settings.py
HTTPCACHE_ENABLED = True
HTTPCACHE_EXPIRATION_SECS = 3600   # 1 hour
HTTPCACHE_DIR = 'httpcache'
HTTPCACHE_IGNORE_HTTP_CODES = [500, 502, 503, 504]

Fix 8: Output Formats and Storage

Run a spider with output to a file:

# JSON
scrapy crawl products -o products.json

# JSON Lines (one JSON object per line — better for streaming)
scrapy crawl products -o products.jsonl

# CSV
scrapy crawl products -o products.csv

# XML
scrapy crawl products -o products.xml

Output to S3 directly:

scrapy crawl products -o s3://my-bucket/products-%(time)s.jsonl
# settings.py
AWS_ACCESS_KEY_ID = 'YOUR_KEY'
AWS_SECRET_ACCESS_KEY = 'YOUR_SECRET'

Define structured Items for type safety:

# items.py
import scrapy

class ProductItem(scrapy.Item):
    title = scrapy.Field()
    price = scrapy.Field()
    url = scrapy.Field()
    category = scrapy.Field()
# spider
from myproject.items import ProductItem

class ProductSpider(scrapy.Spider):
    name = 'products'

    def parse(self, response):
        for product in response.css('div.product'):
            item = ProductItem()
            item['title'] = product.css('h2::text').get()
            item['price'] = product.css('span.price::text').get()
            yield item

Or use ItemLoaders for cleaner extraction:

from scrapy.loader import ItemLoader
from itemloaders.processors import TakeFirst, MapCompose

class ProductLoader(ItemLoader):
    default_output_processor = TakeFirst()
    price_in = MapCompose(lambda x: x.replace('$', '').strip(), float)

# Spider
def parse(self, response):
    for product in response.css('div.product'):
        loader = ItemLoader(item=ProductItem(), selector=product)
        loader.add_css('title', 'h2::text')
        loader.add_css('price', 'span.price::text')
        yield loader.load_item()

Still Not Working?

Debugging with scrapy parse

Test a single URL with your spider’s parse method without running the full crawl:

scrapy parse "https://example.com/products" --spider=products --callback=parse

This runs the URL through your spider’s pipeline and shows what items would be yielded — invaluable for debugging selector issues.

Saving Debug Information

class DebugSpider(scrapy.Spider):
    def parse(self, response):
        # Save the response HTML for inspection
        with open('debug.html', 'wb') as f:
            f.write(response.body)
        
        # Or use Scrapy's built-in
        self.logger.debug(response.css('body').get())

Scrapy vs Other Scrapers — Cross-Tool Comparison

Most “Scrapy not working” sessions end with a developer realizing Scrapy wasn’t the right tool. Knowing the alternatives saves the wasted hours.

Scrapy vs Playwright + BeautifulSoup. Scrapy is a full async crawl framework — concurrency, pipelines, middlewares, request scheduling. Playwright + BeautifulSoup is two libraries you compose yourself: Playwright fetches (handles JS), BeautifulSoup parses. The trade-off: Scrapy wins on scale (10k+ pages), Playwright + BS wins on simplicity (a few hundred pages, JS-heavy sites, one-off scrapes). For pure HTML parsing without browser rendering, see BeautifulSoup not working. If you only need browser automation, skip Scrapy entirely.

Scrapy vs Selenium. Selenium is browser automation, not crawling. It’s slower than Scrapy (one page at a time, real browser) but handles JS-heavy sites natively. Use Selenium only when the site fights back hard (Cloudflare with JS challenge, sites that detect headless browsers) and you can’t bypass it. For most JS-rendered sites, scrapy-playwright (covered in Fix 1) gives you Scrapy’s concurrency model with Playwright’s rendering — better than Selenium for nontrivial workloads. For Selenium-specific failures, see Selenium not working.

Scrapy vs Crawlee (JS/TS). Crawlee is Apify’s open-source crawler for Node.js. It has Cheerio, Puppeteer, and Playwright wrappers, plus built-in queue management, session pools, and proxy rotation. The pitch: “Scrapy but TypeScript.” Choose Crawlee if your team is JS-first or you need to share types between scraper and downstream API. Choose Scrapy if you’re already Python-first — the ecosystem (Item Loaders, exporters, stats) is deeper.

Scrapy vs Apify SDK. Apify SDK (Python or JS) is built on Crawlee with cloud-hosted infrastructure (proxy rotation, datasets, scheduling, key-value stores) baked in. If you don’t want to manage infrastructure for scrapers running daily, Apify removes that burden. The cost is vendor lock-in — your data lives in Apify’s storage by default. Scrapy gives you full control; Apify gives you less ops work.

Rate limiting and rotation comparison. This is where tools differ most:

ToolThrottleProxy rotationUA rotationAnti-bot bypass
ScrapyAutoThrottle, DOWNLOAD_DELAYscrapy-rotating-proxiesscrapy-user-agentsscrapy-playwright + custom mw
Playwright + BSManual (asyncio.sleep)Browser launch argsUA in launch optionsNative (real browser)
SeleniumManual sleepBrowser proxy configUA in optionsNative (real browser)
CrawleeBuilt-in queueBuilt-in session poolBuilt-in fingerprintingBuilt-in stealth plugins
Apify SDKBuilt-in queueBuilt-in (proxy plans)Built-in fingerprintingBuilt-in + Anti-Captcha

Scrapy needs the most assembly for anti-bot — separate packages for rotation, custom middlewares for fingerprint randomization. Crawlee and Apify ship those features by default. For scrapers targeting sites with serious bot protection (Cloudflare, DataDome, PerimeterX), starting with Crawlee or Apify is often faster than retrofitting Scrapy.

Pro Tip: Match the tool to the bot-detection level, not the team’s familiarity. A team that knows Scrapy will still lose days bypassing Cloudflare on a site that Crawlee handles out of the box. Switching tools for a single project is cheap; debugging the wrong tool isn’t.

Spider Lifecycle Hooks

Spiders have lifecycle methods you can override to set up and tear down resources:

class ProductSpider(scrapy.Spider):
    name = 'products'

    @classmethod
    def from_crawler(cls, crawler, *args, **kwargs):
        spider = super().from_crawler(crawler, *args, **kwargs)
        crawler.signals.connect(spider.spider_opened, signal=signals.spider_opened)
        crawler.signals.connect(spider.spider_closed, signal=signals.spider_closed)
        return spider

    def spider_opened(self, spider):
        self.logger.info(f"Spider {spider.name} starting")
        self.start_time = time.time()

    def spider_closed(self, spider, reason):
        elapsed = time.time() - self.start_time
        self.logger.info(f"Spider finished in {elapsed:.1f}s, reason: {reason}")

This is the right place to open database connections, write to log files, or send metrics — not inside parse, which runs once per page.

Running Multiple Spiders

# Run all spiders sequentially
scrapy crawl spider1 && scrapy crawl spider2

# Or programmatically with CrawlerProcess (parallel)
from scrapy.crawler import CrawlerProcess

process = CrawlerProcess(settings={'FEED_FORMAT': 'json', 'FEED_URI': 'output.json'})
process.crawl(Spider1)
process.crawl(Spider2)
process.start()   # Blocks until all spiders finish

Scheduling with Airflow

For scheduled scraping pipelines that integrate with downstream data warehouses, schedule Scrapy spiders with Apache Airflow using BashOperator to run scrapy crawl. For Airflow DAG patterns and scheduler issues, see Airflow not working.

Proxy Rotation Without scrapy-rotating-proxies

If you want lightweight proxy rotation without another dependency, write a downloader middleware:

# middlewares.py
import random

class ProxyRotateMiddleware:
    def __init__(self):
        self.proxies = [
            "http://user:[email protected]:8080",
            "http://user:[email protected]:8080",
            "http://user:[email protected]:8080",
        ]

    def process_request(self, request, spider):
        request.meta["proxy"] = random.choice(self.proxies)

    def process_response(self, request, response, spider):
        if response.status in (403, 429):
            # Mark this proxy as bad and retry with another
            request.meta["proxy"] = random.choice(self.proxies)
            return request
        return response
# settings.py
DOWNLOADER_MIDDLEWARES = {
    "myproject.middlewares.ProxyRotateMiddleware": 350,
}

This is enough for a few dozen proxies and predictable rotation. Past that, the dedicated package handles dead-proxy detection and weighted selection — worth installing.

Detecting Honeypot / Layout Changes Early

Sites change CSS classes regularly, silently breaking selectors. Catch breakages before items dry up:

class ProductSpider(scrapy.Spider):
    name = "products"

    def parse(self, response):
        products = response.css("div.product")
        if not products:
            self.logger.error(
                f"NO PRODUCTS on {response.url} — "
                f"selector 'div.product' returned empty. "
                f"Body length: {len(response.body)}"
            )
            # Optionally save the HTML for offline inspection
            with open(f"debug_{int(time.time())}.html", "wb") as f:
                f.write(response.body)
            return
        for product in products:
            yield {"title": product.css("h2::text").get()}

Combined with the closespider_itemcount=0 flag, you’ll get a loud failure instead of a silent zero-item run.

Persisting Crawl State Across Runs

For resumable crawls (laptops, spot instances, daily incrementals), enable Scrapy’s state persistence:

scrapy crawl products -s JOBDIR=crawls/products-1

Scrapy writes the pending request queue and dupefilter to crawls/products-1/. Stop the spider (Ctrl-C) and restart with the same command — it picks up where it left off. Combine with HTTPCACHE_ENABLED = True and you can iterate on parsing logic without re-fetching pages every run.

F

FixDevs

Solo developer based in Japan. Every solution is cross-referenced with official documentation and tested before publishing.

Was this article helpful?

Related Articles