How to Integrate Proxies with Scrapy: A Step-by-Step Guide

Scrapy is a fast and reliable open-source Python framework for web scraping and crawling. It's one of the best tools for extracting large amounts of structured data for a variety of purposes.

As with all scraping tools, extracting data with Scrapy requires proxy servers, and there are a couple of ways to go about integrating them.

Why Use Proxies With Scrapy

No matter how well you define your Scrapy spider, automated requests will exhibit patterns different from those of an ordinary human user. It may trigger the website's anti-bot systems, and to continue web scraping, you must find a way around it.

Setting custom user agents, limiting the rate of requests, and other tactics are practical only as long as your IP address isn't under suspicion. The more requests you send from one IP, the more chances that it will be flagged, and your Scrapy spider will stop working.

The only sure workaround is using rotating proxies that help you mimic real users and switch up the IPs every set period to avoid detection. Additionally, some data for your Scrapy project might not be available due to geo-restrictions, which using a Scrapy proxy can easily solve.

Setting Up a Basic Scrapy Project

If you haven't already, start by downloading Scrapy. You’ll need an IDE such as PyCharm or Visual Studio Code. Create a new project and open the Terminal within the program. Run the following command:

pip install scrapy

Once Scrapy and its dependencies are installed, you can start your Scrapy project by generating its folder structure. In most cases, the IDE will automatically be in your project folder. If not, start by navigating to the folder of your project.

cd C:\Users\User\Desktop\Scrapyproject

Then, you can run a scrapy startproject myproject command, which will generate your Scrapy folder structure that will look like the one below.

myproject/
├── scrapy.cfg
└── myproject
    ├── __init__.py
    ├── items.py
    ├── middlewares.py
    ├── pipelines.py
    ├── settings.py
    └── spiders
        └── __init__.py

Each file and folder here fulfills a separate function and can be edited accordingly to modify your Scrapy project.

items.py defines models for scraped data.
middlewares.py request/response lifecycle tweaks.
pipelines.py handles post-scraping data work like cleaning, exporting, and saving.
settings.py project settings.
Spiders folder contains your spiders.
scrapy.cfg is a file for deployment used in advanced setups.

Now you can change into your project folder any time with the command CD myproject. It also gives you a command to generate a spider template:

scrapy genspider example httpbin.org

This creates an example.py file for which you can set up Scrapy proxy credentials or use custom proxy middleware.

Configuring Proxies in Scrapy

There are two primary methods for integrating proxies in Scrapy, fit for different purposes. One way is to modify your spider Python file (in this case, it's example.py), adding a meta attribute with proxy credentials. This will set up a proxy on a per-request basis.

Method 1: Meta Attributes

Open the example.py in a source code editor, such as VS Code. The code you'll find will look like this:

import scrapy


class ExampleSpider(scrapy.Spider):
    name = "example"
    allowed_domains = ["httpbin.org"]
    start_urls = ["https://httpbin.org/"]

    def parse(self, response):
        pass

2. Now we can define the meta attribute, which will be used by the default in-built Scrapy proxy middleware to route the request through a proxy. The code might look like this.

def start_requests(self):
    for url in self.start_urls:
        yield scrapy.Request(
            url=url,
            callback=self.parse,
            meta={'proxy': 'http://user:password@proxy1:port'}
        )

3. Head to the IPRoyal dashboard, configure the wanted proxies, and save their credentials. Check our quick start guide on how to use the dashboard.

4. Once you have the needed credentials, paste everything into the code. The finished example.py file looks like this.

import scrapy


class ExampleSpider(scrapy.Spider):
    name = "example"
    allowed_domains = ["httpbin.org"]


def start_requests(self):
    for url in self.start_urls:
        yield scrapy.Request(
            url=url,
            callback=self.parse,
            meta={'proxy': 
'http://IProyal:[email protected]:3128'}
        )

 def parse(self, response):
        pass

5. Now you only need to save the file and run your Scrapy project commands in the terminal.

This is an easy proxy integration method that centralizes proxy configs in your Scrapy spider file. It's best when you don’t need to select proxies based on request content or other parameters dynamically. Configuring a proxy middleware is a better option if you need these features or want integration with external APIs.

Method 2: Scrapy Proxy Middleware

You can configure global proxy settings by installing the downloader middleware. It's a low-level system that allows you to modify requests before they are sent globally. In this case, we'll use pre-built proxy middleware, such as scrapy-rotating-proxies. It can be installed with a pip command.

pip install scrapy-rotating-proxies

After installation, open the settings.py file and modify it by adding your proxy list and downloader middleware of scrapy-rotating-proxies at the end of the code file.

ROTATING_PROXY_LIST = [
    'http://IProyal:[email protected]:3128',
    # you can add as many proxies as you need here
]

DOWNLOADER_MIDDLEWARES = {

'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
    'rotating_proxies.middlewares.RotatingProxyMiddleware': 610,
    'rotating_proxies.middlewares.BanDetectionMiddleware': 620,
}

IPRoyal, however, provides rotating residential proxies by default, so it’s unlikely you’ll need option 2. A single endpoint is provided, so there’s no need to get a complete list.

Alternatively, you can create a custom proxy middleware inside the middlewares.py file. Open the file in a code editor and create a new class that can achieve what you want. In our example, the class MyProxyMiddleware assigns a proxy to web requests if no proxy is already set.

class MyProxyMiddleware:
    def __init__(self, proxy_address):
        self.proxy_address = proxy_address

    @classmethod
    def from_crawler(cls, crawler):
        proxy_address = crawler.settings.get('PROXY_ADDRESS')
        return cls(proxy_address)

    def process_request(self, request, spider):
        if 'proxy' not in request.meta and self.proxy_address:
            request.meta['proxy'] = self.proxy_address

For the custom middleware to work with our Scrapy spider, we also need to edit settings.py with a proxy address and downloader middleware.

PROXY_ADDRESS = 'http://yourproxy:port'

DOWNLOADER_MIDDLEWARES = {
    'yourproject.middlewares.MyProxyMiddleware': 750,
}

Rotating Proxies Automatically

The proxy middleware examples we provided are a start, but their most significant advantage lies in their customization to fit your needs. One of the most common customizations is introducing automatic proxy rotation.

The easiest method to achieve rotation is by using IPRoyal's rotating proxies. Then, the IPs will rotate even if you have a static proxy setup in Scrapy. This might not be suitable for large-scale Scrapy projects that require more custom IP rotation.

We can use the already mentioned third-party middleware, scrapy-rotating-proxy. Just as before, we need to edit the settings.py file to add downloader middleware and a rotating proxy list. Except this time, we’ll add multiple static proxies that our Scrapy spider can access.

ROTATING_PROXY_LIST = [
    'http://IProyal:[email protected]:3128',
    'http://IProyal:[email protected]:3127',
	'http://IProyal:[email protected]:3126'
	'http://IProyal:[email protected]:3125'
    # you can add as many static proxies as you need here
]

DOWNLOADER_MIDDLEWARES = {

    'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
    'rotating_proxies.middlewares.RotatingProxyMiddleware': 610,
    'rotating_proxies.middlewares.BanDetectionMiddleware': 620,
}

This Scrapy proxy middleware will automatically rotate your private static proxy pool for each request by selecting a different proxy from your list, ensuring IP cycling. If a proxy is banned, proxy rotation doesn't stop. Instead, the restricted IPs are taken out of rotation, and requests are switched to others.

You can adjust download delays and concurrent requests of your Scrapy spider. Since the IP is changed with every new request, this will effectively change your IP rotation intervals.

DOWNLOAD_DELAY = 5  # delay 5 seconds between requests
CONCURRENT_REQUESTS_PER_DOMAIN = 1

Proxy reuse can happen with such proxy rotation middleware, so it's essential to use as large a static proxy pool as possible. Although the pool is managed automatically by blocking restricted IPs, it might be useful to add various settings in Scrapy for the middleware to fit your project.

ROTATING_PROXY_CLOSE_SPIDER = True  # Closes spider if all proxies fail
ROTATING_PROXY_PAGE_RETRY_TIMES = 5  # Retry attempts per page before switching proxy
ROTATING_PROXY_BAN_POLICY = 'rotating_proxies.policy.BanDetectPolicy'  # Ban detection strategy
RETRY_ENABLED = True
RETRY_TIMES = 10  # Retry failed requests up to 10 times

Testing Your Scrapy Proxy Integration

We demonstrated how to write and run your Scrapy scripts in our full Scrapy tutorial. There are multiple ways of testing whether your Scrapy script sends requests through a proxy. Running a Scrapy shell command is the easiest one to implement since you don't need to write a dedicated script or modify existing ones.

Paste the command below into your terminal and wait for Scrapy to fetch the page. Since it's an IP echo URL, it will return your IP address.

scrapy shell 'https://ifconfig.me/ip'

2. Then you only need to print the IP address returned. If it's one of the proxies in the Scrapy list you defined earlier, your proxy setup works fine.

print(response.text.strip())

Best Practices for Proxy Management

Per-request IP rotation is best for avoiding detection and when scraping most webpages.
Avoid unreliable or free proxies as they are unlikely to bring the desired results, or can even be honeypots used to log activity and leak your IP.
Rotating residential proxies are best for Scrapy projects, although for less strict websites, datacenter ones might work as well.
Don't forget to rotate User-Agent and other headers together with proxies.
Monitor proxy authentication status and health by logging their success rate, latency, ban frequency, and other relevant metrics.
Replace IPs and add fresh ones to your pool regularly.

Conclusion

Mastering proxies in Scrapy will enable you to collect large-scale data from various websites. Yet, some websites might not work for reasons other than your proxy use. JavaScript-heavy sites are better scraped with tools such as Splash or might require Beautiful Soup and Selenium.