IPRoyal
Back to blog

How to Use Cloudscraper in Python: 2024 Guide

Eugenijus Denisov

Last updated -

How to

In This Article

Ready to get started?

Register now

Cloudflare protects numerous websites from various malicious attacks such as DDoS. The service is so popular that nearly every major company in the world uses it. While it does provide a great service, the platform is somewhat restrictive and will often block web scraping attempts. Even if it’s innocuous, Cloudflare will often block access to the website, making it impossible to collect data at scale.

As such, the Cloudscraper Python library was developed to help scrape Cloudflare-protected websites. Most of the features revolve around triggering various anti-bot protections implemented by the company, allowing users to completely bypass Cloudflare.

What Is Cloudscraper?

Cloudscraper is a Python library that supports web scraping projects by providing easy-to-use ways to bypass Cloudflare protection . While it’s somewhat of a niche library, the ubiquity of Cloudflare protection across major companies makes Cloudscraper irreplaceable in some projects.

The entire process relies on the foundational idea that Cloudflare challenges should only be thrown to users if there’s a lot of suspicion that it may be a bot . After all, most of the challenges ruin the user experience, so they should be as scarce as possible.

So, Cloudscraper uses various features to detect the different challenges that the service can throw to users. It then uses automation to attempt to solve them or, if a CAPTCHA is provided, forward that to a CAPTCHA-solving service.

While Cloudscraper may need a third party service to start fully bypassing Cloudflare, the library itself is lightweight, easy to use, and seamless. It’s built upon the popular Requests library, and it even detects if a website is not protected by Cloudflare. In that case, Cloudscraper performs just like a regular Requests library.

How Does Cloudscraper Work?

If we read the open-source code , there are a few methods that Cloudscraper uses to bypass Cloudflare: JavaScript handling, user agents, and challenge solving.

JavaScript handling is useful for two reasons. First, if someone attempts to scrape Cloudflare-protected websites without JavaScript handling, the service will automatically throw out a challenge. In simple terms, if the service detects that a machine can’t render JavaScript, it will automatically trigger a challenge.

On the other hand, even with a rendering engine, Cloudflare can trigger challenges and errors for other reasons. As such, the ability to solve JavaScript challenges becomes an essential part of bypassing Cloudflare.

You can even customize the engine that’s used to solve JavaScript challenges, although the default settings are fine for most users.

Outside of JavaScript, there are a few more methods that Cloudscraper uses to bypass Cloudflare protection. Since it’s built upon the Requests Python library, changing the user agent sent with each request becomes a necessity.

Requests uses a default user agent that gives away the fact that a HTTP request is being sent through the library. Since almost no legitimate users connect to websites in such a way, Cloudflare will almost always throw out a challenge, if not block the connection attempt entirely.

As such, Cloudscraper not only replaces user agents, but uses optimized browser headers in general in order to avoid arousing suspicion. These are also customizable, however, so you can, for example, pick mobile Chrome user agents if that seems to work better than a random selection.

Finally, Cloudscraper can bypass some browser fingerprinting challenges. However, it’ll likely fail at more advanced ones. These browser fingerprinting challenges could check what sort of plugins, fonts, and various other features are installed – which is impossible for Cloudscraper to simulate.

How to Use Cloudscraper in Python?

Assuming you have Python and an IDE installed, start a new project and open up the Terminal. Type in:

pip install cloudscraper

If you’ve used the Python Requests library, Cloudscraper will be an absolute breeze. Let’s send a simple GET HTTP request to the IPRoyal website using Cloudscraper:

import cloudscraper

# Create a Cloudscraper instance
scraper = cloudscraper.create_scraper()

# Perform a GET request to a Cloudflare-protected website
response = scraper.get('https://iproyal.com')

# Print the website's content
print(response.text)

There’s one degree of separation when using Cloudscraper – you first need to create a scraper instance. Even that isn’t anything special as the Cloudscraper instance is simply a Requests session.

After that, we use Cloudscraper exactly how we use Requests. In fact, if you have an existing project that uses the latter library, updating it to match Cloudscraper requirements can be done in mere minutes.

Setting Up Proxies in Cloudscraper

To use proxies with Cloudscraper, we’ll need to create a dictionary with the value key pair being the protocol and the proxy server and port :

import cloudscraper

# Create a Cloudscraper instance with a proxy
scraper = cloudscraper.create_scraper()

# Proxy dictionary
proxies = {
    'http': 'http://proxy_user:proxy_pass@proxy_host:proxy_port',
    'https': 'http://proxy_user:proxy_pass@proxy_host:proxy_port'
}

# Send a request through the proxy
response = scraper.get('https://iproyal.com', proxies=proxies)

# Print the content of the response
print(response.text)

All that’s changed is an additional dictionary object that stores our proxy data. It’s then added as an argument when sending a GET request through Cloudscraper.

Handling CAPTCHAs Through Third Parties

While Cloudscraper will attempt to solve some CAPTCHAs on its own, you shouldn’t expect it to do all the legwork. CAPTCHAs evolve quickly and Cloudscraper is managed by a single developer – he won’t be able to keep up with all the changes, so Cloudscraper will likely fail often.

So, using third-party CAPTCHA-solving services is likely to come into play. Cloudscraper has even implemented such usage and even supports some services by default:

import cloudscraper

# Create a Cloudscraper instance with a captcha service configured
scraper = cloudscraper.create_scraper(
    captcha={
        'provider': 'provider_name',
        'api_key': 'your_provider_api_key'
    }
)

# Perform a GET request to a Cloudflare-protected website
response = scraper.get('https://iproyal.com')

# Print the website's content
print(response.text)

According to the documentation , Cloudscraper supports these CAPTCHA services by default:

  • 2captcha
  • anticaptcha
  • CapSolver
  • CapMonster Cloud
  • deathbycaptcha
  • 9kw
  • return_response

Customizing Headers and Cookies With Cloudscraper

Finally, as mentioned previously, you can customize the header and cookie logic if required. There are two ways to do so. It’s highly recommended that you use the Cloudscraper internal logic instead of forcing headers through Requests:

import cloudscraper

# Mobile Chrome User-Agents on Android
scraper_android_chrome = cloudscraper.create_scraper(
    browser={
        'browser': 'chrome',  # Specify the browser as Chrome
        'platform': 'android',  # Specify the platform as Android
        'desktop': False  # Set desktop to False to target mobile User-Agents
    }
)

# Check the User-Agent perceived by the website
response_android_chrome = scraper_android_chrome.get('https://httpbin.org/user-agent')
print("Android Chrome User-Agent:", response_android_chrome.json()['user-agent'])


# Desktop Firefox User-Agents on Windows
scraper_windows_firefox = cloudscraper.create_scraper(
    browser={
        'browser': 'firefox',  # Specify the browser as Firefox
        'platform': 'windows',  # Specify the platform as Windows
        'mobile': False  # Set mobile to False to target desktop User-Agents
    }
)

# Check the User-Agent perceived by the website
response_windows_firefox = scraper_windows_firefox.get('https://httpbin.org/user-agent')
print("Windows Firefox User-Agent:", response_windows_firefox.json()['user-agent'])


# Custom User-Agent
scraper_custom = cloudscraper.create_scraper(
    browser={
        'custom': 'ScraperBot/1.0',  # Define a custom User-Agent string
    }
)

# Check the User-Agent perceived by the website
response_custom = scraper_custom.get('https://httpbin.org/user-agent')
print("Custom User-Agent:", response_custom.json()['user-agent'])

By using the first method, you’ll be changing the Cloudscraper default user agent logic. Cloudscraper uses a randomization system to pick out user agents according to your settings (or their default ones).

On the other hand, you can force headers just like through Requests:

import cloudscraper

# Create a Cloudscraper instance
scraper = cloudscraper.create_scraper()

# Define custom headers
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64)',
    'Accept-Language': 'en-US,en;q=0.9',
}

# Define cookies
cookies = {
    'sessionid': '1234567890abcdef'
}

# Make a request with custom headers and cookies
response_headers = scraper.get('https://httpbin.org/headers', headers=headers, cookies=cookies)
response_cookies = scraper.get('https://httpbin.org/cookies', headers=headers, cookies=cookies)

# Print the headers received by the server
print("Headers perceived by the server:")
print(response_headers.json())

# Print the cookies received by the server
print("\nCookies perceived by the server:")
print(response_cookies.json())

It’s not recommended to do that as Cloudscraper has already optimized features for headers. Unless you have good reason to override the default Cloudscraper logic, it’s recommended to use the first method.

Alternatives to Cloudscraper

Cloudscraper is a powerful library that’ll greatly reduce block rates from Cloudflare-protected websites. It won’t, however, be the panacea, and some of the websites will still be impassable. For that, you may need to use other methods or libraries:

  • Selenium

A classic browser automation library that has a lot of plugins and add-ons that improve evasion techniques, including those of Cloudflare.

  • Requests-HTML

An alternative library that does not bypass Cloudflare by default but does render JavaScript, so it could be useful in some applications.

  • Playwright

A modern browser automation library that has a lot of customization and evasion capabilities.

Another library that’s optimized to solve Cloudflare challenges, but a bit harder to set up than Cloudscraper.

Each of the alternatives can be useful in certain scenarios, although browser automation libraries will be somewhat slower than direct HTTP request libraries. On the other hand, they will be less detectable. So, it’s all about making the correct trade-offs.

Create account

Author

Eugenijus Denisov

Senior Software Engineer

With over a decade of experience under his belt, Eugenijus has worked on a wide range of projects - from LMS (learning management system) to large-scale custom solutions for businesses and the medical sector. Proficient in PHP, Vue.js, Docker, MySQL, and TypeScript, Eugenijus is dedicated to writing high-quality code while fostering a collaborative team environment and optimizing work processes. Outside of work, you’ll find him running marathons and cycling challenging routes to recharge mentally and build self-confidence.

Learn More About Eugenijus Denisov
Share on

Related articles