Solving 403 Forbidden Errors in Python Requests
Vilius Dumcius
Last updated -
In This Article
Receiving a “403 Forbidden” HTTP error is par for the course in web scraping. There are various strategies to avoid the error, depending on the programming language, libraries, user agents, and many other factors.
In this guide, we’ll assume you’ll be using Python requests . “403 Forbidden” errors are quite frequent when using Python requests. However, the library is so fast and efficient that there’s good reason to continue using it.
Causes of “403 Forbidden” Errors
While anti-bot systems will be a common culprit, the “403 Forbidden” HTTP error can be thrown out for a multitude of reasons. So, assuming that you’ve triggered anti-bot systems when web scraping isn’t the best route.
Authentication Errors
Initially, the “403 Forbidden” HTTP error was intended for pages that the user did not have the authority to access. For example, some pages may be hidden behind logins or other types of authentication. In that case, a “403 Forbidden” error would be thrown.
It’s also used to gate pages that are reserved for specific IP addresses and the like. So, you can still run into the “403 Forbidden” error even if you’ve not received a ban or restriction.
Therefore, whenever you’re web scraping, if you’re receiving a few of these errors, it may not even be a problem. Some pages will naturally be put under a login, which you don’t have access to, or it may be restricted for other reasons.
User-agent Restrictions
A user agent is metadata about your machine that’s sending an HTTP connection request to a web server. It usually involves some information about your browser, OS, and other details. Every user agent is generally used to return correctly formatted information to your machine.
There’s some caveats, however. Since it displays important metadata, a user agent can also be used to predict the user’s intentions. Additionally, there’s technically no limit on how a user agent is to be defined.
So, some webmasters have implemented ways to track user agents and uncover suspicious ones. Additionally, Python requests themselves have a default header, indicating that a request is being sent using the library in question, so many webmasters block it by default.
IP Address Blocking
Another common issue when web scraping is having your IP address blocked. While it can be displayed in a multitude of different ways, getting a “403 Forbidden” error is quite a common occurrence.
If your IP address is blocked, changing the user agent won’t work. Additionally, you’ll get the same error message on every URL you crawl, which can be a good indicator of getting blocked.
Solving “403 Forbidden” Errors
Depending on the cause, different strategies may be used to avoid the error message. Note that if you need authentication to access the page, none of these strategies will help.
Switch Your User-agent
If your IP address hasn’t been banned, switching your user agent is the way to go. In fact, you should always replace the default Python requests user agent as it’s frequently banned on many websites.
Even if it’s not, it will instantly arouse some level of suspicion. Luckily, Python Requests lets you manually replace the default user agent with one of your liking quite easily.
import requests
url = 'https://iproyal.com''
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'
}
response = requests.get(url, headers=headers)
if response.status_code == 200:
print('Success!')
else:
print(f'Failed with status code: {response.status_code}')
All you need to do is add the “headers” dictionary and use the ‘User-Agent’ key to assign any user agent. Usually, it’s best to pick something popular , to blend in with other users.
It’s recommended to implement a rotation algorithm. Rotating user agents will make it a lot harder for bot detection systems to prevent your web scraping activities.
Use Rotating Proxies
If your IP address has been banned, no amount of user agent switching will save you. Rotating proxies will allow you to constantly switch IP addresses on every request or according to the provider.
import requests
from itertools import cycle
proxies = {
'http': 'http://10.10.1.10:3128',
'https': 'https://10.10.1.11:1080'
}
proxy_pool = cycle(proxies)
url = 'https://example.com'
for _ in range(5):
proxy = next(proxy_pool)
response = requests.get(url, proxies={"http": proxy, "https": proxy})
if response.status_code == 200:
print('Success!')
else:
print(f'Failed with status code: {response.status_code}')
Similarly to user agents, we create a dictionary for proxies where the keys are HTTP or HTTPS protocols while the value is the IP address and port.
We then use the itertools for memory-efficient cycling of the proxies listed. You can run the “for” for as long as necessary, usually for the amount of URLs you have.
Note that if you use a provider that gives you a single endpoint where rotation happens automatically, you won’t need to cycle proxies yourself.
Implement Rate Limiting
Some websites may have a simple system that throws out 403 error if you send too many requests in a short period of time. Rate limiting is the practice of creating delays before each request to avoid triggering the restriction.
All it takes is a simple “time.wait(x)” function in your loop to implement rate limiting. There are plenty of other ways to implement the same thing, but using “wait” is a simple and effective way to test if you’re hitting rate limits.
Implementing a retry system is also valuable as it performs a similar function while also reducing the false positive rate of your web scraping project.
Use a Headless Browser
If the bot detection systems of a website are still being a pain, your only course of action may be to shift to using a headless browser. Bot detection algorithms have a harder time performing their work as a browser sends requests that look a lot less suspicious.
Additionally, the Python requests library has no JavaScript rendering available. So, a headless browser will also let you extract information from any source that requires JavaScript rendering.
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
options = Options()
options.add_argument("--headless=new")
driver = webdriver.Chrome(options=options)
driver.get('https://iproyal.com')
content = driver.page_source
print(content)
driver.quit()
Final Thoughts
These methods should allow you to bypass most “403 Forbidden” error messages. Usually, you’ll need to implement a combination of some of them (for example, if sending a request directly throws the error, switch to Selenium), as the same website can have many reasons to give you such an error.
Author
Vilius Dumcius
Product Owner
With six years of programming experience, Vilius specializes in full-stack web development with PHP (Laravel), MySQL, Docker, Vue.js, and Typescript. Managing a skilled team at IPRoyal for years, he excels in overseeing diverse web projects and custom solutions. Vilius plays a critical role in managing proxy-related tasks for the company, serving as the lead programmer involved in every aspect of the business. Outside of his professional duties, Vilius channels his passion for personal and professional growth, balancing his tech expertise with a commitment to continuous improvement.
Learn More About Vilius Dumcius