How to Read a Robots.txt File for Effective Web Scraping
Justas Palekas
Last updated -
In This Article
Robots.txt is a basic text file that almost every website has. You can access it by visiting any website’s homepage and adding /robots.txt at the end of the URL (for example: https://iproyal.com/robots.txt ).
While it‘s a simple text file, robots.txt is incredibly important for all automated browsing, crawling and scraping included. In fact, it was originally intended to provide directives for search engine crawlers, but now it applies to all automation.
What Is Robots.txt?
The robots.txt file is a specifically formatted file that’s available on nearly all websites, which provides directives for bots and automated scripts. Usually, the robots.txt file includes directives about which parts of the website can or cannot be accessed. Additionally, it will also have a user agent list, indicating to which search engine crawlers the rules apply.
There are various composition rules according to which the robots.txt file is formatted. For example, such a formatting disallows crawling the /private/ folder of the website for every user agent possible:
User agent: *
Disallow: /private/
On the other hand, you can edit the robots.txt file in such a way that a specific user agent is allowed to crawl where others are not :
User-agent: *
Disallow: /private/
User-agent: Googlebot
Disallow: /private/
Allow: /private/public-page.html
In such a case, every user agent, except for Googlebot, is forbidden to access /private/, but Google can visit /private/public-page.html.
It’s important to note that the robots.txt file is a directive, not a cybersecurity tool. Crawlers and scrapers can potentially ignore these directives and still visit all publicly visible pages. It’s largely considered unethical, however, so most larger companies follow the robots.txt file closely.
Additionally, even if you forbid a user agent (or list of them) from crawling a specific part of the page, they may still accidentally visit it. For example, if an internal link leads to a forbidden part of the website (but is otherwise publicly accessible), a crawler may unintentionally visit that page by following the link.
As such, the robots.txt file is important for both web crawlers and SEO. For the former, the robots.txt file provides clear-cut directives on what they can and cannot visit. For the latter, the robots.txt file will tell web crawlers, which pages can be indexed and added to search engine result pages.
Basic Structure of The Robots.txt File
As mentioned above, the robots.txt file has a specific syntax and structure that has to be followed for maximum effect. There are three main structural elements:
- User agent directive
This element states which rules apply to which user agents.
- Allow/disallow rules
These rules indicate which user agent can or cannot visit a specific part of the website (or even a singular page).
- Special characters and wildcards
While the robots.txt file does not fully support regular expressions, there are a few special characters that you can use. The wildcard (*) matches any sequence of characters, while the dollar sign ($) indicates the end of a URL.
Let’s look back to our previous example:
User-agent: *
Disallow: /private/
User-agent: Googlebot
Disallow: /private/
Allow: /private/public-page.html
User agent directives always come first as they indicate to which user agent the following rules apply. In the first statement, the user agent is assigned a wildcard, meaning that all possible combinations of characters are matched. In other words, any user agent is forbidden to access /private/.
Disallowing specific areas is important for SEO as almost every search engine assigns a crawl budget to every website. A crawl budget is the amount of URLs a search engine crawler will go through before giving up. If your website is too large, the crawler may miss important areas and fail to index them.
Another way to manage the crawl budget and apply rules is the dollar sign, which indicates the end of a specific URL. While it’s used somewhat rarely, it may be extremely effective in specific cases. For example:
User-agent: *
Disallow: /*?print=true$
Such a rule disallows crawling all print versions of the website for all user agents. Additionally, for websites with specific queries and tags (such as tracking links) to reduce the likelihood of a search engine crawling duplicate content:
User-agent: *
Disallow: /*?utm_source=*
In this case, every user agent is directed to avoid crawling any URL that has an “UTM_source” tag.
How to Read a Robots.txt File: Step-by-step
On almost every website, the robots.txt file is stored within the root directory and can be accessed by adding /robots.txt to the URL. For example, to access IPRoyal’s robots.txt file, you’d visit https://iproyal.com/robots.txt .
Since it’s just a text file, accessing it through various scripts is equally as easy. A common way to access the robots.txt file with Python is to use the “requests” library to send a GET request to the URL:
import requests
response = requests.get("http://example.com/robots.txt")
print(response.text)
All of the text stored in the robots.txt file will be printed in the output screen. While that shows how easy it is to access and read the robots.txt file, you’ll likely want to save them to reduce the amount of time you need to check them in your web scraping project:
import requests
# URL of the robots.txt file
url = "http://example.com/robots.txt"
# Send a GET request to the URL
response = requests.get(url)
# Save the content of the response to a file
with open("robots.txt", "w") as file:
file.write(response.text)
print("robots.txt file downloaded and saved.")
You’ll then be able to open the robots.txt file whenever you want. However, you may want to establish a naming convention if you intend to crawl many websites.
On the other hand, if you’re a website owner, you may want to check Google Search Console to see whether your robots.txt file is formatted correctly. Google Search Console has a dedicated Robots.txt tester that can be used to validate your file.
Best Practices for Web Scraping with Robots.txt
Search engines have nailed the process of following robots.txt, so any scraping or crawling project should follow in the same footsteps. Before they visit and crawl the website, search engines check the robots.txt file for any directive and follow that as much as possible.
Your process should be nearly identical. Website administrators have good reasons for forbidding access to specific pages (usually to reduce the server load or to prevent search engines from entering an infinite loop of crawling).
Since your user agent is unlikely to be widely known (unlike say the Google, Bing, and other user agents), you only need to look for disallow rules with the user agent of a wildcard. Since the formatting is the same, you can use something like the “re” library to parse the robots.txt file:
import requests
import re
#Download and Save the robots.txt File
url = "http://example.com/robots.txt"
response = requests.get(url)
with open("robots.txt", "w") as file:
file.write(response.text)
print("robots.txt file downloaded and saved.")
#Parse the robots.txt File to Extract Disallowed Paths
def parse_robots_txt(user_agent, file_path="robots.txt"):
with open(file_path, "r") as file:
lines = file.readlines()
disallowed_paths = []
current_user_agent = None
for line in lines:
line = line.strip()
if line.startswith("User-agent:"):
current_user_agent = line.split(":")[1].strip()
elif line.startswith("Disallow:") and current_user_agent == user_agent:
path = line.split(":")[1].strip()
if path:
disallowed_paths.append(path)
return disallowed_paths
disallowed_paths = parse_robots_txt("*")
print("Disallowed paths:", disallowed_paths)
#Function to Check if a URL is Disallowed
def is_url_disallowed(url_path, disallowed_paths):
for path in disallowed_paths:
pattern = re.escape(path).replace("\\*", ".*")
if re.match(pattern, url_path):
return True
return False
The script outlined above will download the robots.txt file from a website, then use regular expressions to search through the file for disallow rules and then add the paths to a “disallowed_paths” object. We can then use that object to check if the path is in our URL to ensure we’re following the robots.txt directive when scraping.
While there are more optimal ways to avoid disallowed paths, if you’re not running search engines, efficiency is not as high of a priority. When you get to the scale of search engines, you will need to make many modifications to ensure proper crawl rate.
Additionally, regardless of you following the robots.txt file directives, you should limit your crawl rate to a reasonable degree, especially during peak server hours. A web scraper or crawler already sends many more requests than a regular user, so you may negatively affect the user experience of the website.
You can relatively easily implement such a slow down by routinely checking for response times. If they get too high, reduce your crawl rate to avoid overloading the server. Search engines may be less concerned about such issues as most website owners want Google and others to crawl their pages.
Conclusion
Search engines, scrapers, and crawlers should follow the robots.txt file directives, regardless of their purpose. Search engines already default to checking the directives, so your web scraping project should be no different.
Luckily, parsing the file is relatively easy. You mostly only need to follow the wildcard directives as other user agents are unlikely to apply to your project. Since the file is so easy to find, you can create a list of them in no time and create rules for your scraper to follow.
Author
Justas Palekas
Head of Product
Since day one, Justas has been essential in defining the way IPRoyal presents itself to the world. His experience in the proxy and marketing industry enabled IPRoyal to stay at the forefront of innovation, actively shaping the proxy business landscape. Justas focuses on developing and fine-tuning marketing strategies, attending industry-related events, and studying user behavior to ensure the best experience for IPRoyal clients worldwide. Outside of work, you’ll find him exploring the complexities of human behavior or delving into the startup ecosystem.
Learn More About Justas Palekas