50% OFF Residential Proxies for 9 months — use code IPR50 at checkout

Get The Deal

In This Article

Back to blog

How to Build a Web Scraper in Python (Step-by-Step)

Python

Learn how to build a web scraper in Python with this step-by-step guide. Gain practical skills and extract data efficiently.

Eugenijus Denisov

Last updated - ‐ 12 min read

Key Takeaways

  • Python simplifies web scraping with tools like requests, BeautifulSoup, and Selenium for static and dynamic content.

  • Build a scraper in a few steps: inspect the page, fetch HTML, extract data, and save to CSV or JSON.

  • Avoid blocks by rotating headers, respecting robots.txt, and using frameworks like Scrapy for larger tasks.

Python is one of the top choices for web scraping due to its large community, ease of use, and extensive library support. Python web scraping tools, such as BeautifulSoup and Scrapy, make it easier to work with HTML elements, and Selenium handles JavaScript for dynamic websites.

In this guide, you’ll learn how to scrape data from both static and dynamic websites, how to avoid getting blocked, and how to put it all together into a working project.

To follow this tutorial, you’ll need:

  • Python 3.7 or higher
  • Pip (Python package manager)
  • A little understanding of HTML and an IDE

To get started, install the essentials by running the following command in your IDE’s Terminal:

pip install requests beautifulsoup4 lxml selenium pandas

If you’d like to learn more about Python web scraping in general and how it works before trying to build a scraper, check out our step-by-step guide to Python web scraping .

How to Build a Web Scraper in Python: Step-by-Step Guide

Python web scraping is manageable when you have the right tools and information at your disposal. Here’s a step-by-step that explains everything from inspecting HTML elements to fetching web pages, parsing HTML content, extracting and saving scraped data.

1. Inspect the Page Structure

Before you start writing code, open a website in Chrome or Firefox, then right-click and choose 'Inspect'. Now you can see the website’s HTML elements.

Look closely at the tags, class names, IDs, and more. You’ll use these to target data, and it’s where the data extraction process begins.

Sometimes the pages can be messy or full of nested tags, but that’s part of the process. It will take some trial and error before you get it right.

2. Fetch the Web Page With requests

Use the requests library to get the HTML content of a page:

import requests

response = requests.get('http://example.com')
html = response.text
print(html)

The Python requests module makes this part easy, and you now have the full page’s code.

3. Parse Content Using BeautifulSoup

Now pass that HTML content to BeautifulSoup. Start by adding another import at the top and adding the soup variable below the html one. You can remove the print function for now:

from bs4 import BeautifulSoup

soup = BeautifulSoup(html, 'lxml')

You can search by tag name or CSS selector. This step is key for parsing HTML.

titles = soup.select('h2.title')

4. Extract the Desired Data

Once you have the right selectors, you can start extracting data:

for title in titles:
    print(title.text.strip())

You can use this method for prices, names, links, or any other data extraction task.

5. Save Data to CSV or JSON

Now it’s time to store the scraped data. Follow the same process as before: imports at the top, the rest of the code at the bottom:

import pandas as pd

data = {'titles': [t.text.strip() for t in titles]}
df = pd.DataFrame(data)
df.to_csv('output.csv', index=False)

It will help you keep your web scraping results organized.

Full Example

Let’s put everything together and scrape quotes.toscrape.com:

import requests
from bs4 import BeautifulSoup
import pandas as pd

url = 'http://quotes.toscrape.com'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'lxml')

quotes = soup.find_all('span', class_='text')
authors = soup.find_all('small', class_='author')

data = [{'quote': q.text, 'author': a.text} for q, a in zip(quotes, authors)]

df = pd.DataFrame(data)
df.to_csv('quotes.csv', index=False)

This Python web scraper retrieves quotes and authors simultaneously. It’s a simple project to test and develop skills in web scraping with Python.

Scraping JavaScript & Dynamic Sites

Python web scraping can also handle dynamic websites that use JavaScript to load HTML content. You can scrape data from such websites using Selenium. It’s slower compared to requests, but it’s the best way to get HTML elements from interactive or changing websites.

Why Static Scraping Fails

Sometimes the HTML content you fetch is empty or missing information. That’s because some websites load with JavaScript after the initial page loads. It breaks web scraping in Python if you’re only using the requests library.

Using Selenium for Dynamic Content

Selenium can help because it controls a real browser and can load JavaScript:

from selenium import webdriver
from selenium.webdriver.chrome.options import Options

options = Options()
options.add_argument("--headless=new")
driver = webdriver.Chrome(options=options)
driver.get('http://example.com')
html = driver.page_source

Using this, you can grab HTML elements even from JavaScript-heavy sites. The downside, however, is that it will take a longer time.

Handling JavaScript Rendering and Delays

Some sites load content after a delay. Selenium can wait until the data is ready:

from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

WebDriverWait(driver, 10).until(
    EC.presence_of_element_located((By.CLASS_NAME, 'product-item'))
)

You can also scroll, click buttons, or wait for specific HTML elements to appear. These tactics make extracting data from dynamic pages not only possible but also efficient.

Need proxies for Python?
Try IPRoyal

Scaling Up With Frameworks: Intro to Scrapy

Once you’ve got a basic Python web scraper working, you’ll start to see the limits of simple scripts. But Scrapy can fix that.

Why Use a Framework Like Scrapy

Scrapy is made for Python web scraping, as the name suggests. It’s faster and more organized than writing everything by hand. You get better speed, less messy code, and support for multiple pages out of the box.

It also helps manage your data extraction flow cleanly with things like items and pipelines. Using Scrapy makes it easier to manage large projects without everything turning into spaghetti.

Creating a Basic Scrapy Spider

Start by installing Scrapy:

pip install scrapy

Then create a project:

scrapy startproject myproject
cd myproject
scrapy genspider quotes `quotes.toscrape.com

Inside your spider, define how you want to extract data from the HTML elements. It will be nested in the 'spiders' directory within the 'myproject' one:

import scrapy

class QuotesSpider(scrapy.Spider):
    name = 'quotes'
    start_urls = ['http://quotes.toscrape.com']

    def parse(self, response):
        for quote in response.css('div.quote'):
            yield {
                'text': quote.css('span.text::text').get(),
                'author': quote.css('small.author::text').get()
            }

It gives web scraping in Python a more solid structure where you don’t even need the requests library or BeautifulSoup. Scrapy parses HTML for you.

Crawling Multiple Pages

Scrapy also makes pagination simple. You just need to find the 'Next' button and follow it:

next_page = response.css('li.next a::attr(href)').get()
if next_page:
    yield response.follow(next_page, self.parse)

Place this block inside the parse() method of your Scrapy spider, right after you finish extracting data from the current page. That way, Scrapy first grabs the content you want, then checks if there’s another page, and if so, continues crawling automatically.

Now you can scrape data from multiple pages, which is essential when you need to do a full-site data extraction that has more than one page inside.

Exporting Data With Pipelines

You can use Scrapy to export your scraped data to many formats:

scrapy crawl quotes -o quotes.json

Your output will be in your project directory.

You can also set up pipelines to clean data, send it to a database, or convert formats. It’s way more efficient than just saving with pandas.

When to Choose Scrapy Over Scripts

In short, you should use Scrapy when you:

  • Need to crawl hundreds or thousands of pages quickly
  • Want to schedule jobs
  • Are managing multiple spiders

It’s the go-to tool for large-scale Python web scraping tasks.

Advanced Python Web Scraping Techniques

Avoiding Blocks and Anti-Bot Measures

As you scale your web scraper up, websites will attempt to block you. You’ll need to take actions to avoid that as much as possible:

  • Change your user agent string with each request or as frequently as required
  • Use different headers and cookies
  • Add time delays or random pauses
  • Retry failed requests after a short wait

To ensure that your operations don’t halt easily, consider adding proxies to your web scraper. Rotating proxies and residential IPs help you stay anonymous. For example, at IPRoyal, we offer pools of clean IPs that reduce your chances of getting blocked.

For large-scale web scraping with Python, it makes a significant difference. This way, you’ll spend more time extracting data and less time solving tedious errors.

Handling Complex Content

Some websites use AJAX or endless scroll, which breaks traditional scraping methods. To handle that efficiently:

  • Use Selenium to wait for content to appear
  • Use browser dev tools to find the real API endpoint, and skip the UI altogether
  • Scroll down the page using JavaScript inside Selenium
  • Use schedule with Python or system cron jobs to run your scraper automatically

When it comes to sites with Cloudflare or reCAPTCHA protection, it becomes more challenging. You’ll most likely need CAPTCHA solvers or a bypass service.

This part of web scraping with Python gets more technical, but it also gives you access to more data.

The legality of web scraping depends on where you live and what you’re scraping. Here’s a brief overview:

  • In the US, public, non-personal, non-copyrighted data is often safe to scrape, but a website’s terms of service can forbid it.
  • In the EU, GDPR rules apply. If you’re extracting data that contains personal information, you need a lawful basis, which can be either consent or legitimate interest. You also need to follow data protection rules.
  • In other regions, laws may vary. It’s always best to conduct thorough research before scraping any information.

As a general rule, it’s recommended to consult with a legal professional first to obtain guidance. Also, remember to respect the site’s rules to avoid potential ethical or legal issues. Always check a site’s robots.txt, don’t ignore the terms of service, and limit request speed to avoid overloading servers.

In short, web scraping isn’t illegal by default, but ignoring terms of service and scraping personal or private information can potentially put you in trouble. Be responsible and respectful when building your Python web scraper.

Full End-to-End Example

Now you’ve got some pieces here and there, but let’s see how it all works together. A Python web scraper follows a simple flow: fetch the page, parse HTML, extract data, and save results. Here’s how a full code block might look like:

pip install requests beautifulsoup4 lxml pandas
import time
import json
import requests
import pandas as pd
from bs4 import BeautifulSoup
from urllib.parse import urljoin

BASE_URL = "http://quotes.toscrape.com/"

# 1) Fetch: grab raw HTML from a URL
def fetch_html(url, timeout=15):
    headers = {
        "User-Agent": "Mozilla/5.0",
        "Accept-Language": "en-US,en;q=0.9"
    }
    r = requests.get(url, headers=headers, timeout=timeout)
    r.raise_for_status()
    return r.text

# 2) Parse: pull fields from the page
def parse_page(html, base_url):
    soup = BeautifulSoup(html, "lxml")
    rows = []

    # Locate quote blocks
    for q in soup.select("div.quote"):
        quote_text = q.select_one("span.text")
        author = q.select_one("small.author")
        tags = [t.get_text(strip=True) for t in q.select("div.tags a.tag")]

        # Clean values
        text_clean = (quote_text.get_text(strip=True) if quote_text else "").strip("""")
        author_clean = author.get_text(strip=True) if author else ""
        tags_clean = ", ".join(tags)

        rows.append(
            {
                "quote": text_clean,
                "author": author_clean,
                "tags": tags_clean
            }
        )

    # 3) Pagination: find the "Next" link
    next_link = soup.select_one("li.next a")
    next_url = urljoin(base_url, next_link["href"]) if next_link else None
    return rows, next_url

# 4) Crawl: loop pages until no next page (or safety limit)
def crawl_all(start_url, delay=1.0, max_pages=30):
    url = start_url
    all_rows = []
    page_no = 0

    while url and page_no < max_pages:
        html = fetch_html(url)
        rows, url = parse_page(html, BASE_URL)  # <- pagination after parsing
        all_rows.extend(rows)
        page_no += 1
        time.sleep(delay)  # be polite

    return all_rows

# 5) Save: CSV and JSON
def save_results(rows, csv_path="quotes.csv", json_path="quotes.json"):
    df = pd.DataFrame(rows)
    # order columns
    cols = ["quote", "author", "tags"]
    df = df.reindex(columns=cols)
    df.to_csv(csv_path, index=False)
    with open(json_path, "w", encoding="utf-8") as f:
        json.dump(rows, f, ensure_ascii=False, indent=2)
    return len(df)

def main():
    print("Starting crawl…")
    rows = crawl_all(BASE_URL, delay=0.8)
    count = save_results(rows)
    print(f"Done. Saved {count} rows to quotes.csv and quotes.json")

if __name__ == "__main__":
    main()

This example is not a one-size-fits-all solution. You will need to adjust and change it based on your target websites and scraping needs.

Choose the Right Tool

Not every job needs the same tool:

  • Use the requests library when the page is static and simple
  • Use Selenium when JavaScript builds most of the HTML content
  • Use Scrapy if you need speed, organization, or multi-page crawling across static pages

Choosing the right tools can save you time and effort in your web scraping projects.

Extract and Clean the Data

Once you get the HTML elements, you’ll need to clean them. Tags often come with whitespace, hidden text, or nested codes. For example:

title = soup.find('h1').text.strip()

Cleaning ensures your data extraction is accurate. Structure it into lists or dictionaries so you can consistently scrape data across multiple pages later.

Save and Export Results

After you extract data, save it for later use. Most scrapers save to CSV or JSON:

import pandas as pd

df = pd.DataFrame(data)
df.to_csv('output.csv', index=False)

For bigger projects, you can connect your Python web scraper directly to a database. That way, your scraped data stays organized and easy to query.

Automate Your Scraper (Optional)

Instead of running scripts manually, you can schedule them. On Linux, use cron. In Python, use the schedule package. Here’s some pseudo-code:

import schedule, time  

def job():  
    print("Running scraper...")  

schedule.every().day.at("10:00").do(job)  

while True:  
    schedule.run_pending()  
    time.sleep(60) 

You’ll also need to set up a system-level scheduler or run the above permanently, so the script can be initiated when the time comes. It’s helpful when you need fresh data extraction every day, week, or any other interval.

Troubleshooting and Common Issues

Even well-built web scraping scripts face issues from time to time. Here are some of the most common ones:

  • 403 errors or timeouts. Your scraper may be blocked. Rotate proxies or adjust headers.
  • Dynamic content not loading. The requests library won’t cut it here, try using Selenium.
  • Selectors break. Websites change their HTML elements, so update your code.
  • API vs scraping. If an API exists, it’s usually better to use it. It’s faster, more stable, and avoids the messy task of parsing HTML.

Conclusion

Now you know the essentials of building a web scraper in Python. You’ve learned the difference between static and dynamic web scraping, when to use the requests library, Selenium, or Scrapy, and how to manage data extraction properly.

Start small: you can first try building a Python web scraper for quotes, products, or news headlines. Then move on to larger projects as you get better.

FAQ

How do I scrape websites that require login or authentication?

You can use Selenium to log in like a genuine user, especially for JavaScript-heavy login flows. For simpler sites or APIs, you can use the requests library with authentication tokens. However, please note that scraping under logins is almost always forbidden and can result in legal issues.

Which tool is best: requests, Selenium, or Scrapy?

Python requests library is suitable for static content, Selenium works best for dynamic or interactive pages, and Scrapy is ideal for large-scale web scraping in Python.

Can I scrape APIs instead of HTML pages?

Yes. If the site provides it, it’s usually better to use it. APIs often provide structured results for your data extraction, eliminating messy HTML content.

How do I bypass CAPTCHAs during scraping?

Use services like 2Captcha or automated solvers. However, if possible, avoid CAPTCHAs by scraping less aggressively or using rotating proxies.

What are the alternatives to web scraping?

Look for RSS feeds, official APIs, or public datasets. They’re safer and usually cleaner than trying to scrape data from messy pages.

Create Account
Share on
Article by IPRoyal
Meet our writers
Data News in Your Inbox

No spam whatsoever, just pure data gathering news, trending topics and useful links. Unsubscribe anytime.

No spam. Unsubscribe anytime.

Related articles