In This Article

Back to blog

Web Scraping With Selenium and Python

Python

Eugenijus Denisov

Last updated - ‐ 14 min read

Key Takeaways

  • Use Selenium with Python to scrape JavaScript-heavy web pages that other tools can’t handle and extract the data.

  • Leverage explicit waits, take screenshots for debugging, and maintain best practices to ensure reliable web scraping.

  • Add proxies and rotate IPs in your scripts to avoid bans and make web scraping safer.

Python libraries like Requests and Scrapy can be very useful when you need to scrape static or dynamic websites, or perform general HTTP requests and API interactions. But they are not suited for the modern web, where single-page web applications driven by JavaScript are the norm. You can access a lot of information in these apps only by executing JavaScript in a browser.

For this reason, modern web scrapers use tools that simulate the actions of real users more closely, taking care to render the page just like it would be rendered for a regular user.

One of these tools is Selenium. In this tutorial, you’ll learn about browser automation using Selenium WebDriver, how to control browsers for efficient automated testing and web scraping, and how you can leverage Selenium with Python to both collect data from dynamic pages and extract it via BeautifulSoup.

What Is Selenium?

Selenium is a collection of open-source projects aimed at automating web browsers. While these tools are mainly used for testing web applications, everything you can use to control browsers automatically can be used for web scraping dynamic web pages.

It can automate most of the actions a real person can take in the browser: scrolling, typing, clicking on buttons, taking screenshots, and even executing your own JavaScript scraping code .

Interaction occurs through Selenium’s WebDriver API, which provides bindings for popular languages such as JavaScript, Python, and Java. As a result, you can set up a Selenium project to scrape dynamic web pages, control browsers, do automated testing, and more.

Selenium vs BeautifulSoup

While you can perform a lot of basic scraping tasks in Python with Requests and BeautifulSoup or a web scraping framework like Scrapy , these tools are bad at handling modern web pages that heavily use JavaScript. To scrape these websites, you need to interact with JavaScript on the page. And for that, you need a browser.

Selenium enables you to automate these interactions. It will launch a browser, execute the necessary JavaScript, and you will be able to scrape the results. In contrast, an HTML-based parser can only return the JavaScript code that the page contains and not execute it.

Ready to get started?
Register now

Web Scraping With Selenium Tutorial

In this tutorial, you’ll learn how to use Selenium’s Python bindings to search the r/programming subreddit . You’ll also learn how to mimic actions such as clicking, typing, and scrolling.

In addition, we’ll show you how to add a proxy to the web scraping script so that your real IP doesn’t get detected and blocked by Reddit.

In a previous tutorial on Python web scraping , we covered how to scrape the top posts of a subreddit using the old Reddit UI . The current UI is less friendly to web scrapers. There are two additional difficulties:

  • The page uses infinite scroll instead of pages for posts.
  • The class names of items are obfuscated — the page uses class names like “eYtD2XCVieq6emjKBH3m” instead of “title-blank-outbound”.

But with the help of Selenium, you can easily handle these difficulties.

Setup

You’ll need Python 3 installed on your computer to follow this tutorial. You should also use the command line to install Selenium, a library with Python bindings for WebDriver.

pip install selenium

After downloading the WebDriver and installing the bindings, create a new Python file called reddit_scraper.py and open it in a code editor.

Import all the required libraries there:

from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By
from time import sleep

After that, define all that is necessary to launch the WebDriver. In the code snippet below, substitute "C:\your\path\here\chromedriver.exe" with the path to the WebDriver you downloaded.

If Selenium correctly downloaded the WebDrive by itself, you can simply leave service = Service().

service = Service("C:\\your\path\here\chromedriver.exe")
options = webdriver.ChromeOptions()

driver = webdriver.Chrome(service=service, options=options)
driver.get("https://www.reddit.com/r/programming/")
sleep(4)

If you run the script right now, it should open up a browser and go to r/programming. You may get a challenge, however, which you can solve manually for now. Instead of running a sleep, replace the line with:

input("Press Enter after you've solved the bot challenge...")

Accepting Cookies

When you start a session on Reddit with Selenium, you might be asked about your cookie preferences. This pop-up can interfere with further actions, so it's a good idea to close it.

So the first step you want the script to do is to try to click on the “Accept all” button for cookies.

An easy way to find the button is by using an XPath selector that matches the first and only button with “Accept all” text.

accept = driver.find_element(
    By.XPATH, '//button[contains(text(), "Accept all")]')

After that, you must instruct the driver to click it via the .click() method.

If the pop-up doesn't show up, trying to click it will lead to an exception, so you should wrap the whole thing in a try-except block.

print("Trying to accept cookies...")
try:
    accept = driver.find_element(By.XPATH, '//button[contains(., "Accept All")]')
    accept.click()
    print("Clicked Accept All")
    sleep(3)  # Give more time for modal to close
except:
    print("Cookie button not found")

As you can see, Selenium's API for locating elements is very similar to the one used by both BeautifulSoup and Scrapy.

After you have gotten the cookie pop-up out of the way, you can start working with the search bar at the top of the page.

If you inspect the Reddit page, it appears to be a standard input field. However, if you try to use a standard CSS selector like input[type="search"], Selenium will throw a NoSuchElementException error.

Reddit has at some point begun using ShadowDOM technology, making scraping significantly harder. We’ll first need to use JavaScript to traverse additional layers before Selenium can interact.

These are the lines of code that do that:

from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC


search_bar = driver.execute_script("""
    return document.querySelector('reddit-search-large')
           .shadowRoot.querySelector('faceplate-search-input')
           .shadowRoot.querySelector('input[name="q"]');
""")

# Check if we found it (Good practice for debugging)
if search_bar:
    # 2. Interact with the element
    # We use JavaScript to click it to ensure focus
    driver.execute_script("arguments[0].click();", search_bar)
    
    # 3. Type and Search
    search_bar.send_keys("selenium")
    sleep(1)
    search_bar.send_keys(Keys.ENTER)
    
    # 4. Wait for results
    sleep(6)
    print("Search completed.")
else:
    print("Could not find the search bar. The page structure may have changed.")

Remember to add the new imports at the top.

In addition to sleep(), Selenium has more sophisticated tools for waiting for elements to appear, such as an implicit wait, which waits for a set amount of time for an element to appear, and an explicit wait, which waits until some condition is fulfilled.

We won't cover them in this article, but you're welcome to explore them when writing your own scripts.

This is the full code for searching:

from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from time import sleep

service = Service()
options = webdriver.ChromeOptions()
options.add_experimental_option("detach", True)  # Keeps window open

driver = webdriver.Chrome(service=service, options=options)
driver.get("https://www.reddit.com/r/programming/")

print("1. Please solve the bot challenge manually if it appears...")
WebDriverWait(driver, 300).until(EC.presence_of_element_located((By.TAG_NAME, "body")))

input("Press Enter here AFTER the page is fully loaded and challenge is solved...")

try:
    accept_btn = WebDriverWait(driver, 5).until(
        EC.element_to_be_clickable((By.XPATH, '//button[contains(., "Accept All")]'))
    )
    accept_btn.click()
    print("Cookies accepted.")
except:
    print("Cookie banner not found.")

sleep(2)

print("Attempting to find Shadow DOM search bar...")

search_input = None

for i in range(10):
    try:
        search_input = driver.execute_script("""
            return document.querySelector('reddit-search-large')
                   .shadowRoot.querySelector('faceplate-search-input')
                   .shadowRoot.querySelector('input[name="q"]');
        """)

        if search_input:
            print("Target found inside Shadow DOM!")
            break
    except Exception as e:
        pass
    sleep(0.5)

if search_input:
    driver.execute_script("arguments[0].click();", search_input)
    sleep(0.5)

    search_input.send_keys("selenium")
    sleep(0.5)
    search_input.send_keys(Keys.ENTER)
    print("Search executed!")
else:
    print("Failed to find the element even with JS. The structure might have changed.")

Scraping Search Results

Once you have loaded search results, it's pretty easy to scrape the results' titles: Reddit kindly put the full title inside the aria-label attribute of the link.

Since attributes belong to the 'Host' element (the link itself) and not the 'Shadow' element, standard Selenium commands can read it easily.

search_results = driver.find_elements(By.CSS_SELECTOR, 'a[data-testid="post-title"]')

But if there are more than 25 or so results, you won't get all of them. To get more, you need to scroll down the page.

A typical way of doing this is to use Selenium's capabilities to execute any JavaScript code via execute_script. This enables you to use the scrollIntoView function on the last title scraped, which will make the browser load more results.

Here's how you can do that:

for _ in range(5):
    if search_results:
        driver.execute_script("arguments[0].scrollIntoView();", search_results[-1])

    sleep(2)

   search_results = driver.find_elements(By.CSS_SELECTOR, 'a[data-testid="post-title"]')

print(f"Found {len(search_results)} posts:")

The function above will scroll five times, each time updating the list of titles with the search results that are revealed.

Finally, we can print out the list of titles to the screen and quit the browser.

for result in search_results:

    title = result.get_attribute("aria-label")
    if title:
        print(title)

driver.quit()

Here's the full script:

from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from time import sleep

service = Service()
options = webdriver.ChromeOptions()
options.add_experimental_option("detach", True)

driver = webdriver.Chrome(service=service, options=options)
driver.get("https://www.reddit.com/r/programming/")

WebDriverWait(driver, 300).until(EC.presence_of_element_located((By.TAG_NAME, "body")))

input("Press Enter here AFTER the page is fully loaded and challenge is solved...")

try:
    accept_btn = WebDriverWait(driver, 5).until(
        EC.element_to_be_clickable((By.XPATH, '//button[contains(., "Accept All")]'))
    )
    accept_btn.click()
except:
    pass

sleep(2)

search_input = None

for i in range(10):
    try:
        search_input = driver.execute_script("""
            return document.querySelector('reddit-search-large')
                   .shadowRoot.querySelector('faceplate-search-input')
                   .shadowRoot.querySelector('input[name="q"]');
        """)
        if search_input:
            break
    except:
        pass
    sleep(0.5)

if search_input:
    driver.execute_script("arguments[0].click();", search_input)
    sleep(0.5)
    search_input.send_keys("selenium")
    sleep(0.5)
    search_input.send_keys(Keys.ENTER)
    sleep(5)

    search_results = driver.find_elements(By.CSS_SELECTOR, 'a[data-testid="post-title"]')

    for _ in range(5):
        if search_results:
            driver.execute_script("arguments[0].scrollIntoView();", search_results[-1])
        sleep(2)
        search_results = driver.find_elements(By.CSS_SELECTOR, 'a[data-testid="post-title"]')

    print(f"Found {len(search_results)} posts:")
    for result in search_results:
        title = result.get_attribute("aria-label")
        if title:
            print(title)

driver.quit()

Explicit vs Implicit Waits

An implicit wait tells the driver to poll the DOM for a set time when using any find_element call. It applies globally to all element-finding in the session. An explicit wait, on the other hand, defines a condition to wait for a particular web element’s state, like visibility or clickability, before proceeding.

Since in modern web page scenarios, which may consist of single-page apps or AJAX loading, HTML elements may appear at unpredictable times, you may prefer explicit waits instead of implicit ones.

Explicit waits allow you to wait exactly until the condition is met, rather than waiting a fixed amount of time and risking being either too early or too long.

When both implicit and explicit waits are enabled, they start interfering with each other. For example, if you set a 10-second implicit wait and a 30-second explicit wait, Selenium can end up waiting far longer than 30 seconds before failing.

It makes total wait times unpredictable and inefficient, which is why most people either disable implicit waits and rely mainly on explicit waits.

Here’s a short example using WebDriverWait and expected_conditions in Python:

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

driver = webdriver.Chrome()
driver.get("https://iproyal.com")
wait = WebDriverWait(driver, 10)
element = wait.until(EC.element_to_be_clickable((By.CSS_SELECTOR, "button.CybotCookiebotDialogBodyButton")))
driver.execute_script("arguments[0].click();", element)

In contrast, an implicit wait would look like:

driver.implicitly_wait(10)
driver.get("https://iproyal.com")
driver.find_element(By.CSS_SELECTOR, "button.submit").click()

Notice that you don’t specify the condition with implicit. It simply waits up to 10 seconds for any find_element to succeed. As a result, explicit waits give finer control and are more reliable for dynamically loaded pages.

Taking Screenshots for Debugging

When you’re running web scraping operations with headless mode or driving browsers invisibly, it can be hard to know what the browser actually sees. Taking screenshots lets you capture a snapshot of the browser state at any point in your script.

It helps you debug issues with web scraping, browser rendering, or automated web browsing.

The driver.save_screenshot() method usually captures the current browser viewport (the visible area). If you need to capture the entire length of the web page, including content that requires scrolling, modern Selenium versions may support this natively for many browsers.

You could do it right after a navigation or after a click to verify that the page loaded as expected, cookie banners didn’t block anything, or dynamic web pages rendered the parts you expected.

In headless mode (no GUI), screenshots become even more important since you don’t see the browser window. Use them as part of your test scripts or web scraping pipeline so you can visually inspect failures.

Using Selenium With BeautifulSoup

Sometimes, you may need the best of both worlds: you use the browser automation of Selenium (via Selenium WebDriver) to handle dynamic web pages and web applications with JavaScript, and then you hand the fully-rendered HTML to a parsing library (like BeautifulSoup) for efficient extraction.

Here’s how it works in a Python script:

from selenium import webdriver
from bs4 import BeautifulSoup

driver = webdriver.Chrome()
driver.get("https://iproyal.com")

html = driver.page_source

soup = BeautifulSoup(html, "html.parser")
titles = [tag.text for tag in soup.select("h3.title")]

driver.quit()

You use Selenium to open and manipulate the web browser, execute JavaScript, scroll, click, etc., then you get the page_source of the web page (including JS-rendered content), and pass that into BeautifulSoup.

From there, parsing and extracting from HTML elements is faster and more convenient than using Selenium alone. It combines the strengths of browser automation for dynamic content and efficient parsing for extraction.

Such an approach means fewer Selenium code loops for extraction and more reliable data capture from modern web browsers.

Adding a Proxy to a Selenium Script

When doing web scraping with Selenium or any other tool, it's important not to use your real IP address. Web scraping is an activity that many websites find intrusive. So, the admins of these websites can take action to IP ban suspected addresses of web scrapers.

For this reason, web scrapers frequently use proxies, which are servers that hide your real IP address.

Many free proxies are available, but they can be very slow and pose privacy risks. So it's much better to use a paid proxy. They are usually very cheap anyway and provide much better and more confidential service.

In this tutorial, you'll use IProyal residential proxies . These are great for web scraping projects because they take care of rotating your IP on every request. In addition, they are sourced from a diverse set of locations, which makes it hard to detect your web scraping activities.

First, you need to find the link to your proxy. If you're using IPRoyal, you can find the link in your dashboard.

Be careful! A proxy access link usually contains your username and password for the service, so you should handle it carefully.

Unfortunately, Chrome doesn't support proxy links with authentication. For this reason, you'll need to use another library called selenium-wire , which enables you to change the request header of requests made with Chrome to use a proxy.

First, install it using your command line:

pip install selenium-wire

Then, go to the imports in your code and replace:

from selenium import webdriver

After that, you'll need to adjust the code that initializes the webdriver. Instead of initializing a service, we need to provide an r string with the location of the webdriver. In the options, we provide a proxy dictionary. In addition, the service and options are initialized a bit differently.

service = r'C:\\chromedriver.exe'
options = {
    'proxy': {
        'http': 'http://username:[email protected]:12321',
        'https': 'http://username:[email protected]:12321',
    }
}

driver = webdriver.Chrome(executable_path=service,
                          seleniumwire_options=options)

#This is how the top part of your script should look: 

from seleniumwire import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
from time import sleep


service = r'C:\\chromedriver.exe'
options = {
    'proxy': {
        'http': 'http://username:[email protected]:12321',
        'https': 'http://username:[email protected]:12321',
    }
}
driver = webdriver.Chrome(executable_path=service,
                          seleniumwire_options=options)

The rest of the script should work the same with selenium-wire.

Now, each time you connect to Reddit, you will have a different IP address. If Reddit detects any activity, all they can do is ban one of those addresses, making it safe for you to keep browsing r/programming in your free time from your own IP.

Conclusion

In this tutorial, you learned how to use Selenium to do basic actions in your web browser, such as clicking, scrolling, and typing. If you want to learn more about what you can do using Selenium, these video series are an excellent resource.

For more web scraping tutorials, you're welcome to read our How To section .

FAQ

Can I use Selenium with non-Chrome browsers such as Firefox or Safari?

Yes, you can download and use the respective drivers for Firefox and Safari with Selenium. For more information, check this link for Firefox and this link for Safari.

Do I need to have the browser running to scrape websites with Selenium?

Yes, but you don't necessarily need the browser to render the pages graphically. Web scrapers frequently use headless browsers, which are browsers that are controlled through command line and not GUI. For more information on using Selenium in a headless mode , check out this guide .

What actions can Selenium perform?

The WebDriver API supports most user actions, such as clicking, typing, and taking screenshots. For a full list, you can refer to the API documentation for Python. In addition, the API enables you to execute any JavaScript code on request, which means that it can do whatever is possible in a real browser by a sophisticated user.

How can I handle infinite scroll or pagination with Selenium?

You can scroll down repeatedly using execute_script("window.scrollTo(0, document.body.scrollHeight);") or locate the “load more” button and click it until no more new web elements appear. Use explicit waits to detect when new content appears.

How do I deal with pop-ups, alerts, or cookie banners?

You can use Selenium to locate the pop-up via CSS selectors or XPath, and click to dismiss it. You may use waits to detect its presence. Handling web elements, like cookie prompts, early, helps you avoid being blocked.

Can Selenium be used with browsers other than Chrome?

Yes, you can control a Firefox instance, Safari, Edge, and more by using their appropriate drivers. For example, you’ll need to use the GeckoDriver for Firefox, or use the latest Selenium manager built into Selenium to obtain drivers automatically. The Selenium package supports multiple web browsers.

What are the best practices for avoiding detection while web scraping?

  • Use real browser profiles, vary your web browsers’ user-agent and window size.
  • Introduce random delays and avoid pattern behavior so you mimic a human user rather than leveraging pure automated testing.
  • Use proxies to rotate IPs.
  • Handle web applications' rate limits and respect both robots.txt and the terms of service.
  • Avoid too many rapid HTTP requests in a short time and spread your web scraping over time.
Create Account
Share on
Article by IPRoyal
Meet our writers
Data News in Your Inbox

No spam whatsoever, just pure data gathering news, trending topics and useful links. Unsubscribe anytime.

No spam. Unsubscribe anytime.

Related articles