IPRoyal - Premium Proxy Service Provider
Back to blog

Playwright Web Scraping: The Complete 2024 Guide

Vilius Dumcius

Last updated -

How to

In This Article

Ready to get started?

Register now

Web scraping with Playwright is becoming increasingly common as the framework provides numerous great features that make data extraction a lot easier. It’s one of the few tools that are available in multiple programming languages , such as Python and JavaScript (Node.js).

Multi-language support is not the only reason to pick Playwright for web scraping. Numerous advanced features make Playwright web scraping a breeze due to the ability to customize browser sessions according to specific websites.

What Is Playwright?

Playwright is an open-source browser automation framework that has been in active development since 2020. Managed by Microsoft, Playwright is designed to provide an easy way to automate browsers based on Chromium, Webkit, and Firefox.

One of its strengths has been multi-language, cross-browser support since Playwright is available in many popular programming languages. As such, it quickly became popular among developers and has been used for many different use cases.

Two primary applications for Playwright have been web scraping and website testing. Both of these make use of the numerous powerful features provided by the framework, allowing developers to automate numerous processes with ease.

Playwright web scraping is particularly popular due to the framework’s ability to automate multiple browsers while providing tons of customization for each of them. Additionally, Playwright is fairly well-optimized, making it a great choice for web scraping.

Is Playwright Web Scraping Worth It?

Playwright web scraping is highly effective, customizable, and applicable to most scenarios. Many browser automation or HTTP request libraries lack certain features such as asynchronous programming or auto-waiting until specific elements load.

All of these are available in Playwright. Web scraping solution developers can make great use of these features, making Playwright a better option than most other browser automation libraries .

There are a few minor drawbacks to web scraping with Playwright. One of the major ones is the learning curve as many of the advanced features can be a bit harder to understand.

Additionally, the Playwright library is a bit heavier as it provides both headful and headless browser drivers for numerous browsers. While disk space is rarely an issue, it can be a bit frustrating in some scenarios.

Finally, since the Playwright Python library is still relatively new (in comparison to nearly any other browser automation library), community support and public code is somewhat lacking.

Web Scraping With Playwright

While Playwright code can be written in Python, Java, JavaScript, and many other languages, we’ll be using JavaScript through Node.js and Python to create a web scraping tool with each one.

You may need to rewrite some of the syntax if you use a different language, but the underlying logic will be identical.

Setting Up Your Environment

Node.js

To get started with Playwright web scraping, we’ll first need the language package and an IDE. You can find and install the Node.js package from the official website. In terms of IDE choices, there’s plenty of options, but we’ll be using IntelliJ IDEA .

Now, once both are installed, you’ll need to install Playwright itself. Open up your IDE, start a new project, boot up the terminal and type in the following two commands one after the other:

npm install playwright
npx install playwright

Python

For Python, you’ll need, again, the language package and an IDE. You can download Python from the official website and there are also plenty of options for the IDE. PyCharm Community Edition is completely free and will do everything you need for the tutorial.

Once both are installed, open up your IDE, start a new project, boot up the Terminal and type in:

pip install playwright

Locating Elements

Playwright provides many ways to find the elements you need, each of which has its own drawbacks and benefits. Over time, you’ll learn to pick the right method just from experience, but here are a few popular options:

  • CSS selectors

One of the most common ways to find elements. You can find them based on CSS classes, IDs, attributes, and many other features.

  • XPath

Another popular way to find elements. XPath lets you move through the HTML DOM structure and pick out the elements you need.

  • Text-based locations

Playwright gives you the ability to find elements through strings. It’s highly useful when you want to extract information from buttons or other short, interactable elements.

Here are a few examples of how you can locate elements using these methods.

Example #1: CSS Selectors

const titles = await page.$$eval('.post-title', elements =>
    elements.map(el => el.textContent.trim())
  );

 console.log('Article Titles:', titles);
        titles = page.query_selector_all('.post-title')
        titles_text = [title.text_content().strip() for title in titles]

        print('Article Titles:', titles_text)

In a web scraping context, Playwright would find all CSS elements that match the “.post-title” selector. Then, all of the elements found according to the selector would be stored in a different variable and trimmed. All data would be output into the console.

Example #2: XPath

  const prices = await page.$$eval(
    '//div[@class="product"]//span[@itemprop="price"]',
    elements => elements.map(el => el.textContent.trim())
  );

  console.log('Product Prices:', prices);
        prices = page.query_selector_all('//div[@class="product"]//span[@itemprop="price"]')
        prices_text = [price.text_content().strip() for price in prices]

        print('Product Prices:', prices_text)

Instead of looking at the CSS selectors, we now look at all the price elements that are stored within product divs. The rest of the code follows the same logic.

Example #3: Text-based Locations

  const buttonText = await page.locator('button:has-text("Read More")').innerText();

  console.log('Button Text:', buttonText);
        button = page.locator('button:has-text("Read More")')
        button_text = button.inner_text()

        print('Button Text:', button_text)

Text-based extraction is the most intuitive. All you have to do is find the button in your browser, verify that the string exists in the HTML file, and input it into your code. As long as the string matches, Playwright will be able to find it.

Scraping Text

To avoid cluttering the article with too much code, we’ll be using a classic method for text scraping – CSS selectors. But before we can start scraping text, we first need to create a Playwright browser instance within Node.js:

const { chromium } = require('playwright');

(async () => {
  const browser = await chromium.launch();
  const page = await browser.newPage();
  await page.goto('https://iproyal.com/blog/');

    await browser.close();
})();

Our code creates a Chromium instance through Playwright, then creates a new page and goes to the assigned URL, in this case, the IPRoyal blog. Since it’s running a headless browser, all you’ll get is an exit code.

Now, we’ll use CSS selectors to find the page titles and output them into the console:

const { chromium } = require('playwright');

(async () => {
    const browser = await chromium.launch();
    const page = await browser.newPage();
    await page.goto('https://iproyal.com/blog/');

      const titles = await page.$$eval('h2.tp-headline-s', elements =>
    elements.map(el => el.textContent.trim())
  );

  console.log('Article Titles:', titles);
    await browser.close();
})();

We’re using the previous “$$eval” method to extract titles. You have to find the correct CSS selector, however.

To find the CSS selector, you can visit the web page, and use the Inspect function. All you need to do is right-click on the element and select “Inspect”. Find the text you need in the code, right-click again, hover over the “Copy” function, and click on “Copy selector”.

Sometimes, the CSS selector will be way too precise or complicated. You can then try to look at the code to find a simpler option. For example, on the IPRoyal website, selectors will be highly specific and complicated.

But all titles are stored in the selector “h2.tp-headline-s”. Filling that in, as we did in our code, will perform all of the functions you need.

Here’s identical code for a Python implementation:

from playwright.sync_api import sync_playwright

def scrape_titles():
    with sync_playwright() as p:
        browser = p.chromium.launch()
        page = browser.new_page()
        page.goto('https://iproyal.com/blog/')

        # Use CSS selector to select all titles inside <h2> tags with the class 'tp-headline-s'
        titles = page.query_selector_all('h2.tp-headline-s')
        titles_text = [title.text_content().strip() for title in titles]

        print('Article Titles:', titles_text)

        browser.close()

scrape_titles()

Scraping Images

The Playwright package will allow you to extract any type of data from a website, images included. Take note, however, that images might take up a lot more space than textual data, so you might need some optimization strategies.

We only need a few tweaks to our previous code to download images from the blog:

const { chromium } = require('playwright');
const fs = require('fs');

(async () => {
    const browser = await chromium.launch();
    const page = await browser.newPage();
    await page.goto('https://iproyal.com/blog/');

    const images = await page.$$eval('img', imgs => imgs.map(img => img.src));
    for (let i = 0; i < images.length; i++) {
        const imageResponse = await page.goto(images[i]);
        await fs.promises.writeFile(`image-${i}.jpg`, await imageResponse.body());
    }
    await browser.close();
})();

We first include “fs” (File System) in our code to allow us to make use of the write file function within Node.js.

Everything else is the same until we hit the “img” CSS selector. We need the URLs of images, which are usually in the “img src” tag.

So, we create a new array “imgs” which is each element found with the “img” CSS selector. We then use the “map()” function to extract the URL from each “img” in the array.

After that, we create a “for” loop that’s matched to the length of our “images” variable. Our loop takes each URL, goes to the page, uses writeFile to create a file from the image. They’re all downloaded to the default directory (usually, the project directory).

For Python, we’re following an identical process:

from playwright.sync_api import sync_playwright

def scrape_images():
    with sync_playwright() as p:
        browser = p.chromium.launch()
        page = browser.new_page()
        page.goto('https://iproyal.com/blog/')

        # Extract all image URLs from the page
        image_urls = page.eval_on_selector_all('img', 'imgs => imgs.map(img => img.src)')

        # Download and save each image
        for i, image_url in enumerate(image_urls):
            response = page.goto(image_url)
            with open(f'image-{i}.jpg', 'wb') as file:
                file.write(response.body())

        browser.close()

scrape_images()

How to Set Up and Use Proxies in Playwright

Whenever you’re engaged in web scraping, you’ll need proxies. No matter how well you optimize your scraping process, whether you use evasion techniques, headful or headless browsers, play around with user agents – you’ll get banned every once in a while.

Proxies will allow you to bypass any IP bans by giving access to millions of devices. Playwright has native proxy support, which is incredibly easy to make use of. All you need to do is change a few lines when launching Playwright:

const browser = await playwright.chromium.launch({
  proxy: {
    server: 'http://proxy-server:port',
    username: 'username',
    password: 'password'
  }
});

You simply feed a few launch options related to proxies, and that’s it. If your proxy provider does not support proxy rotation, however, you may need to create a custom logic.

Here’s a Python example:

from playwright.sync_api import sync_playwright

def launch_browser_with_proxy():
    with sync_playwright() as p:
        browser = p.chromium.launch(proxy={
            'server': 'http://proxy-server:port',
            'username': 'username',
            'password': 'password'
        })
        page = browser.new_page()
        
        # Example: navigate to a website to test the proxy
        page.goto('https://example.com')
        print(page.title())

        browser.close()

launch_browser_with_proxy()

FAQ

Can Playwright be detected?

Yes, any browser automation or HTTP request library can be detected. There are various optimization strategies available, but there’s no way to completely avoid detection. If you do get banned, use proxies to circumvent the block.

Does Playwright have a UI?

While Playwright is primarily a headless browser environment, there’s a headful mode. It can be useful if you want to debug processes.

Is Selenium better than Playwright for web scraping?

Selenium has better community support and more public code, but it’s also a lot older. Playwright has better features, so both have their strengths and weaknesses. However, Playwright may soon catch up and become the better overall framework.

Create account

Author

Vilius Dumcius

Product Owner

With six years of programming experience, Vilius specializes in full-stack web development with PHP (Laravel), MySQL, Docker, Vue.js, and Typescript. Managing a skilled team at IPRoyal for years, he excels in overseeing diverse web projects and custom solutions. Vilius plays a critical role in managing proxy-related tasks for the company, serving as the lead programmer involved in every aspect of the business. Outside of his professional duties, Vilius channels his passion for personal and professional growth, balancing his tech expertise with a commitment to continuous improvement.

Learn More About Vilius Dumcius
Share on

Related articles