Puppeteer Stealth: Prevent Blocks in Web Scraping
Vilius Dumcius
Last updated -
In This Article
Puppeteer is a popular Node.js library intended to automate browsers, particularly Google Chrome. It’s widely used to automate various testing tasks and has some applications in web scraping wherever browsers are required.
While using browser automation libraries often produces fewer blocks than sending HTTP requests directly, bot detection systems still catch some web scrapers . To reduce the likelihood even further, the Puppeteer Stealth plugin is often used as it optimizes the browser.
What Is Puppeteer Extra?
Puppeteer Extra provides several additional features to the regular Puppeteer framework, such as the aforementioned Stealth plugin, an adblocker, and several other capabilities. Adblocker, for example, is used to reduce bandwidth when loading pages, which can be important when using proxies or other bandwidth-based products.
The Stealth plugin, however, takes the main stage out of all the plugins. It’s almost a necessity for any web scraping project as it dramatically reduces bot detection throughout websites. According to their Github, Puppeteer Stealth should bypass all public bot detection tests.
On the other hand, it’s not the end-all-be-all, as you’ll still get blocked by some websites, especially if they use custom bot detection solutions. Using proxies and various other solutions will still be required.
What Is Puppeteer Stealth?
The Puppeteer Stealth plugin changes the regular browser fingerprint to reduce the likelihood of being detected. There are plenty of things that happen under the hood, but some of the most important changes are:
1. Includes missing features
Headless browsers lack some features and capabilities that regular browsers have. These may be loading fonts, images, media, and many other features. Puppeteer Stealth adds these features back in.
2. User agent override
Headless browsers have a default user agent string that can be easily detected by websites much more easily. The Stealth plugin overrides the default user agent to match those used by regular browsers.
3. Changing the “navigator.webdriver” value
That value is often a dead giveaway that someone is running a headless browser. The Stealth plugin modifies the value to resemble a regular browser.
4. Faking plugins and optimizing fingerprints
Browser fingerprinting often checks for various extensions and plugins to verify that it’s a regular user – few people nowadays browse with a completely default browser. Puppeteer Stealth fakes the installation of several popular plugins and extensions to make the session seem more authentic.
5. General behavioral modifications
Automation scripts can be easily detected due to how form submissions, button clicks, and many other behaviors work. Puppeteer Stealth makes modifications (that can be adjusted) to reduce the likelihood of being detected.
Puppeteer evasion techniques have a tremendous impact on the block rates of your web scrapers . It’s always recommended to use Puppeteer Stealth unless you have developed your own evasion techniques that work even better.
How to Web Scrape With Puppeteer Stealth
First, you’ll need an IDE that can run Node.js. There are plenty of options available, from Atom to IntelliJ . Once you have your IDE installed, you’ll also need Node.js installed on your machine.
Note that you may need to install a Node.js plugin into your IDE. Refer to the Node.js guide of your IDE to do so.
Once that’s complete, open up your IDE and use the command line to install both Puppeteer and Puppeteer Extra:
npm install puppeteer puppeteer-extra
Then, you’ll need to install the Stealth plugin separately:
npm install puppeteer-extra-plugin-stealth
Once that’s complete, we’ll need to create a Puppeteer instance with the Stealth plugin enabled:
const puppeteer = require('puppeteer-extra');
const StealthPlugin = require('puppeteer-extra-plugin-stealth');
puppeteer.use(StealthPlugin());
We simply create a constant variable called “puppeteer” and use a function call for Puppeteer Extra. An identical process is used for the Stealth plugin.
Finally, we tell our Puppeteer variable to use the plugin, turning it into Puppeteer Stealth.
puppeteer.launch({ headless: true }).then(async browser => {
const page = await browser.newPage();
await page.goto('https://iproyal.com');
// Perform your scraping tasks
await browser.close();
});
Our next piece of code launches the Puppeteer Stealth instance in headless mode and then uses the browser to start a new page and uses that tab to go to the IPRoyal website.
Finally, since we’re just testing, the browser is closed after it finishes loading the page.
If you want to add a proxy to your Puppeteer Stealth implementation, there are a few ways to go about it. One of the easiest ones is to use the “args” function:
puppeteer.launch({
headless: true,
args: ['--proxy-server=your-proxy-server-address']
});
All you need to do is add the IP address and port of your proxy server. Alternatively, Puppeteer Stealth also supports all of the other plugins, quite many of which were made for better proxy support.
Alternatives to Puppeteer Stealth
If Puppeteer Stealth doesn’t allow you to bypass some website’s anti-scraping measures, there are plenty of alternatives available. Some of them will require you to use a different programming language, such as Python, but if all else fails, these can be good options:
- Playwright
Another framework that natively supports Node.js. Has most of the same features as Puppeteer with built-in stealth features.
- Selenium
A Python library that’s commonly used for web scraping when automating browsers is required. There are plenty of stealth plugins and undetected webdrivers available to your liking.
- Cypress
Another framework that will let you run full tests, automate browsers, and perform web scraping freely.
Remember that no matter how good your web scraping framework and pipeline are built, you’ll still run into CAPTCHAs and blocks. The goal with these tools is to minimize their occurrence, but you can’t eliminate them entirely. Proxies have to be used to circumvent these practices when they do happen.
Author
Vilius Dumcius
Product Owner
With six years of programming experience, Vilius specializes in full-stack web development with PHP (Laravel), MySQL, Docker, Vue.js, and Typescript. Managing a skilled team at IPRoyal for years, he excels in overseeing diverse web projects and custom solutions. Vilius plays a critical role in managing proxy-related tasks for the company, serving as the lead programmer involved in every aspect of the business. Outside of his professional duties, Vilius channels his passion for personal and professional growth, balancing his tech expertise with a commitment to continuous improvement.
Learn More About Vilius Dumcius