Puppeteer Extra: Plugins, Setup, and Customization Guide
Marijus Narbutas
Last updated -
In This Article
Puppeteer is a powerful framework that runs on Node.js, allowing users to automate various Chrome and Chromium-based browser processes. It’s widely used in website testing and web scraping, often being considered the premiere tool for the latter.
While the default configuration of Puppeteer provides a lot of customization and great features, it’s still not fully optimized, especially for web scraping. As such, various developers have created a pool of plugins called “Puppeteer Extra”.
What Is Puppeteer Extra?
Puppeteer Extra is an extension of the regular Puppeteer framework that provides a lot of additional plugins for various tasks and purposes. Puppeteer Extra is largely aimed at web scraping. However, the plugins may be useful for many other browser automation applications.
Additionally, outside of the pre-packaged Puppeteer Extra plugin pool, users can create their own and easily load them through Node.js. So, if you need something additional for your web scraping project , Puppeteer Extra makes it a lot easier to load custom scripts into the browser automation framework.
With all of the Puppeteer Extra plugins, you can greatly extend the functionalities of the framework or use the plugins to disable the default features of the browser. Both of these have various use cases that can be beneficial for web scraping and website testing.
List of Puppeteer Extra Plugins
1. Puppeteer-extra-plugin-stealth
The Puppeteer Extra stealth plugin is a multi-functional addition to your browser that intends to avoid various bot detection algorithms. Most web scraping projects run into bot detection systems that can issue out bans if they suspect someone is using automation to visit the website.
As such, the stealth plugin modifies various aspects of the browser (fingerprints, user agent, navigator properties, JavaScript features, etc.) that all aid in minimizing detection rates. It’s not that useful in website testing, but the stealth plugin is an absolute lifesaver in web scraping.
2. Puppeteer-extra-plugin-recaptcha
While the name of the Puppeteer Extra plugin is rather self-explanatory, there’s a few important caveats to its usage. First, the CAPTCHA plugin will definitely be unable to solve absolutely every occurrence of the test.
Most of the current iterations of ReCAPTCHA will require you to buy a service that manually solves these tests. The Puppeteer Extra plugin, however, makes it easy to interact with such services.
Additionally, it also provides various fallback mechanisms, such as taking screenshots when something goes wrong.
As such, it’s a great addition to the stealth plugin as no matter how undetectable you make a browser, ReCAPTCHA will still appear rather frequently. Most of the time, it’s used in web scraping, almost always in conjunction with the stealth plugin.
3. Puppeteer-extra-plugin-adblocker
Another Puppeteer Extra plugin that’s largely used in web scraping projects. It works exactly like a regular ad-block plugin, eliminating any advertisements that are shown on a website.
While most people run Puppeteer (Extra) with a headless browser (without a GUI), it’s still a useful plugin. Advertisements can take up a lot of bandwidth, and they load even in headless mode, so the plugin can save some traffic and boost page loading speeds.
Both of these are useful in web scraping as most projects use proxies that are often priced per GBs. Reducing the “weight” of pages will reduce costs and speed up data extraction.
4. Puppeteer-extra-plugin-anonymize-ua
An extremely useful plugin that aids in data extraction. A user agent is metadata that’s delivered with connection requests that showcases the browser type and version, OS, and many other aspects.
They’re intended as information for servers that can then deliver an appropriate response to the machine according to their settings. Nowadays, however, a user agent can be used in conjunction with IP addresses to track the same machine.
Additionally, some user agents are blocked by websites by default. The Puppeteer Extra plugin allows you to avoid both of these cases by anonymizing and creating a user agent that’s closer to a regular internet user.
5. Puppeteer-extra-plugin-proxy
Integrating proxies is a vital part of any web scraping project. Websites will simply ban the offending IP addresses after some time, which makes changing them an unavoidable part of the practice.
Since proxies are by far the most popular method to change IP addresses when web scraping, the Puppeteer Extra plugin makes integration, authentication, and usage extremely easy and simple.
6. Puppeteer-extra-plugin-user-preferences
One of the few Puppeteer Extra plugins that’s more useful for website testing rather than web scraping. It allows you to simulate various user preferences (i.e., language, screen resolution, etc.) to check how the website reacts to these settings.
There are some use cases for it when performing data extraction, but you’ll generally want to run Puppeteer (Extra) in headful mode, which isn’t used as frequently when scraping.
7. Puppeteer-extra-plugin-devtools
The plugin enables the DevTools function that’s usually available in regular browsers. DevTools are often used to verify network connections, check how code interacts with the browser, and many other functions.
As such, the Puppeteer Extra plugin is mainly used for website testing. Additionally, it’s not as widely used due to the somewhat narrow and niche use case.
8. Puppeteer-extra-plugin-block-resources
Another plugin that’s intended to reduce the time it takes to load pages. Instead of, however, blocking ads, the plugin stops image, CSS, (some) font, and other feature loading.
It’s one of the few plugins that require some knowledge to use properly. If a website has all of the information in HTML and doesn’t require any additional resources, this Puppeteer Extra plugin could be used for web scraping. Although it may also increase block rates.
On the other hand, it’s useful in website testing to see how pages load without any of the additional resources.
Setting up Puppeteer Extra
To start using Puppeteer Extra, you’ll first need an IDE for Node.js, such as IntelliJ IDEA. Once you have an IDE running, set up a Node.js project and run the following command in the Terminal:
npm install puppeteer puppeteer-extra
That will install both the Puppeteer and the Puppeteer Extra frameworks. Note that Puppeteer Extra does not include any plugins by default, it only enables the usage of them.
So, let’s pick the most popular one, the stealth plugin, and install it by running another command in the Terminal:
npm install puppeteer-extra-plugin-stealth
Once all of the frameworks and plugins are installed, we’ll need to create a basic function that calls Puppeteer Extra and includes the stealth plugin:
const puppeteer = require('puppeteer-extra');
const StealthPlugin = require('puppeteer-extra-plugin-stealth');
puppeteer.use(StealthPlugin());
So, we create a constant variable that calls Puppeteer Extra and another one with the plugin. We then make use of the “use” method to force the stealth plugin.
Continuing onwards, we should use a Puppeteer browser instance to visit a website:
puppeteer.launch({ headless: false }).then(async browser => {
const page = await browser.newPage();
await page.goto('https://iproyal.com');
// Perform your scraping tasks
await browser.close();
});
We use asynchronous programming for Puppeteer. It’s generally a better way to do many things, especially web scraping.
Note that we set “headless” to false. While in most cases you’d want to run the browser in headless mode, for code debugging purposes, the GUI can be valuable.
So, we launch the Puppeteer instance, open a new page, go to the IPRoyal website, and wait until it loads. Once it’s done, the browser instance closes.
Implementing a Custom Plugin
Let’s write a custom Puppeteer Extra plugin that would output text once a page finishes loading successfully (loading is finished with the HTTP code 200).
First, we’ll need to create a new JavaScript file. Name it in any way you like, but keep it descriptive. Then we’ll need to write the plugin as a separate file:
const { PuppeteerExtraPlugin } = require('puppeteer-extra-plugin');
class LoadCheck extends PuppeteerExtraPlugin {
constructor(options = {}) {
super();
this.text = options.text || 'Page loaded successfully!';
}
get name() {
return 'load-check'
}
async onPageCreated(page) {
let isMainFrameResponseOK = false;
// Listen for response events to track the main document's status
page.on('response', response => {
// Check if the response is for the main document
if (response.request().resourceType() === 'document') {
if (response.status() === 200) {
isMainFrameResponseOK = true;
}
}
});
// Listen for the 'load' event
page.on('load', async () => {
console.log(`Load event fired for ${page.url()}`);
if (isMainFrameResponseOK) {
console.log(this.text);
}
});
}
}
module.exports = function(options) {
return new LoadCheck(options);
};
Our plugin first calls the Puppeteer Extra framework in order to work as a plugin. We create a class that extends the Puppeteer Extra framework and start our coding process.
Constructor allows us to initialize the object of the class we created. We use “super()” to call the supervening (parent) class and ensure that it’s loaded correctly. It’s a necessary step as our code extends one of the existing classes.
We also create an options object to make the plugin slightly customizable. Out of all the features, we use a text function to allow users to enter their own text or a default version.
After that, we use the get function to give the plugin a name. Note that we’ll be loading the file, not the name of the plugin, but it’s still a necessary step.
To make things simple, we use the “onPageCreated” method, although it’ll trigger on blank pages as well.
We set the main frame response variable to false. It’s simply a variable that we’ll use to track if the document we’re checking provided us with a 200 HTTP status code.
After that, we set up a short function that listens to the newly opened page. If a 200 HTTP status code is returned, our plugin sets the variable to true.
Finally, we listen in to the load event and check if the previously outlined variable is true. We output when the listening event starts and finish it off with the status response code if it was successful.
Now, all you need to do is to slightly update the main code:
const PageLoadPlugin = require('./puppeteer-extra-plugin-load-check');
const StealthPlugin = require('puppeteer-extra-plugin-stealth');
// Use each plugin separately
puppeteer.use(StealthPlugin());
puppeteer.use(PageLoadPlugin({ text: 'The page loaded with status 200!' })); // Pass options if needed
Double-check that the string under “PageLoadPlugin” matches your plugin’s file name. Running the code should correctly initiate both plugins and output some lines into your console.
Author
Marijus Narbutas
Senior Software Engineer
With more than seven years of experience, Marijus has contributed to developing systems in various industries, including healthcare, finance, and logistics. As a backend programmer who specializes in PHP and MySQL, Marijus develops and maintains server-side applications and databases, ensuring our website works smoothly and securely, providing a seamless experience for our clients. In his free time, he enjoys gaming on his PS5 and stays active with sports like tricking, running, and weight lifting.
Learn More About Marijus Narbutas