Superagent-Proxy: Ultimate Guide for Web Scraping with NodeJS
Justas Vitaitis
Last updated -
In This Article
Superagent is a NodeJS HTTP client library that’s frequently used to send requests to APIs or servers. It’s also widely used for web scraping applications as it’s a good foundational library for making HTTP requests.
On its own, however, superagent only sends HTTP requests. For web scraping applications, you’ll need much more. Most importantly, you’ll need a way to integrate proxies.
Superagent-proxy is an extension package that lets you do exactly that – route traffic through proxies whenever you make HTTP requests. We’ll go through both libraries, how they work, and how you can set up proxies for your web scraping project .
Getting Started with Superagent-Proxy
Prerequisites
Before all the proxy integration can begin, you’ll need a few things. First, an IDE that supports JavaScript and NodeJS itself . You’ll be using the IDE for development purposes, while NodeJS allows you to use JavaScript anywhere. Both of them combined will let you use “superagent”.
While there’s only one NodeJS package you need to use, there’s a wide variety of IDEs you can choose from. A few of the popular options are Visual Studio Code , IntelliJ IDEA , and Atom IDE . IntelliJ IDEA is paid for JavaScript access, but it does have a 30-day free trial.
Installation and Setup
Once you have NodeJS installed and the IDE at the ready, open up the Terminal to install both libraries:
npm install superagent superagent-proxy
If your IDE doesn’t start you off with a “.js” file, you’ll need to create one on your own. You can use the IDE or simply create a file through regular OS means.
You’ll then need to import both of the libraries to make use of the functions stored within.
const superagent = require('superagent');
require('superagent-proxy')(superagent); // Extends superagent request class with proxy capabilities
“Const superagent” is more or less equal to “import superagent” for the means of getting access to the inbuilt functions. Most web resources, however, will recommend using “const superagent” over “import superagent” as it simplifies programming.
Sending a GET Request
Before you can even think of using “superagent-proxy” you’ll need to first set up communication with a server. If you’re not sending any requests to an external server, there’s no point in setting up a proxy.
const superagent = require('superagent');
require('superagent-proxy')(superagent); // Extends superagent request class with proxy capabilities
const url = 'https://iproyal.com'; // The URL you want to scrape
superagent
.get(url)
.then(res => {
console.log('Data retrieved:', res.text); // Output the content directly to the console
})
.catch(err => {
console.error('Error fetching data:', err);
});
We’ve added a few new lines. First, we create a new constant (it cannot be changed during execution) with our desired URL.
Once that’s established, we call the “superagent” library and use the method “.get(url)”. Since our constant is “url”, we’re simply pulling the URL from our constant.
Our code proceeds, after receiving some information from the URL, to take the response (“res”) and outputs the data to the console. We also add “Data retrieved:” before outputting the response text (“res.text”).
In case an error occurs, we use the “.catch()” method to implement some error handling. If an error occurs, instead of throwing some exception that’s hard to understand, the error will be output into the console with some additional information.
Implementing Rotating Proxies
Once we’re communicating with a website or API, we can begin using proxy servers. Our “superagent-proxy” library works primarily with HTTPS proxies. If you want to use SOCKS5 instead of HTTPS proxies, there’s a similar library for it, called “socks-proxy-agent”.
Most of the steps for “socks-proxy-agent” will be nearly identical, so we’ll only be covering the usage of HTTPS proxies only. These proxy servers work with “superagent-proxy” and are significantly more popular as well.
const superagent = require('superagent');
require('superagent-proxy')(superagent); // Extends superagent with proxy capabilities
const url = 'https://iproyal.com'; // The URL you want to scrape
const proxy = 'http://your-proxy.com:port';
superagent
.get(url)
.proxy(proxy)
.then(res => {
console.log('Data retrieved:', res.text);
// Here you can process the data or store it as needed
})
.catch(err => {
console.error('Error fetching data:', err);
});
All we’re doing is adding another constant “proxy”, which points to our HTTP proxy server IP address and port. We then call “proxy(proxy)” in the superagent function.
While such an implementation can work perfectly fine, your proxy servers must support IP whitelisting. Additionally, since there’s only one address, your provider must supply rotating proxies with a single endpoint.
Some providers don’t do either. Let’s start with authentication:
const superagent = require('superagent');
require('superagent-proxy')(superagent); // Extend superagent with proxy support
const url = 'https://example.com/data'; // The URL you want to request
const proxy = 'http://username:password@proxyhost:port'; // Your authenticated proxy
superagent
.get(url)
.proxy(proxy) // Set the proxy with authentication
.then(res => {
console.log('Data retrieved:', res.text); // Output the fetched data
})
.catch(err => {
console.error('Error fetching data:', err); // Error handling
});
We simply change the proxy server address to one that has the username and password in the URL. Take note that some providers may use different combinations, but they should be providing a guide on how to integrate proxy servers.
You may also use the authentication header for a more secure but more complicated way to authorize proxies. Take note that not all proxy providers will support such authentication:
const superagent = require('superagent');
require('superagent-proxy')(superagent); // Extend superagent with proxy support
const url = 'https://example.com/data'; // URL to fetch
const proxy = 'http://proxyhost:port'; // Proxy server (no credentials in URL)
// Username and password for proxy
const username = 'yourUsername';
const password = 'yourPassword';
// Encode username and password in Base64 for the Proxy-Authorization header
const base64Credentials = Buffer.from('${username}:${password}').toString('base64');
const proxyAuthHeader = 'Basic ${base64Credentials}';
superagent
.get(url)
.proxy(proxy) // Set the proxy without embedded credentials
.set('Proxy-Authorization', proxyAuthHeader) // Manually set the proxy authorization header
.then(res => {
console.log('Data retrieved:', res.text); // Output the fetched data
})
.catch(err => {
console.error('Error fetching data:', err); // Error handling
});
We now first create two constants that store our username and password as strings. Since we’ll be using the proxy authentication header, it’s encoded in base64, so we need to turn our strings to such an encoding.
That’s what “base64Credentials” does: it creates a buffer that takes the strings from our username and password constants and converts them to base64. We then create a proxy authorization header, which we pass during the superagent function call.
Implementing IP Addresses as Lists
Some providers may give you lists of IP addresses instead of a single endpoint with rotating proxies. You’ll have to then create your proxy rotation logic.
One of the simplest ways is to simply pick IP addresses at random with each request. Let’s start by setting up a list for our IPs:
const superagent = require('superagent');
require('superagent-proxy')(superagent);
// List of proxies
const proxies = [
'http://username:password@proxyhost1:port',
'http://username:password@proxyhost2:port',
'http://username:password@proxyhost3:port',
// Add more proxies as needed
];
We’ll then create a function to pick a proxy at random from our list:
const superagent = require('superagent');
require('superagent-proxy')(superagent);
// List of proxies
const proxies = [
'http://username:password@proxyhost1:port',
'http://username:password@proxyhost2:port',
'http://username:password@proxyhost3:port',
// Add more proxies as needed
];
function sendRequestWithRandomProxy(url) {
const proxy = proxies[Math.floor(Math.random() * proxies.length)]; // Select a random proxy
console.log(`Using proxy: ${proxy}`);
superagent
.get(url)
.proxy(proxy)
.then(res => {
console.log('Data retrieved:', res.text); // Output the fetched data
})
.catch(err => {
console.error('Error fetching data:', err); // Error handling
});
}
// Example: Send multiple requests using random proxies
sendRequestWithRandomProxy(targetURL);
sendRequestWithRandomProxy(targetURL);
sendRequestWithRandomProxy(targetURL); // Each call picks a different random proxy
We can then simply call the function for each URL. Create a “for” list if you have a lot of URLs to go through.
Picking the Correct Proxies
There are a few different types of proxies that may significantly influence the performance of your web scraping project:
- Residential proxies – IPs acquired from household devices. Great for use cases where speeds are not as important and bot detection rates are high.
- Datacenter proxies – IPs acquired from data centers. Great for use cases where speeds are most important and detection rates are not that high.
- ISP proxies – a combination of residential and datacenter proxies. Best for use cases where lots of IPs are not required, but speed and undetectability is important.
Most of these proxies will integrate through the same methods we’ve outlined above. Take note that some proxies may be charged per IP without any bandwidth limitations, while others are charged by GBs.
Author
Justas Vitaitis
Senior Software Engineer
Justas is a Senior Software Engineer with over a decade of proven expertise. He currently holds a crucial role in IPRoyal’s development team, regularly demonstrating his profound expertise in the Go programming language, contributing significantly to the company’s technological evolution. Justas is pivotal in maintaining our proxy network, serving as the authority on all aspects of proxies. Beyond coding, Justas is a passionate travel enthusiast and automotive aficionado, seamlessly blending his tech finesse with a passion for exploration.
Learn More About Justas Vitaitis