PowerShell Web Scraping Tutorial for 2024
Justas Vitaitis
Last updated -
In This Article
PowerShell is a bit of an unusual tool as it was originally created by Microsoft to upgrade command-line tools. Most command-line tools were only able to perform computer administration tasks, but PowerShell includes automation capabilities.
While it’s not used frequently for web scraping, it’s definitely a powerful tool that can help you acquire data at scale quickly and effectively. PowerShell web scraping is great for small-to-medium sized projects that don’t have too many moving parts.
Can PowerShell Do Web Scraping?
Web scraping with PowerShell is definitely possible. It’s also quite easy to use due to the straightforward nature of the program.
There are a few commands in PowerShell that make web scraping possible, such as Invoke-RestMethod and Invoke-WebRequest. You can communicate with websites and download HTML data using these methods, making them the foundation of web scraping with PowerShell.
When compared to other popular web scraping tool development languages, such as Python with BeautifulSoup and Requests, PowerShell does lag behind. It simply does not have the amount of community support and modularity that a programming language like Python has naturally.
PowerShell, however, can still be great for environments where users want to introduce as few additional programs as possible while still being able to automate tasks. Additionally, all Windows-based infrastructures will have PowerShell by default.
Finally, there are additional modules for PowerShell that make web scraping easier. For example, the PowerHTML module makes parsing HTML data a lot easier, turning PowerShell into an even better web scraping tool.
Web Scraping With PowerShell
Before you can start web scraping with PowerShell, you’ll need the application itself. If you’re a Windows user, then it’s installed by default and you can find it through the search function. If you’re using any other OS, you’ll have to download and install PowerShell from Microsoft’s website .
When PowerShell is up and running, we can start coding our web scraping tool.
Sending a GET Request
PowerShell web scraping is quite similar to most other programming languages. We first have to define a variable that will store the response we get from a website:
$response = Invoke-WebRequest -Uri "https://iproyal.com"
If you run the command, it should take a second to complete. You can then use various methods to access several types of data from the response, such as:
$response.StatusCode
$response.Headers
$response.Links
There’s plenty of more you can find by reading through the PowerShell documentation .
On the other hand, you can use the Invoke-RestMethod to also retrieve data from a website. Unlike a web request, Invoke-RestMethod parses the information and returns JSON data.
$response = Invoke-RestMethod -Uri "https://iproyal.com"
If you’re planning on performing large scale web scraping, you’ll need to create PowerShell scripts. These can be done with Visual Studio Code or any text editor. All you need to do is add the code as listed here into the file and save it with the extension .ps1.
Note that most Windows machines have disabled PowerShell script execution by default. You’ll need to open up Terminal (as an admin) and run the following command:
Set-ExecutionPolicy RemoteSigned
Parsing Data With PowerHTML
PowerHTML is a PowerShell web scraping module that makes extracting and parsing data from HTML a lot easier. You’ll need to create a script that installs the PowerHTML module and then loads the HTML file:
# Install the PowerHTML module
Install-Module -Name PowerHTML -Scope CurrentUser
Import-Module PowerHTML
# Send a request and parse the HTML content
$response = Invoke-WebRequest -Uri "https://iproyal.com"
$html = New-Object HtmlAgilityPack.HtmlDocument
$html.LoadHtml($response.Content)
# Use XPath to select all <a> tags
$links = $html.DocumentNode.SelectNodes('//a') | ForEach-Object {
$_.Attributes['href'].Value
}
# Display the links
$links
Sending Web Requests via Proxy in PowerShell
PowerShell has native proxy support, making your web scraping efforts a lot easier if you have access to them. All you need to do is add a couple of lines to your regular web scraping script:
# Define the proxy server address
$proxy = "http://proxyserver:8080"
# Send a request via the proxy
$response = Invoke-WebRequest -Uri "https://iproyal.com" -Proxy $proxy
# Output the response content
$response.Content
Implementing proxies in web scraping is an absolute necessity as anti-bot measures will frequently ban any automation engine. Additionally, if you have rotating proxies , you won’t need to implement anything else as your IP address will change with each request.
Implementing Proxy Rotation in PowerShell
If you don’t have rotating proxies, then you’ll need to create a manual loop that goes through each one in turn. Doing so isn’t extremely difficult, but will still give you a much better chance at bypassing anti-bot measures when web scraping:
# Define a list of proxies
$proxies = @(
"http://proxyserver1:8080",
"http://proxyserver2:8080",
"http://proxyserver3:8080"
)
# Loop through the proxies and send requests
foreach ($proxyUrl in $proxies) {
$proxy = $proxyUrl
$response = Invoke-WebRequest -Uri "https://iproyal.com" -Proxy $proxy
Write-Host "Response from $proxyUrl: $($response.StatusCode)"
}
You may need to also create a list of URLs to loop through as sending numerous requests to the same one won’t be that beneficial for your web scraping project.
PowerShell web scraping starts struggling a little bit if you need a large number of proxies and have dozens of URLs to go through. The output may simply not be there, and the code gets unnecessarily complicated. It may be a good idea to then switch to a more suitable programming language (e.g., Python) for web scraping. Object-oriented support will also make coding a little easier.
Author
Justas Vitaitis
Senior Software Engineer
Justas is a Senior Software Engineer with over a decade of proven expertise. He currently holds a crucial role in IPRoyal’s development team, regularly demonstrating his profound expertise in the Go programming language, contributing significantly to the company’s technological evolution. Justas is pivotal in maintaining our proxy network, serving as the authority on all aspects of proxies. Beyond coding, Justas is a passionate travel enthusiast and automotive aficionado, seamlessly blending his tech finesse with a passion for exploration.
Learn More About Justas Vitaitis