How to Use HtmlAgilityPack for Web Scraping in C#
TutorialsLearn how HTML Agility Pack helps to parse HTML while web scraping with C#, and in what situations you need additional tools for data extraction.

Justas Vitaitis
Key Takeaways
-
HTML Agility Pack can be set up in your C# project using the NuGet Package Manager in Visual Studio.
-
A basic scraper can use XPath expressions to find relevant tags and CSS styling attributes and extract data.
-
A more advanced scraper will use loops to extract and parse HTML on scale or incorporate alternative tools for JavaScript-heavy targets.
Html Agility Pack is an open source HTML parser library for .NET that plays a critical role when Web Scraping with C# . It works by loading HTML content and converting it into a DOM structure that you can traverse and query using XPath expressions, LINQ, and CSS selectors (via extensions).
Parsing HTML documents with regular expressions or writing your own HTML parser may be impossible or counterproductive. Html Agility Pack is free, and most common scenarios for extracting data like blogs, lists, and tables are already covered.
Setting up HtmlAgilityPack in Your C# Project
The steps below are based on the newest version of Visual Studio with .NET, since it’s standard for coding in C#. Using HTML Agility Pack is possible in other development environments as well, but the steps might differ.
- Open or create a C# project.
- Click on Tools and select NuGet Package Manager.
- Choose Package Manager Console from the submenu, and in the command-line interface, type in the install command:
Install-Package HtmlAgilityPack
- Now you only need to add a directive "using HtmlAgilityPack;" at the start of your C# file to tell the compiler which namespaces you want to use.
This gives access to HTML Agility Pack classes, such as HTML documents, HTML nodes, HTML Web, and others used for data extraction from websites. Here's how a basic code loading an HTML string might look.
using HtmlAgilityPack;
// Create a new document
var doc = new HtmlDocument();
// Load HTML from a string
string html = "<html><body><h1>Hello World</h1></body></html>";
doc.LoadHtml(html);
// Access the content
var title = doc.DocumentNode.SelectSingleNode("//h1").InnerText;
Console.WriteLine(title); // Output: Hello World
Now you can use the terminal to run the program, which should output “Hello World”, by using:
dotnet run
Basic Web Scraping Example With HtmlAgilityPack
We'll use Quotes to Scrape as a practice target to extract data with HTML Agility Pack.
Before getting started, we need to find the pattern that the web page follows. After inspecting the HTML document with DevTools, we can see that author names are inside <small> tags with the class author.
<small class="author" itemprop="author">Albert Einstein</small>
We’ll use it to load HTML content from the URL and then use the HTML Agility Pack to convert the raw HTML into a navigable HTML document. By using XPath expressions that HTML Agility Pack accepts for DOM manipulation and navigation, we’ll be able to do just that.
For author names, you'd use //small[@class='author'], which means that our web scraper will find <small> tags in the HTML document that have a class attribute equal to author. In a real-world scenario, the class might be concealed or named differently, so more work with XPath selectors might be needed.
HTML Agility Pack allows us to use .InnerText to get only the plain text inside and remove any other HTML content. Then we only need to extract data from the first relevant HTML node. The code might look like this:
using HtmlAgilityPack;
using System;
class Program
{
static void Main()
{
// Step 1: Load the webpage
var web = new HtmlWeb();
var doc = web.Load("https://quotes.toscrape.com/");
// Step 2 & 3: Select the first author element using XPath
var firstAuthorNode = doc.DocumentNode.SelectSingleNode("//small[@class='author']");
// Step 4: Extract and display the first author name
if (firstAuthorNode != null)
{
string authorName = firstAuthorNode.InnerText;
Console.WriteLine("First author found:");
Console.WriteLine("------------------");
Console.WriteLine(authorName);
}
else
{
Console.WriteLine("No author found.");
}
}
}
Running the program should output:
First author found:
------------------
Albert Einstein
Some prefer to use CSS selectors over XPath expressions in web scraping. While CSS vs XPath deserves a separate discussion, XPath expressions are better for HTML Agility Pack, because they are natively supported. CSS selectors can still be used and are even preferred in some cases.
CSS selectors with HTML Agility Pack are commonly used to improve performance and readability. You'll need to set up third-party .NET library extensions , which might create other problems. Consider whether it's worth putting that much trust in unofficial dependencies just to use CSS selectors in HTML Agility Pack.
Parsing HTML Elements
Once you can load HTML documents with HTML Agility Pack, you need to start parsing them. It involves selecting HTML content, looping through HTML nodes, and handling cases where things don't go as expected. HTML Agility Pack uses built-in XPath support to accomplish these tasks.
Selecting by Class Attribute
Most web pages, including our example Quotes to Scrape, use CSS classes to style and organize content. The [@class='author'] instructs only to extract elements where the class attribute equals author.
// Select all elements with class="author"
var authors = doc.DocumentNode.SelectNodes("//small[@class='author']");
// Select any element type with a specific class
var headers = doc.DocumentNode.SelectNodes("//*[@class='header']");
Selecting by Tag Name
An even more straightforward way to select HTML document elements is by using tags, but it's most helpful when you want all instances of a specific tag type. For example, //p XPath expression is useful to find all paragraph elements on a page.
var paragraphs = doc.DocumentNode.SelectNodes("//p");
Selecting by ID Attribute
Many web pages use IDs as unique identifiers for specific elements in HTML documents. When you find an ID, you can target it directly by using SelectSingleNode() since IDs are unique and you're selecting only one element.
var navBar = doc.DocumentNode.SelectSingleNode("//nav[@id='navigation']");
Extracting Data With a Loop
When extracting one author from the HTML document, we used a single select node command; now we'll use SelectNodes("//small[@class='author']") as it will find all matching author elements and return them as a collection.
The foreach loop is how we process each descendant node from the first one at a time. Unlike in other languages, you don't need to track counters or manually traverse between parent and child nodes to the following item with C#. A code example might look like this:
using HtmlAgilityPack;
using System;
class Program
{
static void Main()
{
// Step 1: Load the webpage
var web = new HtmlWeb();
var doc = web.Load("https://quotes.toscrape.com/");
// Step 2: Select all author elements using XPath
var authorNodes = doc.DocumentNode.SelectNodes("//small[@class='author']");
// Step 3: Null check before processing
if (authorNodes != null && authorNodes.Count > 0)
{
// Step 4: Extract and display author names
Console.WriteLine("Authors found:");
Console.WriteLine("-------------");
foreach (var node in authorNodes)
{
string authorName = node.InnerText;
Console.WriteLine(authorName);
}
// Show total count
Console.WriteLine($"\nTotal authors: {authorNodes.Count}");
}
else
{
Console.WriteLine("No authors found on this page.");
}
}
}
Running the code above should give you a list of authors and the number of entries.
Null Checks and HTML Document Exceptions
In HTML Agility Pack, select nodes return null when no matching elements are found. Without any null checks, your web scraper will crash when no nodes are found. This is especially important when writing a loop for HTML parsing, as the null checks prevent errors when iterating through multiple results.
// After selecting nodes with XPath
var authorNodes = doc.DocumentNode.SelectNodes("//small[@class='author']");
// Check if any nodes were found
if (authorNodes != null && authorNodes.Count > 0)
{
// Safe to process the collection here
foreach (var node in authorNodes)
{
// Extract and use data
string data = node.InnerText;
Console.WriteLine(data);
}
}
else
{
// Handle the case where no elements were found
Console.WriteLine("No matching elements found.");
}
Null checks ensure your scraper doesn't crash when HTML documents' structure changes, for example, when child nodes are moved. Yet, null checks handle only one specific case of missing HTML content. HTML Agility Pack can run into many other problems when parsing HTML.
- The website might be down
- Your internet connection could drop
- The HTML files might be malformed
- The server might block your request to load HTML content
An advanced HTML parser and scraper needs to include a try-catch exception handling block to catch all these errors you might not have anticipated. The exact error handling you should implement depends on your setup and target. In some cases, it might require using alternative tools.
Alternatives to HtmlAgilityPack
HTML Agility Pack is an excellent choice for static HTML content parsing, but it's insufficient for modern web scraping. Nowadays, web pages use a lot of JavaScript for infinite scrolls, dynamically generated HTML content, and other features. There's also no convenient CSS selector support or the ability to make complex user interactions, like logins or solving CAPTCHA.
Some of these shortcomings can be solved, but even then, the performance of HTML Agility Pack is subpar for large-scale scraping use cases. For these reasons, many C# web scrapers use HTML Agility Pack alternatives to scrape HTML content.
- AngleSharp is a more modern, HTML5 standards-compliant parser library for .NET. It provides an API for parsing, navigating, and manipulating CSS, XML, and HTML documents with native CSS support.
- Selenium is a browser automation framework that controls real web browsers (By default, Chrome) programmatically. It's best for testing web applications and scraping dynamic content that requires interaction.
- Puppeteer Sharp is a .NET port of the Puppeteer library that provides an API for controlling headless Chromium browsers via the DevTools Protocol. Just as Selenium, Puppeteer Sharp is best for scraping JavaScript-heavy websites and testing.
All these tools can be installed as NuGet packages, but, with the exception of AngleSharp, they are still more challenging to set up than HTML Agility Pack. Learning to use Selenium and Puppeteer Sharp for advanced projects might also be more difficult.
The most significant advantage of HTML Agility Pack is that it's lightweight and low on memory when working with most HTML documents. So, if your project doesn't require JavaScript support or navigating complex child nodes, it might be easier to use HTML Agility Pack. Otherwise, look for an alternative.
Conclusion
HTML Agility Pack is a fast and easy solution to parse static HTML documents that even a beginner with some knowledge of C# can use. It's part of the reason why HTML Agility Pack is so popular. However, if your web scraping project is more complex, you might need to opt for more powerful tools.