jsoup Tutorial: HTML Parsing and Web Scraping in Java
TutorialsLearn how to parse HTML and build web scrapers in Java using jsoup, CSS selectors, and DOM traversal in this beginner-friendly tutorial.

Justas Vitaitis
Key Takeaways
-
jsoup is a free, open-source Java library for parsing, extracting, and managing data.
-
jsoup works best with static HTML, but doesn't execute JavaScript-rendered content.
-
You can use jsoup to fetch pages, extract structured data, clean HTML, and build multi-page scrapers.
-
For dynamic websites, combine jsoup with tools like Selenium or Playwright, or scrape JSON APIs directly when possible.
Extracting data from HTML in Java is challenging and often requires serious know-how to execute properly. jsoup makes this extraction easier, allowing users to parse documents, query elements, and scrape web pages without much trouble or extensive experience.
In this tutorial, we’ll walk you through jsoup, what it is, and how to parse HTML, select elements with CSS selectors, extract and clean data, and connect to love web pages for scraping.
What Is jsoup?
jsoup is an open-source Java library built for data parsing and extracting from HTML documents. The tool can target web pages, parse HTML, and extract or modify data with a simple API, using DOM and CSS selectors.
However, while jsoup does help make the process easier, it’s important to note that it only works with static HTML. jsoup can’t execute JavaScript or render dynamic apps like a browser would. So, if you attempt to use jsoup on a JavaScript webpage, the tool will only see the HTML part.
The main jsoup features include:
- Fetching and loading HTML directly over HTTP
- Parsing HTML into a DOM-like document structure
- Selecting elements with CSS-style selectors
- Reading and updating text, HTML, and element attributes
- Cleaning and sanitizing unsafe HTML content
jsoup is mostly used for scraping e-commerce sites, headlines from news websites, extracting metadata from documents, and cleaning user-submitted HTML. Overall, jsoup is a lightweight API and selector system that can be used for scraping Java-based sites and processing HTML tasks.
As a side note, jsoup also supports other top programming languages for web scraping , like Python, Go, and C#, but in this tutorial, we’ll focus solely on Java.
Prerequisites and Setup
Before you open jsoup, you should know a few Java fundamentals. While using jsoup doesn’t require extensive knowledge, users should be comfortable with:
- Core Java syntax and object-oriented programming basics
- Working with methods, classes, and collections
- Basic file I/O and exception handling
- Basic understanding of HTTP requests and responses
- Familiarity with HTTP client libraries like OkHttp , though not required
NOTE: jsoup works with different Java versions, but Java 11 or newer is recommended for the best compatibility and tooling support.
Add jsoup to Your Project
Maven and Gradle are two of the most popular tools for managing Java projects. Both can automatically download libraries for individual projects and handle testing or packaging. Instead of manually downloading .jar files and configuring them manually, you can establish dependencies in a dedicated config file, and Maven/Gradle takes it from there.
To add jsoup using Maven, add the following code to your pom.xml:
<dependency>
<groupId>org.jsoup</groupId>
<artifactId>jsoup</artifactId>
<version>1.18.1</version>
</dependency>
If you’re using Gradle, add this to your build .gradle file:
implementation 'org.jsoup:jsoup:1.18.1'
Note that if you’re using VS Code - Maven may install an outdated compiler version, so make sure it’s set to at least 21.
Verify the Installation
After adding jsoup, you should then check whether the installation process was successful. To do that, the quickest way is to fetch a web page and print its title.
Here's an example of a relatively simple Java class:
package com.example;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
public class App {
public static void main(String[] args) {
try {
Document document = Jsoup.connect("https://example.com").get();
System.out.println("Page title: " + document.title());
} catch (Exception e) {
e.printStackTrace();
}
}
}
If you've successfully added jsoup and configured it correctly, you should get this response:
Page title: Example Domain
Important note: All code examples below assume your file is in the com.example package; do not remove it when pasting examples. Always add package com.example; as the first line of each file, matching the folder structure Maven generated (src/main/java/com/example/). If you're using a different groupId, adjust accordingly.
Connecting to a Web Page and Parsing HTML
Now that your jsoup is up and running, it's time to connect to a website and parse HTML. The process is pretty straightforward and mostly includes three key steps:
- Fetch HTML from a source
- Parse it into a document
- Select and extract the data you need
Step 1: Connecting With jsoup.connect()
To get started, the most common entry point is jsoup.connect() – it sends an HTTP request and returns a parsed document.
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
public class BasicConnectionExample {
public static void main(String[] args) throws Exception {
Document document = Jsoup
.connect("https://example.com")
.get();
System.out.println(document.title());
}
}
But the above example shows only a basic command. In reality, for most scraping projects, you'll need to configure your request with appropriate headers and timeouts. Here's a list of the most common connection settings:
- userAgent() – to identify the client making the request
- timeout() – to set a maximum wait time
- header() – to add custom HTTP headers
- cookies() – to send cookies with your request
- followRedirects() – to control redirect handling
Like so:
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
public class ConfiguredConnectionExample {
public static void main(String[] args) throws Exception {
Document document = Jsoup
.connect("https://example.com")
.userAgent("Mozilla/5.0")
.timeout(10_000)
.header("Accept-Language", "en-US")
.get();
System.out.println(document.title());
}
}
Step 2: Alternative Input Sources
jsoup doesn't just include parsing – the tool can also parse HTML from files, streams, and raw data strings. But before that, you should know that providing a base URI can help resolve relative links into absolute URLs. Here's how the initial command should look:
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
public class BaseUriExample {
public static void main(String[] args) {
String html = """
<a href="/products">Products</a>
""";
Document document = Jsoup.parse(
html,
"https://example.com"
);
Element link = document.selectFirst("a");
System.out.println(link.attr("href"));
System.out.println(link.absUrl("href"));
}
}
And here's the desired output:
/products
https://example.com/products
Once URLs are in place, let's look at some different parsing examples. Here’s an example of parsing a local HTML file:
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import java.io.File;
public class ParseFileExample {
public static void main(String[] args) throws Exception {
File input = new File("sample.html");
Document document = Jsoup.parse(
input,
"UTF-8"
);
System.out.println(document.title());
}
}
You’ll need to create a sample.html file in the root project directory as such:
<html>
<head>
<title>My Local Page</title>
</head>
<body>
<h1>Hello from a file</h1>
</body>
</html>
If you get an error, you can set the file path by replacing the File input line with a:
File input = new File("C:/path/path/path/sample.html");
Example of parsing a raw HTML string:
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
public class ParseStringExample {
public static void main(String[] args) {
String html = """
<html>
<head>
<title>Test Page</title>
</head>
<body>
<h1>Hello jsoup</h1>
</body>
</html>
""";
Document document = Jsoup.parse(html);
System.out.println(document.title());
}
}
Step 3: Understanding Document, Element, and Elements
While working with jsoup, you'll often find these three main types:
1. Document: represents the entire parsed HTML page. Usually, all parsing operations start from the document object.
Document document = Jsoup.connect("https://example.com").get();
2. Element: shows a single HTML node, making it easier to read unique attributes, text, child notes, and other HTML content.
Element heading = document.selectFirst("h1");
System.out.println(heading.text());
3. Elements: a collection of matching elements, used when selectors match multiple nodes.
import org.jsoup.select.Elements;
Elements links = document.select("a");
for (Element link : links) {
System.out.println(link.text());
}
How jsoup Handles Broken HTML
Issues happen, and one of the most common ones is broken HTML. Conveniently, jsoup has a handy feature that automatically cleans and normalizes incorrect or broken HTML.
Let's look at a practical example. The snippet below shows broken HTML:
<html>
<body>
<p>First paragraph
<p>Second paragraph
</body>
</html>
How jsoup handles it and parses the file:
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
public class BrokenHtmlExample {
public static void main(String[] args) {
String html = """
<html>
<body>
<p>First paragraph
<p>Second paragraph
</body>
</html>
""";
Document document = Jsoup.parse(html);
System.out.println(document.body().html());
}
}
Output:
<p>First paragraph</p>
<p>Second paragraph</p>
Selecting HTML Elements
After you've parsed a document, the next step is to find the specific elements you want to extract from your file. With jsoup, you have two choices:
- CSS-style selectors with select()
- Built-in finder methods like getElementById()
Method 1: Using CSS Selectors With select()
This method allows you to query elements with familiar CSS selector syntax. Here are some of the most common selector patterns:
| Selector | Description | Example |
|---|---|---|
| tag | Select by tag name | p |
| .class | Select by class | .quote |
| #id | Select by ID | #main |
| [attr] | Elements with an attribute | [href] |
| [attr=value] | Attribute equals value | [type=text] |
| tag.class | Combined selector | div.quote |
| parent child | Descendant selector | div span |
Let's try selecting elements using a demo website quotes.toscrape.com :
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
public class SelectorExample {
public static void main(String[] args) throws Exception {
Document document = Jsoup.connect(
"https://quotes.toscrape.com"
).get();
Elements quotes =
document.select("div.quote");
for (Element quote : quotes) {
String text =
quote.select(".text").text();
String author =
quote.select(".author").text();
System.out.println(author + ": " + text);
}
}
}
Once you hit enter, this is the output you should get:
Albert Einstein: “The world as we have created it is a process of our thinking...”
J.K. Rowling: “It is our choices, Harry, that show what we truly are...”
[...]
Method 2: Using Built-in Finder Methods
For common operations, jsoup has direct lookup options, which include:
- Find an element by ID:
Element content =
document.getElementById("content");
- Find elements by class:
Elements quotes =
document.getElementsByClass("quote");
-Find elements by tag:
Elements links =
document.getElementsByTag("a");
These methods work well when you already know the exact structure of the page and don’t need more advanced selector logic.
Here’s a sample code you can try:
package com.example;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
public class App {
public static void main(String[] args) throws Exception {
Document document = Jsoup.connect(
"https://quotes.toscrape.com"
).get();
// getElementById
Element content = document.getElementById("content");
System.out.println("getElementById('content') found: " + (content != null));
// getElementsByClass
Elements quotes = document.getElementsByClass("quote");
System.out.println("getElementsByClass('quote') count: " + quotes.size());
// getElementsByTag
Elements links = document.getElementsByTag("a");
System.out.println("getElementsByTag('a') count: " + links.size());
// selectFirst
Element title = document.selectFirst("title");
System.out.println("selectFirst('title') text: " + title.text());
}
}
Your output should look like this:
getElementById('content') found: false
getElementsByClass('quote') count: 10
getElementsByTag('a') count: 55
selectFirst('title') text: Quotes to Scrape
Using selectFirst() for Single Matches
In some cases, you might need only to match a single element. To do that, you could use the selectFirst() instead of select() command, which returns the first matching Element.
Here's an example using the same quotes.toscrape.com page:
Element title =
document.selectFirst("title");
System.out.println(title.text());
Extracting and Manipulating Data
After you select the required elements, it's time to extract the data you need and turn it into a usable structure, so that applications can recognize and work with it. To help with that, jsoup provides APIs for reading text, attributes, HTML fragments, and navigating through related elements in the DOM.
Extracting Text, HTML, and Attribute Values
jsoup has more than one way to extract different data. The most common ones are:
- text() – extracts visible text content
- html() – returns the Element’s inner HTML
- outerHtml() – returns the full element markup
- attr() – reads attribute values
- absUrl() – converts relative URLs into absolute URLs
Let's look at a practical example using the absUrl() method, since it's specifically useful for scraping websites that rely heavily on relative paths for navigation links and assets.
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
public class ExtractionExample {
public static void main(String[] args) {
String html = """
<article>
<h2>Learning jsoup</h2>
<a href="/tutorials/jsoup">
Read Tutorial
</a>
</article>
""";
Document document = Jsoup.parse(
html,
"https://example.com"
);
Element article = document.selectFirst("article");
Element link = article.selectFirst("a");
System.out.println("Text:");
System.out.println(article.text());
System.out.println("\nInner HTML:");
System.out.println(article.html());
System.out.println("\nHref:");
System.out.println(link.attr("href"));
System.out.println("\nAbsolute URL:");
System.out.println(link.absUrl("href"));
}
}
Output:
Text:
Learning jsoup Read Tutorial
Inner HTML:
<h2>Learning jsoup</h2>
<a href="/tutorials/jsoup">
Read Tutorial
</a>
Href:
/tutorials/jsoup
Absolute URL:
https://example.com/tutorials/jsoup
Mapping Extracted Data Into Java Objects
After extracting the required information, the data is typically mapped into Java objects to be processed further. It's a clean way to represent the data you scraped.
You can modify these records directly from extracted Elements:
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
public class MappingExample {
public static void main(String[] args) {
String html = """
<article>
<h2>
<a href="/article-1">
Intro to jsoup
</a>
</h2>
</article>
""";
Document document = Jsoup.parse(
html,
"https://example.com"
);
Element link = document.selectFirst("a");
Article article = new Article(
link.text(),
link.absUrl("href")
);
System.out.println(article);
}
}
Traversing the DOM
On average, CSS selectors are more than capable of handling most extraction scenarios, but there are situations where you need to go through the DOM manually to reach related elements.
Here's how you can do this with jsoup using its traversal methods for moving between parents, children, and siblings:
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
public class TraversalExample {
public static void main(String[] args) {
String html = """
<div class="product">
<h2>Laptop</h2>
<p class="price">$1299</p>
<p class="stock">In Stock</p>
</div>
""";
Document document = Jsoup.parse(html);
Element price = document.selectFirst(".price");
System.out.println("Parent:");
System.out.println(price.parent().tagName());
System.out.println("\nPrevious Sibling:");
System.out.println(price.previousElementSibling().text());
System.out.println("\nNext Sibling:");
System.out.println(price.nextElementSibling().text());
System.out.println("\nChildren:");
for (Element child : price.parent().children()) {
System.out.println(child.tagName() + " -> " + child.text());
}
}
}
And here's the kind of response you should get:
Parent:
div
Previous Sibling:
Laptop
Next Sibling:
In Stock
Children:
h2 -> Laptop
p -> $1299
p -> In Stock
Modifying and Cleaning HTML
jsoup can also clean unsafe HTML and modify documents programmatically with jsoup.clean(), jsoup's built-in cleaning feature that effectively removes unsafe tags and attributes.
import org.jsoup.Jsoup;
import org.jsoup.safety.Safelist;
public class CleanHtmlExample {
public static void main(String[] args) {
String unsafeHtml = """
<p>Hello</p>
<script>alert('xss')</script>
<a href="https://example.com">Visit</a>
""";
String safeHtml = Jsoup.clean(
unsafeHtml,
Safelist.basic()
);
System.out.println(safeHtml);
}
}
Once you hit enter, this is what you should see:
<p>Hello</p>
<a href="https://example.com" rel="nofollow">Visit</a>
In this case, the <script> tag is automatically removed because it isn’t allowed by the designated safelist.
Building a Simple Web Scraper
Now that we’ve gone through the key elements of jsoup, how it works, and what responses you can see for parsing and extracting, in this section of the tutorial, we’ll show you how to build a simple web scraper to use with jsoup.
Our particular example will be able to:
- Crawl multiple pages using pagination
- Extract structured article data
- Store results in Java objects
- Pause between requests to avoid hammering the server
NOTE: As an example, this scraper will target a fictional paginated blog structure.
Complete Scraper Example
package com.example;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
import java.util.ArrayList;
import java.util.List;
public class App {
public record Quote(
String text,
String author
) {
}
public static void main(String[] args) {
List<Quote> quotes = new ArrayList<>();
String baseUrl = "https://quotes.toscrape.com/page/";
int maxPages = 3;
try {
for (int page = 1; page <= maxPages; page++) {
String url = baseUrl + page + "/";
System.out.println("Scraping: " + url);
Document document = Jsoup.connect(url)
.userAgent("Mozilla/5.0")
.timeout(10_000)
.get();
Elements quoteElements =
document.select("div.quote");
for (Element quoteElement : quoteElements) {
Element textEl =
quoteElement.selectFirst(".text");
Element authorEl =
quoteElement.selectFirst(".author");
if (textEl == null || authorEl == null) {
continue;
}
Quote quote = new Quote(
textEl.text(),
authorEl.text()
);
quotes.add(quote);
}
Thread.sleep(2000);
}
System.out.println("\nScraped " + quotes.size() + " quotes:\n");
for (Quote quote : quotes) {
System.out.println(quote);
}
} catch (Exception e) {
e.printStackTrace();
}
}
}
Example output:
Scraping: https://quotes.toscrape.com/page/1/
Scraping: https://quotes.toscrape.com/page/2/
Scraping: https://quotes.toscrape.com/page/3/
Scraped 30 quotes:
Quote[text=?The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.?, author=Albert Einstein]
Quote[text=?It is our choices, Harry, that show what we truly are, far more than our abilities.?, author=J.K. Rowling]
Quote[text=?There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.?, author=Albert Einstein]
Quote[text=?The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.?, author=Jane Austen]
Quote[text=?Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.?, author=Marilyn Monroe]
[...]
How the Pagination Loop Works
The provided web scraper uses what's known as a loop method to move through paginated pages. You can see it in this line:
for (int page = 1; page <= maxPages; page++) {
Every iteration builds a new URL:
String url = baseUrl + page;
The loop pattern is particularly great for scraping websites that show pagination URLs like this:
?page=1
?page=2
?page=3
And once the required page is selected, the scraper will then specify all matching article elements:
Elements articleElements =
document.select("article.post");
Using Thread.sleep() for Politeness
At this point, we'd like to emphasize a crucial data scraping practice. Ethical scraping is one thing, and should be reviewed individually, but it's also important to slow down requests so as not to overload the target server.
The Thread.sleep() command allows you to do just that. Let's say you want to introduce a 2-second interval between sending page requests:
Thread.sleep(2000);
Additionally, standard requests can quickly be intercepted and your connections blocked by receiving servers, so it's advised to randomize your commands.
Example:
Thread.sleep(1500 + (int)(Math.random() * 1000));
NOTE: Always check a website's robots.txt file and terms of service before scraping, but this is just general advice. We strongly recommend reviewing the policies of every site you target and adhering to ethical data collection practices.
Our web scraper may seem rather simple, particularly to more experienced users, but this scraper can be used as a basis to build a more advanced scraper to:
- Save extracted data to CSV files
- Export results as JSON
- Write data into a database
- Follow article links to scrape detailed pages
- Run requests in parallel for higher throughput
- Add retry logic and error handling
- Use rotating proxies and user agents for large-scale scraping
Error Handling, Politeness, and Best Practices
When it comes to using this web scraper in real-life situations, it might not perform as intended. Errors and other issues could happen due to the different target website structure, network performance, limited traffic, and much more.
That's why web scraping specialists build and customize their web scraper for their needs. However, there are common and well-documented errors that can be fixed.
Handling HTTP and Network Errors
When your connection request fails, jsoup can often show these two responses:
- HttpStatusException – the server returned an HTTP error status like 404 or 429.
- IOException – network or connection-related failures.
Here’s an example of how to handle both cases:
import org.jsoup.HttpStatusException;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import java.io.IOException;
public class ErrorHandlingExample {
public static void main(String[] args) {
String url = "https://example.com";
try {
Document document = Jsoup.connect(url)
.userAgent("MyJavaScraper/1.0")
.timeout(10_000)
.get();
System.out.println(document.title());
} catch (HttpStatusException e) {
System.out.println(
"HTTP error: " +
e.getStatusCode() +
" for URL: " +
e.getUrl()
);
} catch (IOException e) {
System.out.println(
"Connection failed: " +
e.getMessage()
);
}
}
}
Scraping Best Practices
Besides the most common errors, web scrapers should follow these best web scraping practises, which not only ensure ethical data extraction, but also help avoid network performance drops, rate limits, or even IP blocks .
- Retry failed requests with exponential backoff
- Throttle requests and avoid aggressive crawling
- Set a clear and meaningful User-Agent
- Respect robots.txt rules where appropriate
- Review and comply with the target site’s terms of service
- Cache results when possible to reduce repeated requests
- Monitor for HTML structure changes that can break selectors
- Log failures and unexpected responses for debugging
Proxy Rotation at a Larger Scale
Today, many websites, particularly massive platforms like Google, Amazon, eBay, etc., introduce strict detection and limiting procedures to prevent server overload. To deal with this, many users choose to include proxy services from trusted providers like IPRoyal together with web scraping tools.
Specifically, proxy rotation helps users to avoid IP-related issues by routing traffic through multiple IP addresses to reduce blocks or CAPTCHA challenges. For this, residential proxies are some of the most reliable proxy types because they are sourced from real user devices, which further reduces blocks and bans.
Limitations and When Not to Use jsoup
Overall, jsoup is a great tool for parsing and scraping static HTML pages. But the tool has its limitations, the biggest of which is that jsoup can't execute JavaScript. This can quickly become challenging as modern frontend frameworks like React, Vue, and Angular typically render content dynamically. If you use jsoup to scrape these pages, you might just get an empty document.
Also, if you need data that's rendered on the client side, the best thing to have is a browser automation or a headless browser. With these, you could then use JSoup to parse afterward.
Common alternatives for JavaScript-heavy sites:
- Selenium – browser automation using real browsers like Chrome or Firefox
- Playwright – modern browser automation with strong support for dynamic sites
- HtmlUnit – a lightweight headless browser for Java applications
Pro Tip 1: Check the Network Tab First
Make sure to inspect your browser's network tab in the developer tools before you build your browser-based scraper. A lot of websites load data from internal JSON APIs instead of embedding it directly in the HTML. So, if you can pinpoint the API request, it will be much more efficient to call the API rather than parsing the rendered HTML.
Pro Tip 2: Parse Entire Documents
If you scrape a large HTML file or stream, try loading the entire document into memory. jsoup supports parsing from an InputStream to reduce memory overhead. Or, if you need to run more advanced high-volume processing, jsoup also has StreamParser for improved performance and memory efficiency.
FAQ
Can I use jsoup with Java 8, or do I need a newer Java version?
Absolutely, you can use jsoup with Java 8. However, Java 11 and later are generally recommended for better performance, improved HTTP tooling, and long-term ecosystem support. So, if you're starting a new jsoup, Java 17 is a strong choice.
Is web scraping with jsoup legal?
Web scraping isn't an illegal activity. However, the legality depends entirely on how jsoup and other web scrapers are used to collect data, and more specifically, what data. Some of the best practices include reading a target website's terms of service, checking the robots.txt, respecting rate limits, and adhering to compliance.
Can jsoup parse XML or only HTML?
jsoup is designed specifically for HTML and broken HTML parsing, but it can also parse XML documents with an XML parser mode. That being said, if you're looking to work with schema-driven XML workflows, it's better to use dedicated XML libraries like JAXB or DOM/SAX.
How do I speed up jsoup scraping for large numbers of pages?
If you want to scale up on your scraping without sacrificing speed, there are a few things you can do from the get-go that could reduce bottlenecks while using jsoup and other tools:
- Reuse HTTP sessions and cookies
- Run requests concurrently with thread pools
- Reduce unnecessary page downloads
- Scrape JSON APIs directly if available
- Use connection timeouts
- Rotate proxies to avoid throttling
- Cache already scraped pages