Jsoup Tutorial: HTML Parsing and Web Scraping in Java

Learn how to parse HTML and build web scrapers in Java using jsoup, CSS selectors, and DOM traversal in this beginner-friendly tutorial.

Justas Vitaitis

Last updated - June 26, 2026 ‐ 18 min read

Tutorials

Key Takeaways

jsoup is a free, open-source Java library for parsing, extracting, and managing data.
jsoup works best with static HTML, but doesn't execute JavaScript-rendered content.
You can use jsoup to fetch pages, extract structured data, clean HTML, and build multi-page scrapers.
For dynamic websites, combine jsoup with tools like Selenium or Playwright, or scrape JSON APIs directly when possible.

Extracting data from HTML in Java is challenging and often requires serious know-how to execute properly. jsoup makes this extraction easier, allowing users to parse documents, query elements, and scrape web pages without much trouble or extensive experience.

In this tutorial, we’ll walk you through jsoup, what it is, and how to parse HTML, select elements with CSS selectors, extract and clean data, and connect to love web pages for scraping.

What Is jsoup?

jsoup is an open-source Java library built for data parsing and extracting from HTML documents. The tool can target web pages, parse HTML, and extract or modify data with a simple API, using DOM and CSS selectors.

However, while jsoup does help make the process easier, it’s important to note that it only works with static HTML. jsoup can’t execute JavaScript or render dynamic apps like a browser would. So, if you attempt to use jsoup on a JavaScript webpage, the tool will only see the HTML part.

The main jsoup features include:

Fetching and loading HTML directly over HTTP
Parsing HTML into a DOM-like document structure
Selecting elements with CSS-style selectors
Reading and updating text, HTML, and element attributes
Cleaning and sanitizing unsafe HTML content

jsoup is mostly used for scraping e-commerce sites, headlines from news websites, extracting metadata from documents, and cleaning user-submitted HTML. Overall, jsoup is a lightweight API and selector system that can be used for scraping Java-based sites and processing HTML tasks.

As a side note, jsoup also supports other top programming languages for web scraping , like Python, Go, and C#, but in this tutorial, we’ll focus solely on Java.

Prerequisites and Setup

Before you open jsoup, you should know a few Java fundamentals. While using jsoup doesn’t require extensive knowledge, users should be comfortable with:

Core Java syntax and object-oriented programming basics
Working with methods, classes, and collections
Basic file I/O and exception handling
Basic understanding of HTTP requests and responses
Familiarity with HTTP client libraries like OkHttp , though not required

NOTE: jsoup works with different Java versions, but Java 11 or newer is recommended for the best compatibility and tooling support.

Add jsoup to Your Project

Maven and Gradle are two of the most popular tools for managing Java projects. Both can automatically download libraries for individual projects and handle testing or packaging. Instead of manually downloading .jar files and configuring them manually, you can establish dependencies in a dedicated config file, and Maven/Gradle takes it from there.

To add jsoup using Maven, add the following code to your pom.xml:

<dependency>
    <groupId>org.jsoup</groupId>
    <artifactId>jsoup</artifactId>
    <version>1.18.1</version>
</dependency>

If you’re using Gradle, add this to your build .gradle file:

implementation 'org.jsoup:jsoup:1.18.1'

Note that if you’re using VS Code - Maven may install an outdated compiler version, so make sure it’s set to at least 21.

Verify the Installation

After adding jsoup, you should then check whether the installation process was successful. To do that, the quickest way is to fetch a web page and print its title.

Here's an example of a relatively simple Java class:

package com.example;

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;

public class App {

    public static void main(String[] args) {
        try {
            Document document = Jsoup.connect("https://example.com").get();

            System.out.println("Page title: " + document.title());

        } catch (Exception e) {
            e.printStackTrace();
        }
    }
}

If you've successfully added jsoup and configured it correctly, you should get this response:

Page title: Example Domain

Important note: All code examples below assume your file is in the com.example package; do not remove it when pasting examples. Always add package com.example; as the first line of each file, matching the folder structure Maven generated (src/main/java/com/example/). If you're using a different groupId, adjust accordingly.

Ready to get started?

Connecting to a Web Page and Parsing HTML

Now that your jsoup is up and running, it's time to connect to a website and parse HTML. The process is pretty straightforward and mostly includes three key steps:

Fetch HTML from a source
Parse it into a document
Select and extract the data you need

Step 1: Connecting With jsoup.connect()

To get started, the most common entry point is jsoup.connect() – it sends an HTTP request and returns a parsed document.

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;

public class BasicConnectionExample {

    public static void main(String[] args) throws Exception {

        Document document = Jsoup
                .connect("https://example.com")
                .get();

        System.out.println(document.title());
    }
}

But the above example shows only a basic command. In reality, for most scraping projects, you'll need to configure your request with appropriate headers and timeouts. Here's a list of the most common connection settings:

userAgent() – to identify the client making the request
timeout() – to set a maximum wait time
header() – to add custom HTTP headers
cookies() – to send cookies with your request
followRedirects() – to control redirect handling

Like so:

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;

public class ConfiguredConnectionExample {

    public static void main(String[] args) throws Exception {

        Document document = Jsoup
                .connect("https://example.com")
                .userAgent("Mozilla/5.0")
                .timeout(10_000)
                .header("Accept-Language", "en-US")
                .get();

        System.out.println(document.title());
    }
}

Step 2: Alternative Input Sources

jsoup doesn't just include parsing – the tool can also parse HTML from files, streams, and raw data strings. But before that, you should know that providing a base URI can help resolve relative links into absolute URLs. Here's how the initial command should look:

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;

public class BaseUriExample {

    public static void main(String[] args) {

        String html = """
                <a href="/products">Products</a>
                """;

        Document document = Jsoup.parse(
                html,
                "https://example.com"
        );

        Element link = document.selectFirst("a");

        System.out.println(link.attr("href"));
        System.out.println(link.absUrl("href"));
    }
}

And here's the desired output:

/products
https://example.com/products

Once URLs are in place, let's look at some different parsing examples. Here’s an example of parsing a local HTML file:

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;

import java.io.File;

public class ParseFileExample {

    public static void main(String[] args) throws Exception {

        File input = new File("sample.html");

        Document document = Jsoup.parse(
                input,
                "UTF-8"
        );

        System.out.println(document.title());
    }
}

You’ll need to create a sample.html file in the root project directory as such:

<html>
    <head>
        <title>My Local Page</title>
    </head>
    <body>
        <h1>Hello from a file</h1>
    </body>
</html>

If you get an error, you can set the file path by replacing the File input line with a:

File input = new File("C:/path/path/path/sample.html");

Example of parsing a raw HTML string:

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;

public class ParseStringExample {

    public static void main(String[] args) {

        String html = """
                <html>
                    <head>
                        <title>Test Page</title>
                    </head>
                    <body>
                        <h1>Hello jsoup</h1>
                    </body>
                </html>
                """;

        Document document = Jsoup.parse(html);

        System.out.println(document.title());
    }
}

Step 3: Understanding Document, Element, and Elements

While working with jsoup, you'll often find these three main types:

1. Document: represents the entire parsed HTML page. Usually, all parsing operations start from the document object.

Document document = Jsoup.connect("https://example.com").get();

2. Element: shows a single HTML node, making it easier to read unique attributes, text, child notes, and other HTML content.

Element heading = document.selectFirst("h1");
System.out.println(heading.text());

3. Elements: a collection of matching elements, used when selectors match multiple nodes.

import org.jsoup.select.Elements;

Elements links = document.select("a");

for (Element link : links) {
    System.out.println(link.text());
}

How jsoup Handles Broken HTML

Issues happen, and one of the most common ones is broken HTML. Conveniently, jsoup has a handy feature that automatically cleans and normalizes incorrect or broken HTML.

Let's look at a practical example. The snippet below shows broken HTML:

<html>
    <body>
        <p>First paragraph
        <p>Second paragraph
    </body>
</html>

How jsoup handles it and parses the file:

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;

public class BrokenHtmlExample {

    public static void main(String[] args) {

        String html = """
                <html>
                    <body>
                        <p>First paragraph
                        <p>Second paragraph
                    </body>
                </html>
                """;

        Document document = Jsoup.parse(html);

        System.out.println(document.body().html());
    }
}

Output:

<p>First paragraph</p>
<p>Second paragraph</p>

Selecting HTML Elements

After you've parsed a document, the next step is to find the specific elements you want to extract from your file. With jsoup, you have two choices:

CSS-style selectors with select()
Built-in finder methods like getElementById()

Method 1: Using CSS Selectors With select()

This method allows you to query elements with familiar CSS selector syntax. Here are some of the most common selector patterns:

Selector	Description	Example
tag	Select by tag name	p
.class	Select by class	.quote
#id	Select by ID	#main
[attr]	Elements with an attribute	[href]
[attr=value]	Attribute equals value	[type=text]
tag.class	Combined selector	div.quote
parent child	Descendant selector	div span

Let's try selecting elements using a demo website quotes.toscrape.com :

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

public class SelectorExample {

    public static void main(String[] args) throws Exception {

        Document document = Jsoup.connect(
                "https://quotes.toscrape.com"
        ).get();

        Elements quotes =
                document.select("div.quote");

        for (Element quote : quotes) {

            String text =
                    quote.select(".text").text();

            String author =
                    quote.select(".author").text();

            System.out.println(author + ": " + text);
        }
    }
}

Once you hit enter, this is the output you should get:

Albert Einstein: “The world as we have created it is a process of our thinking...”
J.K. Rowling: “It is our choices, Harry, that show what we truly are...”
[...]

Method 2: Using Built-in Finder Methods

For common operations, jsoup has direct lookup options, which include:

Find an element by ID:

Element content =
        document.getElementById("content");

Find elements by class:

Elements quotes =
        document.getElementsByClass("quote");

-Find elements by tag:

Elements links =
        document.getElementsByTag("a");

These methods work well when you already know the exact structure of the page and don’t need more advanced selector logic.

Here’s a sample code you can try:

package com.example;

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

public class App {

    public static void main(String[] args) throws Exception {

        Document document = Jsoup.connect(
                "https://quotes.toscrape.com"
        ).get();

        // getElementById
        Element content = document.getElementById("content");
        System.out.println("getElementById('content') found: " + (content != null));

        // getElementsByClass
        Elements quotes = document.getElementsByClass("quote");
        System.out.println("getElementsByClass('quote') count: " + quotes.size());

        // getElementsByTag
        Elements links = document.getElementsByTag("a");
        System.out.println("getElementsByTag('a') count: " + links.size());

        // selectFirst
        Element title = document.selectFirst("title");
        System.out.println("selectFirst('title') text: " + title.text());
    }
}

Your output should look like this:

getElementById('content') found: false
getElementsByClass('quote') count: 10
getElementsByTag('a') count: 55
selectFirst('title') text: Quotes to Scrape

Using selectFirst() for Single Matches

In some cases, you might need only to match a single element. To do that, you could use the selectFirst() instead of select() command, which returns the first matching Element.

Here's an example using the same quotes.toscrape.com page:

Element title =
        document.selectFirst("title");

System.out.println(title.text());

Extracting and Manipulating Data

After you select the required elements, it's time to extract the data you need and turn it into a usable structure, so that applications can recognize and work with it. To help with that, jsoup provides APIs for reading text, attributes, HTML fragments, and navigating through related elements in the DOM.

Extracting Text, HTML, and Attribute Values

jsoup has more than one way to extract different data. The most common ones are:

text() – extracts visible text content
html() – returns the Element’s inner HTML
outerHtml() – returns the full element markup
attr() – reads attribute values
absUrl() – converts relative URLs into absolute URLs

Let's look at a practical example using the absUrl() method, since it's specifically useful for scraping websites that rely heavily on relative paths for navigation links and assets.

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;

public class ExtractionExample {

    public static void main(String[] args) {

        String html = """
                <article>
                    <h2>Learning jsoup</h2>
                    <a href="/tutorials/jsoup">
                        Read Tutorial
                    </a>
                </article>
                """;

        Document document = Jsoup.parse(
                html,
                "https://example.com"
        );

        Element article = document.selectFirst("article");
        Element link = article.selectFirst("a");

        System.out.println("Text:");
        System.out.println(article.text());

        System.out.println("\nInner HTML:");
        System.out.println(article.html());

        System.out.println("\nHref:");
        System.out.println(link.attr("href"));

        System.out.println("\nAbsolute URL:");
        System.out.println(link.absUrl("href"));
    }
}

Output:

Text:
Learning jsoup Read Tutorial

Inner HTML:
<h2>Learning jsoup</h2>
<a href="/tutorials/jsoup">
 Read Tutorial
</a>

Href:
/tutorials/jsoup

Absolute URL:
https://example.com/tutorials/jsoup

Mapping Extracted Data Into Java Objects

After extracting the required information, the data is typically mapped into Java objects to be processed further. It's a clean way to represent the data you scraped.

You can modify these records directly from extracted Elements:

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;

public class MappingExample {

    public static void main(String[] args) {

        String html = """
                <article>
                    <h2>
                        <a href="/article-1">
                            Intro to jsoup
                        </a>
                    </h2>
                </article>
                """;

        Document document = Jsoup.parse(
                html,
                "https://example.com"
        );

        Element link = document.selectFirst("a");

        Article article = new Article(
                link.text(),
                link.absUrl("href")
        );

        System.out.println(article);
    }
}

Traversing the DOM

On average, CSS selectors are more than capable of handling most extraction scenarios, but there are situations where you need to go through the DOM manually to reach related elements.

Here's how you can do this with jsoup using its traversal methods for moving between parents, children, and siblings:

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;

public class TraversalExample {

    public static void main(String[] args) {

        String html = """
                <div class="product">
                    <h2>Laptop</h2>
                    <p class="price">$1299</p>
                    <p class="stock">In Stock</p>
                </div>
                """;

        Document document = Jsoup.parse(html);

        Element price = document.selectFirst(".price");

        System.out.println("Parent:");
        System.out.println(price.parent().tagName());

        System.out.println("\nPrevious Sibling:");
        System.out.println(price.previousElementSibling().text());

        System.out.println("\nNext Sibling:");
        System.out.println(price.nextElementSibling().text());

        System.out.println("\nChildren:");

        for (Element child : price.parent().children()) {
            System.out.println(child.tagName() + " -> " + child.text());
        }
    }
}

And here's the kind of response you should get:

Parent:
div

Previous Sibling:
Laptop

Next Sibling:
In Stock

Children:
h2 -> Laptop
p -> $1299
p -> In Stock

Modifying and Cleaning HTML

jsoup can also clean unsafe HTML and modify documents programmatically with jsoup.clean(), jsoup's built-in cleaning feature that effectively removes unsafe tags and attributes.

import org.jsoup.Jsoup;
import org.jsoup.safety.Safelist;

public class CleanHtmlExample {

    public static void main(String[] args) {

        String unsafeHtml = """
                <p>Hello</p>
                <script>alert('xss')</script>
                <a href="https://example.com">Visit</a>
                """;

        String safeHtml = Jsoup.clean(
                unsafeHtml,
                Safelist.basic()
        );

        System.out.println(safeHtml);
    }
}

Once you hit enter, this is what you should see:

<p>Hello</p>
<a href="https://example.com" rel="nofollow">Visit</a>

In this case, the <script> tag is automatically removed because it isn’t allowed by the designated safelist.

Building a Simple Web Scraper

Now that we’ve gone through the key elements of jsoup, how it works, and what responses you can see for parsing and extracting, in this section of the tutorial, we’ll show you how to build a simple web scraper to use with jsoup.

Our particular example will be able to:

Crawl multiple pages using pagination
Extract structured article data
Store results in Java objects
Pause between requests to avoid hammering the server

NOTE: As an example, this scraper will target a fictional paginated blog structure.

Complete Scraper Example

package com.example;

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

import java.util.ArrayList;
import java.util.List;

public class App {

    public record Quote(
            String text,
            String author
    ) {
    }

    public static void main(String[] args) {

        List<Quote> quotes = new ArrayList<>();

        String baseUrl = "https://quotes.toscrape.com/page/";

        int maxPages = 3;

        try {

            for (int page = 1; page <= maxPages; page++) {

                String url = baseUrl + page + "/";

                System.out.println("Scraping: " + url);

                Document document = Jsoup.connect(url)
                        .userAgent("Mozilla/5.0")
                        .timeout(10_000)
                        .get();

                Elements quoteElements =
                        document.select("div.quote");

                for (Element quoteElement : quoteElements) {

                    Element textEl =
                            quoteElement.selectFirst(".text");

                    Element authorEl =
                            quoteElement.selectFirst(".author");

                    if (textEl == null || authorEl == null) {
                        continue;
                    }

                    Quote quote = new Quote(
                            textEl.text(),
                            authorEl.text()
                    );

                    quotes.add(quote);
                }

                Thread.sleep(2000);
            }

            System.out.println("\nScraped " + quotes.size() + " quotes:\n");

            for (Quote quote : quotes) {
                System.out.println(quote);
            }

        } catch (Exception e) {
            e.printStackTrace();
        }
    }
}

Example output:

Scraping: https://quotes.toscrape.com/page/1/
Scraping: https://quotes.toscrape.com/page/2/
Scraping: https://quotes.toscrape.com/page/3/

Scraped 30 quotes:

Quote[text=?The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.?, author=Albert Einstein]
Quote[text=?It is our choices, Harry, that show what we truly are, far more than our abilities.?, author=J.K. Rowling]
Quote[text=?There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.?, author=Albert Einstein]
Quote[text=?The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.?, author=Jane Austen]
Quote[text=?Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.?, author=Marilyn Monroe]
[...]

The provided web scraper uses what's known as a loop method to move through paginated pages. You can see it in this line:

for (int page = 1; page <= maxPages; page++) {

Every iteration builds a new URL:

String url = baseUrl + page;

The loop pattern is particularly great for scraping websites that show pagination URLs like this:

?page=1
?page=2
?page=3

And once the required page is selected, the scraper will then specify all matching article elements:

Elements articleElements =
        document.select("article.post");

Using Thread.sleep() for Politeness

At this point, we'd like to emphasize a crucial data scraping practice. Ethical scraping is one thing, and should be reviewed individually, but it's also important to slow down requests so as not to overload the target server.

The Thread.sleep() command allows you to do just that. Let's say you want to introduce a 2-second interval between sending page requests:

Thread.sleep(2000);

Additionally, standard requests can quickly be intercepted and your connections blocked by receiving servers, so it's advised to randomize your commands.

Example:

Thread.sleep(1500 + (int)(Math.random() * 1000));

NOTE: Always check a website's robots.txt file and terms of service before scraping, but this is just general advice. We strongly recommend reviewing the policies of every site you target and adhering to ethical data collection practices.

Our web scraper may seem rather simple, particularly to more experienced users, but this scraper can be used as a basis to build a more advanced scraper to:

Save extracted data to CSV files
Export results as JSON
Write data into a database
Follow article links to scrape detailed pages
Run requests in parallel for higher throughput
Add retry logic and error handling
Use rotating proxies and user agents for large-scale scraping

Error Handling, Politeness, and Best Practices

When it comes to using this web scraper in real-life situations, it might not perform as intended. Errors and other issues could happen due to the different target website structure, network performance, limited traffic, and much more.

That's why web scraping specialists build and customize their web scraper for their needs. However, there are common and well-documented errors that can be fixed.

Handling HTTP and Network Errors

When your connection request fails, jsoup can often show these two responses:

HttpStatusException – the server returned an HTTP error status like 404 or 429.
IOException – network or connection-related failures.

Here’s an example of how to handle both cases:

import org.jsoup.HttpStatusException;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;

import java.io.IOException;

public class ErrorHandlingExample {

    public static void main(String[] args) {

        String url = "https://example.com";

        try {

            Document document = Jsoup.connect(url)
                    .userAgent("MyJavaScraper/1.0")
                    .timeout(10_000)
                    .get();

            System.out.println(document.title());

        } catch (HttpStatusException e) {

            System.out.println(
                    "HTTP error: " +
                    e.getStatusCode() +
                    " for URL: " +
                    e.getUrl()
            );

        } catch (IOException e) {

            System.out.println(
                    "Connection failed: " +
                    e.getMessage()
            );
        }
    }
}

Scraping Best Practices

Besides the most common errors, web scrapers should follow these best web scraping practises, which not only ensure ethical data extraction, but also help avoid network performance drops, rate limits, or even IP blocks .

Retry failed requests with exponential backoff
Throttle requests and avoid aggressive crawling
Set a clear and meaningful User-Agent
Respect robots.txt rules where appropriate
Review and comply with the target site’s terms of service
Cache results when possible to reduce repeated requests
Monitor for HTML structure changes that can break selectors
Log failures and unexpected responses for debugging

Proxy Rotation at a Larger Scale

Today, many websites, particularly massive platforms like Google, Amazon, eBay, etc., introduce strict detection and limiting procedures to prevent server overload. To deal with this, many users choose to include proxy services from trusted providers like IPRoyal together with web scraping tools.

Specifically, proxy rotation helps users to avoid IP-related issues by routing traffic through multiple IP addresses to reduce blocks or CAPTCHA challenges. For this, residential proxies are some of the most reliable proxy types because they are sourced from real user devices, which further reduces blocks and bans.

Limitations and When Not to Use jsoup

Overall, jsoup is a great tool for parsing and scraping static HTML pages. But the tool has its limitations, the biggest of which is that jsoup can't execute JavaScript. This can quickly become challenging as modern frontend frameworks like React, Vue, and Angular typically render content dynamically. If you use jsoup to scrape these pages, you might just get an empty document.

Also, if you need data that's rendered on the client side, the best thing to have is a browser automation or a headless browser. With these, you could then use JSoup to parse afterward.

Common alternatives for JavaScript-heavy sites:

Selenium – browser automation using real browsers like Chrome or Firefox
Playwright – modern browser automation with strong support for dynamic sites
HtmlUnit – a lightweight headless browser for Java applications

Pro Tip 1: Check the Network Tab First

Make sure to inspect your browser's network tab in the developer tools before you build your browser-based scraper. A lot of websites load data from internal JSON APIs instead of embedding it directly in the HTML. So, if you can pinpoint the API request, it will be much more efficient to call the API rather than parsing the rendered HTML.

Pro Tip 2: Parse Entire Documents

If you scrape a large HTML file or stream, try loading the entire document into memory. jsoup supports parsing from an InputStream to reduce memory overhead. Or, if you need to run more advanced high-volume processing, jsoup also has StreamParser for improved performance and memory efficiency.

FAQ

Can I use jsoup with Java 8, or do I need a newer Java version?

Absolutely, you can use jsoup with Java 8. However, Java 11 and later are generally recommended for better performance, improved HTTP tooling, and long-term ecosystem support. So, if you're starting a new jsoup, Java 17 is a strong choice.

Is web scraping with jsoup legal?

Web scraping isn't an illegal activity. However, the legality depends entirely on how jsoup and other web scrapers are used to collect data, and more specifically, what data. Some of the best practices include reading a target website's terms of service, checking the robots.txt, respecting rate limits, and adhering to compliance.

Can jsoup parse XML or only HTML?

jsoup is designed specifically for HTML and broken HTML parsing, but it can also parse XML documents with an XML parser mode. That being said, if you're looking to work with schema-driven XML workflows, it's better to use dedicated XML libraries like JAXB or DOM/SAX.

How do I speed up jsoup scraping for large numbers of pages?

If you want to scale up on your scraping without sacrificing speed, there are a few things you can do from the get-go that could reduce bottlenecks while using jsoup and other tools:

Reuse HTTP sessions and cookies
Run requests concurrently with thread pools
Reduce unnecessary page downloads
Scrape JSON APIs directly if available
Use connection timeouts
Rotate proxies to avoid throttling
Cache already scraped pages

Create Account

Author

Justas Vitaitis

Senior Software Engineer

Justas is a Senior Software Engineer with over a decade of proven expertise. He currently holds a crucial role in IPRoyal’s development team, regularly demonstrating his profound expertise in the Go programming language, contributing significantly to the company’s technological evolution. Justas is pivotal in maintaining our proxy network, serving as the authority on all aspects of proxies. Beyond coding, Justas is a passionate travel enthusiast and automotive aficionado, seamlessly blending his tech finesse with a passion for exploration.

Learn More About Justas Vitaitis Meet all Writers

Share on

Article by IPRoyal

Meet our writers

In This Article

jsoup Tutorial: HTML Parsing and Web Scraping in Java

Key Takeaways

What Is jsoup?

Prerequisites and Setup

Add jsoup to Your Project

Verify the Installation

Connecting to a Web Page and Parsing HTML

Step 1: Connecting With jsoup.connect()

Step 2: Alternative Input Sources

Step 3: Understanding Document, Element, and Elements

How jsoup Handles Broken HTML

Selecting HTML Elements

Method 1: Using CSS Selectors With select()

Method 2: Using Built-in Finder Methods

Using selectFirst() for Single Matches

Extracting and Manipulating Data

Extracting Text, HTML, and Attribute Values

Mapping Extracted Data Into Java Objects

Traversing the DOM

Modifying and Cleaning HTML

Building a Simple Web Scraper

Complete Scraper Example

Using Thread.sleep() for Politeness

Error Handling, Politeness, and Best Practices

Handling HTTP and Network Errors

Scraping Best Practices

Proxy Rotation at a Larger Scale

Limitations and When Not to Use jsoup

Pro Tip 1: Check the Network Tab First

Pro Tip 2: Parse Entire Documents

FAQ

Related articles

The Most Common Payment Processing Obstacles and Solutions for SaaS Companies

How to Use Proxifier to Bypass Restrictions

Easy Programmatic SEO with WordPress

In This Article

jsoup Tutorial: HTML Parsing and Web Scraping in Java

Key Takeaways

What Is jsoup?

Prerequisites and Setup

Add jsoup to Your Project

Verify the Installation

Connecting to a Web Page and Parsing HTML

Step 1: Connecting With jsoup.connect()

Step 2: Alternative Input Sources

Step 3: Understanding Document, Element, and Elements

How jsoup Handles Broken HTML

Selecting HTML Elements

Method 1: Using CSS Selectors With select()

Method 2: Using Built-in Finder Methods

Using selectFirst() for Single Matches

Extracting and Manipulating Data

Extracting Text, HTML, and Attribute Values

Mapping Extracted Data Into Java Objects

Traversing the DOM

Modifying and Cleaning HTML

Building a Simple Web Scraper

Complete Scraper Example

How the Pagination Loop Works

Using Thread.sleep() for Politeness

Error Handling, Politeness, and Best Practices

Handling HTTP and Network Errors

Scraping Best Practices

Proxy Rotation at a Larger Scale

Limitations and When Not to Use jsoup

Pro Tip 1: Check the Network Tab First

Pro Tip 2: Parse Entire Documents

FAQ

Related articles

The Most Common Payment Processing Obstacles and Solutions for SaaS Companies

How to Use Proxifier to Bypass Restrictions

Easy Programmatic SEO with WordPress