50% OFF Residential Proxies for 9 months — use code IPR50 at checkout

Get The Deal
Back to blog

Parsing HTML in Python With PyQuery: Step-by-Step Tutorial

Marijus Narbutas

Last updated -
Python

Key Takeaways

  • PyQuery provides a simple, jQuery-like syntax for HTML parsing in Python.

  • Combining CSS selectors with DOM traversal and manipulation methods allows you to efficiently parse HTML and manipulate HTML documents.

  • PyQuery is great for web scraping, especially when you need to extract and clean data.

Ready to get started?

Register now

If you’ve been having trouble with parsing using the PyQuery Python library, here you’ll learn some tips and examples on how to parse HTML effortlessly.

We’ll walk through installing PyQuery, parsing an HTML document, selecting HTML elements with CSS selectors, and extracting data. It will help you parse HTML efficiently and reliably if you’re building a scraper, modifying XML documents, or doing anything else parsing-related.

Installing PyQuery

To get started, you first need to install PyQuery using pip, Python’s package manager. Open your command prompt and enter the following command:

pip install pyquery

It will install PyQuery into your environment. After installation, verify it works by using:

python -c "from pyquery import PyQuery; print(PyQuery('<div>Check</div>').text())"

If everything works, it will output Check. It means your PyQuery is set and ready to go. Now you can start writing a simple script. Create a Python file and include this:

from pyquery import PyQuery as pq

html = "<html><body><h1>Welcome</h1></body></html>"
doc = pq(html)
print(doc("h1").text())

This script shows the basic use of the PyQuery class. It lets you load HTML content and access parts of the document using CSS selectors, preparing you to parse HTML with ease.

Basics of Parsing HTML With PyQuery

There are two methods you can use to easily parse HTML with PyQuery: from a string and from an external file. Before you begin, make sure you import the PyQuery module:

from pyquery import PyQuery as pq

Now, if you want to load HTML content from a string, use this:

html = "<html><body><p>Hello, world!</p></body></html>"
doc = pq(html)

On the other hand, if you’re looking to load HTML content from a file, do this instead:

doc = pq(filename="example.html")

The code line above assumes your HTML file is stored in the same folder as your project.

pq() parses the structure and transforms it into a Python object you can navigate. It’s built mainly for HTML parsing, but it also works with XML documents.

Selecting Elements With CSS Selectors

With PyQuery, you get support and flexibility for CSS selectors. You can select by tag name, class, ID, and even loop through multiple elements.

1. To select by tag name:

doc("p")

This grabs all '< p >' tags in the HTML. So, if your HTML looks like this:

<p>Paragraph 1</p>
<p>Paragraph 2</p>

Then doc(“p”) will find both. You can loop over them or access their text and attributes. It’s useful when you want to parse HTML and collect all text blocks, for example, during web scraping.

2. To select by class name:

doc(".article")

This targets elements like:

<div class="article">News 1</div>
<div class="article">News 2</div>

It will return all elements that include class=’article’. You can also combine it with tag names for more precision: doc("div.article").

It’s great for narrowing down parts of the page with structured web content you need to scrape.

3. To select by ID:

doc("#main")

This looks for an element like:

<div id="main">Main Section</div>

Unlike classes, IDs should be unique in HTML code, so doc(‘#main’) usually returns a single element. This is useful when you need to start from a known spot before navigating other elements with methods like .children() or .find().

4. To loop through multiple HTML elements:

for item in doc(".article"):
    print(pq(item).text())

This helps when you need to manipulate documents and extract information, which is a common task in web scraping with Python.

Once you’re done parsing HTML with PyQuery, you may need to move through the structure of the HTML, which is called DOM traversal.

Using the PyQuery Python library you can easily navigate up, down, and sideways through elements.

To move from an element to its parent:

doc(".item").parent()

This goes up one level in the DOM. For example, if you have:

<div class="container">
  <div class="item">Text</div>
</div>

Using parent() on .item() will return the container.

To access the children of an element:

doc(".container").children()

This will return all direct children of the container, which is useful when your HTML content has blocks you need to work with.

To get the next sibling element:

doc(".item").next()

This finds the element immediately after .item on the same level.

To get the previous sibling:

doc(".item").prev()

It moves left in the DOM and returns the element that comes before .item.

DOM traversal makes web scraping much easier and keeps your code clean, especially when combined with precise CSS selectors.

Manipulating the DOM

Once you parse HTML using PyQuery, you can do more than just read the content, you can also change it. This is called DOM manipulation, and PyQuery gives you tools to edit the HTML document in a jQuery-like syntax.

To change the text inside an element:

doc("h1").text("Updated Title")

This replaces the inner text of all <h1> elements. It’s useful when cleaning up or reformatting content after web scraping.

To modify attributes like links or image sources:

doc("a").attr("href", "https://example.com")
doc("img").attr("src", "new-image.jpg")

This updates the href or src attributes of the selected elements. If you’re reusing data from XML documents or prepping it for analysis, this is essential.

To remove elements from the HTML document:

doc(".ad-banner").remove()

This deletes every element with the class ad-banner. It’s helpful for getting rid of noise while parsing HTML or preparing scraped pages for clean output.

All of this happens through the PyQuery class, giving you control to manipulate HTML documents dynamically with minimal code.

Extracting Data from HTML

After you parse HTML, the next step is pulling out the data you care about.

To get text content from HTML:

text = doc("p").text()

To extract attribute values:

link = doc("a").attr("href")

To get all hyperlinks in an HTML document:

links = [pq(a).attr("href") for a in doc("a")]

This supports web scraping tasks by letting you parse HTML and extract the necessary details instead of managing the entire HTML which includes absolutely everything.

PyQuery and jQuery: How Similar Are They?

One reason developers love PyQuery is its jQuery-like syntax. It mimics jQuery’s API closely, so if you’ve used jQuery in the browser, you’ll feel at home using PyQuery in Python.

Both tools let you select elements using CSS selectors, like:

doc(".post-title")

Or in jQuery:

$(".post-title")

They share many of the same methods: .text(), .attr(), .parent(), .children(), it makes switching between them feel natural. The commands are nearly identical, regardless of whether you’re editing the DOM or pulling text.

But there are important differences. jQuery runs in the browser and manipulates live pages with JavaScript. PyQuery, on the other hand, runs in Python and works on static HTML documents or XML documents already loaded into memory.

So, while their APIs look alike, PyQuery is built for Python’s ecosystem. It fits perfectly into scripts that parse HTML, process data, or clean up HTML content before further analysis.

Conclusion

PyQuery brings jQuery-like syntax to HTML parsing. The Python library lets you easily load HTML content, use CSS selectors, traverse the DOM, modify web content, and extract data. It’s ideal for Python web scraping and working with XML or HTML documents in Python.

If you’re not a fan of PyQuery Python library, you can try web scraping using BeautifulSoup or Python lxml .

Create Account

Author

Marijus Narbutas

Senior Software Engineer

With more than seven years of experience, Marijus has contributed to developing systems in various industries, including healthcare, finance, and logistics. As a backend programmer who specializes in PHP and MySQL, Marijus develops and maintains server-side applications and databases, ensuring our website works smoothly and securely, providing a seamless experience for our clients. In his free time, he enjoys gaming on his PS5 and stays active with sports like tricking, running, and weight lifting.

Learn More About Marijus Narbutas
Share on

Related articles