Web Scraping with Python lxml: A Step-by-Step Tutorial
![](https://cms.iproyal.com/uploads/Web_Scraping_with_Python_lxml_847x300_64083dbba0.png)
![](https://cms.iproyal.com/uploads/Justas_Vitaitis_3a8660fab1.png)
Justas Vitaitis
Last updated -
In This Article
Web scraping relies on parsing techniques to make sense of the enormous volume of data acquired from HTML files. Since HTML is not intended for data analysis, a lot of information (such as tags) is often superfluous.
Python has a multitude of libraries that support easier parsing such as Beautiful Soup 4 and lxml. The Python lxml library is often in direct competition with other parsers, so knowing which one is better for the task at hand isn’t always clear-cut.
What Is Python lxml?
The Python lxml library is intended to provide an easier way to parse XML and HTML files through features such as XPATH expressions search and many others. One of the major benefits of the Python lxml library is that it’s built on top of a C implementation.
Being C-based provides several advantages, key of which is speed. The Python language is flexible, easy to use, but not as fast as some lower-level languages. The lxml library can harness the benefits of both languages – making it easy to use, but also quick.
Additionally, it’s quite versatile as it supports XML and HTML files seamlessly, even if the latter are often not as well-structured as the former. So, the lxml library is quite a good all-around solution for those that intend to scrape both XML and HTML files.
Installing Python lxml
As with any Python coding project, you’ll need an IDE and the Python environment. Assuming both are installed, you’ll need to start a new project and install lxml. Once you’re in the project environment, open up the Terminal and type in:
pip install lxml
After the process is finished, you can start working on your lxml project. You’ll first need to import the lxml library before any of its features can be used:
import lxml
Parsing XML and HTML with Python lxml
We can now begin processing XML and HTML documents. Before engaging in any web scraping, it’s often valuable to test the functionality of the lxml XML toolkit.
Reading XML Documents
Parsing an XML document will often require the usage of the etree module. Since we’ll only be using it for the example, we’ll switch the import up a little bit:
from lxml import etree
# Sample XML data
xml_data = """
<root>
<item>
<name>Item 1</name>
<price>10</price>
</item>
<item>
<name>Item 2</name>
<price>20</price>
</item>
</root>
"""
# Parse the XML data
root = etree.fromstring(xml_data)
# Access elements using XPath
for item in root.xpath("//item"):
name = item.xpath("./name/text()")[0]
price = item.xpath("./price/text()")[0]
print(f"Name: {name}, Price: {price}")
We create a sample XML document from which we’ll try to extract the name and price of an item. You may run into similar XML documents when web scraping, for example, ecommerce websites.
After that, processing XML documents is quite simple (as long as the structure is clear). We create a for loop that finds all content that’s nested under the <item> tag. Since that also includes a lot of other tags, we’ll need to go through each <item> tag to extract only the valuable information.
So, we select the name and extract its text into a separate “name” variable. We do the same for the price of the item. Finally, we use an f-string to print out the values.
Parsing HTML Documents with lxml
HTML documents are usually a little more messy and unstructured than XML files. Fortunately, lxml is just as capable of handling a HTML file, just that it requires some additional tinkering.
from lxml import html
# Sample HTML data
html_data = """
<html>
<body>
<div class="product">
<h2>Product 1</h2>
<span class="price">$15</span>
</div>
<div class="product">
<h2>Product 2</h2>
<span class="price">$25</span>
</div>
</body>
</html>
"""
# Parse the HTML data
tree = html.fromstring(html_data)
# Extract product details
for product in tree.xpath("//div[@class='product']"):
name = product.xpath(".//h2/text()")[0]
price = product.xpath(".//span[@class='price']/text()")[0]
print(f"Name: {name}, Price: {price}")
Notice that instead of “from lxml import etree”, we import the “html” module now. “Import etree” would work, however, it’s not as great at parsing HTML as the dedicated module.
Again, we create a sample HTML file, which will be used for our parsing process.
Similarly, we load the HTML data into a variable and run a for loop. However, while we still use XPATH, the string format is completely different, albeit the underlying logic is the same.
We still doubleslash to find all the “div” tags (in comparison, in the XML document we used “//item” to find all the items), however, “div” is highly generic and is often followed by an attribute (“class”) and its value (“product”). So, all of these are included in our string when parsing HTML.
The remainder of the extraction process follows the same logic with the altered strings for parsing HTML files.
In general, XML files will often be a little easier to parse and read through than HTML documents.
How to Use Python lxml for Web Scraping
Web scraping, unfortunately, won’t give you as much customization as when you already have the files you need. You may encounter both HTML and XML documents when web scraping, so you should be prepared to handle both.
Real-World Web Scraping Example
Before you can perform any web scraping action, additional libraries will be required. Requests is a common library that’s fast and efficient, if less evasive than other options:
pip install lxml requests
Running the above in the Terminal will ensure you have both requests and lxml installed.
We’ll try to extract the most popular proxy locations and the number of IP addresses from the IPRoyal proxy location page .
import requests
from lxml import html
# Target URL
url = "https://iproyal.com/proxies-by-location/"
# Send HTTP GET request to fetch the HTML content
response = requests.get(url)
# Parse the HTML content
tree = html.fromstring(response.content)
# XPath to select all location blocks
location_blocks = tree.xpath("//li[@class='astro-npu2bgpv']")
# Extract and print location names and proxy counts
for block in location_blocks:
# Extract the location name (full name is inside a <span> with specific class)
location_name = block.xpath(".//span[@class='hidden sm:block tp-body truncate astro-npu2bgpv']/text()")
location_name = location_name[0].strip() if location_name else "Unknown"
# Extract the number of proxies (inside a <span> with specific class)
proxy_count = block.xpath(".//span[@class='astro-npu2bgpv']/text()")
proxy_count = proxy_count[0].strip() if proxy_count else "Unknown"
print(f"Location: {location_name}, Proxies: {proxy_count}")
Since the page is not an XML document, we’ll be using the HTML parser from lxml.
Our start is quite simple – we set a URL, send a GET request using requests, and store the response in a “tree” variable. Note that the XPATH is quite complicated – we capture a list tag with a specific class.
We then use a “for” loop to run through each block. Additionally, we pick up the location name from a list since XPATH always returns a list (usually, with just a single item), we pick it up with the list index “[0]”.
To make the output cleaner, we strip any trailing or leading whitespaces and run an “if” check in case the query fails to find anything.
Proxy count follows the same logic. Finally, the output is printed within the loop to give us the locations and IP address counts.
Alternatively, you could use a library like openpyxl or pandas to export your results into a CSV file. In most real-world web scraping scenarios, outputting to the console won’t be enough, so you’ll likely be using these or similar libraries.
Note that the web scraping process technically ends upon reaching the website and storing the HTML. Web scraping in itself isn’t difficult as long as it’s not done at scale. If you try to collect data from thousands of pages, that’s where difficulties arise and you will need proxies and other techniques to retain access.
Parsing, however, will often be the most difficult part of the entire data acquisition chain (from web scraping to parsing). HTML files are just not that great for data analysis, but quite easy to acquire through web scraping.
Frequently Asked Questions
How does lxml compare to BeautifulSoup and ElementTree?
BeautifulSoup is the most beginner friendly and generally performs slightly better on HTML than lxml, but is slower and has worse XPATH support. ElementTree lacks HTML parsing and XPATH features, but is more lightweight and intuitive. In general, lxml is the most powerful and fastest, but also the most difficult to learn.
What are the advantages of using XML over plain text or JSON?
XML has a rigid and clear structure that makes data analysis a lot easier with numerous additional features such as meta data, namespaces, and easy transformation. Plain text lacks all of the above but is the most freeform out of the three. JSON is a middle-ground with a less rigid, but still clear structure, but lacks meta data, namespace, and other features.
How to convert XML to HTML or JSON in Python?
The lxml library supports both conversions. You can convert XML to HTML using XSLT and transform functions while for JSON, you’ll have to first convert the XML file to a Python dictionary and use the JSON library to finalize the conversion.
![](https://cms.iproyal.com/uploads/Justas_Vitaitis_3a8660fab1.png)
Author
Justas Vitaitis
Senior Software Engineer
Justas is a Senior Software Engineer with over a decade of proven expertise. He currently holds a crucial role in IPRoyal’s development team, regularly demonstrating his profound expertise in the Go programming language, contributing significantly to the company’s technological evolution. Justas is pivotal in maintaining our proxy network, serving as the authority on all aspects of proxies. Beyond coding, Justas is a passionate travel enthusiast and automotive aficionado, seamlessly blending his tech finesse with a passion for exploration.
Learn More About Justas Vitaitis