Python XML Parsing: Full Guide
Justas Vitaitis
Last updated -
In This Article
eXtensible Markup Language (XML) is a language that’s similar to HTML in many regards. However, it’s primarily used for data exchange and transmission rather than website creation. While it can and is used in web development, it’s often used as configuration files, RSS feeds, and web service data exchange.
One of the standout features of XML is its high customizability due to the ability to create user-defined tags instead of using inbuilt ones like HTML. Many tags will be unanimous due to their common usage, but anyone can define any tag with XML.
XML is widely used in numerous programming languages, APIs, web services, data storage, and various applications. In fact, most modern applications will have in-built XML file support, so it’s an incredibly useful way to structure data.
What is XML?
XML is one of the most popular markup languages with numerous applications across all software. Unlike many other formats, XML documents are highly customizable, only having a basic necessary structure.
Every XML file starts with a declaration, specifying a version and encoding. The remainder of the XML file uses a tree-like structure, with the topmost part being called the root element. All further elements are descendants. Any element that is immediately nested within an upper element (parent) is called a child element.
Here’s an example of an XML file:
<bookstore> <!-- Root Element -->
<book category="fiction"> <!-- Parent Element -->
<title lang="en">The Great Gatsby</title> <!-- Child Element of <book> -->
<author>F. Scott Fitzgerald</author> <!-- Child Element of <book> -->
<year>1925</year> <!-- Child Element of <book> -->
<price>10.99</price> <!-- Child Element of <book> -->
</book>
<book category="non-fiction"> <!-- Parent Element -->
<title lang="en">Sapiens: A Brief History of Humankind</title> <!-- Child Element of <book> -->
<author>Yuval Noah Harari</author> <!-- Child Element of <book> -->
<year>2011</year> <!-- Child Element of <book> -->
<price>15.99</price> <!-- Child Element of <book> -->
</book>
</bookstore>
All of the elements are marked. Note that only <bookstore> is the root element. Everything else is a descendant of it, but parents and children are only named so if they are directly related.
Intuitively, it may seem that XML documents are hard to parse as they’re so customizable. Most of the time, customization makes parsing a lot harder. That’s not the case with XML parsing, however.
Is Python Suitable for XML Parsing?
XML documents aren’t too difficult to parse as they have an inbuilt structure that’s already used for data storage and web communication. As such, Python has a multitude of XML parsing libraries, each of which has its own advantages and disadvantages.
In general, Python is a great choice for XML parsing. While it won’t do much natively, all the aforementioned libraries will do most of the heavy lifting. Couple that with Python’s easy-to-understand syntax and relatively simple development process, and you have the perfect mix to parse XML files.
Python’s Capabilities for XML Parsing
There are a few options when you want to parse XML files with Python. First, there’s the built-in library that provides native support for parsing XML documents.
While it can definitely do the job, it can be a bit slow and complicated for beginners. Third-party libraries for XML parsing are often faster, more memory efficient, and feature-rich when compared to the inbuilt one.
So, unless you have a good reason to use the native Python XML file parsing library, pick something like lxml to do the job.
Methods for Parsing XML in Python
There are quite a few libraries you can use, depending on your goals. Like with most options, there are unique advantages and drawbacks to XML parsers and libraries.
- xml.etree.ElementTree is the inbuilt Python XML document parsing library. Not that difficult to use, decently fast, but can be slow with large XML files.
- Minidom is a third-party library for XML document parsing that uses a Document Object Model (DOM)-like approach. A good alternative to the default library.
- Lxml is a fast, efficient, and feature-rich library for XML file parsing. One of the most popular options.
- Xmltodict is a unique library that converts XML files to Python dictionaries, which makes it easier to convert to other file formats such as JSON.
Lxml is often the preferred choice for XML file parsing due to its reliance on efficient C libraries. C is considered one of the faster programming languages, so it translates well into efficiency for lxml.
While beginners can pick up the built-in library for parsing XML files as it’s not all that complicated, there’s no reason to avoid picking something like lxml. It’s not significantly more difficult if you understand the XML tree structure while being more popular and efficient.
How to Parse an XML File with Python
For the tutorial, we’ll be using both the inbuilt library and lxml. It’ll showcase how similar these libraries are for parsing XML, while the latter is a lot faster.
Let’s start by creating a new project in your favorite IDE. Then, create an XML file in the working directory. You can use the sample XML document we’ve shown above. We’ll be copying it down here so you don’t have to scroll up:
<bookstore> <!-- Root Element -->
<book category="fiction"> <!-- Parent Element -->
<title lang="en">The Great Gatsby</title> <!-- Child Element of <book> -->
<author>F. Scott Fitzgerald</author> <!-- Child Element of <book> -->
<year>1925</year> <!-- Child Element of <book> -->
<price>10.99</price> <!-- Child Element of <book> -->
</book>
<book category="non-fiction"> <!-- Parent Element -->
<title lang="en">Sapiens: A Brief History of Humankind</title> <!-- Child Element of <book> -->
<author>Yuval Noah Harari</author> <!-- Child Element of <book> -->
<year>2011</year> <!-- Child Element of <book> -->
<price>15.99</price> <!-- Child Element of <book> -->
</book>
</bookstore>
Then, we’ll need to install lxml as it’s a third-party XML parsing library. Open up the Terminal and type in the following:
pip install lxml
Let’s start with parsing using the inbuilt XML document parsing library:
import xml.etree.ElementTree as ET
tree = ET.parse('parsing.xml')
root = tree.getroot()
for child in root:
print(child.tag, child.attrib)
We start by importing the XML document parsing library as ET (to make it a bit shorter and readable). The rest is fairly self-explanatory if you understand how preprocessing XML documents works.
So, an object “tree” is created and we select the “parsing.xml” file we created earlier. We’ll then pick up the root (the upper-most element) object.
To get its children (immediate next descendant), create a “for” loop that goes through the “root” object and prints out the tags and attributes of children.
Let’s now parse the same XML document with lxml:
from lxml import etree
tree = etree.parse('parsing.xml')
root = tree.getroot()
for child in root:
print(child.tag, child.attrib)
As you can see, the code is nearly identical However, if you run it, the output will be slightly different. Lxml picks up that there are comments in the XML file while the inbuilt library does not.
Converting XML to a Dictionary or JSON file
While you can perform the same actions manually with the usage of other XML file parsing libraries, xmltodict is a prepackaged solution that’ll make the process a lot easier.
Start by installing the xmltodict library by opening up the Terminal and entering:
pip install xmltodict
Then, we’ll convert our parsing XML file into a Python dictionary:
import xmltodict
with open('parsing.xml') as xml_file:
data_dict = xmltodict.parse(xml_file.read())
print(data_dict)
All our code does is open the XML file and runs a single function to convert it to a dictionary. Instead of having to write complicated code to complete the process, we only need a few lines.
Xmltodict can also be quite useful if you want to convert an XML file to a JSON format:
import json
import xmltodict
with open('parsing.xml') as xml_file:
data_dict = xmltodict.parse(xml_file.read())
json_data = json.dumps(data_dict, indent=4)
print(json_data)
A few lines are added and a new import is made to deal with JSON files. Since there are barely any changes, the process is fairly self-explanatory.
Finally, xmltodict can accept an XML string as well. You don’t need to have a file on-hand as you can copy the XML file directly into your code and turn it into a string.
Parsing XML Files with Namespaces
An XML file can have both default namespaces and qualified namespaces. Dealing with these is slightly more complicated.
Let’s start by modifying our “parsing.xml” file:
<bookstore xmlns="http://www.example.com/ns"> <!-- Root Element with Default Namespace -->
<book category="fiction"> <!-- Parent Element -->
<title lang="en">The Great Gatsby</title> <!-- Child Element of <book> -->
<author>F. Scott Fitzgerald</author> <!-- Child Element of <book> -->
<year>1925</year> <!-- Child Element of <book> -->
<price>10.99</price> <!-- Child Element of <book> -->
</book>
<book category="non-fiction"> <!-- Parent Element -->
<title lang="en">Sapiens: A Brief History of Humankind</title> <!-- Child Element of <book> -->
<author>Yuval Noah Harari</author> <!-- Child Element of <book> -->
<year>2011</year> <!-- Child Element of <book> -->
<price>15.99</price> <!-- Child Element of <book> -->
</book>
</bookstore>
Now, let’s parse the XML file using the inbuilt library:
import xml.etree.ElementTree as ET
tree = ET.parse('parsing.xml')
root = tree.getroot()
ns = {'ns': 'http://www.example.com/ns'}
for book in root.findall('ns:book', ns):
title = book.find('ns:title', ns)
if title is not None:
print(title.text)
We’ll now be looping through the assigned namespace instead of tags. Namespaces are great for highly complex and large XML documents as they provide additional searching and parsing capabilities.
Handling Invalid XML
If you ever run into an issue with XML file formatting, it’s better to implement more complicated error handling . These can be quite common when, for example, you’re web scraping at scale, so having error handling becomes essential:
from lxml import etree
try:
tree = etree.parse('invalid.xml')
except etree.XMLSyntaxError as e:
print(f"Error: {e}")
Saving XML Data to CSV
Quite a common outcome in numerous use cases is converting XML to a CSV file. The latter are often great at numerical data analysis, so are often requested. Again, web scraping is quite a common use case where you might have to convert an XML to a CSV file.
import csv
import xml.etree.ElementTree as ET
# Parse the XML file
tree = ET.parse('parsing.xml')
root = tree.getroot()
# Define the namespace
ns = {'ns': 'http://www.example.com/ns'}
# Open the CSV file for writing
with open('output.csv', 'w', newline='') as csvfile:
writer = csv.writer(csvfile)
writer.writerow(['Tag', 'Text']) # Writing the header
# Iterate over each book element in the XML and extract relevant data
for book in root.findall('ns:book', ns):
# Get the child elements (title, author, year, price)
title = book.find('ns:title', ns).text if book.find('ns:title', ns) is not None else ''
author = book.find('ns:author', ns).text if book.find('ns:author', ns) is not None else ''
year = book.find('ns:year', ns).text if book.find('ns:year', ns) is not None else ''
price = book.find('ns:price', ns).text if book.find('ns:price', ns) is not None else ''
# Write each book's data to the CSV file
writer.writerow(['title', title])
writer.writerow(['author', author])
writer.writerow(['year', year])
writer.writerow(['price', price])
writer.writerow([]) # Empty row to separate each book
You can use various libraries for conversion to a CSV file, but Python’s default one does the job perfectly.
It may look a little intimidating at first, but there’s basically many repeating lines. Everything is the same up to opening the CSV file (which is created during opening).
We give it a name, open it in write mode (‘w’) and set a specific option for new lines as the default ones may cause unusual outcomes.
After that, we use the writer module to create rows in our file and start by using the first row for our headers.
Then, we’ll iterate through the XML file to find all titles, authors, publication years, and prices. Note that if we cannot find one element, we ask the code to input an empty string to avoid errors.
Finally, we enter all the data in sequence into the CSV file. Although don’t forget to add an empty row to separate elements to avoid a messy file.
Conclusion
All of the above functions and libraries will cover most of the common uses for XML files. In the real world, you’ll often be asked to parse, convert to a JSON or CSV file, or handle invalid ones.
Make sure to experiment with the libraries, as some of them may simply be more comfortable for you. If you’re not dealing with enormous XMLs, the speed won’t matter as much.
Author
Justas Vitaitis
Senior Software Engineer
Justas is a Senior Software Engineer with over a decade of proven expertise. He currently holds a crucial role in IPRoyal’s development team, regularly demonstrating his profound expertise in the Go programming language, contributing significantly to the company’s technological evolution. Justas is pivotal in maintaining our proxy network, serving as the authority on all aspects of proxies. Beyond coding, Justas is a passionate travel enthusiast and automotive aficionado, seamlessly blending his tech finesse with a passion for exploration.
Learn More About Justas Vitaitis