Processing XML in Python — ElementTree (2024)

Processing XML in Python — ElementTree (3)

Learn how you can parse, explore, modify and populate XML files with the Python ElementTree package, for loops and XPath expressions. As a data scientist, you’ll find that understanding XML is powerful for both web-scraping and general practice in parsing a structured document

Extensible Markup Language (XML) is a markup language which encodes documents by defining a set of rules in both machine-readable and human-readable format. Extended from SGML (Standard Generalized Markup Language), it lets us describe the structure of the document. In XML, we can define custom tags. We can also use XML as a standard format to exchange information.

  • XML documents have sections, called elements, defined by a beginning and an ending tag. A tag is a markup construct that begins with < and ends with >. The characters between the start-tag and end-tag, if there are any, are the element's content. Elements can contain markup, including other elements, which are called "child elements".
  • The largest, top-level element is called the root, which contains all other elements.
  • Attributes are name–value pair that exist within a start-tag or empty-element tag. An XML attribute can only have a single value and each attribute can appear at most once on each element.

Here’s a snapshot of movies.xml that we will be using for this tutorial:

<?xml version="1.0"?>
<collection>
<genre category="Action">
<decade years="1980s">
<movie favorite="True" title="Indiana Jones: The raiders of the lost Ark">
<format multiple="No">DVD</format>
<year>1981</year>
<rating>PG</rating>
<description>
'Archaeologist and adventurer Indiana Jones
is hired by the U.S. government to find the Ark of the Covenant before the Nazis.'
</description>
</movie>
<movie favorite="True" title="THE KARATE KID">
<format multiple="Yes">DVD,Online</format>
<year>1984</year>
<rating>PG</rating>
<description>None provided.</description>
</movie>
<movie favorite="False" title="Back 2 the Future">
<format multiple="False">Blu-ray</format>
<year>1985</year>
<rating>PG</rating>
<description>Marty McFly</description>
</movie>
</decade>
<decade years="1990s">
<movie favorite="False" title="X-Men">
<format multiple="Yes">dvd, digital</format>
<year>2000</year>
<rating>PG-13</rating>
<description>Two mutants come to a private academy for their kind whose resident superhero team must oppose a terrorist organization with similar powers.</description>
</movie>
<movie favorite="True" title="Batman Returns">
<format multiple="No">VHS</format>
<year>1992</year>
<rating>PG13</rating>
<description>NA.</description>
</movie>
<movie favorite="False" title="Reservoir Dogs">
<format multiple="No">Online</format>
<year>1992</year>
<rating>R</rating>
<description>WhAtEvER I Want!!!?!</description>
</movie>
</decade>
</genre>

<genre category="Thriller">
<decade years="1970s">
<movie favorite="False" title="ALIEN">
<format multiple="Yes">DVD</format>
<year>1979</year>
<rating>R</rating>
<description>"""""""""</description>
</movie>
</decade>
<decade years="1980s">
<movie favorite="True" title="Ferris Bueller's Day Off">
<format multiple="No">DVD</format>
<year>1986</year>
<rating>PG13</rating>
<description>Funny movie on funny guy </description>
</movie>
<movie favorite="FALSE" title="American Psycho">
<format multiple="No">blue-ray</format>
<year>2000</year>
<rating>Unrated</rating>
<description>psychopathic Bateman</description>
</movie>
</decade>
</genre>

The XML tree structure makes navigation, modification, and removal relatively simple programmatically. Python has a built in library, ElementTree, that has functions to read and manipulate XMLs (and other similarly structured files).

First, import ElementTree. It's a common practice to use the alias of ET:

import xml.etree.ElementTree as ET

Parsing XML Data

In the XML file provided, there is a basic collection of movies described. The only problem is the data is a mess! There have been a lot of different curators of this collection and everyone has their own way of entering data into the file. The main goal in this tutorial will be to read and understand the file with Python — then fix the problems.

First you need to read in the file with ElementTree.

tree = ET.parse('movies.xml')
root = tree.getroot()

Now that you have initialized the tree, you should look at the XML and print out values in order to understand how the tree is structured.

root.tag'collection'

At the top level, you see that this XML is rooted in the collection tag.

root.attrib{}

For Loops

You can easily iterate over subelements (commonly called “children”) in the root by using a simple “for” loop.

for child in root:
print(child.tag, child.attrib)
genre {'category': 'Action'}
genre {'category': 'Thriller'}
genre {'category': 'Comedy'}

Now you know that the children of the root collection are all genre. To designate the genre, the XML uses the attribute category. There are Action, Thriller, and Comedy movies according the genre element.

Typically it is helpful to know all the elements in the entire tree. One useful function for doing that is root.iter().

[elem.tag for elem in root.iter()]['collection',
'genre',
'decade',
'movie',
'format',
'year',
'rating',
'description',
'movie',
.
.
.
.
'movie',
'format',
'year',
'rating',
'description']

There is a helpful way to see the whole document. If you pass the root into the .tostring() method, you can return the whole document. Within ElementTree, this method takes a slightly strange form.

Since ElementTree is a powerful library that can interpret more than just XML, you must specify both the encoding and decoding of the document you are displaying as the string.

You can expand the use of the iter() function to help with finding particular elements of interest. root.iter() will list all subelements under the root that match the element specified. Here, you will list all attributes of the movie element in the tree:

for movie in root.iter('movie'):
print(movie.attrib)
{'favorite': 'True', 'title': 'Indiana Jones: The raiders of the lost Ark'}
{'favorite': 'True', 'title': 'THE KARATE KID'}
{'favorite': 'False', 'title': 'Back 2 the Future'}
{'favorite': 'False', 'title': 'X-Men'}
{'favorite': 'True', 'title': 'Batman Returns'}
{'favorite': 'False', 'title': 'Reservoir Dogs'}
{'favorite': 'False', 'title': 'ALIEN'}
{'favorite': 'True', 'title': "Ferris Bueller's Day Off"}
{'favorite': 'FALSE', 'title': 'American Psycho'}
{'favorite': 'False', 'title': 'Batman: The Movie'}
{'favorite': 'True', 'title': 'Easy A'}
{'favorite': 'True', 'title': 'Dinner for SCHMUCKS'}
{'favorite': 'False', 'title': 'Ghostbusters'}
{'favorite': 'True', 'title': 'Robin Hood: Prince of Thieves'}

XPath Expressions

Many times elements will not have attributes, they will only have text content. Using the attribute .text, you can print out this content.

Now, print out all the descriptions of the movies.

for description in root.iter('description'):
print(description.text)
'Archaeologist and adventurer Indiana Jones is hired by the U.S. government to find the Ark of the Covenant before the Nazis.'None provided.
Marty McFly
Two mutants come to a private academy for their kind whose resident superhero team must oppose a terrorist organization with similar powers.
NA.
WhAtEvER I Want!!!?!
"""""""""
Funny movie about a funny guy
psychopathic Bateman
What a joke!
Emma Stone = Hester Prynne
Tim (Rudd) is a rising executive who “succeeds” in finding the perfect guest, IRS employee Barry (Carell), for his boss’ monthly event, a so-called “dinner for idiots,” which offers certain
advantages to the exec who shows up with the biggest buffoon.
Who ya gonna call?
Robin Hood slaying

Printing out the XML is helpful, but XPath is a query language used to search through an XML quickly and easily. However, Understanding XPath is critically important to scanning and populating XMLs. ElementTree has a .findall() function that will traverse the immediate children of the referenced element.

Here, you will search the tree for movies that came out in 1992:

for movie in root.findall("./genre/decade/movie/[year='1992']"):
print(movie.attrib)
{'favorite': 'True', 'title': 'Batman Returns'}
{'favorite': 'False', 'title': 'Reservoir Dogs'}

The function .findall() always begins at the element specified. This type of function is extremely powerful for a "find and replace". You can even search on attributes!

Now, print out only the movies that are available in multiple formats (an attribute).

for movie in root.findall("./genre/decade/movie/format/[@multiple='Yes']"):
print(movie.attrib)
{'multiple': 'Yes'}
{'multiple': 'Yes'}
{'multiple': 'Yes'}
{'multiple': 'Yes'}
{'multiple': 'Yes'}

Brainstorm why, in this case, the print statement returns the “Yes” values of multiple. Think about how the "for" loop is defined.

Tip: use '...' inside of XPath to return the parent element of the current element.

for movie in root.findall("./genre/decade/movie/format[@multiple='Yes']..."):
print(movie.attrib)
{'favorite': 'True', 'title': 'THE KARATE KID'}
{'favorite': 'False', 'title': 'X-Men'}
{'favorite': 'False', 'title': 'ALIEN'}
{'favorite': 'False', 'title': 'Batman: The Movie'}
{'favorite': 'True', 'title': 'Dinner for SCHMUCKS'}

Modifying an XML

Earlier, the movie titles were an absolute mess. Now, print them out again:

for movie in root.iter('movie'):
print(movie.attrib)
{'favorite': 'True', 'title': 'Indiana Jones: The raiders of the lost Ark'}
{'favorite': 'True', 'title': 'THE KARATE KID'}
{'favorite': 'False', 'title': 'Back 2 the Future'}
{'favorite': 'False', 'title': 'X-Men'}
{'favorite': 'True', 'title': 'Batman Returns'}
{'favorite': 'False', 'title': 'Reservoir Dogs'}
{'favorite': 'False', 'title': 'ALIEN'}
{'favorite': 'True', 'title': "Ferris Bueller's Day Off"}
{'favorite': 'FALSE', 'title': 'American Psycho'}
{'favorite': 'False', 'title': 'Batman: The Movie'}
{'favorite': 'True', 'title': 'Easy A'}
{'favorite': 'True', 'title': 'Dinner for SCHMUCKS'}
{'favorite': 'False', 'title': 'Ghostbusters'}
{'favorite': 'True', 'title': 'Robin Hood: Prince of Thieves'}

Fix the ‘2’ in Back 2 the Future. That should be a find and replace problem. Write code to find the title ‘Back 2 the Future’ and save it as a variable:

b2tf = root.find("./genre/decade/movie[@title='Back 2 the Future']")
print(b2tf)
<Element 'movie' at 0x10ce00ef8>

Notice that using the .find() method returns an element of the tree. Much of the time, it is more useful to edit the content within an element.

Modify the title attribute of the Back 2 the Future element variable to read "Back to the Future". Then, print out the attributes of your variable to see your change. You can easily do this by accessing the attribute of an element and then assigning a new value to it:

b2tf.attrib["title"] = "Back to the Future"
print(b2tf.attrib)
{'favorite': 'False', 'title': 'Back to the Future'}

Write out your changes back to the XML so they are permanently fixed in the document. Print out your movie attributes again to make sure your changes worked. Use the .write() method to do this:

tree.write("movies.xml")tree = ET.parse('movies.xml')
root = tree.getroot()
for movie in root.iter('movie'):
print(movie.attrib)
{'favorite': 'True', 'title': 'Indiana Jones: The raiders of the lost Ark'}
{'favorite': 'True', 'title': 'THE KARATE KID'}
{'favorite': 'False', 'title': 'Back to the Future'}
{'favorite': 'False', 'title': 'X-Men'}
{'favorite': 'True', 'title': 'Batman Returns'}
{'favorite': 'False', 'title': 'Reservoir Dogs'}
{'favorite': 'False', 'title': 'ALIEN'}
{'favorite': 'True', 'title': "Ferris Bueller's Day Off"}
{'favorite': 'FALSE', 'title': 'American Psycho'}
{'favorite': 'False', 'title': 'Batman: The Movie'}
{'favorite': 'True', 'title': 'Easy A'}
{'favorite': 'True', 'title': 'Dinner for SCHMUCKS'}
{'favorite': 'False', 'title': 'Ghostbusters'}
{'favorite': 'True', 'title': 'Robin Hood: Prince of Thieves'}

Fixing Attributes

The multiple attribute is incorrect in some places. Use ElementTree to fix the designator based on how many formats the movie comes in. First, print the formatattribute and text to see which parts need to be fixed.

for form in root.findall("./genre/decade/movie/format"):
print(form.attrib, form.text)
{'multiple': 'No'} DVD
{'multiple': 'Yes'} DVD,Online
{'multiple': 'False'} Blu-ray
{'multiple': 'Yes'} dvd, digital
{'multiple': 'No'} VHS
{'multiple': 'No'} Online
{'multiple': 'Yes'} DVD
{'multiple': 'No'} DVD
{'multiple': 'No'} blue-ray
{'multiple': 'Yes'} DVD,VHS
{'multiple': 'No'} DVD
{'multiple': 'Yes'} DVD,digital,Netflix
{'multiple': 'No'} Online,VHS
{'multiple': 'No'} Blu_Ray

There is some work that needs to be done on this tag.

You can use regex to find commas — that will tell whether the multiple attribute should be "Yes" or "No". Adding and modifying attributes can be done easily with the .set()method.

import refor form in root.findall("./genre/decade/movie/format"):
# Search for the commas in the format text
match = re.search(',',form.text)
if match:
form.set('multiple','Yes')
else:
form.set('multiple','No')
# Write out the tree to the file again
tree.write("movies.xml")
tree = ET.parse('movies.xml')
root = tree.getroot()
for form in root.findall("./genre/decade/movie/format"):
print(form.attrib, form.text)
{'multiple': 'No'} DVD
{'multiple': 'Yes'} DVD,Online
{'multiple': 'No'} Blu-ray
{'multiple': 'Yes'} dvd, digital
{'multiple': 'No'} VHS
{'multiple': 'No'} Online
{'multiple': 'No'} DVD
{'multiple': 'No'} DVD
{'multiple': 'No'} blue-ray
{'multiple': 'Yes'} DVD,VHS
{'multiple': 'No'} DVD
{'multiple': 'Yes'} DVD,digital,Netflix
{'multiple': 'Yes'} Online,VHS
{'multiple': 'No'} Blu_Ray

Moving Elements

Some of the data has been placed in the wrong decade. Use what you have learned about XML and ElementTree to find and fix the decade data errors.

It will be useful to print out both the decade tags and the year tags throughout the document.

for decade in root.findall("./genre/decade"):
print(decade.attrib)
for year in decade.findall("./movie/year"):
print(year.text)
{'years': '1980s'}
1981
1984
1985
{'years': '1990s'}
2000
1992
1992
{'years': '1970s'}
1979
{'years': '1980s'}
1986
2000
{'years': '1960s'}
1966
{'years': '2010s'}
2010
2011
{'years': '1980s'}
1984
{'years': '1990s'}
1991

The two years that are in the wrong decade are the movies from the 2000s. Figure out what those movies are, using an XPath expression.

for movie in root.findall("./genre/decade/movie/[year='2000']"):
print(movie.attrib)
{'favorite': 'False', 'title': 'X-Men'}
{'favorite': 'FALSE', 'title': 'American Psycho'}

You have to add a new decade tag, the 2000s, to the Action genre in order to move the X-Men data. The .SubElement() method can be used to add this tag to the end of the XML.

action = root.find("./genre[@category='Action']")
new_dec = ET.SubElement(action, 'decade')
new_dec.attrib["years"] = '2000s'

Now append the X-Men movie to the 2000s and remove it from the 1990s, using .append() and .remove(), respectively.

xmen = root.find("./genre/decade/movie[@title='X-Men']")
dec2000s = root.find("./genre[@category='Action']/decade[@years='2000s']")
dec2000s.append(xmen)
dec1990s = root.find("./genre[@category='Action']/decade[@years='1990s']")
dec1990s.remove(xmen)

Build XML Documents

Nice, so you were able to essentially move an entire movie to a new decade. Save your changes back to the XML.

tree.write("movies.xml")tree = ET.parse('movies.xml')
root = tree.getroot()
print(ET.tostring(root, encoding='utf8').decode('utf8'))

ElementTree is an important Python library that allows you to parse and navigate an XML document. Using ElementTree breaks down the XML document in a tree structure that is easy to work with. When in doubt, print it out (print(ET.tostring(root, encoding='utf8').decode('utf8'))) - use this helpful print statement to view the entire XML document at once.

Processing XML in Python — ElementTree (2024)

FAQs

How do I process an XML file in Python? ›

To read an XML file using ElementTree, firstly, we import the ElementTree class found inside xml library, under the name ET (common convension). Then passed the filename of the xml file to the ElementTree. parse() method, to enable parsing of our xml file. Then got the root (parent tag) of our xml file using getroot().

How do you parse XML in Python ElementTree? ›

There are two ways to parse the file using 'ElementTree' module. The first is by using the parse() function and the second is fromstring() function. The parse () function parses XML document which is supplied as a file whereas, fromstring parses XML when supplied as a string i.e within triple quotes.

Is it safe to parse XML in Python? ›

Python's interfaces for processing XML are grouped in the xml package. The XML modules are not secure against erroneous or maliciously constructed data. If you need to parse untrusted or unauthenticated data see the XML vulnerabilities and The defusedxml Package sections.

How to handle XML response in Python? ›

The first way involves using the parse() function, and the second way involves using the fromstring() function. The parse() function is used for parsing XML data supplied as a file, while the fromstring() function is used to parse XML data supplied as a string within triple quotes.

What is the fastest way to read XML in Python? ›

As you can see, lxml is by far the fastest XML parsing library, taking only 0.35 seconds compared to over 2 seconds with the built-in xml. etree.

How to extract XML data using Python? ›

Load our XML document into memory, and construct an XML ElementTree object. We then use the find method, passing in an XPath selector, which allows us to specify what element we're trying to extract. If the element can't be found, None is returned. If the element can be found, then we'll use the .

Can Python parse XML? ›

It turns out that you can process XML documents using a few language-agnostic strategies. Each demonstrates different memory and speed trade-offs, which can partially justify the wide range of XML parsers available in Python. In the following section, you'll find out their differences and strengths.

What is an element tree in XML? ›

ElementTree is an important Python library that allows you to parse and navigate an XML document. Using ElementTree breaks down the XML document in a tree structure that is easy to work with.

Is it easier to parse XML or JSON in Python? ›

You need to parse XML with an XML parser. JSON is simple and more flexible. XML is complex and less flexible. JSON supports numbers, objects, strings, and Boolean arrays.

What is the easiest language to parse XML? ›

Perl is one of the most widely used languages to parse XML files specifically for those whose background is not computer science. Perl is easy to learn.

What is the difference between BeautifulSoup and ElementTree? ›

So if you are working with HTML scraping or need to handle "wild" HTML, go with BeautifulSoup. ElementTree provides XML oriented capabilities while BeautifulSoup is more focused on real-world HTML and scraping tasks. Consider the structure and format of your data when choosing between them.

What is the fastest XML parsing library? ›

RapidXml. RapidXml is an attempt to create the fastest XML parser possible, while retaining useability, portability and reasonable W3C compatibility. It is an in-situ parser written in modern C++, with parsing speed approaching that of strlen function executed on the same data.

How to parse a large XML file in Python? ›

ElementTree. The ElementTree XML API provides a simple and intuitive API for parsing and creating XML data in Python. It's a built-in module in Python's standard library, which means you don't need to install anything explicitly.

How to parse data from XML file in Python? ›

Parse the XML File: Use the `parse()` function of ElementTree to load the XML file into a tree structure. Access Elements: Navigate to the specific elements in the XML tree that you want to modify. You can use methods like `find()`, `findall()`, or iterate over elements using loops.

References

Top Articles
Latest Posts
Article information

Author: Tyson Zemlak

Last Updated:

Views: 5735

Rating: 4.2 / 5 (63 voted)

Reviews: 94% of readers found this page helpful

Author information

Name: Tyson Zemlak

Birthday: 1992-03-17

Address: Apt. 662 96191 Quigley Dam, Kubview, MA 42013

Phone: +441678032891

Job: Community-Services Orchestrator

Hobby: Coffee roasting, Calligraphy, Metalworking, Fashion, Vehicle restoration, Shopping, Photography

Introduction: My name is Tyson Zemlak, I am a excited, light, sparkling, super, open, fair, magnificent person who loves writing and wants to share my knowledge and understanding with you.