Parsing xml in Python with etree.ElementTree (2024)

The xml module in the standard library provide tools for working with XML documents.

The ElementTree class in the etree submodule of the xml module offers an intuitive way of parsing and representing XML data.

ElementTree objects represents xml data in form of a tree structure in which the hierarchy is based on the nesting of the xml elements.

Basic parsing example

Consider if we have an xml file called articles.xml with the following content.

<?xml version = '1.0' encoding = 'UTF-8'?><articlelist> <article> <author country = 'India'>John Doe</author> <datepublished>2024/04/05</datepublished> <title>Lorem ipsum dolor sit amet consectetur adipisicing elit</title> <content>Lorem ipsum dolor sit amet consectetur adipisicing elit. Maxime mollitia, molestiae quas vel sint commodi repudiandae consequuntur voluptatum laborum numquam blanditiis harum quisquam eius sed odit fugiat iusto fuga praesentium optio, eaque rerum! Provident similique accusantium nemo autem. </content> </article> <article> <author country = 'Finland'>Mary Smith</author> <datepublished>2024/04/07</datepublished> <title>Perspiciatis minima nesciunt dolorem</title> <content>Perspiciatis minima nesciunt dolorem! Officiis iure rerum voluptates a cumque velit quibusdam sed amet tempora. Sit laborum ab, eius fugit doloribus tenetur fugiat, temporibus enim commodi iusto libero magni deleniti quod quam consequuntur! Commodi minima excepturi repudiandae velit hic maxime doloremque.</content> </article> </articlelist>

We can parse the document by passing the opened file object as an argument to the ElementTree.parse() method, as shown below:

from xml.etree import ElementTreewith open('articles.xml') as file: tree = ElementTree.parse(file) print(tree)

<xml.etree.ElementTree.ElementTree object at 0x000001B03DB770E0>

As shown in the above example, the ElementTree.parse() helper method creates an ElementTree instance from the given file object.

The ElementTree object represents the structure of the xml documents in form of a tree, where each node in the tree represents the corresponding element in the xml document.

Traversing an ElementTree

The tree.iter() method returns an iterator object that yields the nodes of the parsed tree from top to bottom. By default it returns all nodes in the tree.

from xml.etree import ElementTreewith open('articles.xml') as file: tree = ElementTree.parse(file) for node in tree.iter(): print(node.tag)

Search for Nodes

Parsed trees contains some useful methods to expressively search for nodes with certain characteristics. This allows you to find for nodes with given tags or even nodes that appears at certain depth of the parse tree.

The two basic methods for searching are find() and findall().

Find single node - `tree.find()`

The tree.find() method returns the first node that matches the search strings. It returns None, if there is no matching node.

from xml.etree import ElementTreewith open('articles.xml') as file: tree = ElementTree.parse(file) n = tree.find('.//author') print(n.text)

John Doe

Find all matching elements - `tree.findall()`

The tree.findall() method returns a list of all matching nodes for the given search string.

from xml.etree import ElementTreewith open('articles.xml') as file: tree = ElementTree.parse(file) nodes = tree.findall('.//author') print(nodes) for n in nodes: print(n.text)

[<Element 'author' at 0x0000011425175FD0>, <Element 'author' at 0x0000011425176160>]
John Doe
Mary Smith

Deeper look on nodes

The Elementobjects returned by methods liketree.iter(), tree.find(), etc are used to represent a single node in the xml parse tree.

Element objects contain some useful attributes and methods for accessing and manipulating information of the represented xml element. We have already used some of the attributes such as text and tag.

The attrib dictionary of an Element object stores the attributes of the represented xml element.

from xml.etree import ElementTreewith open('articles.xml') as file: tree = ElementTree.parse(file) n = tree.find('*author') print(n.attrib) print(n.text) print(n.attrib.get('country'))

{'country': 'India'}
John Doe
India

You can use the tail attribute to get the text that comes after the closing tag of a given node.

Parsing Strings

If the xml data is in form of a string, we can parse it using the XML() function. Which takes the xml string as an argument, parses it and creates an Element object representation.

from xml.etree import ElementTreexml_data = ''' <article> <author country = 'India'>John Doe</author> <datepublished>2024/04/05</datepublished> <title>Lorem ipsum dolor sit amet consectetur adipisicing elit</title> <content>Lorem ipsum dolor sit amet consectetur adipisicing elit. Maxime mollitia, molestiae quas vel sint commodi repudiandae consequuntur voluptatum laborum numquam blanditiis harum quisquam eius sed odit fugiat iusto fuga praesentium optio, eaque rerum! Provident similique accusantium nemo autem. </content> </article>'''article = ElementTree.XML(xml_data)print(article.findtext('.author'))print(article.findtext('.datepublished'))print(article.findtext('.title'))

John Doe
2024/04/05
Lorem ipsum dolor sit amet consectetur adipisicing elit

Note that unlike parse() which returns an ElementTree instance, the return value of XML() is an Element object.

‹‹ Prevpickle module→