Elementtree Tutorial

xml.etree.ElementTree — The ElementTree XML API …

Source code: Lib/xml/etree/
The module implements a simple and efficient API
for parsing and creating XML data.
Changed in version 3. 3: This module will use a fast implementation whenever available.
Deprecated since version 3. 3: The module is deprecated.
The module is not secure against
maliciously constructed data. If you need to parse untrusted or
unauthenticated data see XML vulnerabilities.
This is a short tutorial for using (ET in
short). The goal is to demonstrate some of the building blocks and basic
concepts of the module.
XML tree and elements¶
XML is an inherently hierarchical data format, and the most natural way to
represent it is with a tree. ET has two classes for this purpose –
ElementTree represents the whole XML document as a tree, and
Element represents a single node in this tree. Interactions with
the whole document (reading and writing to/from files) are usually done
on the ElementTree level. Interactions with a single XML element
and its sub-elements are done on the Element level.
Parsing XML¶
We’ll be using the following XML document as the sample data for this section:




We can import this data by reading from a file:
import as ET
tree = (”)
root = troot()
Or directly from a string:
root = omstring(country_data_as_string)
fromstring() parses XML from a string directly into an Element,
which is the root element of the parsed tree. Other parsing functions may
create an ElementTree. Check the documentation to be sure.
As an Element, root has a tag and a dictionary of attributes:
It also has children nodes over which we can iterate:
>>> for child in root:… print(, )…
country {‘name’: ‘Liechtenstein’}
country {‘name’: ‘Singapore’}
country {‘name’: ‘Panama’}
Children are nested, and we can access specific child nodes by index:
>>> root[0][1]
Not all elements of the XML input will end up as elements of the
parsed tree. Currently, this module skips over any XML comments,
processing instructions, and document type declarations in the
input. Nevertheless, trees built using this module’s API rather
than parsing from XML text can have comments and processing
instructions in them; they will be included when generating XML
output. A document type declaration may be accessed by passing a
custom TreeBuilder instance to the XMLParser
Pull API for non-blocking parsing¶
Most parsing functions provided by this module require the whole document
to be read at once before returning any result. It is possible to use an
XMLParser and feed data into it incrementally, but it is a push API that
calls methods on a callback target, which is too low-level and inconvenient for
most needs. Sometimes what the user really wants is to be able to parse XML
incrementally, without blocking operations, while enjoying the convenience of
fully constructed Element objects.
The most powerful tool for doing this is XMLPullParser. It does not
require a blocking read to obtain the XML data, and is instead fed with data
incrementally with () calls. To get the parsed XML
elements, call ad_events(). Here is an example:
>>> parser = ET. XMLPullParser([‘start’, ‘end’])
>>> (‘sometext’)
>>> list(ad_events())
[(‘start’, )]
>>> (‘ more text
>>> for event, elem in ad_events():… print(event)… print(, ‘text=’, )…
The obvious use case is applications that operate in a non-blocking fashion
where the XML data is being received from a socket or read incrementally from
some storage device. In such cases, blocking reads are unacceptable.
Because it’s so flexible, XMLPullParser can be inconvenient to use for
simpler use-cases. If you don’t mind your application blocking on reading XML
data but would still like to have incremental parsing capabilities, take a look
at iterparse(). It can be useful when you’re reading a large XML document
and don’t want to hold it wholly in memory.
Finding interesting elements¶
Element has some useful methods that help iterate recursively over all
the sub-tree below it (its children, their children, and so on). For example,
>>> for neighbor in (‘neighbor’):… print()…
{‘name’: ‘Austria’, ‘direction’: ‘E’}
{‘name’: ‘Switzerland’, ‘direction’: ‘W’}
{‘name’: ‘Malaysia’, ‘direction’: ‘N’}
{‘name’: ‘Costa Rica’, ‘direction’: ‘W’}
{‘name’: ‘Colombia’, ‘direction’: ‘E’}
ndall() finds only elements with a tag which are direct
children of the current element. () finds the first child
with a particular tag, and accesses the element’s text
content. () accesses the element’s attributes:
>>> for country in ndall(‘country’):… rank = (‘rank’)… name = (‘name’)… print(name, rank)…
Liechtenstein 1
Singapore 4
Panama 68
More sophisticated specification of which elements to look for is possible by
using XPath.
Modifying an XML File¶
ElementTree provides a simple way to build XML documents and write them to files.
The () method serves this purpose.
Once created, an Element object may be manipulated by directly changing
its fields (such as), adding and modifying attributes
(() method), as well as adding new children (for example
with ()).
Let’s say we want to add one to each country’s rank, and add an updated
attribute to the rank element:
>>> for rank in (‘rank’):… new_rank = int() + 1… = str(new_rank)… (‘updated’, ‘yes’)…
>>> (”)
Our XML now looks like this:
We can remove elements using (). Let’s say we want to
remove all countries with a rank higher than 50:
>>> for country in ndall(‘country’):… # using ndall() to avoid removal during traversal… rank = int((‘rank’))… if rank > 50:… (country)…
Note that concurrent modification while iterating can lead to problems,
just like when iterating and modifying Python lists or dicts.
Therefore, the example first collects all matching elements with
ndall(), and only then iterates over the list of matches.
Building XML documents¶
The SubElement() function also provides a convenient way to create new
sub-elements for a given element:
>>> a = ET. Element(‘a’)
>>> b = bElement(a, ‘b’)
>>> c = bElement(a, ‘c’)
>>> d = bElement(c, ‘d’)
>>> (a)

Parsing XML with Namespaces¶
If the XML input has namespaces, tags and attributes
with prefixes in the form prefix:sometag get expanded to
{uri}sometag where the prefix is replaced by the full URI.
Also, if there is a default namespace,
that full URI gets prepended to all of the non-prefixed tags.
Here is an XML example that incorporates two namespaces, one with the
prefix “fictional” and the other serving as the default namespace:

By default, the href attribute is treated as a file name. You can use custom loaders to override this behaviour. Also note that the standard helper does not support XPointer syntax.
To process this file, load it as usual, and pass the root element to the module:
from import ElementTree, ElementInclude
tree = (“”)
The ElementInclude module replaces the {include element with the root element from the document. The result might look something like this: This is a paragraph. If the parse attribute is omitted, it defaults to “xml”. The href attribute is required.
To include a text document, use the {include element, and set the parse attribute to “text”:
Copyright (c) .
The result might look something like:
Copyright (c) 2003.
(href, parse, encoding=None)¶
Default loader. This default loader reads an included resource from disk. href is a URL.
parse is for parse mode either “xml” or “text”. encoding
is an optional text encoding. If not given, encoding is utf-8. Returns the
expanded resource. If the parse mode is “xml”, this is an ElementTree
instance. If the parse mode is “text”, this is a Unicode string. If the
loader fails, it can return None or raise an exception.
(elem, loader=None, base_url=None, max_depth=6)¶
This function expands XInclude directives. elem is the root element. loader is
an optional resource loader. If omitted, it defaults to default_loader().
If given, it should be a callable that implements the same interface as
default_loader(). base_url is base URL of the original file, to resolve
relative include file references. max_depth is the maximum number of recursive
inclusions. Limited to reduce the risk of malicious content explosion. Pass a
negative value to disable the limitation.
Returns the expanded resource. If the parse mode is
“xml”, this is an ElementTree instance. If the parse mode is “text”,
this is a Unicode string. If the loader fails, it can return None or
raise an exception.
New in version 3. 9: The base_url and max_depth parameters.
Element Objects¶
class (tag, attrib={}, **extra)¶
Element class. This class defines the Element interface, and provides a
reference implementation of this interface.
bytestrings or Unicode strings. tag is the element name. attrib is
an optional dictionary, containing element attributes. extra contains
additional attributes, given as keyword arguments.
A string identifying what kind of data this element represents (the
element type, in other words).
These attributes can be used to hold additional data associated with
the element. Their values are usually strings but may be any
application-specific object. If the element is created from
an XML file, the text attribute holds either the text between
the element’s start tag and its first child or end tag, or None, and
the tail attribute holds either the text between the element’s
end tag and the next tag, or None. For the XML data
the a element has None for both text and tail attributes,
the b element has text “1” and tail “4”,
the c element has text “2” and tail None,
and the d element has text None and tail “3”.
To collect the inner text of an element, see itertext(), for
example “”(ertext()).
Applications may store arbitrary objects in these attributes.
A dictionary containing the element’s attributes. Note that while the
attrib value is always a real mutable Python dictionary, an ElementTree
implementation may choose to use another internal representation, and
create the dictionary only if someone asks for it. To take advantage of
such implementations, use the dictionary methods below whenever possible.
The following dictionary-like methods work on the element attributes.
Resets an element. This function removes all subelements, clears all
attributes, and sets the text and tail attributes to None.
get(key, default=None)¶
Gets the element attribute named key.
Returns the attribute value, or default if the attribute was not found.
Returns the element attributes as a sequence of (name, value) pairs. The
attributes are returned in an arbitrary order.
Returns the elements attribute names as a list. The names are returned
in an arbitrary order.
set(key, value)¶
Set the attribute key on the element to value.
The following methods work on the element’s children (subelements).
Adds the element subelement to the end of this element’s internal list
of subelements. Raises TypeError if subelement is not an
Appends subelements from a sequence object with zero or more elements.
Raises TypeError if a subelement is not an Element.
find(match, namespaces=None)¶
Finds the first subelement matching match. match may be a tag name
or a path. Returns an element instance
or None. namespaces is an optional mapping from namespace prefix
to full name. Pass ” as prefix to move all unprefixed tag names
in the expression into the given namespace.
findall(match, namespaces=None)¶
Finds all matching subelements, by tag name or
path. Returns a list containing all matching
elements in document order. namespaces is an optional mapping from
namespace prefix to full name. Pass ” as prefix to move all
unprefixed tag names in the expression into the given namespace.
findtext(match, default=None, namespaces=None)¶
Finds text for the first subelement matching match. match may be
a tag name or a path. Returns the text content
of the first matching element, or default if no element was found.
Note that if the matching element has no text content an empty string
is returned. namespaces is an optional mapping from namespace prefix
insert(index, subelement)¶
Inserts subelement at the given position in this element. Raises
TypeError if subelement is not an Element.
Creates a tree iterator with the current element as the root.
The iterator iterates over this element and all elements below it, in
document (depth first) order. If tag is not None or ‘*’, only
elements whose tag equals tag are returned from the iterator. If the
tree structure is modified during iteration, the result is undefined.
iterfind(match, namespaces=None)¶
path. Returns an iterable yielding all
matching elements in document order. namespaces is an optional mapping
from namespace prefix to full name.
Creates a text iterator. The iterator loops over this element and all
subelements, in document order, and returns all inner text.
makeelement(tag, attrib)¶
Creates a new element object of the same type as this element. Do not
call this method, use the SubElement() factory function instead.
Removes subelement from the element. Unlike the find* methods this
method compares elements based on the instance identity, not on tag value
or contents.
Element objects also support the following sequence type methods
for working with subelements: __delitem__(),
__getitem__(), __setitem__(),
Caution: Elements with no subelements will test as False. This behavior
will change in future versions. Use specific len(elem) or elem is
None test instead.
element = (‘foo’)
if not element: # careful!
print(“element not found, or element has no subelements”)
if element is None:
print(“element not found”)
Prior to Python 3. 8, the serialisation order of the XML attributes of
elements was artificially made predictable by sorting the attributes by
their name. Based on the now guaranteed ordering of dicts, this arbitrary
reordering was removed in Python 3. 8 to preserve the order in which
attributes were originally parsed or created by user code.
In general, user code should try not to depend on a specific ordering of
attributes, given that the XML Information Set explicitly excludes the attribute
order from conveying information. Code should be prepared to deal with
any ordering on input. In cases where deterministic XML output is required,
e. for cryptographic signing or test data sets, canonical serialisation
is available with the canonicalize() function.
In cases where canonical output is not applicable but a specific attribute
order is still desirable on output, code should aim for creating the
attributes directly in the desired order, to avoid perceptual mismatches
for readers of the code. In cases where this is difficult to achieve, a
recipe like the following can be applied prior to serialisation to enforce
an order independently from the Element creation:
def reorder_attributes(root):
for el in ():
attrib =
if len(attrib) > 1:
# adjust attribute order, e. by sorting
attribs = sorted(())
ElementTree Objects¶
class (element=None, file=None)¶
ElementTree wrapper class. This class represents an entire element
hierarchy, and adds some extra support for serialization to and from
standard XML.
element is the root element. The tree is initialized with the contents
of the XML file if given.
Replaces the root element for this tree. This discards the current
contents of the tree, and replaces it with the given element. Use with
care. element is an element instance.
Same as (), starting at the root of the tree.
Same as ndall(), starting at the root of the tree.
Same as ndtext(), starting at the root of the tree.
Returns the root element for this tree.
Creates and returns a tree iterator for the root element. The iterator
loops over all elements in this tree, in section order. tag is the tag
to look for (default is to return all elements).
Same as erfind(), starting at the root of the tree.
parse(source, parser=None)¶
Loads an external XML section into this element tree. source is a file
name or file object. parser is an optional parser instance.
If not given, the standard XMLParser parser is used. Returns the
section root element.
write(file, encoding=”us-ascii”, xml_declaration=None, default_namespace=None, method=”xml”, *, short_empty_elements=True)¶
Writes the element tree to a file, as XML. file is a file name, or a
file object opened for writing. encoding 1 is the output
encoding (default is US-ASCII).
xml_declaration controls if an XML declaration should be added to the
file. Use False for never, True for always, None
for only if not US-ASCII or UTF-8 or Unicode (default is None).
default_namespace sets the default XML namespace (for “xmlns”).
method is either “xml”, “html” or “text” (default is
The keyword-only short_empty_elements parameter controls the formatting
of elements that contain no content. If True (the default), they are
emitted as a single self-closed tag, otherwise they are emitted as a pair
of start/end tags.
The output is either a string (str) or binary (bytes).
This is controlled by the encoding argument. If encoding is
“unicode”, the output is a string; otherwise, it’s binary. Note that
this may conflict with the type of file if it’s an open
file object; make sure you do not try to write a string to a
binary stream and vice versa.
Changed in version 3. 8: The write() method now preserves the attribute order specified
This is the XML file that is going to be manipulated:

Example page

Moved to .

Example of changing the attribute “target” of every link in first paragraph:
>>> from import ElementTree
>>> tree = ElementTree()
>>> (“”)

>>> p = (“body/p”) # Finds first occurrence of tag p in body
>>> p

>>> links = list((“a”)) # Returns list of all links
>>> links
[, ]
>>> for i in links: # Iterates through all found links… [“target”] = “blank”
QName Objects¶
class (text_or_uri, tag=None)¶
QName wrapper. This can be used to wrap a QName attribute value, in order
to get proper namespace handling on output. text_or_uri is a string
containing the QName value, in the form {uri}local, or, if the tag argument
is given, the URI part of a QName. If tag is given, the first argument is
interpreted as a URI, and this argument is interpreted as a local name.
Python XML with ElementTree: Beginner's Guide - DataCamp

Python XML with ElementTree: Beginner’s Guide – DataCamp

As a data scientist, you’ll find that understanding XML is powerful for both web-scraping and general practice in parsing a structured document.
In this tutorial, you’ll cover the following topics:
You’ll learn more about XML and you’ll get introduced to the Python ElementTree package.
Then, you’ll discover how you can explore XML trees to understand the data that you’re working with better with the help of ElementTree functions, for loops and XPath expressions.
Next, you’ll learn how you can modify an XML file; And
You’ll utilize xpath expresssions to populate XML files
What is XML?
XML stands for “Extensible Markup Language”. It is mainly used in webpages, where the data has a specific structure and is understood dynamically by the XML framework.
XML creates a tree-like structure that is easy to interpret and supports a hierarchy. Whenever a page follows XML, it can be called an XML document.
XML documents have sections, called elements, defined by a beginning and an ending tag. A tag is a markup construct that begins with < and ends with >. The characters between the start-tag and end-tag, if there are any, are the element’s content. Elements can contain markup, including other elements, which are called “child elements”.
The largest, top-level element is called the root, which contains all other elements.
Attributes are name–value pair that exist within a start-tag or empty-element tag. An XML attribute can only have a single value and each attribute can appear at most once on each element.
To understand this a little bit better, take a look at the following (shortened) XML file:

DVD 1981

‘Archaeologist and adventurer Indiana Jones
is hired by the U. S. government to find the Ark of the
Covenant before the Nazis. ‘

DVD, Online 1984
None provided.

Blu-ray 1985
Marty McFly

dvd, digital 2000
Two mutants come to a private academy for their kind whose resident superhero team must
oppose a terrorist organization with similar powers.

VHS 1992

Online R
WhAtEvER I Want!!!?!

DVD 1979

Funny movie about a funny guy

blue-ray Unrated
psychopathic Bateman
From what you have read above, you see that
is the single root element: it contains all the other elements, such as , or , which are the child elements or subelements. As you can see, these elements are nested.
Note that these child elements can also act as parents and contain their own child elements, which are then called “sub-child elements”.
You’ll see that, for example, the element contains a couple of “attributes”, such as favorite title that give even more information!
With this short intro to XML files in mind, you’re ready to learn more about ElementTree!
Introduction to ElementTree
The XML tree structure makes navigation, modification, and removal relatively simple programmatically. Python has a built in library, ElementTree, that has functions to read and manipulate XMLs (and other similarly structured files).
First, import ElementTree. It’s a common practice to use the alias of ET:
import as ET
Parsing XML Data
In the XML file provided, there is a basic collection of movies described. The only problem is the data is a mess! There have been a lot of different curators of this collection and everyone has their own way of entering data into the file. The main goal in this tutorial will be to read and understand the file with Python – then fix the problems.
First you need to read in the file with ElementTree.
tree = (”)
root = troot()
Now that you have initialized the tree, you should look at the XML and print out values in order to understand how the tree is structured.
Every part of a tree (root included) has a tag that describes the element. In addition, as you have seen in the introduction, elements might have attributes, which are additional descriptors, used especially for repeated tag usage. Attributes also help to validate values entered for that tag, once again contributing to the structured format of an XML.
You’ll see later on in this tutorial that attributes can be pretty powerful when included in an XML!
At the top level, you see that this XML is rooted in the collection tag.
So the root has no attributes.
For Loops
You can easily iterate over subelements (commonly called “children”) in the root by using a simple “for” loop.
for child in root:
print(, )
genre {‘category’: ‘Action’}
genre {‘category’: ‘Thriller’}
genre {‘category’: ‘Comedy’}
Now you know that the children of the root collection are all genre. To designate the genre, the XML uses the attribute category. There are Action, Thriller, and Comedy movies according the genre element.
Typically it is helpful to know all the elements in the entire tree. One useful function for doing that is (). You can put this function into a “for” loop and it will iterate over the entire tree.
[ for elem in ()]
This gives a general notion for how many elements you have, but it does not show the attributes or levels in the tree.
There is a helpful way to see the whole document. Any element has a. tostring() method. If you pass the root into the. tostring() method, you can return the whole document. Within ElementTree (remember aliased as ET),. tostring() takes a slightly strange form.
Since ElementTree is a powerful library that can interpret more than just XML, you must specify both the encoding and decoding of the document you are displaying as the string. For XMLs, use ‘utf8′ – this is the typical document format type for an XML.
print(string(root, encoding=’utf8’)(‘utf8’))

DVD, VHS 1966
What a joke!

Emma Stone = Hester Prynne

DVD, digital, Netflix 2011
Tim (Rudd) is a rising executive
who “succeeds” in finding the perfect guest,
IRS employee Barry (Carell), for his boss’ monthly event,
a so-called “dinner for idiots, ” which offers certain
advantages to the exec who shows up with the biggest buffoon.

Online, VHS Who ya gonna call?

Blu_Ray 1991
Robin Hood slaying
You can expand the use of the iter() function to help with finding particular elements of interest. () will list all subelements under the root that match the element specified. Here, you will list all attributes of the movie element in the tree:
for movie in (‘movie’):
{‘favorite’: ‘True’, ‘title’: ‘Indiana Jones: The raiders of the lost Ark’}
{‘favorite’: ‘True’, ‘title’: ‘THE KARATE KID’}
{‘favorite’: ‘False’, ‘title’: ‘Back 2 the Future’}
{‘favorite’: ‘False’, ‘title’: ‘X-Men’}
{‘favorite’: ‘True’, ‘title’: ‘Batman Returns’}
{‘favorite’: ‘False’, ‘title’: ‘Reservoir Dogs’}
{‘favorite’: ‘False’, ‘title’: ‘ALIEN’}
{‘favorite’: ‘True’, ‘title’: “Ferris Bueller’s Day Off”}
{‘favorite’: ‘FALSE’, ‘title’: ‘American Psycho’}
{‘favorite’: ‘False’, ‘title’: ‘Batman: The Movie’}
{‘favorite’: ‘True’, ‘title’: ‘Easy A’}
{‘favorite’: ‘True’, ‘title’: ‘Dinner for SCHMUCKS’}
{‘favorite’: ‘False’, ‘title’: ‘Ghostbusters’}
{‘favorite’: ‘True’, ‘title’: ‘Robin Hood: Prince of Thieves’}
You can already see how the movies have been entered in different ways. Don’t worry about that for now, you’ll get a chance to fix one of the errors later on in this tutorial.
XPath Expressions
Many times elements will not have attributes, they will only have text content. Using the attribute, you can print out this content.
Now, print out all the descriptions of the movies.
for description in (‘description’):
None provided.
Marty McFly
Two mutants come to a private academy for their kind whose resident superhero team must
oppose a terrorist organization with similar powers.
WhAtEvER I Want!!!?!
Funny movie about a funny guy
psychopathic Bateman
What a joke!
Emma Stone = Hester Prynne
Tim (Rudd) is a rising executive
Who ya gonna call?
Robin Hood slaying
Printing out the XML is helpful, but XPath is a query language used to search through an XML quickly and easily. XPath stands for XML Path Language and uses, as the name suggests, a “path like” syntax to identify and navigate nodes in an XML document.
Understanding XPath is critically important to scanning and populating XMLs. ElementTree has a. findall() function that will traverse the immediate children of the referenced element. You can use XPath expressions to specify more useful searches.
Here, you will search the tree for movies that came out in 1992:
for movie in ndall(“. /genre/decade/movie/[year=’1992′]”):
The function. findall() always begins at the element specified. This type of function is extremely powerful for a “find and replace”. You can even search on attributes!
Now, print out only the movies that are available in multiple formats (an attribute).
for movie in ndall(“. /genre/decade/movie/format/[@multiple=’Yes’]”):
{‘multiple’: ‘Yes’}
Brainstorm why, in this case, the print statement returns the “Yes” values of multiple. Think about how the “for” loop is defined. Could you rewrite this loop to print out the movie titles instead? Try it below:
Tip: use ‘… ‘ inside of XPath to return the parent element of the current element.
for movie in ndall(“. /genre/decade/movie/format[@multiple=’Yes’]… “):
Modifying an XML
Earlier, the movie titles were an absolute mess. Now, print them out again:
Fix the ‘2’ in Back 2 the Future. That should be a find and replace problem. Write code to find the title ‘Back 2 the Future’ and save it as a variable:
b2tf = (“. /genre/decade/movie[@title=’Back 2 the Future’]”)

Notice that using the () method returns an element of the tree. Much of the time, it is more useful to edit the content within an element.
Modify the title attribute of the Back 2 the Future element variable to read “Back to the Future”. Then, print out the attributes of your variable to see your change. You can easily do this by accessing the attribute of an element and then assigning a new value to it:
[“title”] = “Back to the Future”
{‘favorite’: ‘False’, ‘title’: ‘Back to the Future’}
Write out your changes back to the XML so they are permanently fixed in the document. Print out your movie attributes again to make sure your changes worked. Use the () method to do this:
Fixing Attributes
The multiple attribute is incorrect in some places. Use ElementTree to fix the designator based on how many formats the movie comes in. First, print the format attribute and text to see which parts need to be fixed.
for form in ndall(“. /genre/decade/movie/format”):
{‘multiple’: ‘No’} DVD
{‘multiple’: ‘Yes’} DVD, Online
{‘multiple’: ‘False’} Blu-ray
{‘multiple’: ‘Yes’} dvd, digital
{‘multiple’: ‘No’} VHS
{‘multiple’: ‘No’} Online
{‘multiple’: ‘Yes’} DVD
{‘multiple’: ‘No’} blue-ray
{‘multiple’: ‘Yes’} DVD, VHS
{‘multiple’: ‘Yes’} DVD, digital, Netflix
{‘multiple’: ‘No’} Online, VHS
{‘multiple’: ‘No’} Blu_Ray
There is some work that needs to be done on this tag.
You can use regex to find commas – that will tell whether the multiple attribute should be “Yes” or “No”. Adding and modifying attributes can be done easily with the () method.
Note: re is the standard regex interpreter for Python. If you want to know more about regular expressions, consider this tutorial.
import re
# Search for the commas in the format text
match = (‘, ‘, )
if match:
(‘multiple’, ‘Yes’)
(‘multiple’, ‘No’)
# Write out the tree to the file again
{‘multiple’: ‘No’} Blu-ray
{‘multiple’: ‘Yes’} Online, VHS
Moving Elements
Some of the data has been placed in the wrong decade. Use what you have learned about XML and ElementTree to find and fix the decade data errors.
It will be useful to print out both the decade tags and the year tags throughout the document.
for decade in ndall(“. /genre/decade”):
for year in ndall(“. /movie/year”):
print(, ‘n’)
{‘years’: ‘1980s’}
{‘years’: ‘1990s’}
{‘years’: ‘1970s’}
{‘years’: ‘1960s’}
{‘years’: ‘2010s’}
The two years that are in the wrong decade are the movies from the 2000s. Figure out what those movies are, using an XPath expression.
for movie in ndall(“. /genre/decade/movie/[year=’2000′]”):
You have to add a new decade tag, the 2000s, to the Action genre in order to move the X-Men data. The. SubElement() method can be used to add this tag to the end of the XML.
action = (“. /genre[@category=’Action’]”)
new_dec = bElement(action, ‘decade’)
[“years”] = ‘2000s’
print(string(action, encoding=’utf8′)(‘utf8’))


Now append the X-Men movie to the 2000s and remove it from the 1990s, using () and (), respectively.
xmen = (“. /genre/decade/movie[@title=’X-Men’]”)
dec2000s = (“. /genre[@category=’Action’]/decade[@years=’2000s’]”)
dec1990s = (“. /genre[@category=’Action’]/decade[@years=’1990s’]”)

Build XML Documents
Nice, so you were able to essentially move an entire movie to a new decade. Save your changes back to the XML.
Online, VHS Conclusion
There are some key things to remember about XMLs and using ElementTree.
Tags build the tree structure and designate what values should be delineated there. Using smart structuring can make it easy to read and write to an XML. Tags always need opening and closing brackets to show the parent and children relationships.
Attributes further describe how to validate a tag or allow for boolean designations. Attributes typically take very specific values so that the XML parser (and the user) can use the attributes to check the tag values.
ElementTree is an important Python library that allows you to parse and navigate an XML document. Using ElementTree breaks down the XML document in a tree structure that is easy to work with. When in doubt, print it out (print(string(root, encoding=’utf8′)(‘utf8’))) – use this helpful print statement to view the entire XML document at once. It helps to check when editing, adding, or removing from an XML.
Now you are equipped to understand XML and begin parsing!
The lxml.etree Tutorial

The lxml.etree Tutorial

Stefan Behnel
This is a tutorial on XML processing with It briefly
overviews the main concepts of the ElementTree API, and some simple
enhancements that make your life as a programmer easier.
For a complete reference of the API, see the generated API
The Element class
Elements are lists
Elements carry attributes as a dict
Elements contain text
Using XPath to find text
Tree iteration
The ElementTree class
Parsing from strings and files
The fromstring() function
The XML() function
The parse() function
Parser objects
Incremental parsing
Event-driven parsing
The E-factory
A common way to import is as follows:
>>> from lxml import etree
If your code only uses the ElementTree API and does not rely on any
functionality that is specific to, you can also use (any part
of) the following import chain as a fall-back to the original ElementTree:
from lxml import etree
print(“running with “)
except ImportError:
# Python 2. 5
import as etree
print(“running with cElementTree on Python 2. 5+”)
print(“running with ElementTree on Python 2. 5+”)
# normal cElementTree install
import cElementTree as etree
print(“running with cElementTree”)
# normal ElementTree install
import elementtree. ElementTree as etree
print(“running with ElementTree”)
print(“Failed to import ElementTree from any known place”)
To aid in writing portable code, this tutorial makes it clear in the examples
which part of the presented API is an extension of over the
original ElementTree API, as defined by Fredrik Lundh’s ElementTree
An Element is the main container object for the ElementTree API. Most of
the XML tree functionality is accessed through this class. Elements are
easily created through the Element factory:
>>> root = etree. Element(“root”)
The XML tag name of elements is accessed through the tag property:
Elements are organised in an XML tree structure. To create child elements and
add them to a parent element, you can use the append() method:
>>> ( etree. Element(“child1”))
However, this is so common that there is a shorter and much more efficient way
to do this: the SubElement factory. It accepts the same arguments as the
Element factory, but additionally requires the parent as first argument:
>>> child2 = bElement(root, “child2”)
>>> child3 = bElement(root, “child3”)
To see that this is really XML, you can serialise the tree you have created:
>>> print(string(root, pretty_print=True))

To make the access to these subelements easy and straight forward,
elements mimic the behaviour of normal Python lists as closely as
>>> child = root[0]
>>> print()
>>> print(len(root))
>>> (root[1]) # only!
>>> children = list(root)
>>> for child in root:… print()
>>> (0, etree. Element(“child0”))
>>> start = root[:1]
>>> end = root[-1:]
>>> print(start[0])
>>> print(end[0])
Prior to ElementTree 1. 3 and lxml 2. 0, you could also check the truth value of
an Element to see if it has children, i. e. if the list of children is empty:
if root: # this no longer works!
print(“The root element has children”)
This is no longer supported as people tend to expect that a “something”
evaluates to True and expect Elements to be “something”, may they have
children or not. So, many users find it surprising that any Element
would evaluate to False in an if-statement like the above. Instead,
use len(element), which is both more explicit and less error prone.
>>> print(element(root)) # test if it’s some kind of Element
>>> if len(root): # test if it has children… print(“The root element has children”)
The root element has children
There is another important case where the behaviour of Elements in lxml
(in 2. 0 and later) deviates from that of lists and from that of the
original ElementTree (prior to version 1. 3 or Python 2. 7/3. 2):
>>> root[0] = root[-1] # this moves the element in!
In this example, the last element is moved to a different position,
instead of being copied, i. it is automatically removed from its
previous position when it is put in a different place. In lists,
objects can appear in multiple positions at the same time, and the
above assignment would just copy the item reference into the first
position, so that both contain the exact same item:
>>> l = [0, 1, 2, 3]
>>> l[0] = l[-1]
>>> l
[3, 1, 2, 3]
Note that in the original ElementTree, a single Element object can sit
in any number of places in any number of trees, which allows for the same
copy operation as with lists. The obvious drawback is that modifications
to such an Element will apply to all places where it appears in a tree,
which may or may not be intended.
The upside of this difference is that an Element in always
has exactly one parent, which can be queried through the getparent()
method. This is not supported in the original ElementTree.
>>> root is root[0]. getparent() # only!
If you want to copy an element to a different position in,
consider creating an independent deep copy using the copy module
from Python’s standard library:
>>> from copy import deepcopy
>>> element = etree. Element(“neu”)
>>> ( deepcopy(root[1]))
>>> print(element[0])
>>> print([ for c in root])
[‘child3’, ‘child1’, ‘child2′]
The siblings (or neighbours) of an element are accessed as next and previous
>>> root[0] is root[1]. getprevious() # only!
>>> root[1] is root[0]. getnext() # only!
XML elements support attributes. You can create them directly in the Element
>>> root = etree. Element(“root”, interesting=”totally”)
>>> string(root)
Attributes are just unordered name-value pairs, so a very convenient way
of dealing with them is through the dictionary-like interface of Elements:
>>> print((“interesting”))
>>> print((“hello”))
>>> (“hello”, “Huhu”)
>>> sorted(())
[‘hello’, ‘interesting’]
>>> for name, value in sorted(()):… print(‘%s =%r’% (name, value))
hello = ‘Huhu’
interesting = ‘totally’
For the cases where you want to do item lookup or have other reasons for
getting a ‘real’ dictionary-like object, e. g. for passing it around,
you can use the attrib property:
>>> attributes =
>>> print(attributes[“interesting”])
>>> print((“no-such-attribute”))
>>> attributes[“hello”] = “Guten Tag”
>>> print(attributes[“hello”])
Guten Tag
Note that attrib is a dict-like object backed by the Element itself.
This means that any changes to the Element are reflected in attrib
and vice versa. It also means that the XML tree stays alive in memory
as long as the attrib of one of its Elements is in use. To get an
independent snapshot of the attributes that does not depend on the XML
tree, copy it into a dict:
>>> d = dict()
[(‘hello’, ‘Guten Tag’), (‘interesting’, ‘totally’)]
Elements can contain text:
>>> = “TEXT”
In many XML documents (data-centric documents), this is the only place where
text can be found. It is encapsulated by a leaf tag at the very bottom of the
tree hierarchy.
However, if XML is used for tagged text documents such as (X)HTML, text can
also appear between different elements, right in the middle of the tree:
Here, the
tag is surrounded by text. This is often referred to as
document-style or mixed-content XML. Elements support this through their
tail property. It contains the text that directly follows the element, up
to the next element in the XML tree:
>>> html = etree. Element(“html”)
>>> body = bElement(html, “body”)
>>> string(html)
>>> br = bElement(body, “br”)

>>> = “TAIL”
The two properties and are enough to represent any
text content in an XML document. This way, the ElementTree API does
not require any special text nodes in addition to the Element
class, that tend to get in the way fairly often (as you might know
from classic DOM APIs).
However, there are cases where the tail text also gets in the way.
For example, when you serialise an Element from within the tree, you
do not always want its tail text in the result (although you would
still want the tail text of its children). For this purpose, the
tostring() function accepts the keyword argument with_tail:
>>> string(br)
>>> string(br, with_tail=False) # only!

If you want to read only the text, i. without any intermediate
tags, you have to recursively concatenate all text and tail
attributes in the correct order. Again, the tostring() function
comes to the rescue, this time using the method keyword:
>>> string(html, method=”text”)
Another way to extract the text content of a tree is XPath, which
also allows you to extract the separate text chunks into a list:
>>> print((“string()”)) # only!
>>> print((“//text()”)) # only!
[‘TEXT’, ‘TAIL’]
If you want to use this more often, you can wrap it in a function:
>>> build_text_list = (“//text()”) # only!
>>> print(build_text_list(html))
Note that a string result returned by XPath is a special ‘smart’
object that knows about its origins. You can ask it where it came
from through its getparent() method, just as you would with
>>> texts = build_text_list(html)
>>> print(texts[0])
>>> parent = texts[0]. getparent()
>>> print(texts[1])
>>> print(texts[1]. getparent())
You can also find out if it’s normal text content or tail text:
>>> print(texts[0]. is_text)
>>> print(texts[1]. is_text)
>>> print(texts[1]. is_tail)
While this works for the results of the text() function, lxml will
not tell you the origin of a string value that was constructed by the
XPath functions string() or concat():
>>> stringify = (“string()”)
>>> print(stringify(html))
>>> print(stringify(html). getparent())
For problems like the above, where you want to recursively traverse the tree
and do something with its elements, tree iteration is a very convenient
solution. Elements provide a tree iterator for this purpose. It yields
elements in document order, i. in the order their tags would appear if you
serialised the tree to XML:
>>> bElement(root, “child”) = “Child 1”
>>> bElement(root, “child”) = “Child 2”
>>> bElement(root, “another”) = “Child 3”
Child 1
Child 2
Child 3
>>> for element in ():… print(“%s -%s”% (, ))
root – None
child – Child 1
child – Child 2
another – Child 3
If you know you are only interested in a single tag, you can pass its name to
iter() to have it filter for you. Starting with lxml 3. 0, you can also
pass more than one tag to intercept on multiple tags during iteration.
>>> for element in (“child”):… print(“%s -%s”% (, ))
>>> for element in (“another”, “child”):… print(“%s -%s”% (, ))
By default, iteration yields all nodes in the tree, including
ProcessingInstructions, Comments and Entity instances. If you want to
make sure only Element objects are returned, you can pass the
Element factory as tag parameter:
>>> ((“#234”))
>>> (mment(“some comment”))
>>> for element in ():… if isinstance(, basestring): # or ‘str’ in Python 3… print(“%s -%s”% (, ))… else:… print(“SPECIAL:%s -%s”% (element, ))
SPECIAL: ê – ê
SPECIAL: – some comment
>>> for element in (tag=etree. Element):… print(“%s -%s”% (, ))
>>> for element in ():… print()
Note that passing a wildcard “*” tag name will also yield all
Element nodes (and only elements).
In, elements provide further iterators for all directions in the
tree: children, parents (or rather ancestors) and siblings.
Serialisation commonly uses the tostring() function that returns a
string, or the () method that writes to a file, a
file-like object, or a URL (via FTP PUT or HTTP POST). Both calls accept
the same keyword arguments like pretty_print for formatted output
or encoding to select a specific output encoding other than plain
>>> root = (‘‘)
>>> print(string(root, xml_declaration=True))

>>> print(string(root, encoding=’iso-8859-1′))

Note that pretty printing appends a newline at the end.
For more fine-grained control over the pretty-printing, you can add
whitespace indentation to the tree before serialising it, using the
indent() function (added in lxml 4. 5):
>>> root = (‘n‘)
>>> print(string(root))

>>> (root)
‘n ‘
>>> root[0]
>>> (root, space=” “)
>>> (root, space=”t”)
In lxml 2. 0 and later (as well as ElementTree 1. 3), the serialisation
functions can do more than XML serialisation. You can serialise to
HTML or extract the text content by passing the method keyword:
>>> root = (… ‘


>>> string(root) # default: method = ‘xml’


>>> string(root, method=’xml’) # same as above
>>> string(root, method=’html’)


>>> print(string(root, method=’html’, pretty_print=True))


>>> string(root, method=’text’)
As for XML serialisation, the default encoding for plain text
serialisation is ASCII:
>>> br = next((‘br’)) # get first result of iteration
>>> = u’Wxf6rld’
>>> string(root, method=’text’) # doctest: +ELLIPSIS
Traceback (most recent call last):…
UnicodeEncodeError: ‘ascii’ codec can’t encode character u’xf6’…
>>> string(root, method=’text’, encoding=”UTF-8″)
Here, serialising to a Python unicode string instead of a byte string
might become handy. Just pass the name ‘unicode’ as encoding:
>>> string(root, encoding=’unicode’, method=’text’)
The W3C has a good article about the Unicode character set and
character encodings.
An ElementTree is mainly a document wrapper around a tree with a
root node. It provides a couple of methods for serialisation and
general document handling.
>>> root = (”’… ]>… &tasty;… ”’)
>>> tree = etree. ElementTree(root)
>>> print(cinfo. xml_version)
1. 0
>>> print(ctype)

>>> lic_id = ‘-//W3C//DTD XHTML 1. 0 Transitional//EN’
>>> stem_url = ”

An ElementTree is also what you get back when you call the
parse() function to parse files or file-like objects (see the
parsing section below).
One of the important differences is that the ElementTree class
serialises as a complete document, as opposed to a single Element.
This includes top-level processing instructions and comments, as well
as a DOCTYPE and other DTD content in the document:
>>> print(string(tree)) # lxml 1. 3. 4 and later
In the original implementation and in lxml
up to 1. 3, the output looks the same as when serialising only
the root Element:
>>> print(string(troot()))
This serialisation behaviour has changed in lxml 1. 4. Before,
the tree was serialised without DTD content, which made lxml
lose DTD information in an input-output cycle.
supports parsing XML in a number of ways and from all
important sources, namely strings, files, URLs (/ftp) and
file-like objects. The main parse functions are fromstring() and
parse(), both called with the source as first argument. By
default, they use the standard parser, but you can always pass a
different parser as second argument.
The fromstring() function is the easiest way to parse a string:
>>> some_xml_data = “data
>>> root = omstring(some_xml_data)
The XML() function behaves like the fromstring() function, but is
commonly used to write XML literals right into the source:
>>> root = (“data“)
There is also a corresponding function HTML() for HTML literals.
>>> root = (“




The parse() function is used to parse from files and file-like objects.
As an example of such a file-like object, the following code uses the
BytesIO class for reading from a string instead of an external file.
That class comes from the io module in Python 2. 6 and later. In older
Python versions, you will have to use the StringIO class from the
StringIO module. However, in real life, you would obviously avoid
doing this all together and use the string parsing functions above.
>>> from io import BytesIO
>>> some_file_or_file_like_object = BytesIO(b”data“)
>>> tree = (some_file_or_file_like_object)
>>> string(tree)
Note that parse() returns an ElementTree object, not an Element object as
the string parser functions:
>>> root = troot()
The reasoning behind this difference is that parse() returns a
complete document from a file, while the string parsing functions are
commonly used to parse XML fragments.
The parse() function supports any of the following sources:
an open file object (make sure to open it in binary mode)
a file-like object that has a (byte_count) method returning
a byte string on each call
a filename string
an HTTP or FTP URL string
Note that passing a filename or URL is usually faster than passing an
open file or file-like object. However, the HTTP/FTP client in libxml2
is rather simple, so things like HTTP authentication require a dedicated
URL request library, e. urllib2 or requests. These libraries
usually provide a file-like object for the result that you can parse
from while the response is streaming in.
By default, uses a standard parser with a default setup. If
you want to configure the parser, you can create a new instance:
>>> parser = etree. XMLParser(remove_blank_text=True) # only!
This creates a parser that removes empty text between tags while parsing,
which can reduce the size of the tree and avoid dangling tail text if you know
that whitespace-only content is not meaningful for your data. An example:
>>> root = (“ “, parser)

Note that the whitespace content inside the tag was not removed, as
content at leaf elements tends to be data content (even if blank). You can
easily remove it in an additional step by traversing the tree:
>>> for element in (“*”):… if is not None and not ():… = None

See help(etree. XMLParser) to find out about the available parser options.
provides two ways for incremental step-by-step parsing. One is
through file-like objects, where it calls the read() method repeatedly.
This is best used where the data arrives from a source like urllib or any
other file-like object that can provide data on request. Note that the parser
will block and wait until data becomes available in this case:
>>> class DataSource:… data = [ b”<", b"a/", b"><", b"/root>“]… def read(self, requested_size):… try:… return (0)… except IndexError:… return b”
>>> tree = (DataSource())

The second way is through a feed parser interface, given by the feed(data)
and close() methods:
>>> parser = etree. XMLParser()
>>> (“>> (“t><") >>> (“a/”)
>>> (“><") >>> (“/root>”)
>>> root = ()
Here, you can interrupt the parsing process at any time and continue it later
on with another call to the feed() method. This comes in handy if you
want to avoid blocking calls to the parser, e. in frameworks like Twisted,
or whenever data comes in slowly or in chunks and you want to do other things
while waiting for the next chunk.
After calling the close() method (or when an exception was raised
by the parser), you can reuse the parser by calling its feed()
method again:
>>> (““)
Sometimes, all you need from a document is a small fraction somewhere deep
inside the tree, so parsing the whole tree into memory, traversing it and
dropping it can be too much overhead. supports this use case
with two event-driven parser interfaces, one that generates parser events
while building the tree (iterparse), and one that does not build the tree
at all, and instead calls feedback methods on a target object in a SAX-like
Here is a simple iterparse() example:
>>> some_file_like = BytesIO(b”
>>> for event, element in erparse(some_file_like):… print(“%s, %4s, %s”% (event,, ))
end, a, data
end, root, None
By default, iterparse() only generates events when it is done parsing an
element, but you can control this through the events keyword argument:
>>> for event, element in erparse(some_file_like,… events=(“start”, “end”)):… print(“%5s, %4s, %s”% (event,, ))
start, root, None
start, a, data
Note that the text, tail, and children of an Element are not necessarily present
yet when receiving the start event. Only the end event guarantees
that the Element has been parsed completely.
It also allows you to () or modify the content of an Element to
save memory. So if you parse a large tree and you want to keep memory
usage small, you should clean up parts of the tree that you no longer
need. The keep_tail=True argument to () makes sure that
(tail) text content that follows the current element will not be touched.
It is highly discouraged to modify any content that the parser may not
have completely read through yet.
>>> some_file_like = BytesIO(… b”data“)
>>> for event, element in erparse(some_file_like):… if == ‘b’:… print()… elif == ‘a’:… print(“** cleaning up the subtree”)… (keep_tail=True)
** cleaning up the subtree
A very important use case for iterparse() is parsing large
generated XML files, e. database dumps. Most often, these XML
formats only have one main data item element that hangs directly below
the root node and that is repeated thousands of times. In this case,
it is best practice to let do the tree building and only to
intercept on exactly this one Element, using the normal tree API
for data extraction.
>>> xml_file = BytesIO(b”’… ABCabcMORE DATAmore dataXYZxyz… ”’)
>>> for _, element in erparse(xml_file, tag=’a’):… print(‘%s –%s’% (ndtext(‘b’), element[1]))… (keep_tail=True)
ABC — abc
MORE DATA — more data
XYZ — xyz
If, for some reason, building the tree is not desired at all, the
target parser interface of can be used. It creates
SAX-like events by calling the methods of a target object. By
implementing some or all of these methods, you can control which
events are generated:
>>> class ParserTarget:… events = []… close_count = 0… def start(self, tag, attrib):… ((“start”, tag, attrib))… def close(self):… events, =, []… ose_count += 1… return events
>>> parser_target = ParserTarget()
>>> parser = etree. XMLParser(target=parser_target)
>>> events = omstring(‘‘, parser)
>>> print(ose_count)
>>> for event in events:… print(‘event:%s – tag:%s’% (event[0], event[1]))… for attr, value in event[2]():… print(‘ *%s =%s’% (attr, value))
event: start – tag: root
* test = true
You can reuse the parser and its target as often as you like, so you
should take care that the () method really resets the
target to a usable state (also in the case of an error! ).
The ElementTree API avoids
namespace prefixes
wherever possible and deploys the real namespace (the URI) instead:
>>> xhtml = etree. Element(“{html”)
>>> body = bElement(xhtml, “{body”)
>>> = “Hello World”
>>> print(string(xhtml, pretty_print=True))
Hello World
>>> print((“bgcolor”))
>>> (XHTML + “bgcolor”)
You can also use XPath with fully qualified names:
>>> find_xhtml_body = XPath( # lxml only!… “//{%s}body”% XHTML_NAMESPACE)
>>> results = find_xhtml_body(xhtml)
>>> print(results[0])
For convenience, you can use “*” wildcards in all iterators of,
both for tag names and namespaces:
>>> for el in (‘*’): print() # any element
>>> for el in (‘{*’): print()
>>> for el in (‘{*}body’): print()
To look for elements that do not have a namespace, either use the
plain tag name or provide the empty namespace explicitly:
>>> [ for el in (‘{body’)]
>>> [ for el in (‘body’)]
>>> [ for el in (‘{}body’)]
>>> [ for el in (‘{}*’)]
The E-factory provides a simple and compact syntax for generating XML and
>>> from er import E
>>> def CLASS(*args): # class is a reserved word in Python… return {“class”:’ ‘(args)}
>>> html = page = (… ( # create an Element called “html”… (… (“This is a sample document”)… ),… E. h1(“Hello! “, CLASS(“title”)),… p(“This is a paragraph with “, E. b(“bold”), ” text in it! “),… p(“This is another paragraph, with a”, “n “,… a(“link”, href=”), “. “),… p(“Here are some reserved characters: . (“

And finally an embedded XHTML fragment.

“),… )… )
>>> print(string(page, pretty_print=True))

This is a sample document


This is a paragraph with bold text in it!

This is another paragraph, with a

Element creation based on attribute access makes it easy to build up a
simple vocabulary for an XML language:
>>> from er import ElementMaker # lxml only!
>>> E = ElementMaker(namespace=”,… nsmap={‘p’: “})
>>> DOC =
>>> TITLE =
>>> SECTION = ction
>>> PAR =
>>> my_doc = DOC(… TITLE(“The dog and the hog”),… SECTION(… TITLE(“The dog”),… PAR(“Once upon a time,… PAR(“And then… “)… TITLE(“The hog”),… PAR(“Sooner or later… )
>>> print(string(my_doc, pretty_print=True))
ProxyTags : , , , , , , , Leave a Comment on Elementtree Tutorial

Leave a Reply

Your email address will not be published. Required fields are marked *

Theme Blog Tales by Kantipur Themes