Lxml

lxml – Processing XML and HTML with Python

lxml is the most feature-rich
and easy-to-use library
for processing XML and HTML
in the Python language.
The lxml XML toolkit is a Pythonic binding for the C libraries
libxml2 and libxslt. It is unique in that it combines the speed and
XML feature completeness of these libraries with the simplicity of a
native Python API, mostly compatible but superior to the well-known
ElementTree API. The latest release works with all CPython versions
from 2. 7 to 3. 9. See the introduction for more information about
background and goals of the lxml project. Some common questions are
answered in the FAQ.
lxml has been downloaded from the Python Package Index
millions of times and is also available directly in many package
distributions, e. g. for Linux or macOS.
Most people who use lxml do so because they like using it.
You can show us that you like it by blogging about your experience
with it and linking to the project website.
If you are using lxml for your work and feel like giving a bit of
your own benefit back to support the project, consider sending us
money through GitHub Sponsors, Tidelift or PayPal that we can use
to buy us free time for the maintenance of this great library, to
fix bugs in the software, review and integrate code contributions,
to improve its features and documentation, or to just take a deep
breath and have a cup of tea every once in a while.
Please read the Legal Notice below, at the bottom of this page.
Thank you for your support.
Support lxml through GitHub Sponsors
via a Tidelift subscription
or via PayPal:
Please contact Stefan Behnel
for other ways to support the lxml project,
as well as commercial consulting, customisations and trainings on lxml and
fast Python XML processing.
Travis-CI and AppVeyor
support the lxml project with their build and CI servers.
Jetbrains supports the lxml project by donating free licenses of their
PyCharm IDE.
Another supporter of the lxml project is
COLOGNE Webdesign.
The complete lxml documentation is available for download as PDF
documentation. The HTML documentation from this web site is part of
the normal source download.
Tutorials:
the tutorial for XML processing
John Shipman’s tutorial on Python XML processing with lxml
Fredrik Lundh’s tutorial for ElementTree
ElementTree:
ElementTree API
compatibility and differences of
ElementTree performance characteristics and comparison
specific API documentation
the generated API documentation as a reference
parsing and validating XML
XPath and XSLT support
Python XPath extension functions for XPath and XSLT
custom XML element classes for custom XML APIs (see EuroPython 2008 talk)
a SAX compliant API for interfacing with other XML tools
a C-level API for interfacing with external C/Cython modules
lxml. objectify:
lxml. objectify API documentation
a brief comparison of objectify and etree
follows the ElementTree API as much as possible, building
it on top of the native libxml2 tree. If you are new to ElementTree,
start with the tutorial for XML processing. See also the
ElementTree compatibility overview and the ElementTree performance
page comparing lxml to the original ElementTree and cElementTree
implementations.
Right after the tutorial for XML processing and the
ElementTree documentation, the next place to look is the
specific API documentation. It describes how lxml extends the
ElementTree API to expose libxml2 and libxslt specific XML
functionality, such as XPath, Relax NG, XML Schema, XSLT, and
c14n (including c14n 2. 0).
Python code can be called from XPath expressions and XSLT
stylesheets through the use of XPath extension functions. lxml
also offers a SAX compliant API, that works with the SAX support in
the standard library.
There is a separate module lxml. objectify that implements a data-binding
API on top of See the objectify and etree FAQ entry for a
comparison.
In addition to the ElementTree API, lxml also features a sophisticated
API for custom XML element classes. This is a simple way to write
arbitrary XML driven APIs on top of lxml. also has a
C-level API that can be used to efficiently extend in
external C modules, including fast custom element class support.
The best way to download lxml is to visit lxml at the Python Package
Index (PyPI). It has the source
that compiles on various platforms. The source distribution is signed
with this key.
The latest version is lxml 4. 6. 3, released 2021-03-21
(changes for 4. 3). Older versions
are listed below.
Please take a look at the
installation instructions!
This complete web site (including the generated API documentation) is
part of the source distribution, so if you want to download the
documentation for offline use, take the source archive and copy the
doc/html directory out of the source tree, or use the
PDF documentation.
The latest installable developer sources
are available from Github. It’s also possible to check out
the latest development version of lxml from Github directly, using a command
like this (assuming you use hg and have hg-git installed):
hg clone git+ssh lxml
Alternatively, if you use git, this should work as well:
git clone lxml
You can browse the source repository and its history through
the web. Please read how to build lxml from source
first. The latest CHANGES of the developer version are also
accessible. You can check there if a bug you found has been fixed
or a feature you want has been implemented in the latest trunk version.
Questions? Suggestions? Code to contribute? We have a mailing list.
You can search the archive with Gmane or Google.
lxml uses the launchpad bug tracker. If you are sure you found a
bug in lxml, please file a bug report there. If you are not sure
whether some unexpected behaviour of lxml is a bug or not, please
check the documentation and ask on the mailing list first. Do not
forget to search the archive (e. with Gmane)!
The lxml library is shipped under a BSD license. libxml2 and libxslt2
itself are shipped under the MIT license. There should therefore be no
obstacle to using lxml in your codebase.
See the websites of lxml
4. 5,
4. 4,
4. 3,
4. 2,
4. 1,
4. 0,
3. 8,
3. 7,
3. 6,
3. 5,
3. 4,
3. 3,
3. 2,
3. 1,
3. 0,
2. 3,
2. 2,
2. 1,
2. 0,
1. 3
lxml 4. 3, released 2021-03-21 (changes for 4. 3)
lxml 4. 2, released 2020-11-26 (changes for 4. 2)
lxml 4. 1, released 2020-10-18 (changes for 4. 1)
lxml 4. 0, released 2020-10-17 (changes for 4. 0)
lxml 4. 5. 2, released 2020-07-09 (changes for 4. 1, released 2020-05-19 (changes for 4. 0, released 2020-01-29 (changes for 4. 4. 3, released 2020-01-28 (changes for 4. 2, released 2019-11-25 (changes for 4. 1, released 2019-08-11 (changes for 4. 0, released 2019-07-27 (changes for 4. 0)
older releases
Total project income in 2019: EUR 717. 52 (59. 79 € / month)
Tidelift: EUR 360. 30
Paypal: EUR 157. 22
other: EUR 200. 00
Any donation that you make to the lxml project is voluntary and
is not a fee for any services, goods, or advantages. By making
a donation to the lxml project, you acknowledge that we have the
right to use the money you donate in any lawful way and for any
lawful purpose we see fit and we are not obligated to disclose
the way and purpose to any party unless required by applicable
law. Although lxml is free software, to the best of our knowledge
the lxml project does not have any tax exempt status. The lxml
project is neither a registered non-profit corporation nor a
registered charity in any country. Your donation may or may not
be tax-deductible; please consult your tax advisor in this matter.
We will not publish or disclose your name and/or e-mail address
without your consent, unless required by applicable law. Your
donation is non-refundable.
lxml - Processing XML and HTML with Python

lxml – Processing XML and HTML with Python

lxml is the most feature-rich
and easy-to-use library
for processing XML and HTML
in the Python language.
The lxml XML toolkit is a Pythonic binding for the C libraries
libxml2 and libxslt. It is unique in that it combines the speed and
XML feature completeness of these libraries with the simplicity of a
native Python API, mostly compatible but superior to the well-known
ElementTree API. The latest release works with all CPython versions
from 2. 7 to 3. 9. See the introduction for more information about
background and goals of the lxml project. Some common questions are
answered in the FAQ.
lxml has been downloaded from the Python Package Index
millions of times and is also available directly in many package
distributions, e. g. for Linux or macOS.
Most people who use lxml do so because they like using it.
You can show us that you like it by blogging about your experience
with it and linking to the project website.
If you are using lxml for your work and feel like giving a bit of
your own benefit back to support the project, consider sending us
money through GitHub Sponsors, Tidelift or PayPal that we can use
to buy us free time for the maintenance of this great library, to
fix bugs in the software, review and integrate code contributions,
to improve its features and documentation, or to just take a deep
breath and have a cup of tea every once in a while.
Please read the Legal Notice below, at the bottom of this page.
Thank you for your support.
Support lxml through GitHub Sponsors
via a Tidelift subscription
or via PayPal:
Please contact Stefan Behnel
for other ways to support the lxml project,
as well as commercial consulting, customisations and trainings on lxml and
fast Python XML processing.
Travis-CI and AppVeyor
support the lxml project with their build and CI servers.
Jetbrains supports the lxml project by donating free licenses of their
PyCharm IDE.
Another supporter of the lxml project is
COLOGNE Webdesign.
The complete lxml documentation is available for download as PDF
documentation. The HTML documentation from this web site is part of
the normal source download.
Tutorials:
the tutorial for XML processing
John Shipman’s tutorial on Python XML processing with lxml
Fredrik Lundh’s tutorial for ElementTree
ElementTree:
ElementTree API
compatibility and differences of
ElementTree performance characteristics and comparison
specific API documentation
the generated API documentation as a reference
parsing and validating XML
XPath and XSLT support
Python XPath extension functions for XPath and XSLT
custom XML element classes for custom XML APIs (see EuroPython 2008 talk)
a SAX compliant API for interfacing with other XML tools
a C-level API for interfacing with external C/Cython modules
lxml. objectify:
lxml. objectify API documentation
a brief comparison of objectify and etree
follows the ElementTree API as much as possible, building
it on top of the native libxml2 tree. If you are new to ElementTree,
start with the tutorial for XML processing. See also the
ElementTree compatibility overview and the ElementTree performance
page comparing lxml to the original ElementTree and cElementTree
implementations.
Right after the tutorial for XML processing and the
ElementTree documentation, the next place to look is the
specific API documentation. It describes how lxml extends the
ElementTree API to expose libxml2 and libxslt specific XML
functionality, such as XPath, Relax NG, XML Schema, XSLT, and
c14n (including c14n 2. 0).
Python code can be called from XPath expressions and XSLT
stylesheets through the use of XPath extension functions. lxml
also offers a SAX compliant API, that works with the SAX support in
the standard library.
There is a separate module lxml. objectify that implements a data-binding
API on top of See the objectify and etree FAQ entry for a
comparison.
In addition to the ElementTree API, lxml also features a sophisticated
API for custom XML element classes. This is a simple way to write
arbitrary XML driven APIs on top of lxml. also has a
C-level API that can be used to efficiently extend in
external C modules, including fast custom element class support.
The best way to download lxml is to visit lxml at the Python Package
Index (PyPI). It has the source
that compiles on various platforms. The source distribution is signed
with this key.
The latest version is lxml 4. 6. 3, released 2021-03-21
(changes for 4. 3). Older versions
are listed below.
Please take a look at the
installation instructions!
This complete web site (including the generated API documentation) is
part of the source distribution, so if you want to download the
documentation for offline use, take the source archive and copy the
doc/html directory out of the source tree, or use the
PDF documentation.
The latest installable developer sources
are available from Github. It’s also possible to check out
the latest development version of lxml from Github directly, using a command
like this (assuming you use hg and have hg-git installed):
hg clone git+ssh lxml
Alternatively, if you use git, this should work as well:
git clone lxml
You can browse the source repository and its history through
the web. Please read how to build lxml from source
first. The latest CHANGES of the developer version are also
accessible. You can check there if a bug you found has been fixed
or a feature you want has been implemented in the latest trunk version.
Questions? Suggestions? Code to contribute? We have a mailing list.
You can search the archive with Gmane or Google.
lxml uses the launchpad bug tracker. If you are sure you found a
bug in lxml, please file a bug report there. If you are not sure
whether some unexpected behaviour of lxml is a bug or not, please
check the documentation and ask on the mailing list first. Do not
forget to search the archive (e. with Gmane)!
The lxml library is shipped under a BSD license. libxml2 and libxslt2
itself are shipped under the MIT license. There should therefore be no
obstacle to using lxml in your codebase.
See the websites of lxml
4. 5,
4. 4,
4. 3,
4. 2,
4. 1,
4. 0,
3. 8,
3. 7,
3. 6,
3. 5,
3. 4,
3. 3,
3. 2,
3. 1,
3. 0,
2. 3,
2. 2,
2. 1,
2. 0,
1. 3
lxml 4. 3, released 2021-03-21 (changes for 4. 3)
lxml 4. 2, released 2020-11-26 (changes for 4. 2)
lxml 4. 1, released 2020-10-18 (changes for 4. 1)
lxml 4. 0, released 2020-10-17 (changes for 4. 0)
lxml 4. 5. 2, released 2020-07-09 (changes for 4. 1, released 2020-05-19 (changes for 4. 0, released 2020-01-29 (changes for 4. 4. 3, released 2020-01-28 (changes for 4. 2, released 2019-11-25 (changes for 4. 1, released 2019-08-11 (changes for 4. 0, released 2019-07-27 (changes for 4. 0)
older releases
Total project income in 2019: EUR 717. 52 (59. 79 € / month)
Tidelift: EUR 360. 30
Paypal: EUR 157. 22
other: EUR 200. 00
Any donation that you make to the lxml project is voluntary and
is not a fee for any services, goods, or advantages. By making
a donation to the lxml project, you acknowledge that we have the
right to use the money you donate in any lawful way and for any
lawful purpose we see fit and we are not obligated to disclose
the way and purpose to any party unless required by applicable
law. Although lxml is free software, to the best of our knowledge
the lxml project does not have any tax exempt status. The lxml
project is neither a registered non-profit corporation nor a
registered charity in any country. Your donation may or may not
be tax-deductible; please consult your tax advisor in this matter.
We will not publish or disclose your name and/or e-mail address
without your consent, unless required by applicable law. Your
donation is non-refundable.
Introduction to the Python lxml Library - Stack Abuse

Introduction to the Python lxml Library – Stack Abuse

lxml is a Python library which allows for easy handling of XML and HTML files, and can also be used for web scraping. There are a lot of off-the-shelf XML parsers out there, but for better results, developers sometimes prefer to write their own XML and HTML parsers. This is when the lxml library comes to play. The key benefits of this library are that it’s ease of use, extremely fast when parsing large documents, very well documented, and provides easy conversion of data to Python data types, resulting in easier file manipulation.
In this tutorial, we will deep dive into Python’s lxml library, starting with how to set it up for different operating systems, and then discussing its benefits and the wide range of functionalities it offers.
Installation
There are multiple ways to install lxml on your system. We’ll explore some of them below.
Using Pip
Pip is a Python package manager which is used to download and install Python libraries to your local system with ease i. e. it downloads and installs all the dependencies for the package you’re installing, as well.
If you have pip installed on your system, simply run the following command in terminal or command prompt:
$ pip install lxml
Using apt-get
If you’re using MacOS or Linux, you can install lxml by running this command in your terminal:
$ sudo apt-get install python-lxml
Using easy_install
You probably won’t get to this part, but if none of the above commands works for you for some reason, try using easy_install:
$ easy_install lxml
Note: If you wish to install any particular version of lxml, you can simply state it when you run the command in the command prompt or terminal like this, lxml==3. x. y.
By now, you should have a copy of the lxml library installed on your local machine. Let’s now get our hands dirty and see what cool things can be done using this library.
Functionality
To be able to use the lxml library in your program, you first need to import it. You can do that by using the following command:
from lxml import etree as et
This will import the etree module, the module of our interest, from the lxml library.
Creating HTML/XML Documents
Using the etree module, we can create XML/HTML elements and their subelements, which is a very useful thing if we’re trying to write or manipulate an HTML or XML file. Let’s try to create the basic structure of an HTML file using etree:
root = et. Element(‘html’, version=”5. 0″)
# Pass the parent node, name of the child node,
# and any number of optional attributes
bElement(root, ‘head’)
bElement(root, ‘title’, bgcolor=”red”, fontsize=’22’)
bElement(root, ‘body’, fontsize=”15″)
In the code above, you need to know that the Element function requires at least one parameter, whereas the SubElement function requires at least two. This is because the Element function only ‘requires’ the name of the element to be created, whereas the SubElement function requires the name of both the root node and the child node to be created.
It’s also important to know that both these functions only have a lower bound to the number of arguments they can accept, but no upper bound because you can associate as many attributes with them as you want. To add an attribute to an element, simply add an additional parameter to the (Sub)Element function and specify your attribute in the form of attributeName=’attribute value’.
Let’s try to run the code we wrote above to gain a better intuition regarding these functions:
# Use pretty_print=True to indent the HTML output
print (string(root, pretty_print=True)(“utf-8”))
Output:


<br /> <body fontsize="15"/><br /> </html><br /> There’s another way to create and organize your elements in a hierarchical manner. Let’s explore that as well:<br /> root = et. Element(‘html’)<br /> (bElement(‘head’))<br /> (bElement(‘body’))<br /> So in this case whenever we create a new element, we simply append it to the root/parent node.<br /> Parsing HTML/XML Documents<br /> Until now, we have only considered creating new elements, assigning attributes to them, etc. Let’s now see an example where we already have an HTML or XML file, and we wish to parse it to extract certain information. Assuming that we have the HTML file that we created in the first example, let’s try to get the tag name of one specific element, followed by printing the tag names of all the elements.<br /> print()<br /> html<br /> Now to iterate through all the child elements in the root node and print their tags:<br /> for e in root:<br /> head<br /> title<br /> body<br /> Working with Attributes<br /> Let’s now see how we associate attributes to existing elements, as well as how to retrieve the value of a particular attribute for a given element.<br /> Using the same root element as before, try out the following code:<br /> (‘newAttribute’, ‘attributeValue’)<br /> # Print root again to see if the new attribute has been added<br /> print(string(root, pretty_print=True)(“utf-8”))<br /> <html version="5. 0" newAttribute="attributeValue"><br /> Here we can see that the newAttribute=”attributeValue” has indeed been added to the root element.<br /> Check out our hands-on, practical guide to learning Git, with best-practices, industry-accepted standards, and included cheat sheet. Stop Googling Git commands and actually learn it! Let’s now try to get the values of the attributes we have set in the above code. Here we access a child element using array indexing on the root element, and then use the get() method to retrieve the attribute:<br /> print((‘newAttribute’))<br /> print(root[1](‘alpha’)) # root[1] accesses the `title` element<br /> print(root[1](‘bgcolor’))<br /> attributeValue<br /> None<br /> red<br /> Retrieving Text from Elements<br /> Now that we have seen basic functionalities of the etree module, let’s try to do some more interesting things with our HTML and XML files. Almost always, these files have some text in between the tags. So, let’s see how we can add text to our elements:<br /> # Copying the code from the very first example<br /> bElement(root, ‘title’, bgcolor=”red”, fontsize=”22″)<br /> # Add text to the Elements and SubElements<br /> = “This is an HTML file”<br /> root[0] = “This is the head of that file”<br /> root[1] = “This is the title of that file”<br /> root[2] = “This is the body of that file and would contain paragraphs etc”<br /> <html version="5. 0">This is an HTML file<head>This is the head of that file</head><title bgcolor="red" fontsize="22">This is the title of that fileThis is the body of that file and would contain paragraphs etc
Check if an Element has Children
Next, there are two very important things that we should be able to check, as that is required in a lot of web scraping applications for exception handling. First thing we’d like to check is whether or not an element has children, and second is whether or not a node is an Element.
Let’s do that for the nodes we created above:
if len(root) > 0:
print(“True”)
else:
print(“False”)
The above code will output “True” since the root node does have child nodes. However, if we check the same thing for the root’s child nodes, like in the code below, the output will be “False”.
for i in range(len(root)):
if (len(root[i]) > 0):
False
Now let’s do the same thing to see if each of the nodes is an Element or not:
print(element(root[i]))
True
The iselement method is helpful for determining if you have a valid Element object, and thus if you can continue traversing it using the methods we’ve shown here.
Check if an Element has a Parent
Just now, we showed how to go down the hierarchy, i. how to check if an element has children or not, and now in this section we will try to go up the hierarchy, i. how to check and get the parent of a child node.
print(tparent())
print(root[0]. getparent())
print(root[1]. getparent())
The first line should return nothing (aka None) as the root node itself doesn’t have any parent. The other two should both point to the root element i. the HTML tag. Let’s check the output to see if it is what we expect:

Retrieving Element Siblings
In this section we will learn how to traverse sideways in the hierarchy, which retrieves an element’s siblings in the tree.
Traversing the tree sideways is quite similar to navigating it vertically. For the latter, we used the getparent and the length of the element, for the former, we’ll use getnext and getprevious functions. Let’s try them on nodes that we previously created to see how they work:
# root[1] is the `title` tag
print(root[1]. getnext()) # The tag after the `title` tag
print(root[1]. getprevious()) # The tag before the `title` tag


Here you can see that root[1]. getnext() retrieved the “body” tag since it was the next element, and root[1]. getprevious() retrieved the “head” tag.
Similarly, if we had used the getprevious function on root, it would have returned None, and if we had used the getnext function on root[2], it would also have returned None.
Parsing XML from a String
Moving on, if we have an XML or HTML file and we wish to parse the raw string in order to obtain or manipulate the required information, we can do so by following the example below:
root = (‘This is an HTML fileThis is the head of that fileThis is the title of that fileThis is the body of that file and would contain paragraphs etc‘)
root[1] = “The title text has changed! ”
print(string(root, xml_declaration=True)(‘utf-8’))

This is an HTML fileThis is the head of that fileThe title text has changed! This is the body of that file and would contain paragraphs etc
As you can see, we successfully changed some text in the HTML document. The XML doctype declaration was also automatically added because of the xml_declaration parameter that we passed to the tostring function.
Searching for Elements
The last thing we’re going to discuss is quite handy when parsing XML and HTML files. We will be checking ways through which we can see if an Element has any particular type of children, and if it does what do they contain.
This has many practical use-cases, such as finding all of the link elements on a particular web page.
print((‘a’)) # No tags exist, so this will be `None`
print((‘head’))
print(ndtext(‘title’)) # Directly retrieve the the title tag’s text
This is the title of that file
Conclusion
In the above tutorial, we started with a basic introduction to what lxml library is and what it is used for. After that, we learned how to install it on different environments like Windows, Linux, etc. Moving on, we explored different functionalities that could help us in traversing through the HTML/XML tree vertically as well as sideways. In the end, we also discussed ways to find elements in our tree, and as well as obtain information from them.

Frequently Asked Questions about lxml

What is meant by lxml?

lxml is a Python library which allows for easy handling of XML and HTML files, and can also be used for web scraping.Apr 10, 2019

What is difference between XML and lxml?

For most normal XML operations including building document trees and simple searching and parsing of element attributes and node values, even namespaces, ElementTree is a reliable handler. Lxml is a third-party module that requires installation.Apr 2, 2018

What is lxml in BeautifulSoup?

BeautifulSoup is a Python package that parses broken HTML, just like lxml supports it based on the parser of libxml2. … To prevent users from having to choose their parser library in advance, lxml can interface to the parsing capabilities of BeautifulSoup through the lxml. html. soupparser module.

Leave a Reply

Your email address will not be published. Required fields are marked *