BeautifulSoup Parser – lxml
BeautifulSoup is a Python package that parses broken HTML. While libxml2
(and thus lxml) can also parse broken HTML, BeautifulSoup is a bit more
forgiving and has superiour support for encoding detection.
lxml can benefit from the parsing capabilities of BeautifulSoup
through the module. It provides three main
functions: fromstring() and parse() to parse a string or file
using BeautifulSoup, and convert_tree() to convert an existing
BeautifulSoup tree into a list of top-level Elements.
The functions fromstring() and parse() behave as known from
ElementTree. The first returns a root Element, the latter returns an
There is also a legacy module called, which
mimics the interface provided by ElementTree’s own ElementSoup
module. Note that the soupparser module was added in lxml 2. 0. 3.
Previous versions of lxml 2. x only have the ElementSoup module.
Here is a document full of tag soup, similar to, but not quite like, HTML:
>>> tag_soup = ‘
all you need to do is pass it to the fromstring() function:
>>> from import fromstring
>>> root = fromstring(tag_soup)
To see what we have here, you can serialise it:
>>> from import tostring
>>> print tostring(root, pretty_print=True),
Not quite what you’d expect from an HTML page, but, well, it was broken
already, right? BeautifulSoup did its best, and so now it’s a tree.
To control which Element implementation is used, you can pass a
makeelement factory function to parse() and fromstring().
By default, this is based on the HTML parser defined in
By default, the BeautifulSoup parser also replaces the entities it
finds by their character equivalent.
>>> tag_soup = ‘©€-õƽ
>>> body = fromstring(tag_soup)(‘. //body’)
If you want them back on the way out, you can just serialise with the
default encoding, which is ‘US-ASCII’.
>>> tostring(body, method=”html”)
Any other encoding will output the respective byte sequences.
>>> tostring(body, encoding=”utf-8″)
>>> tostring(body, method=”html”, encoding=”utf-8″)
>>> tostring(body, encoding=unicode)
>>> tostring(body, method=”html”, encoding=unicode)
The downside of using this parser is that it is much slower than
the HTML parser of lxml. So if performance matters, you might want to
consider using soupparser only as a fallback for certain cases.
One common problem of lxml’s parser is that it might not get the
encoding right in cases where the document contains a tag
at the wrong place. In this case, you can exploit the fact that lxml
serialises much faster than most other HTML libraries for Python.
Just serialise the document to unicode and if that gives you an
exception, re-parse it with BeautifulSoup to see if that works
>>> tag_soup = ”’… … …
>>> root = (tag_soup)
>>> try:… ignore = tostring(root, encoding=unicode)… except UnicodeDecodeError:… root = (tag_soup)
BeautifulSoup and lxml.html – what to prefer? – Stack Overflow
In summary, lxml is positioned as a lightning-fast production-quality html and xml parser that, by the way, also includes a soupparser module to fall back on BeautifulSoup’s functionality. BeautifulSoup is a one-person project, designed to save you time to quickly extract data out of poorly-formed html or xml.
lxml documentation says that both parsers have advantages and disadvantages. For this reason, lxml provides a soupparser so you can switch back and forth. Quoting,
BeautifulSoup uses a different parsing approach. It is not a real HTML
parser but uses regular expressions to dive through tag soup. It is
therefore more forgiving in some cases and less good in others. It is
not uncommon that lxml/libxml2 parses and fixes broken HTML better,
but BeautifulSoup has superiour support for encoding detection. It
very much depends on the input which parser works better.
In the end they are saying,
The downside of using this parser is that it is much slower than
the HTML parser of lxml. So if performance matters, you might want
to consider using soupparser only as a fallback for certain cases.
If I understand them correctly, it means that the soup parser is more robust — it can deal with a “soup” of malformed tags by using regular expressions — whereas lxml is more straightforward and just parses things and builds a tree as you would expect. I assume it also applies to BeautifulSoup itself, not just to the soupparser for lxml.
They also show how to benefit from BeautifulSoup’s encoding detection, while still parsing quickly with lxml:
>>> from BeautifulSoup import UnicodeDammit
>>> def decode_html(html_string):… converted = UnicodeDammit(html_string, isHTML=True)… if not converted. unicode:… raise UnicodeDecodeError(… “Failed to detect encoding, tried [%s]”,… ‘, ‘(iedEncodings))… # print converted. originalEncoding… return converted. unicode
>>> root = (decode_html(tag_soup))
In words of BeautifulSoup’s creator,
That’s it! Have fun! I wrote Beautiful Soup to save everybody time.
Once you get used to it, you should be able to wrangle data out of
poorly-designed websites in just a few minutes. Send me email if you
have any comments, run into problems, or want me to know about your
project that uses Beautiful Soup.
Quoted from the Beautiful Soup documentation.
I hope this is now clear. The soup is a brilliant one-person project designed to save you time to extract data out of poorly-designed websites. The goal is to save you time right now, to get the job done, not necessarily to save you time in the long term, and definitely not to optimize the performance of your software.
Also, from the lxml website,
lxml has been downloaded from the Python Package Index more than two
million times and is also available directly in many package
distributions, e. g. for Linux or MacOS-X.
And, from Why lxml?,
The C libraries libxml2 and libxslt have huge benefits:…
Standards-compliant… Full-featured… fast. fast! FAST!… lxml
is a new Python binding for libxml2 and libxslt…
1. lxml is way faster than BeautifulSoup – this may not matter if …
> 1. lxml is way faster than BeautifulSoup – this may not matter if all you’re waiting for is the network. But if you’re parsing something on disk, this may be lxml’s HTML parser is garbage, so is BS’s, they will parse pages in non-obvious ways which do not reflect what you see in your browser, because your browser follows HTML5 tree ml5lib fixes that (and can construct both lxml and bs trees, and both libraries have html5lib integration), however it’s slow. I don’t know that there is a native compatible parser (there are plenty of native HTML5 parsers e. g. gumbo or html5ever but I don’t remember them being able to generate lxml or bs trees). > 2. Don’t forget to check the status code of r (atus_code or less generally)Alternatively (depending on use case) `r. raise_for_status()`. I’m still annoyed that there’s no way to ask requests to just check it outright. > Those with a background in coding might prefer the. cssselect method available in whatever object the parsed document results in. That’s obviously a tad slower than find/findall/xpath, but it’s oftentimes too convenient to pass cssselect simply translates CSS selectors to XPath, and while I don’t know for sure I’m guessing it has an expression cache, so it should not be noticeably slower than XPath (CSS selectors are not a hugely complex language anyway)
Frequently Asked Questions about lxml vs beautifulsoup
Is lxml faster than BeautifulSoup?
lxml is way faster than BeautifulSoup – this may not matter if all you’re waiting for is the network. But if you’re parsing something on disk, this may be significant. … html5lib fixes that (and can construct both lxml and bs trees, and both libraries have html5lib integration), however it’s slow.Oct 24, 2017
Which is better BeautifulSoup or lxml?
It is not uncommon that lxml/libxml2 parses and fixes broken HTML better, but BeautifulSoup has superiour support for encoding detection. It very much depends on the input which parser works better. In the end they are saying, The downside of using this parser is that it is much slower than the HTML parser of lxml.Oct 24, 2013
What is lxml in BeautifulSoup?
BeautifulSoup is a Python package that parses broken HTML, just like lxml supports it based on the parser of libxml2. … To prevent users from having to choose their parser library in advance, lxml can interface to the parsing capabilities of BeautifulSoup through the lxml. html. soupparser module.