html.parser — Simple HTML and XHTML parser … – Python Docs
Source code: Lib/html/
This module defines a class HTMLParser which serves as the basis for
parsing text files formatted in HTML (HyperText Mark-up Language) and XHTML.
class (*, convert_charrefs=True)¶
Create a parser instance able to parse invalid markup.
If convert_charrefs is True (the default), all character
references (except the ones in script/style elements) are
automatically converted to the corresponding Unicode characters.
An HTMLParser instance is fed HTML data and calls handler methods
when start tags, end tags, text, comments, and other markup elements are
encountered. The user should subclass HTMLParser and override its
methods to implement the desired behavior.
This parser does not check that end tags match start tags or call the end-tag
handler for elements which are closed implicitly by closing an outer element.
Changed in version 3. 4: convert_charrefs keyword argument added.
Changed in version 3. 5: The default value for argument convert_charrefs is now True.
Example HTML Parser Application¶
As a basic example, below is a simple HTML parser that uses the
HTMLParser class to print out start tags, end tags, and data
as they are encountered:
from import HTMLParser
class MyHTMLParser(HTMLParser):
def handle_starttag(self, tag, attrs):
print(“Encountered a start tag:”, tag)
def handle_endtag(self, tag):
print(“Encountered an end tag:”, tag)
def handle_data(self, data):
print(“Encountered some data:”, data)
parser = MyHTMLParser()
(‘
‘
Parse me!
‘)
The output will then be:
Encountered a start tag: html
Encountered a start tag: head
Encountered a start tag: title
Encountered some data: Test
Encountered an end tag: title
Encountered an end tag: head
Encountered a start tag: body
Encountered a start tag: h1
Encountered some data: Parse me!
Encountered an end tag: h1
Encountered an end tag: body
Encountered an end tag: html
HTMLParser Methods¶
HTMLParser instances have the following methods:
(data)¶
Feed some text to the parser. It is processed insofar as it consists of
complete elements; incomplete data is buffered until more data is fed or
close() is called. data must be str.
()¶
Force processing of all buffered data as if it were followed by an end-of-file
mark. This method may be redefined by a derived class to define additional
processing at the end of the input, but the redefined version should always call
the HTMLParser base class method close().
Reset the instance. Loses all unprocessed data. This is called implicitly at
instantiation time.
Return current line number and offset.
t_starttag_text()¶
Return the text of the most recently opened start tag. This should not normally
be needed for structured processing, but may be useful in dealing with HTML “as
deployed” or for re-generating input with minimal changes (whitespace between
attributes can be preserved, etc. ).
The following methods are called when data or markup elements are encountered
and they are meant to be overridden in a subclass. The base class
implementations do nothing (except for handle_startendtag()):
HTMLParser. handle_starttag(tag, attrs)¶
This method is called to handle the start of a tag (e. g.
The tag argument is the name of the tag converted to lower case. The attrs
argument is a list of (name, value) pairs containing the attributes found
inside the tag’s <> brackets. The name will be translated to lower case,
and quotes in the value have been removed, and character and entity references
have been replaced.
For instance, for the tag ‘)
Decl: DOCTYPE HTML PUBLIC “-//W3C//DTD HTML 4. 01//EN” ”
Parsing an element with a few attributes and a title:
>>> (‘
Start tag: img
attr: (‘src’, ”)
attr: (‘alt’, ‘The Python logo’)
>>>
>>> (‘
Python
‘)
Start tag: h1
Data: Python
End tag: h1
The content of script and style elements is returned as is, without
further parsing:
>>> (‘
‘)
Start tag: style
attr: (‘type’, ‘text/css’)
Data: #python { color: green}
End tag: style
>>> (‘‘)
Start tag: script
attr: (‘type’, ‘text/javascript’)
Data: alert(“hello! “);
End tag: script
Parsing comments:
>>> (‘‘… ‘IE-specific content‘)
Comment: a comment
Comment: [if IE 9]>IE-specific content‘):
>>> (‘>>>’)
Named ent: >
Num ent: >
Feeding incomplete chunks to feed() works, but
handle_data() might be called more than once
(unless convert_charrefs is set to True):
>>> for chunk in [‘
Start tag: span
Data: buff
Data: ered
Data: text
End tag: span
Parsing invalid HTML (e. unquoted attributes) also works:
>>> (‘
Start tag: p
Start tag: a
attr: (‘class’, ‘link’)
attr: (‘href’, ‘#main’)
Data: tag soup
End tag: p
End tag: a

HTMLParser in Python 3.x – AskPython
provides a very simple and efficient way for coders to read through HTML code. This library comes pre-installed in the stdlib. This simplifies our interfacing with the HTMLParser library as we do not need to install additional packages from the Python Package Index (PyPI) for the same is HTMLParser? Essentially, HTMLParser lets us understand HTML code in a nested fashion. The module has methods that are automatically called when specific HTML elements are met with. It simplifies HTML tags and data fed with HTML data, the tag reads through it one tag at a time, going from start tags to the tags within, then the end tags and so to Use HTMLParser? HTMLParser only identifies the tags or data for us but does not output any data when something is identified. We need to add functionality to the methods before they can output the information they if we need to add functionality, what’s the use of the HTMLParser? This module saves us the time of creating the functionality of identifying tags ’re not going to code how to identify the tags, only what to do once they’re identified. Understood? Great! Now let’s get into creating a parser for ourselves! Subclassing the HTMLParserHow can we add functionality to the HTMLParser methods? By subclassing. Also identified as Inheritance, we create a class that retains the behavior of HTMLParser, while adding more bclassing lets us override the default functionality of a method (which in our case, is to return nothing when tags are identified) and add some better functions instead. Let’s see how to work with the HTMLParser nding Names of The Called MethodsThere are many methods available within the module. We’ll go over the ones you’d need frequently and then learn how to make use of MLParser. handle_starttag(tag, attrs) – Called when start tags are found (example ,
, )HTMLParser. handle_endtag(tag) – Called when end tags are found (example