20.2. `html.parser` — 简单 HTML 和 XHTML 剖析器 ¶

此模块定义的类 HTMLParser 其充当剖析 HTML (超文本标记语言) 和 XHTML 格式文本文件的基础。

class html.parser. HTMLParser ( strict=False , * , convert_charrefs=False ) ¶

An exception is defined as well:

exception html.parser. HTMLParseError ¶

20.2.1. HTML Parser 应用程序范例 ¶

As a basic example, below is a simple HTML parser that uses the HTMLParser class to print out start tags, end tags, and data as they are encountered:

from html.parser import HTMLParser
class MyHTMLParser(HTMLParser):
    def handle_starttag(self, tag, attrs):
        print("Encountered a start tag:", tag)
    def handle_endtag(self, tag):
        print("Encountered an end tag :", tag)
    def handle_data(self, data):
        print("Encountered some data  :", data)
parser = MyHTMLParser()
parser.feed('<html><head><title>Test</title></head>'
            '<body><h1>Parse me!</h1></body></html>')

The output will then be:

Encountered a start tag: html
Encountered a start tag: head
Encountered a start tag: title
Encountered some data  : Test
Encountered an end tag : title
Encountered an end tag : head
Encountered a start tag: body
Encountered a start tag: h1
Encountered some data  : Parse me!
Encountered an end tag : h1
Encountered an end tag : body
Encountered an end tag : html

20.2.2. `HTMLParser` 方法 ¶

HTMLParser 实例具有下列方法：

HTMLParser. feed ( data ) ¶

HTMLParser. close ( ) ¶

HTMLParser. reset ( ) ¶

HTMLParser. getpos ( ) ¶

HTMLParser. get_starttag_text ( ) ¶

The following methods are called when data or markup elements are encountered and they are meant to be overridden in a subclass. The base class implementations do nothing (except for handle_startendtag() ):

HTMLParser. handle_starttag ( tag , attrs ) ¶

HTMLParser. handle_endtag ( tag ) ¶

HTMLParser. handle_startendtag ( tag , attrs ) ¶

HTMLParser. handle_data ( data ) ¶

HTMLParser. handle_entityref ( name ) ¶

HTMLParser. handle_charref ( name ) ¶

HTMLParser. handle_comment ( data ) ¶

HTMLParser. handle_decl ( decl ) ¶

HTMLParser. handle_pi ( data ) ¶

HTMLParser. unknown_decl ( data ) ¶

20.2.3. 范例 ¶

The following class implements a parser that will be used to illustrate more examples:

from html.parser import HTMLParser
from html.entities import name2codepoint
class MyHTMLParser(HTMLParser):
    def handle_starttag(self, tag, attrs):
        print("Start tag:", tag)
        for attr in attrs:
            print("     attr:", attr)
    def handle_endtag(self, tag):
        print("End tag  :", tag)
    def handle_data(self, data):
        print("Data     :", data)
    def handle_comment(self, data):
        print("Comment  :", data)
    def handle_entityref(self, name):
        c = chr(name2codepoint[name])
        print("Named ent:", c)
    def handle_charref(self, name):
        if name.startswith('x'):
            c = chr(int(name[1:], 16))
        else:
            c = chr(int(name))
        print("Num ent  :", c)
    def handle_decl(self, data):
        print("Decl     :", data)
parser = MyHTMLParser()

剖析 doctype：

>>> parser.feed('<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" '
...             '"http://www.w3.org/TR/html4/strict.dtd">')
Decl     : DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd"

Parsing an element with a few attributes and a title:

>>> parser.feed('<img src="python-logo.png" alt="The Python logo">')
Start tag: img
     attr: ('src', 'python-logo.png')
     attr: ('alt', 'The Python logo')
>>>
>>> parser.feed('<h1>Python</h1>')
Start tag: h1
Data     : Python
End tag  : h1

The content of script and style elements is returned as is, without further parsing:

>>> parser.feed('<style type="text/css">#python { color: green }</style>')
Start tag: style
     attr: ('type', 'text/css')
Data     : #python { color: green }
End tag  : style
>>>
>>> parser.feed('<script type="text/javascript">'
...             'alert("<strong>hello!</strong>");</script>')
Start tag: script
     attr: ('type', 'text/javascript')
Data     : alert("<strong>hello!</strong>");
End tag  : script

剖析注释：

>>> parser.feed('<!-- a comment -->'
...             '<!--[if IE 9]>IE-specific content<![endif]-->')
Comment  :  a comment
Comment  : [if IE 9]>IE-specific content<![endif]

Parsing named and numeric character references and converting them to the correct char (note: these 3 references are all equivalent to '>' ):

>>> parser.feed('>>>')
Named ent: >
Num ent  : >
Num ent  : >

Feeding incomplete chunks to feed() works, but handle_data() might be called more than once (unless convert_charrefs 被设为 True ):

>>> for chunk in ['<sp', 'an>buff', 'ered ', 'text</s', 'pan>']:
...     parser.feed(chunk)
...
Start tag: span
Data     : buff
Data     : ered
Data     : text
End tag  : span

Parsing invalid HTML (e.g. unquoted attributes) also works:

>>> parser.feed('<p><a class=link href=#main>tag soup</p ></a>')
Start tag: p
Start tag: a
     attr: ('class', 'link')
     attr: ('href', '#main')
Data     : tag soup
End tag  : p
End tag  : a

脚注

[1]	For backward compatibility reasons strict mode does not raise exceptions for all non-compliant HTML. That is, some invalid HTML is tolerated even in strict 模式。

20.2. `html.parser` — 简单 HTML 和 XHTML 剖析器 ¶

20.2.1. HTML Parser 应用程序范例 ¶

20.2.2. `HTMLParser` 方法 ¶

20.2.3. 范例 ¶

内容表

上一话题

下一话题

本页

快速搜索

20.2. html.parser — 简单 HTML 和 XHTML 剖析器 ¶

20.2.1. HTML Parser 应用程序范例 ¶

20.2.2. HTMLParser 方法 ¶

20.2.3. 范例 ¶

内容表

上一话题

下一话题

本页

快速搜索

20.2. `html.parser` — 简单 HTML 和 XHTML 剖析器 ¶

20.2.2. `HTMLParser` 方法 ¶