21.10. `urllib.robotparser` — robots.txt 剖析器 ¶

源代码： Lib/urllib/robotparser.py

This module provides a single class, RobotFileParser , which answers questions about whether or not a particular user agent can fetch a URL on the Web site that published the robots.txt file. For more details on the structure of robots.txt files, see http://www.robotstxt.org/orig.html .

class urllib.robotparser. RobotFileParser ( url='' ) ¶

This class provides methods to read, parse and answer questions about the robots.txt file at url .

set_url ( url ) ¶: Sets the URL referring to a robots.txt 文件。

read ( ) ¶: 读取 robots.txt URL and feeds it to the parser.

parse ( lines ) ¶: Parses the lines argument.

can_fetch ( useragent , url ) ¶: 返回 True 若 useragent is allowed to fetch the url according to the rules contained in the parsed robots.txt 文件。

mtime ( ) ¶: Returns the time the robots.txt file was last fetched. This is useful for long-running web spiders that need to check for new robots.txt files periodically.

modified ( ) ¶: Sets the time the robots.txt file was last fetched to the current time.

crawl_delay ( useragent ) ¶: 返回值为 Crawl-delay parameter from robots.txt 为 useragent in question. If there is no such parameter or it doesn’t apply to the useragent specified or the robots.txt entry for this parameter has invalid syntax, return None .

3.6 版新增。

request_rate ( useragent ) ¶: Returns the contents of the Request-rate parameter from robots.txt as a 命名元组 RequestRate(requests, seconds) . If there is no such parameter or it doesn’t apply to the useragent specified or the robots.txt entry for this parameter has invalid syntax, return None .

3.6 版新增。

The following example demonstrates basic use of the RobotFileParser 类：

							>>> import urllib.robotparser
>>> rp = urllib.robotparser.RobotFileParser()
>>> rp.set_url("http://www.musi-cal.com/robots.txt")
>>> rp.read()
>>> rrate = rp.request_rate("*")
>>> rrate.requests
3
>>> rrate.seconds
20
>>> rp.crawl_delay("*")
6
>>> rp.can_fetch("*", "http://www.musi-cal.com/cgi-bin/search?city=San+Francisco")
False
>>> rp.can_fetch("*", "http://www.musi-cal.com/")
True
							
						

上一话题

21.9. urllib.error — 由 urllib.request 引发的异常类

下一话题

21.11. http — HTTP 模块

21.10. `urllib.robotparser` — robots.txt 剖析器 ¶

上一话题

下一话题

本页

快速搜索

21.10. urllib.robotparser — robots.txt 剖析器 ¶

上一话题

下一话题

本页

快速搜索

21.10. `urllib.robotparser` — robots.txt 剖析器 ¶