urllib.robotparser
— robots.txt 剖析器
¶
源代码: Lib/urllib/robotparser.py
此模块提供单个类,
RobotFileParser
, which answers questions about whether or not a particular user agent can fetch a URL on the web site that published the
robots.txt
file. For more details on the structure of
robots.txt
files, see
http://www.robotstxt.org/orig.html
.
This class provides methods to read, parse and answer questions about the
robots.txt
file at
url
.
Sets the URL referring to a
robots.txt
文件。
读取
robots.txt
URL and feeds it to the parser.
剖析行自变量。
返回
True
若
useragent
is allowed to fetch the
url
according to the rules contained in the parsed
robots.txt
文件。
Returns the time the
robots.txt
file was last fetched. This is useful for long-running web spiders that need to check for new
robots.txt
files periodically.
Sets the time the
robots.txt
file was last fetched to the current time.
返回值为
Crawl-delay
参数从
robots.txt
为
useragent
in question. If there is no such parameter or it doesn’t apply to the
useragent
specified or the
robots.txt
entry for this parameter has invalid syntax, return
None
.
3.6 版新增。
Returns the contents of the
Request-rate
参数从
robots.txt
作为
命名元组
RequestRate(requests, seconds)
. If there is no such parameter or it doesn’t apply to the
useragent
specified or the
robots.txt
entry for this parameter has invalid syntax, return
None
.
3.6 版新增。
Returns the contents of the
Sitemap
参数从
robots.txt
in the form of a
list()
. If there is no such parameter or the
robots.txt
entry for this parameter has invalid syntax, return
None
.
3.8 版新增。
The following example demonstrates basic use of the
RobotFileParser
类:
>>> import urllib.robotparser >>> rp = urllib.robotparser.RobotFileParser() >>> rp.set_url("http://www.musi-cal.com/robots.txt") >>> rp.read() >>> rrate = rp.request_rate("*") >>> rrate.requests 3 >>> rrate.seconds 20 >>> rp.crawl_delay("*") 6 >>> rp.can_fetch("*", "http://www.musi-cal.com/cgi-bin/search?city=San+Francisco") False >>> rp.can_fetch("*", "http://www.musi-cal.com/") True