urllib.robotparser
— robots.txt 剖析器
¶
源代码: Lib/urllib/robotparser.py
此模块提供单个类,
RobotFileParser
, which answers questions about whether or not a particular user agent can fetch a URL on the Web site that published the
robots.txt
file. For more details on the structure of
robots.txt
files, see
http://www.robotstxt.org/orig.html
.
urllib.robotparser.
RobotFileParser
(
url=''
)
¶
This class provides methods to read, parse and answer questions about the
robots.txt
file at
url
.
set_url
(
url
)
¶
Sets the URL referring to a
robots.txt
文件。
read
(
)
¶
读取
robots.txt
URL and feeds it to the parser.
parse
(
lines
)
¶
剖析行自变量。
can_fetch
(
useragent
,
url
)
¶
返回
True
若
useragent
is allowed to fetch the
url
according to the rules contained in the parsed
robots.txt
文件。
mtime
(
)
¶
Returns the time the
robots.txt
file was last fetched. This is useful for long-running web spiders that need to check for new
robots.txt
files periodically.
modified
(
)
¶
Sets the time the
robots.txt
file was last fetched to the current time.
crawl_delay
(
useragent
)
¶
返回值为
Crawl-delay
参数从
robots.txt
为
useragent
in question. If there is no such parameter or it doesn’t apply to the
useragent
specified or the
robots.txt
entry for this parameter has invalid syntax, return
None
.
3.6 版新增。
request_rate
(
useragent
)
¶
Returns the contents of the
Request-rate
参数从
robots.txt
作为
命名元组
RequestRate(requests, seconds)
. If there is no such parameter or it doesn’t apply to the
useragent
specified or the
robots.txt
entry for this parameter has invalid syntax, return
None
.
3.6 版新增。
site_maps
(
)
¶
Returns the contents of the
Sitemap
参数从
robots.txt
in the form of a
list()
. If there is no such parameter or the
robots.txt
entry for this parameter has invalid syntax, return
None
.
3.8 版新增。
The following example demonstrates basic use of the
RobotFileParser
类:
>>> import urllib.robotparser >>> rp = urllib.robotparser.RobotFileParser() >>> rp.set_url("http://www.musi-cal.com/robots.txt") >>> rp.read() >>> rrate = rp.request_rate("*") >>> rrate.requests 3 >>> rrate.seconds 20 >>> rp.crawl_delay("*") 6 >>> rp.can_fetch("*", "http://www.musi-cal.com/cgi-bin/search?city=San+Francisco") False >>> rp.can_fetch("*", "http://www.musi-cal.com/") True