urllib.robotparser
— robots.txt 剖析器
¶
源代码: Lib/urllib/robotparser.py
此模块提供单个类,
RobotFileParser
, which answers questions about whether or not a particular user agent can fetch a URL on the web site that published the
robots.txt
file. For more details on the structure of
robots.txt
files, see
http://www.robotstxt.org/orig.html
.
- class urllib.robotparser. RobotFileParser ( url = '' ) ¶
-
This class provides methods to read, parse and answer questions about the
robots.txtfile at url .- set_url ( url ) ¶
-
Sets the URL referring to a
robots.txt文件。
- read ( ) ¶
-
读取
robots.txtURL and feeds it to the parser.
- parse ( lines ) ¶
-
剖析行自变量。
- can_fetch ( useragent , url ) ¶
-
返回
True若 useragent is allowed to fetch the url according to the rules contained in the parsedrobots.txt文件。
- mtime ( ) ¶
-
Returns the time the
robots.txtfile was last fetched. This is useful for long-running web spiders that need to check for newrobots.txtfiles periodically.
- modified ( ) ¶
-
Sets the time the
robots.txtfile was last fetched to the current time.
- crawl_delay ( useragent ) ¶
-
返回值为
Crawl-delay参数从robots.txt为 useragent in question. If there is no such parameter or it doesn’t apply to the useragent specified or therobots.txtentry for this parameter has invalid syntax, returnNone.Added in version 3.6.