urllib.parse
— 将 URL 剖析成组件
¶
源代码: Lib/urllib/parse.py
此模块定义的标准接口能将 URL (统一资源定位符) 字符串分解成组件 (编址方案、网络位置、路径等),将组件组合回 URL 字符串,及将给定基 URL 的相对 URL 转换成绝对 URL。
The module has been designed to match the internet RFC on Relative Uniform Resource Locators. It supports the following URL schemes:
file
,
ftp
,
gopher
,
hdl
,
http
,
https
,
imap
,
itms-services
,
mailto
,
mms
,
news
,
nntp
,
prospero
,
rsync
,
rtsp
,
rtsps
,
rtspu
,
sftp
,
shttp
,
sip
,
sips
,
snews
,
svn
,
svn+ssh
,
telnet
,
wais
,
ws
,
wss
.
CPython 实现细节:
The inclusion of the
itms-services
URL scheme can prevent an app from passing Apple’s App Store review process for the macOS and iOS App Stores. Handling for the
itms-services
scheme is always removed on iOS; on macOS, it
may
be removed if CPython has been built with the
--with-app-store-compliance
选项。
The
urllib.parse
模块定义的函数分为 2 大类:URL 剖析和 URL 引用。这些将详细涵盖在下列章节。
This module’s functions use the deprecated term
netloc
(或
net_loc
), which was introduced in
RFC 1808
. However, this term has been obsoleted by
RFC 3986
, which introduced the term
authority
as its replacement. The use of
netloc
is continued for backward compatibility.
URL 剖析 ¶
URL 剖析函数聚焦于将 URL 字符串分割成其组件,或将 URL 组件组合成 URL 字符串。
- urllib.parse. urlparse ( urlstring , scheme = '' , allow_fragments = True ) ¶
-
将 URL 剖析成 6 个组件,返回 6 项 命名元组 。这相当于一般 URL 结构:
scheme://netloc/path;parameters?query#fragment. Each tuple item is a string, possibly empty. The components are not broken up into smaller parts (for example, the network location is a single string), and % escapes are not expanded. The delimiters as shown above are not part of the result, except for a leading slash in the path component, which is retained if present. For example:>>> from urllib.parse import urlparse >>> urlparse("scheme://netloc/path;parameters?query#fragment") ParseResult(scheme='scheme', netloc='netloc', path='/path;parameters', params='', query='query', fragment='fragment') >>> o = urlparse("http://docs.python.org:80/3/library/urllib.parse.html?" ... "highlight=params#url-parsing") >>> o ParseResult(scheme='http', netloc='docs.python.org:80', path='/3/library/urllib.parse.html', params='', query='highlight=params', fragment='url-parsing') >>> o.scheme 'http' >>> o.netloc 'docs.python.org:80' >>> o.hostname 'docs.python.org' >>> o.port 80 >>> o._replace(fragment="").geturl() 'http://docs.python.org:80/3/library/urllib.parse.html?highlight=params'
遵循的句法规范在 RFC 1808 , urlparse recognizes a netloc only if it is properly introduced by ‘//’. Otherwise the input is presumed to be a relative URL and thus to start with a path component.
>>> from urllib.parse import urlparse
>>> urlparse('//www.cwi.nl:80/%7Eguido/Python.html')
ParseResult(scheme='', netloc='www.cwi.nl:80', path='/%7Eguido/Python.html',
params='', query='', fragment='')
>>> urlparse('www.cwi.nl/%7Eguido/Python.html')
ParseResult(scheme='', netloc='', path='www.cwi.nl/%7Eguido/Python.html',
params='', query='', fragment='')
>>> urlparse('help/Python.html')
ParseResult(scheme='', netloc='', path='help/Python.html', params='',
query='', fragment='')