urllib.parse — 将 URL 剖析成组件

源代码: Lib/urllib/parse.py


此模块定义的标准接口能将 URL (统一资源定位符) 字符串分解成组件 (编址方案、网络位置、路径等),将组件组合回 URL 字符串,及将给定基 URL 的相对 URL 转换成绝对 URL。

The module has been designed to match the internet RFC on Relative Uniform Resource Locators. It supports the following URL schemes: file , ftp , gopher , hdl , http , https , imap , itms-services , mailto , mms , news , nntp , prospero , rsync , rtsp , rtsps , rtspu , sftp , shttp , sip , sips , snews , svn , svn+ssh , telnet , wais , ws , wss .

CPython 实现细节: The inclusion of the itms-services URL scheme can prevent an app from passing Apple’s App Store review process for the macOS and iOS App Stores. Handling for the itms-services scheme is always removed on iOS; on macOS, it may be removed if CPython has been built with the --with-app-store-compliance 选项。

The urllib.parse 模块定义的函数分为 2 大类:URL 剖析和 URL 引用。这些将详细涵盖在下列章节。

This module’s functions use the deprecated term netloc (或 net_loc ), which was introduced in RFC 1808 . However, this term has been obsoleted by RFC 3986 , which introduced the term authority as its replacement. The use of netloc is continued for backward compatibility.

URL 剖析

URL 剖析函数聚焦于将 URL 字符串分割成其组件,或将 URL 组件组合成 URL 字符串。

urllib.parse. urlparse ( urlstring , scheme = '' , allow_fragments = True )

将 URL 剖析成 6 个组件,返回 6 项 命名元组 。这相当于一般 URL 结构: scheme://netloc/path;parameters?query#fragment . Each tuple item is a string, possibly empty. The components are not broken up into smaller parts (for example, the network location is a single string), and % escapes are not expanded. The delimiters as shown above are not part of the result, except for a leading slash in the path component, which is retained if present. For example:

>>> from urllib.parse import urlparse
>>> urlparse("scheme://netloc/path;parameters?query#fragment")
ParseResult(scheme='scheme', netloc='netloc', path='/path;parameters', params='',
            query='query', fragment='fragment')
>>> o = urlparse("http://docs.python.org:80/3/library/urllib.parse.html?"
...              "highlight=params#url-parsing")
>>> o
ParseResult(scheme='http', netloc='docs.python.org:80',
            path='/3/library/urllib.parse.html', params='',
            query='highlight=params', fragment='url-parsing')
>>> o.scheme
'http'
>>> o.netloc
'docs.python.org:80'
>>> o.hostname
'docs.python.org'
>>> o.port
80
>>> o._replace(fragment="").geturl()
'http://docs.python.org:80/3/library/urllib.parse.html?highlight=params'
												

遵循的句法规范在 RFC 1808 , urlparse recognizes a netloc only if it is properly introduced by ‘//’. Otherwise the input is presumed to be a relative URL and thus to start with a path component.

>>> from urllib.parse import urlparse
>>> urlparse('//www.cwi.nl:80/%7Eguido/Python.html')
ParseResult(scheme='', netloc='www.cwi.nl:80', path='/%7Eguido/Python.html',
            params='', query='', fragment='')
>>> urlparse('www.cwi.nl/%7Eguido/Python.html')
ParseResult(scheme='', netloc='', path='www.cwi.nl/%7Eguido/Python.html',
            params='', query='', fragment='')
>>> urlparse('help/Python.html')
ParseResult(scheme='', netloc='', path='help/Python.html', params='',
            query='', fragment='')