2. 词法分析

A Python program is read by a parser . Input to the parser is a stream of tokens , generated by the 词法分析器 . This chapter describes how the lexical analyzer breaks a file into tokens.

Python reads program text as Unicode code points; the encoding of a source file can be given by an encoding declaration and defaults to UTF-8, see PEP 3120 for details. If the source file cannot be decoded, a SyntaxError 被引发。

2.1. 行结构

Python 程序被分成许多 逻辑行 .

2.1.1. 逻辑行

The end of a logical line is represented by the token NEWLINE. Statements cannot cross logical line boundaries except where NEWLINE is allowed by the syntax (e.g., between statements in compound statements). A logical line is constructed from one or more physical lines by following the explicit or implicit 行联接 规则。

2.1.2. 物理行

A physical line is a sequence of characters terminated by an end-of-line sequence. In source files and strings, any of the standard platform line termination sequences can be used - the Unix form using ASCII LF (linefeed), the Windows form using the ASCII sequence CR LF (return followed by linefeed), or the old Macintosh form using the ASCII CR (return) character. All of these forms can be used equally, regardless of platform. The end of input also serves as an implicit terminator for the final physical line.

When embedding Python, source code strings should be passed to Python APIs using the standard C conventions for newline characters (the \n character, representing ASCII LF, is the line terminator).

2.1.3. 注释

注释开头的哈希字符 ( # ) that is not part of a string literal, and ends at the end of the physical line. A comment signifies the end of the logical line unless the implicit line joining rules are invoked. Comments are ignored by the syntax.

2.1.4. 编码声明

若 Python 脚本第 1 (或第 2) 行注释匹配正则表达式 coding[=:]\s*([-\w.]+) ,将作为编码声明处理此注释;表达式的第 1 组命名源代码文件的编码。编码声明必须单独出现在一行中。若编码声明在第 2 行,第 1 行也必须是仅注释行。推荐的编码表达式形式

# -*- coding: <encoding-name> -*-