Everything you thought you knew about binary data and Unicode has changed.
-
Python 3.0 uses the concepts of
text
and (binary)
data
instead of Unicode strings and 8-bit strings. All text is Unicode; however
encoded
Unicode is represented as binary data. The type used to hold text is
str
, the type used to hold data is
bytes
. The biggest difference with the 2.x situation is that any attempt to mix text and data in Python 3.0 raises
TypeError
, whereas if you were to mix Unicode and 8-bit strings in Python 2.x, it would work if the 8-bit string happened to contain only 7-bit (ASCII) bytes, but you would get
UnicodeDecodeError
if it contained non-ASCII values. This value-specific behavior has caused numerous sad faces over the years.
-
As a consequence of this change in philosophy, pretty much all code that uses Unicode, encodings or binary data most likely has to change. The change is for the better, as in the 2.x world there were numerous bugs having to do with mixing encoded and unencoded text. To be prepared in Python 2.x, start using
unicode
for all unencoded text, and
str
for binary or encoded data only. Then the
2to3
tool will do most of the work for you.
-
You can no longer use
u"..."
literals for Unicode text. However, you must use
b"..."
literals for binary data.
-
As the
str
and
bytes
types cannot be mixed, you must always explicitly convert between them. Use
str.encode()
to go from
str
to
bytes
,和
bytes.decode()
to go from
bytes
to
str
。还可以使用
bytes(s, encoding=...)
and
str(b, encoding=...)
,分别。
-
像
str
,
bytes
type is immutable. There is a separate
可变
type to hold buffered binary data,
bytearray
. Nearly all APIs that accept
bytes
also accept
bytearray
. The mutable API is based on
collections.MutableSequence
.
-
All backslashes in raw string literals are interpreted literally. This means that
'\U'
and
'\u'
escapes in raw strings are not treated specially. For example,
r'\u20ac'
is a string of 6 characters in Python 3.0, whereas in 2.6,
ur'\u20ac'
was the single “euro” character. (Of course, this change only affects raw string literals; the euro character is
'\u20ac'
in Python 3.0.)
-
内置
basestring
abstract type was removed. Use
str
instead. The
str
and
bytes
types don’t have functionality enough in common to warrant a shared base class. The
2to3
tool (see below) replaces every occurrence of
basestring
with
str
.
-
Files opened as text files (still the default mode for
open()
) always use an encoding to map between strings (in memory) and bytes (on disk). Binary files (opened with a
b
in the mode argument) always use bytes in memory. This means that if a file is opened using an incorrect mode or encoding, I/O will likely fail loudly, instead of silently producing incorrect data. It also means that even Unix users will have to specify the correct mode (text or binary) when opening a file. There is a platform-dependent default encoding, which on Unixy platforms can be set with the
LANG
environment variable (and sometimes also with some other platform-specific locale-related environment variables). In many cases, but not all, the system default is UTF-8; you should never count on this default. Any application reading or writing more than pure ASCII text should probably have a way to override the encoding. There is no longer any need for using the encoding-aware streams in the
codecs
模块。
-
The initial values of
sys.stdin
,
sys.stdout
and
sys.stderr
are now unicode-only text files (i.e., they are instances of
io.TextIOBase
). To read and write bytes data with these streams, you need to use their
io.TextIOBase.buffer
属性。
-
Filenames are passed to and returned from APIs as (Unicode) strings. This can present platform-specific problems because on some platforms filenames are arbitrary byte strings. (On the other hand, on Windows filenames are natively stored as Unicode.) As a work-around, most APIs (e.g.
open()
and many functions in the
os
module) that take filenames accept
bytes
objects as well as strings, and a few APIs have a way to ask for a
bytes
return value. Thus,
os.listdir()
returns a list of
bytes
instances if the argument is a
bytes
instance, and
os.getcwdb()
returns the current working directory as a
bytes
instance. Note that when
os.listdir()
returns a list of strings, filenames that cannot be decoded properly are omitted rather than raising
UnicodeError
.
-
Some system APIs like
os.environ
and
sys.argv
can also present problems when the bytes made available by the system is not interpretable using the default encoding. Setting the
LANG
variable and rerunning the program is probably the best approach.
-
PEP 3138
:
repr()
of a string no longer escapes non-ASCII characters. It still escapes control characters and code points with non-printable status in the Unicode standard, however.
-
PEP 3120
: The default source encoding is now UTF-8.
-
PEP 3131
: Non-ASCII letters are now allowed in identifiers. (However, the standard library remains ASCII-only with the exception of contributor names in comments.)
-
The
StringIO
and
cStringIO
modules are gone. Instead, import the
io
module and use
io.StringIO
or
io.BytesIO
for text and data respectively.
-
另请参阅
Unicode 怎么样
, which was updated for Python 3.0.