Allen's Weblog: Unicode in Python, and how to prevent it

[UPDATE 16 Aug 2011] Armin Ronacher has written a nice module called unicode-nazi that provides the Unicode warnings I discuss at the end of this article.

Though I can't use Python 3 for any of my projects, it does have a few nice things. One particular behaviour where it improves on Python 2 is forbidding implicit conversions between byte strings and Unicode strings. For example:

Python 3.1.2 (release31-maint, Sep 17 2010, 20:34:23)

[GCC 4.4.5] on linux2

Type "help", "copyright", "credits" or "license" for more information.

>>> 'foo' + b'baz'

Traceback (most recent call last):

  File "<stdin>", line 1, in <module>

TypeError: Can't convert 'bytes' object to str implicitly

If you do this in Python 2, it invokes the default encoding to convert between bytes and unicode, leading to manifold unhappinesses. So in Python 2, the above example looks like:

Python 2.6.6 (r266:84292, Sep 15 2010, 15:52:39)

[GCC 4.4.5] on linux2

Type "help", "copyright", "credits" or "license" for more information.

>>> u'foo' + b'baz'

u'foobaz'

This looks OK, but problems lurk just below the surface. The default encoding used in nearly all circumstances is ASCII, and this conversion will blow up if non-ASCII bytes are involved.

>>> u'foo' + b'\xe2\x98\x83'

Traceback (most recent call last):

  File "<stdin>", line 1, in <module>

UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 0: ordinal not in range(128)

This is only a 3, at best 3.5 on the Russell scale of API misusability. Even worse, this conversion is done for any function implemented in C that expects unicode or bytes and receives the other type. For example, unicode() converts byte strings to unicode strings, and if not given an explicit encoding argument, uses the default encoding.

>>> unicode('bob')

u'bob'

>>> unicode('\xe2\x98\x83')

Traceback (most recent call last):

  File "<stdin>, line 1, in 

UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 0: ordinal not in range(128)

The right way to do it is call .decode() on the byte string.



>>> '\xe2\x98\x83'.decode('utf8')

u'\u2603'

Just be sure you don't call it on a unicode string by mistake, or you'll get very confused:

>>> u'foo'.decode('utf8')

u'foo'

>>> u'\u2603'.decode('utf8')

Traceback (most recent call last):

  File "<stdin>", line 1, in ?

  File "/usr/lib/python2.4/encodings/utf_8.py", line 16, in decode

    return codecs.utf_8_decode(input, errors, True)

UnicodeEncodeError: 'ascii' codec can't encode character u'\u2603' in position 0: ordinal not in range(128)

Hey! That error says "encode", but we asked for it to decode!

It turns out that since the utf8 decoder expects bytes as input, but we gave it unicode... Python helpfully tries to convert the unicode characters into bytes, using the default encoding! This is why the first decode call succeeds - all the characters in it can be converted to ASCII bytes.

Is there any hope?

Since Unicode was added to Python in 2.1, there's been a way to get Python 3's behavior. The site module in the Python standard library sets Python's default encoding on startup. If edited to use "undefined" instead of "ascii", the above examples all fail instead of converting:



>>> u'foo' + baz



Traceback (most recent call last):

  File "<stdin>", line 1, in 

  File "/usr/lib/python2.6/encodings/undefined.py", line 22, in decode

    raise UnicodeError("undefined encoding")

UnicodeError: undefined encoding

So why can't we switch to this today? Well, for one thing, lots and lots of code depends on this implicit encoding/decoding. Even code in the standard library:



Traceback (most recent call last):

  File "<stdin>", line 1, in <module>

  File "/usr/lib/python2.6/re.py", line 190, in compile

    return _compile(pattern, flags)

  File "/usr/lib/python2.6/re.py", line 243, in _compile

    p = sre_compile.compile(pattern, flags)

  File "/usr/lib/python2.6/sre_compile.py", line 506, in compile

    p = sre_parse.parse(p, flags)

  File "/usr/lib/python2.6/sre_parse.py", line 672, in parse

    source = Tokenizer(str)

  File "/usr/lib/python2.6/sre_parse.py", line 187, in __init__

    self.__next()

  File "/usr/lib/python2.6/sre_parse.py", line 193, in __next

    if char[0] == "\\":

  File "/usr/lib/python2.6/encodings/undefined.py", line 22, in decode

    raise UnicodeError("undefined encoding")

UnicodeError: undefined encoding

Not to mention loads of existing third-party apps and libraries. So what can we do? Fortunately, Python allows registering new codecs. I've written a slight variation on the ASCII codec, ascii_with_complaints, which preserves the default Python behaviour, but also produces warnings.



>>> import re

>>> re.compile(u'\N{SNOWMAN}+')

/usr/lib/python2.6/sre_parse.py:193: UnicodeWarning: Implicit conversion of str to unicode

  if char[0] == "\\":

/usr/lib/python2.6/sre_parse.py:418: UnicodeWarning: Implicit conversion of str to unicode

  if this and this[0] not in SPECIAL_CHARS:

/usr/lib/python2.6/sre_parse.py:421: UnicodeWarning: Implicit conversion of str to unicode

  elif this == "[":

/usr/lib/python2.6/sre_parse.py:478: UnicodeWarning: Implicit conversion of str to unicode

  elif this and this[0] in REPEAT_CHARS:

/usr/lib/python2.6/sre_parse.py:480: UnicodeWarning: Implicit conversion of str to unicode

  if this == "?":

/usr/lib/python2.6/sre_parse.py:482: UnicodeWarning: Implicit conversion of str to unicode

  elif this == "*":

/usr/lib/python2.6/sre_parse.py:485: UnicodeWarning: Implicit conversion of str to unicode

  elif this == "+":

<_sre.SRE_Pattern object at 0xb76d6f80>

Hopefully this will be useful as a tool for ferreting out those Unicode bugs waiting to break your application, as well.

8 comments:

Anonymous said...: can you please correctly license your codec? it looks interesting, but as long as you don't, one can't legally use it.; November 7, 2010 at 7:16 AM
Anonymous said...: adding like logging the line and object where it was triggered is interesting too; November 7, 2010 at 7:29 AM
Allen Short said...: Oops! Good point. This was based on the original 'ascii' codec, and I've marked at such. Thanks for catching that.; November 7, 2010 at 7:30 AM
Anonymous said...: http://mail.python.org/pipermail/python-dev/2009-August/091406.html

See the suggestion at the end of this post (which is not implemented yet, at least not in the py 2.6.6 I use) and the previous posts in that thread about the general problem.

It is a bit unfortunate that it still seems impossible to use another default encoding without globally modifying the system behaviour (by editing site.py) or using a completely separate python.; November 7, 2010 at 8:36 AM
Anonymous said...: Nice, I wrote something pretty similar a while back and it does come in handy (I thought I'd linked you to it at some point, but perhaps I'm confused).

One thing though: there is code out there (like pygtk) that replaces the default encoding without telling you or checking what the previous value actually is (usually to set it to utf-8). If you use a library like that you'll have to apply this hack *after* triggering the library's default encoding replacement, and you *might* want to use an utf-8 instead of ascii version of the hack.; November 7, 2010 at 5:25 PM
Tarjei Huse said...: This post is the best (and shortest) summary of what you need to know about Python and unicode that I have seen. Kudos!; November 9, 2010 at 11:35 PM
andy said...: i think your example of '\xe2\98\83'.decode('utf8') is wrong, since '\xe2' is not a valid ASCII.; November 30, 2013 at 2:40 AM
Anonymous said...: Thank you!

>>> u'foo'.decode('utf8')

That saved my life. ;-); June 5, 2015 at 12:37 PM

Allen's Weblog

Unicode in Python, and how to prevent it

Is there any hope?

8 comments:

Ohloh Stats

Projects

Blog Archive

Labels