Unicode in Python, and how to prevent it



[UPDATE 16 Aug 2011] Armin Ronacher has written a nice module called unicode-nazi that provides the Unicode warnings I discuss at the end of this article.

Though I can't use Python 3 for any of my projects, it does have a few nice things. One particular behaviour where it improves on Python 2 is forbidding implicit conversions between byte strings and Unicode strings. For example:
Python 3.1.2 (release31-maint, Sep 17 2010, 20:34:23)

[GCC 4.4.5] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> 'foo' + b'baz'
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: Can't convert 'bytes' object to str implicitly

If you do this in Python 2, it invokes the default encoding to convert between bytes and unicode, leading to manifold unhappinesses. So in Python 2, the above example looks like:
Python 2.6.6 (r266:84292, Sep 15 2010, 15:52:39)

[GCC 4.4.5] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> u'foo' + b'baz'
u'foobaz'

This looks OK, but problems lurk just below the surface. The default encoding used in nearly all circumstances is ASCII, and this conversion will blow up if non-ASCII bytes are involved.
>>> u'foo' + b'\xe2\x98\x83'

Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 0: ordinal not in range(128)

This is only a 3, at best 3.5 on the Russell scale of API misusability. Even worse, this conversion is done for any function implemented in C that expects unicode or bytes and receives the other type. For example, unicode() converts byte strings to unicode strings, and if not given an explicit encoding argument, uses the default encoding.
>>> unicode('bob')

u'bob'
>>> unicode('\xe2\x98\x83')
Traceback (most recent call last):
File "<stdin>, line 1, in
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 0: ordinal not in range(128)

The right way to do it is call .decode() on the byte string.


>>> '\xe2\x98\x83'.decode('utf8')
u'\u2603'

Just be sure you don't call it on a unicode string by mistake, or you'll get very confused:
>>> u'foo'.decode('utf8')

u'foo'
>>> u'\u2603'.decode('utf8')
Traceback (most recent call last):
File "<stdin>", line 1, in ?
File "/usr/lib/python2.4/encodings/utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2603' in position 0: ordinal not in range(128)

Hey! That error says "encode", but we asked for it to decode!

It turns out that since the utf8 decoder expects bytes as input, but we gave it unicode... Python helpfully tries to convert the unicode characters into bytes, using the default encoding! This is why the first decode call succeeds - all the characters in it can be converted to ASCII bytes.

Is there any hope?


Since Unicode was added to Python in 2.1, there's been a way to get Python 3's behavior. The site module in the Python standard library sets Python's default encoding on startup. If edited to use "undefined" instead of "ascii", the above examples all fail instead of converting:



>>> u'foo' + baz

Traceback (most recent call last):
File "<stdin>", line 1, in
File "/usr/lib/python2.6/encodings/undefined.py", line 22, in decode
raise UnicodeError("undefined encoding")
UnicodeError: undefined encoding


So why can't we switch to this today? Well, for one thing, lots and lots of code depends on this implicit encoding/decoding. Even code in the standard library:


Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python2.6/re.py", line 190, in compile
return _compile(pattern, flags)
File "/usr/lib/python2.6/re.py", line 243, in _compile
p = sre_compile.compile(pattern, flags)
File "/usr/lib/python2.6/sre_compile.py", line 506, in compile
p = sre_parse.parse(p, flags)
File "/usr/lib/python2.6/sre_parse.py", line 672, in parse
source = Tokenizer(str)
File "/usr/lib/python2.6/sre_parse.py", line 187, in __init__
self.__next()
File "/usr/lib/python2.6/sre_parse.py", line 193, in __next
if char[0] == "\\":
File "/usr/lib/python2.6/encodings/undefined.py", line 22, in decode
raise UnicodeError("undefined encoding")
UnicodeError: undefined encoding


Not to mention loads of existing third-party apps and libraries. So what can we do? Fortunately, Python allows registering new codecs. I've written a slight variation on the ASCII codec, ascii_with_complaints, which preserves the default Python behaviour, but also produces warnings.


>>> import re
>>> re.compile(u'\N{SNOWMAN}+')
/usr/lib/python2.6/sre_parse.py:193: UnicodeWarning: Implicit conversion of str to unicode
if char[0] == "\\":
/usr/lib/python2.6/sre_parse.py:418: UnicodeWarning: Implicit conversion of str to unicode
if this and this[0] not in SPECIAL_CHARS:
/usr/lib/python2.6/sre_parse.py:421: UnicodeWarning: Implicit conversion of str to unicode
elif this == "[":
/usr/lib/python2.6/sre_parse.py:478: UnicodeWarning: Implicit conversion of str to unicode
elif this and this[0] in REPEAT_CHARS:
/usr/lib/python2.6/sre_parse.py:480: UnicodeWarning: Implicit conversion of str to unicode
if this == "?":
/usr/lib/python2.6/sre_parse.py:482: UnicodeWarning: Implicit conversion of str to unicode
elif this == "*":
/usr/lib/python2.6/sre_parse.py:485: UnicodeWarning: Implicit conversion of str to unicode
elif this == "+":
<_sre.SRE_Pattern object at 0xb76d6f80>


Hopefully this will be useful as a tool for ferreting out those Unicode bugs waiting to break your application, as well.

The true blue fate with an artificial calendar

In case you were wondering, here's what I've been busy with lately.

I haven't given up on Python yet. Stay tuned, big stuff on the way.

Two-thirds slow, one-third amazing



My evident neglect of this site was not intentional. Moving across the country and starting a new job tend to reduce one's available time for open source work, and mine hasn't resulted in anything really worth announcing for the past year (or more). But today that changes!

Download PyMeta 0.4.0



Since the last release of Ecru I have been trying to get rid of its dependency on Python, by porting the E parser to E. In the process of doing so, I realized it was probably a bad idea to try to use a parser whose only form of error reporting was the string "Parse error". Since I'm still more familiar with Python than E, I started implementing error reporting in PyMeta. (Also, Python has a debugger.) This resulted in some significant rewrites of the internals.

So what's different in PyMeta 0.4?

Comments!


With its new space-age technology, PyMeta 0.4 will treat '#' as a comment character, just like Python. (Yes, this was rather overdue.)

Reorganized code generator


Previously, code for grammars was generated by the grammar parser calling methods on a builder object that directly emitted Python code as strings. Now the grammar parser builds a tree, which is then consumed by a code generator. If you want to generate something other than Python (or change how Python code gets generated), it should be a lot simpler now. Look at pymeta.builder.PythonWriter for specifics, specifically the generate_ methods.

Error tracking and reporting


This is the big one. Previously, PyMeta expressions returned a value or raised an error. Now, each expression evaluated by the parser now returns a value and an error, even if a successful parse was found. If parse failure occurs, an error still gets raised. Combining expressions with "|" returns the error from furthest into the input and combines ties.

So the result of this is that you can now get nicely formatted output telling you where stuff went wrong and how. Here's an example, based on the old TinyHTML parser:


When you mismatch tags, the parser notices:


Notice here that the information in ParseError is structured so that future tools can figure out stuff about what failed and how.

If there's more than one possible valid input at the point of failure, the parser will tell you:


Plans for the Future


There are a few directions I'd like to take PyMeta in the future. A really nice thing would be to have a way to generate grammars ahead of time easily, writing out a Python module. This release includes bin/generate_parser which does a rather naive version of this. The problem is figuring out how to make grammars that are subclasses of something other than just OMetaBase.
Also, with the new code generator setup, it'd be fairly easy to generate Cython instead of Python, resulting in grammars that can be compiled as extension modules, hopefully resulting in much faster parse times. PyMeta isn't meant to be blazingly fast -- PEG parsers aren't known to be the most efficient -- but it'd be nice to squeeze all we can out of it.

Other people have asked for event based parsing and incremental output. Seems like a neat idea... I'd just have to figure out what that means. :-)

Special thanks to Marien Zwart and Cory Dodt for their contributions and encouragement for this release.