Allen's Weblog

Unicode in Python, and how to prevent it

[UPDATE 16 Aug 2011] Armin Ronacher has written a nice module called unicode-nazi that provides the Unicode warnings I discuss at the end of this article.

Though I can't use Python 3 for any of my projects, it does have a few nice things. One particular behaviour where it improves on Python 2 is forbidding implicit conversions between byte strings and Unicode strings. For example:

Python 3.1.2 (release31-maint, Sep 17 2010, 20:34:23)

[GCC 4.4.5] on linux2

Type "help", "copyright", "credits" or "license" for more information.

>>> 'foo' + b'baz'

Traceback (most recent call last):

  File "<stdin>", line 1, in <module>

TypeError: Can't convert 'bytes' object to str implicitly

If you do this in Python 2, it invokes the default encoding to convert between bytes and unicode, leading to manifold unhappinesses. So in Python 2, the above example looks like:

Python 2.6.6 (r266:84292, Sep 15 2010, 15:52:39)

[GCC 4.4.5] on linux2

Type "help", "copyright", "credits" or "license" for more information.

>>> u'foo' + b'baz'

u'foobaz'

This looks OK, but problems lurk just below the surface. The default encoding used in nearly all circumstances is ASCII, and this conversion will blow up if non-ASCII bytes are involved.

>>> u'foo' + b'\xe2\x98\x83'

Traceback (most recent call last):

  File "<stdin>", line 1, in <module>

UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 0: ordinal not in range(128)

This is only a 3, at best 3.5 on the Russell scale of API misusability. Even worse, this conversion is done for any function implemented in C that expects unicode or bytes and receives the other type. For example, unicode() converts byte strings to unicode strings, and if not given an explicit encoding argument, uses the default encoding.

>>> unicode('bob')

u'bob'

>>> unicode('\xe2\x98\x83')

Traceback (most recent call last):

  File "<stdin>, line 1, in 

UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 0: ordinal not in range(128)

The right way to do it is call .decode() on the byte string.



>>> '\xe2\x98\x83'.decode('utf8')

u'\u2603'

Just be sure you don't call it on a unicode string by mistake, or you'll get very confused:

>>> u'foo'.decode('utf8')

u'foo'

>>> u'\u2603'.decode('utf8')

Traceback (most recent call last):

  File "<stdin>", line 1, in ?

  File "/usr/lib/python2.4/encodings/utf_8.py", line 16, in decode

    return codecs.utf_8_decode(input, errors, True)

UnicodeEncodeError: 'ascii' codec can't encode character u'\u2603' in position 0: ordinal not in range(128)

Hey! That error says "encode", but we asked for it to decode!

It turns out that since the utf8 decoder expects bytes as input, but we gave it unicode... Python helpfully tries to convert the unicode characters into bytes, using the default encoding! This is why the first decode call succeeds - all the characters in it can be converted to ASCII bytes.

Is there any hope?

Since Unicode was added to Python in 2.1, there's been a way to get Python 3's behavior. The site module in the Python standard library sets Python's default encoding on startup. If edited to use "undefined" instead of "ascii", the above examples all fail instead of converting:



>>> u'foo' + baz



Traceback (most recent call last):

  File "<stdin>", line 1, in 

  File "/usr/lib/python2.6/encodings/undefined.py", line 22, in decode

    raise UnicodeError("undefined encoding")

UnicodeError: undefined encoding

So why can't we switch to this today? Well, for one thing, lots and lots of code depends on this implicit encoding/decoding. Even code in the standard library:



Traceback (most recent call last):

  File "<stdin>", line 1, in <module>

  File "/usr/lib/python2.6/re.py", line 190, in compile

    return _compile(pattern, flags)

  File "/usr/lib/python2.6/re.py", line 243, in _compile

    p = sre_compile.compile(pattern, flags)

  File "/usr/lib/python2.6/sre_compile.py", line 506, in compile

    p = sre_parse.parse(p, flags)

  File "/usr/lib/python2.6/sre_parse.py", line 672, in parse

    source = Tokenizer(str)

  File "/usr/lib/python2.6/sre_parse.py", line 187, in __init__

    self.__next()

  File "/usr/lib/python2.6/sre_parse.py", line 193, in __next

    if char[0] == "\\":

  File "/usr/lib/python2.6/encodings/undefined.py", line 22, in decode

    raise UnicodeError("undefined encoding")

UnicodeError: undefined encoding

Not to mention loads of existing third-party apps and libraries. So what can we do? Fortunately, Python allows registering new codecs. I've written a slight variation on the ASCII codec, ascii_with_complaints, which preserves the default Python behaviour, but also produces warnings.



>>> import re

>>> re.compile(u'\N{SNOWMAN}+')

/usr/lib/python2.6/sre_parse.py:193: UnicodeWarning: Implicit conversion of str to unicode

  if char[0] == "\\":

/usr/lib/python2.6/sre_parse.py:418: UnicodeWarning: Implicit conversion of str to unicode

  if this and this[0] not in SPECIAL_CHARS:

/usr/lib/python2.6/sre_parse.py:421: UnicodeWarning: Implicit conversion of str to unicode

  elif this == "[":

/usr/lib/python2.6/sre_parse.py:478: UnicodeWarning: Implicit conversion of str to unicode

  elif this and this[0] in REPEAT_CHARS:

/usr/lib/python2.6/sre_parse.py:480: UnicodeWarning: Implicit conversion of str to unicode

  if this == "?":

/usr/lib/python2.6/sre_parse.py:482: UnicodeWarning: Implicit conversion of str to unicode

  elif this == "*":

/usr/lib/python2.6/sre_parse.py:485: UnicodeWarning: Implicit conversion of str to unicode

  elif this == "+":

<_sre.SRE_Pattern object at 0xb76d6f80>

Hopefully this will be useful as a tool for ferreting out those Unicode bugs waiting to break your application, as well.

The true blue fate with an artificial calendar

In case you were wondering, here's what I've been busy with lately.

I haven't given up on Python yet. Stay tuned, big stuff on the way.

Two-thirds slow, one-third amazing

My evident neglect of this site was not intentional. Moving across the country and starting a new job tend to reduce one's available time for open source work, and mine hasn't resulted in anything really worth announcing for the past year (or more). But today that changes!

Download PyMeta 0.4.0

Since the last release of Ecru I have been trying to get rid of its dependency on Python, by porting the E parser to E. In the process of doing so, I realized it was probably a bad idea to try to use a parser whose only form of error reporting was the string "Parse error". Since I'm still more familiar with Python than E, I started implementing error reporting in PyMeta. (Also, Python has a debugger.) This resulted in some significant rewrites of the internals.

So what's different in PyMeta 0.4?

Comments!

With its new space-age technology, PyMeta 0.4 will treat '#' as a comment character, just like Python. (Yes, this was rather overdue.)

Reorganized code generator

Previously, code for grammars was generated by the grammar parser calling methods on a builder object that directly emitted Python code as strings. Now the grammar parser builds a tree, which is then consumed by a code generator. If you want to generate something other than Python (or change how Python code gets generated), it should be a lot simpler now. Look at pymeta.builder.PythonWriter for specifics, specifically the generate_ methods.

Error tracking and reporting

This is the big one. Previously, PyMeta expressions returned a value or raised an error. Now, each expression evaluated by the parser now returns a value and an error, even if a successful parse was found. If parse failure occurs, an error still gets raised. Combining expressions with "|" returns the error from furthest into the input and combines ties.

So the result of this is that you can now get nicely formatted output telling you where stuff went wrong and how. Here's an example, based on the old TinyHTML parser:

When you mismatch tags, the parser notices:

Notice here that the information in ParseError is structured so that future tools can figure out stuff about what failed and how.

If there's more than one possible valid input at the point of failure, the parser will tell you:

Plans for the Future

There are a few directions I'd like to take PyMeta in the future. A really nice thing would be to have a way to generate grammars ahead of time easily, writing out a Python module. This release includes bin/generate_parser which does a rather naive version of this. The problem is figuring out how to make grammars that are subclasses of something other than just OMetaBase.
Also, with the new code generator setup, it'd be fairly easy to generate Cython instead of Python, resulting in grammars that can be compiled as extension modules, hopefully resulting in much faster parse times. PyMeta isn't meant to be blazingly fast -- PEG parsers aren't known to be the most efficient -- but it'd be nice to squeeze all we can out of it.

Other people have asked for event based parsing and incremental output. Seems like a neat idea... I'd just have to figure out what that means. :-)

Special thanks to Marien Zwart and Cory Dodt for their contributions and encouragement for this release.

Ecru 0.2.0

It's been a bit longer for this release.

The major new additions are message buffering on eventual references and the Ref.whenResolved and Ref.whenBroken methods, as well as the __whenMoreResolved Miranda method. Along with eventual sends and the vat support from the previous release, this is enough to provide support for the when syntax sugar. The Python interface now can control the E event loop: ecru.api.iterate runs a single item in the queue of the vat exposed to Python, and ecru.api.incrementalEval queues an E expression to be run, returning a promise to Python immediately. There's an example IRC-bot REPL in doc/examples/ircbot.py.

Special shout-out to Eric Mangold (aka teratorn) for bugfixes and testing. Thanks Eric!

Ecru 0.1.3

Time for another Ecru release. This release should be a lot more stable: ctypes is no longer used for the Python interface. The major behaviour change I've added is support for Selfless objects, allowing for value equality: for example, Map objects with the same contents now compare equal. The rest of the work has been in code reorganization and preparation for concurrency support; vats are implemented now, and the REPL executes code by putting it into the vat's event queue. Eventual sends work now (i.e., "foo <- doStuff(x, y)"). Goals for the next couple releases are actual multi-vat execution, support for the 'when' expression, and portability to other OSes.

Ecru, A C Runtime For E

I'm happy to announce Ecru, a new implementation of the E language.

E is a language designed both for security and safe and efficient concurrency. Twisted borrowed the idea of Deferreds from E, where they are much better integrated into the language than in Python, owing to the syntax and library support E provides. E's design is based around capability security, even to the level of allowing mutually untrusting programs to run in the same process. Due to its consistent focus on security, it's possible to write secure programs without having to do much more than stick to good object-oriented style.

My goals with Ecru development are, initially, to develop an environment suitable for restricted execution of the type that my esteemed colleague has been looking for, allowing safe scripting of server-hosted applications. A wider goal is to provide an effective replacement to the C implementation of Python for development of network software; a successor to Twisted, as it were. Furthermore, E's semantics lend themselves to more efficient implementation than Python's. Although Ecru has received essentially no optimization work thus far, I believe it may be possible to make it faster than Python for many tasks, without being any more difficult to work with.

Right now Ecru depends on Python, since I'm using PyMeta for its parser. Ecru only implements enough of E to run the compiler; I plan to soon implement OMeta in E, so that Ecru can also be used standalone. There's a simple REPL in Python, so you can download Ecru and try it out right now. Currently I'm focusing on cleaning up the code (bootstraps are usually messy affairs) and replacing some of the standard-library object stubs I wrote in C with the versions implemented in E in use by the existing Java version.

PyMeta 0.3.0 Released

Originally when I was implementing PyMeta I was sure that the input stream implementation that the Javascript version used was inefficient. Rather than having a mutable object with methods for advancing and rewinding the stream, it has immutable objects with "head" and "tail" methods, which return the value at the current position and a new object representing the next position in the stream. All that allocation couldn't be healthy.

Turns out I was wrong. I misunderstood the requirements for OMeta's input stream. Various operations require the ability to save a particular point in the stream and go back to it. To further complicate matters, arguments to rules are passed on the stream by pushing them onto the front, and rules that take arguments read them off of the stream. This is very handy for certain types of pattern matching, but it totally destroys any hope of simply implementing the input as a list and an index into it, because there has to be a way to uniquely identify values pushed onto the stream. If a value gets pushed onto the stream, is read from it, then another one is pushed on, both of them have the same predecessor, but they don't have the same stream position. It becomes more like a tree than a sequence. JS-OMeta handled this by just creating a new stream object for each argument. I didn't give up soon enough on my clever idea when initially implementing PyMeta, and it grew more complicated with each feature I implemented, involving a bunch of lists for buffering things and a complicated mark/rewind scheme.

After writing a rather complicated grammar with PyMeta, I began to wonder if I could improve its speed. By this time I knew the JS version's algorithm was less complicated so I decided to try it out. It cut my grammar's unit tests' runtime from around 40 seconds to 4 seconds. Also, it fixed a bug.

Moral of the story: I'm not going to try to implement clever optimizations until I understand the original version any more. :)

I've released a version with this new input implementation. Get it in the usual place.