On Readability

Programs must be written for people to read, and only incidentally for machines to execute. — Abelson & Sussman, Structure and Interpretation of Computer Programs

Code readability gets talked about a lot these days. I haven't yet heard from anyone who's opposed to it. Unfortunately, learning to read code is a skill rarely discussed and even more rarely taught. As the SICP quote above points out, readability is perhaps the most important quality to aim for when writing. But what is code readability, exactly?

I propose there are three essential levels of code readability:

  • "What is this code about?"
  • "What is this code supposed to do?"
  • "What does this code do?"

The first level is important when you're skimming code, trying to develop a picture of its overall structure. Good organization into modules tends to help a lot with this; if you have modules named util then this is harder than it has to be. Supporting docs describing architecture and organization can assist with this as well, along with usage examples if the code you're reading is a library.

The second level is what you encounter once you've found some code and you want to use or modify it. Maybe it's a library with weak or missing documentation and you're trying to discover how a function wants to be called or which methods to override in a class you need to inherit from. Good style, good class/function names, docstrings, and comments can all be very helpful in making your code readable for this case. There's been some research which associates poor quality identifier names and obvious bugs, so even if your code works well, poor naming can make it look like code that doesn't.

However, neither of these are the most important sense in which readability matters. They're more about communicating intent, so that later readers of your code can figure out what you were thinking. Sometimes it's a way to share hard-won insight that leaves indelible scars. But the final level is what matters most and matters longest.

To understand a program you must become both the machine and the program. — Alan Perlis, Epigrams In Programming #23

The most important level of readability is being able to look at code and understand what it actually does when run. All the factors discussed above — naming, style, documentation — cannot help with this task at all. In fact, they can be actively harmful to this. Even if the comment accurately described the code when it was written, there's no reason it has to now. The most obvious time you'll need to engage in this level of code reading is debugging — when the code looks like it does the right thing, but actually doesn't.

Language design plays a big role in supporting or detracting from the creation of readable code. As the link above shows, C provides a myriad of features that either fight against or outright destroy readability. So when picking a language, include readability as a factor in your decision making — and not just how the syntax looks. Mark Miller provides some excellent thoughts on how to design a language for readability in his notes on The Power of Irrelevance. There are also several people studying how to build tools to help us read code more effectively; Clarity In Code touches on some of the issues being considered in that field.

But given the constraints of the language you're currently using, what can we do to improve readability in our code? The key is promoting local reasoning. The absolute worst case for readability occurs when you have to understand all the code to understand any of the code. So we want to preserve as many barriers between portions of the program as are needed to prevent this.

Since local reasoning is good, global state is bad. The more code that can affect the behavior of the function or class you're looking at right now, the more work it takes to actually discern what it'll do, and when. Similarly, threads (and coroutines) destroy readability. since they destroy the ability to understand control flow locally in a single piece of code. Other forms of "magic" like call stack inspection, adding behavior to a class from other modules, metaclass shenanigans, preprocessor hacks, or macros all detract from local reasoning as well.

Since this perspective on programming is so little discussed or taught, it leads to a communications gap between inexperienced and veteran programmers. Once an experienced programmer has spent enough time picking apart badly designed code, it can become the dominant factor in his assessment of all the code he sees. I've experienced this myself fairly often: code that could be easily made a little more readable makes me itch; code that can't easily be fixed that was written with no thought for readability can be quite upsetting. But writing readable code is extra work, so people who haven't spent dozens of hours staring at a debugger prompt are sometimes baffled by the strong emotions these situations inspire. Why is it such a big deal to make that variable global? It works, after all.

Every program has at least one bug and can be shortened by at least one instruction — from which, by induction, it is evident that every program can be reduced to one instruction that does not work. — Ken Arnold

When considering how to write readable code, choice of audience matters a lot. Who's going to read what you're writing? When? For writing prose, we do this all the time. We use quite different style in a chat message or email than in a blog post, and a different style again in a formal letter or article. The wider your audience and the longer the duration you expect the message to be relevant, the more work you put into style, clarity, and readability. The same applies to code. Are you writing a one-off script that's only a few dozen lines long? Using global variables and one-letter identifiers is probably not going to hurt you, because you will most likely delete the code rather than read it again. Writing a library for use in more than one program? Be very careful about using any global state at all. (Also pay special attention to the names you give your classes/modules/functions; they may be very difficult to change later.) If new programmers were taught this idea as well as they're taught how to, e.g., override methods in a class or invoke standard libraries, it would be a lot easier for both new programmers and those mentoring them to relax.

So when you do set out to write readable code, consider your audience. There are some obvious parties to consider. If you're writing a library, your users will certainly read some of your code, even if it's just examples of how to use your library. Anyone who wants to modify your code later will need to read it. If your code is packaged by an OS distribution or other software collection, the packager will often need to read parts of your code to see how it interacts with other elements of the system. Security auditors will want to read your code to know how it handles the authority it's granted or the secrets it protects. And not least, you're writing for your future self! So even if those other people don't exist or don't matter to you — make life easier on that last guy. He'll thank you for it.

Coroutines reduce readability

A recent email thread was brought to my attention which suggested adding greenlet-style coroutines to the Python standard library. I felt like this would be a good time to go into why coroutines are a bad idea.

"Readability counts."

Many keyboards have been worn out debating how to make code more readable, and what affects readability. One of the reasons I've enjoyed using Python so much is that it doesn't fight (much) my efforts to write code that's easy to read. Proponents of coroutines, as used in libraries such as gevent, have claimed that a major advantage is that they make networking code easier to read, compared to other concurrency mechanisms such as generators or callbacks. I am going to argue instead that coroutines make code harder to read. Before I get into that, I'm going to propose this definition of readability:

A program is readable when you can look at its code and understand what it does.

Note particularly that this is different from looking at code and understanding what the author intended the program to do. Readability counts most when you're reading code that doesn't work (such as when debugging) or code that might not work the way it should (such as when doing a security audit). Designing for readability means designing for adversarial review of code.

As Mark Miller and Dave Herman have pointed out, when first learning to program in a language like Python, there are some basic assumptions we make about control flow. The main one I want to talk about here is that it's possible to understand what happens when you call a function by reading the code of the function.

Consider this trivial example.

self._foo.a = self._foo.b
self._foo.b = baz()

Suppose you want to determine whether any code can see self or self._foo while its internal attributes are disarranged — in this case, the time during which its a and b attributes are set to the same value. Normally in Python we'd be able to answer this question by reading the source for baz. However, in the presence of coroutines this isn't sufficient! If baz, or anything it calls, invokes something that causes the current coroutine to suspend, then any other code can be invoked at that point, thus making it impossible to keep this internal mutation from being exposed.

"In the face of ambiguity, refuse the temptation to guess."

There's many different situations where this sort of problem arises. In general, any kind of imperative code needs to be able to preserve invariants for its data structures, while still being able to do work that might temporarily violate those invariants. This is why Python has the with and try/finally structures; being able to express some level of transaction-like behaviour is useful, so you can worry about cleanup and invariants at a single place.

These are only useful for operations that aren't extended in time, however. When using coroutines, it's possible to write code where finally blocks don't get a chance to run before something in another coroutine interferes. More distressingly, the finally block may not run at all! When a coroutine is suspended, there's no guarantee it will be resumed before the program terminates.

If this sounds a lot like using threads, it's because it is. Coroutines are a form of threads; they're the foundation for what are called "green threads" in some language runtimes, such as early versions of Java and Ruby. The problems with threads are well documented, and various tools developed to deal with the problems they introduce, such as mutexes, locks, and queues. Not all coroutine libraries provide these tools, and the ones that do don't encourage their pervasive use. The only salient difference in behavior is that OS-provided threads can be interrupted at more points. On the other hand, OS threads can be scheduled on multiple processors at once, providing parallelism. So, in conclusion: coroutines are strictly worse than threads, because they have the same kinds of problems (non-determinism, loss of code readability) and do not offer any unique advantages.

Superior options for concurrency are use of Deferreds to manage callbacks, or generators. The primary historical objection to callbacks is the "pyramid of doom", where functions get nested to ridiculous depths. Deferreds make callback-invoking code composable, and help flatten out the functions used, as David Reid has ably shown. Use of callbacks/Deferreds lets you keep all your normal assumptions about control flow. Invoking a function can return a Deferred, but it can't do anything to suspend your code calling it. Once a function is exited, it can't be re-entered without calling it again. So in a very useful sense, Deferreds make concurrent code much more readable.

Generators let you keep most of your assumptions, but they add an extra rule: a function can be suspended and (maybe) later re-entered when a yield keyword is encountered. This provides the same amount of information as callbacks, but does enable some cases that require a good bit more squinting and head-scratching to figure out.

I believe that better syntax can provide the convenience of generators and the clarity benefits of Deferreds. More about that in a future post.

Exocet: A Second Look

So what I didn't really talk about last time is that more than just letting you directly express module loading, Exocet also implements:

Parameterized Modules



In Python, classes and functions take parameters, but modules don't. When you load a module, you don't have any opportunity to tell it what you want, or what context it's being loaded in. A lot of times, this matters.

For example, some code has optional dependencies. This idiom can be seen a lot in some parts of Twisted:

try:
from OpenSSL import SSL
except ImportError:
SSL = None


Code following this checks if 'SSL' is None to decide whether to define methods and classes that provide support for SSL connections in Twisted.


Other code can depend on one of multiple providers of an interface. Here's the famous example from the docs for the magnificent lxml library:


try:
from lxml import etree
print("running with lxml.etree")
except ImportError:
try:
# Python 2.5
import xml.etree.cElementTree as etree
print("running with cElementTree on Python 2.5+")
except ImportError:
try:
# Python 2.5
import xml.etree.ElementTree as etree
print("running with ElementTree on Python 2.5+")
except ImportError:
try:
# normal cElementTree install
import cElementTree as etree
print("running with cElementTree")
except ImportError:
try:
# normal ElementTree install
import elementtree.ElementTree as etree
print("running with ElementTree")
except ImportError:
print("Failed to import ElementTree from any known place")


The effect of this code is to try to import, in order, one of:

  1. lxml
  2. cElementTree from the stdlib
  3. ElementTree from the stdlib
  4. ElementTree installed separately
  5. cElementTree installed separately


I don't think it's a stretch to say this is rather silly. How would you feel if you saw a function that tried to access five different global variables in a row in order to decide what to do?

And though all of these modules implement the same interface, they're still different code, and you might hit some edge case where their behaviour differs. How do you test your code using each possible ElementTree implementation? As Glyph points out, it's always sunny in Python, and every piece of bad code can be worked around by writing worse code; you could have your unit tests fool around with sys.modules. But can't there be something better?

Here's a different way of thinking about it entirely.

In languages that aren't as good as Python, dependency injection is a technique that gets used to deal with this. Dependency injection has many forms and can be rather complicated, but the general idea is code declares the thing it needs, and something else (unfortunately, often an XML file!) describes what objects to provide to satisfy those dependencies.


With Exocet, we hijack the import statement to describe named parameters, indicating the dependencies our code has.
from exocet.parameters import etree

x = etree.parse(open("mydata.xml"))


Now, if you look in the Exocet tarball, you won't find a parameters.py file. This name doesn't correspond to anything on the filesystem.

So how does this help us? Well, it means you can load your ElementTree-using module like this:
m = exocet.pep302Mapper.withOverrides({"exocet.parameters.etree",
xml.etree.cElementTree})
my_etree_using_module = exocet.loadNamed("my_etree_using_module", m)


Or this:
m = exocet.pep302Mapper.withOverrides({"exocet.parameters.etree",
lxml.etree})
my_etree_using_module = exocet.loadNamed("my_etree_using_module", m)


This way, you can test your code without having to resort to sys.modules hackery, and you can better factor your applications by separating configuration and environment concerns from the rest of your code.

As you may recall from last time, pep302Mapper provides the default Python implementation of module loading. In this case, we've just added another name that can be imported, exocet.parameters.etree.

Being able to provide parameters to modules opens up the possibility to eliminate all use of global objects from your code, and pass objects only to the code that needs them. I believe that with experimentation and refinement of these tools, this technique is going to enable a lot of simpler methods for organizing complex applications and reduce a lot of complications people have to deal with in Python now.

Introducing Exocet

Last time I talked about the deficiencies of Python's module system. Now I'd like to talk about a solution to them.

There are two questions related to Python's module system that come up repeatedly on #python, and don't have obviously good answers.


  1. How do I reload a module?

  2. How do I create a plugin system?

Exocet was written primarily to answer these questions.

Download Exocet 0.5

What is Exocet?


Exocet is a new way to load Python modules. It separates the act of naming a dependency from the act of creating a module object. As a result, more than one instance of a module can be created from the same source file. Also, when creating a module object, precise control can be exerted over what it's allowed to import.

Let's start with a code example.

>>> import exocet, httplib 
>>> urllib = exocet.loadNamed("urllib", exocet.pep302Mapper)

>>> def HTTP_Ex(host):
... print "making HTTP connection to", host
... return httplib.HTTP(host)
...

>>> class _ModuleProxy(object):
... def __getattribute__(self, name):
... if name == 'HTTP':
... return HTTP_Ex
... else:
... return getattr(httplib, name)

>>> httplib_plus = _ModuleProxy()
>>> overriddenMap = exocet.pep302Mapper.withOverrides({"httplib": httplib_plus})
>>> urllib_plus = exocet.loadNamed("urllib", overriddenMap)

>>> print urllib.urlopen("http://python.org").read(121)
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">

>>> print urllib_plus.urlopen("http://python.org").read(121)
making HTTP connection to python.org
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">

Quick breakdown of what's going on here: loadNamed takes a module name and a mapper object, and returns a new module object. Note that this is very different from what import does; each invocation of loadNamed returns a new module object, unrelated to any previous ones.

The mapper object is responsible for intercepting all import calls made by the module being loaded. The pep302Mapper object used here just invokes Python's normal importing behaviour, caching loaded modules in sys.modules. But its withOverrides method creates a new mapper, one that looks in a dict first. The urllib_plus module is almost like the urllib module in this example, with one difference: the latter got the real httplib module when it executed the statement import httplib; urllib_plus got the httplib_plus module when it ran that statement. The other mapper that Exocet includes by default is emptyMapper, in which all import statements will fail. You can construct your own, providing any object under any name to be available to import statements in modules you load.

So after constructing these two modules, they're usable as normal. The only difference is that urllib_plus invokes our wrapper function when it calls httplib.HTTP(), producing the printed line before proceeding with its work.

So, how does this behaviour answer the questions I mentioned earlier?

Module reloading


As we saw last time, reload isn't suitable for real-world use; it changes things around too much in some places, not enough in others. With Exocet, you don't have to reload modules because you can just load them. When you want a new module with updated code, you just load it and have a new module object. All the old objects still exist as long as you want them to; there's no orphaned-instance problem. If you need new versions of the module's dependencies you can load them first and put them in the mapper so that they're visible to your new module. Old instances, classes, and modules can be removed in the normal way — by Python's garbage collection, when nothing uses them any longer.

Plugin systems


The normal use of Python packages and modules involves importing specific ones by name. However, sometimes you want to load a bunch of modules and call something in each. Since Exocet borrows code from twisted.python.modules, it's possible to iterate over a package and load each module in it:

>>> import exocet 
>>> [exocet.load(x, exocet.pep302Mapper) for x in
... exocet.getModule("twisted.plugins").iterModules()]
[<module 'twisted.plugins.cred_anonymous' from '/home/washort/Projects/Twisted/trunk/twisted/plugins/cred_anonymous.py'>,
<module 'twisted.plugins.cred_file' from '/home/washort/Projects/Twisted/trunk/twisted/plugins/cred_file.py'>,
<module 'twisted.plugins.cred_memory' from '/home/washort/Projects/Twisted/trunk/twisted/plugins/cred_memory.py'>,
<module 'twisted.plugins.cred_unix' from '/home/washort/Projects/Twisted/trunk/twisted/plugins/cred_unix.py'>,
<module 'twisted.plugins.twisted_conch' from '/home/washort/Projects/Twisted/trunk/twisted/plugins/twisted_conch.py'>,
<module 'twisted.plugins.twisted_ftp' from '/home/washort/Projects/Twisted/trunk/twisted/plugins/twisted_ftp.py'>,
<module 'twisted.plugins.twisted_inet' from '/home/washort/Projects/Twisted/trunk/twisted/plugins/twisted_inet.py'>,
<module 'twisted.plugins.twisted_lore' from '/home/washort/Projects/Twisted/trunk/twisted/plugins/twisted_lore.py'>,
<module 'twisted.plugins.twisted_mail' from '/home/washort/Projects/Twisted/trunk/twisted/plugins/twisted_mail.py'>,
<module 'twisted.plugins.twisted_manhole' from '/home/washort/Projects/Twisted/trunk/twisted/plugins/twisted_manhole.py'>,
<module 'twisted.plugins.twisted_names' from '/home/washort/Projects/Twisted/trunk/twisted/plugins/twisted_names.py'>,
<module 'twisted.plugins.twisted_news' from '/home/washort/Projects/Twisted/trunk/twisted/plugins/twisted_news.py'>,
<module 'twisted.plugins.twisted_portforward' from '/home/washort/Projects/Twisted/trunk/twisted/plugins/twisted_portforward.py'>,
<module 'twisted.plugins.twisted_qtstub' from '/home/washort/Projects/Twisted/trunk/twisted/plugins/twisted_qtstub.py'>,
<module 'twisted.plugins.twisted_reactors' from '/home/washort/Projects/Twisted/trunk/twisted/plugins/twisted_reactors.py'>,
<module 'twisted.plugins.twisted_runner' from '/home/washort/Projects/Twisted/trunk/twisted/plugins/twisted_runner.py'>,
<module 'twisted.plugins.twisted_socks' from '/home/washort/Projects/Twisted/trunk/twisted/plugins/twisted_socks.py'>,
<module 'twisted.plugins.twisted_telnet' from '/home/washort/Projects/Twisted/trunk/twisted/plugins/twisted_telnet.py'>,
<module 'twisted.plugins.twisted_trial' from '/home/washort/Projects/Twisted/trunk/twisted/plugins/twisted_trial.py'>,
<module 'twisted.plugins.twisted_web' from '/home/washort/Projects/Twisted/trunk/twisted/plugins/twisted_web.py'>,
<module 'twisted.plugins.twisted_web2' from '/home/washort/Projects/Twisted/trunk/twisted/plugins/twisted_web2.py'>,
<module 'twisted.plugins.twisted_words' from '/home/washort/Projects/Twisted/trunk/twisted/plugins/twisted_words.py'>]


In this case we used the default pep302Mapper since we wanted plugin modules to be able to import anything they wanted. However, a different mapper could be provided which provided different modules, limited them to only importing certain ones, or none at all.

Unit testing



Hardly anybody asks about this on #python, but they should. By loading the modules you test with Exocet, you can replace their dependencies with stubs without monkeypatching. Just provide overrides to the mapper object when loading the module, and you have a stubbed or mocked module to test without altering the behaviour of other tests.

Next steps


Exocet lets you load most existing Python modules, but in a way that gives much better control over what happens. This makes room for radically different ways of thinking about how to organize Python programs. Since this approach is so different, practice is needed to come up with new idioms for how to use it effectively. I've tried carefully to provide mechanism here with very little policy. There are a few kinks to still work out in how Exocet works in more complicated scenarios, but give it a try! It might be just the thing for your next IRC bot. ;-)

If you want to help with Exocet development, it happens here, on Launchpad. I look forward to your feedback.

Modules in Python: the Good, the Bad, and the Ugly

I've spent a lot of time living with Python's module system, both in my own
work and in helping people on Freenode's #python channel. A lot of Python's
power comes from its module system; however, it could be better. It can be
hard to think about how modules could be done differently, since it's so
central to the design of Python software, but it's worth the effort. Here's
some stuff I've been thinking about.



The Good




No global namespace for objects

This is the benefit of module systems in general and it's a big one. It's what I miss the most when writing in C, for example (And until very recently — Javascript!) Being able to treat a source file as an actual container for code instead of an arbitrary pile of functions and variables with no structure makes code much easier to comprehend and navigate.

Python modules are easy to write and to import

Particularly I'm comparing this to Scheme 48 and ML, which both have very well-designed and powerful module systems, but they're rather confusing to the newcomer because they require a good bit of up-front knowledge to construct a module that's useful to anyone else. In Python, you just stick some code in a file and then all the names in it are importable. My earliest Python memory was joining #python and asking "how do I import some code I wrote in one file into another"? I was told 'for a file named foo.py, use "import foo"'. My reaction was "Really? That's all?" Providing a low barrier to entry for creating and using modules is an extremely powerful advantage of Python.



The Bad



Modules are in a global namespace

Although modules contain classes, functions, etc., there's no containment hierarchy for module names themselves. Different modules can have functions and classes with the same name in them, but there's nothing that can contain multiple modules with the same name. This shows up as a problem when you want to write unit tests that use fake versions of some modules, for example. When faking a function or a class, one creates a new version. Modules generally have to be modified rather than replaced, since import looks up modules in the global module namespace.

PYTHONPATH is a rather inflexible way to organize modules

Organizing modules by location in the filesystem is a great way to get started, but it's not the only possible thing one might want. This deficiency has been addressed in recent Pythons via the PEP 302 import hooks. However...

PEP302 hooks help, but aren't enough by themselves

The canonical example of alternate module organization is putting them in a zip file, which Python supports via the standard import hooks now. Now you have extra problems, though. Python packages are a good way to organize modules, but they don't provide a way to enumerate their contents. To work around this, everybody looks at the filesystem layout to determine what's in a package. But if your modules aren't being loaded directly from the filesystem, this approach won't work.



The Really Bad



Modules are singletons (i.e., global mutable state)

This is the dark secret at the heart of any large-scale Python project. One can be very careful about organizing one's state into instances and so forth, but all modules are still visible and modifiable by any code at any time.

Still easy to write unreadable code via monkey-patching

It's easy and convenient to assign to module attributes any time you feel like it. The result is that any time you see "from foo import someObject", you can't every be sure about where that object was defined unless you read all the source code in the application. Even when it's desirable to change module contents (such as for tests), it's easy to fail to do so in a way that doesn't introduce dependencies or conflicts between tests. The classic example is calling some function that initializes module globals from a config file; if one test does it, it can cause tests run after it to fail or incorrectly succeed.

reload()

The reload function is a symptom of all the above problems. Its inspiration is obvious: loading code that's changed since the current Python process has started is an entirely sensible idea. However, Python's assumptions about how modules work makes this rather difficult to do in a sensible manner. It's common to create new lists rather than modify old ones when a new version of some data is wanted. This convention is reinforced by the ease by which list comprehensions can be used to do this job. The convention encouraged by the existence of reload is exactly opposite, though — instead of creating a new module object, the old one is emptied and refilled with fresh objects. The result is that instances of classes in that module are orphaned; the class they were instantiated from can't be reached by its name. Also, it only reloads a single module; no help is provided in updating modules that depend on it, or updating its own dependencies. Figuring out which modules to reload or not reload at any given time is often very tricky. Plenty of other corner cases exist, such as reinitialization of function default arguments, and so forth. Because of all this, the standard advice on #python is that "reload will not make you happy".



What Now?


So with these problems identified in how Python handles modules, can anything be done?
Well, that's why I wrote Exocet. More about that next time.

Unicode in Python, and how to prevent it



[UPDATE 16 Aug 2011] Armin Ronacher has written a nice module called unicode-nazi that provides the Unicode warnings I discuss at the end of this article.

Though I can't use Python 3 for any of my projects, it does have a few nice things. One particular behaviour where it improves on Python 2 is forbidding implicit conversions between byte strings and Unicode strings. For example:
Python 3.1.2 (release31-maint, Sep 17 2010, 20:34:23)

[GCC 4.4.5] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> 'foo' + b'baz'
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: Can't convert 'bytes' object to str implicitly

If you do this in Python 2, it invokes the default encoding to convert between bytes and unicode, leading to manifold unhappinesses. So in Python 2, the above example looks like:
Python 2.6.6 (r266:84292, Sep 15 2010, 15:52:39)

[GCC 4.4.5] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> u'foo' + b'baz'
u'foobaz'

This looks OK, but problems lurk just below the surface. The default encoding used in nearly all circumstances is ASCII, and this conversion will blow up if non-ASCII bytes are involved.
>>> u'foo' + b'\xe2\x98\x83'

Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 0: ordinal not in range(128)

This is only a 3, at best 3.5 on the Russell scale of API misusability. Even worse, this conversion is done for any function implemented in C that expects unicode or bytes and receives the other type. For example, unicode() converts byte strings to unicode strings, and if not given an explicit encoding argument, uses the default encoding.
>>> unicode('bob')

u'bob'
>>> unicode('\xe2\x98\x83')
Traceback (most recent call last):
File "<stdin>, line 1, in
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 0: ordinal not in range(128)

The right way to do it is call .decode() on the byte string.


>>> '\xe2\x98\x83'.decode('utf8')
u'\u2603'

Just be sure you don't call it on a unicode string by mistake, or you'll get very confused:
>>> u'foo'.decode('utf8')

u'foo'
>>> u'\u2603'.decode('utf8')
Traceback (most recent call last):
File "<stdin>", line 1, in ?
File "/usr/lib/python2.4/encodings/utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2603' in position 0: ordinal not in range(128)

Hey! That error says "encode", but we asked for it to decode!

It turns out that since the utf8 decoder expects bytes as input, but we gave it unicode... Python helpfully tries to convert the unicode characters into bytes, using the default encoding! This is why the first decode call succeeds - all the characters in it can be converted to ASCII bytes.

Is there any hope?


Since Unicode was added to Python in 2.1, there's been a way to get Python 3's behavior. The site module in the Python standard library sets Python's default encoding on startup. If edited to use "undefined" instead of "ascii", the above examples all fail instead of converting:



>>> u'foo' + baz

Traceback (most recent call last):
File "<stdin>", line 1, in
File "/usr/lib/python2.6/encodings/undefined.py", line 22, in decode
raise UnicodeError("undefined encoding")
UnicodeError: undefined encoding


So why can't we switch to this today? Well, for one thing, lots and lots of code depends on this implicit encoding/decoding. Even code in the standard library:


Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python2.6/re.py", line 190, in compile
return _compile(pattern, flags)
File "/usr/lib/python2.6/re.py", line 243, in _compile
p = sre_compile.compile(pattern, flags)
File "/usr/lib/python2.6/sre_compile.py", line 506, in compile
p = sre_parse.parse(p, flags)
File "/usr/lib/python2.6/sre_parse.py", line 672, in parse
source = Tokenizer(str)
File "/usr/lib/python2.6/sre_parse.py", line 187, in __init__
self.__next()
File "/usr/lib/python2.6/sre_parse.py", line 193, in __next
if char[0] == "\\":
File "/usr/lib/python2.6/encodings/undefined.py", line 22, in decode
raise UnicodeError("undefined encoding")
UnicodeError: undefined encoding


Not to mention loads of existing third-party apps and libraries. So what can we do? Fortunately, Python allows registering new codecs. I've written a slight variation on the ASCII codec, ascii_with_complaints, which preserves the default Python behaviour, but also produces warnings.


>>> import re
>>> re.compile(u'\N{SNOWMAN}+')
/usr/lib/python2.6/sre_parse.py:193: UnicodeWarning: Implicit conversion of str to unicode
if char[0] == "\\":
/usr/lib/python2.6/sre_parse.py:418: UnicodeWarning: Implicit conversion of str to unicode
if this and this[0] not in SPECIAL_CHARS:
/usr/lib/python2.6/sre_parse.py:421: UnicodeWarning: Implicit conversion of str to unicode
elif this == "[":
/usr/lib/python2.6/sre_parse.py:478: UnicodeWarning: Implicit conversion of str to unicode
elif this and this[0] in REPEAT_CHARS:
/usr/lib/python2.6/sre_parse.py:480: UnicodeWarning: Implicit conversion of str to unicode
if this == "?":
/usr/lib/python2.6/sre_parse.py:482: UnicodeWarning: Implicit conversion of str to unicode
elif this == "*":
/usr/lib/python2.6/sre_parse.py:485: UnicodeWarning: Implicit conversion of str to unicode
elif this == "+":
<_sre.SRE_Pattern object at 0xb76d6f80>


Hopefully this will be useful as a tool for ferreting out those Unicode bugs waiting to break your application, as well.

The true blue fate with an artificial calendar

In case you were wondering, here's what I've been busy with lately.

I haven't given up on Python yet. Stay tuned, big stuff on the way.