Modules in Python: the Good, the Bad, and the Ugly

I've spent a lot of time living with Python's module system, both in my own
work and in helping people on Freenode's #python channel. A lot of Python's
power comes from its module system; however, it could be better. It can be
hard to think about how modules could be done differently, since it's so
central to the design of Python software, but it's worth the effort. Here's
some stuff I've been thinking about.



The Good




No global namespace for objects

This is the benefit of module systems in general and it's a big one. It's what I miss the most when writing in C, for example (And until very recently — Javascript!) Being able to treat a source file as an actual container for code instead of an arbitrary pile of functions and variables with no structure makes code much easier to comprehend and navigate.

Python modules are easy to write and to import

Particularly I'm comparing this to Scheme 48 and ML, which both have very well-designed and powerful module systems, but they're rather confusing to the newcomer because they require a good bit of up-front knowledge to construct a module that's useful to anyone else. In Python, you just stick some code in a file and then all the names in it are importable. My earliest Python memory was joining #python and asking "how do I import some code I wrote in one file into another"? I was told 'for a file named foo.py, use "import foo"'. My reaction was "Really? That's all?" Providing a low barrier to entry for creating and using modules is an extremely powerful advantage of Python.



The Bad



Modules are in a global namespace

Although modules contain classes, functions, etc., there's no containment hierarchy for module names themselves. Different modules can have functions and classes with the same name in them, but there's nothing that can contain multiple modules with the same name. This shows up as a problem when you want to write unit tests that use fake versions of some modules, for example. When faking a function or a class, one creates a new version. Modules generally have to be modified rather than replaced, since import looks up modules in the global module namespace.

PYTHONPATH is a rather inflexible way to organize modules

Organizing modules by location in the filesystem is a great way to get started, but it's not the only possible thing one might want. This deficiency has been addressed in recent Pythons via the PEP 302 import hooks. However...

PEP302 hooks help, but aren't enough by themselves

The canonical example of alternate module organization is putting them in a zip file, which Python supports via the standard import hooks now. Now you have extra problems, though. Python packages are a good way to organize modules, but they don't provide a way to enumerate their contents. To work around this, everybody looks at the filesystem layout to determine what's in a package. But if your modules aren't being loaded directly from the filesystem, this approach won't work.



The Really Bad



Modules are singletons (i.e., global mutable state)

This is the dark secret at the heart of any large-scale Python project. One can be very careful about organizing one's state into instances and so forth, but all modules are still visible and modifiable by any code at any time.

Still easy to write unreadable code via monkey-patching

It's easy and convenient to assign to module attributes any time you feel like it. The result is that any time you see "from foo import someObject", you can't every be sure about where that object was defined unless you read all the source code in the application. Even when it's desirable to change module contents (such as for tests), it's easy to fail to do so in a way that doesn't introduce dependencies or conflicts between tests. The classic example is calling some function that initializes module globals from a config file; if one test does it, it can cause tests run after it to fail or incorrectly succeed.

reload()

The reload function is a symptom of all the above problems. Its inspiration is obvious: loading code that's changed since the current Python process has started is an entirely sensible idea. However, Python's assumptions about how modules work makes this rather difficult to do in a sensible manner. It's common to create new lists rather than modify old ones when a new version of some data is wanted. This convention is reinforced by the ease by which list comprehensions can be used to do this job. The convention encouraged by the existence of reload is exactly opposite, though — instead of creating a new module object, the old one is emptied and refilled with fresh objects. The result is that instances of classes in that module are orphaned; the class they were instantiated from can't be reached by its name. Also, it only reloads a single module; no help is provided in updating modules that depend on it, or updating its own dependencies. Figuring out which modules to reload or not reload at any given time is often very tricky. Plenty of other corner cases exist, such as reinitialization of function default arguments, and so forth. Because of all this, the standard advice on #python is that "reload will not make you happy".



What Now?


So with these problems identified in how Python handles modules, can anything be done?
Well, that's why I wrote Exocet. More about that next time.

6 comments:

Maciej Fijalkowski said...

PEP302 doesn't come without awful things attached. For starters it introduces a couple of new global caches (path_cache and meta_cache maybe?). Also implementors of it, like zipimporter, introduce new layers of caches.

This is all getting ugly pretty quickly and faking the whole set of different caches is slowly getting into a nightmare.

My thinking was along the lines to be able to import the module with a context. This context would (or would not) contain globals, builtins, caches etc.

Allen Short said...

Maciej:
PEP 302 does introduce three new global objects (sys.meta_path, sys.path_hooks, and sys.path_importer_cache). path_importer_cache caches the results of calling the objects in path_hooks, which get invoked to possibly handle an entry on sys.path specially. I haven't had problems with them yet... but that's probably because I haven't tested Exocet with code that modifies path_hooks. :)

So yes -- it's even worse than I said!

Virgil Dupras said...

About the "Still easy to write unreadable code via monkey-patching": Monkey patching should be used only for unit testing. I don't think that other uses of monkey patching is widespread in the Python world (unlike the Ruby world... These people are real ninjas/rockstars/scuba-divers :) ) so when you use a module, you can usually expect sane behavior from it.

Are you saying that Python shouldn't have the ability to monkey patch easily? Then writing unit tests would become hell.

Allen Short said...

Virgil:
I agree about what should be done. That doesn't often affect what people actually do.

And yes, I do think there's a better way for testing than monkey-patching.

Kamil Kisiel said...

It's not easy or obvious, but you can actually get a list of all the modules and subpackages in a package using functions from pkgutil. I don't mean to imply that this is a great system, but it is certainly possible in the current system.

I think your criticisms are spot on in general.

Matth said...

If you ever feel too annoyed by Python modules, consider yourself lucky that, unlike us Rubyists, you at least *have* a relatively sane and flexible namespacing and import system -- as opposed to a clumsy repurposing of the class/mixin system which has unfortunate limitations and side-effects when you try to use it to import between namespaces.