Allen's Weblog

The true blue fate with an artificial calendar

In case you were wondering, here's what I've been busy with lately.

I haven't given up on Python yet. Stay tuned, big stuff on the way.

Two-thirds slow, one-third amazing

My evident neglect of this site was not intentional. Moving across the country and starting a new job tend to reduce one's available time for open source work, and mine hasn't resulted in anything really worth announcing for the past year (or more). But today that changes!

Download PyMeta 0.4.0

Since the last release of Ecru I have been trying to get rid of its dependency on Python, by porting the E parser to E. In the process of doing so, I realized it was probably a bad idea to try to use a parser whose only form of error reporting was the string "Parse error". Since I'm still more familiar with Python than E, I started implementing error reporting in PyMeta. (Also, Python has a debugger.) This resulted in some significant rewrites of the internals.

So what's different in PyMeta 0.4?

Comments!

With its new space-age technology, PyMeta 0.4 will treat '#' as a comment character, just like Python. (Yes, this was rather overdue.)

Reorganized code generator

Previously, code for grammars was generated by the grammar parser calling methods on a builder object that directly emitted Python code as strings. Now the grammar parser builds a tree, which is then consumed by a code generator. If you want to generate something other than Python (or change how Python code gets generated), it should be a lot simpler now. Look at pymeta.builder.PythonWriter for specifics, specifically the generate_ methods.

Error tracking and reporting

This is the big one. Previously, PyMeta expressions returned a value or raised an error. Now, each expression evaluated by the parser now returns a value and an error, even if a successful parse was found. If parse failure occurs, an error still gets raised. Combining expressions with "|" returns the error from furthest into the input and combines ties.

So the result of this is that you can now get nicely formatted output telling you where stuff went wrong and how. Here's an example, based on the old TinyHTML parser:

When you mismatch tags, the parser notices:

Notice here that the information in ParseError is structured so that future tools can figure out stuff about what failed and how.

If there's more than one possible valid input at the point of failure, the parser will tell you:

Plans for the Future

There are a few directions I'd like to take PyMeta in the future. A really nice thing would be to have a way to generate grammars ahead of time easily, writing out a Python module. This release includes bin/generate_parser which does a rather naive version of this. The problem is figuring out how to make grammars that are subclasses of something other than just OMetaBase.
Also, with the new code generator setup, it'd be fairly easy to generate Cython instead of Python, resulting in grammars that can be compiled as extension modules, hopefully resulting in much faster parse times. PyMeta isn't meant to be blazingly fast -- PEG parsers aren't known to be the most efficient -- but it'd be nice to squeeze all we can out of it.

Other people have asked for event based parsing and incremental output. Seems like a neat idea... I'd just have to figure out what that means. :-)

Special thanks to Marien Zwart and Cory Dodt for their contributions and encouragement for this release.

Ecru 0.2.0

It's been a bit longer for this release.

The major new additions are message buffering on eventual references and the Ref.whenResolved and Ref.whenBroken methods, as well as the __whenMoreResolved Miranda method. Along with eventual sends and the vat support from the previous release, this is enough to provide support for the when syntax sugar. The Python interface now can control the E event loop: ecru.api.iterate runs a single item in the queue of the vat exposed to Python, and ecru.api.incrementalEval queues an E expression to be run, returning a promise to Python immediately. There's an example IRC-bot REPL in doc/examples/ircbot.py.

Special shout-out to Eric Mangold (aka teratorn) for bugfixes and testing. Thanks Eric!

Ecru 0.1.3

Time for another Ecru release. This release should be a lot more stable: ctypes is no longer used for the Python interface. The major behaviour change I've added is support for Selfless objects, allowing for value equality: for example, Map objects with the same contents now compare equal. The rest of the work has been in code reorganization and preparation for concurrency support; vats are implemented now, and the REPL executes code by putting it into the vat's event queue. Eventual sends work now (i.e., "foo <- doStuff(x, y)"). Goals for the next couple releases are actual multi-vat execution, support for the 'when' expression, and portability to other OSes.

Ecru, A C Runtime For E

I'm happy to announce Ecru, a new implementation of the E language.

E is a language designed both for security and safe and efficient concurrency. Twisted borrowed the idea of Deferreds from E, where they are much better integrated into the language than in Python, owing to the syntax and library support E provides. E's design is based around capability security, even to the level of allowing mutually untrusting programs to run in the same process. Due to its consistent focus on security, it's possible to write secure programs without having to do much more than stick to good object-oriented style.

My goals with Ecru development are, initially, to develop an environment suitable for restricted execution of the type that my esteemed colleague has been looking for, allowing safe scripting of server-hosted applications. A wider goal is to provide an effective replacement to the C implementation of Python for development of network software; a successor to Twisted, as it were. Furthermore, E's semantics lend themselves to more efficient implementation than Python's. Although Ecru has received essentially no optimization work thus far, I believe it may be possible to make it faster than Python for many tasks, without being any more difficult to work with.

Right now Ecru depends on Python, since I'm using PyMeta for its parser. Ecru only implements enough of E to run the compiler; I plan to soon implement OMeta in E, so that Ecru can also be used standalone. There's a simple REPL in Python, so you can download Ecru and try it out right now. Currently I'm focusing on cleaning up the code (bootstraps are usually messy affairs) and replacing some of the standard-library object stubs I wrote in C with the versions implemented in E in use by the existing Java version.

PyMeta 0.3.0 Released

Originally when I was implementing PyMeta I was sure that the input stream implementation that the Javascript version used was inefficient. Rather than having a mutable object with methods for advancing and rewinding the stream, it has immutable objects with "head" and "tail" methods, which return the value at the current position and a new object representing the next position in the stream. All that allocation couldn't be healthy.

Turns out I was wrong. I misunderstood the requirements for OMeta's input stream. Various operations require the ability to save a particular point in the stream and go back to it. To further complicate matters, arguments to rules are passed on the stream by pushing them onto the front, and rules that take arguments read them off of the stream. This is very handy for certain types of pattern matching, but it totally destroys any hope of simply implementing the input as a list and an index into it, because there has to be a way to uniquely identify values pushed onto the stream. If a value gets pushed onto the stream, is read from it, then another one is pushed on, both of them have the same predecessor, but they don't have the same stream position. It becomes more like a tree than a sequence. JS-OMeta handled this by just creating a new stream object for each argument. I didn't give up soon enough on my clever idea when initially implementing PyMeta, and it grew more complicated with each feature I implemented, involving a bunch of lists for buffering things and a complicated mark/rewind scheme.

After writing a rather complicated grammar with PyMeta, I began to wonder if I could improve its speed. By this time I knew the JS version's algorithm was less complicated so I decided to try it out. It cut my grammar's unit tests' runtime from around 40 seconds to 4 seconds. Also, it fixed a bug.

Moral of the story: I'm not going to try to implement clever optimizations until I understand the original version any more. :)

I've released a version with this new input implementation. Get it in the usual place.

More Than Just Parsers

PyMeta is more than just a parsing framework, it's a general pattern matching language. Here's a parser for a tiny HTML-like language:


from pymeta.grammar import OMeta
from itertools import chain

tinyHTMLGrammar = """

name ::= <letterOrDigit>+:ls => ''.join(ls)

tag ::= ('<' <spaces> <name>:n <spaces> <attribute>*:attrs '>'
         <html>:c
         '<' '/' <token n> <spaces> '>'
             => [n.lower(), dict(attrs), c])

html ::= (<text> | <tag>)*

text ::= (~('<') <anything>)+:t => ''.join(t)

attribute ::= <spaces> <name>:k <token '='> <quotedString>:v => (k, v)

quotedString ::= (('"' | '\''):q (~<exactly q> <anything>)*:xs <exactly q>
                     => ''.join(xs))

"""
TinyHTML = OMeta.makeGrammar(tinyHTMLGrammar, globals(), name="TinyHTML")

This will parse an HTML-ish string into a tree structure.


Python 2.5.2 (r252:60911, Apr 8 2008, 21:49:41)
[GCC 4.2.3 (Ubuntu 4.2.3-2ubuntu7)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import html
>>> x = html.TinyHTML("<html><title>Yes</title><body><h1>Man, HTML is
 <i>great</i>.</h1><p>How could you even <b>think</b> 
otherwise?</p><img src='HIPPO.JPG'></img><a 
href='http://twistedmatrix.com'>A Good Website</a></body></html>")

>>> tree = x.apply("html")

>>> import pprint

>>> pprint.pprint(tree)
[['html',
  {},
  [['title', {}, ['Yes']],
   ['body',
    {},
    [['h1', {}, ['Man, HTML is ', ['i', {}, ['great']], '.']],
     ['p',
      {},
      ['How could you even ', ['b', {}, ['think']], ' otherwise?']],
     ['img', {'src': 'HIPPO.JPG'}, []],
     ['a', {'href': 'http://twistedmatrix.com'}, ['A Good Website']]]]]]]

Suppose now that we want to turn this tree structure back into the HTMLish format we found it in. We can write an unparser:


def formatAttrs(attrs):
    """
    Format a dictionary as HTML-ish attributes.
    """
    return ''.join([" %s='%s'" % (k, v) for (k, v) in attrs.iteritems()])


unparserGrammar = """
contents ::= [<tag>*:t] => ''.join(t)
tag ::= ([:name :attrs <contents>:t]
            => "<%s%s>%s</%s>" % (name, formatAttrs(attrs), t, name)
         | <anything>)
"""

TinyHTMLUnparser = OMeta.makeGrammar(unparserGrammar, globals(), name="TinyHTMLUnparser")

Square brackets in a rule indicate that a list with the given contents should be matched. This way we can traverse a tree structure and produce a string.


>>> html.TinyHTMLUnparser([tree]).apply("contents")
"<html><title>Yes</title><body><h1>Man, HTML is 
<i>great</i>.</h1><p>How could you even <b>think</b> 
otherwise?</p><img src='HIPPO.JPG'></img>
;<a href='http://twistedmatrix.com'>A Good Website</a></body></html>"

Other sorts of transformations are possible, of course: here's an example that ignores everything but the 'src' attribute of IMG and the 'href' attribute of A:


linkExtractorGrammar = """
contents ::= [<tag>*:t] => list(chain(*t))
tag ::= ( ["a" :attrs ?('href' in attrs) <contents>:t] => ([attrs['href']] + t)
        | ["img" :attrs ?('src' in attrs) <contents>:t] => ([attrs['src']] + t)
        | [:name :attrs <contents>:t] => t
        | :text => [])
"""

LinkExtractor = OMeta.makeGrammar(linkExtractorGrammar, globals(), name="LinkExtractor")


>>> html.LinkExtractor([tree]).apply("contents")

['HIPPO.JPG', 'http://twistedmatrix.com']

And here's an example that produces another tree, without B or I elements:


boringifierGrammar = """
contents ::= [<tag>*:t] => list(chain(*t))
tag ::= ( ["b" <anything> <contents>:t] => t
        | ["i" <anything> <contents>:t] => t
        | [:name :attrs <contents>:t] => [[name, attrs, t]]
        | :text => [text])
"""

Boringifier = OMeta.makeGrammar(boringifierGrammar, globals(), name="Boringifier")

And once we have the new tree, we can treat it just like the original:


>>> tree2 = html.Boringifier([tree]).apply("contents")
>>> pprint.pprint(tree2)
[['html',
  {},
  [['title', {}, ['Yes']],
   ['body',
    {},
    [['h1', {}, ['Man, HTML is ', 'great', '.']],
     ['p', {}, ['How could you even ', 'think', ' otherwise?']],
     ['img', {'src': 'HIPPO.JPG'}, []],
     ['a', {'href': 'http://twistedmatrix.com'}, ['A Good Website']]]]]]]
>>> html.TinyHTMLUnparser([tree2]).apply("contents")
"<html><title>Yes</title><body><h1>Man, HTML is 
great.</h1><p>How could you even think otherwise?</p><img src='HIPPO.JPG'>
</img><a href='http://twistedmatrix.com'>A Good Website</a></body>
</html>"

As you can see, the final result shows the original string, but with <b> and <i>tags removed.

This kind of tree transformation is highly useful for implementing language tools, and might form a good basis for a refactoring library. Analysis can be done on the syntax tree produced by the parser, transformation can be done by other tools, and the unparser can then turn it back into valid source code.