More Than Just Parsers

PyMeta is more than just a parsing framework, it's a general pattern matching language. Here's a parser for a tiny HTML-like language:


from pymeta.grammar import OMeta
from itertools import chain

tinyHTMLGrammar = """

name ::= <letterOrDigit>+:ls => ''.join(ls)

tag ::= ('<' <spaces> <name>:n <spaces> <attribute>*:attrs '>'
<html>:c
'<' '/' <token n> <spaces> '>'
=> [n.lower(), dict(attrs), c])

html ::= (<text> | <tag>)*

text ::= (~('<') <anything>)+:t => ''.join(t)

attribute ::= <spaces> <name>:k <token '='> <quotedString>:v => (k, v)

quotedString ::= (('"' | '\''):q (~<exactly q> <anything>)*:xs <exactly q>
=> ''.join(xs))

"""
TinyHTML = OMeta.makeGrammar(tinyHTMLGrammar, globals(), name="TinyHTML")


This will parse an HTML-ish string into a tree structure.



Python 2.5.2 (r252:60911, Apr 8 2008, 21:49:41)
[GCC 4.2.3 (Ubuntu 4.2.3-2ubuntu7)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import html
>>> x = html.TinyHTML("<html><title>Yes</title><body><h1>Man, HTML is
<i>great</i>.</h1><p>How could you even <b>think</b>
otherwise?</p><img src='HIPPO.JPG'></img><a
href='http://twistedmatrix.com'>A Good Website</a></body></html>")

>>> tree = x.apply("html")

>>> import pprint

>>> pprint.pprint(tree)
[['html',
{},
[['title', {}, ['Yes']],
['body',
{},
[['h1', {}, ['Man, HTML is ', ['i', {}, ['great']], '.']],
['p',
{},
['How could you even ', ['b', {}, ['think']], ' otherwise?']],
['img', {'src': 'HIPPO.JPG'}, []],
['a', {'href': 'http://twistedmatrix.com'}, ['A Good Website']]]]]]]



Suppose now that we want to turn this tree structure back into the HTMLish format we found it in. We can write an unparser:


def formatAttrs(attrs):
"""
Format a dictionary as HTML-ish attributes.
"""
return ''.join([" %s='%s'" % (k, v) for (k, v) in attrs.iteritems()])


unparserGrammar = """
contents ::= [<tag>*:t] => ''.join(t)
tag ::= ([:name :attrs <contents>:t]
=> "<%s%s>%s</%s>" % (name, formatAttrs(attrs), t, name)
| <anything>)
"""

TinyHTMLUnparser = OMeta.makeGrammar(unparserGrammar, globals(), name="TinyHTMLUnparser")


Square brackets in a rule indicate that a list with the given contents should be matched. This way we can traverse a tree structure and produce a string.


>>> html.TinyHTMLUnparser([tree]).apply("contents")
"<html><title>Yes</title><body><h1>Man, HTML is
<i>great</i>.</h1><p>How could you even <b>think</b>
otherwise?</p><img src='HIPPO.JPG'></img>
;<a href='http://twistedmatrix.com'>A Good Website</a></body></html>"



Other sorts of transformations are possible, of course: here's an example that ignores everything but the 'src' attribute of IMG and the 'href' attribute of A:


linkExtractorGrammar = """
contents ::= [<tag>*:t] => list(chain(*t))
tag ::= ( ["a" :attrs ?('href' in attrs) <contents>:t] => ([attrs['href']] + t)
| ["img" :attrs ?('src' in attrs) <contents>:t] => ([attrs['src']] + t)
| [:name :attrs <contents>:t] => t
| :text => [])
"""

LinkExtractor = OMeta.makeGrammar(linkExtractorGrammar, globals(), name="LinkExtractor")



>>> html.LinkExtractor([tree]).apply("contents")

['HIPPO.JPG', 'http://twistedmatrix.com']


And here's an example that produces another tree, without B or I elements:


boringifierGrammar = """
contents ::= [<tag>*:t] => list(chain(*t))
tag ::= ( ["b" <anything> <contents>:t] => t
| ["i" <anything> <contents>:t] => t
| [:name :attrs <contents>:t] => [[name, attrs, t]]
| :text => [text])
"""

Boringifier = OMeta.makeGrammar(boringifierGrammar, globals(), name="Boringifier")


And once we have the new tree, we can treat it just like the original:


>>> tree2 = html.Boringifier([tree]).apply("contents")
>>> pprint.pprint(tree2)
[['html',
{},
[['title', {}, ['Yes']],
['body',
{},
[['h1', {}, ['Man, HTML is ', 'great', '.']],
['p', {}, ['How could you even ', 'think', ' otherwise?']],
['img', {'src': 'HIPPO.JPG'}, []],
['a', {'href': 'http://twistedmatrix.com'}, ['A Good Website']]]]]]]
>>> html.TinyHTMLUnparser([tree2]).apply("contents")
"<html><title>Yes</title><body><h1>Man, HTML is
great.</h1><p>How could you even think otherwise?</p><img src='HIPPO.JPG'>
</img><a href='http://twistedmatrix.com'>A Good Website</a></body>
</html>"


As you can see, the final result shows the original string, but with <b> and <i>tags removed.

This kind of tree transformation is highly useful for implementing language tools, and might form a good basis for a refactoring library. Analysis can be done on the syntax tree produced by the parser, transformation can be done by other tools, and the unparser can then turn it back into valid source code.