Allen's Weblog: More Than Just Parsers

PyMeta is more than just a parsing framework, it's a general pattern matching language. Here's a parser for a tiny HTML-like language:


from pymeta.grammar import OMeta
from itertools import chain

tinyHTMLGrammar = """

name ::= <letterOrDigit>+:ls => ''.join(ls)

tag ::= ('<' <spaces> <name>:n <spaces> <attribute>*:attrs '>'
         <html>:c
         '<' '/' <token n> <spaces> '>'
             => [n.lower(), dict(attrs), c])

html ::= (<text> | <tag>)*

text ::= (~('<') <anything>)+:t => ''.join(t)

attribute ::= <spaces> <name>:k <token '='> <quotedString>:v => (k, v)

quotedString ::= (('"' | '\''):q (~<exactly q> <anything>)*:xs <exactly q>
                     => ''.join(xs))

"""
TinyHTML = OMeta.makeGrammar(tinyHTMLGrammar, globals(), name="TinyHTML")

This will parse an HTML-ish string into a tree structure.


Python 2.5.2 (r252:60911, Apr 8 2008, 21:49:41)
[GCC 4.2.3 (Ubuntu 4.2.3-2ubuntu7)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import html
>>> x = html.TinyHTML("<html><title>Yes</title><body><h1>Man, HTML is
 <i>great</i>.</h1><p>How could you even <b>think</b> 
otherwise?</p><img src='HIPPO.JPG'></img><a 
href='http://twistedmatrix.com'>A Good Website</a></body></html>")

>>> tree = x.apply("html")

>>> import pprint

>>> pprint.pprint(tree)
[['html',
  {},
  [['title', {}, ['Yes']],
   ['body',
    {},
    [['h1', {}, ['Man, HTML is ', ['i', {}, ['great']], '.']],
     ['p',
      {},
      ['How could you even ', ['b', {}, ['think']], ' otherwise?']],
     ['img', {'src': 'HIPPO.JPG'}, []],
     ['a', {'href': 'http://twistedmatrix.com'}, ['A Good Website']]]]]]]

Suppose now that we want to turn this tree structure back into the HTMLish format we found it in. We can write an unparser:


def formatAttrs(attrs):
    """
    Format a dictionary as HTML-ish attributes.
    """
    return ''.join([" %s='%s'" % (k, v) for (k, v) in attrs.iteritems()])


unparserGrammar = """
contents ::= [<tag>*:t] => ''.join(t)
tag ::= ([:name :attrs <contents>:t]
            => "<%s%s>%s</%s>" % (name, formatAttrs(attrs), t, name)
         | <anything>)
"""

TinyHTMLUnparser = OMeta.makeGrammar(unparserGrammar, globals(), name="TinyHTMLUnparser")

Square brackets in a rule indicate that a list with the given contents should be matched. This way we can traverse a tree structure and produce a string.


>>> html.TinyHTMLUnparser([tree]).apply("contents")
"<html><title>Yes</title><body><h1>Man, HTML is 
<i>great</i>.</h1><p>How could you even <b>think</b> 
otherwise?</p><img src='HIPPO.JPG'></img>
;<a href='http://twistedmatrix.com'>A Good Website</a></body></html>"

Other sorts of transformations are possible, of course: here's an example that ignores everything but the 'src' attribute of IMG and the 'href' attribute of A:


linkExtractorGrammar = """
contents ::= [<tag>*:t] => list(chain(*t))
tag ::= ( ["a" :attrs ?('href' in attrs) <contents>:t] => ([attrs['href']] + t)
        | ["img" :attrs ?('src' in attrs) <contents>:t] => ([attrs['src']] + t)
        | [:name :attrs <contents>:t] => t
        | :text => [])
"""

LinkExtractor = OMeta.makeGrammar(linkExtractorGrammar, globals(), name="LinkExtractor")


>>> html.LinkExtractor([tree]).apply("contents")

['HIPPO.JPG', 'http://twistedmatrix.com']

And here's an example that produces another tree, without B or I elements:


boringifierGrammar = """
contents ::= [<tag>*:t] => list(chain(*t))
tag ::= ( ["b" <anything> <contents>:t] => t
        | ["i" <anything> <contents>:t] => t
        | [:name :attrs <contents>:t] => [[name, attrs, t]]
        | :text => [text])
"""

Boringifier = OMeta.makeGrammar(boringifierGrammar, globals(), name="Boringifier")

And once we have the new tree, we can treat it just like the original:


>>> tree2 = html.Boringifier([tree]).apply("contents")
>>> pprint.pprint(tree2)
[['html',
  {},
  [['title', {}, ['Yes']],
   ['body',
    {},
    [['h1', {}, ['Man, HTML is ', 'great', '.']],
     ['p', {}, ['How could you even ', 'think', ' otherwise?']],
     ['img', {'src': 'HIPPO.JPG'}, []],
     ['a', {'href': 'http://twistedmatrix.com'}, ['A Good Website']]]]]]]
>>> html.TinyHTMLUnparser([tree2]).apply("contents")
"<html><title>Yes</title><body><h1>Man, HTML is 
great.</h1><p>How could you even think otherwise?</p><img src='HIPPO.JPG'>
</img><a href='http://twistedmatrix.com'>A Good Website</a></body>
</html>"

As you can see, the final result shows the original string, but with <b> and <i>tags removed.

This kind of tree transformation is highly useful for implementing language tools, and might form a good basis for a refactoring library. Analysis can be done on the syntax tree produced by the parser, transformation can be done by other tools, and the unparser can then turn it back into valid source code.

4 comments:

Enrico Santoemma said...: Pymeta looks nice. How can one find a reference to study the grammar?
A query of PEG doesn't return a lot of material

Enrico; May 7, 2008 at 2:51 AM
kib said...: Very nice project, thanks Allen.; May 8, 2008 at 4:15 PM
Anonymous said...: Allen,

In short Pymeta rocks! Just playing around with it inspires me to build great things. Thanks so much, I'll be following the developments of Pymeta closely on this blog.; May 29, 2008 at 1:34 AM
Anonymous said...: Great package. Unfortunately a reference is missing so probably a lot of possibilities cannot be used. For example after playing for two hours I decided to stop since I was not able to use lookahead (and in the code there's the function but no description how to use it).
I'll give it a try in the future, when some documentation will come up.
Anyway thanks for this interisting stuff.; June 6, 2008 at 7:34 AM

Allen's Weblog

More Than Just Parsers

4 comments:

Ohloh Stats

Projects

Blog Archive

Labels