Tuesday, March 25, 2008

Easy parsing with pyparsing

If you haven't used Paul McGuire's pyparsing module yet, you've been missing out on a great tool. Whenever you hit a wall trying to parse text with regular expressions or string operations, 'think pyparsing'.

I had the need to parse a load balancer configuration file and save certain values in a database. Most of the stuff I needed was fairly easily obtainable with regular expressions or Python string operations. However, I was stumped when I encountered a line such as:

bind http "Customer Server 1" http "Customer Server 2" http

This line 'binds' a 'virtual server' port to one or more 'real servers' and their ports (I'm using here this particular load balancer's jargon, but the concepts are the same for all load balancers.)

The syntax is 'bind' followed by a word denoting the virtual server port, followed by one or more pairs of real server names and ports. The kicker is that the real server names can be either a single word containing no whitespace, or multiple words enclosed in double quotes.

Splitting the line by spaces or double quotes is not the solution in this case. I started out by rolling my own little algorithm and keeping track of where I am inside the string, then I realized that I'm actually writing my own parser at this point. Time to reach for pyparsing.

I won't go into the details of how to use pyparsing, since there is great documentation available (see Paul's PyCon06 presentation, the examples on the pyparsing site, and also Paul's O'Reilly Shortcut book). Basically you need to define your grammar for the expression you need to parse, then translate it into pyparsing-specific constructs. Because pyparsing's API is so intuitive and powerful, the translation process is straightforward.

Here's how I ended up implementing my pyparsing grammar:

from pyparsing import *

def parse_bind_line(line):
quoted_real_server = dblQuotedString.setParseAction(removeQuotes)
real_server = Word(alphas, printables) | quoted_real_server
port = Word(alphanums)
real_server_port = Group(real_server + port)
bind_expr = Suppress(Literal("bind")) + \
port + \
OneOrMore(real_server_port)
return bind_expr.parseString(line)

That's all there is to it. You need to read it from the bottom up to see how the expression gets decomposed into elements, and elements get decomposed into sub-elements.

I'll explain each line, starting with the last one before the return:

bind_expr = Suppress(Literal("bind")) + \
port + \
OneOrMore(real_server_port)

A bind expression starts with the literal "bind", followed by a port, followed by one or more real server/port pairs. That's pretty much what the line above actually says, isn't it. The Suppress construct tells pyparsing that we're not interested in returning the literal "bind" in the final token list.


real_server_port = Group(real_server + port)

A real server/port pair is simply a real server name followed by a port. The Group construct tells pyparsing that we want to group these 2 tokens in a list inside the final token list.


port = Word(alphanums)

A port is a word composed of alphanumeric characters. In general, word means 'a sequence of characters containing no whitespace'. The 'alphanums' variable is a special pyparsing variable already containing the list of alphanumeric characters.


real_server = Word(alphas, printables) | quoted_real_server

A real server is either a single word, or an expression in quotes. Note that we can declare a pyparsing Word with 2 arguments; the 1st argument specifies the allowed characters for the initial character of the word, whereas the 2nd argument specified the allowed characters for the body of the word. In this case, we're saying that we want a real server name to start with an alphabetical character, but other than that it can contain any printable character.


quoted_real_server = dblQuotedString.setParseAction(removeQuotes)

Here is where you can glimpse the power of pyparsing. With this single statement we're parsing a sequence of words enclosed in double quotes, and we're saying that we're not interested in the quotes. There's also a sglQuotedString class for words enclosed in single quotes. Thanks to Paul for bringing this to my attention. My clumsy attempt at manually declaring a sequence of words enclosed in double quotes ran something like this:


no_quote_word = Word(alphanums+"-.")
quoted_real_server = Suppress(Literal("\"")) + \
OneOrMore(no_quote_word) + \
Suppress(Literal("\""))
quoted_real_server.setParseAction(lambda tokens: " ".join(tokens))

The only useful thing you can take away from this mumbo-jumbo is that you can associate an action with each token. When pyparsing will encounter that token, it will apply the action (function or class) you specified on that token. This is useful for doing validation of your tokens, for example for a date. Very powerful stuff.

Now it's time to test my function on a few strings:

if __name__ == "__main__":
tests = """\
bind http "Customer Server 1" http "Customer Server 2" http
bind http "Customer Server - 11" 81 "Customer Server 12" 82
bind http www.mywebsite.com-server1 http www.mywebsite.com-server2 http
bind ssl www.mywebsite.com-server1 ssl www.mywebsite.com-server2 ssl
bind http TEST-server http
bind http MY-cluster-web11 83 MY-cluster-web-12 83
bind http cust1-server1.site.com http cust1-server2.site.com http
""".splitlines()

for t in tests:
print parse_bind_line(t)


Running the code above produces this output:


$ ./parse_bind.py
['http', ['Customer Server 1', 'http'], ['Customer Server 2', 'http']]
['http', ['Customer Server - 11', '81'], ['Customer Server 12', '82']]
['http', ['www.mywebsite.com-server1', 'http'], ['www.mywebsite.com-server2', 'http']]
['ssl', ['www.mywebsite.com-server1', 'ssl'], ['www.mywebsite.com-server2', 'ssl']]
['http', ['TEST-server', 'http']]
['http', ['MY-cluster-web11', '83'], ['MY-cluster-web-12', '83']]
['http', ['cust1-server1.site.com', 'http'], ['cust1-server2.site.com', 'http']]

From here, I was able to quickly identify for a given virtual server everything I needed: a virtual server port, and all the real server/port pairs associated with it. Inserting all this into a database was just another step. The hard work had already been done by pyparsing.

Once more, kudos to Paul McGuire for creating such an useful and fun tool.

5 comments:

Anonymous said...

While I don't doubt pyparsing has plenty of applications, your specific problem is probably more easily solved with shlex.split().

Grig Gheorghiu said...

Anonymous -- thanks for the shlex tip. I didn't know of this module. I googled it and I saw that it was also covered by Doug Hellman in his pyMOTW series: http://www.oreillynet.com/onlamp/blog/2007/10/pymotw_shlex.html

Nice -- and it would have solved my particular problem.

But I don't regret playing with pyparsing at all, especially since I have other parsing scenarios which are more complicated than the one I described.

Pedahzur said...

I'll agree with you on the utility of pyparsing. I recently had to parse dumps from an MS SQL database, and pyparsing was great. Naming sections of parsed code makes using the results so easy, because then you can get at it via a dict-like interface. And many thanks to Paul McGuire for his great help on the pyparsing list.

Anonymous said...

I've seen a lot of praise for pyparsing and I'm sure it deserves that. Here is, however, my two cents' worth without pyparsing:

s='bind http "Customer Server 1" http "Customer Server 2" http'
print s.replace(' "', '|').replace('" ', '|').split('|')
#prints: ['bind http', 'Customer Server 1', 'http', 'Customer Server 2', 'http']

Grig Gheorghiu said...

Jussi -- thanks for your solution. As I said in a comment above, I still need pyparsing for other types of parsing. The example I gave was maybe a bit simplistic. What I like about pyparsing too is that it can group tokens together for easy processing. So the real server and its port are grouped together in a list in my case. With you solution, I'd have to manually add that step.

Grig

Modifying EC2 security groups via AWS Lambda functions

One task that comes up again and again is adding, removing or updating source CIDR blocks in various security groups in an EC2 infrastructur...