Text parsing with python and PLY
LVEE 2014
Formal language parsing is a common problem in software development. Standardized formats with already available parsing libraries, such as XML and JSON, have simplified the problem, but did not completely remove it. Python is widely used as a general purpose programming language as well as one for rapid prototyping, so Python programmers face this task as well.
There is a number of parser generators for Python, which use different algorithms and APIs. In this talk we discuss PLY that is a pure Python implementation of classic UNIX tools: lex scanner generator, and YACC parser generator. It is maintained by David Beazley and distributed under thee-clause BSD license.
Lex and YACC are there for decades and have been reimplemented not once. This makes PLY easy to learn for UNIX programmers and relatively easy to convert the Python code into C/C++ if needed.
Parsing basics
There are three typical approaches to make a parser:
- write an ad hoc parser manually,
- use a parser generator,
- write an advanced parsing algorithm manually.
Ad hoc parsers are suitable for simple grammars, but tend to quickly become hard to maintain and modify. Coding an advanced parsing algorithm like an LL parser is usually redundant and only feasible for very complex grammars or in case of special requirements. For the majority of grammars a parser produced with some parser generator is usually the best option.
Most parser implementations do not work directly with the character stream but use a token stream produced by a scanner. Using a separate scanner (also known as lexer) to break the stream into tokens simplifies the task as it allows to write grammar rules in terms of token names rather than literal strings, e.g. all arithmetic operators can be referred to as OPSIGN token in a rule for arithmetic expressions. Scanner code is usually generated automatically from a set of regular expressions as well.
The most common algorithm used by parser generators is LALR (1) that reads tokens until a sequence matching the entire grammar rule is found.
Using PLY
PLY consists of two modules, ply.lex and ply.yacc. Actions for tokens and grammar rules are defined as functions, while token regular expressions and grammar rules are defined in function docstrings. Tokens that need no actions can be defined as variables.
A simple parser for breaking into parts a line consisting of letters and digits groups separated by semicolon will look like:
import re import ply.lex as lex import ply.yacc as yacc
## Lexer part tokens = ( 'LETTER', 'DIGIT', 'SEMI' )
t_LETTER = r'[a-z]'
def t_DIGIT(t): r'[0-9]' t.value = int(t.value) return t
t_SEMI = r';' t_ignore = r' '
lexer = lex.lex(debug=1)
## Parser part start = 'start'
def p_letter_digit_pair(p): ''' pair : LETTER DIGIT SEMI ''' p[0] = (p[1], p[2])
def p_pair_group(p): ''' pair_group : pair_group pair | pair ''' print p[0], p[1] p[0] = [p[1]] if len(p) == 2 else p[1] + [p[2]]
start = 'pair_group'
parser = yacc.yacc(debug=1)
## Test it s = "a 0; b 1; c 2;" print(parser.parse(s))
References
A more elaborate example discussed in the conference talk can be found at https://github.com/dmbaturin/ply-example
Abstract licensed under Creative Commons Attribution-ShareAlike 3.0 license
Back