Text parsing with python and PLY

Даниил Батурин, Томск, Russia

LVEE 2014

Language parsing is a common software development problem. Python is widely used in both production software development and rapid prototyping, and a number of lexer and parser generators were written for it. In this talk we discuss using one of them, PLY, which is a pure python implementation of classic lex and YACC tools, for parsing a made-up configuration file grammar.

Formal language parsing is a common problem in software development. Standardized formats with already available parsing libraries, such as XML and JSON, have simplified the problem, but did not completely remove it. Python is widely used as a general purpose programming language as well as one for rapid prototyping, so Python programmers face this task as well.

There is a number of parser generators for Python, which use different algorithms and APIs. In this talk we discuss PLY that is a pure Python implementation of classic UNIX tools: lex scanner generator, and YACC parser generator. It is maintained by David Beazley and distributed under thee-clause BSD license.

Lex and YACC are there for decades and have been reimplemented not once. This makes PLY easy to learn for UNIX programmers and relatively easy to convert the Python code into C/C++ if needed.

Parsing basics

There are three typical approaches to make a parser:

write an ad hoc parser manually,
use a parser generator,
write an advanced parsing algorithm manually.

Ad hoc parsers are suitable for simple grammars, but tend to quickly become hard to maintain and modify. Coding an advanced parsing algorithm like an LL parser is usually redundant and only feasible for very complex grammars or in case of special requirements. For the majority of grammars a parser produced with some parser generator is usually the best option.

Most parser implementations do not work directly with the character stream but use a token stream produced by a scanner. Using a separate scanner (also known as lexer) to break the stream into tokens simplifies the task as it allows to write grammar rules in terms of token names rather than literal strings, e.g. all arithmetic operators can be referred to as OPSIGN token in a rule for arithmetic expressions. Scanner code is usually generated automatically from a set of regular expressions as well.

The most common algorithm used by parser generators is LALR (1) that reads tokens until a sequence matching the entire grammar rule is found.

Using PLY

PLY consists of two modules, ply.lex and ply.yacc. Actions for tokens and grammar rules are defined as functions, while token regular expressions and grammar rules are defined in function docstrings. Tokens that need no actions can be defined as variables.

A simple parser for breaking into parts a line consisting of letters and digits groups separated by semicolon will look like:

import re
import ply.lex as lex
import ply.yacc as yacc

## Lexer part
tokens = ( 'LETTER', 'DIGIT', 'SEMI' )

t_LETTER = r'[a-z]'

def t_DIGIT(t):
    r'[0-9]'
    t.value = int(t.value)
    return t

t_SEMI = r';'
t_ignore = r' '

lexer = lex.lex(debug=1)

## Parser part
start = 'start'

def p_letter_digit_pair(p):
    ''' pair : LETTER DIGIT SEMI '''
    p[0] = (p[1], p[2])

def p_pair_group(p):
    ''' pair_group : pair_group pair
                   | pair
    '''
    print p[0], p[1]
    p[0] = [p[1]] if len(p) == 2 else p[1] + [p[2]]

start = 'pair_group'

parser = yacc.yacc(debug=1)

## Test it
s = "a 0; b 1; c 2;"
print(parser.parse(s))