13 July 2012

Parsing the Newick format in C using flex and bison.

The following post is my answer for this question on biostar "Newick 2 Json converter".
The Newick tree format is a simple format used to write out trees (using parentheses and commas) in a text file .
The original question asked for a parser based on perl but here, I've implemented a C parser using flex/bison.


((Human:0.3, Chimpanzee:0.2):0.1, Gorilla:0.3, (Mouse:0.6, Rat:0.5):0.2);

A formal grammar for the Newick format is available here
Items in { } may appear zero or more times.
   Items in [ ] are optional, they may appear once or not at all.
   All other punctuation marks (colon, semicolon, parentheses, comma and
         single quote) are required parts of the format.

              tree ==> descendant_list [ root_label ] [ : branch_length ] ;

   descendant_list ==> ( subtree { , subtree } )

           subtree ==> descendant_list [internal_node_label] [: branch_length]
                   ==> leaf_label [: branch_length]

            root_label ==> label
   internal_node_label ==> label
            leaf_label ==> label

                 label ==> unquoted_label
                       ==> quoted_label

        unquoted_label ==> string_of_printing_characters
          quoted_label ==> ' string_of_printing_characters '

         branch_length ==> signed_number
                       ==> unsigned_number

The Flex Lexer

The Flex Lexer is used to extract the terminal tokens of the grammar from the input stream.
Those terminals are '(' ')' ',' ';' ':' , strings and numbers. For the simple and double quoted strings, we tell the lexer to enter in a specific state ( 'apos' and 'quot').

The Bison Scanner

The Bison scanner reads the tokens returned by Flex and implements the grammar.
The simple structure holding the tree is defined in 'struct tree_t'. The code also contains some methods to dump the tree as JSON.



$ make
bison -d newick.y
flex newick.l
gcc -Wall -O3 newick.tab.c lex.yy.c
lex.yy.c:1265:17: warning: ‘yyunput’ defined but not used [-Wunused-function]
lex.yy.c:1306:16: warning: ‘input’ defined but not used [-Wunused-function]
echo "((Human:0.3, Chimpanzee:0.2):0.1, Gorilla:0.3, (Mouse:0.6, Rat:0.5):0.2);" | ./a.out

    "children": [
            "length": 0.1,
            "children": [
                    "label": "Human",
                    "length": 0.3
                    "label": "Chimpanzee",
                    "length": 0.2
            "label": "Gorilla",
            "length": 0.3
            "length": 0.2,
            "children": [
                    "label": "Mouse",
                    "length": 0.6
                    "label": "Rat",
                    "length": 0.5

That's it,



Chris Maloney said...

This is very fun to read, and it brings back memories. I used to work a fair amount with lex and yacc. I didn't try out your code, but I have no doubt that it works.

I have a quick question. If I were to try to code something like this, I don't think I would go back to these tools; I think I'd either do it in Perl, or I'd use it as an excuse to try out antlr (have you heard of it?). So the question is, why flex and bison?

Pierre Lindenbaum said...

Chris, I wrote it in lex/bison because it is just fun dans my code was just a roughly copy-paste of my C code for parsing JSON :-)

About another language: ( I'm not a perl programmer ) I would use JAVACC rather than antlr because javacc doesn't require any dependencies.

Giovanna Furfaro said...

Hello, I am an Italian student and a university project I should do a tree with flex and bison , as what you did in the post . Very good !. I tried your code but i get error :
missing "newick.tab.h".
It was a file to create bison?
I'm sorry for my english.

Pierre Lindenbaum said...

@giovanna show me the full error trace or look at this git repo: https://github.com/lindenb/newicktools

it contains the recent sources.