Showing posts with label bison. Show all posts
Showing posts with label bison. Show all posts

13 July 2012

Parsing the Newick format in C using flex and bison.

The following post is my answer for this question on biostar "Newick 2 Json converter".
The Newick tree format is a simple format used to write out trees (using parentheses and commas) in a text file .
The original question asked for a parser based on perl but here, I've implemented a C parser using flex/bison.


Example:

((Human:0.3, Chimpanzee:0.2):0.1, Gorilla:0.3, (Mouse:0.6, Rat:0.5):0.2);

A formal grammar for the Newick format is available here
Items in { } may appear zero or more times.
   Items in [ ] are optional, they may appear once or not at all.
   All other punctuation marks (colon, semicolon, parentheses, comma and
         single quote) are required parts of the format.


              tree ==> descendant_list [ root_label ] [ : branch_length ] ;

   descendant_list ==> ( subtree { , subtree } )

           subtree ==> descendant_list [internal_node_label] [: branch_length]
                   ==> leaf_label [: branch_length]

            root_label ==> label
   internal_node_label ==> label
            leaf_label ==> label

                 label ==> unquoted_label
                       ==> quoted_label

        unquoted_label ==> string_of_printing_characters
          quoted_label ==> ' string_of_printing_characters '

         branch_length ==> signed_number
                       ==> unsigned_number

The Flex Lexer

The Flex Lexer is used to extract the terminal tokens of the grammar from the input stream.
Those terminals are '(' ')' ',' ';' ':' , strings and numbers. For the simple and double quoted strings, we tell the lexer to enter in a specific state ( 'apos' and 'quot').

The Bison Scanner

The Bison scanner reads the tokens returned by Flex and implements the grammar.
The simple structure holding the tree is defined in 'struct tree_t'. The code also contains some methods to dump the tree as JSON.

Makefile


Testing

compile:
$ make
bison -d newick.y
flex newick.l
gcc -Wall -O3 newick.tab.c lex.yy.c
lex.yy.c:1265:17: warning: ‘yyunput’ defined but not used [-Wunused-function]
lex.yy.c:1306:16: warning: ‘input’ defined but not used [-Wunused-function]
test:
echo "((Human:0.3, Chimpanzee:0.2):0.1, Gorilla:0.3, (Mouse:0.6, Rat:0.5):0.2);" | ./a.out

{
    "children": [
        {
            "length": 0.1,
            "children": [
                {
                    "label": "Human",
                    "length": 0.3
                },
                {
                    "label": "Chimpanzee",
                    "length": 0.2
                }
            ]
        },
        {
            "label": "Gorilla",
            "length": 0.3
        },
        {
            "length": 0.2,
            "children": [
                {
                    "label": "Mouse",
                    "length": 0.6
                },
                {
                    "label": "Rat",
                    "length": 0.5
                }
            ]
        }
    ]
}


That's it,

Pierre