Parsing the Newick format in C using flex and bison.
The following post is my answer for this question on biostar "Newick 2 Json converter".
The Newick tree format is a simple format used to write out trees (using parentheses and commas) in a text file .
The original question asked for a parser based on perl but here, I've implemented a C parser using flex/bison.
Example:
((Human:0.3, Chimpanzee:0.2):0.1, Gorilla:0.3, (Mouse:0.6, Rat:0.5):0.2);
A formal grammar for the Newick format is available here
Items in { } may appear zero or more times. Items in [ ] are optional, they may appear once or not at all. All other punctuation marks (colon, semicolon, parentheses, comma and single quote) are required parts of the format. tree ==> descendant_list [ root_label ] [ : branch_length ] ; descendant_list ==> ( subtree { , subtree } ) subtree ==> descendant_list [internal_node_label] [: branch_length] ==> leaf_label [: branch_length] root_label ==> label internal_node_label ==> label leaf_label ==> label label ==> unquoted_label ==> quoted_label unquoted_label ==> string_of_printing_characters quoted_label ==> ' string_of_printing_characters ' branch_length ==> signed_number ==> unsigned_number
The Flex Lexer
The Flex Lexer is used to extract the terminal tokens of the grammar from the input stream.Those terminals are '(' ')' ',' ';' ':' , strings and numbers. For the simple and double quoted strings, we tell the lexer to enter in a specific state ( 'apos' and 'quot').
The Bison Scanner
The Bison scanner reads the tokens returned by Flex and implements the grammar.The simple structure holding the tree is defined in
'struct tree_t'
. The code also contains some methods to dump the tree as JSON. Makefile
Testing
compile:$ make bison -d newick.y flex newick.l gcc -Wall -O3 newick.tab.c lex.yy.c lex.yy.c:1265:17: warning: ‘yyunput’ defined but not used [-Wunused-function] lex.yy.c:1306:16: warning: ‘input’ defined but not used [-Wunused-function]test:
echo "((Human:0.3, Chimpanzee:0.2):0.1, Gorilla:0.3, (Mouse:0.6, Rat:0.5):0.2);" | ./a.out { "children": [ { "length": 0.1, "children": [ { "label": "Human", "length": 0.3 }, { "label": "Chimpanzee", "length": 0.2 } ] }, { "label": "Gorilla", "length": 0.3 }, { "length": 0.2, "children": [ { "label": "Mouse", "length": 0.6 }, { "label": "Rat", "length": 0.5 } ] } ] }
That's it,
Pierre