Parsing the Newick format in C using flex and bison.
The following post is my answer for this question on biostar "Newick 2 Json converter".
The Newick tree format is a simple format used to write out trees (using parentheses and commas) in a text file .
The original question asked for a parser based on perl but here, I've implemented a C parser using flex/bison.
Example:
((Human:0.3, Chimpanzee:0.2):0.1, Gorilla:0.3, (Mouse:0.6, Rat:0.5):0.2);
A formal grammar for the Newick format is available here
Items in { } may appear zero or more times.
Items in [ ] are optional, they may appear once or not at all.
All other punctuation marks (colon, semicolon, parentheses, comma and
single quote) are required parts of the format.
tree ==> descendant_list [ root_label ] [ : branch_length ] ;
descendant_list ==> ( subtree { , subtree } )
subtree ==> descendant_list [internal_node_label] [: branch_length]
==> leaf_label [: branch_length]
root_label ==> label
internal_node_label ==> label
leaf_label ==> label
label ==> unquoted_label
==> quoted_label
unquoted_label ==> string_of_printing_characters
quoted_label ==> ' string_of_printing_characters '
branch_length ==> signed_number
==> unsigned_number
The Flex Lexer
The Flex Lexer is used to extract the terminal tokens of the grammar from the input stream.Those terminals are '(' ')' ',' ';' ':' , strings and numbers. For the simple and double quoted strings, we tell the lexer to enter in a specific state ( 'apos' and 'quot').
The Bison Scanner
The Bison scanner reads the tokens returned by Flex and implements the grammar.The simple structure holding the tree is defined in
'struct tree_t'. The code also contains some methods to dump the tree as JSON. Makefile
Testing
compile:$ make bison -d newick.y flex newick.l gcc -Wall -O3 newick.tab.c lex.yy.c lex.yy.c:1265:17: warning: ‘yyunput’ defined but not used [-Wunused-function] lex.yy.c:1306:16: warning: ‘input’ defined but not used [-Wunused-function]test:
echo "((Human:0.3, Chimpanzee:0.2):0.1, Gorilla:0.3, (Mouse:0.6, Rat:0.5):0.2);" | ./a.out
{
"children": [
{
"length": 0.1,
"children": [
{
"label": "Human",
"length": 0.3
},
{
"label": "Chimpanzee",
"length": 0.2
}
]
},
{
"label": "Gorilla",
"length": 0.3
},
{
"length": 0.2,
"children": [
{
"label": "Mouse",
"length": 0.6
},
{
"label": "Rat",
"length": 0.5
}
]
}
]
}That's it,
Pierre