The Newick tree format is a simple format used to write out trees (using parentheses and commas) in a text file .
The original question asked for a parser based on perl but here, I've implemented a C parser using flex/bison.
Example:
((Human:0.3, Chimpanzee:0.2):0.1, Gorilla:0.3, (Mouse:0.6, Rat:0.5):0.2);
A formal grammar for the Newick format is available here
Items in { } may appear zero or more times. Items in [ ] are optional, they may appear once or not at all. All other punctuation marks (colon, semicolon, parentheses, comma and single quote) are required parts of the format. tree ==> descendant_list [ root_label ] [ : branch_length ] ; descendant_list ==> ( subtree { , subtree } ) subtree ==> descendant_list [internal_node_label] [: branch_length] ==> leaf_label [: branch_length] root_label ==> label internal_node_label ==> label leaf_label ==> label label ==> unquoted_label ==> quoted_label unquoted_label ==> string_of_printing_characters quoted_label ==> ' string_of_printing_characters ' branch_length ==> signed_number ==> unsigned_number
The Flex Lexer
The Flex Lexer is used to extract the terminal tokens of the grammar from the input stream.Those terminals are '(' ')' ',' ';' ':' , strings and numbers. For the simple and double quoted strings, we tell the lexer to enter in a specific state ( 'apos' and 'quot').
The Bison Scanner
The Bison scanner reads the tokens returned by Flex and implements the grammar.The simple structure holding the tree is defined in
'struct tree_t'
. The code also contains some methods to dump the tree as JSON. Makefile
Testing
compile:$ make bison -d newick.y flex newick.l gcc -Wall -O3 newick.tab.c lex.yy.c lex.yy.c:1265:17: warning: ‘yyunput’ defined but not used [-Wunused-function] lex.yy.c:1306:16: warning: ‘input’ defined but not used [-Wunused-function]test:
echo "((Human:0.3, Chimpanzee:0.2):0.1, Gorilla:0.3, (Mouse:0.6, Rat:0.5):0.2);" | ./a.out { "children": [ { "length": 0.1, "children": [ { "label": "Human", "length": 0.3 }, { "label": "Chimpanzee", "length": 0.2 } ] }, { "label": "Gorilla", "length": 0.3 }, { "length": 0.2, "children": [ { "label": "Mouse", "length": 0.6 }, { "label": "Rat", "length": 0.5 } ] } ] }
That's it,
Pierre
This is very fun to read, and it brings back memories. I used to work a fair amount with lex and yacc. I didn't try out your code, but I have no doubt that it works.
ReplyDeleteI have a quick question. If I were to try to code something like this, I don't think I would go back to these tools; I think I'd either do it in Perl, or I'd use it as an excuse to try out antlr (have you heard of it?). So the question is, why flex and bison?
Chris, I wrote it in lex/bison because it is just fun dans my code was just a roughly copy-paste of my C code for parsing JSON :-)
ReplyDeleteAbout another language: ( I'm not a perl programmer ) I would use JAVACC rather than antlr because javacc doesn't require any dependencies.
Hello, I am an Italian student and a university project I should do a tree with flex and bison , as what you did in the post . Very good !. I tried your code but i get error :
ReplyDeletemissing "newick.tab.h".
It was a file to create bison?
I'm sorry for my english.
@giovanna show me the full error trace or look at this git repo: https://github.com/lindenb/newicktools
ReplyDeleteit contains the recent sources.