YOKOFAKUN: taxonomy

Showing posts with label taxonomy. Show all posts

13 July 2012

Parsing the Newick format in C using flex and bison.

The following post is my answer for this question on biostar "Newick 2 Json converter".
The Newick tree format is a simple format used to write out trees (using parentheses and commas) in a text file .
The original question asked for a parser based on perl but here, I've implemented a C parser using flex/bison.

Example:

((Human:0.3, Chimpanzee:0.2):0.1, Gorilla:0.3, (Mouse:0.6, Rat:0.5):0.2);

A formal grammar for the Newick format is available here

Items in { } may appear zero or more times.
   Items in [ ] are optional, they may appear once or not at all.
   All other punctuation marks (colon, semicolon, parentheses, comma and
         single quote) are required parts of the format.


              tree ==> descendant_list [ root_label ] [ : branch_length ] ;

   descendant_list ==> ( subtree { , subtree } )

           subtree ==> descendant_list [internal_node_label] [: branch_length]
                   ==> leaf_label [: branch_length]

            root_label ==> label
   internal_node_label ==> label
            leaf_label ==> label

                 label ==> unquoted_label
                       ==> quoted_label

        unquoted_label ==> string_of_printing_characters
          quoted_label ==> ' string_of_printing_characters '

         branch_length ==> signed_number
                       ==> unsigned_number

The Flex Lexer

The Flex Lexer is used to extract the terminal tokens of the grammar from the input stream.
Those terminals are '(' ')' ',' ';' ':' , strings and numbers. For the simple and double quoted strings, we tell the lexer to enter in a specific state ( 'apos' and 'quot').

The Bison Scanner

The Bison scanner reads the tokens returned by Flex and implements the grammar.
The simple structure holding the tree is defined in 'struct tree_t'. The code also contains some methods to dump the tree as JSON.

Makefile

Testing

compile:

$ make
bison -d newick.y
flex newick.l
gcc -Wall -O3 newick.tab.c lex.yy.c
lex.yy.c:1265:17: warning: ‘yyunput’ defined but not used [-Wunused-function]
lex.yy.c:1306:16: warning: ‘input’ defined but not used [-Wunused-function]

test:

echo "((Human:0.3, Chimpanzee:0.2):0.1, Gorilla:0.3, (Mouse:0.6, Rat:0.5):0.2);" | ./a.out

{
    "children": [
        {
            "length": 0.1,
            "children": [
                {
                    "label": "Human",
                    "length": 0.3
                },
                {
                    "label": "Chimpanzee",
                    "length": 0.2
                }
            ]
        },
        {
            "label": "Gorilla",
            "length": 0.3
        },
        {
            "length": 0.2,
            "children": [
                {
                    "label": "Mouse",
                    "length": 0.6
                },
                {
                    "label": "Rat",
                    "length": 0.5
                }
            ]
        }
    ]
}

That's it,

Pierre

13 August 2011

A FUSE-based filesystem reproducing the NCBI Taxonomy hierarchy.

In this post I will show how I've used FUSE to create a new loadable filesystem for Linux, reproducing the tree of the NCBI Taxonomy.

From the Fuse Homepage:With FUSE it is possible to implement a fully functional filesystem in a userspace program. Features include:

Simple library API
Simple installation (no need to patch or recompile the kernel)
Secure implementation
Userspace - kernel interface is very efficient
Usable by non privileged users
Runs on Linux kernels 2.4.X and 2.6.X
Has proven very stable over time

Downloading and installing Fuse

Download Fuse from sourceforge:http://sourceforge.net/projects/fuse/files/fuse-2.X/.

$ tar xvfz fuse-2.8.5.tar.gz

$ cd fuse-2.8.5/

$ ./configure

$ make

$ make install

The classes

Taxon

A Taxon is a node in the NCBI Taxonomy:


class Taxon

    {

    public:

	 /* node id */

	int id;

	/* parent id */

	int parent_id;

	/* number of children */

	int count_children;

	/* name for this node */

	char* name;

	/* node children */

	Taxon** children;

	Taxon();

	~Taxon();

	/* compare by id */

	static int comparator(const void* p1,const void* p2);

	/* compare by name*/

	static int compareByName(const void* p1,const void* p2);

	const Taxon* getChildrenAt(int i) const;

	/* find children by its name*/

	const Taxon* findChildByName(const char* name) const;

	/* recursive. find child from a unix path */

	const Taxon* findChildByPath(const char* path) const;

	/* stupid representation of a node as  a XML file */

	string xml() const;

    };

FuseTaxonomy

The NCBI Taxonomy:

class FuseTaxonomy

{

private:

	/* number of nodes */

	size_t nTaxons;

	/** taxons ordered by ids */

	Taxon** taxons;

	/** taxons ordered by names */

	Taxon** names;

public:

	/* global instance */

	static FuseTaxonomy* INSTANCE;

	FuseTaxonomy();

	~FuseTaxonomy();

	/* find taxon by ID */

	const Taxon* findTaxonById(int id)const;

	/* find taxon by name */

	const Taxon* findTaxonByName(const char* name) const;

	/* root of taxonomy */

	const Taxon* getRoot() const;

	/* read file nodes.dmp */

	void readNodes(const char* nodes);

	/* read file names.dmp */

	void readNames(const char* namesfile);

	/* find taxon node from unix path */

	const Taxon* findByPath(const char* path) const;

	/** FUSE CALLBACK:  This function returns metadata concerning a file specified by path in a special stat structure. */

	static int getattr(const char *path, struct stat *stbuf);

	/* FUSE CALLBACK:  used to read directory contents */

	static int readdir(const char *path, void *buf, fuse_fill_dir_t filler,  off_t offset, struct fuse_file_info *fi);

	/* FUSE CALLBACK: checks whatever user is permitted to open the /hello file with flags given in the fuse_file_info structure.  */

	static int open(const char *path, struct fuse_file_info *fi);

	/* FUSE CALLBACK:  used to feed the user with data from the file. */

	static int read(const char *path, char *buf, size_t size, off_t offset, struct fuse_file_info *fi);

};

The static functions 'getattr', 'readdir', 'open' and 'read' are the callbacks called by the FUSE API to explore the new filesystem and must be initialized in the 'main' method.

Test

Load the NCBI taxonomy

$ wget -O taxdump.tar.gz "ftp://ftp.ncbi.nih.gov/pub/taxonomy/taxdump.tar.gz"

$ tar xvfz taxdump.tar.gz  names.dmp

$ tar xvfz taxdump.tar.gz  nodes.dmp

Create a temporary folder for our new filesystem:

$ mkdir -p tmp_fuse

load the NCBI taxonomy and install the new filesystem

$ ./fusetaxonomy nodes.dmp names.dmp  tmp_fuse

What's is in this new filesystem ?

$ls tmp_fuse

root

And what are the nodes under "root" ?

$ls tmp_fuse

cellular_organisms  other_sequences  unclassified_sequences  Viroids  Viruses

And what's under 'Eukaryota' ?

ls -la tmp_fuse/root/cellular_organisms/Eukaryota
total 0
drwxr-xr-x 22 root root 0 1970-01-01 01:00 .
drwxr-xr-x 3 root root 0 1970-01-01 01:00 ..
drwxr-xr-x 9 root root 0 1970-01-01 01:00 Alveolata
drwxr-xr-x 11 root root 0 1970-01-01 01:00 Amoebozoa
drwxr-xr-x 4 root root 0 1970-01-01 01:00 Apusozoa
drwxr-xr-x 5 root root 0 1970-01-01 01:00 Centroheliozoa
drwxr-xr-x 4 root root 0 1970-01-01 01:00 Cryptophyta
drwxr-xr-x 38 root root 0 1970-01-01 01:00 environmental_samples
drwxr-xr-x 5 root root 0 1970-01-01 01:00 Euglenozoa
drwxr-xr-x 7 root root 0 1970-01-01 01:00 Fornicata
drwxr-xr-x 5 root root 0 1970-01-01 01:00 Glaucocystophyceae
drwxr-xr-x 11 root root 0 1970-01-01 01:00 Haptophyceae
drwxr-xr-x 4 root root 0 1970-01-01 01:00 Heterolobosea
drwxr-xr-x 4 root root 0 1970-01-01 01:00 Jakobida
drwxr-xr-x 3 root root 0 1970-01-01 01:00 Katablepharidophyta
drwxr-xr-x 1 root root 0 1970-01-01 01:00 Malawimonadidae
drwxr-xr-x 6 root root 0 1970-01-01 01:00 Opisthokonta
drwxr-xr-x 8 root root 0 1970-01-01 01:00 Oxymonadida
drwxr-xr-x 9 root root 0 1970-01-01 01:00 Parabasalia
drwxr-xr-x 9 root root 0 1970-01-01 01:00 Rhizaria
drwxr-xr-x 7 root root 0 1970-01-01 01:00 Rhodophyta
drwxr-xr-x 25 root root 0 1970-01-01 01:00 stramenopiles
drwxr-xr-x 36 root root 0 1970-01-01 01:00 unclassified_eukaryotes
drwxr-xr-x 3 root root 0 1970-01-01 01:00 Viridiplantae

Let's use find !

find tmp_fuse/root/cellular_organisms/Eukaryota/Opisthokonta/Metazoa/ | head

tmp_fuse/root/cellular_organisms/Eukaryota/Opisthokonta/Metazoa/

tmp_fuse/root/cellular_organisms/Eukaryota/Opisthokonta/Metazoa/Porifera

tmp_fuse/root/cellular_organisms/Eukaryota/Opisthokonta/Metazoa/Porifera/Demospongiae

tmp_fuse/root/cellular_organisms/Eukaryota/Opisthokonta/Metazoa/Porifera/Demospongiae/Tetractinomorpha

tmp_fuse/root/cellular_organisms/Eukaryota/Opisthokonta/Metazoa/Porifera/Demospongiae/Tetractinomorpha/Astrophorida

tmp_fuse/root/cellular_organisms/Eukaryota/Opisthokonta/Metazoa/Porifera/Demospongiae/Tetractinomorpha/Astrophorida/Geodiidae

tmp_fuse/root/cellular_organisms/Eukaryota/Opisthokonta/Metazoa/Porifera/Demospongiae/Tetractinomorpha/Astrophorida/Geodiidae/Geodia

tmp_fuse/root/cellular_organisms/Eukaryota/Opisthokonta/Metazoa/Porifera/Demospongiae/Tetractinomorpha/Astrophorida/Geodiidae/Geodia/Geodia_cydonium

tmp_fuse/root/cellular_organisms/Eukaryota/Opisthokonta/Metazoa/Porifera/Demospongiae/Tetractinomorpha/Astrophorida/Geodiidae/Geodia/Geodia_neptuni

tmp_fuse/root/cellular_organisms/Eukaryota/Opisthokonta/Metazoa/Porifera/Demospongiae/Tetractinomorpha/Astrophorida/Geodiidae/Geodia/Geodia_papyracea

What is the content of Homo_sapiens_neanderthalensis ?:

$ cat tmp_fuse/root/cellular_organisms/Eukaryota/Opisthokonta/Metazoa/Eumetazoa/Bilateria/Coelomata/Deuterostomia/Chordata/Craniata/Vertebrata/Gnathostomata/Teleostomi/Euteleostomi/Sarcopterygii/Tetrapoda/Amniota/Mammalia/Theria/Eutheria/Euarchontoglires/Primates/Haplorrhini/Simiiformes/Catarrhini/Hominoidea/Hominidae/Homininae/Homo/Homo_sapiens/Homo_sapiens_neanderthalensis

<?xml version="1.0"?>

<Taxon-Id>63221</Taxon-Id>

unmount the NCBI filesystem

sudo ${FUSEDIR}/bin/fusermount -u  tmp_fuse

The Full source code.

The full source code is available as a 'gist' on github.com.
The code is also on github here

That's it !

Pierre

07 January 2011

SVG image in wikipedia + Zoom.it =a Zoomable "Tree of life"

Zoom.it is a free service for viewing and sharing high-resolution imagery. You give us the link to any image ( including SVG, pdfs) on the web along with a nice short URL.. As a test, I used the SVG file "Tree of life with genome size.svg" on wikipedia and here is the awesome result generated by Zoom.it:

http://zoom.it/1zy

That's it,
Piere

30 June 2010

XSLT+NCBI-Taxonomy=Graphviz Dot

The following post was inspired by this question on Biostar.com: http://biostar.stackexchange.com/questions/1549: "lets say I want to know which taxonomic level groups Tribolium castaneum and Drosophila melanogaster. Insects, right? (...) Now lets say I have 10 pairs of such species and I want to see how close & distant they are... How can I do this easily?"
I suggested two solutions, both using a XSLT stylesheet. I then wondered if one could use a xslt stylesheet to draw a tree of life with the help of graphviz. This stylesheet I wrote is available at:

http://code.google.com/(...)/taxonomy2dot.xsl

Usage

xsltproc --novalid taxonomy2dot.xsl \
"http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?id=7070,32351,9605,9606&db=taxonomy&retmode=xml" |\
dot -o/home/pierre/file.svg -Tsvg

The main problem with this stylesheet was to create one and only one connection between two nodes even if this connection was present more than one time in the XML file. So, the trick was to use the xpath axis: preceding-sibling:: to check if the connection was previously printed.

Result

That's it !

Pierre

YOKOFAKUN