A FUSE-based filesystem reproducing the NCBI Taxonomy hierarchy.
In this post I will show how I've used FUSE to create a new loadable filesystem for Linux, reproducing the tree of the NCBI Taxonomy.
From the Fuse Homepage:With FUSE it is possible to implement a fully functional filesystem in a userspace program. Features include:
.
Downloading and installing Fuse
Download Fuse from sourceforge:http://sourceforge.net/projects/fuse/files/fuse-2.X/.$ tar xvfz fuse-2.8.5.tar.gz
$ cd fuse-2.8.5/
$ ./configure
$ make
$ make install
The classes
Taxon
A Taxon is a node in the NCBI Taxonomy:
class Taxon
{
public:
/* node id */
int id;
/* parent id */
int parent_id;
/* number of children */
int count_children;
/* name for this node */
char* name;
/* node children */
Taxon** children;
Taxon();
~Taxon();
/* compare by id */
static int comparator(const void* p1,const void* p2);
/* compare by name*/
static int compareByName(const void* p1,const void* p2);
const Taxon* getChildrenAt(int i) const;
/* find children by its name*/
const Taxon* findChildByName(const char* name) const;
/* recursive. find child from a unix path */
const Taxon* findChildByPath(const char* path) const;
/* stupid representation of a node as a XML file */
string xml() const;
};
FuseTaxonomy
The NCBI Taxonomy:class FuseTaxonomy
{
private:
/* number of nodes */
size_t nTaxons;
/** taxons ordered by ids */
Taxon** taxons;
/** taxons ordered by names */
Taxon** names;
public:
/* global instance */
static FuseTaxonomy* INSTANCE;
FuseTaxonomy();
~FuseTaxonomy();
/* find taxon by ID */
const Taxon* findTaxonById(int id)const;
/* find taxon by name */
const Taxon* findTaxonByName(const char* name) const;
/* root of taxonomy */
const Taxon* getRoot() const;
/* read file nodes.dmp */
void readNodes(const char* nodes);
/* read file names.dmp */
void readNames(const char* namesfile);
/* find taxon node from unix path */
const Taxon* findByPath(const char* path) const;
/** FUSE CALLBACK: This function returns metadata concerning a file specified by path in a special stat structure. */
static int getattr(const char *path, struct stat *stbuf);
/* FUSE CALLBACK: used to read directory contents */
static int readdir(const char *path, void *buf, fuse_fill_dir_t filler, off_t offset, struct fuse_file_info *fi);
/* FUSE CALLBACK: checks whatever user is permitted to open the /hello file with flags given in the fuse_file_info structure. */
static int open(const char *path, struct fuse_file_info *fi);
/* FUSE CALLBACK: used to feed the user with data from the file. */
static int read(const char *path, char *buf, size_t size, off_t offset, struct fuse_file_info *fi);
};
The static functions 'getattr', 'readdir', 'open' and 'read' are the callbacks called by the FUSE API to explore the new filesystem and must be initialized in the 'main' method.
Test
Load the NCBI taxonomy
$ wget -O taxdump.tar.gz "ftp://ftp.ncbi.nih.gov/pub/taxonomy/taxdump.tar.gz"
$ tar xvfz taxdump.tar.gz names.dmp
$ tar xvfz taxdump.tar.gz nodes.dmp
Create a temporary folder for our new filesystem:
$ mkdir -p tmp_fuse
load the NCBI taxonomy and install the new filesystem
$ ./fusetaxonomy nodes.dmp names.dmp tmp_fuse
What's is in this new filesystem ?
$ls tmp_fuse
root
And what are the nodes under "root" ?
$ls tmp_fuse
cellular_organisms other_sequences unclassified_sequences Viroids Viruses
And what's under 'Eukaryota' ?
ls -la tmp_fuse/root/cellular_organisms/Eukaryota
total 0
drwxr-xr-x 22 root root 0 1970-01-01 01:00 .
drwxr-xr-x 3 root root 0 1970-01-01 01:00 ..
drwxr-xr-x 9 root root 0 1970-01-01 01:00 Alveolata
drwxr-xr-x 11 root root 0 1970-01-01 01:00 Amoebozoa
drwxr-xr-x 4 root root 0 1970-01-01 01:00 Apusozoa
drwxr-xr-x 5 root root 0 1970-01-01 01:00 Centroheliozoa
drwxr-xr-x 4 root root 0 1970-01-01 01:00 Cryptophyta
drwxr-xr-x 38 root root 0 1970-01-01 01:00 environmental_samples
drwxr-xr-x 5 root root 0 1970-01-01 01:00 Euglenozoa
drwxr-xr-x 7 root root 0 1970-01-01 01:00 Fornicata
drwxr-xr-x 5 root root 0 1970-01-01 01:00 Glaucocystophyceae
drwxr-xr-x 11 root root 0 1970-01-01 01:00 Haptophyceae
drwxr-xr-x 4 root root 0 1970-01-01 01:00 Heterolobosea
drwxr-xr-x 4 root root 0 1970-01-01 01:00 Jakobida
drwxr-xr-x 3 root root 0 1970-01-01 01:00 Katablepharidophyta
drwxr-xr-x 1 root root 0 1970-01-01 01:00 Malawimonadidae
drwxr-xr-x 6 root root 0 1970-01-01 01:00 Opisthokonta
drwxr-xr-x 8 root root 0 1970-01-01 01:00 Oxymonadida
drwxr-xr-x 9 root root 0 1970-01-01 01:00 Parabasalia
drwxr-xr-x 9 root root 0 1970-01-01 01:00 Rhizaria
drwxr-xr-x 7 root root 0 1970-01-01 01:00 Rhodophyta
drwxr-xr-x 25 root root 0 1970-01-01 01:00 stramenopiles
drwxr-xr-x 36 root root 0 1970-01-01 01:00 unclassified_eukaryotes
drwxr-xr-x 3 root root 0 1970-01-01 01:00 Viridiplantae
total 0
drwxr-xr-x 22 root root 0 1970-01-01 01:00 .
drwxr-xr-x 3 root root 0 1970-01-01 01:00 ..
drwxr-xr-x 9 root root 0 1970-01-01 01:00 Alveolata
drwxr-xr-x 11 root root 0 1970-01-01 01:00 Amoebozoa
drwxr-xr-x 4 root root 0 1970-01-01 01:00 Apusozoa
drwxr-xr-x 5 root root 0 1970-01-01 01:00 Centroheliozoa
drwxr-xr-x 4 root root 0 1970-01-01 01:00 Cryptophyta
drwxr-xr-x 38 root root 0 1970-01-01 01:00 environmental_samples
drwxr-xr-x 5 root root 0 1970-01-01 01:00 Euglenozoa
drwxr-xr-x 7 root root 0 1970-01-01 01:00 Fornicata
drwxr-xr-x 5 root root 0 1970-01-01 01:00 Glaucocystophyceae
drwxr-xr-x 11 root root 0 1970-01-01 01:00 Haptophyceae
drwxr-xr-x 4 root root 0 1970-01-01 01:00 Heterolobosea
drwxr-xr-x 4 root root 0 1970-01-01 01:00 Jakobida
drwxr-xr-x 3 root root 0 1970-01-01 01:00 Katablepharidophyta
drwxr-xr-x 1 root root 0 1970-01-01 01:00 Malawimonadidae
drwxr-xr-x 6 root root 0 1970-01-01 01:00 Opisthokonta
drwxr-xr-x 8 root root 0 1970-01-01 01:00 Oxymonadida
drwxr-xr-x 9 root root 0 1970-01-01 01:00 Parabasalia
drwxr-xr-x 9 root root 0 1970-01-01 01:00 Rhizaria
drwxr-xr-x 7 root root 0 1970-01-01 01:00 Rhodophyta
drwxr-xr-x 25 root root 0 1970-01-01 01:00 stramenopiles
drwxr-xr-x 36 root root 0 1970-01-01 01:00 unclassified_eukaryotes
drwxr-xr-x 3 root root 0 1970-01-01 01:00 Viridiplantae
Let's use find !
find tmp_fuse/root/cellular_organisms/Eukaryota/Opisthokonta/Metazoa/ | head
tmp_fuse/root/cellular_organisms/Eukaryota/Opisthokonta/Metazoa/
tmp_fuse/root/cellular_organisms/Eukaryota/Opisthokonta/Metazoa/Porifera
tmp_fuse/root/cellular_organisms/Eukaryota/Opisthokonta/Metazoa/Porifera/Demospongiae
tmp_fuse/root/cellular_organisms/Eukaryota/Opisthokonta/Metazoa/Porifera/Demospongiae/Tetractinomorpha
tmp_fuse/root/cellular_organisms/Eukaryota/Opisthokonta/Metazoa/Porifera/Demospongiae/Tetractinomorpha/Astrophorida
tmp_fuse/root/cellular_organisms/Eukaryota/Opisthokonta/Metazoa/Porifera/Demospongiae/Tetractinomorpha/Astrophorida/Geodiidae
tmp_fuse/root/cellular_organisms/Eukaryota/Opisthokonta/Metazoa/Porifera/Demospongiae/Tetractinomorpha/Astrophorida/Geodiidae/Geodia
tmp_fuse/root/cellular_organisms/Eukaryota/Opisthokonta/Metazoa/Porifera/Demospongiae/Tetractinomorpha/Astrophorida/Geodiidae/Geodia/Geodia_cydonium
tmp_fuse/root/cellular_organisms/Eukaryota/Opisthokonta/Metazoa/Porifera/Demospongiae/Tetractinomorpha/Astrophorida/Geodiidae/Geodia/Geodia_neptuni
tmp_fuse/root/cellular_organisms/Eukaryota/Opisthokonta/Metazoa/Porifera/Demospongiae/Tetractinomorpha/Astrophorida/Geodiidae/Geodia/Geodia_papyracea
What is the content of Homo_sapiens_neanderthalensis ?:
$ cat tmp_fuse/root/cellular_organisms/Eukaryota/Opisthokonta/Metazoa/Eumetazoa/Bilateria/Coelomata/Deuterostomia/Chordata/Craniata/Vertebrata/Gnathostomata/Teleostomi/Euteleostomi/Sarcopterygii/Tetrapoda/Amniota/Mammalia/Theria/Eutheria/Euarchontoglires/Primates/Haplorrhini/Simiiformes/Catarrhini/Hominoidea/Hominidae/Homininae/Homo/Homo_sapiens/Homo_sapiens_neanderthalensis
<?xml version="1.0"?>
<Taxon-Id>63221</Taxon-Id>
unmount the NCBI filesystem
sudo ${FUSEDIR}/bin/fusermount -u tmp_fuse
The Full source code.
The full source code is available as a 'gist' on github.com.The code is also on github here
That's it !
Pierre
2 comments:
Cool!
This is almost unbearably awesome.
Does it work for writing as well as reading?
If so, then we're well on the way to a dream of maintaining the taxonomy in a git module, allowing specialist taxonomists to clone, work on their specialist area, and push (if trusted) or issue a pull request (otherwise).
Post a Comment