13 August 2011

A FUSE-based filesystem reproducing the NCBI Taxonomy hierarchy.

In this post I will show how I've used FUSE to create a new loadable filesystem for Linux, reproducing the tree of the NCBI Taxonomy.

From the Fuse Homepage:With FUSE it is possible to implement a fully functional filesystem in a userspace program. Features include:

  • Simple library API
  • Simple installation (no need to patch or recompile the kernel)
  • Secure implementation
  • Userspace - kernel interface is very efficient
  • Usable by non privileged users
  • Runs on Linux kernels 2.4.X and 2.6.X
  • Has proven very stable over time
.

Downloading and installing Fuse

Download Fuse from sourceforge:http://sourceforge.net/projects/fuse/files/fuse-2.X/.

$ tar xvfz fuse-2.8.5.tar.gz

$ cd fuse-2.8.5/
$ ./configure
$ make
$ make install


The classes


Taxon

A Taxon is a node in the NCBI Taxonomy:

class Taxon
{
public:
/* node id */
int id;
/* parent id */
int parent_id;
/* number of children */
int count_children;
/* name for this node */
char* name;
/* node children */
Taxon** children;
Taxon();
~Taxon();
/* compare by id */
static int comparator(const void* p1,const void* p2);
/* compare by name*/
static int compareByName(const void* p1,const void* p2);
const Taxon* getChildrenAt(int i) const;
/* find children by its name*/
const Taxon* findChildByName(const char* name) const;
/* recursive. find child from a unix path */
const Taxon* findChildByPath(const char* path) const;
/* stupid representation of a node as a XML file */
string xml() const;
};

FuseTaxonomy

The NCBI Taxonomy:
class FuseTaxonomy

{
private:
/* number of nodes */
size_t nTaxons;
/** taxons ordered by ids */
Taxon** taxons;
/** taxons ordered by names */
Taxon** names;
public:
/* global instance */
static FuseTaxonomy* INSTANCE;
FuseTaxonomy();
~FuseTaxonomy();
/* find taxon by ID */
const Taxon* findTaxonById(int id)const;
/* find taxon by name */
const Taxon* findTaxonByName(const char* name) const;
/* root of taxonomy */
const Taxon* getRoot() const;
/* read file nodes.dmp */
void readNodes(const char* nodes);
/* read file names.dmp */
void readNames(const char* namesfile);
/* find taxon node from unix path */
const Taxon* findByPath(const char* path) const;
/** FUSE CALLBACK: This function returns metadata concerning a file specified by path in a special stat structure. */
static int getattr(const char *path, struct stat *stbuf);
/* FUSE CALLBACK: used to read directory contents */
static int readdir(const char *path, void *buf, fuse_fill_dir_t filler, off_t offset, struct fuse_file_info *fi);
/* FUSE CALLBACK: checks whatever user is permitted to open the /hello file with flags given in the fuse_file_info structure. */
static int open(const char *path, struct fuse_file_info *fi);
/* FUSE CALLBACK: used to feed the user with data from the file. */
static int read(const char *path, char *buf, size_t size, off_t offset, struct fuse_file_info *fi);
};

The static functions 'getattr', 'readdir', 'open' and 'read' are the callbacks called by the FUSE API to explore the new filesystem and must be initialized in the 'main' method.

Test


Load the NCBI taxonomy
$ wget -O taxdump.tar.gz "ftp://ftp.ncbi.nih.gov/pub/taxonomy/taxdump.tar.gz"

$ tar xvfz taxdump.tar.gz names.dmp
$ tar xvfz taxdump.tar.gz nodes.dmp

Create a temporary folder for our new filesystem:
$ mkdir -p tmp_fuse

load the NCBI taxonomy and install the new filesystem
$ ./fusetaxonomy nodes.dmp names.dmp  tmp_fuse

What's is in this new filesystem ?
$ls tmp_fuse

root

And what are the nodes under "root" ?
$ls tmp_fuse

cellular_organisms other_sequences unclassified_sequences Viroids Viruses

And what's under 'Eukaryota' ?
ls -la tmp_fuse/root/cellular_organisms/Eukaryota
total 0
drwxr-xr-x 22 root root 0 1970-01-01 01:00 .
drwxr-xr-x 3 root root 0 1970-01-01 01:00 ..
drwxr-xr-x 9 root root 0 1970-01-01 01:00 Alveolata
drwxr-xr-x 11 root root 0 1970-01-01 01:00 Amoebozoa
drwxr-xr-x 4 root root 0 1970-01-01 01:00 Apusozoa
drwxr-xr-x 5 root root 0 1970-01-01 01:00 Centroheliozoa
drwxr-xr-x 4 root root 0 1970-01-01 01:00 Cryptophyta
drwxr-xr-x 38 root root 0 1970-01-01 01:00 environmental_samples
drwxr-xr-x 5 root root 0 1970-01-01 01:00 Euglenozoa
drwxr-xr-x 7 root root 0 1970-01-01 01:00 Fornicata
drwxr-xr-x 5 root root 0 1970-01-01 01:00 Glaucocystophyceae
drwxr-xr-x 11 root root 0 1970-01-01 01:00 Haptophyceae
drwxr-xr-x 4 root root 0 1970-01-01 01:00 Heterolobosea
drwxr-xr-x 4 root root 0 1970-01-01 01:00 Jakobida
drwxr-xr-x 3 root root 0 1970-01-01 01:00 Katablepharidophyta
drwxr-xr-x 1 root root 0 1970-01-01 01:00 Malawimonadidae
drwxr-xr-x 6 root root 0 1970-01-01 01:00 Opisthokonta
drwxr-xr-x 8 root root 0 1970-01-01 01:00 Oxymonadida
drwxr-xr-x 9 root root 0 1970-01-01 01:00 Parabasalia
drwxr-xr-x 9 root root 0 1970-01-01 01:00 Rhizaria
drwxr-xr-x 7 root root 0 1970-01-01 01:00 Rhodophyta
drwxr-xr-x 25 root root 0 1970-01-01 01:00 stramenopiles
drwxr-xr-x 36 root root 0 1970-01-01 01:00 unclassified_eukaryotes
drwxr-xr-x 3 root root 0 1970-01-01 01:00 Viridiplantae

Let's use find !
find tmp_fuse/root/cellular_organisms/Eukaryota/Opisthokonta/Metazoa/ | head

tmp_fuse/root/cellular_organisms/Eukaryota/Opisthokonta/Metazoa/
tmp_fuse/root/cellular_organisms/Eukaryota/Opisthokonta/Metazoa/Porifera
tmp_fuse/root/cellular_organisms/Eukaryota/Opisthokonta/Metazoa/Porifera/Demospongiae
tmp_fuse/root/cellular_organisms/Eukaryota/Opisthokonta/Metazoa/Porifera/Demospongiae/Tetractinomorpha
tmp_fuse/root/cellular_organisms/Eukaryota/Opisthokonta/Metazoa/Porifera/Demospongiae/Tetractinomorpha/Astrophorida
tmp_fuse/root/cellular_organisms/Eukaryota/Opisthokonta/Metazoa/Porifera/Demospongiae/Tetractinomorpha/Astrophorida/Geodiidae
tmp_fuse/root/cellular_organisms/Eukaryota/Opisthokonta/Metazoa/Porifera/Demospongiae/Tetractinomorpha/Astrophorida/Geodiidae/Geodia
tmp_fuse/root/cellular_organisms/Eukaryota/Opisthokonta/Metazoa/Porifera/Demospongiae/Tetractinomorpha/Astrophorida/Geodiidae/Geodia/Geodia_cydonium
tmp_fuse/root/cellular_organisms/Eukaryota/Opisthokonta/Metazoa/Porifera/Demospongiae/Tetractinomorpha/Astrophorida/Geodiidae/Geodia/Geodia_neptuni
tmp_fuse/root/cellular_organisms/Eukaryota/Opisthokonta/Metazoa/Porifera/Demospongiae/Tetractinomorpha/Astrophorida/Geodiidae/Geodia/Geodia_papyracea

What is the content of Homo_sapiens_neanderthalensis ?:
$ cat tmp_fuse/root/cellular_organisms/Eukaryota/Opisthokonta/Metazoa/Eumetazoa/Bilateria/Coelomata/Deuterostomia/Chordata/Craniata/Vertebrata/Gnathostomata/Teleostomi/Euteleostomi/Sarcopterygii/Tetrapoda/Amniota/Mammalia/Theria/Eutheria/Euarchontoglires/Primates/Haplorrhini/Simiiformes/Catarrhini/Hominoidea/Hominidae/Homininae/Homo/Homo_sapiens/Homo_sapiens_neanderthalensis

<?xml version="1.0"?>
<Taxon-Id>63221</Taxon-Id>

unmount the NCBI filesystem
sudo ${FUSEDIR}/bin/fusermount -u  tmp_fuse

The Full source code.

The full source code is available as a 'gist' on github.com.
The code is also on github here

That's it !

Pierre

2 comments:

Egon Willighagen said...

Cool!

Mike Taylor said...

This is almost unbearably awesome.

Does it work for writing as well as reading?

If so, then we're well on the way to a dream of maintaining the taxonomy in a git module, allowing specialist taxonomists to clone, work on their specialist area, and push (if trusted) or issue a pull request (otherwise).