YOKOFAKUN: August 2011

13 August 2011

A FUSE-based filesystem reproducing the NCBI Taxonomy hierarchy.

In this post I will show how I've used FUSE to create a new loadable filesystem for Linux, reproducing the tree of the NCBI Taxonomy.

From the Fuse Homepage:With FUSE it is possible to implement a fully functional filesystem in a userspace program. Features include:

Simple library API
Simple installation (no need to patch or recompile the kernel)
Secure implementation
Userspace - kernel interface is very efficient
Usable by non privileged users
Runs on Linux kernels 2.4.X and 2.6.X
Has proven very stable over time

Downloading and installing Fuse

Download Fuse from sourceforge:http://sourceforge.net/projects/fuse/files/fuse-2.X/.

$ tar xvfz fuse-2.8.5.tar.gz

$ cd fuse-2.8.5/

$ ./configure

$ make

$ make install

The classes

Taxon

A Taxon is a node in the NCBI Taxonomy:


class Taxon

    {

    public:

	 /* node id */

	int id;

	/* parent id */

	int parent_id;

	/* number of children */

	int count_children;

	/* name for this node */

	char* name;

	/* node children */

	Taxon** children;

	Taxon();

	~Taxon();

	/* compare by id */

	static int comparator(const void* p1,const void* p2);

	/* compare by name*/

	static int compareByName(const void* p1,const void* p2);

	const Taxon* getChildrenAt(int i) const;

	/* find children by its name*/

	const Taxon* findChildByName(const char* name) const;

	/* recursive. find child from a unix path */

	const Taxon* findChildByPath(const char* path) const;

	/* stupid representation of a node as  a XML file */

	string xml() const;

    };

FuseTaxonomy

The NCBI Taxonomy:

class FuseTaxonomy

{

private:

	/* number of nodes */

	size_t nTaxons;

	/** taxons ordered by ids */

	Taxon** taxons;

	/** taxons ordered by names */

	Taxon** names;

public:

	/* global instance */

	static FuseTaxonomy* INSTANCE;

	FuseTaxonomy();

	~FuseTaxonomy();

	/* find taxon by ID */

	const Taxon* findTaxonById(int id)const;

	/* find taxon by name */

	const Taxon* findTaxonByName(const char* name) const;

	/* root of taxonomy */

	const Taxon* getRoot() const;

	/* read file nodes.dmp */

	void readNodes(const char* nodes);

	/* read file names.dmp */

	void readNames(const char* namesfile);

	/* find taxon node from unix path */

	const Taxon* findByPath(const char* path) const;

	/** FUSE CALLBACK:  This function returns metadata concerning a file specified by path in a special stat structure. */

	static int getattr(const char *path, struct stat *stbuf);

	/* FUSE CALLBACK:  used to read directory contents */

	static int readdir(const char *path, void *buf, fuse_fill_dir_t filler,  off_t offset, struct fuse_file_info *fi);

	/* FUSE CALLBACK: checks whatever user is permitted to open the /hello file with flags given in the fuse_file_info structure.  */

	static int open(const char *path, struct fuse_file_info *fi);

	/* FUSE CALLBACK:  used to feed the user with data from the file. */

	static int read(const char *path, char *buf, size_t size, off_t offset, struct fuse_file_info *fi);

};

The static functions 'getattr', 'readdir', 'open' and 'read' are the callbacks called by the FUSE API to explore the new filesystem and must be initialized in the 'main' method.

Test

Load the NCBI taxonomy

$ wget -O taxdump.tar.gz "ftp://ftp.ncbi.nih.gov/pub/taxonomy/taxdump.tar.gz"

$ tar xvfz taxdump.tar.gz  names.dmp

$ tar xvfz taxdump.tar.gz  nodes.dmp

Create a temporary folder for our new filesystem:

$ mkdir -p tmp_fuse

load the NCBI taxonomy and install the new filesystem

$ ./fusetaxonomy nodes.dmp names.dmp  tmp_fuse

What's is in this new filesystem ?

$ls tmp_fuse

root

And what are the nodes under "root" ?

$ls tmp_fuse

cellular_organisms  other_sequences  unclassified_sequences  Viroids  Viruses

And what's under 'Eukaryota' ?

ls -la tmp_fuse/root/cellular_organisms/Eukaryota
total 0
drwxr-xr-x 22 root root 0 1970-01-01 01:00 .
drwxr-xr-x 3 root root 0 1970-01-01 01:00 ..
drwxr-xr-x 9 root root 0 1970-01-01 01:00 Alveolata
drwxr-xr-x 11 root root 0 1970-01-01 01:00 Amoebozoa
drwxr-xr-x 4 root root 0 1970-01-01 01:00 Apusozoa
drwxr-xr-x 5 root root 0 1970-01-01 01:00 Centroheliozoa
drwxr-xr-x 4 root root 0 1970-01-01 01:00 Cryptophyta
drwxr-xr-x 38 root root 0 1970-01-01 01:00 environmental_samples
drwxr-xr-x 5 root root 0 1970-01-01 01:00 Euglenozoa
drwxr-xr-x 7 root root 0 1970-01-01 01:00 Fornicata
drwxr-xr-x 5 root root 0 1970-01-01 01:00 Glaucocystophyceae
drwxr-xr-x 11 root root 0 1970-01-01 01:00 Haptophyceae
drwxr-xr-x 4 root root 0 1970-01-01 01:00 Heterolobosea
drwxr-xr-x 4 root root 0 1970-01-01 01:00 Jakobida
drwxr-xr-x 3 root root 0 1970-01-01 01:00 Katablepharidophyta
drwxr-xr-x 1 root root 0 1970-01-01 01:00 Malawimonadidae
drwxr-xr-x 6 root root 0 1970-01-01 01:00 Opisthokonta
drwxr-xr-x 8 root root 0 1970-01-01 01:00 Oxymonadida
drwxr-xr-x 9 root root 0 1970-01-01 01:00 Parabasalia
drwxr-xr-x 9 root root 0 1970-01-01 01:00 Rhizaria
drwxr-xr-x 7 root root 0 1970-01-01 01:00 Rhodophyta
drwxr-xr-x 25 root root 0 1970-01-01 01:00 stramenopiles
drwxr-xr-x 36 root root 0 1970-01-01 01:00 unclassified_eukaryotes
drwxr-xr-x 3 root root 0 1970-01-01 01:00 Viridiplantae

Let's use find !

find tmp_fuse/root/cellular_organisms/Eukaryota/Opisthokonta/Metazoa/ | head

tmp_fuse/root/cellular_organisms/Eukaryota/Opisthokonta/Metazoa/

tmp_fuse/root/cellular_organisms/Eukaryota/Opisthokonta/Metazoa/Porifera

tmp_fuse/root/cellular_organisms/Eukaryota/Opisthokonta/Metazoa/Porifera/Demospongiae

tmp_fuse/root/cellular_organisms/Eukaryota/Opisthokonta/Metazoa/Porifera/Demospongiae/Tetractinomorpha

tmp_fuse/root/cellular_organisms/Eukaryota/Opisthokonta/Metazoa/Porifera/Demospongiae/Tetractinomorpha/Astrophorida

tmp_fuse/root/cellular_organisms/Eukaryota/Opisthokonta/Metazoa/Porifera/Demospongiae/Tetractinomorpha/Astrophorida/Geodiidae

tmp_fuse/root/cellular_organisms/Eukaryota/Opisthokonta/Metazoa/Porifera/Demospongiae/Tetractinomorpha/Astrophorida/Geodiidae/Geodia

tmp_fuse/root/cellular_organisms/Eukaryota/Opisthokonta/Metazoa/Porifera/Demospongiae/Tetractinomorpha/Astrophorida/Geodiidae/Geodia/Geodia_cydonium

tmp_fuse/root/cellular_organisms/Eukaryota/Opisthokonta/Metazoa/Porifera/Demospongiae/Tetractinomorpha/Astrophorida/Geodiidae/Geodia/Geodia_neptuni

tmp_fuse/root/cellular_organisms/Eukaryota/Opisthokonta/Metazoa/Porifera/Demospongiae/Tetractinomorpha/Astrophorida/Geodiidae/Geodia/Geodia_papyracea

What is the content of Homo_sapiens_neanderthalensis ?:

$ cat tmp_fuse/root/cellular_organisms/Eukaryota/Opisthokonta/Metazoa/Eumetazoa/Bilateria/Coelomata/Deuterostomia/Chordata/Craniata/Vertebrata/Gnathostomata/Teleostomi/Euteleostomi/Sarcopterygii/Tetrapoda/Amniota/Mammalia/Theria/Eutheria/Euarchontoglires/Primates/Haplorrhini/Simiiformes/Catarrhini/Hominoidea/Hominidae/Homininae/Homo/Homo_sapiens/Homo_sapiens_neanderthalensis

<?xml version="1.0"?>

<Taxon-Id>63221</Taxon-Id>

unmount the NCBI filesystem

sudo ${FUSEDIR}/bin/fusermount -u  tmp_fuse

The Full source code.

The full source code is available as a 'gist' on github.com.

	FUSEDIR=/home/lindenb/tmp/FUSE
	CFLAGS= -D_FILE_OFFSET_BITS=64 -I $(FUSEDIR)/include -Wall
	LDFLAGS=-L $(FUSEDIR)/lib
	all:test
	test:fusetaxonomy nodes.dmp names.dmp
	rm -f /tmp/jeter.txt
	mkdir -p tmp_fuse
	./fusetaxonomy nodes.dmp names.dmp tmp_fuse
	-ls tmp_fuse
	-ls tmp_fuse/root/
	-ls tmp_fuse/root/cellular_organisms
	-ls -la tmp_fuse/root/cellular_organisms/Eukaryota
	find tmp_fuse/root/cellular_organisms/Eukaryota/Opisthokonta/Metazoa/ \| head
	cat tmp_fuse/root/cellular_organisms/Eukaryota/Opisthokonta/Metazoa/Eumetazoa/Bilateria/Coelomata/Deuterostomia/Chordata/Craniata/Vertebrata/Gnathostomata/Teleostomi/Euteleostomi/Sarcopterygii/Tetrapoda/Amniota/Mammalia/Theria/Eutheria/Euarchontoglires/Primates/Haplorrhini/Simiiformes/Catarrhini/Hominoidea/Hominidae/Homininae/Homo/Homo_sapiens/Homo_sapiens_neanderthalensis
	sudo $(FUSEDIR)/bin/fusermount -u tmp_fuse
	fusetaxonomy:ncbi.cpp
	g++ $(CFLAGS) -o $@ ncbi.cpp $(LDFLAGS) -lfuse

	nodes.dmp names.dmp:
	wget -O taxdump.tar.gz "ftp://ftp.ncbi.nih.gov/pub/taxonomy/taxdump.tar.gz"
	tar xvfz taxdump.tar.gz names.dmp
	tar xvfz taxdump.tar.gz nodes.dmp
	rm taxdump.tar.gz
	clean:
	rm -f taxdump.tar.gz names.dmp nodes.dmp a.out fusetaxonomy

view raw Makefile hosted with ❤ by GitHub

	/*
	* Author:
	* Pierre Lindenbaum PhD
	* WWW
	* http://plindenbaum.blogspot.com
	* Date
	* 2011-08-15
	* Motivation:
	* use the FUSE API to browse the NCBI taxonomy as a filesystem
	* Compilation:
	* g++ -D_FILE_OFFSET_BITS=64 -I $(FUSE)/include -Wall -o fusetaxonomy ncbi.cpp -L $(FUSE)/lib -lfuse
	* Reference:
	* http://sourceforge.net/apps/mediawiki/fuse/index.php?title=Hello_World
	*/
	/* required, see top of fuse/fuse.h */
	#define FUSE_USE_VERSION 26
	#include <fuse/fuse.h>
	#include <cstdio>
	#include <cstdlib>
	#include <cstring>
	#include <cerrno>
	#include <iostream>
	#include <stdexcept>
	#include <string>
	#include <cassert>
	#include <sstream>
	#include <fcntl.h>

	using namespace std;

	FILE safeFOpen (const char filename,const char *modes)
	{
	FILE* f=std::fopen(filename,modes);
	if(f==NULL)
	{
	cerr << "[I/O] Cannot open " << filename << " " << strerror(errno)<< ".\n";
	std::exit(EXIT_FAILURE);
	}
	return f;
	}


	void safeRealloc (void ptr, size_t size)
	{
	ptr=std::realloc(ptr,size);
	if(ptr==NULL)
	{
	cerr << "[Out of memory] Cannot re-allocate " << size << " bytes.\n";
	std::exit(EXIT_FAILURE);
	}
	return ptr;
	}

	void *safeMalloc (std::size_t size)
	{
	void* p=std::malloc(size);
	if(p==NULL)
	{
	cerr << "[Out of memory] Cannot allocate " << size << " bytes.\n";
	std::exit(EXIT_FAILURE);
	}
	return p;
	}
	char* safeStrndup(const char* s,size_t len)
	{
	char p=(char)safeMalloc((len+1)*sizeof(char));
	memcpy((void)p,s,lensizeof(char));
	p[len]=0;
	return p;
	}

	char* safeStrdup(const char* s)
	{
	return safeStrndup(s,strlen(s));
	}


	/** Line reader Utility */
	class LineReader
	{
	private:
	FILE* in;
	char* buffer;
	std::size_t buffer_capacity;
	public:
	LineReader(FILE* in):in(in),buffer(NULL),buffer_capacity(BUFSIZ)
	{
	buffer=(char*)safeMalloc(buffer_capacity);
	}
	virtual ~LineReader()
	{
	free(buffer);
	}
	char* readLine(std::size_t* userlen)
	{
	if(feof(in))
	{
	if(userlen!=NULL) *userlen=0UL;
	return NULL;
	}
	int c;
	size_t len=0;
	while((c=fgetc(in))!=EOF && c!='\n')
	{
	if(len+2>=buffer_capacity)
	{
	buffer_capacity+=BUFSIZ;
	buffer=(char)safeRealloc(buffer,buffer_capacitysizeof(char));
	}
	buffer[len++]=c;
	}
	buffer[len]='\0';
	if(userlen!=NULL) *userlen=len;
	return buffer;
	}

	};

	/**
	* A node in the NCBI taxonomy
	*/
	class Taxon
	{
	public:
	/* node id */
	int id;
	/* parent id */
	int parent_id;
	/* number of children */
	int count_children;
	/* name for this node */
	char* name;
	/* children */
	Taxon** children;
	Taxon():id(-1),parent_id(-1),count_children(0),name(NULL),children(NULL)
	{

	}
	~Taxon()
	{
	free(name);
	free(children);
	}
	/* compare by id */
	static int comparator(const void* p1,const void* p2)
	{
	Taxon tx1=(Taxon)p1;
	Taxon tx2=(Taxon)p2;

	return (tx1)->id - (tx2)->id;
	}
	/* compare by name*/
	static int compareByName(const void* p1,const void* p2)
	{
	Taxon tx1=(Taxon)p1;
	Taxon tx2=(Taxon)p2;

	return strcmp((tx1)->name, (tx2)->name);
	}
	/* get i-th children */
	const Taxon* getChildrenAt(int i) const
	{
	assert(i>=0 && i< count_children);
	return children[i];
	}
	/* find children by its name*/
	const Taxon* findChildByName(const char* name) const
	{
	for(int i=0;i< count_children;++i)
	{
	const Taxon* c = getChildrenAt(i);
	if(strcmp(c->name,name)==0) return c;
	}
	return NULL;
	}

	/* find child from a unix path */
	const Taxon* findChildByPath(const char* path) const
	{
	char* p=(char*)strchr(path,'/');
	if(p==NULL)
	{
	const Taxon* child=findChildByName(path);
	return child;
	}
	char* s=safeStrndup(path,p-path);
	const Taxon* child=findChildByName(s);
	free(s);
	if(child==NULL) return NULL;
	return child->findChildByPath(&p[1]);
	}

	/* stupid representation of a node as a XML file */
	string xml() const
	{
	ostringstream os;
	os << "<?xml version=\"1.0\"?>\n<Taxon-Id>"<< id << "</Taxon-Id>\n";
	return os.str();
	}
	};

	/** the NCBI taxonomy */
	class FuseTaxonomy
	{
	private:
	/* number of nodes */
	size_t nTaxons;
	/** taxons ordered by ids */
	Taxon** taxons;
	/** taxons ordered by names */
	Taxon** names;

	public:
	/* global instance */
	static FuseTaxonomy* INSTANCE;
	FuseTaxonomy():nTaxons(0),taxons(NULL),names(NULL)
	{

	}
	~FuseTaxonomy()
	{
	for(size_t i=0;i< nTaxons;++i)
	{
	delete taxons[i];
	}
	free(taxons);
	free(names);
	}
	/* find taxon by ID */
	const Taxon* findTaxonById(int id)const
	{
	Taxon key;
	key.id=id;
	Taxon* keyPtr=&key;
	Taxon v=(Taxon)bsearch(
	&keyPtr,taxons,
	nTaxons,
	sizeof(Taxon*),
	Taxon::comparator
	);
	return v==NULL?NULL:*v;
	}
	/* find taxon by name */
	const Taxon* findTaxonByName(const char* name) const
	{
	Taxon key;
	key.name=(char*)name;//simple reference, no copy
	Taxon* keyPtr=&key;
	Taxon v=(Taxon)bsearch(
	&keyPtr,names,
	nTaxons,
	sizeof(Taxon*),
	Taxon::compareByName
	);
	key.name=NULL;//prevent destructor to crash
	return v==NULL?NULL:*v;
	}
	/* root of taxonomy */
	const Taxon* getRoot() const
	{
	const Taxon* root=findTaxonById(1);
	assert(root!=NULL);
	return root;
	}

	/* read file nodes.dmp */
	void readNodes(const char* nodes)
	{
	FILE* in=safeFOpen(nodes,"r");
	LineReader reader(in);
	char* line=NULL;
	while((line=reader.readLine(NULL))!=NULL)
	{
	char* pipe1=strchr(line,'\|');
	if(pipe1==NULL) continue;
	*pipe1=0;
	++pipe1;
	char* pipe2= strchr(pipe1,'\|');
	if(pipe2==NULL) continue;
	*pipe2=0;
	Taxon* taxon=new Taxon;
	taxon->id= atoi(line);

	taxon->parent_id=atoi(pipe1);
	taxons=(Taxon*)safeRealloc(taxons,sizeof(Taxon)*(nTaxons+1));
	taxons[nTaxons]=taxon;
	nTaxons++;
	}
	fclose(in);
	qsort ((void)taxons,nTaxons,sizeof(Taxon), Taxon::comparator);
	for(size_t i=0;i< nTaxons;++i)
	{
	Taxon* taxon=taxons[i];


	Taxon* parent=(Taxon*)findTaxonById(taxon->parent_id);
	if(parent==NULL \|\| parent->id==taxon->id)
	{
	continue;
	}
	parent->children=(Taxon*)safeRealloc(parent->children,sizeof(Taxon)*(parent->count_children+1));
	parent->children[parent->count_children]=taxon;
	parent->count_children++;
	}
	}

	/* read file names.dmp */
	void readNames(const char* namesfile)
	{
	FILE* in=safeFOpen(namesfile,"r");
	LineReader reader(in);
	size_t len;
	char* line=NULL;
	while((line=reader.readLine(&len))!=NULL)
	{
	char* pipe1=strchr(line,'\|');
	if(pipe1==NULL) continue;
	*pipe1=0;
	++pipe1;
	while(isspace(*pipe1)) ++pipe1;

	char* pipe2= strchr(pipe1,'\|');
	if(pipe2==NULL) continue;
	*pipe2=0;
	char* p3=pipe2-1;
	while(pipe1< p3 && isspace(*p3)) --p3;
	*(p3+1)=0;


	int id=atoi(line);
	Taxon* taxon=(Taxon*)findTaxonById(id);
	if(taxon==NULL) continue;
	if(taxon->name!=NULL)
	{
	if(strstr(pipe2+1,"scientific name")==NULL) continue;
	free(taxon->name);
	}

	taxon->name= safeStrdup(pipe1);
	char* p4=taxon->name;
	while(*p4!=0)
	{
	if(!isalnum(*p4))
	{
	*p4='_';
	}
	++p4;
	}
	}
	fclose(in);
	names=(Taxon*)safeMalloc(sizeof(Taxon)*(nTaxons));
	memcpy((void)names,taxons,sizeof(Taxon)*(nTaxons));
	qsort ((void)names,nTaxons,sizeof(Taxon), Taxon::compareByName);
	}



	/* find taxon node from unix path */
	const Taxon* findByPath(const char* path) const
	{

	if(path==NULL )
	{
	return NULL;
	}
	if(path[0]!='/')
	{
	return NULL;
	}
	char* path1=(char*)&path[1];
	if(strcmp(path1,getRoot()->name)==0)
	{
	return getRoot();
	}
	char* slash=(char*)strchr(path1,'/');
	if(slash==NULL)
	{
	return NULL;
	}
	if(strncmp(getRoot()->name,path1,strlen(getRoot()->name))!=0)
	{
	return NULL;
	}

	return getRoot()->findChildByPath(&slash[1]);
	}
	/** FUSE CALLBACK: This function returns metadata concerning a file specified by path in a special stat structure. */
	static int getattr(const char path, struct stat stbuf)
	{
	int res = 0;
	FuseTaxonomy* taxonomy=FuseTaxonomy::INSTANCE;
	//Reset memory for the stat structure
	memset(stbuf, 0, sizeof(struct stat));

	if(path[0]!='/')
	{
	return ENOENT;
	}
	if(path[1]=='.')
	{
	return ENOENT;
	}
	const Taxon* taxon= taxonomy->findByPath(path);

	if(strcmp("/",path)==0)
	{
	stbuf->st_mode = S_IFDIR \| 0755;
	stbuf->st_nlink = 1 ;
	}
	else if(taxon!=NULL && taxon->count_children!=0)
	{
	stbuf->st_mode = S_IFDIR \| 0755;
	stbuf->st_nlink = taxon->count_children;
	}
	else if(taxon!=NULL )
	{
	stbuf->st_mode = S_IFREG \| 0444;
	stbuf->st_nlink = 1;
	stbuf->st_size = taxon->xml().size();
	}
	else
	{
	res = -ENOENT;
	}

	return res;
	}
	/* FUSE CALLBACK: used to read directory contents */
	static int readdir(const char path, void buf, fuse_fill_dir_t filler,
	off_t offset, struct fuse_file_info *fi)
	{
	(void) offset;
	(void) fi;
	FuseTaxonomy* taxonomy=FuseTaxonomy::INSTANCE;

	if(strcmp(path,"/")==0)
	{

	filler(buf, ".", NULL, 0);
	filler(buf, "..", NULL, 0);
	filler(buf, taxonomy->getRoot()->name, NULL, 0);
	}
	else
	{
	const Taxon* taxon= taxonomy->findByPath(path);
	if(taxon==NULL)
	{
	return -ENOENT;
	}

	if(taxon->count_children>0)
	{
	filler(buf, ".", NULL, 0);
	filler(buf, "..", NULL, 0);
	}

	for(int i=0;i< taxon->count_children;++i)
	{
	const Taxon* c=taxon->getChildrenAt(i);
	filler(buf, c->name, NULL, 0);
	}
	}
	return 0;
	}

	/* FUSE CALLBACK: checks whatever user is permitted to open the /hello file with flags given in the fuse_file_info structure. */
	static int open(const char path, struct fuse_file_info fi)
	{

	if(strcmp(path,"/")==0) return -ENOENT;
	FuseTaxonomy* taxonomy=FuseTaxonomy::INSTANCE;
	const Taxon* taxon= taxonomy->findByPath(path);
	if(taxon==NULL) return -ENOENT;
	//if the user wants to open the file for anything else than reading only, we tell him that he does not have sufficient permissions.
	if((fi->flags & 3) != O_RDONLY)
	return -EACCES;

	return 0;
	}
	/* FUSE CALLBACK: used to feed the user with data from the file. */
	static int read(const char path, char buf, size_t size, off_t offset,
	struct fuse_file_info *fi)
	{
	FuseTaxonomy* taxonomy=FuseTaxonomy::INSTANCE;
	const Taxon* taxon= taxonomy->findByPath(path);
	if(taxon==NULL) return -ENOENT;
	string xml = taxon -> xml();

	size_t len = 0;
	for(size_t i=offset;i< xml.size() && i< size;++i)
	{
	buf[i]=xml.at(i);
	++len;
	}
	return len;
	}
	};

	FuseTaxonomy* FuseTaxonomy::INSTANCE=NULL;

	int main(int argc, char *argv[])
	{
	if(argc!=4)
	{
	fprintf(stderr,"%s: Pierre Lindenbaum PhD. 2011. Compilation:%s \n",argv[0],__DATE__);
	fprintf(stderr,"Usage:%s nodes.dmp names.dmp fuse.directory\n",argv[0]);
	return EXIT_FAILURE;
	}
	void* userData=NULL;
	FuseTaxonomy::INSTANCE=new FuseTaxonomy;
	FuseTaxonomy::INSTANCE->readNodes(argv[1]);
	FuseTaxonomy::INSTANCE->readNames(argv[2]);
	assert(FuseTaxonomy::INSTANCE->getRoot()!=NULL);

	struct fuse_operations operations;
	memset((void*)&operations,0,sizeof( struct fuse_operations));
	operations.getattr = FuseTaxonomy::getattr;
	operations.readdir = FuseTaxonomy::readdir;
	operations.open = FuseTaxonomy::open;
	operations.read = FuseTaxonomy::read;

	return fuse_main(argc-2, &argv[2], &operations,userData);
	}

view raw ncbi.cpp hosted with ❤ by GitHub

The code is also on github here

That's it !

Pierre

10 August 2011

Wikipedia OpenSearch and semantic bash completion

I wrote a tool that use the opensearch API for mediawiki to display the results of a query in wikipedia:

$ opensearch Charles Darw
Charles_Darwin   Charles Robert Darwin FRS (12 February 1809 – 19 April 1882) was an English...
Charles_Darwin_University   Charles Darwin University (CDU) is an Australian public university wi...
Charles_Darwin%27s_health   For much of his adult life, Charles Darwin's health was repeatedly co...
Charles_Darwin_Foundation   The Charles Darwin Foundation was founded in 1959, under the auspices...
Charles_Darwin_National_Park   Charles Darwin National Park is in the Northern Territory of Austr...
Charles_Darwin_Research_Station   The Charles Darwin Research Station (CDRS) is a biological rese...
Charles_Darwin%27s_education   Charles Darwin's education gave him a foundation in the doctrine o...
Charles_Darwin_%281758%E2%80%931778%29   Charles Darwin (3 September 1758–15 May 1778) was the ...
Charles_Darwin%27s_religious_views   Charles Darwin's views on religion have been the subject of ...
Charles_Darwin_and_the_Tree_of_Life   Charles Darwin and the Tree of Life is a 2009 television do...
Charles_Darwin_School   Charles Darwin School is the only secondary school in the Biggin Hill area. 
Charles_Darwin_%28aviator%29   Major Charles John Wharton Darwin was a First World War flying ace...
Charles_Darwin_Reserve   Charles Darwin Reserve is a 686 km2 nature reserve in Western Australia.
Charles_Darwin_bibliography   This is a partial list of the writings of Charles Darwin, including...
Darwin   Darwin may refer to:

This tool is available on github at:

https://github.com/lindenb/ccsandbox/blob/master/src/opensearch.cpp

Bash completion

Now I want to use this tool for bash completion. Say, I wrote a tool named 'foo' that expects the title of a wikipedia article as an argument. In ~/.bash_completion.d/, I wrote a file named opensearch:

_opensearch()

    	{

        local cur choice reply

        COMPREPLY=()

        cur="${COMP_WORDS[COMP_CWORD]}"

        choice=$(opensearch -d -c ${cur})

        reply=$(compgen -W "${choice}" -- ${cur})

        COMPREPLY=( $reply )

        return 0

    	} &&



complete -F _opensearch foo

This script is activated when a new terminal is loaded in ${HOME}/.bashrc:

(...)
if [ -f ${HOME}/.bash_completion.d/opensearch ]; then
    . ${HOME}/.bash_completion.d/opensearch
fi

Now, typing 'foo', a keyword and <TAB> activates this new completion:

$ foo Charles_D <TAB> <TAB>

Charles_Dance              Charles_Darwin_University  Charles_Doolittle_Walcott  Charles_Dutoit
Charles_Darwin             Charles_Dickens            Charles_Durning
$ foo Charles_D

That's it,

Pierre

09 August 2011

Quick tip: bash completion for Bioinformatics

The default behavior for the <tab> completion can be extended by creating a new file:


${HOME}/.bash_completion

in your home.
For example, you can write your own Bash Completion for samtools, bwa or your favourite tool by adding the following line in "${HOME}/.bash_completion":


complete -f -X '!*.@(bam|sam|fasta|fa|fa.gz|fastq.gz|fasta.gz)' samtools bwa

Open a new bash and type:


$ samtools index <tab>

myfile.fastq.gz  sequence.fasta

Pierre

01 August 2011

Using the Freebase and the Bioportal Widgets to create a semantic object.

The following HTML code uses:

a Widget from the NCBO Bioportal for the Protein-protein interaction Ontology
And two Widgets from Freebase, linking to the proteins, the genes and the diseases.

After completion, it generates a JSON object describing a semantic object:

{
"subject":"http://www.freebase.com/view/en/nsp3",
"predicate":"http://purl.org/obo/owl/MI#MI_0407",
"value":"http://www.freebase.com/view/en/roxan",
"pmid":15047801
}

Source

You can download the source and test it:

	<html>
	<head>
	<link type="text/css" rel="stylesheet" href="http://freebaselibs.com/static/suggest/1.3/suggest.min.css" />
	<script type="text/javascript" src="http://ajax.googleapis.com/ajax/libs/jquery/1.4.2/jquery.min.js"></script>
	<script type="text/javascript" src="http://freebaselibs.com/static/suggest/1.3/suggest.min.js"></script>
	<script type="text/javascript" src="http://bioportal.bioontology.org/javascripts/widgets/form_complete.js"></script>
	<script type="text/javascript">
	var freebase_topics= {
	type:[ '/biology/protein','/biology/gene','/medicine/disease']
	};

	function update_json_field()
	{
	var o={subject:null,predicate:null,value:null,pmid:null};
	var p=document.getElementById("prot1").value;
	if(p.indexOf("http://")==0) o["subject"]=p;
	p=document.getElementById("prot2").value;
	if(p.indexOf("http://")==0) o["value"]=p;
	p=document.getElementById("predicate").value;
	if(p.indexOf("http://")==0) o["predicate"]=p;
	p=document.getElementById("pmid").value;
	if(p.length!=0 && p.match(/[1-9][0-9]*/)) o["pmid"]= parseInt(p);
	var root=document.getElementById("json");
	root.value=JSON.stringify(o);
	}

	$(function() {
	$("#prot1").suggest(freebase_topics).bind("fb-select",function(e, data)
	{
	var root=document.getElementById("prot1");
	root.value="http://www.freebase.com/view"+data.id;
	update_json_field();
	});
	});
	$(function() {
	$("#prot2").suggest(freebase_topics).bind("fb-select",function(e, data)
	{
	var root=document.getElementById("prot2");
	root.value="http://www.freebase.com/view"+data.id;
	update_json_field();
	});
	});

	</script>
	</head>
	<body>


	<fieldset class="search">
	<legend>Create a new Fact</legend>
	<table>
	<tr>
	<th><label for="prot1">Protein 1:</label></th>
	<td><input type="text" placeholder="Protein 1..." id="prot1" size="50"/></td>
	</tr>
	<tr>
	<th><label for="predicate">Predicate:</label></th>
	<td><input type="text" placeholder="Predicate..." id="predicate" class="bp_form_complete-1040-uri" size="50"/></td>
	</tr>
	<tr>
	<th><label for="prot2">Protein 2:</label></th>
	<td><input type="text" placeholder="Protein 2..." id="prot2" size="50" onchange="update_json_field();"/></td>
	</tr>
	<tr>
	<th><label for="pmid">PMID:</label></th>
	<td><input type="number" min="1" placeholder="PMID..." id="pmid" size="50" onchange="update_json_field();"/></td>
	</tr>
	<tr>
	<th><label for="json">JSON</label></th>
	<td><textarea type="text" cols="50" rows="5" readonly="true" placeholder="{}" id="json" size="50"></textarea></td>
	</tr>
	</table>
	</fieldset>


	</body>
	</html>

view raw fact.html hosted with ❤ by GitHub

That's it,

Pierre

Memory-Mapping the Human Genome with 'mmap': my notebook

In this post, I've explored how to use a memory-mapped file to handle the 3Go of the Human Genome sequence as if it was entirely loaded in memory.
Via wikipedia: "A memory-mapped file is a segment of virtual memory which has been assigned a direct byte-for-byte correlation with some portion of a file or file-like resource. This resource is typically a file that is physically present on-disk... In computing, mmap is a POSIX-compliant Unix system call that maps files or devices into memory. It is a method of memory-mapped file I/O. It naturally implements demand paging, because initially file contents are not entirely read from disk and do not use physical RAM at all. The actual reads from disk are performed in "lazy" manner, after a specific location is accessed."
Using a C++ program, I'm going to memory-map the fasta sequence of the human genome (indexed with samtools faidx) and search the position of some short-reads using a BoyerMoore algorithm:

Required members:


/* maps a chromosome to its samtools faidx index */
map<string,faidx1_t> name2index;
/* used to get the size of the file */
struct stat buf;
/* genome fasta file file descriptor */
int fd;
/* the mmap (memory mapped) pointer */
char *mapptr;

Opening the mmap

string faidx(fastaFile);
string line;
faidx.append(".fai");
/* open *.fai file */
ifstream in(faidx.c_str(),ios::in);
/* read indexes in the .fai file that was created with samtools */
while(getline(in,line,'\n'))
 {
 faidx1_t index;
 //parse the faidx line...
 //...
 name2index.insert(make_pair(chrom,index));
 }
/* close index file */
in.close();

/* get the whole size of the fasta file */
stat(fasta, &buf);
/* open the fasta file */
fd = open(fastaFile, O_RDONLY);
/* open a memory mapped file associated to this fasta file descriptor */
mapptr = (char*)mmap(0, buf.st_size, PROT_READ, MAP_SHARED, fd, 0);

It's a kind of MAGIC: Getting the base at index 'position-th' of chromosome 'chrom'


std::map<std::string,faidx1_t>::iterator r=name2index.find(chrom);
faidx1_t& index=r->second;
char base=at(&index,position);
(...)
/* returns the base at position 'index' for the chromosome indexed by faidx */
char at(const FaidxPtr faidx,int64_t index)
 {
 long pos= faidx->offset +
  index / faidx->line_blen * faidx->line_len +
  index % faidx->line_blen
  ;
 /* here is the magic: no need to fseek/fread/ftell the file */
 return toupper(mapptr[pos]);
 }

Mapping the short reads

I've hacked a simple Boyer-Moore-Horspool algorithm from ttp://en.wikipedia.org/wiki/Boyer-Moore-Horspool_algorithm. Of course, you wouldn't use this algorithm to map your short reads for real :-) .

Disposing the mmap

/* close memory mapped map */
if(mapptr!=NULL) munmap(mapptr,buf.st_size);
 /* dispose fasta file descriptor */
if(fd!=-1) close(fd);

Compile & run


$  g++ -Wall testmmap.cpp -lz

$ ./a.out -g /path/tp/hg19.fa /path/to/my.fastq.gz

ATCATTTTCCTCCTAACAGATTAAAAATCAAGAAATATAAACCAGATGTAGCAG chr11 93065362
(...)

Source code

	/**
	* Author:
	* Pierre Lindenbaum PhD
	* Contact:
	* plindenbaum@yahoo.fr
	* WWW:
	* http://plindenbaum.blogspot.com
	* Reference:
	*
	* Motivation:
	* Mapping a large fasta file in memory using mmap
	* Usage:
	* g++ -Wall testmmap.cpp -lz
	* ./a.out -g /path/to/genome/indexed/with/samtools-faidx/hg.fasta file.fastq
	*/
	#include <cstdio>
	#include <cstdlib>
	#include <cstring>
	#include <climits>
	#include <stdint.h>
	#include <cassert>
	#include <fstream>
	#include <iostream>
	#include <algorithm>
	#include <map>
	#include <cerrno>
	#include <sys/types.h>
	#include <sys/stat.h>
	#include <fcntl.h>
	#include <sys/mman.h>
	#include <zlib.h>

	using namespace std;

	/* it is the structure created by samtools faidx */
	typedef struct faidx1_t
	{
	int32_t line_len, line_blen;
	int64_t len;
	uint64_t offset;
	}faidx1_t,*FaidxPtr;


	/** returns the anti parallele base */
	static const char antiparralele(char c)
	{
	switch(c)
	{
	case 'A': return 'T';
	case 'T': return 'A';
	case 'G': return 'C';
	case 'C': return 'G';
	default:return 'N';
	}
	}
	/**
	* wrapper for a mmap, a fileid and some faidx indexes
	*/
	class IndexedGenome
	{
	private:
	/* maps a chromosome to the samtools faidx index */
	map<string,faidx1_t> name2index;
	/* used to get the size of the file */
	struct stat buf;
	/* genome fasta file file descriptor */
	int fd;
	/* the mmap (memory mapped) pointer */
	char *mapptr;
	/** reads an fill a string */
	bool readline(gzFile in,string& line)
	{
	if(gzeof(in)) return false;
	line.clear();
	int c=-1;
	while((c=gzgetc(in))!=EOF && c!='\n') line+=(char)c;
	return true;
	}

	public:
	/** constructor
	* @param fasta: the path to the genomic fasta file indexed with samtools faidx
	*/
	IndexedGenome(const char* fasta):fd(-1),mapptr(NULL)
	{
	string faidx(fasta);
	string line;
	faidx.append(".fai");
	/* open .fai file /
	ifstream in(faidx.c_str(),ios::in);
	if(!in.is_open())
	{
	cerr << "cannot open " << faidx << endl;
	exit(EXIT_FAILURE);
	}
	/* read indexes in fai file */
	while(getline(in,line,'\n'))
	{
	if(line.empty()) continue;
	const char* p=line.c_str();
	char* tab=(char*)strchr(p,'\t');
	if(tab==NULL) continue;
	string chrom(p,tab-p);
	++tab;
	faidx1_t index;
	if(sscanf(tab,"%ld\t%ld\t%d\t%d",
	&index.len, &index.offset, &index.line_blen,&index.line_len
	)!=4)
	{
	cerr << "Cannot read index in "<< line << endl;
	exit(EXIT_FAILURE);
	}
	/* insert in the map(chrom,faidx) */
	name2index.insert(make_pair(chrom,index));
	}
	/* close index file */
	in.close();

	/* get the whole size of the fasta file */
	if(stat(fasta, &buf)!=0)
	{
	perror("Cannot stat");
	exit(EXIT_FAILURE);
	}

	/* open the fasta file */
	fd = open(fasta, O_RDONLY);
	if (fd == -1)
	{
	perror("Error opening file for reading");
	exit(EXIT_FAILURE);
	}
	/* open a memory mapped file associated to this fasta file descriptor */
	mapptr = (char*)mmap(0, buf.st_size, PROT_READ, MAP_SHARED, fd, 0);
	if (mapptr == MAP_FAILED)
	{
	close(fd);
	perror("Error mmapping the file");
	exit(EXIT_FAILURE);
	}
	}
	/* destructor */
	~IndexedGenome()
	{
	/* close memory mapped map */
	if(mapptr!=NULL && munmap(mapptr,buf.st_size) == -1)
	{
	perror("Error un-mmapping the file");
	}
	/* dispose fasta file descriptor */
	if(fd!=-1) close(fd);
	}

	/* return the base at position 'index' for the chromosome indexed by faidx */
	char at(const FaidxPtr faidx,int64_t index)
	{
	long pos= faidx->offset +
	index / faidx->line_blen * faidx->line_len +
	index % faidx->line_blen
	;
	/* here is the magic: no need to fseek/fread/ftell the file */
	return toupper(mapptr[pos]);
	}


	private:
	/* search a DNA string in the genome
	* simple Boyer-Moore-Horspool_algorithm hacked from
	* http://en.wikipedia.org/wiki/Boyer-Moore-Horspool_algorithm
	*/
	void boyerMoore(const string& search)
	{
	size_t bad_char_skip[UCHAR_MAX + 1];


	if (search.empty()) return;


	for (size_t scan = 0; scan <= UCHAR_MAX; scan++)
	{
	bad_char_skip[scan] = search.size();
	}

	size_t last = search.size() - 1;


	for (size_t scan = 0; scan < last; scan++)
	{
	bad_char_skip[(size_t)search[scan]] = last - scan;
	}
	/* loop over each chromosome */
	for(std::map<std::string,faidx1_t>::iterator r=name2index.begin();
	r!=name2index.end();
	++r)
	{
	faidx1_t& index=r->second;
	size_t haystack_index=0;

	while (haystack_index+ search.size() <= (size_t)index.len)
	{

	size_t scan=last;
	while( at(&index,haystack_index+scan) == search[scan] &&
	scan>0)
	{
	--scan;
	}
	if(scan==0)
	{
	cout << search << "\t"<< r->first << "\t" << haystack_index << endl;
	haystack_index+=search.size();
	}
	else
	{
	char c=at(&index,haystack_index+last);
	haystack_index += bad_char_skip[c];
	}
	}
	}
	}
	public:
	/** read a fastq file an searches the sequences in the genome */
	void mapReads(gzFile in)
	{
	string name;
	string seq;
	string name2;
	string qual;

	for(;;)
	{
	if(!readline(in,name)) break;
	if(!readline(in,seq)) break;
	if(!readline(in,name2)) break;
	if(!readline(in,qual)) break;

	while(!seq.empty() && seq[0]=='N')
	{
	seq.erase(0,1);
	}
	while(!seq.empty() && seq[seq.size()-1]=='N')
	{
	seq.erase(seq.size()-1,1);
	}

	if(seq.size()<10) continue;

	std::transform(seq.begin(), seq.end(),seq.begin(), ::toupper);
	//search
	boyerMoore(seq);
	//rev-comp and search
	reverse(seq.begin(),seq.end());
	std::transform(seq.begin(), seq.end(),seq.begin(), ::antiparralele);
	boyerMoore(seq);
	}
	}
	};

	int main(int argc,char** argv)
	{
	char* genomePath=NULL;
	int optind=1;
	while(optind < argc)
	{
	if(strcmp(argv[optind],"-h")==0)
	{
	fprintf(stderr,"%s: Pierre Lindenbaum PHD. 2010.\n",argv[0]);
	fprintf(stderr,"Compilation: %s at %s.\n",__DATE__,__TIME__);
	fprintf(stderr," -g <path/to/genome/indexed/with/samtools/faids> FASQ[stdin\|files] \n");
	exit(EXIT_FAILURE);
	}
	else if(strcmp(argv[optind],"-g")==0 && optind+1<argc)
	{
	genomePath=argv[++optind];
	}
	else if(strcmp(argv[optind],"--")==0)
	{
	++optind;
	break;
	}
	else if(argv[optind][0]=='-')
	{
	fprintf(stderr,"unknown option '%s'\n",argv[optind]);
	exit(EXIT_FAILURE);
	}
	else
	{
	break;
	}
	++optind;
	}
	if(genomePath==NULL)
	{
	cerr << "genome path missing"<< endl;
	return EXIT_FAILURE;
	}
	/* open IndexedGenome */
	IndexedGenome* genome=new IndexedGenome(genomePath);
	if(optind==argc)
	{
	/* read fastqs from stdin */
	genome->mapReads(gzdopen(fileno(stdin), "r"));
	}
	else
	{
	/** loop over each fastq file */
	while(optind< argc)
	{
	char* filename=argv[optind++];
	errno=0;
	gzFile in=gzopen(filename,"rb");
	if(in==NULL)
	{
	cerr << argv[0]
	<< ": Cannot open "<< filename
	<< ":" << strerror(errno)
	<< endl;
	exit(EXIT_FAILURE);
	}
	genome->mapReads(in);
	gzclose(in);
	}
	}
	/* dispose genome */
	delete genome;
	cerr << "Done." << endl;
	return 0;
	}

view raw testmmap.cpp hosted with ❤ by GitHub

That's it,

Pierre

YOKOFAKUN

13 August 2011

A FUSE-based filesystem reproducing the NCBI Taxonomy hierarchy.

Downloading and installing Fuse

The classes

Taxon

FuseTaxonomy

Test

The Full source code.

10 August 2011

Wikipedia OpenSearch and semantic bash completion

Bash completion

09 August 2011

Quick tip: bash completion for Bioinformatics

01 August 2011

Using the Freebase and the Bioportal Widgets to create a semantic object.

Source

Memory-Mapping the Human Genome with 'mmap': my notebook

Required members:

Opening the mmap

It's a kind of MAGIC: Getting the base at index 'position-th' of chromosome 'chrom'

Mapping the short reads

Disposing the mmap

Compile & run

Source code

About Me

Feeds

Blog Archive

Web2.0

Labels