13 August 2011

A FUSE-based filesystem reproducing the NCBI Taxonomy hierarchy.

In this post I will show how I've used FUSE to create a new loadable filesystem for Linux, reproducing the tree of the NCBI Taxonomy.

From the Fuse Homepage:With FUSE it is possible to implement a fully functional filesystem in a userspace program. Features include:

  • Simple library API
  • Simple installation (no need to patch or recompile the kernel)
  • Secure implementation
  • Userspace - kernel interface is very efficient
  • Usable by non privileged users
  • Runs on Linux kernels 2.4.X and 2.6.X
  • Has proven very stable over time

Downloading and installing Fuse

Download Fuse from sourceforge:http://sourceforge.net/projects/fuse/files/fuse-2.X/.

$ tar xvfz fuse-2.8.5.tar.gz

$ cd fuse-2.8.5/
$ ./configure
$ make
$ make install

The classes


A Taxon is a node in the NCBI Taxonomy:

class Taxon
/* node id */
int id;
/* parent id */
int parent_id;
/* number of children */
int count_children;
/* name for this node */
char* name;
/* node children */
Taxon** children;
/* compare by id */
static int comparator(const void* p1,const void* p2);
/* compare by name*/
static int compareByName(const void* p1,const void* p2);
const Taxon* getChildrenAt(int i) const;
/* find children by its name*/
const Taxon* findChildByName(const char* name) const;
/* recursive. find child from a unix path */
const Taxon* findChildByPath(const char* path) const;
/* stupid representation of a node as a XML file */
string xml() const;


The NCBI Taxonomy:
class FuseTaxonomy

/* number of nodes */
size_t nTaxons;
/** taxons ordered by ids */
Taxon** taxons;
/** taxons ordered by names */
Taxon** names;
/* global instance */
static FuseTaxonomy* INSTANCE;
/* find taxon by ID */
const Taxon* findTaxonById(int id)const;
/* find taxon by name */
const Taxon* findTaxonByName(const char* name) const;
/* root of taxonomy */
const Taxon* getRoot() const;
/* read file nodes.dmp */
void readNodes(const char* nodes);
/* read file names.dmp */
void readNames(const char* namesfile);
/* find taxon node from unix path */
const Taxon* findByPath(const char* path) const;
/** FUSE CALLBACK: This function returns metadata concerning a file specified by path in a special stat structure. */
static int getattr(const char *path, struct stat *stbuf);
/* FUSE CALLBACK: used to read directory contents */
static int readdir(const char *path, void *buf, fuse_fill_dir_t filler, off_t offset, struct fuse_file_info *fi);
/* FUSE CALLBACK: checks whatever user is permitted to open the /hello file with flags given in the fuse_file_info structure. */
static int open(const char *path, struct fuse_file_info *fi);
/* FUSE CALLBACK: used to feed the user with data from the file. */
static int read(const char *path, char *buf, size_t size, off_t offset, struct fuse_file_info *fi);

The static functions 'getattr', 'readdir', 'open' and 'read' are the callbacks called by the FUSE API to explore the new filesystem and must be initialized in the 'main' method.


Load the NCBI taxonomy
$ wget -O taxdump.tar.gz "ftp://ftp.ncbi.nih.gov/pub/taxonomy/taxdump.tar.gz"

$ tar xvfz taxdump.tar.gz names.dmp
$ tar xvfz taxdump.tar.gz nodes.dmp

Create a temporary folder for our new filesystem:
$ mkdir -p tmp_fuse

load the NCBI taxonomy and install the new filesystem
$ ./fusetaxonomy nodes.dmp names.dmp  tmp_fuse

What's is in this new filesystem ?
$ls tmp_fuse


And what are the nodes under "root" ?
$ls tmp_fuse

cellular_organisms other_sequences unclassified_sequences Viroids Viruses

And what's under 'Eukaryota' ?
ls -la tmp_fuse/root/cellular_organisms/Eukaryota
total 0
drwxr-xr-x 22 root root 0 1970-01-01 01:00 .
drwxr-xr-x 3 root root 0 1970-01-01 01:00 ..
drwxr-xr-x 9 root root 0 1970-01-01 01:00 Alveolata
drwxr-xr-x 11 root root 0 1970-01-01 01:00 Amoebozoa
drwxr-xr-x 4 root root 0 1970-01-01 01:00 Apusozoa
drwxr-xr-x 5 root root 0 1970-01-01 01:00 Centroheliozoa
drwxr-xr-x 4 root root 0 1970-01-01 01:00 Cryptophyta
drwxr-xr-x 38 root root 0 1970-01-01 01:00 environmental_samples
drwxr-xr-x 5 root root 0 1970-01-01 01:00 Euglenozoa
drwxr-xr-x 7 root root 0 1970-01-01 01:00 Fornicata
drwxr-xr-x 5 root root 0 1970-01-01 01:00 Glaucocystophyceae
drwxr-xr-x 11 root root 0 1970-01-01 01:00 Haptophyceae
drwxr-xr-x 4 root root 0 1970-01-01 01:00 Heterolobosea
drwxr-xr-x 4 root root 0 1970-01-01 01:00 Jakobida
drwxr-xr-x 3 root root 0 1970-01-01 01:00 Katablepharidophyta
drwxr-xr-x 1 root root 0 1970-01-01 01:00 Malawimonadidae
drwxr-xr-x 6 root root 0 1970-01-01 01:00 Opisthokonta
drwxr-xr-x 8 root root 0 1970-01-01 01:00 Oxymonadida
drwxr-xr-x 9 root root 0 1970-01-01 01:00 Parabasalia
drwxr-xr-x 9 root root 0 1970-01-01 01:00 Rhizaria
drwxr-xr-x 7 root root 0 1970-01-01 01:00 Rhodophyta
drwxr-xr-x 25 root root 0 1970-01-01 01:00 stramenopiles
drwxr-xr-x 36 root root 0 1970-01-01 01:00 unclassified_eukaryotes
drwxr-xr-x 3 root root 0 1970-01-01 01:00 Viridiplantae

Let's use find !
find tmp_fuse/root/cellular_organisms/Eukaryota/Opisthokonta/Metazoa/ | head


What is the content of Homo_sapiens_neanderthalensis ?:
$ cat tmp_fuse/root/cellular_organisms/Eukaryota/Opisthokonta/Metazoa/Eumetazoa/Bilateria/Coelomata/Deuterostomia/Chordata/Craniata/Vertebrata/Gnathostomata/Teleostomi/Euteleostomi/Sarcopterygii/Tetrapoda/Amniota/Mammalia/Theria/Eutheria/Euarchontoglires/Primates/Haplorrhini/Simiiformes/Catarrhini/Hominoidea/Hominidae/Homininae/Homo/Homo_sapiens/Homo_sapiens_neanderthalensis

<?xml version="1.0"?>

unmount the NCBI filesystem
sudo ${FUSEDIR}/bin/fusermount -u  tmp_fuse

The Full source code.

The full source code is available as a 'gist' on github.com.
The code is also on github here

That's it !


10 August 2011

Wikipedia OpenSearch and semantic bash completion

I wrote a tool that use the opensearch API for mediawiki to display the results of a query in wikipedia:

$ opensearch Charles Darw
Charles_Darwin Charles Robert Darwin FRS (12 February 1809 – 19 April 1882) was an English...
Charles_Darwin_University Charles Darwin University (CDU) is an Australian public university wi...
Charles_Darwin%27s_health For much of his adult life, Charles Darwin's health was repeatedly co...
Charles_Darwin_Foundation The Charles Darwin Foundation was founded in 1959, under the auspices...
Charles_Darwin_National_Park Charles Darwin National Park is in the Northern Territory of Austr...
Charles_Darwin_Research_Station The Charles Darwin Research Station (CDRS) is a biological rese...
Charles_Darwin%27s_education Charles Darwin's education gave him a foundation in the doctrine o...
Charles_Darwin_%281758%E2%80%931778%29 Charles Darwin (3 September 1758–15 May 1778) was the ...
Charles_Darwin%27s_religious_views Charles Darwin's views on religion have been the subject of ...
Charles_Darwin_and_the_Tree_of_Life Charles Darwin and the Tree of Life is a 2009 television do...
Charles_Darwin_School Charles Darwin School is the only secondary school in the Biggin Hill area.
Charles_Darwin_%28aviator%29 Major Charles John Wharton Darwin was a First World War flying ace...
Charles_Darwin_Reserve Charles Darwin Reserve is a 686 km2 nature reserve in Western Australia.
Charles_Darwin_bibliography This is a partial list of the writings of Charles Darwin, including...
Darwin Darwin may refer to:

This tool is available on github at:

Bash completion

Now I want to use this tool for bash completion. Say, I wrote a tool named 'foo' that expects the title of a wikipedia article as an argument. In ~/.bash_completion.d/, I wrote a file named opensearch:

local cur choice reply
choice=$(opensearch -d -c ${cur})
reply=$(compgen -W "${choice}" -- ${cur})
COMPREPLY=( $reply )
return 0
} &&

complete -F _opensearch foo

This script is activated when a new terminal is loaded in ${HOME}/.bashrc:
if [ -f ${HOME}/.bash_completion.d/opensearch ]; then
. ${HOME}/.bash_completion.d/opensearch
Now, typing 'foo', a keyword and <TAB> activates this new completion:
$ foo Charles_D <TAB> <TAB>

Charles_Dance Charles_Darwin_University Charles_Doolittle_Walcott Charles_Dutoit
Charles_Darwin Charles_Dickens Charles_Durning
$ foo Charles_D

That's it,


09 August 2011

Quick tip: bash completion for Bioinformatics

The default behavior for the <tab> completion can be extended by creating a new file:

in your home.
For example, you can write your own Bash Completion for samtools, bwa or your favourite tool by adding the following line in "${HOME}/.bash_completion":

complete -f -X '!*.@(bam|sam|fasta|fa|fa.gz|fastq.gz|fasta.gz)' samtools bwa

Open a new bash and type:

$ samtools index <tab>
myfile.fastq.gz sequence.fasta


01 August 2011

Using the Freebase and the Bioportal Widgets to create a semantic object.

The following HTML code uses:
After completion, it generates a JSON object describing a semantic object:


You can download the source and test it:

That's it,


Memory-Mapping the Human Genome with 'mmap': my notebook

In this post, I've explored how to use a memory-mapped file to handle the 3Go of the Human Genome sequence as if it was entirely loaded in memory.
Via wikipedia: "A memory-mapped file is a segment of virtual memory which has been assigned a direct byte-for-byte correlation with some portion of a file or file-like resource. This resource is typically a file that is physically present on-disk... In computing, mmap is a POSIX-compliant Unix system call that maps files or devices into memory. It is a method of memory-mapped file I/O. It naturally implements demand paging, because initially file contents are not entirely read from disk and do not use physical RAM at all. The actual reads from disk are performed in "lazy" manner, after a specific location is accessed."
Using a C++ program, I'm going to memory-map the fasta sequence of the human genome (indexed with samtools faidx) and search the position of some short-reads using a BoyerMoore algorithm:

Required members:

/* maps a chromosome to its samtools faidx index */
map<string,faidx1_t> name2index;
/* used to get the size of the file */
struct stat buf;
/* genome fasta file file descriptor */
int fd;
/* the mmap (memory mapped) pointer */
char *mapptr;

Opening the mmap

string faidx(fastaFile);
string line;
/* open *.fai file */
ifstream in(faidx.c_str(),ios::in);
/* read indexes in the .fai file that was created with samtools */
faidx1_t index;
//parse the faidx line...
/* close index file */

/* get the whole size of the fasta file */
stat(fasta, &buf);
/* open the fasta file */
fd = open(fastaFile, O_RDONLY);
/* open a memory mapped file associated to this fasta file descriptor */
mapptr = (char*)mmap(0, buf.st_size, PROT_READ, MAP_SHARED, fd, 0);

It's a kind of MAGIC: Getting the base at index 'position-th' of chromosome 'chrom'

std::map<std::string,faidx1_t>::iterator r=name2index.find(chrom);
faidx1_t& index=r->second;
char base=at(&index,position);
/* returns the base at position 'index' for the chromosome indexed by faidx */
char at(const FaidxPtr faidx,int64_t index)
long pos= faidx->offset +
index / faidx->line_blen * faidx->line_len +
index % faidx->line_blen
/* here is the magic: no need to fseek/fread/ftell the file */
return toupper(mapptr[pos]);

Mapping the short reads

I've hacked a simple Boyer-Moore-Horspool algorithm from ttp://en.wikipedia.org/wiki/Boyer-Moore-Horspool_algorithm. Of course, you wouldn't use this algorithm to map your short reads for real :-) .

Disposing the mmap

/* close memory mapped map */
if(mapptr!=NULL) munmap(mapptr,buf.st_size);
/* dispose fasta file descriptor */
if(fd!=-1) close(fd);

Compile & run

$ g++ -Wall testmmap.cpp -lz

$ ./a.out -g /path/tp/hg19.fa /path/to/my.fastq.gz


Source code

That's it,