11 April 2007

How blast works ?

A few years ago, I wondered how blast was implemented: was there a way to play the binary file where the sequences were indexed ? I had a glance at the NCBI C toolkit but I was a little bit lost with all that source code. I asked the question via usenet and I recieved a mail from M. Dumontier who suggested me to have a look at the SLRI toolkit:

The Samuel Lunenfeld Research Institute (SLRI) Toolkit is a cross-platform toolkit for manipulating biological information. The SLRI toolkit is based mainly in C and derives many functions from the NCBI toolkit. The SLRI toolkit was developed mainly for data pertaining to protein structure and function but can be used to manipulate other data such as gene sequences.

Last sunday, I added a new short entry into wikipedia about formatdb and I wondered again how the software was implemented: what is the format of those files ? how are packaged the protein , the degenerate nucleotides ? could I implement a reader/writer with another language (java ?) ? Just for my own curiosity I would be interested to have some more information about how blast was implemented. Feel free to add some more information about this subject in wikipedia.


