17 May 2010

Elementary school for Bioinformatics indexing/extracting a sequence in the FASTA format

Not Exactly Rocket Science: Inspired by the faidx command in Samtools, I've created a simple basic tool indexing some FASTA sequences. Based on the fact that all the lines in a fasta should (hopefully) have the very same length, it should be straightforward to get any sub-sequence knowing the starting index of a named sequence in a file.
The first tool (source code available here) indexes the sequences in one more fasta file and produces a RDF file (more is better) describing each entry:

java -jar fastaindexer.jar ~/rotavirus.fa

<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:fi="uri:fr.inserm.umr915:fastaidx:1.0:" xmlns:dc="http://purl.org/dc/elements/1.1/">
<fi:Index>
<fi:source rdf:resource="file:///home/pierre/rotavirus.fa"/>
<dc:title>gi|164611561</dc:title>
<fi:line-size>70</fi:line-size>
<fi:start>14</fi:start>
<fi:end>1104</fi:end>
<fi:length>1075</fi:length>
</fi:Index>
<fi:Index>
<fi:source rdf:resource="file:///home/pierre/rotavirus.fa"/>
<dc:title>gi|122938370</dc:title>
<fi:line-size>70</fi:line-size>
<fi:start>1120</fi:start>
<fi:end>1728</fi:end>
<fi:length>600</fi:length>
</fi:Index>
<fi:Index>
<fi:source rdf:resource="file:///home/pierre/rotavirus.fa"/>
<dc:title>gi|78064499</dc:title>
<fi:line-size>70</fi:line-size>
<fi:start>1743</fi:start>
<fi:end>20623</fi:end>
<fi:length>18615</fi:length>
</fi:Index>
<fi:Index>
<fi:source rdf:resource="file:///home/pierre/rotavirus.fa"/>
<dc:title>gi|66774335</dc:title>
<fi:line-size>70</fi:line-size>
<fi:start>20638</fi:start>
<fi:end>39518</fi:end>
<fi:length>18615</fi:length>
</fi:Index>
<fi:Index>
<fi:source rdf:resource="file:///home/pierre/rotavirus.fa"/>
<dc:title>gi|10242107</dc:title>
<fi:line-size>70</fi:line-size>
<fi:start>39533</fi:start>
<fi:end>40208</fi:end>
<fi:length>666</fi:length>
</fi:Index>
<fi:Index>
<fi:source rdf:resource="file:///home/pierre/rotavirus.fa"/>
<dc:title>gi|164611570</dc:title>
<fi:line-size>70</fi:line-size>
<fi:start>40224</fi:start>
<fi:end>40897</fi:end>
<fi:length>664</fi:length>
</fi:Index>
<fi:Index>
<fi:source rdf:resource="file:///home/pierre/rotavirus.fa"/>
<dc:title>gi|164611569</dc:title>
<fi:line-size>70</fi:line-size>
<fi:start>40913</fi:start>
<fi:end>42583</fi:end>
<fi:length>1647</fi:length>
</fi:Index>
<fi:Index>
<fi:source rdf:resource="file:///home/pierre/rotavirus.fa"/>
<dc:title>gi|164611567</dc:title>
<fi:line-size>70</fi:line-size>
<fi:start>42599</fi:start>
<fi:end>43673</fi:end>
<fi:length>1059</fi:length>
</fi:Index>
<fi:Index>
<fi:source rdf:resource="file:///home/pierre/rotavirus.fa"/>
<dc:title>gi|164611565</dc:title>
<fi:line-size>70</fi:line-size>
<fi:start>43689</fi:start>
<fi:end>45505</fi:end>
<fi:length>1791</fi:length>
</fi:Index>
<fi:Index>
<fi:source rdf:resource="file:///home/pierre/rotavirus.fa"/>
<dc:title>gi|164611563</dc:title>
<fi:line-size>70</fi:line-size>
<fi:start>45521</fi:start>
<fi:end>47252</fi:end>
<fi:length>1707</fi:length>
</fi:Index>
</rdf:RDF>

From this index, any substring can be extracted for a given sequence:

FastaIndex.java
/**
* returns a sequence from a RandomAccessFile for this index
* @param in a fasta file opened with a RandomAccessFile
* @param start the start offset of the sequence (first base=0)
* @param length number of symbols to read
* @return an array of bytes containing the sequence
* @throws IOException
*/
public byte[] getSequence(
RandomAccessFile in,
int start,
int length
) throws IOException
{
if(start<0) throw new IllegalArgumentException("start<0:"+start);
if(start+length> getLength()) throw new IllegalArgumentException("start="+start+" length="+length+"> size="+getLength());
/* the array of byte where we store the final sequence */
byte seq[]=new byte[length];

/** the fasta row index */
int row_index=start/this.lineSize;
/** index of 'start' in this current row */
int index_in_row=start-row_index*this.lineSize;
/** prepare a buffer to read some bytes from the fasta file */
byte buffer[]=new byte[FastaIndex.getBufferSize()];
/** number of bytes in the buffer */
int buffer_length=0;
/** current position in the buffer */
int index_in_buffer=0;
/** current position in the final array of bytes (sequence) */
int index_in_seq=0;
/** move the IO cursor to the beginning of the sequence */
in.seek(
this.getSeqStart()+
row_index*(this.lineSize+1)+
index_in_row
);
while(length>0)
{
/* buffer empty ? fill it */
if(buffer_length==0)
{
buffer_length=in.read(buffer);
index_in_buffer=0;
if(buffer_length<=0) throw new IOException("cannot fill buffer");
}
/* number of byte to copy into the sequence */
int n_to_copy= Math.min(buffer_length-index_in_buffer,Math.min(length,this.lineSize-index_in_row));
//System.err.println("n_to_copy="+n_to_copy+" index_in_row="+index_in_row+" \""+new String(buffer,index_in_buffer,n_to_copy)+"\"");

/* copy the bytes from the buffer to the sequence */
System.arraycopy(buffer, index_in_buffer, seq, index_in_seq, n_to_copy);

/* move the offsets */
index_in_seq+=n_to_copy;
length-=n_to_copy;
index_in_buffer+=n_to_copy;
index_in_row=(index_in_row+n_to_copy)%this.lineSize;

/* check the next input is a CR/LF */
if(length>0)
{
if(index_in_row==0)
{
/* buffer is filled, read from input stream */
if(index_in_buffer==buffer_length)
{
if(in.read()!='\n')
{
throw new IOException("I/O error expected a carriage return at offset "+in.getFilePointer());
}
}
else /* check the buffer */
{
if(buffer[index_in_buffer]!='\n')
{
throw new IOException("I/O error expected a carriage return at offset "+in.getFilePointer());
}
index_in_buffer++;
}
}

/* reset the buffer it is filled */
if(index_in_buffer==buffer_length)
{
buffer_length=0;
index_in_buffer=0;
}
}


}
return seq;
}


Example

:
echo -e "gi|10242107\t100\t278" | java -jar genomeindex.jar -f ~/jeter.xml
>gi|10242107|100-278
AACTCTTTCTGGAAAATCTATTGGTAGGAGTGAACAGTACATTTCACCAG
ATGCAGAAGCATTCAGTAAATACATGCTGTCAAAGTCTCCAGAAGATATT
GGACCATCTGATTCTGCTTCAAACGATCCACTCACCAGCTTTTCGATTAG
ATCGAATGCAGTTAAGACAAATGCAGAC



That's it
Pierre

7 comments:

id said...

Why output the index in RDF ? - Greg Tyrelle (Open ID, doesn't give me a name ?)

Pierre Lindenbaum said...

Why not ? :-) why should I create a new XML schema when there is RDF ?

Greg said...

Indeed why not! I'm not questioning it as an approach, I'm just curious. I thought you may have had a specific reason, some additional utility that you could see from having the index in RDF, SPARQL queries, integration of other RDF data sources etc.

Pierre Lindenbaum said...

No :-) I also could have used a JSON format but I always want to have the possibility to process this kind of file with XSLT.

Greg said...

Processing RDF/XML with XSLT is only for the truly brave :)

Pierre Lindenbaum said...

:-) no, if the structure of the XML element is always the same :-)

BTW, *I* can process RDF with XSLT: http://plindenbaum.blogspot.com/2009/06/rdf-javascript-xsl-stylesheets.html
;-)

casbon said...

I'm calling this Casbon's Corollary to Greenspun's tenth: any
indexed data format will contain an ad hoc informally-specified
bug-ridden slow implementation of half of SQL.

http://lists.idyll.org/pipermail/biology-in-python/2009-November/000501.html

http://ivory.idyll.org/blog/mar-10/storing-and-retrieving-sequences

tl;dr: sqlite3 is better than your index