28 January 2010

Elementary school for Bioinformatics: Suffix Arrays

I was recently asked to map the flanking ends of a large set of SNPs on the human genome. As a perfect match algorithm was sufficient I've implemented a suffix array algorithm to map each flanking sequence. Via Wikipedia:A suffix array is an array of integers giving the starting positions of suffixes of a string in lexicographical order. It can be used as an index to quickly locate every occurrence of a substring within the string. Finding every occurrence of the substring is equivalent to finding every suffix that begins with the substring. Thanks to the lexicographical ordering, these suffixes will be grouped together in the suffix array, and can be found efficiently with a binary search.

Say, the sequence of your genome is "gggtaaagctataactattgatcaggcgtt".
The array of integers giving the starting positions of the suffixes is:

[00]    gggtaaagctataactattgatcaggcgtt
[01] ggtaaagctataactattgatcaggcgtt
[02] gtaaagctataactattgatcaggcgtt
[03] taaagctataactattgatcaggcgtt
[04] aaagctataactattgatcaggcgtt
[05] aagctataactattgatcaggcgtt
[06] agctataactattgatcaggcgtt
[07] gctataactattgatcaggcgtt
[08] ctataactattgatcaggcgtt
[09] tataactattgatcaggcgtt
[10] ataactattgatcaggcgtt
[11] taactattgatcaggcgtt
[12] aactattgatcaggcgtt
[13] actattgatcaggcgtt
[14] ctattgatcaggcgtt
[15] tattgatcaggcgtt
[16] attgatcaggcgtt
[17] ttgatcaggcgtt
[18] tgatcaggcgtt
[19] gatcaggcgtt
[20] atcaggcgtt
[21] tcaggcgtt
[22] caggcgtt
[23] aggcgtt
[24] ggcgtt
[25] gcgtt
[26] cgtt
[27] gtt
[28] tt
[29] t
This array is sorted using the lexicographical order of the prefixes:
[04]    aaagctataactattgatcaggcgtt
[12] aactattgatcaggcgtt
[05] aagctataactattgatcaggcgtt
[13] actattgatcaggcgtt
[06] agctataactattgatcaggcgtt
[23] aggcgtt
[10] ataactattgatcaggcgtt
[20] atcaggcgtt
[16] attgatcaggcgtt
[22] caggcgtt
[26] cgtt
[08] ctataactattgatcaggcgtt
[14] ctattgatcaggcgtt
[19] gatcaggcgtt
[25] gcgtt
[07] gctataactattgatcaggcgtt
[24] ggcgtt
[00] gggtaaagctataactattgatcaggcgtt
[01] ggtaaagctataactattgatcaggcgtt
[02] gtaaagctataactattgatcaggcgtt
[27] gtt
[29] t
[03] taaagctataactattgatcaggcgtt
[11] taactattgatcaggcgtt
[09] tataactattgatcaggcgtt
[15] tattgatcaggcgtt
[21] tcaggcgtt
[18] tgatcaggcgtt
[28] tt
[17] ttgatcaggcgtt
Now, using a Binary Search we can quickly find where is the sequence "aagctataacta" in the genome:
[04]    aaagctataactattgatcaggcgtt
[12] aactattgatcaggcgtt
[05] aagctataactattgatcaggcgtt
HIT! aagctataacta
[13] actattgatcaggcgtt
[06] agctataactattgatcaggcgtt
(...)
I've implemented this basic algorithm in Java for my needs (finding the pairs of hits, etc...). The source code is available here:. The program was able to map 17E6 * 2 sequences on the human genome in about half a day and, as each sequence can be searched regardless of the others, I now plan to make the code multithreaded.

Implementation (Samples)


Building the index


//write genomic sequence in this dynamic byte array
ByteArrayOutputStream baos=new ByteArrayOutputStream(150000000);
//open the genomiic sequence
FileInputStream fin=new FileInputStream(chromFile);

//ignore the fasta header
int c=fin.read();
if(c!='>') throw new IOException("not starting with '>' "+chromFile);
///skip fasta header
while((c=fin.read())!=-1)
{
if(c=='\n') break;
}
//read the fasta sequence
byte array[]=new byte[1000000];
int nRead;
while((nRead=fin.read(array))!=-1)
{
int j=0;
//remove blanks
for(int i=0;i< nRead;++i)
{
if( Character.isWhitespace(array[i])||
Character.isDigit(array[i])) continue;
array[j]=(byte)Character.toUpperCase(array[i]);
if(array[j]=='>') throw new IOException("expected only one sequence in "+chromFile);
j++;
}
baos.write(array,0,j);
}
fin.close();
//get the genomic sequence as an array of bytes
this.chrom_sequence=baos.toByteArray();
baos=null;
LOG.info("OK in byte array. length= "+chrom_sequence.length);

//build the array of pointers
this.pointers=new int[chrom_sequence.length];
for(int i=0;i+this.flanking<= this.pointers.length;++i)
{
this.pointers[i]=i;
}

(...)
//sort the suffix array
quicksort(0,this.pointers.length-1);
(...)

Binary Search


private int lower_bound(
int first, int last,
byte shortSeq[]
)
{
int len = last - first;
while (len > 0)
{
int half = len / 2;
int middle = first + half;
if (compare(middle,shortSeq)<0)
{
first = middle + 1;
len -= half + 1;
}
else
{
len = half;
}
}
return first;
}
(...)
byte shortSeq[]=...;
int i=lower_bound(0,this.pointers.length,shortSeq);
if(i>= this.pointers.length ) return;
while(i< this.pointers.length &&
compare(i, shortSeq)==0)
{
System.out.println(
shortName+"\t"+
chromName+"\t"+
strand+"\t"+
this.pointers[i]+"\t"+
(this.pointers[i]+shortSeq.length)
);
i++;
}

That's it
Pierre

1 comment:

sahay said...

That's awesome.I am taking a course which requires me to write a proposal on bioinformatics problems. Fortunately/unfortunately I mentioned to propose a solution for the problem of multi-read mapping. What do you suggest would be a good exploratory algorithm to mention.

thanks
Sahay