Say, the sequence of your genome is
"gggtaaagctataactattgatcaggcgtt"
.The array of integers giving the starting positions of the suffixes is:
[00] gggtaaagctataactattgatcaggcgttThis array is sorted using the lexicographical order of the prefixes:
[01] ggtaaagctataactattgatcaggcgtt
[02] gtaaagctataactattgatcaggcgtt
[03] taaagctataactattgatcaggcgtt
[04] aaagctataactattgatcaggcgtt
[05] aagctataactattgatcaggcgtt
[06] agctataactattgatcaggcgtt
[07] gctataactattgatcaggcgtt
[08] ctataactattgatcaggcgtt
[09] tataactattgatcaggcgtt
[10] ataactattgatcaggcgtt
[11] taactattgatcaggcgtt
[12] aactattgatcaggcgtt
[13] actattgatcaggcgtt
[14] ctattgatcaggcgtt
[15] tattgatcaggcgtt
[16] attgatcaggcgtt
[17] ttgatcaggcgtt
[18] tgatcaggcgtt
[19] gatcaggcgtt
[20] atcaggcgtt
[21] tcaggcgtt
[22] caggcgtt
[23] aggcgtt
[24] ggcgtt
[25] gcgtt
[26] cgtt
[27] gtt
[28] tt
[29] t
[04] aaagctataactattgatcaggcgttNow, using a Binary Search we can quickly find where is the sequence
[12] aactattgatcaggcgtt
[05] aagctataactattgatcaggcgtt
[13] actattgatcaggcgtt
[06] agctataactattgatcaggcgtt
[23] aggcgtt
[10] ataactattgatcaggcgtt
[20] atcaggcgtt
[16] attgatcaggcgtt
[22] caggcgtt
[26] cgtt
[08] ctataactattgatcaggcgtt
[14] ctattgatcaggcgtt
[19] gatcaggcgtt
[25] gcgtt
[07] gctataactattgatcaggcgtt
[24] ggcgtt
[00] gggtaaagctataactattgatcaggcgtt
[01] ggtaaagctataactattgatcaggcgtt
[02] gtaaagctataactattgatcaggcgtt
[27] gtt
[29] t
[03] taaagctataactattgatcaggcgtt
[11] taactattgatcaggcgtt
[09] tataactattgatcaggcgtt
[15] tattgatcaggcgtt
[21] tcaggcgtt
[18] tgatcaggcgtt
[28] tt
[17] ttgatcaggcgtt
"aagctataacta"
in the genome:[04] aaagctataactattgatcaggcgttI've implemented this basic algorithm in Java for my needs (finding the pairs of hits, etc...). The source code is available here:. The program was able to map 17E6 * 2 sequences on the human genome in about half a day and, as each sequence can be searched regardless of the others, I now plan to make the code multithreaded.
[12] aactattgatcaggcgtt
[05] aagctataactattgatcaggcgtt
HIT!aagctataacta
[13] actattgatcaggcgtt
[06] agctataactattgatcaggcgtt
(...)
Implementation (Samples)
Building the index
//write genomic sequence in this dynamic byte array
ByteArrayOutputStream baos=new ByteArrayOutputStream(150000000);
//open the genomiic sequence
FileInputStream fin=new FileInputStream(chromFile);
//ignore the fasta header
int c=fin.read();
if(c!='>') throw new IOException("not starting with '>' "+chromFile);
///skip fasta header
while((c=fin.read())!=-1)
{
if(c=='\n') break;
}
//read the fasta sequence
byte array[]=new byte[1000000];
int nRead;
while((nRead=fin.read(array))!=-1)
{
int j=0;
//remove blanks
for(int i=0;i< nRead;++i)
{
if( Character.isWhitespace(array[i])||
Character.isDigit(array[i])) continue;
array[j]=(byte)Character.toUpperCase(array[i]);
if(array[j]=='>') throw new IOException("expected only one sequence in "+chromFile);
j++;
}
baos.write(array,0,j);
}
fin.close();
//get the genomic sequence as an array of bytes
this.chrom_sequence=baos.toByteArray();
baos=null;
LOG.info("OK in byte array. length= "+chrom_sequence.length);
//build the array of pointers
this.pointers=new int[chrom_sequence.length];
for(int i=0;i+this.flanking<= this.pointers.length;++i)
{
this.pointers[i]=i;
}
(...)
//sort the suffix array
quicksort(0,this.pointers.length-1);
(...)
ByteArrayOutputStream baos=new ByteArrayOutputStream(150000000);
//open the genomiic sequence
FileInputStream fin=new FileInputStream(chromFile);
//ignore the fasta header
int c=fin.read();
if(c!='>') throw new IOException("not starting with '>' "+chromFile);
///skip fasta header
while((c=fin.read())!=-1)
{
if(c=='\n') break;
}
//read the fasta sequence
byte array[]=new byte[1000000];
int nRead;
while((nRead=fin.read(array))!=-1)
{
int j=0;
//remove blanks
for(int i=0;i< nRead;++i)
{
if( Character.isWhitespace(array[i])||
Character.isDigit(array[i])) continue;
array[j]=(byte)Character.toUpperCase(array[i]);
if(array[j]=='>') throw new IOException("expected only one sequence in "+chromFile);
j++;
}
baos.write(array,0,j);
}
fin.close();
//get the genomic sequence as an array of bytes
this.chrom_sequence=baos.toByteArray();
baos=null;
LOG.info("OK in byte array. length= "+chrom_sequence.length);
//build the array of pointers
this.pointers=new int[chrom_sequence.length];
for(int i=0;i+this.flanking<= this.pointers.length;++i)
{
this.pointers[i]=i;
}
(...)
//sort the suffix array
quicksort(0,this.pointers.length-1);
(...)
Binary Search
private int lower_bound(
int first, int last,
byte shortSeq[]
)
{
int len = last - first;
while (len > 0)
{
int half = len / 2;
int middle = first + half;
if (compare(middle,shortSeq)<0)
{
first = middle + 1;
len -= half + 1;
}
else
{
len = half;
}
}
return first;
}
(...)
byte shortSeq[]=...;
int i=lower_bound(0,this.pointers.length,shortSeq);
if(i>= this.pointers.length ) return;
while(i< this.pointers.length &&
compare(i, shortSeq)==0)
{
System.out.println(
shortName+"\t"+
chromName+"\t"+
strand+"\t"+
this.pointers[i]+"\t"+
(this.pointers[i]+shortSeq.length)
);
i++;
}
int first, int last,
byte shortSeq[]
)
{
int len = last - first;
while (len > 0)
{
int half = len / 2;
int middle = first + half;
if (compare(middle,shortSeq)<0)
{
first = middle + 1;
len -= half + 1;
}
else
{
len = half;
}
}
return first;
}
(...)
byte shortSeq[]=...;
int i=lower_bound(0,this.pointers.length,shortSeq);
if(i>= this.pointers.length ) return;
while(i< this.pointers.length &&
compare(i, shortSeq)==0)
{
System.out.println(
shortName+"\t"+
chromName+"\t"+
strand+"\t"+
this.pointers[i]+"\t"+
(this.pointers[i]+shortSeq.length)
);
i++;
}
That's it
Pierre
That's awesome.I am taking a course which requires me to write a proposal on bioinformatics problems. Fortunately/unfortunately I mentioned to propose a solution for the problem of multi-read mapping. What do you suggest would be a good exploratory algorithm to mention.
ReplyDeletethanks
Sahay