Given an input string X, an integer k and a frequency f, report all k-mers that occur with frequency higher than f. Expect the length of X to be from a few hundred thousands to a few millions and k to be between 5 and 20.
I wrote a simple java code to solve this problem. This is not big science as I used a brute-force algorithm, but you might be interested about how the k-mers were mapped to their occurrences. Here, my code uses the java implementation of the BerkeleyDB API (http://www.oracle.com/technology/documentation/berkeley-db/je/index.html) to map the k-mers to their occurrences.
The source code is available here:http://anybody.cephb.fr/perso/lindenb/tmp/StringChallenge.java (Opps, please remove this extra (c=fin.read();), I cannot change this at this time)
The code
First the BerkeleyDB environment is initialized and we create a temporary database that will map the k-mers to the counts.
File envHome=new File(System.getProperty("java.io.tmpdir"));
EnvironmentConfig envCfg= new EnvironmentConfig();
envCfg.setAllowCreate(true);
envCfg.setReadOnly(false);
Environment env= new Environment(envHome,envCfg);
//create a first database mapping k-mers to count
DatabaseConfig cfg= new DatabaseConfig();
cfg.setAllowCreate(true);
cfg.setReadOnly(false);
cfg.setTemporary(true);
Database db= env.openDatabase(null, "kmers", cfg);
DatabaseEntry key = new DatabaseEntry();
DatabaseEntry value = new DatabaseEntry();
EnvironmentConfig envCfg= new EnvironmentConfig();
envCfg.setAllowCreate(true);
envCfg.setReadOnly(false);
Environment env= new Environment(envHome,envCfg);
//create a first database mapping k-mers to count
DatabaseConfig cfg= new DatabaseConfig();
cfg.setAllowCreate(true);
cfg.setReadOnly(false);
cfg.setTemporary(true);
Database db= env.openDatabase(null, "kmers", cfg);
DatabaseEntry key = new DatabaseEntry();
DatabaseEntry value = new DatabaseEntry();
The sequence is then scanned and an array of bytes of length k is filled with the characters.
FileReader fin= new FileReader(file);
byte array[]=new byte[kmer];
int c=-1;
int array_size=0;
//c=fin.read(); oopps, I should have removed this....
while((c=fin.read())!=-1)
{
if(Character.isWhitespace(c)) continue;
c=Character.toUpperCase(c);
array[array_size++]=(byte)c;
}
byte array[]=new byte[kmer];
int c=-1;
int array_size=0;
//c=fin.read(); oopps, I should have removed this....
while((c=fin.read())!=-1)
{
if(Character.isWhitespace(c)) continue;
c=Character.toUpperCase(c);
array[array_size++]=(byte)c;
}
This array is filled, searched in the database and the count is incremented back in the BerkeleyDB. The the content of the array is shifted to the left.
key.setData(array);
int count=0;
//is this data already exists ?
if(db.get(null,key, value, null)==OperationStatus.SUCCESS)
{
count =IntegerBinding.entryToInt(value);
}
IntegerBinding.intToEntry(count+1, value);
db.put(null,key, value);
//switch to left
for(int i=1;i< kmer;++i)
{
array[i-1]=array[i];
}
array_size--;
int count=0;
//is this data already exists ?
if(db.get(null,key, value, null)==OperationStatus.SUCCESS)
{
count =IntegerBinding.entryToInt(value);
}
IntegerBinding.intToEntry(count+1, value);
db.put(null,key, value);
//switch to left
for(int i=1;i< kmer;++i)
{
array[i-1]=array[i];
}
array_size--;
At the end, in order to have a set of ordered results, a reverse database is created . It maps the counts to the k-mers.
//create a second database mapping count to k-mers
cfg= new DatabaseConfig();
cfg.setAllowCreate(true);
cfg.setReadOnly(false);
cfg.setTemporary(true);
cfg.setSortedDuplicates(true);
Database db2= env.openDatabase(null, "occ",cfg);
key = new DatabaseEntry();
value = new DatabaseEntry();
Cursor cursor= db.openCursor(null, null);
while(cursor.getNext(key, value,null)==OperationStatus.SUCCESS)
{
int count=IntegerBinding.entryToInt(value);
if((count/(float)total)< freq) continue;
db2.put(null, value, key);
}
cursor.close();
This second database is then scanned to print the ordered results.
while(cursor.getNext(key, value,null)==OperationStatus.SUCCESS)
{
int count=IntegerBinding.entryToInt(key);
String seq= new String(value.getData(),value.getOffset(),value.getSize());
System.out.println(seq+"\t"+count+"/"+total);
}
{
int count=IntegerBinding.entryToInt(key);
String seq= new String(value.getData(),value.getOffset(),value.getSize());
System.out.println(seq+"\t"+count+"/"+total);
}
Compilation
javac -cp ${BERKELEY-JE-PATH}/lib/je-3.3.75.jar:. StringChallenge.java
Excecution
java -cp ${BERKELEY-JE-PATH}/lib/je-3.3.75.jar:. StringChallenge
Test with the Human chr22 (~49E6 bp)
time java -cp ${BERKELEY-JE-PATH}/lib/je-3.3.75.jar:. StringChallenge -k 8 -f 0.001 chr22.fa
AAAAAAAA 71811/49691430
TTTTTTTT 72474/49691430
NNNNNNNN 14840030/49691430
real 5m44.240s
user 5m56.186s
sys 0m3.178s
time java -cp ${BERKELEY-JE-PATH}/lib/je-3.3.75.jar:. StringChallenge -k 10 -f 0.001 chr22.fa
AAAAAAAAAA 50128/49691428
TTTTTTTTTT 50841/49691428
NNNNNNNNNN 14840010/49691428
real 8m15.152s
user 8m41.429s
sys 0m3.748s
AAAAAAAA 71811/49691430
TTTTTTTT 72474/49691430
NNNNNNNN 14840030/49691430
real 5m44.240s
user 5m56.186s
sys 0m3.178s
time java -cp ${BERKELEY-JE-PATH}/lib/je-3.3.75.jar:. StringChallenge -k 10 -f 0.001 chr22.fa
AAAAAAAAAA 50128/49691428
TTTTTTTTTT 50841/49691428
NNNNNNNNNN 14840010/49691428
real 8m15.152s
user 8m41.429s
sys 0m3.748s
That's it. Now I'd like to know how the 'elegant' solutions will be implemented :-)
No comments:
Post a Comment