OK, after Scifoo 2007 (http://plindenbaum.blogspot.com/2007/07/scifoo-07-anxiety-from-homebody.html). Here are my apprehensions for BioHackathon 2009, you know, I'm lost and anxious when I cannot see the Peripherique ;-)
12 March 2009
My notebook for a Stupid RDF Server
In this post, I'm writing my notes about the java package com.sun.net.httpserver.*. This package contains a lightweight HTTP server and I've tested it to create a simple-and-stupid RDF server.
The source code is available at:
OK. First we need a 'Statement' class holding a RDF triple:
The statements are just stored in a synchronized set. That is to say, we the server stops, all the statements are lost :-).
The 'servlet' StupidRDFServer is a class that implementing com.sun.net.httpserver.HttpHandler, that is to say it must implement the method public void handle(HttpExchange http) throws IOException
The following code starts the server.
The 'Add Statements' button adds a statement in the list:
The 'Query N3' button prints the triples, using the parameters of the form as a filter:
The 'Query RDF' button prints the triples as RDF, using the parameters of the form as a filter. Example of output:
That's it
The source code is available at:
OK. First we need a 'Statement' class holding a RDF triple:
private static class Statement
{
/** subject of this statement */
private URI subject;
/** predicate of this statement */
private URI predicate;
/** value of this statement a String or a URI*/
private Object value;
boolean isLiteral()
{
return value.getClass()==String.class;
}
@Override
public int hashCode() {
final int prime = 31;
int result = 1;
result = prime * result + predicate.hashCode();
result = prime * result + subject.hashCode();
result = prime * result + value.hashCode();
return result;
}
@Override
public boolean equals(Object obj) {
if (this == obj) return true;
if (obj == null) return false;
Statement other = (Statement) obj;
return subject.equals(other.subject) &&
predicate.equals(other.predicate) &&
value.equals(other.value)
;
}
@Override
public String toString() {
return "<"+subject+"> <"+predicate+"> "+(isLiteral()
?"\""+C.escape(String.class.cast(value))+"\""
:"<"+value+"> ."
);
}
}
{
/** subject of this statement */
private URI subject;
/** predicate of this statement */
private URI predicate;
/** value of this statement a String or a URI*/
private Object value;
boolean isLiteral()
{
return value.getClass()==String.class;
}
@Override
public int hashCode() {
final int prime = 31;
int result = 1;
result = prime * result + predicate.hashCode();
result = prime * result + subject.hashCode();
result = prime * result + value.hashCode();
return result;
}
@Override
public boolean equals(Object obj) {
if (this == obj) return true;
if (obj == null) return false;
Statement other = (Statement) obj;
return subject.equals(other.subject) &&
predicate.equals(other.predicate) &&
value.equals(other.value)
;
}
@Override
public String toString() {
return "<"+subject+"> <"+predicate+"> "+(isLiteral()
?"\""+C.escape(String.class.cast(value))+"\""
:"<"+value+"> ."
);
}
}
The statements are just stored in a synchronized set. That is to say, we the server stops, all the statements are lost :-).
private Set<Statement> statements= Collections.synchronizedSet(new HashSet<Statement>());
The 'servlet' StupidRDFServer is a class that implementing com.sun.net.httpserver.HttpHandler, that is to say it must implement the method public void handle(HttpExchange http) throws IOException
The following code starts the server.
public static void main(String[] args)
{
try
{
HttpServer server = HttpServer.create(new InetSocketAddress(PORT), 0);
server.createContext(CONTEXT, new StupidRDFServer());
server.start();
}
catch(IOException err)
{
err.printStackTrace();
}
}
If the query is empty, the following form is returned to the client.{
try
{
HttpServer server = HttpServer.create(new InetSocketAddress(PORT), 0);
server.createContext(CONTEXT, new StupidRDFServer());
server.start();
}
catch(IOException err)
{
err.printStackTrace();
}
}
The 'Add Statements' button adds a statement in the list:
respHeader.add("Content-Type", "text/html");
http.sendResponseHeaders(200, 0);
boolean added=this.statements.add(stmt);
printForm(out,(added?"Statement Added":"Statement already in model"));
http.sendResponseHeaders(200, 0);
boolean added=this.statements.add(stmt);
printForm(out,(added?"Statement Added":"Statement already in model"));
The 'Query N3' button prints the triples, using the parameters of the form as a filter:
respHeader.add("Content-Type", "text/plain");
http.sendResponseHeaders(200, 0);
synchronized(statements)
{
Iterator<Statement> iter= statements.iterator();
while(iter.hasNext())
{
Statement triple=iter.next();
if(
(stmt.subject==null || stmt.subject.equals(triple.subject)) &&
(stmt.predicate==null || stmt.predicate.equals(triple.predicate)) &&
(stmt.value==null || stmt.value.equals(triple.value))
)
{
out.println(triple);
}
}
}
http.sendResponseHeaders(200, 0);
synchronized(statements)
{
Iterator<Statement> iter= statements.iterator();
while(iter.hasNext())
{
Statement triple=iter.next();
if(
(stmt.subject==null || stmt.subject.equals(triple.subject)) &&
(stmt.predicate==null || stmt.predicate.equals(triple.predicate)) &&
(stmt.value==null || stmt.value.equals(triple.value))
)
{
out.println(triple);
}
}
}
The 'Query RDF' button prints the triples as RDF, using the parameters of the form as a filter. Example of output:
<?xml version="1.0" encoding="UTF-8"?>
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
<rdf:Statement>
<rdf:subject rdf:resource="foaf%3Ame"/>
<rdf:predicate rdf:resource="foaf%3Aname"/>
<rdf:object>Pierre</rdf:object>
</rdf:Statement>
<rdf:Statement>
<rdf:subject rdf:resource="urn%3Aother"/>
<rdf:predicate rdf:resource="foaf%3Aknows"/>
<rdf:object rdf:resource="foaf%3Aknows"/>
</rdf:Statement>
</rdf:RDF>
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
<rdf:Statement>
<rdf:subject rdf:resource="foaf%3Ame"/>
<rdf:predicate rdf:resource="foaf%3Aname"/>
<rdf:object>Pierre</rdf:object>
</rdf:Statement>
<rdf:Statement>
<rdf:subject rdf:resource="urn%3Aother"/>
<rdf:predicate rdf:resource="foaf%3Aknows"/>
<rdf:object rdf:resource="foaf%3Aknows"/>
</rdf:Statement>
</rdf:RDF>
That's it
07 March 2009
A lightweight java parser for RDF
About one year ago, I wrote a lightweight java parser for RDF based on the Stream API for XML (Stax). It is far from being perfect as , for example, it does not handle the reified statements, xml:base, ... but it is small (24K) and works fine with most RDF files. Inspired by the XML SAX parsers, this RDF parser doesn't keep the statements in memory but calls a method
The code is available at
First we need a small internal class to record the content of each triple
First we scan the elements of the document until the <rdf:RDF> element is found. Then, the method
All the nodes are then scanned .The method parseDescription is called for each element.
The current element will be the subject of the triple.
The URI of this subject need to extracted.
First we check if this URI can be extracted from an attribute
If it was not found, the attribute
If it was not found, the attribute
If it was not found, this is an anonymous node. We create a random URI.
The qualified name of the element contains the
The other attributes of the current element may contains some new triples.
We then loop over the children of the current element. Those nodes are the predicates of the current subject. The method parsePredicate is called, each time a new element is found.
First the property attributes of the current element are scanned, and some new triples may be created. e.g:
During this process, the value of the attribute rdf:parseType is noted if it was found.
Furthermore, if there was an attribute rdf:resource, then this element is a new triple linking another resource.
If rdf:parseType="Literal" then we transform the children of the current node into a string, and a new triple is created.
If rdf:parseType="Resource", then the current node is a blank node: The rdf:Description will be omitted. A blank node is created and we call recursively parsePredicate using this blank node has the new subject.
If rdf:parseType="Collection", The children elements give the set of subject nodes of the collection. We call recursively parseDescription for each of these nodes.
Else this is the default rdf:parseType.
If a new element is found, then , this is the subject of a new resource (We call recursively parseDescription), else the current statement has a Literal as the object of this statement and we concatenate all the text.
The following code parses go.rdf.gz (1744 Ko) and returns the number of statements.
Result:
That's it.
Pierre
"found"
each time a triple is found. This method can be overridden to implement your own code.Source code
The code is available at
RDFEvent
First we need a small internal class to record the content of each triple
private static class RDFEvent
{
URI subject=null;
URI predicate=null;
Object value=null;
URI valueType=null;
String lang=null;
int listIndex=-1;
(...)
}
{
URI subject=null;
URI predicate=null;
Object value=null;
URI valueType=null;
String lang=null;
int listIndex=-1;
(...)
}
Searching for rdf:RDF
First we scan the elements of the document until the <rdf:RDF> element is found. Then, the method
parseRDF
is called.this.parser = this.factory.createXMLEventReader(in);
while(getReader().hasNext())
{
XMLEvent event = getReader().nextEvent();
if(event.isStartElement())
{
StartElement start=(StartElement)event;
if(name2string(start).equals(RDF.NS+"RDF"))
{
parseRDF();
}
}
}
while(getReader().hasNext())
{
XMLEvent event = getReader().nextEvent();
if(event.isStartElement())
{
StartElement start=(StartElement)event;
if(name2string(start).equals(RDF.NS+"RDF"))
{
parseRDF();
}
}
}
parseRDF: Searching the statements
All the nodes are then scanned .The method parseDescription is called for each element.
while(getReader().hasNext())
{
XMLEvent event = getReader().nextEvent();
if(event.isEndElement())
{
return;
}
else if(event.isStartElement())
{
parseDescription(event.asStartElement());
}
else if(event.isProcessingInstruction())
{
throw new XMLStreamException("Found Processing Instruction in RDF ???");
}
else if(event.isCharacters() &&
event.asCharacters().getData().trim().length()>0)
{
throw new XMLStreamException("Found text in RDF ???");
}
}
{
XMLEvent event = getReader().nextEvent();
if(event.isEndElement())
{
return;
}
else if(event.isStartElement())
{
parseDescription(event.asStartElement());
}
else if(event.isProcessingInstruction())
{
throw new XMLStreamException("Found Processing Instruction in RDF ???");
}
else if(event.isCharacters() &&
event.asCharacters().getData().trim().length()>0)
{
throw new XMLStreamException("Found text in RDF ???");
}
}
parseDescription: Parsing the subject of a triple
The current element will be the subject of the triple.
The URI of this subject need to extracted.
First we check if this URI can be extracted from an attribute
rdf:about
Attribute att= description.getAttributeByName(new QName(RDF.NS,"about"));
if(att!=null) descriptionURI= createURI( att.getValue());
if(att!=null) descriptionURI= createURI( att.getValue());
If it was not found, the attribute
rdf:nodeID
is searched:att= description.getAttributeByName(new QName(RDF.NS,"nodeID"));
if(att!=null) descriptionURI= createURI( att.getValue());
if(att!=null) descriptionURI= createURI( att.getValue());
If it was not found, the attribute
rdf:ID
is searched.att= description.getAttributeByName(new QName(RDF.NS,"ID"));
if(att!=null) descriptionURI= resolveBase(att.getValue());
if(att!=null) descriptionURI= resolveBase(att.getValue());
If it was not found, this is an anonymous node. We create a random URI.
descriptionURI= createAnonymousURI();
rdf:type
The qualified name of the element contains the
rdf:type
of this statement. We can emit a new triple about this type:QName qn=description.getName();
if(!(qn.getNamespaceURI().equals(RDF.NS) &&
qn.getLocalPart().equals("Description")))
{
RDFEvent evt= new RDFEvent();
evt.subject=descriptionURI;
evt.predicate= createURI(RDF.NS+"type");
evt.value=name2uri(qn);
found(evt);
}
if(!(qn.getNamespaceURI().equals(RDF.NS) &&
qn.getLocalPart().equals("Description")))
{
RDFEvent evt= new RDFEvent();
evt.subject=descriptionURI;
evt.predicate= createURI(RDF.NS+"type");
evt.value=name2uri(qn);
found(evt);
}
Other attributes
The other attributes of the current element may contains some new triples.
for(Iterator<?> i=description.getAttributes();
i.hasNext();)
{
att=(Attribute)i.next();
qn= att.getName();
String local= qn.getLocalPart();
if(qn.getNamespaceURI().equals(RDF.NS) &&
( local.equals("about") ||
local.equals("ID") ||
local.equals("nodeID")))
{
continue;
}
RDFEvent evt= new RDFEvent();
evt.subject=descriptionURI;
evt.predicate= name2uri(qn);
evt.value= att.getValue();
found(evt);
}
i.hasNext();)
{
att=(Attribute)i.next();
qn= att.getName();
String local= qn.getLocalPart();
if(qn.getNamespaceURI().equals(RDF.NS) &&
( local.equals("about") ||
local.equals("ID") ||
local.equals("nodeID")))
{
continue;
}
RDFEvent evt= new RDFEvent();
evt.subject=descriptionURI;
evt.predicate= name2uri(qn);
evt.value= att.getValue();
found(evt);
}
Searching the predicates
We then loop over the children of the current element. Those nodes are the predicates of the current subject. The method parsePredicate is called, each time a new element is found.
while(getReader().hasNext())
{
XMLEvent event = getReader().nextEvent();
if(event.isEndElement())
{
return descriptionURI;
}
else if(event.isStartElement())
{
parsePredicate(descriptionURI,event.asStartElement());
}
else if(event.isProcessingInstruction())
{
throw new XMLStreamException("Found Processing Instruction in RDF ???");
}
else if(event.isCharacters() &&
event.asCharacters().getData().trim().length()>0)
{
throw new XMLStreamException("Found text in RDF ??? \""+
event.asCharacters().getData()+"\""
);
}
}
{
XMLEvent event = getReader().nextEvent();
if(event.isEndElement())
{
return descriptionURI;
}
else if(event.isStartElement())
{
parsePredicate(descriptionURI,event.asStartElement());
}
else if(event.isProcessingInstruction())
{
throw new XMLStreamException("Found Processing Instruction in RDF ???");
}
else if(event.isCharacters() &&
event.asCharacters().getData().trim().length()>0)
{
throw new XMLStreamException("Found text in RDF ??? \""+
event.asCharacters().getData()+"\""
);
}
}
parsePredicate: Parsing the predicate of the current triple
First the property attributes of the current element are scanned, and some new triples may be created. e.g:
<rdf:Description ex:fullName="Dave Beckett">
<ex:homePage rdf:resource="http://purl.org/net/dajobe/"/>
</rdf:Description>
<ex:homePage rdf:resource="http://purl.org/net/dajobe/"/>
</rdf:Description>
During this process, the value of the attribute rdf:parseType is noted if it was found.
Furthermore, if there was an attribute rdf:resource, then this element is a new triple linking another resource.
<ex:homePage rdf:resource="http://purl.org/net/dajobe/"/>
If rdf:parseType="Literal" then we transform the children of the current node into a string, and a new triple is created.
if(parseType.equals("Literal"))
{
StringBuilder b= parseLiteral();
RDFEvent evt= new RDFEvent();
evt.subject=descriptionURI;
evt.predicate= predicateURI;
evt.value= b.toString();
evt.lang=lang;
evt.valueType=datatype;
found(evt);
}
{
StringBuilder b= parseLiteral();
RDFEvent evt= new RDFEvent();
evt.subject=descriptionURI;
evt.predicate= predicateURI;
evt.value= b.toString();
evt.lang=lang;
evt.valueType=datatype;
found(evt);
}
If rdf:parseType="Resource", then the current node is a blank node: The rdf:Description will be omitted. A blank node is created and we call recursively parsePredicate using this blank node has the new subject.
URI blanck = createAnonymousURI();
RDFEvent evt= new RDFEvent();
evt.subject=descriptionURI;
evt.predicate= predicateURI;
evt.value=blanck;
evt.lang=lang;
evt.valueType=datatype;
found(evt);
while(getReader().hasNext())
{
XMLEvent event = getReader().nextEvent();
if(event.isStartElement())
{
parsePredicate(blanck, event.asStartElement());
}
else if(event.isEndElement())
{
return;
}
}
RDFEvent evt= new RDFEvent();
evt.subject=descriptionURI;
evt.predicate= predicateURI;
evt.value=blanck;
evt.lang=lang;
evt.valueType=datatype;
found(evt);
while(getReader().hasNext())
{
XMLEvent event = getReader().nextEvent();
if(event.isStartElement())
{
parsePredicate(blanck, event.asStartElement());
}
else if(event.isEndElement())
{
return;
}
}
If rdf:parseType="Collection", The children elements give the set of subject nodes of the collection. We call recursively parseDescription for each of these nodes.
int index=0;
while(getReader().hasNext())
{
XMLEvent event = getReader().nextEvent();
if(event.isStartElement())
{
URI value= parseDescription(event.asStartElement());
RDFEvent evt= new RDFEvent();
evt.subject=descriptionURI;
evt.predicate= predicateURI;
evt.value=value;
evt.lang=lang;
evt.valueType=datatype;
evt.listIndex=(++index);
found(evt);
}
else if(event.isEndElement())
{
return;
}
}
while(getReader().hasNext())
{
XMLEvent event = getReader().nextEvent();
if(event.isStartElement())
{
URI value= parseDescription(event.asStartElement());
RDFEvent evt= new RDFEvent();
evt.subject=descriptionURI;
evt.predicate= predicateURI;
evt.value=value;
evt.lang=lang;
evt.valueType=datatype;
evt.listIndex=(++index);
found(evt);
}
else if(event.isEndElement())
{
return;
}
}
Else this is the default rdf:parseType.
If a new element is found, then , this is the subject of a new resource (We call recursively parseDescription), else the current statement has a Literal as the object of this statement and we concatenate all the text.
StringBuilder b= new StringBuilder();
while(getReader().hasNext())
{
XMLEvent event = getReader().nextEvent();
if(event.isStartElement())
{
URI childURI=parseDescription(event.asStartElement());
RDFEvent evt= new RDFEvent();
evt.subject=descriptionURI;
evt.predicate= predicateURI;
evt.value= childURI;
found(evt);
b.setLength(0);
foundResourceAsChild=true;
}
else if(event.isCharacters())
{
b.append(event.asCharacters().getData());
}
else if(event.isEndElement())
{
if(!foundResourceAsChild)
{
RDFEvent evt= new RDFEvent();
evt.subject=descriptionURI;
evt.predicate= predicateURI;
evt.value= b.toString();
evt.lang=lang;
evt.valueType=datatype;
found(evt);
}
else
{
if(b.toString().trim().length()!=0) throw new XMLStreamException("Found bad text "+b);
}
return;
}
}
while(getReader().hasNext())
{
XMLEvent event = getReader().nextEvent();
if(event.isStartElement())
{
URI childURI=parseDescription(event.asStartElement());
RDFEvent evt= new RDFEvent();
evt.subject=descriptionURI;
evt.predicate= predicateURI;
evt.value= childURI;
found(evt);
b.setLength(0);
foundResourceAsChild=true;
}
else if(event.isCharacters())
{
b.append(event.asCharacters().getData());
}
else if(event.isEndElement())
{
if(!foundResourceAsChild)
{
RDFEvent evt= new RDFEvent();
evt.subject=descriptionURI;
evt.predicate= predicateURI;
evt.value= b.toString();
evt.lang=lang;
evt.valueType=datatype;
found(evt);
}
else
{
if(b.toString().trim().length()!=0) throw new XMLStreamException("Found bad text "+b);
}
return;
}
}
Testing
The following code parses go.rdf.gz (1744 Ko) and returns the number of statements.
long now= System.currentTimeMillis();
URL url= new URL("ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/rdf/go.rdf.gz");
InputStream r= new GZIPInputStream(url.openStream());
RDFHandler h= new RDFHandler()
{
@Override
public void found(URI subject, URI predicate, Object value,
URI dataType, String lang, int index)
throws IOException {
++count;
}
};
h.parse(r);
r.close();
System.out.println("time:"+((System.currentTimeMillis()-now)/1000)+" secs count:"+count+" triples");
URL url= new URL("ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/rdf/go.rdf.gz");
InputStream r= new GZIPInputStream(url.openStream());
RDFHandler h= new RDFHandler()
{
@Override
public void found(URI subject, URI predicate, Object value,
URI dataType, String lang, int index)
throws IOException {
++count;
}
};
h.parse(r);
r.close();
System.out.println("time:"+((System.currentTimeMillis()-now)/1000)+" secs count:"+count+" triples");
Result:
time:17 secs count:188391 triples
That's it.
Pierre
03 March 2009
String Challenge: My (brute-force) solution.
In this post, I present my (brute-force/quick n'dirty) solution to the recent 'String Challenge' submited by Thomas Mailund on his blog: http://www.mailund.dk/index.php/2009/03/02/a-string-algorithms-challenge/. Briefly, here is the problem:
Given an input string X, an integer k and a frequency f, report all k-mers that occur with frequency higher than f. Expect the length of X to be from a few hundred thousands to a few millions and k to be between 5 and 20.
I wrote a simple java code to solve this problem. This is not big science as I used a brute-force algorithm, but you might be interested about how the k-mers were mapped to their occurrences. Here, my code uses the java implementation of the BerkeleyDB API (http://www.oracle.com/technology/documentation/berkeley-db/je/index.html) to map the k-mers to their occurrences.
The source code is available here:http://anybody.cephb.fr/perso/lindenb/tmp/StringChallenge.java (Opps, please remove this extra (c=fin.read();), I cannot change this at this time)
First the BerkeleyDB environment is initialized and we create a temporary database that will map the k-mers to the counts.
The sequence is then scanned and an array of bytes of length k is filled with the characters.
This array is filled, searched in the database and the count is incremented back in the BerkeleyDB. The the content of the array is shifted to the left.
At the end, in order to have a set of ordered results, a reverse database is created . It maps the counts to the k-mers.
//create a second database mapping count to k-mers
cfg= new DatabaseConfig();
cfg.setAllowCreate(true);
cfg.setReadOnly(false);
cfg.setTemporary(true);
cfg.setSortedDuplicates(true);
Database db2= env.openDatabase(null, "occ",cfg);
key = new DatabaseEntry();
value = new DatabaseEntry();
Cursor cursor= db.openCursor(null, null);
while(cursor.getNext(key, value,null)==OperationStatus.SUCCESS)
{
int count=IntegerBinding.entryToInt(value);
if((count/(float)total)< freq) continue;
db2.put(null, value, key);
}
cursor.close();
This second database is then scanned to print the ordered results.
That's it. Now I'd like to know how the 'elegant' solutions will be implemented :-)
Given an input string X, an integer k and a frequency f, report all k-mers that occur with frequency higher than f. Expect the length of X to be from a few hundred thousands to a few millions and k to be between 5 and 20.
I wrote a simple java code to solve this problem. This is not big science as I used a brute-force algorithm, but you might be interested about how the k-mers were mapped to their occurrences. Here, my code uses the java implementation of the BerkeleyDB API (http://www.oracle.com/technology/documentation/berkeley-db/je/index.html) to map the k-mers to their occurrences.
The source code is available here:http://anybody.cephb.fr/perso/lindenb/tmp/StringChallenge.java (Opps, please remove this extra (c=fin.read();), I cannot change this at this time)
The code
First the BerkeleyDB environment is initialized and we create a temporary database that will map the k-mers to the counts.
File envHome=new File(System.getProperty("java.io.tmpdir"));
EnvironmentConfig envCfg= new EnvironmentConfig();
envCfg.setAllowCreate(true);
envCfg.setReadOnly(false);
Environment env= new Environment(envHome,envCfg);
//create a first database mapping k-mers to count
DatabaseConfig cfg= new DatabaseConfig();
cfg.setAllowCreate(true);
cfg.setReadOnly(false);
cfg.setTemporary(true);
Database db= env.openDatabase(null, "kmers", cfg);
DatabaseEntry key = new DatabaseEntry();
DatabaseEntry value = new DatabaseEntry();
EnvironmentConfig envCfg= new EnvironmentConfig();
envCfg.setAllowCreate(true);
envCfg.setReadOnly(false);
Environment env= new Environment(envHome,envCfg);
//create a first database mapping k-mers to count
DatabaseConfig cfg= new DatabaseConfig();
cfg.setAllowCreate(true);
cfg.setReadOnly(false);
cfg.setTemporary(true);
Database db= env.openDatabase(null, "kmers", cfg);
DatabaseEntry key = new DatabaseEntry();
DatabaseEntry value = new DatabaseEntry();
The sequence is then scanned and an array of bytes of length k is filled with the characters.
FileReader fin= new FileReader(file);
byte array[]=new byte[kmer];
int c=-1;
int array_size=0;
//c=fin.read(); oopps, I should have removed this....
while((c=fin.read())!=-1)
{
if(Character.isWhitespace(c)) continue;
c=Character.toUpperCase(c);
array[array_size++]=(byte)c;
}
byte array[]=new byte[kmer];
int c=-1;
int array_size=0;
//c=fin.read(); oopps, I should have removed this....
while((c=fin.read())!=-1)
{
if(Character.isWhitespace(c)) continue;
c=Character.toUpperCase(c);
array[array_size++]=(byte)c;
}
This array is filled, searched in the database and the count is incremented back in the BerkeleyDB. The the content of the array is shifted to the left.
key.setData(array);
int count=0;
//is this data already exists ?
if(db.get(null,key, value, null)==OperationStatus.SUCCESS)
{
count =IntegerBinding.entryToInt(value);
}
IntegerBinding.intToEntry(count+1, value);
db.put(null,key, value);
//switch to left
for(int i=1;i< kmer;++i)
{
array[i-1]=array[i];
}
array_size--;
int count=0;
//is this data already exists ?
if(db.get(null,key, value, null)==OperationStatus.SUCCESS)
{
count =IntegerBinding.entryToInt(value);
}
IntegerBinding.intToEntry(count+1, value);
db.put(null,key, value);
//switch to left
for(int i=1;i< kmer;++i)
{
array[i-1]=array[i];
}
array_size--;
At the end, in order to have a set of ordered results, a reverse database is created . It maps the counts to the k-mers.
//create a second database mapping count to k-mers
cfg= new DatabaseConfig();
cfg.setAllowCreate(true);
cfg.setReadOnly(false);
cfg.setTemporary(true);
cfg.setSortedDuplicates(true);
Database db2= env.openDatabase(null, "occ",cfg);
key = new DatabaseEntry();
value = new DatabaseEntry();
Cursor cursor= db.openCursor(null, null);
while(cursor.getNext(key, value,null)==OperationStatus.SUCCESS)
{
int count=IntegerBinding.entryToInt(value);
if((count/(float)total)< freq) continue;
db2.put(null, value, key);
}
cursor.close();
This second database is then scanned to print the ordered results.
while(cursor.getNext(key, value,null)==OperationStatus.SUCCESS)
{
int count=IntegerBinding.entryToInt(key);
String seq= new String(value.getData(),value.getOffset(),value.getSize());
System.out.println(seq+"\t"+count+"/"+total);
}
{
int count=IntegerBinding.entryToInt(key);
String seq= new String(value.getData(),value.getOffset(),value.getSize());
System.out.println(seq+"\t"+count+"/"+total);
}
Compilation
javac -cp ${BERKELEY-JE-PATH}/lib/je-3.3.75.jar:. StringChallenge.java
Excecution
java -cp ${BERKELEY-JE-PATH}/lib/je-3.3.75.jar:. StringChallenge
Test with the Human chr22 (~49E6 bp)
time java -cp ${BERKELEY-JE-PATH}/lib/je-3.3.75.jar:. StringChallenge -k 8 -f 0.001 chr22.fa
AAAAAAAA 71811/49691430
TTTTTTTT 72474/49691430
NNNNNNNN 14840030/49691430
real 5m44.240s
user 5m56.186s
sys 0m3.178s
time java -cp ${BERKELEY-JE-PATH}/lib/je-3.3.75.jar:. StringChallenge -k 10 -f 0.001 chr22.fa
AAAAAAAAAA 50128/49691428
TTTTTTTTTT 50841/49691428
NNNNNNNNNN 14840010/49691428
real 8m15.152s
user 8m41.429s
sys 0m3.748s
AAAAAAAA 71811/49691430
TTTTTTTT 72474/49691430
NNNNNNNN 14840030/49691430
real 5m44.240s
user 5m56.186s
sys 0m3.178s
time java -cp ${BERKELEY-JE-PATH}/lib/je-3.3.75.jar:. StringChallenge -k 10 -f 0.001 chr22.fa
AAAAAAAAAA 50128/49691428
TTTTTTTTTT 50841/49691428
NNNNNNNNNN 14840010/49691428
real 8m15.152s
user 8m41.429s
sys 0m3.748s
That's it. Now I'd like to know how the 'elegant' solutions will be implemented :-)