YOKOFAKUN: March 2009

12 March 2009

A few nightmares before biohackathon 2009.

OK, after Scifoo 2007 (http://plindenbaum.blogspot.com/2007/07/scifoo-07-anxiety-from-homebody.html). Here are my apprehensions for BioHackathon 2009, you know, I'm lost and anxious when I cannot see the Peripherique ;-)

My notebook for a Stupid RDF Server

In this post, I'm writing my notes about the java package com.sun.net.httpserver.*. This package contains a lightweight HTTP server and I've tested it to create a simple-and-stupid RDF server.
The source code is available at:

StupidRDFServer.java

OK. First we need a 'Statement' class holding a RDF triple:

private static class Statement
        {
        /** subject of this statement */
        private URI subject;
        /** predicate of this statement */
        private URI predicate;
        /** value of this statement a String or a URI*/
        private Object value;


        boolean isLiteral()
            {
            return value.getClass()==String.class;
            }

        @Override
        public int hashCode() {
            final int prime = 31;
            int result = 1;
            result = prime * result + predicate.hashCode();
            result = prime * result + subject.hashCode();
            result = prime * result + value.hashCode();
            return result;
        }

        @Override
        public boolean equals(Object obj) {
            if (this == obj) return true;
            if (obj == null) return false;
            Statement other = (Statement) obj;
            return  subject.equals(other.subject) &&
                    predicate.equals(other.predicate) &&
                    value.equals(other.value)
                    ;
            }


        @Override
        public String toString() {
            return "<"+subject+"> <"+predicate+"> "+(isLiteral()
                    ?"\""+C.escape(String.class.cast(value))+"\""
                    :"<"+value+"> ."
                    );
            }
        }

The statements are just stored in a synchronized set. That is to say, we the server stops, all the statements are lost :-).

private Set<Statement> statements= Collections.synchronizedSet(new HashSet<Statement>());

The 'servlet' StupidRDFServer is a class that implementing com.sun.net.httpserver.HttpHandler, that is to say it must implement the method public void handle(HttpExchange http) throws IOException

The following code starts the server.

public static void main(String[] args)
          {
                try
                        {
                            HttpServer server = HttpServer.create(new InetSocketAddress(PORT), 0);
                            server.createContext(CONTEXT, new StupidRDFServer());
                            server.start();
                        }
                catch(IOException err)
                        {
                        err.printStackTrace();
                        }
          }

If the query is empty, the following form is returned to the client.

The 'Add Statements' button adds a statement in the list:

respHeader.add("Content-Type", "text/html");
                http.sendResponseHeaders(200, 0);
                boolean added=this.statements.add(stmt);
                printForm(out,(added?"Statement Added":"Statement already in model"));

The 'Query N3' button prints the triples, using the parameters of the form as a filter:

respHeader.add("Content-Type", "text/plain");
http.sendResponseHeaders(200, 0);
synchronized(statements)
 {
 Iterator<Statement> iter= statements.iterator();
 while(iter.hasNext())
  {
  Statement triple=iter.next();
  if(
  (stmt.subject==null || stmt.subject.equals(triple.subject)) &&
  (stmt.predicate==null || stmt.predicate.equals(triple.predicate)) &&
  (stmt.value==null || stmt.value.equals(triple.value))
  )
  {
  out.println(triple);
  }
  }
 }

The 'Query RDF' button prints the triples as RDF, using the parameters of the form as a filter. Example of output:

<?xml version="1.0" encoding="UTF-8"?>
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
<rdf:Statement>
 <rdf:subject rdf:resource="foaf%3Ame"/>
 <rdf:predicate rdf:resource="foaf%3Aname"/>
 <rdf:object>Pierre</rdf:object>
</rdf:Statement>
<rdf:Statement>
 <rdf:subject rdf:resource="urn%3Aother"/>
 <rdf:predicate rdf:resource="foaf%3Aknows"/>
 <rdf:object rdf:resource="foaf%3Aknows"/>
</rdf:Statement>
</rdf:RDF>

That's it

07 March 2009

A lightweight java parser for RDF

About one year ago, I wrote a lightweight java parser for RDF based on the Stream API for XML (Stax). It is far from being perfect as , for example, it does not handle the reified statements, xml:base, ... but it is small (24K) and works fine with most RDF files. Inspired by the XML SAX parsers, this RDF parser doesn't keep the statements in memory but calls a method "found" each time a triple is found. This method can be overridden to implement your own code.

Source code

The code is available at

http://code.google.com/p/lindenb/source/browse/trunk/src/java/org/lindenb/sw/io/RDFHandler.java

RDFEvent

First we need a small internal class to record the content of each triple

private static class RDFEvent
                {
                URI subject=null;
                URI predicate=null;
                Object value=null;
                URI valueType=null;
                String lang=null;
                int listIndex=-1;
  (...)
  }

Searching for rdf:RDF

First we scan the elements of the document until the <rdf:RDF> element is found. Then, the method parseRDF is called.

this.parser = this.factory.createXMLEventReader(in);

while(getReader().hasNext())
 {
 XMLEvent event = getReader().nextEvent();
 if(event.isStartElement())
  {
  StartElement start=(StartElement)event;
  if(name2string(start).equals(RDF.NS+"RDF"))
   {
   parseRDF();
   }
  }
 }

parseRDF: Searching the statements

All the nodes are then scanned .The method parseDescription is called for each element.

while(getReader().hasNext())
 {
 XMLEvent event = getReader().nextEvent();
 if(event.isEndElement())
  {
  return;
  }
 else if(event.isStartElement())
  {
  parseDescription(event.asStartElement());
  }
 else if(event.isProcessingInstruction())
  {
  throw new XMLStreamException("Found Processing Instruction in RDF ???");
  }
 else if(event.isCharacters() &&
  event.asCharacters().getData().trim().length()>0)
  {
  throw new XMLStreamException("Found text in RDF ???");
  }
 }

parseDescription: Parsing the subject of a triple

The current element will be the subject of the triple.
The URI of this subject need to extracted.
First we check if this URI can be extracted from an attribute rdf:about

Attribute att= description.getAttributeByName(new QName(RDF.NS,"about"));
if(att!=null) descriptionURI= createURI( att.getValue());

If it was not found, the attribute rdf:nodeID is searched:

att= description.getAttributeByName(new QName(RDF.NS,"nodeID"));
if(att!=null) descriptionURI=  createURI( att.getValue());

If it was not found, the attribute rdf:ID is searched.

att= description.getAttributeByName(new QName(RDF.NS,"ID"));
if(att!=null) descriptionURI= resolveBase(att.getValue());

If it was not found, this is an anonymous node. We create a random URI.

descriptionURI= createAnonymousURI();

rdf:type

The qualified name of the element contains the rdf:type of this statement. We can emit a new triple about this type:

QName qn=description.getName();
if(!(qn.getNamespaceURI().equals(RDF.NS) &&
 qn.getLocalPart().equals("Description")))
 {
 RDFEvent evt= new RDFEvent();
 evt.subject=descriptionURI;
 evt.predicate= createURI(RDF.NS+"type");
 evt.value=name2uri(qn);
 found(evt);
 }

Other attributes

The other attributes of the current element may contains some new triples.

for(Iterator<?> i=description.getAttributes();
 i.hasNext();)
 {
 att=(Attribute)i.next();
 qn= att.getName();
 String local= qn.getLocalPart();
 if(qn.getNamespaceURI().equals(RDF.NS) &&
  ( local.equals("about") ||
  local.equals("ID") ||
  local.equals("nodeID")))
  {
  continue;
  }
 RDFEvent evt= new RDFEvent();
 evt.subject=descriptionURI;
 evt.predicate= name2uri(qn);
 evt.value= att.getValue();
 found(evt);
 }

Searching the predicates

We then loop over the children of the current element. Those nodes are the predicates of the current subject. The method parsePredicate is called, each time a new element is found.

while(getReader().hasNext())
 {
 XMLEvent event = getReader().nextEvent();
 if(event.isEndElement())
  {
  return descriptionURI;
  }
 else if(event.isStartElement())
  {
  parsePredicate(descriptionURI,event.asStartElement());
  }
 else if(event.isProcessingInstruction())
  {
  throw new XMLStreamException("Found Processing Instruction in RDF ???");
  }
 else if(event.isCharacters() &&
   event.asCharacters().getData().trim().length()>0)
  {
  throw new XMLStreamException("Found text in RDF ??? \""+
   event.asCharacters().getData()+"\""
   );
  }
 }

parsePredicate: Parsing the predicate of the current triple

First the property attributes of the current element are scanned, and some new triples may be created. e.g:

<rdf:Description ex:fullName="Dave Beckett">
  <ex:homePage rdf:resource="http://purl.org/net/dajobe/"/>
</rdf:Description>

During this process, the value of the attribute rdf:parseType is noted if it was found.
Furthermore, if there was an attribute rdf:resource, then this element is a new triple linking another resource.

<ex:homePage rdf:resource="http://purl.org/net/dajobe/"/>

If rdf:parseType="Literal" then we transform the children of the current node into a string, and a new triple is created.

if(parseType.equals("Literal"))
 {
 StringBuilder b= parseLiteral();
 RDFEvent evt= new RDFEvent();
 evt.subject=descriptionURI;
 evt.predicate= predicateURI;
 evt.value= b.toString();
 evt.lang=lang;
 evt.valueType=datatype;
 found(evt);
 }

If rdf:parseType="Resource", then the current node is a blank node: The rdf:Description will be omitted. A blank node is created and we call recursively parsePredicate using this blank node has the new subject.

URI blanck = createAnonymousURI();
RDFEvent evt= new RDFEvent();
evt.subject=descriptionURI;
evt.predicate= predicateURI;
evt.value=blanck;
evt.lang=lang;
evt.valueType=datatype;
found(evt);

while(getReader().hasNext())
 {
 XMLEvent event = getReader().nextEvent();
 if(event.isStartElement())
  {
  parsePredicate(blanck, event.asStartElement());
  }
 else if(event.isEndElement())
  {
  return;
  }
 }

If rdf:parseType="Collection", The children elements give the set of subject nodes of the collection. We call recursively parseDescription for each of these nodes.

int index=0;
while(getReader().hasNext())
 {
 XMLEvent event = getReader().nextEvent();
 if(event.isStartElement())
  {
  URI value= parseDescription(event.asStartElement());
  RDFEvent evt= new RDFEvent();
  evt.subject=descriptionURI;
  evt.predicate= predicateURI;
  evt.value=value;
  evt.lang=lang;
  evt.valueType=datatype;
  evt.listIndex=(++index);

  found(evt);
  }
 else if(event.isEndElement())
  {
  return;
  }
 }

Else this is the default rdf:parseType.
If a new element is found, then , this is the subject of a new resource (We call recursively parseDescription), else the current statement has a Literal as the object of this statement and we concatenate all the text.

 StringBuilder b= new StringBuilder();
while(getReader().hasNext())
 {
 XMLEvent event = getReader().nextEvent();
 if(event.isStartElement())
  {
  URI childURI=parseDescription(event.asStartElement());
  RDFEvent evt= new RDFEvent();
  evt.subject=descriptionURI;
  evt.predicate= predicateURI;
  evt.value= childURI;
  found(evt);
  b.setLength(0);
  foundResourceAsChild=true;
  }
 else if(event.isCharacters())
  {
  b.append(event.asCharacters().getData());
  }
 else if(event.isEndElement())
  {
  if(!foundResourceAsChild)
   {
   RDFEvent evt= new RDFEvent();
   evt.subject=descriptionURI;
   evt.predicate= predicateURI;
   evt.value= b.toString();
   evt.lang=lang;
   evt.valueType=datatype;
   found(evt);
   }
  else
   {
   if(b.toString().trim().length()!=0) throw new XMLStreamException("Found bad text "+b);
   }
  return;
  }
 }

Testing

The following code parses go.rdf.gz (1744 Ko) and returns the number of statements.

long now= System.currentTimeMillis();
URL url= new URL("ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/rdf/go.rdf.gz");
InputStream r= new GZIPInputStream(url.openStream());
RDFHandler h= new RDFHandler()
{
@Override
public void found(URI subject, URI predicate, Object value,
URI dataType, String lang, int index)
throws IOException {
++count;
}
};
h.parse(r);
r.close();
System.out.println("time:"+((System.currentTimeMillis()-now)/1000)+" secs count:"+count+" triples");

Result:

time:17 secs count:188391 triples

That's it.

Pierre

03 March 2009

String Challenge: My (brute-force) solution.

In this post, I present my (brute-force/quick n'dirty) solution to the recent 'String Challenge' submited by Thomas Mailund on his blog: http://www.mailund.dk/index.php/2009/03/02/a-string-algorithms-challenge/. Briefly, here is the problem:

Given an input string X, an integer k and a frequency f, report all k-mers that occur with frequency higher than f. Expect the length of X to be from a few hundred thousands to a few millions and k to be between 5 and 20.

I wrote a simple java code to solve this problem. This is not big science as I used a brute-force algorithm, but you might be interested about how the k-mers were mapped to their occurrences. Here, my code uses the java implementation of the BerkeleyDB API (http://www.oracle.com/technology/documentation/berkeley-db/je/index.html) to map the k-mers to their occurrences.

The source code is available here:http://anybody.cephb.fr/perso/lindenb/tmp/StringChallenge.java (Opps, please remove this extra (c=fin.read();), I cannot change this at this time)

The code

First the BerkeleyDB environment is initialized and we create a temporary database that will map the k-mers to the counts.

File envHome=new File(System.getProperty("java.io.tmpdir"));
        EnvironmentConfig envCfg= new EnvironmentConfig();
        envCfg.setAllowCreate(true);
        envCfg.setReadOnly(false);
        Environment env= new Environment(envHome,envCfg);

        //create a first database mapping k-mers to count
        DatabaseConfig cfg= new DatabaseConfig();
        cfg.setAllowCreate(true);
        cfg.setReadOnly(false);
        cfg.setTemporary(true);
        Database db= env.openDatabase(null, "kmers", cfg);
        DatabaseEntry key = new DatabaseEntry();
        DatabaseEntry value = new DatabaseEntry();

The sequence is then scanned and an array of bytes of length k is filled with the characters.

       FileReader fin= new FileReader(file);
        byte array[]=new  byte[kmer];
        int c=-1;
        int array_size=0;
      //c=fin.read(); oopps, I should have removed this....


        while((c=fin.read())!=-1)
            {
            if(Character.isWhitespace(c)) continue;
            c=Character.toUpperCase(c);
            array[array_size++]=(byte)c;

           }

This array is filled, searched in the database and the count is incremented back in the BerkeleyDB. The the content of the array is shifted to the left.

 key.setData(array);

                int count=0;
                //is this data already exists ?
                if(db.get(null,key, value, null)==OperationStatus.SUCCESS)
                    {
                    count =IntegerBinding.entryToInt(value);
                    }

               IntegerBinding.intToEntry(count+1, value);
               db.put(null,key, value);

                //switch to left
                for(int i=1;i< kmer;++i)
                        {
                        array[i-1]=array[i];
                        }
                array_size--;

At the end, in order to have a set of ordered results, a reverse database is created . It maps the counts to the k-mers.

//create a second database mapping count to k-mers
        cfg= new DatabaseConfig();
        cfg.setAllowCreate(true);
        cfg.setReadOnly(false);
        cfg.setTemporary(true);
        cfg.setSortedDuplicates(true);
        Database db2= env.openDatabase(null, "occ",cfg);
        key = new DatabaseEntry();
        value = new DatabaseEntry();
        
        Cursor cursor= db.openCursor(null, null);
        while(cursor.getNext(key, value,null)==OperationStatus.SUCCESS)
            {
            int count=IntegerBinding.entryToInt(value);
            if((count/(float)total)< freq) continue;
            db2.put(null, value, key);
            }
        cursor.close();

This second database is then scanned to print the ordered results.

 while(cursor.getNext(key, value,null)==OperationStatus.SUCCESS)
            {
            int count=IntegerBinding.entryToInt(key);
            String seq= new String(value.getData(),value.getOffset(),value.getSize());
            System.out.println(seq+"\t"+count+"/"+total);
            }

Compilation

javac -cp ${BERKELEY-JE-PATH}/lib/je-3.3.75.jar:. StringChallenge.java

Excecution

java -cp ${BERKELEY-JE-PATH}/lib/je-3.3.75.jar:. StringChallenge

Test with the Human chr22 (~49E6 bp)

time java -cp  ${BERKELEY-JE-PATH}/lib/je-3.3.75.jar:. StringChallenge -k 8 -f 0.001 chr22.fa
AAAAAAAA    71811/49691430
TTTTTTTT    72474/49691430
NNNNNNNN    14840030/49691430

real    5m44.240s
user    5m56.186s
sys    0m3.178s

time java -cp ${BERKELEY-JE-PATH}/lib/je-3.3.75.jar:. StringChallenge -k 10 -f  0.001 chr22.fa
AAAAAAAAAA      50128/49691428
TTTTTTTTTT      50841/49691428
NNNNNNNNNN      14840010/49691428

real    8m15.152s
user    8m41.429s
sys     0m3.748s

That's it. Now I'd like to know how the 'elegant' solutions will be implemented :-)