YOKOFAKUN: engine

Showing posts with label engine. Show all posts

12 June 2009

RDF storage with a Key/Value Engine

Just curious, can I store some RDF statements in a Key/Value engine like BerkeleyDB (java Edition) ?
Yes, it's like re-inventing the wheel but, again, I like re-inventing the wheel :-)

A berkeleyDB Database contains a set of Key/Value. e.g.

Key	Value
SecurityNumber:9877	FirstName:John LastName:Doe
SecurityNumber:9899	FirstName:Peter LastName:Parker
SecurityNumber:9988	FirstName:Edith LastName:Parker

Data are stored as an array of bytes. Keys are ordered on a bit-based order and the records are stored in a B-Tree table. Databases can be defined to store unique or duplicated keys.
In BerkeleyDB, a Cursor is an iterator used to scan the database: as the keys are sorted , accessing a range of keys is very fast.
Some indexes (Secondary Database) can be linked to a Database, for example, in the previous table, if you want to quickly access the person having a LastName=="Parker" I would create a secondary database on LastName. Deleting an item in the secondary database, automatically delete the corresponding item in the main database. As far as I know, you cannot create a secondary database if your primary database allows duplicate keys.

Key2	Key1	Value1
LastName:Doe	SecurityNumber:9877	FirstName:John LastName:Doe
LastName:Parker	SecurityNumber:9899	FirstName:Peter LastName:Parker
LastName:Parker	SecurityNumber:9988	FirstName:Edith LastName:Parker

OK, now I want to store some RDF statements, that is to say something like the following triple:

{
SUBJECT = RESOURCE;
PREDICATE = RESOURCE;
OBJECT = ( RESOURCE || LITERAL)
}

I need to create an index to quickly find any statement matching one , two or the three components of the statement.
In the solution I've implemented, there is only one primary database and all the component of a statement are part of the 'DATA' , the 'KEY' of the database is just a unique number.

Key	Value
1	(s1,p1,o1)
2	(s2,p2,o2)
3	(s3,p3,o3)

Some secondary databases ( subject2triple ,predicate2triple, objectLiteral2triple, objectRsrc2triple ) are used to quickly find each rdf Statement from a given node.

Creating the Database 'triplesDB' :

EnvironmentConfig envCfg= new EnvironmentConfig();
envCfg.setAllowCreate(true);
this.environment= new Environment(envFile,envCfg);
DatabaseConfig cfg= new DatabaseConfig();
cfg.setAllowCreate(true);
cfg.setSortedDuplicates(false);
this.triplesDB= this.environment.openDatabase( null, "triples", cfg);

We then create an the secondary indexes for each component of a statement ( subject2triple ,predicate2triple, objectLiteral2triple, objectRsrc2triple) . For example, the following code opens the secondaty database 'objectRsrc2triple'. We create a 'SecondaryKeyCreator' to tell BerkeleyDB about how it should extract the secondary key from the primary data.

            config2= new SecondaryConfig();
            config2.setAllowCreate(true);
            config2.setSortedDuplicates(true);
            config2.setKeyCreator(new SecondaryKeyCreator()
                {
                @Override
                public boolean createSecondaryKey(SecondaryDatabase arg0,
                        DatabaseEntry key,
                        DatabaseEntry data,
                        DatabaseEntry result) throws DatabaseException {
                    Statement stmt= STMT_VALUE_BINDING.entryToObject(data);
                    if(!stmt.getValue().isResource()) return false;
                    Resource L= Resource.class.cast(stmt.getValue());
                    TupleOutput out= new TupleOutput();
                    saveResource(L, out);
                    result.setData(out.toByteArray());
                    return true;
                    }
                });
            this.objectRsrc2triple=this.environment.openSecondaryDatabase(null,"objectRsrc2triple", triplesDB, config2);

We need some methods to write and read the components of a Statement from/to an array of bytes:

public void objectToEntry(Statement stmt, TupleOutput out)
            {
            saveResource(stmt.getSubject(),out);
            saveResource(stmt.getPredicate(),out);
            if(stmt.getValue().isResource())
                {
                out.writeByte(OPCODE_RESOURCE);
                saveResource(Resource.class.cast(stmt.getValue()),out);
                }
            else
                {
                out.writeByte(OPCODE_LITERAL);
                saveLiteral(Literal.class.cast(stmt.getValue()),out);
                }
            }

        public Statement entryToObject(TupleInput in)
            {
            Resource subject = readResource(in);
            Resource predicate = readResource(in);
            RDFNode object=null;
            switch(in.readByte())
                {
                case OPCODE_RESOURCE:
                    {
                    object= readResource(in);
                    break;
                    }
                case OPCODE_LITERAL:
                    {
                    object= readLiteral(in);
                    break;
                    }
                default: throw new IllegalStateException("Unknown opcode");
                }
            return new Statement(subject,predicate,object);
            }
        }

I've also wrapped the Cursor in a java.util.Iterator:. One interesting Iterator is the JoinIterator which quickly retrieve the *common* Statements returned from a distinct set of 'Cursors'. When we first retrieve the values of those cursors, we seek for each searched keys. Then we let the BerkeleyDB API finding the intersection of those cursors.

    private class JoinIterator
        extends AbstractIterator<Statement>
        {
     /** the joined iterators */
        protected List<CursorAndEntries> cursorEntries;
        /** our join cursor */
        private JoinCursor joinCursor=null;
        /** current key */
        protected DatabaseEntry keyEntry=new DatabaseEntry();
        /** current value */
        protected DatabaseEntry valueEntry=new DatabaseEntry();
        protected JoinIterator(List<CursorAndEntries> cursorsEntries)
            {
            this.cursorEntries=new ArrayList<CursorAndEntries>(cursorsEntries);
            }
        protected Statement readNext() throws DatabaseException
            {
            if(super._firstCall)
                {
                super._firstCall=false;
                for(CursorAndEntries ca:this.cursorEntries)
                    {
               if(ca.cursor.getSearchKey(ca.keyEntry,ca.valueEntry,null)!=OperationStatus.SUCCESS)
                        {
                        return null;
                        }
                    }
                Cursor cursors[]= new Cursor[this.cursorEntries.size()];
                for(int i=0;i< this.cursorEntries.size();++i)
                    {
                    cursors[i]= cursorEntries.get(i).cursor;
                    }
                joinCursor = BerkeleyDBModel.this.triplesDB.join(cursors, null);
                }
            if(joinCursor.getNext(keyEntry, valueEntry,LockMode.DEFAULT)==OperationStatus.SUCCESS)
                {
                return STMT_VALUE_BINDING.entryToObject(valueEntry);
                }
            else
                {
                return null;
                }
            }

        @Override
        public void close()
            {
            for(CursorAndEntries cursor:cursorEntries)
                {
                try {
                    if(cursor.cursor!=null) cursor.cursor.close();
                    cursor.cursor=null;
                } catch (Exception e) {
                    e.printStackTrace();
                    }
                }
            try {
                if(joinCursor!=null) joinCursor.close();
                joinCursor=null;
            } catch (Exception e) {
                e.printStackTrace();
                }
            }
        }

This JoinCursor is then used for any specific query. For example, the following method 'listStatement' takes three parameters (s,p,o) and return an Iterator over a list of Statements. If a parameter is null, it will be used as a wildcard: this is where we shall use the secondary databases for finding the statements.

public CloseableIterator<Statement> listStatements(
Resource s,
Resource p,
RDFNode o) throws RDFException
 {
 if(s==null && p==null && o==null)
  {
  return listStatements();
  }
 try
 {
 List<CursorAndEntries> cursors= new ArrayList<CursorAndEntries>(3);
 if(s!=null)/** subject is not null */
  {
  CursorAndEntries ca= new CursorAndEntries();
  ca.cursor=this.subject2triple.openCursor(null, null);
  RSRC_KEY_BINDING.objectToEntry(s, ca.keyEntry);
  cursors.add(ca);
  }
 if(p!=null)/** predicate is not null */
  {
  CursorAndEntries ca= new CursorAndEntries();
  ca.cursor=this.predicate2triple.openCursor(null, null);
  RSRC_KEY_BINDING.objectToEntry(p, ca.keyEntry);
  cursors.add(ca);
  }
 if(o!=null)/** object is not null */
        {
  if(o.isResource())
  {
  CursorAndEntries ca= new CursorAndEntries();
  ca.cursor=this.objectRsrc2triple.openCursor(null, null);
  RSRC_KEY_BINDING.objectToEntry(Resource.class.cast(o), ca.keyEntry);
  cursors.add(ca);
  }
  else
  {
  CursorAndEntries ca= new CursorAndEntries();
  ca.cursor=this.objectLiteral2triple.openCursor(null, null);
  LITERAL_KEY_BINDING.objectToEntry(Literal.class.cast(o), ca.keyEntry);
  cursors.add(ca);
  }
      }
 return new JoinIterator(cursors);
 }
 catch (DatabaseException e)
 {
 throw new RDFException(e);
 }
}

Performance

This RDFStore was used to download and parse a remote gzipped file from genontology.org. The uncompressed file is ~61.0Mo and contains 774579 statements. It took about ~6 min) to download and digest the file. Remember that each time a statement is about to be inserted, we need to check if it doesn't already exists in the database The amount of space required to store the database was 591Mo (ouch !!).

This code was my first idea about how to solve this problem. Obviously, some other engines are far more efficient :-) :(...)Good progress though, last night's run yielded 5 billion triples loaded in just under 10 hours for an average throughput of 135k triples per second. Max throughput was just above 210k triples per second. 1 billion triples was reached in an astonishing 78 minutes.(...)

As a conclusion, here is the full source code for the first version of this storage engine:

package org.lindenb.sw.model;
import java.io.File;
import java.io.IOException;
import java.io.InputStream;
import java.net.URI;
import java.net.URL;
import java.util.ArrayList;
import java.util.List;
import java.util.zip.GZIPInputStream;
import javax.xml.stream.XMLStreamException;
import org.lindenb.sw.RDFException;
import org.lindenb.sw.io.RDFHandler;
import org.lindenb.sw.nodes.Literal;
import org.lindenb.sw.nodes.RDFNode;
import org.lindenb.sw.nodes.Resource;
import org.lindenb.sw.nodes.Statement;
import org.lindenb.util.iterator.CloseableIterator;

import com.sleepycat.bind.tuple.IntegerBinding;
import com.sleepycat.bind.tuple.TupleBinding;
import com.sleepycat.bind.tuple.TupleInput;
import com.sleepycat.bind.tuple.TupleOutput;
import com.sleepycat.je.Cursor;
import com.sleepycat.je.Database;
import com.sleepycat.je.DatabaseConfig;
import com.sleepycat.je.DatabaseEntry;
import com.sleepycat.je.DatabaseException;
import com.sleepycat.je.Environment;
import com.sleepycat.je.EnvironmentConfig;
import com.sleepycat.je.JoinCursor;
import com.sleepycat.je.LockMode;
import com.sleepycat.je.OperationStatus;
import com.sleepycat.je.SecondaryConfig;
import com.sleepycat.je.SecondaryDatabase;
import com.sleepycat.je.SecondaryKeyCreator;

public class BerkeleyDBModel
    {
    private static final byte OPCODE_RESOURCE = 'R';
    private static final byte OPCODE_LITERAL = 'L';
   
    private Environment environment;
    private Database triplesDB;
    private SecondaryDatabase subject2triple= null;
    private SecondaryDatabase predicate2triple= null;
    private SecondaryDatabase objectLiteral2triple= null;
    private SecondaryDatabase objectRsrc2triple= null;
   
    private static final StatementBinding STMT_VALUE_BINDING= new StatementBinding();
    private static final ResourceBinding RSRC_KEY_BINDING= new ResourceBinding();
    private static final LiteralBinding LITERAL_KEY_BINDING= new LiteralBinding();
   
    /**
     * TupleBinding for a Resource
     *
     */
    private static class ResourceBinding
    extends TupleBinding<Resource>
        {
        public void objectToEntry(Resource rsrc, TupleOutput out)
            {
            saveResource(rsrc,out);
            }
       
        public Resource entryToObject(TupleInput in)
            {
            return readResource(in);
            }
        }
   
    /**
     * TupleBinding for a Literal
     *
     */
    private static class LiteralBinding
    extends TupleBinding<Literal>
        {
        public void objectToEntry(Literal rsrc, TupleOutput out)
            {
            saveLiteral(rsrc,out);
            }
       
        public Literal entryToObject(TupleInput in)
            {
            return readLiteral(in);
            }
        }
   
    /**
     * TupleBinding for a Statement
     *
     */
    private static class StatementBinding
        extends TupleBinding<Statement>
        {
        public void objectToEntry(Statement stmt, TupleOutput out)
            {
            saveResource(stmt.getSubject(),out);
            saveResource(stmt.getPredicate(),out);
            if(stmt.getValue().isResource())
                {
                out.writeByte(OPCODE_RESOURCE);
                saveResource(Resource.class.cast(stmt.getValue()),out);
                }
            else
                {
                out.writeByte(OPCODE_LITERAL);
                saveLiteral(Literal.class.cast(stmt.getValue()),out);
                }
            }
       
        public Statement entryToObject(TupleInput in)
            {
            Resource subject = readResource(in);
            Resource predicate = readResource(in);
            RDFNode object=null;
            switch(in.readByte())
                {
                case OPCODE_RESOURCE:
                    {
                    object= readResource(in);
                    break;
                    }
                case OPCODE_LITERAL:
                    {
                    object= readLiteral(in);
                    break;
                    }
                default: throw new IllegalStateException("Unknown opcode");
                }
            return new Statement(subject,predicate,object);
            }
        }
   





    /**
    * AbstractIterator
    */
    private abstract class AbstractIterator<T>
        implements CloseableIterator<T>
        {
        protected T _object=null;
        private boolean _nextTested=false;
        private boolean _hasNext=false;
        protected boolean _firstCall=true;
        protected AbstractIterator()
            {
            }
       
        @Override
        public void remove() {
            throw new UnsupportedOperationException();
            }
       
        @Override
        public boolean hasNext()
            {
            if(_nextTested) return _hasNext;
            _nextTested=true;
            _hasNext=false;
           
                T obj=null;
                try
                    {
                    obj=readNext();
                    _firstCall=false;
                    if(obj!=null)
                        {
                        _object=obj;
                        _hasNext=true;
                        }
                    }
                catch(DatabaseException err)
                    {
                    err.printStackTrace();                   
                    }

            if(!_hasNext)
                {
                close();
                }
            return _hasNext;
            }   
       
        protected abstract T readNext() throws DatabaseException;
       
        @Override
        public T next()
            {
            if(!_nextTested)
                {
                if(!hasNext()) throw new IllegalStateException();
                }
            _nextTested=false;
            _hasNext=false;
           
            T x= _object;
            _object=null;
            return x;
            }
       
        @Override
        public abstract void close();
        }
   
   
    /**
    * CursorIterator
    */
    private abstract class CursorIterator<T>
        extends AbstractIterator<T>
        {
        protected Cursor cursor;
        private DatabaseEntry keyEntry;
        private DatabaseEntry valueEntry;
       
        protected CursorIterator(Cursor cursor)
            {
            this.cursor=cursor;
            this.keyEntry= new DatabaseEntry();
            this.valueEntry= new DatabaseEntry();
            }
       
        @Override
        public void remove() {
            throw new UnsupportedOperationException();
            }
       
        protected abstract T readNext(DatabaseEntry key,DatabaseEntry value) throws DatabaseException;

       
        protected final T readNext() throws DatabaseException
            {
            if(this.cursor==null) return null;
            return readNext(this.keyEntry,this.valueEntry);
            }
       
        @Override
        public void close()
            {
            if(this.cursor!=null)
              {
             try { this.cursor.close(); }
             catch (Exception e) { e.printStackTrace();}
             }
            this.cursor=null;
            }
        }
   
    /**
     * A container for the 3 values used 
     * in the following next JoinIterator
     **/
    private static class CursorAndEntries
        {
        Cursor cursor;
        DatabaseEntry keyEntry=new DatabaseEntry();
        DatabaseEntry valueEntry=new DatabaseEntry();
        }
   
    /**
    * JoinIterator
    */
    private class JoinIterator
        extends AbstractIterator<Statement>
        {
     /** the joined iterators */
        protected List<CursorAndEntries> cursorEntries;
        /** our join cursor */
        private JoinCursor joinCursor=null;
        /** current key */
        protected DatabaseEntry keyEntry=new DatabaseEntry();
        /** current value */
        protected DatabaseEntry valueEntry=new DatabaseEntry();
        protected JoinIterator(List<CursorAndEntries> cursorsEntries)
            {
            this.cursorEntries=new ArrayList<CursorAndEntries>(cursorsEntries);
            }
       
        @Override
        public void remove() {
            throw new UnsupportedOperationException();
            }
       
        protected Statement readNext() throws DatabaseException
            {
            if(super._firstCall)
                {
                super._firstCall=false;
                for(CursorAndEntries ca:this.cursorEntries)
                    {
                    if(ca.cursor.getSearchKey(ca.keyEntry,ca.valueEntry,null)!=OperationStatus.SUCCESS)
                        {
                        return null;
                        }
                    }
                Cursor cursors[]= new Cursor[this.cursorEntries.size()];
                for(int i=0;i< this.cursorEntries.size();++i)
                    {
                    cursors[i]= cursorEntries.get(i).cursor;
                    }
                joinCursor = BerkeleyDBModel.this.triplesDB.join(cursors, null);
                }
            if(joinCursor.getNext(keyEntry, valueEntry, LockMode.DEFAULT)==OperationStatus.SUCCESS)
                {
                return STMT_VALUE_BINDING.entryToObject(valueEntry);
                }
            else
                {
                return null;
                }
            }

        @Override
        public void close()
            {
         
            for(CursorAndEntries cursor:cursorEntries)
                {
                try {
                    if(cursor.cursor!=null) cursor.cursor.close();
                    cursor.cursor=null;
                } catch (Exception e) {
                    e.printStackTrace();
                    }
                }
            try {
                if(joinCursor!=null) joinCursor.close();
                joinCursor=null;
            } catch (Exception e) {
                e.printStackTrace();
                }
            }
        }
   
   
    /**
    *
    * BerkeleyDBModel
    *
    */
    public BerkeleyDBModel(
        File envFile
        ) throws RDFException
        {
       
        try {
            EnvironmentConfig envCfg= new EnvironmentConfig();
            envCfg.setAllowCreate(true);
            this.environment= new Environment(envFile,envCfg);
           
           
            DatabaseConfig cfg= new DatabaseConfig();
            cfg.setAllowCreate(true);
            cfg.setSortedDuplicates(false);
           
           
            this.triplesDB= this.environment.openDatabase(
                null, "triples", cfg);
           
           
            /* create secondary key on literal as value */
            SecondaryConfig config2= new SecondaryConfig();
            config2.setAllowCreate(true);
            config2.setSortedDuplicates(true);
       
            config2.setKeyCreator(new SecondaryKeyCreator()
                {
                @Override
                public boolean createSecondaryKey(SecondaryDatabase arg0,
                        DatabaseEntry key,
                        DatabaseEntry data,
                        DatabaseEntry result) throws DatabaseException
                    {
                    Statement stmt= STMT_VALUE_BINDING.entryToObject(data);
                    if(stmt.getValue().isResource()) return false;
                    Literal L= Literal.class.cast(stmt.getValue());
                    TupleOutput out= new TupleOutput();
                    saveLiteral(L, out);
                    result.setData(out.toByteArray());
                    return true;
                    }
                });
            this.objectLiteral2triple=this.environment.openSecondaryDatabase(null,"objectLiteral2triple", triplesDB, config2);
           
            /* create secondary key on resource as value */
            config2= new SecondaryConfig();
            config2.setAllowCreate(true);
            config2.setSortedDuplicates(true);
            config2.setKeyCreator(new SecondaryKeyCreator()
                {
                @Override
                public boolean createSecondaryKey(SecondaryDatabase arg0,
                        DatabaseEntry key,
                        DatabaseEntry data,
                        DatabaseEntry result) throws DatabaseException {
                    Statement stmt= STMT_VALUE_BINDING.entryToObject(data);
                    if(!stmt.getValue().isResource()) return false;
                    Resource L= Resource.class.cast(stmt.getValue());
                    TupleOutput out= new TupleOutput();
                    saveResource(L, out);
                    result.setData(out.toByteArray());
                    return true;
                    }
                });
            this.objectRsrc2triple=this.environment.openSecondaryDatabase(null,"objectRsrc2triple", triplesDB, config2);
           
            /* create secondary key on predicate  */
            config2= new SecondaryConfig();
            config2.setAllowCreate(true);
            config2.setSortedDuplicates(true);
            config2.setKeyCreator(new SecondaryKeyCreator()
                {
                @Override
                public boolean createSecondaryKey(SecondaryDatabase arg0,
                        DatabaseEntry key,
                        DatabaseEntry data,
                        DatabaseEntry result) throws DatabaseException {
                    Statement stmt= STMT_VALUE_BINDING.entryToObject(data);
                    TupleOutput out= new TupleOutput();
                    saveResource(stmt.getSubject(), out);
                    result.setData(out.toByteArray());
                    return true;
                    }
                });
            this.subject2triple=this.environment.openSecondaryDatabase(null, "subject2triple", triplesDB, config2);
           
           
            /* create secondary key on predicate  */
            config2= new SecondaryConfig();
            config2.setAllowCreate(true);
            config2.setSortedDuplicates(true);
            config2.setKeyCreator(new SecondaryKeyCreator()
                {
                @Override
                public boolean createSecondaryKey(SecondaryDatabase arg0,
                        DatabaseEntry key,
                        DatabaseEntry data,
                        DatabaseEntry result) throws DatabaseException {
                    Statement stmt= STMT_VALUE_BINDING.entryToObject(data);
                    TupleOutput out= new TupleOutput();
                    saveResource(stmt.getPredicate(), out);
                    result.setData(
                            out.getBufferBytes(),
                            out.getBufferOffset(),
                            out.getBufferLength()
                            );
                    return true;
                    }
                });
            this.predicate2triple=this.environment.openSecondaryDatabase(null, "predicate2triple", triplesDB, config2);
            }
        catch (DatabaseException e)
            {
            throw new RDFException(e);
            }
        }
   
    /** Close this model  */
    public void close() throws RDFException
        {
        try
        {
        subject2triple.close();
        predicate2triple.close();
        objectLiteral2triple.close();
        objectRsrc2triple.close();
        triplesDB.close();
        environment.close();
        } catch(DatabaseException err)
            {
            throw new RDFException(err);
            }
        subject2triple=null;
        predicate2triple=null;
        objectLiteral2triple=null;
        objectRsrc2triple=null;
        triplesDB=null;
        environment=null;
        }
   
    public void clear()  throws RDFException
        {
        try {
            DatabaseEntry key= new DatabaseEntry();
            DatabaseEntry data= new DatabaseEntry();
            Cursor c= triplesDB.openCursor(null, null);
            while(c.getNext(key, data, null)==OperationStatus.SUCCESS)
                {
                c.delete();
                }
            c.close();
        } catch (Exception e) {
            e.printStackTrace();
            }
        }
   
   
    protected Environment getEnvironment()
        {
        return this.environment;
        }
   
    protected Database getTripleDB()
        {
        return this.triplesDB;
        }
   
    public CloseableIterator<Resource> listSubjects()  throws RDFException
        {
        try {
            return new CursorIterator<Resource>(this.subject2triple.openCursor(null, null))
                {
                @Override
                protected Resource readNext(DatabaseEntry key,DatabaseEntry value) 
                  throws DatabaseException
                    {
                    if(this.cursor.getNext(key, value, null)==OperationStatus.SUCCESS)
                        {
                        return RSRC_KEY_BINDING.entryToObject(value);
                        }
                    return null;
                    }
                };
        } catch (Exception e)
            {
            throw new RDFException(e);
            }
       
        }
   
   
    public CloseableIterator<Statement> listStatements() throws RDFException
        {
        try {
            return new CursorIterator<Statement>(this.triplesDB.openCursor(null, null))
                {
                @Override
                protected Statement readNext(DatabaseEntry key,DatabaseEntry
value) throws DatabaseException
                    {
                    if(this.cursor.getNext(key, value, null)==OperationStatus.SUCCESS)
                        {
                        return STMT_VALUE_BINDING.entryToObject(value);
                        }
                    return null;
                    }
                };
        } catch (DatabaseException e)
            {
            throw new RDFException(e);
            }
        }
   
   
    public CloseableIterator<Statement> listStatements(
            Resource s,
            Resource p,
            RDFNode o) throws RDFException
            {
            if(s==null && p==null && o==null)
                    {
                    return listStatements();
                    }
            try
                {
                List<CursorAndEntries> cursors= new ArrayList<CursorAndEntries>(3);
                if(s!=null)
                    {
                    CursorAndEntries ca= new CursorAndEntries();
                    ca.cursor=this.subject2triple.openCursor(null, null);
                    RSRC_KEY_BINDING.objectToEntry(s, ca.keyEntry);
                    cursors.add(ca);
                    }
                if(p!=null)
                    {
                    CursorAndEntries ca= new CursorAndEntries();
                    ca.cursor=this.predicate2triple.openCursor(null, null);
                    RSRC_KEY_BINDING.objectToEntry(p, ca.keyEntry);
                    cursors.add(ca);
                    }
                if(o!=null)
                    {
                    if(o.isResource())
                        {
                        CursorAndEntries ca= new CursorAndEntries();
                        ca.cursor=this.objectRsrc2triple.openCursor(null, null);
                        RSRC_KEY_BINDING.objectToEntry(Resource.class.cast(o), ca.keyEntry);
                        cursors.add(ca);
                        }
                    else
                        {
                        CursorAndEntries ca= new CursorAndEntries();
                        ca.cursor=this.objectLiteral2triple.openCursor(null, null);
                        LITERAL_KEY_BINDING.objectToEntry(Literal.class.cast(o), ca.keyEntry);
                        cursors.add(ca);
                        }
                    }
                return new JoinIterator(cursors);
                }
            catch (DatabaseException e)
                {
                throw new RDFException(e);
                }
           
            }
   
   
    public Resource createResource(String uri)
        {
        return new Resource(uri);
        }
   
    public Resource createResource(URI uri)
        {
        return createResource(uri.toString());
        }
   
    public Literal createLiteral(String text)
        {
        return new Literal(text);
        }
   
 
    private static void saveResource(Resource rsrc,TupleOutput out)
        {
        out.writeString(rsrc.getURI());
        }
   
    private static  Resource readResource(TupleInput in)
        {
        return new Resource(in.readString());
        }
   
    private static void saveLiteral(Literal literal,TupleOutput out)
        {
        out.writeString(literal.getLexicalForm());
       
        }
   
    private static  Literal readLiteral(TupleInput in)
        {
        String s= in.readString();
        return new Literal(s);
        }
   
   
    public long size()throws RDFException
        {
        try
        {
        return getTripleDB().count();
        }catch(DatabaseException err)
        {
            throw new RDFException(err);
        }
        }
   
    public boolean contains(Statement stmt ) throws RDFException
        {
        CloseableIterator<Statement> iter=null;
        try {
            iter= listStatements(stmt.getSubject(), stmt.getPredicate(),stmt.getValue());
            return (iter.hasNext());
        } catch (RDFException e) {
            throw e;
            }
        finally
            {
            if(iter!=null) iter.close();
            }
        }
   
    public BerkeleyDBModel add(Resource s,Resource p,RDFNode o) throws RDFException
        {
        return add(new Statement(s,p,o));
        }
   
   
    public BerkeleyDBModel add(Statement stmt) throws RDFException
        {
        DatabaseEntry key= new DatabaseEntry();
        DatabaseEntry value= new DatabaseEntry();
        Cursor c=null;
        try
            {
            if(contains(stmt)) return this;
            c= triplesDB.openCursor(null, null);
            int id=0;
           
            if(c.getLast(key, value, null)==OperationStatus.SUCCESS)
                {
                id= IntegerBinding.entryToInt(key);
                }
            STMT_VALUE_BINDING.objectToEntry(stmt, value);
            IntegerBinding.intToEntry(id+1,key);
            getTripleDB().put(null, key, value);
            return this;
            }
        catch(DatabaseException error)
            {
            throw new RDFException(error);
            }
        finally
            {
            if(c!=null) try {c.close(); } catch(DatabaseException err) {}
            }
        }
   
    public void read(InputStream in) throws
IOException,RDFException,XMLStreamException
        {
        org.lindenb.sw.io.RDFHandler h= new RDFHandler()
            {
            @Override
            public void found(URI subject, URI predicate, Object value,
                    URI dataType, String lang, int index) {
                try
                {
                if(value instanceof URI)
                    {
                    add(createResource(subject),createResource(predicate),createResource((URI)value));
                    }
                else
                    {
                    add(createResource(subject),createResource(predicate),createLiteral((String)value));
                    }
                } catch(RDFException err)
                    {
                    throw new RuntimeException(err);
                    }
                }
            };
        h.parse(in);
        }
   
    public static void main(String[] args) {
     BerkeleyDBModel rdfStore= null;
        try {
            URL url= new
URL("http://archive.geneontology.org/latest-lite/go_20090607-termdb.owl.gz");
           
            rdfStore = new BerkeleyDBModel(new File("/tmp/rdfdb"));
            for(int i=0;i<10;i++)
                {
                long now= System.currentTimeMillis();
                rdfStore.clear();
                InputStream in= new GZIPInputStream(url.openStream());
                rdfStore.read(in);
                in.close();
                System.err.println("("+i+") "+rdfStore.size()+"time="+(System.currentTimeMillis()-now)/1000);
                }
            rdfStore.clear();
            }
        catch (Exception e) {
            e.printStackTrace();
            }
        finally
            {
            if(rdfStore!=null) try { rdfStore.close(); } catch(Exception err) {err.printStackTrace();}
            }
        }
   
    }

That's it
Pierre

25 November 2008

Taxonomy and Semantic Web: writing an extension for ARQ/SPARQL

In this post I'll show how I've implemented a custom function in ARQ, the SPARQL/Jena engine for querying a RDF graph. The new function implemented tests if a node in the NCBI-taxonomy hierarchy as a given ancestor.

Requirements

Jena/ARQ: http://jena.sourceforge.net/ARQ/
A java 1.6 compiler
nodes.dmp , the ncbi taxonomy downloaded from ftp://ftp.ncbi.nih.gov/pub/taxonomy/

Here are a sample of the very first lines of nodes.dmp: the first column is the node-id of the taxon, the second column is its parent-id.

cat nodes.dmp | cut -c 1-20 | head
  |    1    |    no rank    |        |
  |    131567    |    superki
  |    335928    |    genus    |
  |    6    |    species    |    AC
  |    32199    |    species   
  |    135621    |    genus   
  |    10    |    species    |   
  |    203488    |    genus   
  |    13    |    species    |   
  |    32011    |    genus    |

The input

our input is a RDF file:

<?xml version="1.0" encoding="UTF-8"?>
<rdf:RDF
xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns:dc="http://purl.org/dc/elements/1.1/"
xmlns:tax="http://species.lindenb.org"
>

<tax:Individual rdf:about="http://fr.wikipedia.org/wiki/Tintin">
<dc:title xml:lang="fr">Tintin</dc:title>
<dc:title xml:lang="en">Tintin</dc:title>
<tax:taxon rdf:resource="lsid:ncbi.nlm.nih.gov:taxonomy:9606"/>
</tax:Individual>

<tax:Individual rdf:about="http://fr.wikipedia.org/wiki/Babar">
<dc:title xml:lang="fr">Babar</dc:title>
<dc:title xml:lang="en">Babar</dc:title>
<tax:taxon rdf:resource="lsid:ncbi.nlm.nih.gov:taxonomy:9785"/>
</tax:Individual>

<tax:Individual rdf:about="http://fr.wikipedia.org/wiki/Milou">
<dc:title xml:lang="fr">Milou</dc:title>
<dc:title xml:lang="en">Snowy</dc:title>
<tax:taxon rdf:resource="lsid:ncbi.nlm.nih.gov:taxonomy:9615"/>
</tax:Individual>

<tax:Individual rdf:about="http://fr.wikipedia.org/wiki/Donald_Duck">
<dc:title xml:lang="fr">Donald</dc:title>
<dc:title xml:lang="en">Donald Duck</dc:title>
<tax:taxon rdf:resource="lsid:ncbi.nlm.nih.gov:taxonomy:8839"/>
</tax:Individual>

<tax:Individual rdf:about="http://fr.wikipedia.org/wiki/Le_L%C3%A9zard">
<dc:title xml:lang="fr">Lezard</dc:title>
<dc:title xml:lang="en">Lizard</dc:title>
<dc:title xml:lang="fr">Curt Connors</dc:title>
<dc:title xml:lang="en">Curt Connors</dc:title>
<tax:taxon rdf:resource="lsid:ncbi.nlm.nih.gov:taxonomy:9606"/>
<tax:taxon rdf:resource="lsid:ncbi.nlm.nih.gov:taxonomy:8504"/>
</tax:Individual>

</rdf:RDF>

Images via wikipedia
Tintin & Snowy	Babar	Donald	The Lizard

Basically this file describes

4 individuals: Tintin (human), Snowy (dog), Donal (duck) , Babar (Elephant) and Dr Connors/The Lizard (spiderman's foe)
Each individual unambigously identified by his URI in wikipedia
Each individual is named in english and in french
For each individual, is ID in the NCBI hierarchy is specified using a simple URI (here I've tried to use a LSID, but it could have been something else (a URL... ))

A basic query

The following SPARQL query retrieve the URI, the taxonomy and the english name for each individuals.

The query

PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX dc:  <http://purl.org/dc/elements/1.1/>
PREFIX tax:  <http://species.lindenb.org>

SELECT ?individual ?taxon ?title
  {
  ?individual a tax:Individual .
  ?individual dc:title ?title .
  ?individual tax:taxon ?taxon .
  FILTER langMatches( lang(?title), "en" )
  }

Invoking ARQ

arq --query query01.rq --data taxonomy.rdf

Result

-------------------------------------------------------------------------------------------------------------
| individual                                    | taxon                                 | title             |
=============================================================================================================
| <http://fr.wikipedia.org/wiki/Le_L%C3%A9zard> | <lsid:ncbi.nlm.nih.gov:taxonomy:8504> | "Curt Connors"@en |
| <http://fr.wikipedia.org/wiki/Le_L%C3%A9zard> | <lsid:ncbi.nlm.nih.gov:taxonomy:9606> | "Curt Connors"@en |
| <http://fr.wikipedia.org/wiki/Le_L%C3%A9zard> | <lsid:ncbi.nlm.nih.gov:taxonomy:8504> | "Lizard"@en       |
| <http://fr.wikipedia.org/wiki/Le_L%C3%A9zard> | <lsid:ncbi.nlm.nih.gov:taxonomy:9606> | "Lizard"@en       |
| <http://fr.wikipedia.org/wiki/Donald_Duck>    | <lsid:ncbi.nlm.nih.gov:taxonomy:8839> | "Donald Duck"@en  |
| <http://fr.wikipedia.org/wiki/Milou>          | <lsid:ncbi.nlm.nih.gov:taxonomy:9615> | "Snowy"@en        |
| <http://fr.wikipedia.org/wiki/Babar>          | <lsid:ncbi.nlm.nih.gov:taxonomy:9785> | "Babar"@en        |
| <http://fr.wikipedia.org/wiki/Tintin>         | <lsid:ncbi.nlm.nih.gov:taxonomy:9606> | "Tintin"@en       |
-------------------------------------------------------------------------------------------------------------

Adding a custom function

Now, I want to add a new function in sparql. This function 'isA' will take as input to parameters: the taxon/LSID of the child and the taxon/LSID of the parent and it will return a boolean 'true' if the 'child' has the 'parent' in his phylogeny. This new function is implemented by extending the class com.hp.hpl.jena.sparql.function.FunctionBase2. This new class contains an associative array child2parent mapping each taxon-id to its parent. This map is loaded as described bellow:

                Pattern pat= Pattern.compile("[ \t]*\\|[ \t]*");
                String line;
                BufferedReader r= new BufferedReader(new FileReader(TAXONOMY_NODES_PATH));
                while((line=r.readLine())!=null)
                    {
                    String tokens[]=pat.split(line, 3);
                    this.child2parent.put(
                            Integer.parseInt(tokens[0]),
                            Integer.parseInt(tokens[1])
                            );
                    }
                r.close();
(...)

The function 'exec' will check if the two arguments are an URI and will invoke the method isChildOf

  public NodeValue exec(NodeValue childNode, NodeValue parentNode)
    {
    (...check the nodes are URI)
    return NodeValue.makeBoolean(isChildOf(childId,parentId));
    }

The function 'isChildOf' loops in the map child2parent to check if the parent is an ancestor of the child:

       while(true)
            {
            Integer id= child2parent.get(childid);
            if(id==null || id==childid) return false;
            if(id==parentid) return true;
            childid=id;
            }

Here is the complete source code of this class:

package org.lindenb.arq4taxonomy;

import java.io.BufferedReader;
import java.io.FileReader;
import java.io.IOException;
import java.util.HashMap;
import java.util.Map;
import java.util.regex.Pattern;

import com.hp.hpl.jena.sparql.expr.ExprEvalException;
import com.hp.hpl.jena.sparql.expr.NodeValue;
import com.hp.hpl.jena.sparql.function.FunctionBase2;

public class isA
    extends FunctionBase2
    {
    public static final String LSID="lsid:ncbi.nlm.nih.gov:taxonomy:";
    public static final String TAXONOMY_NODES_PATH="/home/lindenb/tmp/TAXONOMY_NCBI/nodes.dmp";
    private Map<Integer, Integer> child2parent=null;
   
    public isA()
        {
       
        }
    /**
     * return a associative map child.id -> parent.id
     * @return
     */
    private Map<Integer, Integer> getTaxonomy()
        {
        if(this.child2parent==null)
            {
            this.child2parent= new HashMap<Integer, Integer>();
            try
                {
                Pattern pat= Pattern.compile("[ \t]*\\|[ \t]*");
                String line;
                BufferedReader r= new BufferedReader(new FileReader(TAXONOMY_NODES_PATH));
                while((line=r.readLine())!=null)
                    {
                    String tokens[]=pat.split(line, 3);
                    this.child2parent.put(
                            Integer.parseInt(tokens[0]),
                            Integer.parseInt(tokens[1])
                            );
                    }
                r.close();
                System.err.println(this.child2parent.size());
                }
            catch(IOException err)
                {
                err.printStackTrace();
                throw new ExprEvalException(err);
                }
            }
        return this.child2parent;
        }
   
    private boolean isChildOf(int childid,int parentid)
        {
        if(childid==parentid) return true;
        Map<Integer,Integer> map= getTaxonomy();
        while(true)
            {
            Integer id= map.get(childid);
            if(id==null || id==childid) return false;
            if(id==parentid) return true;
            childid=id;
            }
        }
   
    @Override
    public NodeValue exec(NodeValue childNode, NodeValue parentNode)
        {

        if( childNode.isLiteral() ||
            parentNode.isLiteral() ||
            childNode.asNode().isBlank() ||
            parentNode.asNode().isBlank())
            {
            return NodeValue.makeBoolean(false);
            }

        String childURI = childNode.asNode().getURI();
        if(!childURI.startsWith(LSID))
            {
            return NodeValue.makeBoolean(false);
            }
       

        String parentURI = parentNode.asNode().getURI();
        if(!parentURI.startsWith(LSID))
            {
            return NodeValue.makeBoolean(false);
            }

        int childId=0;
        try {
            childId= Integer.parseInt(childURI.substring(LSID.length()));
            }
        catch (NumberFormatException e)
            {
            return NodeValue.makeBoolean(false);
            }
       
        int parentId=0;
        try {
            parentId= Integer.parseInt(parentURI.substring(LSID.length()));
            }
        catch (NumberFormatException e)
            {
            return NodeValue.makeBoolean(false);
            }
   
        return NodeValue.makeBoolean(isChildOf(childId,parentId));
        }
   
    }

This class is then compiled and packaged into the file tax.jar:

    javac -cp $(ARQ_CLASSPATH):. -sourcepath src src/org/lindenb/arq4taxonomy/isA.java
    jar cvf tax.jar -C src org

and we add this jar in the classpath:

export CP=$PWD/tax.jar

To tell ARQ about this new functio,n we just add its classpath as a new PREFIX in the SPARQL query:

PREFIX fn: <java:org.lindenb.arq4taxonomy.>

First test

the following SPARQL query retrieves all the Mammals (http://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?mode=Undef&id=40674) in the data set.

The query

PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX dc:  <http://purl.org/dc/elements/1.1/>
PREFIX tax:  <http://species.lindenb.org>
PREFIX fn: <java:org.lindenb.arq4taxonomy.>

SELECT ?individual ?taxon ?title
{
?individual a tax:Individual .
?individual dc:title ?title .
?individual tax:taxon ?taxon .
 FILTER fn:isA(?taxon,<lsid:ncbi.nlm.nih.gov:taxonomy:40674> )
 FILTER langMatches( lang(?title), "en" )
}

The command line

arq --query query02.rq --data taxonomy.rdf

The result

-------------------------------------------------------------------------------------------------------------
| individual                                    | taxon                                 | title             |
=============================================================================================================
| <http://fr.wikipedia.org/wiki/Le_L%C3%A9zard> | <lsid:ncbi.nlm.nih.gov:taxonomy:9606> | "Curt Connors"@en |
| <http://fr.wikipedia.org/wiki/Le_L%C3%A9zard> | <lsid:ncbi.nlm.nih.gov:taxonomy:9606> | "Lizard"@en       |
| <http://fr.wikipedia.org/wiki/Milou>          | <lsid:ncbi.nlm.nih.gov:taxonomy:9615> | "Snowy"@en        |
| <http://fr.wikipedia.org/wiki/Babar>          | <lsid:ncbi.nlm.nih.gov:taxonomy:9785> | "Babar"@en        |
| <http://fr.wikipedia.org/wiki/Tintin>         | <lsid:ncbi.nlm.nih.gov:taxonomy:9606> | "Tintin"@en       |
-------------------------------------------------------------------------------------------------------------

Second query

the following SPARQL query retrieves all the 'Sauropdias' (http://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?mode=Undef&id=8457) in the RDF file.

The SPARQL file

PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX dc:  <http://purl.org/dc/elements/1.1/>
PREFIX tax:  <http://species.lindenb.org>
PREFIX fn: <java:org.lindenb.arq4taxonomy.>

SELECT ?individual ?taxon ?title
{
?individual a tax:Individual .
?individual dc:title ?title .
?individual tax:taxon ?taxon .
 FILTER fn:isA(?taxon,<lsid:ncbi.nlm.nih.gov:taxonomy:8457> )
 FILTER langMatches( lang(?title), "en" )
}

Command line

arq --query query03.rq --datataxonomy.rdf

The result

-------------------------------------------------------------------------------------------------------------
| individual                                    | taxon                                 | title             |
=============================================================================================================
| <http://fr.wikipedia.org/wiki/Le_L%C3%A9zard> | <lsid:ncbi.nlm.nih.gov:taxonomy:8504> | "Curt Connors"@en |
| <http://fr.wikipedia.org/wiki/Le_L%C3%A9zard> | <lsid:ncbi.nlm.nih.gov:taxonomy:8504> | "Lizard"@en       |
| <http://fr.wikipedia.org/wiki/Donald_Duck>    | <lsid:ncbi.nlm.nih.gov:taxonomy:8839> | "Donald Duck"@en  |
-------------------------------------------------------------------------------------------------------------

Et hop ! voila ! That's it !

04 April 2008

BerkeleyDB and Hapmap: My notebook.

I'm currently trying to find the best way to store some genotypes. For example I need to store 278.766.958 illumina genotypes (marker,individual, allele1, allele2) and mysql, even with indexes, is getting slow when I'm looking for the Mendelian incompatibilities. Deepak suggested via twitter to use href="http://hdf.ncsa.uiuc.edu/HDF5/">HDF5 but as far as I understand the documentation, HDF5 is "just" a smarter implementation of the C functions fseek/fread/fwrite.

I've been looking at the java implementation of the BerkeleyDB (BDB) just to watch its performance according to my needs. This engine can be used as en embedded database and doesn't need a database server running as a daemon (just like JavaDB). A BDB berkeley database contains records where each record contains a "key" and a "value" (a kind of two columns database, or a kind of C++ std::multimap). In the current post, I'll describe a program which 1) download some genotypes from HapMap 2) Find the pedigree of the samples 3) Loop over the markers (ordered on their position on the genome) and get the frequency 4) find Mendelian incompatibilities

The source code is available on pastie at:

http://pastie.textmate.org/175209

First we need a few classes:

class Position holds a position on a chromosome

private static class Position
 {
 String chromosome;
 int position;
 Position(String chromosome,int position)
  {
  this.chromosome=chromosome;
  this.position=position;
  }
 }

class Marker contains the informations about a snp :

private  static class Marker
 {
 int rsId;
 String alleles;
 Position pos;
 char strand;
 }

class Individual contains the informations about an individual including his 'id' on corriel ( see http://cardiogenomics.med.harvard.edu)

private  static class Individual
 {
 /**Coriell Repository Number*/
 String name;
 /** id in corriel */
 String individualID=null;
 /** father id */
 String fatherID=null;
 /** mother id */
 String motherID=null;
 }

As BDB is a collection of pairs(key,value) we need a class MarkerIndividual holding the pair(marker,individual) to store the genotypes with f(pair(marker,individual)=genotype.

private static class MarkerIndividual
 {
 private int rsId;
 /**Coriell Repository Number*/
 private String individualId;
 }

At the end, we need a class Genotype to store two alleles:

private  static class Genotype
 {
 char a1,a2;
 
 public boolean same(char a1,char a2)
  {
  return (this.a1==a1 && this.a2==a2) ||
    (this.a1==a2 && this.a2==a1)
    ;
  }
 }

BDB has several alternatives to read and write the java objects from/to the databases, this operation requires the object to be converted into an array of bytes: 1) the java Serialization can be used, 2) a TupleBinding can be implemented, this class must impements two functions which are used to encode/decode the object to/from an array of bytes. I've choosen to use this later option, and here is for example the TupleBinding implementation for the class Individual:

private TupleBinding individualTupleBinding=new TupleBinding()
 {
 @Override
 public Object entryToObject(TupleInput input)
  {
  Individual indi= new Individual();
  indi.name= input.readString();
  indi.individualID= input.readString();
  indi.fatherID=input.readString();
  indi.motherID= input.readString();
  return indi;
  }
 
 @Override
 public void objectToEntry(Object object, TupleOutput output)  {
  Individual indi= Individual.class.cast(object);
  output.writeString(indi.name);
  output.writeString(indi.individualID);
  output.writeString(indi.fatherID);
  output.writeString(indi.motherID);
  }
 };

To open and create an Berkeley Environment the following code was written:

 EnvironmentConfig envConf= new EnvironmentConfig();
 envConf.setAllowCreate(!readOnly);
 envConf.setReadOnly(readOnly);
 envConf.setTransactional(false);
 this.env= new Environment(
   file,
  envConf
  );

Then, we open each database. We need 3 databases: markers, individuals and genotypes. Opening a Database looks like this:

DatabaseConfig dbConfig= new DatabaseConfig();
/* allow create if database does not exists */
dbConfig.setAllowCreate(!readOnly);
dbConfig.setReadOnly(readOnly);
Database db= this.env.openDatabase(null, "database-name", d
bConfig);

Althoug DBD is based on a pair(key,value) another component of the value could be searched and need to be indexed. This process is called a "Secondary Database". For example, with this project I created a secondary database:1) to find/loop over the markers using their position 2) to find/loop over the individuals using individualID instead of name. We need a class extending SecondaryKeyCreator
to extract this second key for the original value. For example here is the class extraction the Position from the Marker.

class MarkerPositionCreator implements SecondaryKeyCreator
 {
 public boolean createSecondaryKey(
  SecondaryDatabase secDb,
  DatabaseEntry keyEntry,
  DatabaseEntry dataEntry,
  DatabaseEntry resultEntry)
  {
  Marker marker= (Marker)Test01.this.markerTupleBinding.entryToObject(dataEntry);
  Test01.this.positionTupleBinding.objectToEntry(marker.pos, resultEntry);
  return true;
  }
}

We also need to open those "secondary" databases

SecondaryConfig secondConfig= new SecondaryConfig();
 secondConfig.setAllowCreate(!readOnly);
 /* on open, if the secondary database is empty then the primary
database is read in its entirety and additions/modifications to the secondary's records occur automatically */
 secondConfig.setAllowPopulate(true);
 secondConfig.setSortedDuplicates(true);
 MarkerPositionCreator posCreator = new MarkerPositionCreator();
 /* Identifies the key creator object to be used for secondary key creation.  */
 secondConfig.setKeyCreator(posCreator);
 positionDB = this.env.openSecondaryDatabase(null, "position
", this.markerDB,secondConfig);

OK, more genetic now. The HapMap genotypes are available here:http://www.hapmap.org/genotypes/latest/rs_strand/non-redundant/ (the path may change). A file looks like a table f(marker,individual)=genotype:

rs# SNPalleles chrom pos strand genome_build center protLSID assayLSID panelLSID QC_code NA18940 NA18942 NA189
rs28412942 A/T chrMT 410 + ncbi_B36 affymetrix GenomeWideSNP_6.02 Japanese:2 QC+ AA AA AA AA AA AA AA AA AA AA
rs3937039 A/G chrMT 665 + ncbi_b36 broad genotype_protocol_11 Japanese:1 QC+ AA AA AA AA AA AA AA AA AA AA AA
rs2853517 A/G chrMT 711 + ncbi_b36 broad genotype_protocol_11 Japanese:1 QC+ GG GG GG GG GG GG GG GG GG GG GG
rs28358568 C/T chrMT 712 + ncbi_b36 broad genotype_protocol_11 Japanese:1 QC+ TT TT TT TT TT TT TT TT TT TT TT
(...)

The file is processed as follow.

Pattern space=Pattern.compile("[ ]");
String HEADER[]=new String[]{"rs#","SNPalleles","chrom","pos","strand","genome_build","center","protLSID","assayLSID","panelLSID","QC_code"};
BufferedReader in= new  BufferedReader(new InputStreamReader(new GZIPInputStream(url.openStream())));

String line= in.readLine();
if(line==null) throw new IOException("Empty file");
/* The first line of this file is the header*/
String header[]=space.split(line);
for(int i=0;i< HEADER.length;++i)
 {
 if(!header[i].equals(HEADER[i])) throw new IOException("Bad header "+header[i]+" expected "+HEADER[i]);
 }
/* the header contains the name of the Individual which will be inserted. At this time we don't know what are the relationships between those individuals.*/
for(int i=HEADER.length;i< header.length;++i)
 {
 Individual individual= new Individual();
 individual.name=header[i];
 DatabaseEntry key= new DatabaseEntry(individual.name.getBytes());
 DatabaseEntry data= new DatabaseEntry();
 this.individualTupleBinding.objectToEntry(individual, data);
 getIndividualDB().put(null
   ,key
   ,data
   );
 }
/** the following lines are the markers and the genotypes */
TupleBinding INT_BINDING=TupleBinding.getPrimitiveBinding(Integer.class);
while((line=in.readLine())!=null)
 {
 if(!line.startsWith("rs")) continue;
 String tokens[]=space.split(line);
 //System.err.println(line);
 /* fill the information of this marker */
 Marker marker= new Marker(Integer.parseInt( tokens[0].substring(2)));
 marker.alleles= tokens[1];
 marker.pos= new Position(tokens[2],Integer.parseInt(tokens[3]));
 marker.strand= tokens[4].charAt(0);

 DatabaseEntry key= new DatabaseEntry();
 INT_BINDING.objectToEntry(marker.getRSId(), key);
 DatabaseEntry data= new DatabaseEntry();
 this.markerTupleBinding.objectToEntry(marker, data);
 getMarkerDB().put(null
   ,key
   ,data
   );
 /** loop over this marker and get the genotypes */
 for(int i=HEADER.length;i<header.length;++i)
  {
  if(tokens[i].length()!=2 || tokens[i].equals("NN")) continue;
  /** create the KEY */
  MarkerIndividual mi= new MarkerIndividual(marker.rsId,header[i]);
  this.markerIndividualTupleBinding.objectToEntry(mi, key);
  /** create the genotype */
  this.genotypeTupleBinding.objectToEntry(
    new Genotype(tokens[i].charAt(0),tokens[i].charAt(1)),
    data
    );
  /** put the new pair( pair(marker,individual), genotype) in the BDB */
  getGenotypeDB().put(null
    ,key
    ,data
    );
  }
 }
in.close();

OK, I want to find the relationships between those individuals, this information is available here. For each "Coriell Repository Number" we find the individual in our database, if it exists we add the information and write the individual back to the database. (See function ">makePedigree line 466).

To retrieve the genotype g=f(marker,individual) I wrote the following simple utility function getGenotypeAt:

Genotype getGenotypeAt(Marker marker,Individual indi) throws DatabaseException
 {
 if(marker==null || indi==null) return null;
 DatabaseEntry key=new DatabaseEntry();
 DatabaseEntry value=new DatabaseEntry();
 this.markerIndividualTupleBinding.objectToEntry(new MarkerIndividual(marker.rsId,indi.name),key);
 if(this.genotypeDB.get(null, key, value, LockMode.DEFAULT)!=OperationStatus.SUCCESS) return null;
 Genotype g= Genotype.class.cast(this.genotypeTupleBinding.entryToObject(value));
 return g;
 }

To get the frequencies of the alleles, we loop each over each marker (using a secondary database to get the markers ordered on the genome (not ordered on rs##)) and we get all the genotypes for each individual. To loop over a BDB an instance Cursor (looks like a java.util.Iterator) is used.

void frequencies() throws DatabaseException
 {
 SecondaryCursor cM= this.positionDB.openSecondaryCursor(null, null);
 DatabaseEntry key=new DatabaseEntry();
 DatabaseEntry value=new DatabaseEntry();
 while(cM.getNext(key, value,LockMode.DEFAULT)==OperationStatus.SUCCESS)
  {
  Marker m= Marker.class.cast(this.markerTupleBinding.entryToObject(value));
  HashMap<Character, Integer> allele2count= new HashMap<Character, Integer>();
  int totalGenotyped=0;
  int totalFailures=0;
  System.out.print("rs"+m.rsId+" "+m.alleles+" "+m.pos.chromosome+" "+m.pos.position+" "+m.strand);
  Cursor cI= this.individualDB.openCursor(null, null);
  while(cI.getNext(key, value,LockMode.DEFAULT)==OperationStatus.SUCCESS)
   {
   Individual indi=Individual.class.cast(this.individualTupleBinding.entryToObject(value));
   Genotype g= getGenotypeAt(m, indi);
   if(g!=null)
    {
    totalGenotyped++;
    for(int i=0;i< 2;++i)
     {
     char c= (i==0?g.a1:g.a2);
     Integer count= allele2count.get(c);
     if(count==null) count=0;
     allele2count.put(c,count+1);
     }
    }
   else
    {
    totalFailures++;
    }
   }
  cI.close();
  System.out.print(" genotyped:"+(int)(100.0*(totalGenotyped-totalFailures)/(float)totalGenotyped)+"%");
  for(Character allele: allele2count.keySet())
   {
   System.out.print(" f("+allele+")="+allele2count.get(allele)/(2.0*totalGenotyped));
   }
  System.out.println();
  }
 cM.close();
 }

Finding the Mendelian incompatibilities is much like the same: I sued here the brute force, we loop over each individual and over each marker. If the individual as any parent, we check that his genotype is compatible with them.

void incompats() throws DatabaseException
 {
 DatabaseEntry key=new DatabaseEntry();
 DatabaseEntry value=new DatabaseEntry();

 Cursor cI= this.individualDB.openCursor(null, null);
 while(cI.getNext(key, value,LockMode.DEFAULT)==OperationStatus.SUCCESS)
  {
  Individual indi=Individual.class.cast(this.individualTupleBinding.entryToObject(value));

  if(indi.fatherID==null && indi.motherID==null)
   {
   continue;
   }

  Individual father= findIndividualByCorrielId(indi.fatherID);
  Individual mother= findIndividualByCorrielId(indi.motherID);

  Cursor cM= this.markerDB.openCursor(null, null);
  while(cM.getNext(key, value,LockMode.DEFAULT)==OperationStatus.SUCCESS)
   {
   Marker m= Marker.class.cast(this.markerTupleBinding.entryToObject(value));
   Genotype gChild= getGenotypeAt(m, indi);
   Genotype gFather= getGenotypeAt(m, father);
   if(isIncompat(gChild,gFather))
    {
    System.out.println("Incompat: for rs"+m.getRSId()+"("+m.alleles+") "+
     indi.individualID+" is "+gChild+" and his father "+indi.fatherID+" is "+
     gFather
     );
    continue;
    }
   Genotype gMother= getGenotypeAt(m, mother);
   if(isIncompat(gChild,gMother))
    {
    System.out.println("Incompat: for rs"+m.getRSId()+"("+m.alleles+") "+
     indi.individualID+" is "+gChild+" and his mother "+indi.motherID+" is "+
     gMother
     );
    continue;
    }
   if(isIncompat(gChild,gFather,gMother))
    {
    System.out.println("Incompat: for rs"+m.getRSId()+"("+m.alleles+") "+
     indi.individualID+" is "+gChild+
     " and his father "+indi.fatherID+" is "+ gFather+" and his mother "+indi.motherID+" is "+ gMother
     );
    }
   }
  cM.close();
  }
 cI.close();
 }

That's it. I first test the chromosome 1 at http://www.hapmap.org/genotypes/latest/rs_strand/non-redundant/genotypes_chr1_CEU_r23a_nr.b36.txt.gz(11Mo) but I pressed Ctrl-C when the files reached 1.4Go !
I then used the chr22 file directly downloaded on my computer. The space required by BerkeleyDB to hold the database and the indexes was 721Mo whereas the zipped original source of data was 2Mo (25Mo unzipped)!!! (Arghhhhhhhhhhhh !!!!!).

Individuals count:=90

Markers count:=54786

Genotypes count:=4853237

Time required to load the hapmap file : 1174secs (20min)

rs9624968 A/G chr22 24783030 + genotyped:86% f(G)=0.879746835443038 f(A)=0.12025316455696203
rs9624969 C/T chr22 24784595 + genotyped:87% f(T)=0.075 f(C)=0.925
rs6004919 C/T chr22 24785216 + genotyped:100% f(T)=0.12777777777777777 f(C)=0.8722222222222222
rs4585127 A/G chr22 24785559 + genotyped:100% f(G)=0.8722222222222222 f(A)=0.12777777777777777
rs5752262 A/G chr22 24786367 + genotyped:95% f(G)=0.5116279069767442 f(A)=0.4883720930232558
rs16981296 C/G chr22 24787784 + genotyped:95% f(G)=0.8488372093023255 f(C)=0.1511627906976744
rs1003547 G/T chr22 24788134 + genotyped:86% f(T)=0.44936708860759494 f(G)=0.5506329113924051
rs9613094 A/G chr22 24788388 + genotyped:100% f(G)=0.2222222222222222 f(A)=0.7777777777777778
rs16986627 A/G chr22 24789298 + genotyped:88% f(G)=0.2654320987654321 f(A)=0.7345679012345679
(...)

Time required to generate the frequencies Time:955secs

Incompat: for rs133457(C/T) 1341M02 is CC and his father 1341MF13 is TT
Incompat: for rs136009(C/T) 1341M02 is CT and his father 1341MF13 is TT and his mother 1341MM14 is TT
Incompat: for rs394518(C/T) 1341M02 is CT and his father 1341MF13 is TT and his mother 1341MM14 is TT
Incompat: for rs628437(A/C) 1341M02 is AC and his father 1341MF13 is CC and his mother 1341MM14 is CC
Incompat: for rs731403(C/G) 1341M02 is GG and his mother 1341MM14 is CC
Incompat: for rs1122940(C/T) 1341M02 is CT and his father 1341MF13 is TT and his mother 1341MM14 is TT
Incompat: for rs2845343(A/T) 1341M02 is AA and his mother 1341MM14 is TT
Incompat: for rs4822498(C/T) 1341M02 is CC and his father 1341MF13 is TT
Incompat: for rs4822499(C/T) 1341M02 is TT and his father 1341MF13 is CC
Incompat: for rs4823195(A/G) 1341M02 is AA and his mother 1341MM14 is GG
Incompat: for rs4991267(C/T) 1341M02 is CC and his mother 1341MM14 is TT
Incompat: for rs5755047(A/T) 1341M02 is TT and his father 1341MF13 is AA
Incompat: for rs5755343(A/T) 1341M02 is AA and his mother 1341MM14 is TT
Incompat: for rs5755420(A/C) 1341M02 is AC and his father 1341MF13 is CC and his mother 1341MM14 is CC
Incompat: for rs5765056(C/T) 1341M02 is CT and his father 1341MF13 is CC and his mother 1341MM14 is CC
Incompat: for rs5765436(A/C) 1341M02 is AC and his father 1341MF13 is CC and his mother 1341MM14 is CC
Incompat: for rs5765499(C/T) 1341M02 is CC and his father 1341MF13 is TT
Incompat: for rs5768636(G/T) 1341M02 is GT and his father 1341MF13 is GG and his mother 1341MM14 is GG
Incompat: for rs5769710(C/T) 1341M02 is CC and his father 1341MF13 is TT
Incompat: for rs5770600(A/C) 1341M02 is CC and his father 1341MF13 is AA
Incompat: for rs5997220(A/G) 1341M02 is AG and his father 1341MF13 is GG and his mother 1341MM14 is GG
(...)
764 errors

Time required to find the Mendelian incompatibilities: 11021secs (~3H00)

When the program closed, the database was compacted down to 710Mo.

Conclusion: Too slow, some huges files generated. Definitely a bad choice to handle this kind of data.

10 March 2008

Custom Search Engine For Bioinformatics

Mostly because I need this every time, I wrote a few "Custom Search" engines for dbSNP, Hapmap and the ICSC Genome Browser. Those engines are available at http://lindenb.integragen.org/opensearch/opensearch.html.

Pierre

YOKOFAKUN

12 June 2009

RDF storage with a Key/Value Engine

Performance

25 November 2008

Taxonomy and Semantic Web: writing an extension for ARQ/SPARQL

Requirements

The input

A basic query

The query

Invoking ARQ

Result

Adding a custom function

First test

The query

The command line

The result

Second query

The SPARQL file

Command line

The result

04 April 2008

BerkeleyDB and Hapmap: My notebook.

10 March 2008

Custom Search Engine For Bioinformatics

About Me

Feeds

Blog Archive

Web2.0

Labels