Just curious, can I store some RDF statements in a Key/Value engine like BerkeleyDB (java Edition) ?
Yes, it's like re-inventing the wheel but, again, I like re-inventing the wheel :-)
A berkeleyDB Database contains a set of Key/Value. e.g.
Data are stored as an array of bytes. Keys are ordered on a bit-based order and the records are stored in a B-Tree table. Databases can be defined to store unique or duplicated keys.
In BerkeleyDB, a Cursor is an iterator used to scan the database: as the keys are sorted , accessing a range of keys is very fast.
Some indexes (Secondary Database) can be linked to a Database, for example, in the previous table, if you want to quickly access the person having a LastName=="Parker" I would create a secondary database on LastName. Deleting an item in the secondary database, automatically delete the corresponding item in the main database. As far as I know, you cannot create a secondary database if your primary database allows duplicate keys.
OK, now I want to store some RDF statements, that is to say something like the following triple:
SUBJECT = RESOURCE;
PREDICATE = RESOURCE;
OBJECT = ( RESOURCE || LITERAL)
I need to create an index to quickly find any statement matching one , two or the three components of the statement.
In the solution I've implemented, there is only one primary database and all the component of a statement are part of the 'DATA' , the 'KEY' of the database is just a unique number.
Creating the Database 'triplesDB' :
We then create an the secondary indexes for each component of a statement ( subject2triple ,predicate2triple, objectLiteral2triple, objectRsrc2triple) . For example, the following code opens the secondaty database 'objectRsrc2triple'. We create a 'SecondaryKeyCreator' to tell BerkeleyDB about how it should extract the secondary key from the primary data.
We need some methods to write and read the components of a Statement from/to an array of bytes:
I've also wrapped the Cursor in a java.util.Iterator:. One interesting Iterator is the JoinIterator which quickly retrieve the *common* Statements returned from a distinct set of 'Cursors'. When we first retrieve the values of those cursors, we seek for each searched keys. Then we let the BerkeleyDB API finding the intersection of those cursors.
This JoinCursor is then used for any specific query. For example, the following method 'listStatement' takes three parameters (s,p,o) and return an Iterator over a list of Statements. If a parameter is null, it will be used as a wildcard: this is where we shall use the secondary databases for finding the statements.
This RDFStore was used to download and parse a remote gzipped file from genontology.org. The uncompressed file is ~61.0Mo and contains 774579 statements. It took about ~6 min) to download and digest the file. Remember that each time a statement is about to be inserted, we need to check if it doesn't already exists in the database The amount of space required to store the database was 591Mo (ouch !!).
This code was my first idea about how to solve this problem. Obviously, some other engines are far more efficient :-) :(...)Good progress though, last night's run yielded 5 billion triples loaded in just under 10 hours for an average throughput of 135k triples per second. Max throughput was just above 210k triples per second. 1 billion triples was reached in an astonishing 78 minutes.(...)
As a conclusion, here is the full source code for the first version of this storage engine: