YOKOFAKUN: berkeleydb

Showing posts with label berkeleydb. Show all posts

07 February 2013

A tool to compare the BAMs

Following that thread on Biostar, I've created a tool that compares two or more BAMs. This java program uses the Picard and berkeleydb-JE libraries and is available at: http://code.google.com/p/jvarkit/wiki/CompareBams.

Download & install

see http://code.google.com/p/jvarkit/wiki/CompareBams.

Example

The following Makefile align the same pair of FASTQs with 5 different parameters for bwa aln -O (gap_open_penalty)

FASTQ1=001.fastq.gz
FASTQ2=002.fastq.gz
REF=/path/to/human_g1k_v37.fasta
BWA=bwa
SAMTOOLS=samtools
ALL_BAMS= 

define SAI
$(1)_1.sai : ${FASTQ1}  $(REF)
        $(BWA) aln  $(2) -t 2  -f $$@ $(REF) $$<
$(1)_2.sai :${FASTQ2}  $(REF)
        $(BWA) aln  $(2) -t 2  -f $$@ $(REF) $$<

endef

define ALN

ALL_BAMS+= $(1).bam 

$(eval $(call SAI,$(1),$(2)))

$(1).bam: $(1)_1.sai $(1)_2.sai
        $(BWA) sampe $(3)  ${REF} \
                $(1)_1.sai $(1)_2.sai  \
                $(FASTQ1) $(FASTQ2) |\
        $(SAMTOOLS) view -S -b -o $$@ -T $(REF) - 

endef

.PHONY:all

all: diff.gz

$(eval $(foreach GAP, 8 9 10 11 12 , $(call ALN,GAP$(GAP), -O $(GAP) , )))


diff.gz: $(ALL_BAMS)
        mkdir -p tmp.bdb
        java -jar  /commun/data/packages/jvarkit/comparebams.jar -d tmp.bdb $^ | gzip --best > $@

execute:

$ make

(...)
java -jar  /path/to/jvarkit/comparebams.jar -d tmp.bdb GAP8.bam GAP9.bam GAP10.bam GAP11.bam GAP12.bam | gzip --best > diff.gz
Feb 07, 2013 2:09:57 PM fr.inserm.umr1087.jvarkit.tools.cmpbams.CompareBams run
#INFO: in GAP8.bam count:1
Feb 07, 2013 2:10:30 PM fr.inserm.umr1087.jvarkit.tools.cmpbams.CompareBams run
INFO: in GAP9.bam count:1
(...)

The output is a tab delimited file containing

the name of the read
the comparison of each bam vs the others EQ=equals, NE=not-equals
the positions of the reads in each bam

$ gunzip -c diff.gz | egrep -E  '(#|NE)' | head


#READ-Name      GAP8.bam GAP9.bam|GAP8.bam GAP10.bam|GAP8.bam GAP11.bam|GAP8.bam GAP12.bam|GAP9.bam GAP10.bam|GAP9.bam GAP11.bam|GAP9.bam GAP12.bam|GAP10.bam GAP11.bam|GAP10.bam GAP12.bam|GAP11.bam GAP12.bam GAP8.bam        GAP9.bam        GAP10.bam       GAP11.bam       GAP12.bam
M00491:10:000000000-A27BP:1:1101:10029:10672    EQ|NE|NE|NE|NE|NE|NE|EQ|EQ|EQ   6:123892013,6:123892006 6:123892013,6:123892006 6:123892005,6:123892013 6:123892005,6:123892013 6:123892005,6:123892013
M00491:10:000000000-A27BP:1:1101:10265:10054    EQ|EQ|NE|NE|EQ|NE|NE|NE|NE|NE   19:49671437,19:49671412 19:49671437,19:49671412 19:49671437,19:49671412 19:49671435     19:49671412,19:49671435
M00491:10:000000000-A27BP:1:1101:10904:12333    EQ|NE|NE|NE|NE|NE|NE|EQ|EQ|EQ   10:88681151,10:88681156 10:88681151,10:88681156 10:88681156,10:88681150 10:88681156,10:88681150 10:88681156,10:88681150
M00491:10:000000000-A27BP:1:1101:11211:13492    EQ|NE|NE|NE|NE|NE|NE|EQ|EQ|EQ   8:52321469      8:52321469      8:52321470      8:52321470      8:52321470
M00491:10:000000000-A27BP:1:1101:11298:18283    NE|NE|NE|NE|EQ|EQ|EQ|EQ|EQ|EQ   6:126071103,6:126071110 6:126071104,6:126071110 6:126071104,6:126071110 6:126071104,6:126071110 6:126071104,6:126071110
M00491:10:000000000-A27BP:1:1101:11381:15675    EQ|NE|NE|NE|NE|NE|NE|EQ|EQ|EQ   1:156106900,1:156106905 1:156106900,1:156106905 1:156106905,1:156106899 1:156106905,1:156106899 1:156106905,1:156106899
M00491:10:000000000-A27BP:1:1101:12189:14088    EQ|NE|NE|EQ|NE|NE|EQ|EQ|NE|NE   15:22015803     15:22015803     15:21009140     15:21009140     15:22015803
M00491:10:000000000-A27BP:1:1101:12382:11193    EQ|NE|NE|NE|NE|NE|NE|EQ|EQ|EQ   4:111542263,4:111542256 4:111542263,4:111542256 4:111542263,4:111542254 4:111542263,4:111542254 4:111542263,4:111542254
M00491:10:000000000-A27BP:1:1101:12998:24492    EQ|NE|NE|NE|NE|NE|NE|EQ|EQ|EQ   2:179433503,2:179433496 2:179433503,2:179433496 2:179433503,2:179433497 2:179433503,2:179433497 2:179433503,2:179433497
(....)

That's it
Pierre

19 June 2011

A couchdb-like application using apache JETTY and BerkeleyDB. My notebook

This is my notebook for the Jetty API. Jetty provides an Web server and javax.servlet container. It can be embedded in devices, tools, frameworks, application servers, and clusters. I find it more easier for developping and testing a new web application as it doesn't require some deployement descriptors, some configuration files, a web container, etc... like in apache tomcat. As I will show below, a whole web applications can be embedded in a single java class.
As an example, I wrote a couchdb-like web application, named DivanDB ('divan' is the french word for couch), using Jetty and Berkeleydb. The source file is available on github at:

https://github.com/lindenb/jsandbox/blob/master/src/sandbox/DivanDB.java

(See also my previous post :CouchDB for Bioinformatics: Storing SNPs - My Notebook)

The BerkeleyDB-base key/value datastore

The class BDBStorage consists in a BerkeleyDB environment and a BerkeleyDB database. The database stores keys and the values as String.

/** a berkeley-db String/String datastore */ 
private static class BDBStorage
    {
    /** bdb environment */
    private Environment environment=null;
    /** string/string database */
    private Database database=null;
    
    private BDBStorage()
     {
     }
    
    /** open environment & database */
    private void open(File dbHome) throws DatabaseException
  {
  (...)
  }
    /** close environment & database */
    private void close()
  {
  (...)
  }
   }

The web context

A servlet context is created where we put a BDBStorage. We also tell the context about the DivanDB servlet and the path of the application.


DivanDB app=new DivanDB();
BDBStorage storage=new BDBStorage();
ServletContextHandler context = new ServletContextHandler();
context.setAttribute(STORAGE_ATTRIBUTE, storage);
context.addServlet(new ServletHolder(app),"/*");
context.setContextPath("/divandb");
context.setResourceBase(".");

The server

The embedded server listens to the specified port. We add the context to the server, it is started and listen to the queries forever.

/* create a new server */
Server server = new Server(port);
/* context */
server.setHandler(context);
/* start server */
server.start();
/* loop forever */
server.join();

The Servlet

The servlet DivanDB receives the POST, GET and DELETE queries just like couchdb. It decodes the JSON data using a JSON API provided in the jetty distribution. See the code for more information.

Start the server

$java -jar dist/divandb.jar
2011-06-19 18:12:16.823:INFO::jetty-8.0.0.M3
2011-06-19 18:12:16.896:INFO::Started SelectChannelConnector@0.0.0.0:8080 STARTING

Testing, using CURL

Add a single snp. Here the id is a concatenation of the chromosome number and the genomic position:

$curl -X PUT -d '{"id":"010000011390","name":"rs75100478","avHet":0.5,"chrom":"chr1","position":11390}' http://localhost:8080/divandb/
{"ok":"true","id":["010000011390"]}

Add single snp without id (it will be generated):

$curl -X PUT -d '{"name":"rs3877545","avHet":0.375,"chrom":"chr1","position":11507}' http://localhost:8080/divandb/
{"ok":"true","id":["id972755720"]}

Add an array of SNPs:

$ curl -X PUT -d '[{"id":"010000011390","name":"rs75100478","avHet":0.5,"chrom":"chr1","position":11390},{"id":"010000011507","name":"rs3877545","avHet":0.375,"chrom":"chr1","position":11507},{"id":"010000011863","name":"rs76933399","avHet":0.46875,"chrom":"chr1","position":11863},{"id":"010000011920","name":"rs78105014","avHet":0.5,"chrom":"chr1","position":11920},{"id":"010000011959","name":"rs71234145","avHet":0.5,"chrom":"chr1","position":11959},{"id":"010000012048","name":"rs7357889","avHet":0.277778,"chrom":"chr1","position":12048},{"id":"010000012073","name":"rs76972523","avHet":0.5,"chrom":"chr1","position":12073},{"id":"010000012120","name":"rs78253371","avHet":0.5,"chrom":"chr1","position":12120},{"id":"010000012148","name":"rs78371031","avHet":0.5,"chrom":"chr1","position":12148},{"id":"010000012158","name":"rs79313092","avHet":0.5,"chrom":"chr1","position":12158},{"id":"010000012168","name":"rs74400200","avHet":0.5,"chrom":"chr1","position":12168},{"id":"010000012169","name":"rs79903148","avHet":0.375,"chrom":"chr1","position":12169},{"id":"010000012213","name":"rs79272378","avHet":0.5,"chrom":"chr1","position":12213},{"id":"010000012264","name":"rs2475482","avHet":0.444444,"chrom":"chr1","position":12264},{"id":"010000012295","name":"rs76096671","avHet":0.375,"chrom":"chr1","position":12295},{"id":"010000012295","name":"rs79461972","avHet":0.375,"chrom":"chr1","position":12295},{"id":"010000012305","name":"rs71236994","avHet":0.5,"chrom":"chr1","position":12305},{"id":"010000012332","name":"rs2981846","avHet":0.375,"chrom":"chr1","position":12332},{"id":"010000012336","name":"rs76760687","avHet":0.375,"chrom":"chr1","position":12336},{"id":"010000012355","name":"rs76588012","avHet":0.444444,"chrom":"chr1","position":12355}]' http://localhost:8080/divandb/
{"ok":"true","id":["010000011390","010000011507","010000011863","010000011920","010000011959","010000012048","010000012073","010000012120","010000012148","010000012158","010000012168","010000012169","010000012213","010000012264","010000012295","010000012295","010000012305","010000012332","010000012336","010000012355"]

Deleting a SNP:

$curl -X DELETE -d '{"id":"2065420809"}' http://localhost:8080/divandb/
{"ok":"true","id":["id972755720"]}

List all the SNPs:

$ curl http://localhost:8080/divandb/
[
{"position":11390,"id":"010000011390","_timestamp":"1308501619331","name":"rs75100478","avHet":0.5,"chrom":"chr1"},
{"position":11507,"id":"010000011507","_timestamp":"1308501619335","name":"rs3877545","avHet":0.375,"chrom":"chr1"},
{"position":11863,"id":"010000011863","_timestamp":"1308501619336","name":"rs76933399","avHet":0.46875,"chrom":"chr1"},
{"position":11920,"id":"010000011920","_timestamp":"1308501619336","name":"rs78105014","avHet":0.5,"chrom":"chr1"},
{"position":11959,"id":"010000011959","_timestamp":"1308501619337","name":"rs71234145","avHet":0.5,"chrom":"chr1"},
{"position":12048,"id":"010000012048","_timestamp":"1308501619337","name":"rs7357889","avHet":0.277778,"chrom":"chr1"},
{"position":12073,"id":"010000012073","_timestamp":"1308501619338","name":"rs76972523","avHet":0.5,"chrom":"chr1"},
{"position":12120,"id":"010000012120","_timestamp":"1308501619338","name":"rs78253371","avHet":0.5,"chrom":"chr1"},
{"position":12148,"id":"010000012148","_timestamp":"1308501619339","name":"rs78371031","avHet":0.5,"chrom":"chr1"},
{"position":12158,"id":"010000012158","_timestamp":"1308501619339","name":"rs79313092","avHet":0.5,"chrom":"chr1"},
{"position":12168,"id":"010000012168","_timestamp":"1308501619339","name":"rs74400200","avHet":0.5,"chrom":"chr1"},
{"position":12169,"id":"010000012169","_timestamp":"1308501619340","name":"rs79903148","avHet":0.375,"chrom":"chr1"},
{"position":12213,"id":"010000012213","_timestamp":"1308501619340","name":"rs79272378","avHet":0.5,"chrom":"chr1"},
{"position":12264,"id":"010000012264","_timestamp":"1308501619341","name":"rs2475482","avHet":0.444444,"chrom":"chr1"},
{"position":12295,"id":"010000012295","_timestamp":"1308501619342","name":"rs79461972","avHet":0.375,"chrom":"chr1"},
{"position":12305,"id":"010000012305","_timestamp":"1308501619344","name":"rs71236994","avHet":0.5,"chrom":"chr1"},
{"position":12332,"id":"010000012332","_timestamp":"1308501619348","name":"rs2981846","avHet":0.375,"chrom":"chr1"},
{"position":12336,"id":"010000012336","_timestamp":"1308501619348","name":"rs76760687","avHet":0.375,"chrom":"chr1"},
{"position":12355,"id":"010000012355","_timestamp":"1308501619349","name":"rs76588012","avHet":0.444444,"chrom":"chr1"}
]

List the SNPs on chr1 between 12000 and 12150:

$ curl "http://localhost:8080/divandb/?startkey=010000012000&endkey=010000012150"
[
{"position":12048,"id":"010000012048","_timestamp":"1308501619337","name":"rs7357889","avHet":0.277778,"chrom":"chr1"},
{"position":12073,"id":"010000012073","_timestamp":"1308501619338","name":"rs76972523","avHet":0.5,"chrom":"chr1"},
{"position":12120,"id":"010000012120","_timestamp":"1308501619338","name":"rs78253371","avHet":0.5,"chrom":"chr1"},
{"position":12148,"id":"010000012148","_timestamp":"1308501619339","name":"rs78371031","avHet":0.5,"chrom":"chr1"}
]

List 3 SNPs starting from index=2

$ curl "http://localhost:8080/divandb/?start=2&limit=3"
[
{"position":11507,"id":"010000011507","_timestamp":"1308501619335","name":"rs3877545","avHet":0.375,"chrom":"chr1"},{"position":11863,"id":"010000011863","_timestamp":"1308501619336","name":"rs76933399","avHet":0.46875,"chrom":"chr1"},
{"position":11920,"id":"010000011920","_timestamp":"1308501619336","name":"rs78105014","avHet":0.5,"chrom":"chr1"}
]

That's it

Pierre

26 November 2009

A Java implementation of Jan Aerts' LocusTree

This post is a description of my implementation of Jan Aerts' LocusTree algorithm (I want to thank Jan, our discussion and his comments were as great source of inspiration) based on BerkeleyDB-JE, a Key/Value datastore. This implementation has been used to build a genome browser displaying its data with the SVG format. In brief: splicing each chromosome using a dichotomic approach allows to quickly find all the features in a given genomic region for a given resolution. A count of the total number of objects in the descent of each child node is used to produce a histogram of the number of objects smaller than the given resolution.

JSON/Metadata

All the information is stored in BerkeleyDB and I've used JSON to add some metadata about each object. The JSON is serialized, gzipped and stored in BerkeleyDB.

Organism

Each organism is defined by an ID and a Name. The Key of the BerkeleyDB is the organism.id.

The organisms are loaded in the database using a simple XML file:

<organisms>
        <organism id="36">
                <name>hg18</name>
                <description>Human Genome Build v.36</description>
                <metadata>{"taxon-id":9606}</metadata>
        </organism>
</organisms>

Chromosome

Each chromosome is defined by an ID, a Name, its length and its organism-ID. The Key in berkeleyDB is the chromosome ID.

The chromosomes are loaded in the database using a simple XML file:

<chromosomes organism-id="36">
  <chromosome id="1">
    <name>chr1</name>
    <metadata>{"size":247249719,"type":"autosomal"}</metadata>
  </chromosome>
  <chromosome id="10">
    <name>chr10</name>
    <metadata>{"size":135374737,"type":"autosomal"}</metadata>
  </chromosome>
(...)
</chromosomes>

Track

Each track is defined by an ID and a Name. The Key in berkeleyDB is the track ID.

The descriptions of the tracks are loaded in the database using a simple XML file:

<tracks>
        <track id="1">
                <name>cytobands</name>
                <description>UCSC cytobands</description>
        </track>
        <track id="2">
                <name>knownGene</name>
                <description>UCSC knownGene</description>
        </track>
        <track id="3">
                <name>snp130</name>
                <description>dbSNP v.130</description>
        </track>
        <track id="4">
                <name>snp130Coding</name>
                <description>UCSC coding Snp</description>
        </track>
        <track id="5">
                <name>all_mrna</name>
                <description>UCSC All mRNA</description>
        </track>
</tracks>

Nodes

Each LocusTree Node (LTNode) is linked to a Chromosome and a Track using a database named 'TrackChrom'. Here the Key of the BerkeleyDB is a composite key (chromosome/track).

The structure of a LTNode is described below. Each node contains a link to its parent, the links to its children as well as a set of genomic entities whose length is greater or equals that 'this.length'.

To load the content of each LocusTree, I've defined a simple java interface called LTStreamLoader which looks like this:

public interface LTLoader
{
public MappedObject getMappedObject();
public String getChromosome();
public Set<String> getKeywords();
}
public interface LTStreamLoader
extends LTLoader
{
public void open(String uri) throws IOException;
public void close() throws IOException;
public boolean next() throws IOException;
}

An instance of this interface is used to load the content of a tab delimited file as defined in the following XML file:

<loaders>
        <load organism-id="36" track-id="5" class-loader="fr.cephb.locustree.loaders.UCSCAllMrnaLoader" limit="10000">
                http://hgdownload.cse.ucsc.edu/goldenPath/hg18/database/all_mrna.txt.gz
        </load>
        <load organism-id="36" track-id="4" class-loader="fr.cephb.locustree.loaders.UCSCSnpCodingLoader" limit="10000">
                http://hgdownload.cse.ucsc.edu/goldenPath/hg18/database/snp130CodingDbSnp.txt.gz
        </load>
        <load organism-id="36" track-id="1" class-loader="fr.cephb.locustree.loaders.UCSCCytoBandsLoader">
                http://hgdownload.cse.ucsc.edu/goldenPath/hg18/database/cytoBand.txt.gz
        </load>
        <load organism-id="36" track-id="2" class-loader="fr.cephb.locustree.loaders.UCSCKnownGeneLoader">
                http://hgdownload.cse.ucsc.edu/goldenPath/hg18/database/knownGene.txt.gz
        </load>
        <load organism-id="36" track-id="3" class-loader="fr.cephb.locustree.loaders.UCSCSnpLoader" limit="10000">
                http://hgdownload.cse.ucsc.edu/goldenPath/hg18/database/snp130.txt.gz
        </load>
</loaders>

It took about 3H00 to load 'snp130.txt.gz' and the size of the indexed BerkeleyDB/LocusTree database was 16Go (ouch!).

Building the Genome Browser

The locus tree database was used to create (yet another) Genome Browser. My current implementation runs smoothly under Apache Tomcat. The SVG vectorial format was used to draw and hyperlink the data. Here is a screenshot of the first version I wrote one week ago. As you can see, the objects that were too small to be drawn, were displayed within a histogram.

Later, I've added some labels.

And my latest version uses the JSON metadata available in each objet to display the spliced structure of the genes:

The browser is fast (sorry, I cannot show it at this time) but I need to play with the config of BerkeleyDB to speed up the insertions and reduce the size of the database.

That's it.
Pierre

NB: The figures of this post were created using SVGToCanvas.

27 September 2009

Extracting Scientists &SF writers from Wikipedia.

Images via wikipedia

In a recent post on FriendFeed, Christopher Harris asked: do you know of any science fiction writer who is/was also a scientist?. My first approach to automatically retrieve those names, was to use Freebase. For example, the following MQL query retrieves the Scientists and the SF Writers.

[{
"id":null,
"name":null,
"type" : "/people/person",
"a:profession":[{"name":"Scientist"}],
"b:profession":[{"name":"Science-Fiction Writer"}],
"limit":100

}]

The MQL query Editor returned the following result:

{
"code": "/api/status/ok",
"result": [
{
"a:profession": [{
"name": "Scientist"
}],
"b:profession": [{
"name": "Science-fiction writer"
}],
"id": "/en/edward_llewellyn",
"name": "Edward Llewellyn",
"type": "/people/person"
},
{
"a:profession": [{
"name": "Scientist"
}],
"b:profession": [{
"name": "Science-fiction writer"
}],
"id": "/en/konrad_fialkowski",
"name": "Konrad Fiałkowski",
"type": "/people/person"
}
],
"status": "200 OK",
"transaction_id": "cache;cache01.p01.sjc1:8101;2009-09-27T15:58:11Z;0002"
}

Only two persons ! That's not much, because the articles in Wikipedia, as well as in Freebase are classified using a hierarchical Categories (sadly, it is not an acyclic graph), but there is no tool to find the articles matching the sub-categories. So , you'll have to repeat this quety for the "British scientists", the "French Biologists", etc... (by the way, I think wikipedians should not have allowed to mix two distinct kind of categories (e.g. profession and nationality).It messes-up the classification). (do you know if this can be achieved using SPARQL and DBPedia ?)

Then I wrote a java tool extracting the pages having a given WP category using the wikipedia API. This tool, "wpsubcat" is available here: http://code.google.com/p/lindenb/downloads/list and requires BerkeleyDB java Edition in order to store the temporary results. The source code is available here: WPSubCat.

Usage

-debug-level <java.util.logging.Level> default:OFF
-base <url> default:http://en.wikipedia.org
-ns <int> restrict results to the given namespace default:14 (Category)
-db-home BerkeleyDB default directory:/tmp/bdb
-d <integer> max recursion depth default:3

-add <category> add a starting article
OR
(stdin|files) containing articles' titles

Examples

Retrieve all the subClasses of 'Category:Scientists'

java -cp je-3.3.75.jar:wpsubcat.jar org.lindenb.tinytools.WPSubCat \
-add "Category:Scientists" > catscientists.txt

Retrieve all the scientists.

java -cp je-3.3.75.jar:wpsubcat.jar org.lindenb.tinytools.WPSubCat \
-ns 0 -d 0 catscientists.txt > scientists.txt

Result

After a series of 'sort' and 'comm', the result is the following list (in fact, it is underestimated, I've sightly improved the way the sub-categories are retrieved) :

That's it

Pierre

23 September 2009

db_sql: the new utility for BerkeleyDB. My Notebook.

The new version of BerkeleyDB 4.8 has been released. This new version of the key/value storage engine comes with a new utility called db_sql
From Oracle: Db_sql is a utility program that translates a schema description written in a SQL Data Definition Language dialect into C code that implements the schema using Berkeley DB. It is intended to provide a quick and easy means of getting started with Berkeley DB for users who are already conversant with SQL. It also introduces a convenient way to express a Berkeley DB schema in a format that is both external to the program that uses it and compatible with relational databases.

On my side, I still use Apache Velocity to generate this kind of code (see this older post )

Let's generate the C-code for a simple database storing some SNPs. The heterozygosity, the flanking sequences and the observed variation will be stored and the #rs-id will be indexed. My first attempt was:

CREATE DATABASE snpDatabase;

CREATE TABLE snp (
name varchar(50) NOT NULL PRIMARY KEY,
rs_id VARCHAR(20) NULL,
avHet float,
seq5 text,
observed text,
seq3 text,
class enum('unknown','single','in-del','het','microsatellite','named','mixed','mnp','insertion','deletion')
);

CREATE INDEX dbsnp_id ON snp(rs_id);

But it seemed that: enums, the TEXT type and the NULL/NOT NULL modifier are not supported. So, my second try was:

CREATE DATABASE snpDatabase;

CREATE TABLE snp (
name varchar(50) PRIMARY KEY,
rs_id varchar(20) NULL ,
avHet float,
seq5 varchar(300),
observed varchar(20),
seq3 varchar(300)
);

CREATE INDEX dbsnp_id ON snp(rs_id);

Invoking db_sql

The following command generates the C code for managing the snps in the database as well as a test. The code contains the structure 'struct _snp_data' describing a SNP, the functions to open/close the database, inserting/removing a _snp_data, serializing/de-serializing the structure

/usr/local/BerkeleyDB.4.8/bin/db_sql -i schema.sql -o snp.c -h snp.h -t snp_test.c

snp.h

/*
* Header file for a Berkeley DB implementation
* generated from SQL DDL by db_sql
*/
#include <sys/types.h>
#include <sys/stat.h>
#include <assert.h>
#include <errno.h>
#include <stdlib.h>
#include <string.h>
#include "db.h"

/*
* Array size constants.
*/
#define SNP_DATA_NAME_LENGTH 50
#define SNP_DATA_RS_ID_LENGTH 20
#define SNP_DATA_SEQ5_LENGTH 300
#define SNP_DATA_OBSERVED_LENGTH 20
#define SNP_DATA_SEQ3_LENGTH 300

/*
* Data structures representing the record layouts
*/
typedef struct _snp_data {
char name[SNP_DATA_NAME_LENGTH];
char rs_id[SNP_DATA_RS_ID_LENGTH];
double avHet;
char seq5[SNP_DATA_SEQ5_LENGTH];
char observed[SNP_DATA_OBSERVED_LENGTH];
char seq3[SNP_DATA_SEQ3_LENGTH];
} snp_data;

snp_data snp_data_specimen;

/*
* Macros for the maximum length of the
* records after serialization. This is
* the maximum size of the data that is stored
*/
#define SNP_DATA_SERIALIZED_LENGTH (sizeof(snp_data_specimen.name) + \
sizeof(snp_data_specimen.rs_id) + \
sizeof(snp_data_specimen.avHet) + \
sizeof(snp_data_specimen.seq5) + \
sizeof(snp_data_specimen.observed) + \
sizeof(snp_data_specimen.seq3))

/*
* These typedefs are prototypes for the user-written
* iteration callback functions, which are invoked during
* full iteration and secondary index queries
*/
typedef void (*snp_iteration_callback)(void *, snp_data *);

/*
* The environment creation/initialization function
*/
int create_snpDatabase_env(DB_ENV **envpp);

/*
* Database creation/initialization functions
*/
int create_snp_database(DB_ENV *envp, DB **dbpp);

/*
* Database removal functions
*/
int remove_snp_database(DB_ENV *envp);

/*
* Functions for inserting records by providing
* the full corresponding data structure
*/
int snp_insert_struct(DB *dbp, snp_data *snpp);

/*
* Functions for inserting records by providing
* each field value as a separate argument
*/
int snp_insert_fields(DB * dbp,
char *name,
char *rs_id,
double avHet,
char *seq5,
char *observed,
char *seq3);

/*
* Functions for retrieving records by key
*/
int get_snp_data(DB *dbp, char *snp_key, snp_data *data);

/*
* Functions for deleting records by key
*/
int delete_snp_key(DB *dbp, char *snp_key);

/*
* Functions for doing iterations over
* an entire primary database
*/
int snp_full_iteration(DB *dbp,
snp_iteration_callback user_func,
void *user_data);

/*
* Index creation and removal functions
*/
int create_dbsnp_id_secondary(DB_ENV *envp, DB *dbpp, DB **secondary_dbpp);

int remove_dbsnp_id_index(DB_ENV * envp);

int dbsnp_id_query_iteration(DB *secondary_dbp,
char *dbsnp_id_key,
snp_iteration_callback user_func,
void *user_data);

/*
* This convenience method invokes all of the
* environment and database creation methods necessary
* to initialize the complete BDB environment. It uses
* the global environment and database pointers declared
* below. You may bypass this function and use your own
* environment and database pointers, if you wish.
*/
int initialize_snpDatabase_environment();

extern DB_ENV * snpDatabase_envp;
extern DB *snp_dbp;
extern DB *dbsnp_id_dbp;

snp.c

#include "snp.h"

int create_snpDatabase_env(DB_ENV **envpp)
{
int ret, flags;
char *env_name = "./snpDatabase";

if ((ret = db_env_create(envpp, 0)) != 0) {
fprintf(stderr, "db_env_create: %s", db_strerror(ret));
return 1;
}

(*envpp)->set_errfile(*envpp, stderr);
flags = DB_CREATE | DB_INIT_MPOOL;

if ((ret = (*envpp)->open(*envpp, env_name, flags, 0)) != 0) {
(*envpp)->err(*envpp, ret, "DB_ENV->open: %s", env_name);
return 1;
}
return 0;
}

/*
* These are custom comparator functions for integer keys. They are
* needed to make integers sort well on little-endian architectures,
* such as x86. cf. discussion of btree comparators in 'Getting Started
* with Data Storage' manual.
*/
static int
compare_int(DB *dbp, const DBT *a, const DBT *b)
{
int ai, bi;

memcpy(&ai, a->data, sizeof(int));
memcpy(&bi, b->data, sizeof(int));
return (ai - bi);
}

int
compare_long(DB *dbp, const DBT *a, const DBT *b)
{
long ai, bi;

memcpy(&ai, a->data, sizeof(long));
memcpy(&bi, b->data, sizeof(long));
return (ai - bi);
}

/*
* A generic function for creating and opening a database
*/
int
create_database(DB_ENV *envp,
char *db_name,
DB **dbpp,
int flags,
DBTYPE type,
int moreflags,
int (*comparator)(DB *, const DBT *, const DBT *))
{
int ret;
FILE *errfilep;

if ((ret = db_create(dbpp, envp, 0)) != 0) {
envp->err(envp, ret, "db_create");
return ret;
}

if (moreflags != 0)
(*dbpp)->set_flags(*dbpp, moreflags);

if (comparator != NULL)
(*dbpp)->set_bt_compare(*dbpp, comparator);

envp->get_errfile(envp, &errfilep);
(*dbpp)->set_errfile(*dbpp, errfilep);
if ((ret = (*dbpp)->open(*dbpp, NULL, db_name,
NULL, type, flags, 0644)) != 0) {
(*dbpp)->err(*dbpp, ret, "DB->open: %s", db_name);
return ret;
}

return 0;
}

int create_snp_database(DB_ENV *envp, DB **dbpp)
{
return create_database(envp, "snp.db", dbpp,
DB_CREATE, DB_BTREE, 0, NULL);
}

int remove_snp_database(DB_ENV *envp)
{
return envp->dbremove(envp, NULL, "snp.db", NULL, 0);
}

int serialize_snp_data(snp_data *snpp, char *buffer)
{
size_t len;
char *p;

memset(buffer, 0, SNP_DATA_SERIALIZED_LENGTH);
p = buffer;

len = strlen(snpp->name) + 1;
assert(len <= SNP_DATA_NAME_LENGTH);
memcpy(p, snpp->name, len);
p += len;

len = strlen(snpp->rs_id) + 1;
assert(len <= SNP_DATA_RS_ID_LENGTH);
memcpy(p, snpp->rs_id, len);
p += len;

memcpy(p, &snpp->avHet, sizeof(snpp->avHet));
p += sizeof(snpp->avHet);

len = strlen(snpp->seq5) + 1;
assert(len <= SNP_DATA_SEQ5_LENGTH);
memcpy(p, snpp->seq5, len);
p += len;

len = strlen(snpp->observed) + 1;
assert(len <= SNP_DATA_OBSERVED_LENGTH);
memcpy(p, snpp->observed, len);
p += len;

len = strlen(snpp->seq3) + 1;
assert(len <= SNP_DATA_SEQ3_LENGTH);
memcpy(p, snpp->seq3, len);
p += len;

return p - buffer;
}

void deserialize_snp_data(char *buffer, snp_data *snpp)
{
size_t len;

memset(snpp, 0, sizeof(*snpp));

len = strlen(buffer) + 1;
assert(len <= SNP_DATA_NAME_LENGTH);
memcpy(snpp->name, buffer, len);
buffer += len;

len = strlen(buffer) + 1;
assert(len <= SNP_DATA_RS_ID_LENGTH);
memcpy(snpp->rs_id, buffer, len);
buffer += len;

memcpy(&snpp->avHet, buffer, sizeof(snpp->avHet));
buffer += sizeof(snpp->avHet);

len = strlen(buffer) + 1;
assert(len <= SNP_DATA_SEQ5_LENGTH);
memcpy(snpp->seq5, buffer, len);
buffer += len;

len = strlen(buffer) + 1;
assert(len <= SNP_DATA_OBSERVED_LENGTH);
memcpy(snpp->observed, buffer, len);
buffer += len;

len = strlen(buffer) + 1;
assert(len <= SNP_DATA_SEQ3_LENGTH);
memcpy(snpp->seq3, buffer, len);
buffer += len;

}

int snp_insert_struct( DB *dbp, snp_data *snpp)
{
DBT key_dbt, data_dbt;
char serialized_data[SNP_DATA_SERIALIZED_LENGTH];
int ret, serialized_size;
char *snp_key = snpp->name;

memset(&key_dbt, 0, sizeof(key_dbt));
memset(&data_dbt, 0, sizeof(data_dbt));

key_dbt.data = snp_key;
key_dbt.size = strlen(snp_key) + 1;

serialized_size = serialize_snp_data(snpp, serialized_data);

data_dbt.data = serialized_data;
data_dbt.size = serialized_size;

if ((ret = dbp->put(dbp, NULL, &key_dbt, &data_dbt, 0)) != 0) {
dbp->err(dbp, ret, "Inserting key %d", snp_key);
return -1;
}
return 0;
}

int snp_insert_fields(DB *dbp,
char *name,
char *rs_id,
double avHet,
char *seq5,
char *observed,
char *seq3)
{
snp_data data;
assert(strlen(name) < SNP_DATA_NAME_LENGTH);
strncpy(data.name, name, SNP_DATA_NAME_LENGTH);
assert(strlen(rs_id) < SNP_DATA_RS_ID_LENGTH);
strncpy(data.rs_id, rs_id, SNP_DATA_RS_ID_LENGTH);
data.avHet = avHet;
assert(strlen(seq5) < SNP_DATA_SEQ5_LENGTH);
strncpy(data.seq5, seq5, SNP_DATA_SEQ5_LENGTH);
assert(strlen(observed) < SNP_DATA_OBSERVED_LENGTH);
strncpy(data.observed, observed, SNP_DATA_OBSERVED_LENGTH);
assert(strlen(seq3) < SNP_DATA_SEQ3_LENGTH);
strncpy(data.seq3, seq3, SNP_DATA_SEQ3_LENGTH);
return snp_insert_struct(dbp, &data);
}

int get_snp_data(DB *dbp,
char *snp_key,
snp_data *data)
{
DBT key_dbt, data_dbt;
int ret;
char *canonical_key = snp_key;

memset(&key_dbt, 0, sizeof(key_dbt));
memset(&data_dbt, 0, sizeof(data_dbt));

key_dbt.data = canonical_key;
key_dbt.size = strlen(canonical_key) + 1;

if ((ret = dbp->get(dbp, NULL, &key_dbt, &data_dbt, 0)) != 0) {
dbp->err(dbp, ret, "Retrieving key %d", snp_key);
return ret;
}

assert(data_dbt.size <= SNP_DATA_SERIALIZED_LENGTH);

deserialize_snp_data(data_dbt.data, data);
return 0;
}

int delete_snp_key(DB *dbp, char *snp_key)
{
DBT key_dbt;
int ret;
char *canonical_key = snp_key;

memset(&key_dbt, 0, sizeof(key_dbt));

key_dbt.data = canonical_key;
key_dbt.size = strlen(canonical_key) + 1;

if ((ret = dbp->del(dbp, NULL, &key_dbt, 0)) != 0) {
dbp->err(dbp, ret, "deleting key %d", snp_key);
return ret;
}

return 0;
}

int snp_full_iteration(DB *dbp,
snp_iteration_callback user_func,
void *user_data)
{
DBT key_dbt, data_dbt;
DBC *cursorp;
snp_data deserialized_data;
int ret;

memset(&key_dbt, 0, sizeof(key_dbt));
memset(&data_dbt, 0, sizeof(data_dbt));

if ((ret = dbp->cursor(dbp, NULL, &cursorp, 0)) != 0) {
dbp->err(dbp, ret, "creating cursor");
return ret;
}

while ((ret = cursorp->get(cursorp, &key_dbt, &data_dbt, DB_NEXT)) == 0) {
deserialize_snp_data(data_dbt.data, &deserialized_data);
(*user_func)(user_data, &deserialized_data);
}

if (ret != DB_NOTFOUND) {
dbp->err(dbp, ret, "Full iteration");
cursorp->close(cursorp);
return ret;
}

cursorp->close(cursorp);

return 0;
}

int dbsnp_id_callback(DB *dbp,
const DBT *key_dbt,
const DBT *data_dbt,
DBT *secondary_key_dbt)
{

int ret;
snp_data deserialized_data;

deserialize_snp_data(data_dbt->data, &deserialized_data);

memset(secondary_key_dbt, 0, sizeof(*secondary_key_dbt));
secondary_key_dbt->size = strlen(deserialized_data.rs_id) + 1;
secondary_key_dbt->data = malloc(secondary_key_dbt->size);
memcpy(secondary_key_dbt->data, deserialized_data.rs_id,
secondary_key_dbt->size);

/* tell the caller to free memory referenced by secondary_key_dbt */
secondary_key_dbt->flags = DB_DBT_APPMALLOC;

return 0;
}

int create_dbsnp_id_secondary(DB_ENV *envp,
DB *primary_dbp,
DB **secondary_dbpp)
{
int ret;
char * secondary_name = "dbsnp_id.db";

if ((ret = create_database(envp, secondary_name, secondary_dbpp,
DB_CREATE, DB_BTREE, DB_DUPSORT, NULL)) != 0)
return ret;

if ((ret = primary_dbp->associate(primary_dbp, NULL, *secondary_dbpp,
&dbsnp_id_callback, DB_CREATE)) != 0) {
(*secondary_dbpp)->err(*secondary_dbpp, ret,
"DB->associate: %s.db", secondary_name);
return ret;
}
return 0;
}

int remove_dbsnp_id_index(DB_ENV *envp)
{
return envp->dbremove(envp, NULL, "dbsnp_id.db", NULL, 0);
}

int dbsnp_id_query_iteration(DB *secondary_dbp,
char *dbsnp_id_key,
snp_iteration_callback user_func,
void *user_data)
{
DBT key_dbt, data_dbt;
DBC *cursorp;
snp_data deserialized_data;
int ret;

memset(&key_dbt, 0, sizeof(key_dbt));
memset(&data_dbt, 0, sizeof(data_dbt));

if ((ret = secondary_dbp->cursor(secondary_dbp, NULL, &cursorp, 0)) != 0) {
secondary_dbp->err(secondary_dbp, ret, "creating cursor");
return ret;
}

key_dbt.data = dbsnp_id_key;
key_dbt.size = strlen(dbsnp_id_key) + 1;

for (ret = cursorp->get(cursorp, &key_dbt, &data_dbt, DB_SET);
ret == 0;
ret = cursorp->get(cursorp, &key_dbt, &data_dbt, DB_NEXT_DUP)) {
deserialize_snp_data(data_dbt.data, &deserialized_data);
(*user_func)(user_data, &deserialized_data);
}

if (ret != DB_NOTFOUND) {
secondary_dbp->err(secondary_dbp, ret, "Querying secondary");
return ret;
}

cursorp->close(cursorp);

return 0;
}

DB_ENV * snpDatabase_envp = NULL;
DB *snp_dbp = NULL;
DB *dbsnp_id_dbp = NULL;

int initialize_snpDatabase_environment()
{
if (create_snpDatabase_env(&snpDatabase_envp) != 0)
goto exit_error;

if (create_snp_database(snpDatabase_envp, &snp_dbp) != 0)
goto exit_error;

if (create_dbsnp_id_secondary(snpDatabase_envp, snp_dbp, &dbsnp_id_dbp) != 0)
goto exit_error;

return 0;

exit_error:

fprintf(stderr, "Stopping initialization because of error\n");
return -1;
}

snp_test.c

/*
* Simple test for a Berkeley DB implementation
* generated from SQL DDL by db_sql
*/

#include "snp.h"

/*
* These are the iteration callback functions. One is defined per
* database(table). They are used for both full iterations and for
* secondary index queries. When a retrieval returns multiple records,
* as in full iteration over an entire database, one of these functions
* is called for each record found
*/

void snp_iteration_callback_test(void *msg, snp_data *snp_record)
{
printf("In iteration callback, message is: %s\n", (char *)msg);

printf("snp->name: %s\n", snp_record->name);
printf("snp->rs_id: %s\n", snp_record->rs_id);
printf("snp->avHet: %lf\n", snp_record->avHet);
printf("snp->seq5: %s\n", snp_record->seq5);
printf("snp->observed: %s\n", snp_record->observed);
printf("snp->seq3: %s\n", snp_record->seq3);
}

main(int argc, char **argv)
{
int i;
int ret;

snp_data snp_record;
snp_dbp = NULL;
dbsnp_id_dbp = NULL;

/*
* Use the convenience method to initialize the environment.
* The initializations for each entity and environment can be
* done discretely if you prefer, but this is the easy way.
*/
ret = initialize_snpDatabase_environment();
if (ret != 0){
printf("Initialize error");
return ret;
}

/*
* Now that everything is initialized, insert a single
* record into each database, using the ...insert_fields
* functions. These functions take each field of the
* record as a separate argument
*/
ret = snp_insert_fields( snp_dbp, "ninety-nine", "ninety-nine", 99.5, "ninety-nine", "ninety-nine", "ninety-nine");
if (ret != 0){
printf("Insert error\n");
goto exit_error;
}

/*
* Next, retrieve the records just inserted, looking them up
* by their key values
*/

printf("Retrieval of snp record by key\n");
ret = get_snp_data( snp_dbp, "ninety-nine", &snp_record);

printf("snp.name: %s\n", snp_record.name);
printf("snp.rs_id: %s\n", snp_record.rs_id);
printf("snp.avHet: %lf\n", snp_record.avHet);
printf("snp.seq5: %s\n", snp_record.seq5);
printf("snp.observed: %s\n", snp_record.observed);
printf("snp.seq3: %s\n", snp_record.seq3);
if (ret != 0)
{
printf("Retrieve error\n");
goto exit_error;
}

/*
* Now try iterating over every record, using the ...full_iteration
* functions for each database. For each record found, the
* appropriate ...iteration_callback_test function will be invoked
* (these are defined above).
*/
ret = snp_full_iteration(snp_dbp, &snp_iteration_callback_test,
"retrieval of snp record through full iteration");
if (ret != 0){
printf("Full Iteration Error\n");
goto exit_error;
}

/*
* For the secondary indexes, query for the known keys. This also
* results in the ...iteration_callback_test function's being called
* for each record found.
*/
dbsnp_id_query_iteration(dbsnp_id_dbp, "ninety-nine",
&snp_iteration_callback_test,
"retrieval of snp record through dbsnp_id query");

/*
* Now delete a record from each database using its primary key.
*/
ret = delete_snp_key( snp_dbp, "ninety-nine");
if (ret != 0) {
printf("Delete error\n");
goto exit_error;
}

exit_error:
/*
* Close the secondary index databases
*/
if (dbsnp_id_dbp != NULL)
dbsnp_id_dbp->close(dbsnp_id_dbp, 0);

/*
* Close the primary databases
*/
if (snp_dbp != NULL)
snp_dbp->close(snp_dbp, 0);

/*
* Delete the secondary index databases
*/
remove_dbsnp_id_index(snpDatabase_envp);

/*
* Delete the primary databases
*/
remove_snp_database(snpDatabase_envp);

/*
* Finally, close the environment
*/
snpDatabase_envp->close(snpDatabase_envp, 0);
return ret;
}

Compiling an Running the test

export LD_LIBRARY_PATH=/usr/local/BerkeleyDB.4.8/lib
gcc snp_test.c snp.c -ldb
mkdir snpDatabase
./a.out
Retrieval of snp record by key
snp.name: ninety-nine
snp.rs_id: ninety-nine
snp.avHet: 99.500000
snp.seq5: ninety-nine
snp.observed: ninety-nine
snp.seq3: ninety-nine
In iteration callback, message is: retrieval of snp record through full iteration
snp->name: ninety-nine
snp->rs_id: ninety-nine
snp->avHet: 99.500000
snp->seq5: ninety-nine
snp->observed: ninety-nine
snp->seq3: ninety-nine
In iteration callback, message is: retrieval of snp record through dbsnp_id query
snp->name: ninety-nine
snp->rs_id: ninety-nine
snp->avHet: 99.500000
snp->seq5: ninety-nine
snp->observed: ninety-nine
snp->seq3: ninety-nine