07 February 2010

I Really need to sleep: inserting the SNPs into MongoDB with C++

In the previous post I showed how to parse dbSNP/XML with libxml. In the current post I'll insert the results into MongoDB using the native Mongo C++ API. Some others have already posted about MongoDB: for example see Jan's, Brad's or Neil's posts The Boost-C++library is required: my code failed to compile with boost 4.* but it compiled fine with boost 3.9.

Starting mongoDB

> mkdir ~/tmp/MONGODB/data
> mongodb-linux-i686-1.2.2/bin/mongod -dbpath ~/tmp/MONGODB/data

Sun Feb 7 19:22:36 Mongo DB : starting : pid = 12123 port = 27017 dbpath = /home/pierre/tmp/MONGODB/data master = 0 slave = 0 32-bit
** NOTE: when using MongoDB 32 bit, you are limited to about 2 gigabytes of data
** see http://blog.mongodb.org/post/137788967/32-bit-limitations for more

Sun Feb 7 19:22:36 db version v1.2.2, pdfile version 4.5
Sun Feb 7 19:22:36 git version: 8a4fb8b1c7cb78648c55368d806ba35054f6be54
Sun Feb 7 19:22:36 sys info: Linux domU-12-31-39-01-70-B4 2.6.21.7-2.fc8xen #1 SMP Fri Feb 15 12:39:36 EST 2008 i686 BOOST_LIB_VERSION=1_37
Sun Feb 7 19:22:36 waiting for connections on port 27017

Refactoring the XML parser for dbSNP


I've added a 'mongo::DBClientConnection connection' to the MondoDB database in the class DBSNPHandler. This connection is simply opened with
state.connection.connect("localhost");
.
Now, in the method "endElement" when a new SNP is found, a new mongo::BSONObjBuilder object is filled with the fields describing a snp. The "rs####" is used as the key of the database.
mongo::BSONObjBuilder b;
b.append("_id", state->rs_id);
b.append("name", state->rs_id);
b.append("seq5", state->seq5);
b.append("observed", state->observed);
b.append("seq3", state->seq3);
This object is then converted to a mongo::BSONObj and inserted into the MondoDB database:
mongo::BSONObj p = b.obj();
state->connection.insert("ncbi.dbsnp", p);
At the end, we can loop over all the items:
std::cout << "count:" << state.connection.count("ncbi.dbsnp") << std::endl;
mongo::BSONObj emptyObj;
std::auto_ptr<mongo::DBClientCursor> cursor = state.connection.query("ncbi.dbsnp", emptyObj);
while( cursor->more() )
{
std::cout << cursor->next().toString() << std::endl;
}

Compilation

Here , the program was compiled using
export LD_LIBRARY_PATH=../boost/boost_1_39_0/stage/lib:../mongodb-linux-i686-1.2.2/lib

g++ `xml2-config --cflags --libs` \
-I ../mongodb-linux-i686-1.2.2/include/mongo\
-I ../boost/boost_1_39_0\
-L ../mongodb-linux-i686-1.2.2/lib \
-L ../boost/boost_1_39_0/stage/lib \
dbsnp.c \
-lmongoclient \
-lboost_thread-gcc43-mt \
-lboost_filesystem-gcc43-mt

Execution


./a.out ds_ch17.xml.gz
count:740
{ _id: "rs69624490", name: "rs69624490", seq5: "CTCCAGCCCGGGCCCACCCTACAGCCGACACCAAGTTGCGTCACCGTGATCTGGACACCCAGACGTACAT...", observed: "C/G", seq3: "GGTACTCCACCAGCAGGAGCAGGAGGCTCTCCCACCTCCACCTGCCACTGGCCGACAGCTCCCAGTGCGT..." }
{ _id: "rs69626771", name: "rs69626771", seq5: "GAAATTCCTACAAAAATCCATCTTTGTGGCATCATCAGGGGTGATCTGTCCTTGCAGGAATATCTCACGT...", observed: "G/T", seq3: "CTGTAGTTAAACTGGTAACGAAGGTTCCACCCTTCCTCCCAGGCCACAGCGCCCCCAGCAGTCTTGCACC..." }
{ _id: "rs69628798", name: "rs69628798", seq5: "ACCTCAGTAAAATGAAGATTATTACTATGTGCCCTATGCAGGACAGGGACTGTGTTCTGATACAGGCCCT...", observed: "A/G", seq3: "TTGAACAGATATCAGAAAAAGGGGGAGAGAGAAATCAGTTGGTTGGGAGGAGAATGAGGGGGGCAGGAGG..." }
{ _id: "rs69633675", name: "rs69633675", seq5: "GTTGTTGGTGAGGGGAGGGAGTGGGGCAAGAGGAACAGTGTGGTCTAGAGGATAGAGCAAGGGAATGGGA...", observed: "A/G", seq3: "TGGCTCAGTGGAAAGAACACGGGCTTTGGAGTCAGAGATCAGGGGTTCGAATCCCGGCTCTGCCACTTGG..." }
{ _id: "rs69638381", name: "rs69638381", seq5: "GGCCACGCTGCTTGTCATAGCGGCTTTCCAGCTCTGCCCTCCGCAGCAAGGGCAGCGCTCACCTAGGCAA...", observed: "G/T", seq3: "GGGCCGTACCGATGAGTTCTCCCGGGGAGAGACCAGGAGCTCTGAGTCAGGAAGGGAATCAAAAGGCGAC..." }
{ _id: "rs69640257", name: "rs69640257", seq5: "TTTCCTCTTCTCCTTGGCCCTGCATTATCCCCACTATTTGACTTGTCCAGGTCAGCCTTTGTAAATAAAA...", observed: "A/G", seq3: "ACTCTTTACCATCACTTATTCTCAAGGCTGGTCAAAACCTTCTCTGTCATTGTCATTTGTTTATTGAGCA..." }
{ _id: "rs69640260", name: "rs69640260", seq5: "CTCAAAAGAGAATGGTTCCTTTAGGTCCCTGAGGACACCCCAGGAAGGTTGGGTATCCCTTGCTTCAATT...", observed: "C/T", seq3: "AGCGTGTCTTAGTGGAAAAAGCACAAGTCTGGGAGTCAGAGTATCTGGGTTCTAATCCCCACCACCCAAT..." }
{ _id: "rs69641743", name: "rs69641743", seq5: "TCTATACATTGTTTAGTTCCTCTCCCCCACTAGACTGTAAACACCTTGAGGGCAAGGAGCATCTCTTCTG...", observed: "A/C", seq3: "CTAGCAAATAGAGCACGGGCCTGGGAATCTGAATAGGTTCTAATCCTGGCTCTGCCACTTGTCTGCTATG..." }
{ _id: "rs69642295", name: "rs69642295", seq5: "AGCATGCACGATATAAAAAATGCCTGAGCACGTTTTCGACCACCTGAGGGAAGCAGAGGAAAGAGTGAAA...", observed: "A/G", seq3: "GGGATAGCTCTTGTGTGGATCAGGAGTGTGGTTCAACAGCACAAATTCATTTACAGAGATCACTAGAAGC..." }
{ _id: "rs69643084", name: "rs69643084", seq5: "AATTCCCTGATACAGTGTTTTGCACAGGGTGCTTTACTGCTGAAAGACTGAATGCTGCCCTATAGCCTGC...", observed: "C/T", seq3: "GTTTTCTTAAGAACCCATGGGGCTCTGGGCTACTTCTTTTCTTTGTGACTCCATAGTAGTTTACAAAAGC..." }
(...)

Source code


#include <libxml/parser.h>
#include <string>
#include <cstring>
#include <iostream>
#include "client/dbclient.h"
#include "db/jsobj.h"

/**
* Hold the state of the parser
*/
class DBSNPHandler
{
public:
mongo::DBClientConnection connection;
// current rs### id
std::string rs_id;
// current 5' sequence
std::string seq5;
// current observed variation
std::string observed;
// current 3' sequence
std::string seq3;
// current string handler by the SAX handler
std::string* content;
//did we find the sequence ?
bool sequence_found;

DBSNPHandler():content(NULL),sequence_found(false)
{
}

~DBSNPHandler()
{
clear();
}

void clear()
{
if(content!=NULL) delete content;
content=NULL;
sequence_found=false;
rs_id.clear();
seq5.clear();
observed.clear();
seq3.clear();
content=NULL;
}
};

/** called when an TAG is opened */
static void startElement(void * ctx,
const xmlChar * localname,
const xmlChar * prefix,
const xmlChar * URI,
int nb_namespaces,
const xmlChar ** namespaces,
int nb_attributes,
int nb_defaulted,
const xmlChar ** attributes)
{
DBSNPHandler* state=(DBSNPHandler*)ctx;
if(strcmp( (char*) localname,"Rs")==0)
{
state->clear();

for(int i=0;i< nb_attributes;++i)
{
if(strcmp((char*) attributes[i*5],"rsId")!=0) continue;
int len=(char*) attributes[i*5+4]-(char*) attributes[i*5+3];
state->rs_id.assign("rs");
state->rs_id.append(
(char*) attributes[i*5+3],
len
);
break;
}
}
else if((strcmp( (char*) localname,"Seq5")==0 ||
strcmp( (char*) localname,"Observed")==0 ||
strcmp( (char*) localname,"Seq3")==0) &&
state->sequence_found==false
)
{
state->content=new std::string;
}
}


/** called when an TAG is closed */
static void endElement(void * ctx,
const xmlChar * localname,
const xmlChar * prefix,
const xmlChar * URI)
{
DBSNPHandler* state=(DBSNPHandler*)ctx;
if(strcmp( (char*) localname,"Rs")==0)
{
mongo::BSONObjBuilder b;
b.append("_id", state->rs_id);
b.append("name", state->rs_id);
b.append("seq5", state->seq5);
b.append("observed", state->observed);
b.append("seq3", state->seq3);
mongo::BSONObj p = b.obj();

state->connection.insert("ncbi.dbsnp", p);


//we're done with this SNP, clear the state
state->clear();
}
else if(state->content!=NULL && strcmp( (char*) localname,"Seq5")==0)
{
state->seq5.assign(*(state->content));
delete state->content;
state->content=NULL;
}
else if(state->content!=NULL && strcmp( (char*) localname,"Observed")==0)
{
state->observed.assign(*(state->content));
delete state->content;
state->content=NULL;
}
else if(state->content!=NULL && strcmp( (char*) localname,"Seq3")==0)
{
state->seq3.assign(*(state->content));
delete state->content;
state->content=NULL;
state->sequence_found=true;
}
}

static void handleCharacters(void * ctx, const xmlChar * ch, int len)
{
DBSNPHandler* state=(DBSNPHandler*)ctx;
if(state->content!=NULL)
{
state->content->append((char*)ch,len);
}
}


int main(int argc,char **argv)
{
int res;
LIBXML_TEST_VERSION
xmlSAXHandler handler;
DBSNPHandler state;

memset(&handler,0,sizeof(xmlSAXHandler));


handler.startElementNs= startElement;
handler.endElementNs=endElement;
handler.characters=handleCharacters;
handler.initialized = XML_SAX2_MAGIC;

state.connection.connect("localhost");

for(int i=1;i< argc;++i)
{
res=xmlSAXUserParseFile(&handler,&state,argv[i]);

if(res!=0)
{
std::cerr << "Error "<< res << argv[i] << std::endl;
}
}

//dump results
std::cout << "count:" << state.connection.count("ncbi.dbsnp") << std::endl;
mongo::BSONObj emptyObj;
std::auto_ptr<mongo::DBClientCursor> cursor = state.connection.query("ncbi.dbsnp", emptyObj);
while( cursor->more() )
{
std::cout << cursor->next().toString() << std::endl;
}

xmlCleanupParser();
xmlMemoryDump();
return(0);
}


Hey... here come the sandman.. :-)


That's it !
Pierre

No comments: