29 April 2009

CouchDB for Bioinformatics: Storing SNPs - My Notebook


In a previous post, I've been playing with Apache Hadoop.
I've encountered some technical difficulties with Hadoop (such as the simple question: "How should can I read my data from a file stored on the Hadoop File System (HDFS) ?? ") , so I now have a look at Apache CouchDB

(Via http://couchdb.apache.org/:) Apache CouchDB is a
  • A document database server, accessible via a RESTful JSON API
  • Ad-hoc and schema-free with a flat address space
  • Distributed, featuring robust, incremental replication with bi-directional conflict detection and management
  • Query-able and index-able, featuring a table oriented reporting engine that uses Javascript as a query language


Installation


Installing couchdb on my computer as easy as described on the wiki and after starting the default CouchDB server (host: localhost, port:5984) I got a {"couchdb":"Welcome","version":"0.10.0a769334"} when I opened my browser on http://localhost:5984/, as well as an interactive console at http://localhost:5984/_utils/.
the Hadoop console
.

Sending Request to CouchDB


Documents and Databases are created, deleted,found, etc.. using a REST protocol based on HTTP and its methods (GET/POST/PUT/DELETE/...). In order to play with this couchdb, I've created a small java program using the apache http client library (see source at the end of this post) (I know there is already a JAVA API for couchdb, but I wanted to be sure that I understand the concepts).

Creating a Database


First I want to create a database of Single Nucleotide Polymorphisms (SNPs): a PUT request is sent with the name of the database http://localhost:5984/position2snp
public void createDatabase(String database) throws IOException
{
PutMethod method= new PutMethod("http://localhost:5984/"+database);
HttpClient client=new HttpClient();
client.executeMethod(method);
method.releaseConnection();
}

And couchdb returns the following JSON object.
{"ok":true}


Listing the databases


A special identifier (_all_dbs) in the request path and a GET method are used to to get all the databases:
public void getAllDatabases() throws IOException
{
GetMethod method= new GetMethod("http://localhost:5984/_all_dbs");
HttpClient client=new HttpClient();
client.executeMethod(method);
method.releaseConnection();
}

And couchdb returns the following JSON array containing the one and only database:
["position2snp"]


Adding some SNPs in the Database


CouchDB is a key/value store of JSON-based document. Here the KEY will be a fixed-size (We'll use some '0' for padding) concatenation of the chromosome and the genomic position for each SNP e.g. "chr01_00000006".
static String formatPosition(int chromosome,int position)
{
StringWriter w= new StringWriter();
PrintWriter out= new PrintWriter(w);
out.printf("chr%02d_%08d", chromosome,position);
return w.toString();
}

The VALUE will be a structured description of the SNP. Here I've simulated a set of random SNP.
Random rand= new Random();
for(int i=0;i< 50;++i)
{
int chromosome= 1+rand.nextInt(4);
int position = 1+rand.nextInt(100);
putDocument(POS2SNP,formatPosition(chromosome, position),
"{\"rs\":\"rs"+(i+1)+"\"," +
"\"avHet\":"+(rand.nextFloat()*0.5f)+"," +
"\"snpClass\":\""+(i%2==0?"mutation":"silent")+"\"," +
"\"mapping\":{" +
"\"chromosome\":\"chr"+chromosome+"\"," +
"\"position\":"+position+"" +
"}}");
}

The new document is loaded by sending a PUT request, with the structured description of the SNP in the body of the request, to "http://localhost:5984/position2snp/THE_KEY" and .
public void putDocument(String id,String json) throws IOException
{
PutMethod method= new PutMethod("http://localhost:5984/position2snp/"+id);
method.setRequestEntity(new StringRequestEntity(json,"application/json", "UTF-8"));
HttpClient client=new HttpClient();
client.executeMethod(method);
method.releaseConnection();
}

(Note: the POST method can be used instead of PUT to generate a unique key for a document)
For example, when I post a new SNP, couchdb returns the following JSON object containing the status ("ok"), the id/key of the new document/snp and its revision-id:
{"ok":true,"id":"chr02_00000081","rev":"1-948851818"}

Retrieving only one row in the Database


A GET request is sent with the special keyword _all_docs and the parameter limit. e.g: " http://localhost:5984/position2snp/_all_docs?limit=1".
Couchdb returns the following JSON object:
{"total_rows":49,"offset":0,"rows":[
{"id":"chr01_00000006","key":"chr01_00000006","value":{"rev":"1-1656436337"}}
]}

Retrieving the document from its KEY/ID


A GET request is sent to http://localhost:5984/position2snp/THE_KEY
For example sending http://localhost:5984/position2snp/chr03_00000046 returns:
{"_id":"chr03_00000046","_rev":"1-3263055450","rs":"rs9","avHet":0.09515628,"snpClass":"mutation","mapping":{"chromosome":"chr3","position":46}}

If the KEY was not found when sending http://localhost:5984/position2snp/chr03_00000000 , Couchdb returns:
{"error":"not_found","reason":"missing"}


Finding all the SNPs in a defined Genomic Segment


As the KEYs are sorted, have the same length, and define the position of the SNP on the genome, we can search CouchDB for all the SNPs in a defined region of the genome by sending a GET HTTP request with the parameters startkey and endkey. For example, if we want all the SNPs on the chromosome 2 between the bases 30 and 60 we send http://localhost:5984/position2snp/_all_docs?startkey=%22chr02_00000030%22&endkey=%22chr02_00000060%22 and the result is:
{"total_rows":47,"offset":18,"rows":[
{"id":"chr02_00000032","key":"chr02_00000032","value":{"rev":"1-4178679245"}},
{"id":"chr02_00000040","key":"chr02_00000040","value":{"rev":"1-392133644"}},
{"id":"chr02_00000056","key":"chr02_00000056","value":{"rev":"1-3847661844"}}
]}

Views


Views are the primary tool used for querying and reporting on CouchDB documents. Views are stored inside special documents called design documents, and can be accessed via an HTTP GET request to the URI /{dbname}/{docid}/{viewname} . The view is defined by a JavaScript function that MAPs view keys to values. If a view has a REDUCE function, it is used to produce aggregate results for that view. A reduce function is passed a set of intermediate values and combines them to a single value. Reduce functions must accept, as input, results emitted by its corresponding MAP function

We upload our view called _design/genotypage:
putDocument("position2snp","_design/genotypage",
"{\"views\":{" +
"\"snpMutation\":{ \"map\":\"function(doc) {if(doc.snpClass='mutation') emit(null,doc); }\"},"+
"\"snpByClass\":{ \"map\":\"function(doc) { emit(doc.snpClass,doc.avHet); }\"},"+
"\"snpByName\":{ \"map\":\"function(doc) { emit(doc.rs,doc); }\"},"+
"\"snpByClassMaxHet\":{" +
"\"map\":\"function(doc) { emit(doc.snpClass,doc.avHet); }\"," +
"\"reduce\":\"function(keys, values) { var mean=0.0; for ( var i = 0; i < values.length; ++i ) { mean+=values[i];} return mean/(values.length);}\"" +
"}"+
"}}" +
"");

This view contains four functions: snpMutation, snpByClass, snpByName and snpByClassMaxHet

Find all SNPs having a class='mutation'


A GET request is sent to "http://localhost:5984/position2snp/_design/genotypage/_view/snpMutation" and couchdb returns
{"total_rows":47,"offset":0,"rows":[
{"id":"chr01_00000001","key":null,"value":{"_id":"chr01_00000001","_rev":"1-3508311375","rs":"rs12","avHet":0.30527565,"snpClass":"mutation","mapping":{"chromosome":"chr1","position":1}}},
{"id":"chr01_00000006","key":null,"value":{"_id":"chr01_00000006","_rev":"1-3309871741","rs":"rs31","avHet":0.30192727,"snpClass":"mutation","mapping":{"chromosome":"chr1","position":6}}},
{"id":"chr01_00000009","key":null,"value":{"_id":"chr01_00000009","_rev":"1-4077528375","rs":"rs3","avHet":0.44473252,"snpClass":"mutation","mapping":{"chromosome":"chr1","position":9}}},
{"id":"chr01_00000015","key":null,"value":{"_id":"chr01_00000015","_rev":"1-247108112","rs":"rs17","avHet":0.42058986,"snpClass":"mutation","mapping":{"chromosome":"chr1","position":15}}},
{"id":"chr01_00000016","key":null,"value":{"_id":"chr01_00000016","_rev":"1-3568315779","rs":"rs43","avHet":0.4113328,"snpClass":"mutation","mapping":{"chromosome":"chr1","position":16}}},
(...)
{"id":"chr04_00000033","key":null,"value":{"_id":"chr04_00000033","_rev":"1-3823284043","rs":"rs20","avHet":0.17454243,"snpClass":"mutation","mapping":{"chromosome":"chr4","position":33}}},
{"id":"chr04_00000035","key":null,"value":{"_id":"chr04_00000035","_rev":"1-1400920328","rs":"rs10","avHet":0.33354515,"snpClass":"mutation","mapping":{"chromosome":"chr4","position":35}}},
{"id":"chr04_00000045","key":null,"value":{"_id":"chr04_00000045","_rev":"1-3632023176","rs":"rs49","avHet":0.44040334,"snpClass":"mutation","mapping":{"chromosome":"chr4","position":45}}},
{"id":"chr04_00000058","key":null,"value":{"_id":"chr04_00000058","_rev":"1-1711768614","rs":"rs14","avHet":0.3784455,"snpClass":"mutation","mapping":{"chromosome":"chr4","position":58}}}
]}


Create a table containing the snpClass and the avHet for each SNP


A GET request is sent to "method http://localhost:5984/position2snp/_design/genotypage/_view/snpByClass" and couchdb returns:
{"total_rows":47,"offset":0,"rows":[
{"id":"chr01_00000006","key":"mutation","value":0.30192727},
{"id":"chr01_00000009","key":"mutation","value":0.44473252},
{"id":"chr01_00000015","key":"mutation","value":0.42058986},
{"id":"chr01_00000016","key":"mutation","value":0.4113328},
(...)
{"id":"chr04_00000019","key":"silent","value":0.069098294},
{"id":"chr04_00000033","key":"silent","value":0.17454243},
{"id":"chr04_00000035","key":"silent","value":0.33354515},
{"id":"chr04_00000058","key":"silent","value":0.3784455}
]}


Find the snp named "rs1"


A GET request is sent to "http://localhost:5984/position2snp/_design/genotypage/_view/snpByName?key=%22rs1%22" , and using the request parameter key="rs1". Couchdb returns:
{"total_rows":47,"offset":0,"rows":[
{"id":"chr01_00000023","key":"rs1","value":{"_id":"chr01_00000023","_rev":"1-49396577","rs":"rs1","avHet":0.035366535,"snpClass":"mutation","mapping":{"chromosome":"chr1","position":23}}}
]}


Map/Reduce: map all the SNP by they snpClass and avHet. Reduce this map to have the mean avHet for the two types of snpClass


A GET request is sent to "http://localhost:5984/position2snp/_design/genotypage/_view/snpByClassMaxHet?group=true" using the request parameter group=true to group by snpClass. Couchdb returns:
{"rows":[
{"key":"mutation","value":0.29830500208},
{"key":"silent","value":0.2255268866772728}
]}


Dropping the Database


A DELETE request with the name of the database is sent to "http://localhost:5984/position2snp".
public void deleteDatabase(String database) throws IOException
{
DeleteMethod method= new DeleteMethod("http://localhost:5984/"+database);
HttpClient client=new HttpClient();
client.executeMethod(method);
method.releaseConnection();
}

And couchdb returns the following JSON object:
{"ok":true}

Source Code


package test;
import java.io.IOException;
import java.io.InputStream;
import java.io.PrintWriter;
import java.io.StringWriter;
import java.net.URLEncoder;

import java.util.ArrayList;
import java.util.List;
import java.util.Random;

import org.apache.commons.httpclient.HttpClient;
import org.apache.commons.httpclient.HttpMethod;
import org.apache.commons.httpclient.NameValuePair;
import org.apache.commons.httpclient.methods.DeleteMethod;
import org.apache.commons.httpclient.methods.GetMethod;
import org.apache.commons.httpclient.methods.PostMethod;
import org.apache.commons.httpclient.methods.PutMethod;
import org.apache.commons.httpclient.methods.StringRequestEntity;
import org.lindenb.io.TInputStream;
import org.lindenb.json.Parser;
import org.lindenb.util.C;

public class CouchDBTest01
{
private static final int DEFAULT_PORT=5984;
private static final String DEFAULT_HOST="localhost";
private String host= DEFAULT_HOST;
private int port= DEFAULT_PORT;
private HttpClient client;

CouchDBTest01()
{
this.client = new HttpClient();
}

public String getHost()
{
return host;
}
public int getPort() {
return port;
}

private String getPath()
{
return "http://"+getHost()+":"+getPort();
}

private Object parseResult(String message,HttpMethod method)throws IOException
{
System.err.println("\n#################" + message+"#####################");

System.err.println("Sending :"+ method.getName()+" method "+getPath()+method.getPath()+(method.getQueryString()==null?"":"?"+method.getQueryString()));


this.client.executeMethod(method);
InputStream in=method.getResponseBodyAsStream();
in= new TInputStream(in,System.out);
Object o= new Parser().parse(in);
in.close();
method.releaseConnection();
return o;
}


public void createDatabase(String database) throws IOException
{
PutMethod method= new PutMethod(getPath()+"/"+database);
parseResult("Create database "+database,method);
}


public void deleteDatabase(String database) throws IOException
{
DeleteMethod method= new DeleteMethod(getPath()+"/"+database);
parseResult("Drop database "+database,method);
}

public void getAllDatabases() throws IOException
{
GetMethod method= new GetMethod(getPath()+"/"+"_all_dbs");
parseResult("Get All Databases",method);

}

public void getDocument(String database,String docid) throws IOException
{
GetMethod method= new GetMethod(getPath()+"/"+database+"/"+docid);
parseResult("Get document "+docid,method);
}

public void putDocument(String database,String docid,String json) throws IOException
{
PutMethod method= new PutMethod(getPath()+"/"+database+"/"+docid);
method.setRequestEntity(new StringRequestEntity(json,"application/json", "UTF-8"));
parseResult("Create document "+docid,method);
}

public void putDocument(String database,String json) throws IOException
{
PostMethod method= new PostMethod(getPath()+"/"+database+"/");
method.setRequestEntity(new StringRequestEntity(json,"application/json", "UTF-8"));
parseResult("Create document",method);
}

public void getDocuments(String database,
String startkey,
String endkey,
Integer limit,
Boolean descending
) throws IOException
{
List<NameValuePair> params= new ArrayList<NameValuePair>();
if(startkey!=null) params.add(new NameValuePair("startkey",startkey));
if(endkey!=null) params.add(new NameValuePair("endkey",endkey));
if(limit!=null) params.add(new NameValuePair("limit",limit.toString()));
if(descending!=null) params.add(new NameValuePair("descending",descending.toString()));

GetMethod method= new GetMethod(getPath()+"/"+database+"/"+"_all_docs");
method.setQueryString(params.toArray(new NameValuePair[params.size()]));
parseResult("Get Documents",method);
}

static String formatPosition(int chrom,int position)
{
StringWriter w= new StringWriter();
PrintWriter out= new PrintWriter(w);
out.printf("chr%02d_%08d", chrom,position);
return w.toString();
}

static String quote(String s)
{
return "\""+C.escape(s)+"\"";
}


void makeTest() throws IOException
{
final String POS2SNP="position2snp";
createDatabase(POS2SNP);
getAllDatabases();


Random rand= new Random();
for(int i=0;i< 50;++i)
{
int chromosome= 1+rand.nextInt(4);
int position = 1+rand.nextInt(100);
putDocument(POS2SNP,formatPosition(chromosome, position),
"{\"rs\":\"rs"+(i+1)+"\"," +
"\"avHet\":"+(rand.nextFloat()*0.5f)+"," +
"\"snpClass\":\""+(i%2==0?"mutation":"silent")+"\"," +
"\"mapping\":{" +
"\"chromosome\":\"chr"+chromosome+"\"," +
"\"position\":"+position+"" +
"}}");
getDocument(POS2SNP,formatPosition(chromosome, 0));
}



getDocuments(POS2SNP,null,null,1,null);

getDocuments(POS2SNP,quote(formatPosition(2, 30)),quote(formatPosition(2, 60)),null,null);

putDocument(POS2SNP,"_design/genotypage",
"{\"views\":{" +
"\"snpMutation\":{ \"map\":\"function(doc) {if(doc.snpClass='mutation') emit(null,doc); }\"},"+
"\"snpByClass\":{ \"map\":\"function(doc) { emit(doc.snpClass,doc.avHet); }\"},"+
"\"snpByName\":{ \"map\":\"function(doc) { emit(doc.rs,doc); }\"},"+
"\"snpByClassMaxHet\":{" +
"\"map\":\"function(doc) { emit(doc.snpClass,doc.avHet); }\"," +
"\"reduce\":\"function(keys, values) { var mean=0.0; for ( var i = 0; i < values.length; ++i ) { mean+=values[i];} return mean/(values.length);}\"" +
"}"+
"}}" +
"");

GetMethod method= new GetMethod(getPath()+"/"+POS2SNP+"/_design/genotypage/_view/snpMutation");
parseResult("Map1",method);
method= new GetMethod(getPath()+"/"+POS2SNP+"/_design/genotypage/_view/snpByClass");
parseResult("Map2",method);
method= new GetMethod(getPath()+"/"+POS2SNP+"/_design/genotypage/_view/snpByClassMaxHet");
method.setQueryString("group=true");
parseResult("Map3",method);
method= new GetMethod(getPath()+"/"+POS2SNP+"/_design/genotypage/_view/snpByName");
method.setQueryString("key="+URLEncoder.encode("\"rs1\"","UTF8"));
parseResult("Map4",method);


deleteDatabase(POS2SNP) ;
}

public static void main(String[] args) {
try {
int optind=0;
CouchDBTest01 app=new CouchDBTest01();
while(optind<args.length)
{
if(args[optind].equals("-h"))
{
System.err.println("Pierre Lindenbaum PhD.");
System.err.println("-h this screen");
System.err.println("-H (host)");
System.err.println("-p (port)");
System.err.println("-d debug");
return;
}
else if (args[optind].equals("-H"))
{
app.host=args[++optind];
}
else if (args[optind].equals("-p"))
{
app.port=Integer.parseInt(args[++optind]);
}
else if (args[optind].equals("--"))
{
++optind;
break;
}
else if (args[optind].startsWith("-"))
{
System.err.println("bad argument " + args[optind]);
System.exit(-1);
}
else
{
break;
}
++optind;
}
app.makeTest();

} catch (Exception e) {
e.printStackTrace();
}
}
}

5 comments:

dalloliogm said...

The idea is good, and thank you for the post.
However, I am not sure on which are the advantages of couchdb over a relational database, for bioinformatics...

dalloliogm said...

for example, let's say you want to collect all the snps within 40,000 bp of any gene in a list of genes.

How would you do that with couchdb? Will it be simpler with respect to a relational database?

Or another example may be representing a graph into couchdb. Do you think it would be easier to navigate it with this solutions?

Pierre Lindenbaum said...

@dalloliogm I don't think you can do things like that (join,...) with a *single* query. But I would store the start positions of the genes just like I stored the snp , I would query each gene for each snp and store the result in a temporary database before digesting the result.

There is no SQL for this kind of key/value databases because you can store anything in the entries. You have to create your own tool to query and digest your data.


Hadoop & Hierachical data: http://wiki.apache.org/couchdb/How_to_store_hierarchical_dataPierre

dalloliogm said...

ok, very interesting then.
I think I will stick with sqlalchemy and mysql for the moment, but for the next project I will look at it.
I was also looking at hdf5, a binary format which is similar to a hierarchical database.

Mic L said...

Could you please post the whole project on GitHub?