CouchDB for Bioinformatics: Storing SNPs - My Notebook
In a previous post, I've been playing with Apache Hadoop.
I've encountered some technical difficulties with Hadoop (such as the simple question: "How should can I read my data from a file stored on the Hadoop File System (HDFS) ?? ") , so I now have a look at Apache CouchDB
- A document database server, accessible via a RESTful JSON API
- Ad-hoc and schema-free with a flat address space
- Distributed, featuring robust, incremental replication with bi-directional conflict detection and management
- Query-able and index-able, featuring a table oriented reporting engine that uses Javascript as a query language
Installation
Installing couchdb on my computer as easy as described on the wiki and after starting the default CouchDB server (host: localhost, port:5984) I got a
{"couchdb":"Welcome","version":"0.10.0a769334"}
when I opened my browser on http://localhost:5984/, as well as an interactive console at http://localhost:5984/_utils/..
Sending Request to CouchDB
Documents and Databases are created, deleted,found, etc.. using a REST protocol based on HTTP and its methods (GET/POST/PUT/DELETE/...). In order to play with this couchdb, I've created a small java program using the apache http client library (see source at the end of this post) (I know there is already a JAVA API for couchdb, but I wanted to be sure that I understand the concepts).
Creating a Database
First I want to create a database of Single Nucleotide Polymorphisms (SNPs): a PUT request is sent with the name of the database http://localhost:5984/position2snp
{
PutMethod method= new PutMethod("http://localhost:5984/"+database);
HttpClient client=new HttpClient();
client.executeMethod(method);
method.releaseConnection();
}
And couchdb returns the following JSON object.
Listing the databases
A special identifier (_all_dbs) in the request path and a GET method are used to to get all the databases:
{
GetMethod method= new GetMethod("http://localhost:5984/_all_dbs");
HttpClient client=new HttpClient();
client.executeMethod(method);
method.releaseConnection();
}
And couchdb returns the following JSON array containing the one and only database:
Adding some SNPs in the Database
CouchDB is a key/value store of JSON-based document. Here the KEY will be a fixed-size (We'll use some '0' for padding) concatenation of the chromosome and the genomic position for each SNP e.g.
"chr01_00000006"
.{
StringWriter w= new StringWriter();
PrintWriter out= new PrintWriter(w);
out.printf("chr%02d_%08d", chromosome,position);
return w.toString();
}
The VALUE will be a structured description of the SNP. Here I've simulated a set of random SNP.
for(int i=0;i< 50;++i)
{
int chromosome= 1+rand.nextInt(4);
int position = 1+rand.nextInt(100);
putDocument(POS2SNP,formatPosition(chromosome, position),
"{\"rs\":\"rs"+(i+1)+"\"," +
"\"avHet\":"+(rand.nextFloat()*0.5f)+"," +
"\"snpClass\":\""+(i%2==0?"mutation":"silent")+"\"," +
"\"mapping\":{" +
"\"chromosome\":\"chr"+chromosome+"\"," +
"\"position\":"+position+"" +
"}}");
}
The new document is loaded by sending a PUT request, with the structured description of the SNP in the body of the request, to "http://localhost:5984/position2snp/THE_KEY" and .
{
PutMethod method= new PutMethod("http://localhost:5984/position2snp/"+id);
method.setRequestEntity(new StringRequestEntity(json,"application/json", "UTF-8"));
HttpClient client=new HttpClient();
client.executeMethod(method);
method.releaseConnection();
}
(Note: the POST method can be used instead of PUT to generate a unique key for a document)
For example, when I post a new SNP, couchdb returns the following JSON object containing the status ("ok"), the id/key of the new document/snp and its revision-id:
Retrieving only one row in the Database
A GET request is sent with the special keyword _all_docs and the parameter limit. e.g: "
http://localhost:5984/position2snp/_all_docs?limit=1
".Couchdb returns the following JSON object:
{"id":"chr01_00000006","key":"chr01_00000006","value":{"rev":"1-1656436337"}}
]}
Retrieving the document from its KEY/ID
A GET request is sent to http://localhost:5984/position2snp/THE_KEY
For example sending
http://localhost:5984/position2snp/chr03_00000046
returns:If the KEY was not found when sending
http://localhost:5984/position2snp/chr03_00000000
, Couchdb returns:Finding all the SNPs in a defined Genomic Segment
As the KEYs are sorted, have the same length, and define the position of the SNP on the genome, we can search CouchDB for all the SNPs in a defined region of the genome by sending a GET HTTP request with the parameters startkey and endkey. For example, if we want all the SNPs on the chromosome 2 between the bases 30 and 60 we send
http://localhost:5984/position2snp/_all_docs?startkey=%22chr02_00000030%22&endkey=%22chr02_00000060%22
and the result is:{"id":"chr02_00000032","key":"chr02_00000032","value":{"rev":"1-4178679245"}},
{"id":"chr02_00000040","key":"chr02_00000040","value":{"rev":"1-392133644"}},
{"id":"chr02_00000056","key":"chr02_00000056","value":{"rev":"1-3847661844"}}
]}
Views
Views are the primary tool used for querying and reporting on CouchDB documents. Views are stored inside special documents called design documents, and can be accessed via an HTTP GET request to the URI /{dbname}/{docid}/{viewname} . The view is defined by a JavaScript function that MAPs view keys to values. If a view has a REDUCE function, it is used to produce aggregate results for that view. A reduce function is passed a set of intermediate values and combines them to a single value. Reduce functions must accept, as input, results emitted by its corresponding MAP function
We upload our view called _design/genotypage:
"{\"views\":{" +
"\"snpMutation\":{ \"map\":\"function(doc) {if(doc.snpClass='mutation') emit(null,doc); }\"},"+
"\"snpByClass\":{ \"map\":\"function(doc) { emit(doc.snpClass,doc.avHet); }\"},"+
"\"snpByName\":{ \"map\":\"function(doc) { emit(doc.rs,doc); }\"},"+
"\"snpByClassMaxHet\":{" +
"\"map\":\"function(doc) { emit(doc.snpClass,doc.avHet); }\"," +
"\"reduce\":\"function(keys, values) { var mean=0.0; for ( var i = 0; i < values.length; ++i ) { mean+=values[i];} return mean/(values.length);}\"" +
"}"+
"}}" +
"");
This view contains four functions: snpMutation, snpByClass, snpByName and snpByClassMaxHet
Find all SNPs having a class='mutation'
A GET request is sent to "
http://localhost:5984/position2snp/_design/genotypage/_view/snpMutation
" and couchdb returns {"id":"chr01_00000001","key":null,"value":{"_id":"chr01_00000001","_rev":"1-3508311375","rs":"rs12","avHet":0.30527565,"snpClass":"mutation","mapping":{"chromosome":"chr1","position":1}}},
{"id":"chr01_00000006","key":null,"value":{"_id":"chr01_00000006","_rev":"1-3309871741","rs":"rs31","avHet":0.30192727,"snpClass":"mutation","mapping":{"chromosome":"chr1","position":6}}},
{"id":"chr01_00000009","key":null,"value":{"_id":"chr01_00000009","_rev":"1-4077528375","rs":"rs3","avHet":0.44473252,"snpClass":"mutation","mapping":{"chromosome":"chr1","position":9}}},
{"id":"chr01_00000015","key":null,"value":{"_id":"chr01_00000015","_rev":"1-247108112","rs":"rs17","avHet":0.42058986,"snpClass":"mutation","mapping":{"chromosome":"chr1","position":15}}},
{"id":"chr01_00000016","key":null,"value":{"_id":"chr01_00000016","_rev":"1-3568315779","rs":"rs43","avHet":0.4113328,"snpClass":"mutation","mapping":{"chromosome":"chr1","position":16}}},
(...)
{"id":"chr04_00000033","key":null,"value":{"_id":"chr04_00000033","_rev":"1-3823284043","rs":"rs20","avHet":0.17454243,"snpClass":"mutation","mapping":{"chromosome":"chr4","position":33}}},
{"id":"chr04_00000035","key":null,"value":{"_id":"chr04_00000035","_rev":"1-1400920328","rs":"rs10","avHet":0.33354515,"snpClass":"mutation","mapping":{"chromosome":"chr4","position":35}}},
{"id":"chr04_00000045","key":null,"value":{"_id":"chr04_00000045","_rev":"1-3632023176","rs":"rs49","avHet":0.44040334,"snpClass":"mutation","mapping":{"chromosome":"chr4","position":45}}},
{"id":"chr04_00000058","key":null,"value":{"_id":"chr04_00000058","_rev":"1-1711768614","rs":"rs14","avHet":0.3784455,"snpClass":"mutation","mapping":{"chromosome":"chr4","position":58}}}
]}
Create a table containing the snpClass and the avHet for each SNP
A GET request is sent to "
method http://localhost:5984/position2snp/_design/genotypage/_view/snpByClass
" and couchdb returns:{"id":"chr01_00000006","key":"mutation","value":0.30192727},
{"id":"chr01_00000009","key":"mutation","value":0.44473252},
{"id":"chr01_00000015","key":"mutation","value":0.42058986},
{"id":"chr01_00000016","key":"mutation","value":0.4113328},
(...)
{"id":"chr04_00000019","key":"silent","value":0.069098294},
{"id":"chr04_00000033","key":"silent","value":0.17454243},
{"id":"chr04_00000035","key":"silent","value":0.33354515},
{"id":"chr04_00000058","key":"silent","value":0.3784455}
]}
Find the snp named "rs1"
A GET request is sent to "
http://localhost:5984/position2snp/_design/genotypage/_view/snpByName?key=%22rs1%22
" , and using the request parameter key="rs1"
. Couchdb returns:{"id":"chr01_00000023","key":"rs1","value":{"_id":"chr01_00000023","_rev":"1-49396577","rs":"rs1","avHet":0.035366535,"snpClass":"mutation","mapping":{"chromosome":"chr1","position":23}}}
]}
Map/Reduce: map all the SNP by they snpClass and avHet. Reduce this map to have the mean avHet for the two types of snpClass
A GET request is sent to "
http://localhost:5984/position2snp/_design/genotypage/_view/snpByClassMaxHet?group=true
" using the request parameter group=true
to group by snpClass. Couchdb returns:{"key":"mutation","value":0.29830500208},
{"key":"silent","value":0.2255268866772728}
]}
Dropping the Database
A DELETE request with the name of the database is sent to "http://localhost:5984/position2snp".
{
DeleteMethod method= new DeleteMethod("http://localhost:5984/"+database);
HttpClient client=new HttpClient();
client.executeMethod(method);
method.releaseConnection();
}
And couchdb returns the following JSON object:
Source Code
import java.io.IOException;
import java.io.InputStream;
import java.io.PrintWriter;
import java.io.StringWriter;
import java.net.URLEncoder;
import java.util.ArrayList;
import java.util.List;
import java.util.Random;
import org.apache.commons.httpclient.HttpClient;
import org.apache.commons.httpclient.HttpMethod;
import org.apache.commons.httpclient.NameValuePair;
import org.apache.commons.httpclient.methods.DeleteMethod;
import org.apache.commons.httpclient.methods.GetMethod;
import org.apache.commons.httpclient.methods.PostMethod;
import org.apache.commons.httpclient.methods.PutMethod;
import org.apache.commons.httpclient.methods.StringRequestEntity;
import org.lindenb.io.TInputStream;
import org.lindenb.json.Parser;
import org.lindenb.util.C;
public class CouchDBTest01
{
private static final int DEFAULT_PORT=5984;
private static final String DEFAULT_HOST="localhost";
private String host= DEFAULT_HOST;
private int port= DEFAULT_PORT;
private HttpClient client;
CouchDBTest01()
{
this.client = new HttpClient();
}
public String getHost()
{
return host;
}
public int getPort() {
return port;
}
private String getPath()
{
return "http://"+getHost()+":"+getPort();
}
private Object parseResult(String message,HttpMethod method)throws IOException
{
System.err.println("\n#################" + message+"#####################");
System.err.println("Sending :"+ method.getName()+" method "+getPath()+method.getPath()+(method.getQueryString()==null?"":"?"+method.getQueryString()));
this.client.executeMethod(method);
InputStream in=method.getResponseBodyAsStream();
in= new TInputStream(in,System.out);
Object o= new Parser().parse(in);
in.close();
method.releaseConnection();
return o;
}
public void createDatabase(String database) throws IOException
{
PutMethod method= new PutMethod(getPath()+"/"+database);
parseResult("Create database "+database,method);
}
public void deleteDatabase(String database) throws IOException
{
DeleteMethod method= new DeleteMethod(getPath()+"/"+database);
parseResult("Drop database "+database,method);
}
public void getAllDatabases() throws IOException
{
GetMethod method= new GetMethod(getPath()+"/"+"_all_dbs");
parseResult("Get All Databases",method);
}
public void getDocument(String database,String docid) throws IOException
{
GetMethod method= new GetMethod(getPath()+"/"+database+"/"+docid);
parseResult("Get document "+docid,method);
}
public void putDocument(String database,String docid,String json) throws IOException
{
PutMethod method= new PutMethod(getPath()+"/"+database+"/"+docid);
method.setRequestEntity(new StringRequestEntity(json,"application/json", "UTF-8"));
parseResult("Create document "+docid,method);
}
public void putDocument(String database,String json) throws IOException
{
PostMethod method= new PostMethod(getPath()+"/"+database+"/");
method.setRequestEntity(new StringRequestEntity(json,"application/json", "UTF-8"));
parseResult("Create document",method);
}
public void getDocuments(String database,
String startkey,
String endkey,
Integer limit,
Boolean descending
) throws IOException
{
List<NameValuePair> params= new ArrayList<NameValuePair>();
if(startkey!=null) params.add(new NameValuePair("startkey",startkey));
if(endkey!=null) params.add(new NameValuePair("endkey",endkey));
if(limit!=null) params.add(new NameValuePair("limit",limit.toString()));
if(descending!=null) params.add(new NameValuePair("descending",descending.toString()));
GetMethod method= new GetMethod(getPath()+"/"+database+"/"+"_all_docs");
method.setQueryString(params.toArray(new NameValuePair[params.size()]));
parseResult("Get Documents",method);
}
static String formatPosition(int chrom,int position)
{
StringWriter w= new StringWriter();
PrintWriter out= new PrintWriter(w);
out.printf("chr%02d_%08d", chrom,position);
return w.toString();
}
static String quote(String s)
{
return "\""+C.escape(s)+"\"";
}
void makeTest() throws IOException
{
final String POS2SNP="position2snp";
createDatabase(POS2SNP);
getAllDatabases();
Random rand= new Random();
for(int i=0;i< 50;++i)
{
int chromosome= 1+rand.nextInt(4);
int position = 1+rand.nextInt(100);
putDocument(POS2SNP,formatPosition(chromosome, position),
"{\"rs\":\"rs"+(i+1)+"\"," +
"\"avHet\":"+(rand.nextFloat()*0.5f)+"," +
"\"snpClass\":\""+(i%2==0?"mutation":"silent")+"\"," +
"\"mapping\":{" +
"\"chromosome\":\"chr"+chromosome+"\"," +
"\"position\":"+position+"" +
"}}");
getDocument(POS2SNP,formatPosition(chromosome, 0));
}
getDocuments(POS2SNP,null,null,1,null);
getDocuments(POS2SNP,quote(formatPosition(2, 30)),quote(formatPosition(2, 60)),null,null);
putDocument(POS2SNP,"_design/genotypage",
"{\"views\":{" +
"\"snpMutation\":{ \"map\":\"function(doc) {if(doc.snpClass='mutation') emit(null,doc); }\"},"+
"\"snpByClass\":{ \"map\":\"function(doc) { emit(doc.snpClass,doc.avHet); }\"},"+
"\"snpByName\":{ \"map\":\"function(doc) { emit(doc.rs,doc); }\"},"+
"\"snpByClassMaxHet\":{" +
"\"map\":\"function(doc) { emit(doc.snpClass,doc.avHet); }\"," +
"\"reduce\":\"function(keys, values) { var mean=0.0; for ( var i = 0; i < values.length; ++i ) { mean+=values[i];} return mean/(values.length);}\"" +
"}"+
"}}" +
"");
GetMethod method= new GetMethod(getPath()+"/"+POS2SNP+"/_design/genotypage/_view/snpMutation");
parseResult("Map1",method);
method= new GetMethod(getPath()+"/"+POS2SNP+"/_design/genotypage/_view/snpByClass");
parseResult("Map2",method);
method= new GetMethod(getPath()+"/"+POS2SNP+"/_design/genotypage/_view/snpByClassMaxHet");
method.setQueryString("group=true");
parseResult("Map3",method);
method= new GetMethod(getPath()+"/"+POS2SNP+"/_design/genotypage/_view/snpByName");
method.setQueryString("key="+URLEncoder.encode("\"rs1\"","UTF8"));
parseResult("Map4",method);
deleteDatabase(POS2SNP) ;
}
public static void main(String[] args) {
try {
int optind=0;
CouchDBTest01 app=new CouchDBTest01();
while(optind<args.length)
{
if(args[optind].equals("-h"))
{
System.err.println("Pierre Lindenbaum PhD.");
System.err.println("-h this screen");
System.err.println("-H (host)");
System.err.println("-p (port)");
System.err.println("-d debug");
return;
}
else if (args[optind].equals("-H"))
{
app.host=args[++optind];
}
else if (args[optind].equals("-p"))
{
app.port=Integer.parseInt(args[++optind]);
}
else if (args[optind].equals("--"))
{
++optind;
break;
}
else if (args[optind].startsWith("-"))
{
System.err.println("bad argument " + args[optind]);
System.exit(-1);
}
else
{
break;
}
++optind;
}
app.makeTest();
} catch (Exception e) {
e.printStackTrace();
}
}
}