Showing posts with label indexing. Show all posts
Showing posts with label indexing. Show all posts

24 May 2013

A Tribble/FeatureCodec handling JSON-based annotations files.

I wrote a java FeatureCodec for JSON with a the tribble library.
Citing the GATK tream: "The Tribble project was started as an effort to overhaul our reference-ordered data system; we had many different formats that were shoehorned into a common framework that didn't really work as intended. What we wanted was a common framework that allowed for searching of reference ordered data, regardless of the underlying type. Jim Robinson had developed indexing schemes for text-based files, which was incorporated into the Tribble library."".

The library is available at:https://github.com/lindenb/jsontribble.


The library contains the tools to sort, index and query the json file.

As a proof of concept, I also created a REST-based service to query those files.

REST/JSON

For example http://localhost:8080/jsontribble/rest/tribble/resources/dbsnp/annotations.json?chrom=chr1&start=881826&end=981826 returns:
{"header":{"description":"UCSC  snp137: select count(*) from snp137 where FIND_IN_SET(func,\"missense\")>0 and avHet>0.1"}
,"features":[
{"chrom":"chr1","start":881826,"end":881827,"name":"rs112341375","score":0,"strand":"+","refNCBI":"G","refUCSC":"G","observed":"C/G","class":"single","valid":["by-frequency"],"avHet":0.5,"func":["missense"],"submitters":["BUSHMAN"]}
{"chrom":"chr1","start":897119,"end":897120,"name":"rs28530579","score":0,"strand":"+","refNCBI":"G","refUCSC":"G","observed":"C/G","class":"single","valid":["unknown"],"avHet":0.375,"func":["missense"],"submitters":["ABI","ENSEMBL","SSAHASNP"]}
{"chrom":"chr1","start":907739,"end":907740,"name":"rs112235940","score":0,"strand":"+","refNCBI":"G","refUCSC":"G","observed":"A/G","class":"single","valid":["unknown"],"avHet":0.5,"func":["missense"],"submitters":["COMPLETE_GENOMICS"]}
{"chrom":"chr1","start":949607,"end":949608,"name":"rs1921","score":0,"strand":"+","refNCBI":"G","refUCSC":"G","observed":"A/C/G","class":"single","valid":["by-cluster","by-frequency","by-1000genomes"],"avHet":0.464348,"func":["missense"],"submitters":["1000GENOMES","AFFY","BGI","BUSHMAN","CGAP-GAI","CLINSEQ_SNP","COMPLETE_GENOMICS","CORNELL","DEBNICK","EXOME_CHIP","GMI","HGSV","ILLUMINA","ILLUMINA-UK","KRIBB_YJKIM","LEE","MGC_GENOME_DIFF","NHLBI-ESP","SC_JCM","SC_SNP","SEATTLESEQ","SEQUENOM","UWGC","WIAF","YUSUKE"],"bitfields":["maf-5-some-pop","maf-5-all-pops"]}
]}

REST/XML

Example http://localhost:8080/jsontribble/rest/tribble/resources/dbsnp/annotations.xml?chrom=chr1&start=897119&end=981826
<?xml version="1.0" encoding="UTF-8"?>
<annotations chrom="chr1" start="897119" end="981826">
  <header>
    <description>UCSC  snp137: select count(*) from snp137 where FIND_IN_SET(func,"missense")&gt;0 and avHet&gt;0.1</description>
  </header>
  <features>
    <feature>
      <chrom>chr1</chrom>
      <start type="integer">897119</start>
      <end type="integer">897120</end>
      <name>rs28530579</name>
      <score type="integer">0</score>
      <strand>+</strand>
      <refNCBI>G</refNCBI>
      <refUCSC>G</refUCSC>
      <observed>C/G</observed>
      <class>single</class>
      <valid>

BED/text

Example: http://localhost:8080/jsontribble/rest/tribble/resources/merge/annotations.bed?chrom=chr1&start=897119&end=981826.
chr1    895966  901099  {"chrom":"chr1","start":895966,"end":901099,"strand":"+","name":"uc001aca.2","cds...
chr1    896828  897858  {"chrom":"chr1","start":896828,"end":897858,"strand":"+","name":"uc001acb.1","cds...
chr1    897008  897858  {"chrom":"chr1","start":897008,"end":897858,"strand":"+","name":"uc010nya.1","cds...
chr1    897119  897120  {"chrom":"chr1","start":897119,"end":897120,"name":"rs28530579","score":0,"strand...
chr1    897734  899229  {"chrom":"chr1","start":897734,"end":899229,"strand":"+","name":"uc010nyb.1","cds...
chr1    901876  910484  {"chrom":"chr1","start":901876,"end":910484,"strand":"+","name":"uc001acd.3","cds...
chr1    901876  910484  {"chrom":"chr1","start":901876,"end":910484,"strand":"+","name":"uc001ace.3","cds...
chr1    901876  910484  {"chrom":"chr1","start":901876,"end":910484,"strand":"+","name":"uc001acf.3","cds...
chr1    907739  907740  {"chrom":"chr1","start":907739,"end":907740,"name":"rs112235940","score":0,"stran...
chr1    910578  917473  {"chrom":"chr1","start":910578,"end":917473,"strand":"-","name":"uc001ach.2","cds...
chr1    934341  935552  {"chrom":"chr1","start":934341,"end":935552,"strand":"-","name":"uc001aci.2","cds...
chr1    934341  935552  {"chrom":"chr1","start":934341,"end":935552,"strand":"-","name":"uc010nyc.1","cds...
chr1    948846  949919  {"chrom":"chr1","start":948846,"end":949919,"strand":"+","name":"uc001acj.4","cds...
chr1    949607  949608  {"chrom":"chr1","start":949607,"end":949608,"name":"rs1921","score":0,"strand":"+...
chr1    955502  991499  {"chrom":"chr1","start":955502,"end":991499,"strand":"+","name":"uc001ack.2","cds...

06 April 2012

Indexing the content of Gene Ontology with apache SOLR

Via Wikipedia:"Solr (http://lucene.apache.org/solr/) is an open source enterprise search platform from the Apache Lucene project. Its major features include powerful full-text search, hit highlighting, faceted search, dynamic clustering, database integration, and rich document (e.g., Word, PDF) handling. Providing distributed search and index replication, Solr is highly scalable." In the this post, I'll show how I've used SOLR to index the content of GeneOntology.

Download and install SOLR

Download from http://mirrors.ircam.fr/pub/apache/lucene/solr/3.5.0/apache-solr-3.5.0.tgz.
tar xvfz apache-solr-3.5.0.tgz
rm apache-solr-3.5.0.tgz

Configure schema.xml

We need to tell SOLR about the which fields of GO will be indexed, what are their type, how they should be tokenized and parsed. This information is defined in the schema.xml. The following components will be indexed: accession, name, synonym and definition. Edit apache-solr-3.5.0/example/solr/conf/schema.xml and add the following <fields>:

<field name="go_name" type="text_general" indexed="true" stored="true" multiValued="false"/>
<field name="go_synonym" type="text_general" indexed="true" stored="true" multiValued="true"/>
<field name="go_definition" type="text_general" indexed="true" stored="true" multiValued="false"/>

Start the SOLR server

In this example, the SOLR server is started using the simple Jetty server provided in the distribution:

$ cd apache-solr-3.5.0/example/example
$ java -jar start.jar

(...)

Indexing Gene Ontology

Go is downloaded as RDF/XML from http://archive.geneontology.org/latest-termdb/go_daily-termdb.rdf-xml.gz
 
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE go:go PUBLIC "-//Gene Ontology//Custom XML/RDF Version 2.0//EN" "http://www.geneontology.org/dtd/go.dtd">

<go:go xmlns:go="http://www.geneontology.org/dtds/go.dtd#" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
    <rdf:RDF>
        <go:term rdf:about="http://www.geneontology.org/go#GO:0000001">
            <go:accession>GO:0000001</go:accession>
            <go:name>mitochondrion inheritance</go:name>
            <go:synonym>mitochondrial inheritance</go:synonym>
            <go:definition>The distribution of mitochondria, including the mitochondrial genome, into daughter cells after mitosis or meiosis, mediated by interactions between mitochondria and the cytoskeleton.</go:definition>
            <go:is_a rdf:resource="http://www.geneontology.org/go#GO:0048308" />
            <go:is_a rdf:resource="http://www.geneontology.org/go#GO:0048311" />
        </go:term>
        <go:term rdf:about="http://www.geneontology.org/go#GO:0000002">
            <go:accession>GO:0000002</go:accession>
            <go:name>mitochondrial genome maintenance</go:name>
            <go:definition>The maintenance of the structure and integrity of the mitochondrial genome; includes replication and segregation of the mitochondrial chromosome.</go:definition>
            <go:is_a rdf:resource="http://www.geneontology.org/go#GO:0007005" />
            <go:dbxref rdf:parseType="Resource">
                <go:database_symbol>InterPro</go:database_symbol>
(...)
 
We now need to transform this XML file to another XML file that can be indexed by the SOLR server.  

"You can modify a Solr index by POSTing XML Documents containing instructions to add (or update) documents, delete documents, commit pending adds and deletes, and optimize your index."

The following XSLT stylesheet is used to transform the RDF/XML for GO:


$ xsltproc --novalid go2solr.xsl go_daily-termdb.rdf-xml.gz > add.xml
$ cat add.xml

Before indexing the current disk usage under apache-solr-3.5.0 is 136Mo. We can now use the java utiliy post.jar to index GeneOntology.

 $ cd  ~/package/apache-solr-3.5.0/example/exampledocs
 $ java -jar post.jar  add.xml

SimplePostTool: version 1.4
SimplePostTool: POSTing files to http://localhost:8983/solr/update..
SimplePostTool: POSTing file jeter.xml
SimplePostTool: COMMITting Solr index changes..

After indexing, the disk usage under apache-solr-3.5.0 is 153Mo.

Querying

Search for the GO terms having go:definition containing "cancer" a go:name containing "genome" but discard those having go:definition containing "metabolism".
 curl "http://localhost:8983/solr/select/?q=go_definition%3Acancer+go_name%3Agenome+-go_definition%3Ametabolism&version=2.2&start=0&rows=10&indent=on"
Same query, but return the result as a JSON structure:
 curl "http://localhost:8983/solr/select/?q=go_definition%3Acancer+go_name%3Agenome+-go_definition%3Ametabolism&version=2.2&start=0&rows=10&indent=on&wt=json"
That's it, Pierre