Showing posts with label annotation. Show all posts
Showing posts with label annotation. Show all posts

12 July 2013

Inside the Variation Toolkit: Gene Ontology for VCF, GUI for VCF

A quick note about three java-based tools for VCF files I wrote today.

VcfViewGui

VcfViewGui : a Simple java-Swing-based VCF viewer.


VCFGeneOntology

vcfgo reads a VCF annotated with VEP or SNPEFF, loads the data from GeneOntology and GOA and adds a new field in the INFO column for the GO terms for each position.
Example:
$ java -jar dist/vcfgo.jar I="https://raw.github.com/arq5x/gemini/master/test/tes.snpeff.vcf" |\
    grep -v -E '^##' | head -n 3

#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  1094PC0005  1094PC0009  1094PC0012  1094PC0013
chr1    30860   .   G   C   33.46   .   AC=2;AF=0.053;AN=38;BaseQRankSum=2.327;DP=49;Dels=0.00;EFF=DOWNSTREAM(MODIFIER||||85|FAM138A|protein_coding|CODING|ENST00000417324|),DOWNSTREAM(MODIFIER|||||FAM138A|processed_transcript|CODING|ENST00000461467|),DOWNSTREAM(MODIFIER|||||MIR1302-10|miRNA|NON_CODING|ENST00000408384|),INTRON(MODIFIER|||||MIR1302-10|antisense|NON_CODING|ENST00000469289|),INTRON(MODIFIER|||||MIR1302-10|antisense|NON_CODING|ENST00000473358|),UPSTREAM(MODIFIER|||||WASH7P|unprocessed_pseudogene|NON_CODING|ENST00000423562|),UPSTREAM(MODIFIER|||||WASH7P|unprocessed_pseudogene|NON_CODING|ENST00000430492|),UPSTREAM(MODIFIER|||||WASH7P|unprocessed_pseudogene|NON_CODING|ENST00000438504|),UPSTREAM(MODIFIER|||||WASH7P|unprocessed_pseudogene|NON_CODING|ENST00000488147|),UPSTREAM(MODIFIER|||||WASH7P|unprocessed_pseudogene|NON_CODING|ENST00000538476|);FS=3.128;HRun=0;HaplotypeScore=0.6718;InbreedingCoeff=0.1005;MQ=36.55;MQ0=0;MQRankSum=0.217;QD=16.73;ReadPosRankSum=2.017 GT:AD:DP:GQ:PL  0/0:7,0:7:15.04:0,15,177    0/0:2,0:2:3.01:0,3,39   0/0:6,0:6:12.02:0,12,143    0/0:4,0:4:9.03:0,9,119
chr1    69270   .   A   G   2694.18 .   AC=40;AF=1.000;AN=40;DP=83;Dels=0.00;EFF=SYNONYMOUS_CODING(LOW|SILENT|tcA/tcG|S60|305|OR4F5|protein_coding|CODING|ENST00000335137|exon_1_69091_70008);FS=0.000;GOA=OR4F5|GO:0004984&GO:0005886&GO:0004930&GO:0016021;HRun=0;HaplotypeScore=0.0000;InbreedingCoeff=-0.0598;MQ=31.06;MQ0=0;QD=32.86   GT:AD:DP:GQ:PL  ./. ./. 1/1:0,3:3:9.03:106,9,0  1/1:0,6:6:18.05:203,18,0

VCFFilterGeneOntology

vcffiltergo reads a VCF annotated with VEP or SNPEFF, loads the data from GeneOntology and GOA and adds a filter in the FILTER column if a gene at the current genomic location is a descendant of a given GO term.
Example:
$  java -jar dist/vcffiltergo.jar I="https://raw.github.com/arq5x/gemini/master/test/test1.snpeff.vcf"  \
    CHILD_OF=GO:0005886 FILTER=MEMBRANE  |\
    grep -v "^##"   | head -n 3

#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  1094PC0005  1094PC0009  1094PC0012  1094PC0013
chr1    30860   .   G   C   33.46   PASS    AC=2;AF=0.053;AN=38;BaseQRankSum=2.327;DP=49;Dels=0.00;EFF=DOWNSTREAM(MODIFIER||||85|FAM138A|protein_coding|CODING|ENST00000417324|),DOWNSTREAM(MODIFIER|||||FAM138A|processed_transcript|CODING|ENST00000461467|),DOWNSTREAM(MODIFIER|||||MIR1302-10|miRNA|NON_CODING|ENST00000408384|),INTRON(MODIFIER|||||MIR1302-10|antisense|NON_CODING|ENST00000469289|),INTRON(MODIFIER|||||MIR1302-10|antisense|NON_CODING|ENST00000473358|),UPSTREAM(MODIFIER|||||WASH7P|unprocessed_pseudogene|NON_CODING|ENST00000423562|),UPSTREAM(MODIFIER|||||WASH7P|unprocessed_pseudogene|NON_CODING|ENST00000430492|),UPSTREAM(MODIFIER|||||WASH7P|unprocessed_pseudogene|NON_CODING|ENST00000438504|),UPSTREAM(MODIFIER|||||WASH7P|unprocessed_pseudogene|NON_CODING|ENST00000488147|),UPSTREAM(MODIFIER|||||WASH7P|unprocessed_pseudogene|NON_CODING|ENST00000538476|);FS=3.128;HRun=0;HaplotypeScore=0.6718;InbreedingCoeff=0.1005;MQ=36.55;MQ0=0;MQRankSum=0.217;QD=16.73;ReadPosRankSum=2.017 GT:AD:DP:GQ:PL  0/0:7,0:7:15.04:0,15,177    0/0:2,0:2:3.01:0,3,39   0/0:6,0:6:12.02:0,12,143    0/0:4,0:4:9.03:0,9,119
chr1    69270   .   A   G   2694.18 MEMBRANE    AC=40;AF=1.000;AN=40;DP=83;Dels=0.00;EFF=SYNONYMOUS_CODING(LOW|SILENT|tcA/tcG|S60|305|OR4F5|protein_coding|CODING|ENST00000335137|exon_1_69091_70008);FS=0.000;HRun=0;HaplotypeScore=0.0000;InbreedingCoeff=-0.0598;MQ=31.06;MQ0=0;QD=32.86 GT:AD:DP:GQ:PL  ./. ./. 1/1:0,3:3:9.03:106,9,0  1/1:0,6:6:18.05:203,18,0



That's it,

Pierre

09 July 2013

Mapping the annotations of a query sequence on a BLAST hit, my notebook.

This post is the answer to my own question on biostar "BLASTN / TBLASTN : mapping the features of the query to the hit.". I wrote a java program to map the annotations of a sequence to the Hit of a Blast result. The tool is available on github at https://github.com/lindenb/jvarkit.
For example, say you want to map the features of the Uniprot record for Rotavirus NSP3 (http://www.uniprot.org/uniprot/P04514) on Genbank http://www.ncbi.nlm.nih.gov/nuccore/AY065842.1 (RNA).

Download P04514 as XML

TblastN P04514(protein) vs AY065842 (RNA) and save the BLAST result as XML:

<BlastOutput_program>tblastn</BlastOutput_program>
  <BlastOutput_version>TBLASTN 2.2.28+</BlastOutput_version>
(...)
<Hit>
  <Hit_num>1</Hit_num>
  <Hit_id>gi|18139606|gb|AY065842.1|</Hit_id>
  <Hit_def>AY065842</Hit_def>
  <Hit_accession>AY065842</Hit_accession>
  <Hit_len>1078</Hit_len>
  <Hit_hsps>
    <Hsp>
      <Hsp_num>1</Hsp_num>
      <Hsp_bit-score>546.584</Hsp_bit-score>
      <Hsp_score>1407</Hsp_score>
      <Hsp_evalue>0</Hsp_evalue>
      <Hsp_query-from>1</Hsp_query-from>
      <Hsp_query-to>313</Hsp_query-to>
      <Hsp_hit-from>26</Hsp_hit-from>
      <Hsp_hit-to>964</Hsp_hit-to>
      <Hsp_query-frame>0</Hsp_query-frame>
      <Hsp_hit-frame>2</Hsp_hit-frame>
      <Hsp_identity>260</Hsp_identity>
      <Hsp_positive>260</Hsp_positive>
      <Hsp_gaps>0</Hsp_gaps>
      <Hsp_align-len>313</Hsp_align-len>
      <Hsp_qseq>MLKMESTQQMASSIINTSFEAAVVAATSTLELMGIQYDYNEIYTRVKSKFDYVMDDSGVKNNLLGKAATIDQALNGKFGSVMRNKNWMTDSRTVAKLDEDVNKLRMMLSSKGIDQKMRVLNACFSVKRIPGKSSSVIKCTRLMKDKIERGAVEVDDSFVEEKMEVDTVDWKSRYDQLERRFESLKQRVNEKYTTWVQKAKKVNENMYSLQNVISQQQNQIADLQNYCSKLEADLQNKVGSLVSSVEWYLKSMELPDEVKTDIEQQLNSIDTISPINAIDDLEILIRNLIHDYDRTFLMFKGLLRQCNYEYAYE</Hsp_qseq>
      <Hsp_hseq>MLKMESTQQMASSIINSSFEAAVVAATSTLELMGIQYDYNEVYTRVKSKFDLVMDDSGVKNNLIGKAITIDQALNGKFSSAIRNRNWMTDSRTVAKLDEDVNKLRIMLSSKGIDQKMRVLNACFSVKRIPGKSSSIVKCTRLMKDKLERGEVEVDDSFVEEKMEVDTIDWKSRYEQLEKRFESLKHRVNEKYNHWVLKARKVNENMNSLQNVISQQQAHINELQMYNNKLERDLQSKIGSVVSSIEWYLRSMELSDDVKSDIEQQLNSIDQLNPVNAIDDFESILRNLISDYDRLFIMFKGLLQQCNYTYTYE</Hsp_hseq>
      <Hsp_midline>MLKMESTQQMASSIIN SFEAAVVAATSTLELMGIQYDYNE YTRVKSKFD VMDDSGVKNNL GKA TIDQALNGKF S  RN NWMTDSRTVAKLDEDVNKLR MLSSKGIDQKMRVLNACFSVKRIPGKSSS  KCTRLMKDK ERG VEVDDSFVEEKMEVDT DWKSRY QLE RFESLK RVNEKY  WV KA KVNENM SLQNVISQQQ  I  LQ Y  KLE DLQ K GS VSS EWYL SMEL D VK DIEQQLNSID   P NAIDD E   RNLI DYDR F MFKGLL QCNY Y YE</Hsp_midline>
    </Hsp>
   (..)
   
Now generate a BED file to map the features of P04514 on AY065842:
$ java -jar dist/blastmapannots.jar I=P04514.xml B=blast.xml

AY065842 25 961 Non-structural_protein_3 943 + 25961 255,255,255 1 936 25
AY065842 34 469 RNA-binding 970 + 34 469 255,255,255 1 435 34
AY065842 472 640 Dimerization 947 + 472 640 255,255,255 1 168 472
AY065842 532 724 Interaction_with_ZC3H7B 917 + 532 724 255,255,255 1 192 532
AY065842 646 961 Interaction_with_EIF4G1 905 + 646 961 255,255,255 1 315 646
AY065842 520 733 coiled-coil_region 916 + 520 733 255,255,255 1 213 520

That's it,

Pierre

29 May 2013

Binding a C library with Javascript/ #mozilla. An example with the Tabix library

In this post I'll show how to bind a C API to javascript using the mozilla/xul-runner API and the tabix library.

About xpcshell

XULRunner is a Mozilla runtime package. The SDK package contains xpcshell, a JavaScript Shell application that lets you run JavaScript code. "Unlike the ordinary JS shell (js), xpcshell lets the scripts running in it access the mozila technologies (XPCOM)." I've tested the current code with
$ xulrunner -v
Mozilla XULRunner 22.0 - 20130521223249
XULRunner is not installed by default on ubuntu on needs to be downloaded.

The js.type library

The js-ctypes is a foreign-function library for Mozilla's privileged JavaScript. It provides C-compatible data types and allows JS code to call functions in shared libraries (dll, so, dylib) and implement callback functions.

Tabix

Heng Li's Tabix is "a generic tool that indexes position sorted files in TAB-delimited formats such as GFF, BED, PSL, SAM and SQL export, and quickly retrieves features overlapping specified regions.". The code is available in github at https://github.com/samtools/tabix.

Binding the Tabix library to javascript

First of all, the dynamic library for tabix must be compiled:
$ cd /path/to/tabix.dir
$ make libtabix.so.1
A javascript file tabix.js is created. At the top, we tell the javascrpipt engine we want to use the js.type library:
Components.utils.import("resource://gre/modules/ctypes.jsm")
The dynamic library for tabix is loaded:
var lib = ctypes.open("libtabix.so.1");
We bind each required methods of the tabix library to javascript. As an example we're going to bind ti_open. The C declaration for this method is:
tabix_t *ti_open(const char *fn, const char *fnidx);
Using js.type, the call to that method is wrapped to javascript using declare/:
var DLOpen= lib.declare("ti_open",/* method name */
 ctypes.default_abi,/* Application binary interface type */
 ctypes.voidptr_t, /* return type is a pointer 'void*' */
 ctypes.char.ptr,  /* first argument is 'char*' */
 ctypes.int32_t /* second argument is 'int' */
 );
In javascript, the library is used by invoking DLOpen :
function TabixFile(filename)
 {
 this.ptr= DLOpen(filename,0);
 if(this.ptr.isNull()) throw "I/O ERROR: Cannot open \""+filename+"\"";
 };
var tabix=new TabixFile("annotatons.bed.gz");

The tabix.js library

All in one, I wrote the following file.

Testing


load("tabix.js");
var tabix=new TabixFile("/path/to/tabix-0.2.5/example.gtf.gz");
var iter=tabix.query("chr2:32800-35441");
while((line=iter.next())!=null)
 {
 print(line);
 }
tabix.close();

Set the dynamic library path (LD_LIBRARY_PATH) and invoke this script with xpcshell:
LD_LIBRARY_PATH=/path/to/xulrunner-sdk/bin:/path/to/tabix-0.2.5 /path/to/xulrunner-sdk/bin/xpcshell -f test.js
Output:
chr2 HAVANA transcript 28814 36385 . - . gene_id "ENSG00000184731"; transcript_id "ENST00000327669"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "FAM110C"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "FAM110C-001"; level 2; tag "CCDS"; ccdsid "CCDS42645"; havana_gene "OTTHUMG00000151321"; havana_transcript "OTTHUMT00000322220";
chr2 HAVANA gene 28814 36870 . - . gene_id "ENSG00000184731"; transcript_id "ENSG00000184731"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "FAM110C"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "FAM110C"; level 2; havana_gene "OTTHUMG00000151321";
chr2 HAVANA transcript 31220 32952 . - . gene_id "ENSG00000184731"; transcript_id "ENST00000460464"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "FAM110C"; transcript_type "processed_transcript"; transcript_status "KNOWN"; transcript_name "FAM110C-003"; level 2; havana_gene "OTTHUMG00000151321"; havana_transcript "OTTHUMT00000322222";
chr2 HAVANA transcript 31221 36870 . - . gene_id "ENSG00000184731"; transcript_id "ENST00000461026"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "FAM110C"; transcript_type "processed_transcript"; transcript_status "KNOWN"; transcript_name "FAM110C-002"; level 2; havana_gene "OTTHUMG00000151321"; havana_transcript "OTTHUMT00000322221";
chr2 HAVANA exon 32809 32952 . - . gene_id "ENSG00000184731"; transcript_id "ENST00000460464"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "FAM110C"; transcript_type "processed_transcript"; transcript_status "KNOWN"; transcript_name "FAM110C-003"; level 2; havana_gene "OTTHUMG00000151321"; havana_transcript "OTTHUMT00000322222";
chr2 HAVANA CDS 35440 36385 . - 0 gene_id "ENSG00000184731"; transcript_id "ENST00000327669"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "FAM110C"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "FAM110C-001"; level 2; tag "CCDS"; ccdsid "CCDS42645"; havana_gene "OTTHUMG00000151321"; havana_transcript "OTTHUMT00000322220";
chr2 HAVANA exon 35440 36385 . - . gene_id "ENSG00000184731"; transcript_id "ENST00000327669"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "FAM110C"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "FAM110C-001"; level 2; tag "CCDS"; ccdsid "CCDS42645"; havana_gene "OTTHUMG00000151321"; havana_transcript "OTTHUMT00000322220";

That's it,

Pierre

24 May 2013

A Tribble/FeatureCodec handling JSON-based annotations files.

I wrote a java FeatureCodec for JSON with a the tribble library.
Citing the GATK tream: "The Tribble project was started as an effort to overhaul our reference-ordered data system; we had many different formats that were shoehorned into a common framework that didn't really work as intended. What we wanted was a common framework that allowed for searching of reference ordered data, regardless of the underlying type. Jim Robinson had developed indexing schemes for text-based files, which was incorporated into the Tribble library."".

The library is available at:https://github.com/lindenb/jsontribble.


The library contains the tools to sort, index and query the json file.

As a proof of concept, I also created a REST-based service to query those files.

REST/JSON

For example http://localhost:8080/jsontribble/rest/tribble/resources/dbsnp/annotations.json?chrom=chr1&start=881826&end=981826 returns:
{"header":{"description":"UCSC  snp137: select count(*) from snp137 where FIND_IN_SET(func,\"missense\")>0 and avHet>0.1"}
,"features":[
{"chrom":"chr1","start":881826,"end":881827,"name":"rs112341375","score":0,"strand":"+","refNCBI":"G","refUCSC":"G","observed":"C/G","class":"single","valid":["by-frequency"],"avHet":0.5,"func":["missense"],"submitters":["BUSHMAN"]}
{"chrom":"chr1","start":897119,"end":897120,"name":"rs28530579","score":0,"strand":"+","refNCBI":"G","refUCSC":"G","observed":"C/G","class":"single","valid":["unknown"],"avHet":0.375,"func":["missense"],"submitters":["ABI","ENSEMBL","SSAHASNP"]}
{"chrom":"chr1","start":907739,"end":907740,"name":"rs112235940","score":0,"strand":"+","refNCBI":"G","refUCSC":"G","observed":"A/G","class":"single","valid":["unknown"],"avHet":0.5,"func":["missense"],"submitters":["COMPLETE_GENOMICS"]}
{"chrom":"chr1","start":949607,"end":949608,"name":"rs1921","score":0,"strand":"+","refNCBI":"G","refUCSC":"G","observed":"A/C/G","class":"single","valid":["by-cluster","by-frequency","by-1000genomes"],"avHet":0.464348,"func":["missense"],"submitters":["1000GENOMES","AFFY","BGI","BUSHMAN","CGAP-GAI","CLINSEQ_SNP","COMPLETE_GENOMICS","CORNELL","DEBNICK","EXOME_CHIP","GMI","HGSV","ILLUMINA","ILLUMINA-UK","KRIBB_YJKIM","LEE","MGC_GENOME_DIFF","NHLBI-ESP","SC_JCM","SC_SNP","SEATTLESEQ","SEQUENOM","UWGC","WIAF","YUSUKE"],"bitfields":["maf-5-some-pop","maf-5-all-pops"]}
]}

REST/XML

Example http://localhost:8080/jsontribble/rest/tribble/resources/dbsnp/annotations.xml?chrom=chr1&start=897119&end=981826
<?xml version="1.0" encoding="UTF-8"?>
<annotations chrom="chr1" start="897119" end="981826">
  <header>
    <description>UCSC  snp137: select count(*) from snp137 where FIND_IN_SET(func,"missense")&gt;0 and avHet&gt;0.1</description>
  </header>
  <features>
    <feature>
      <chrom>chr1</chrom>
      <start type="integer">897119</start>
      <end type="integer">897120</end>
      <name>rs28530579</name>
      <score type="integer">0</score>
      <strand>+</strand>
      <refNCBI>G</refNCBI>
      <refUCSC>G</refUCSC>
      <observed>C/G</observed>
      <class>single</class>
      <valid>

BED/text

Example: http://localhost:8080/jsontribble/rest/tribble/resources/merge/annotations.bed?chrom=chr1&start=897119&end=981826.
chr1    895966  901099  {"chrom":"chr1","start":895966,"end":901099,"strand":"+","name":"uc001aca.2","cds...
chr1    896828  897858  {"chrom":"chr1","start":896828,"end":897858,"strand":"+","name":"uc001acb.1","cds...
chr1    897008  897858  {"chrom":"chr1","start":897008,"end":897858,"strand":"+","name":"uc010nya.1","cds...
chr1    897119  897120  {"chrom":"chr1","start":897119,"end":897120,"name":"rs28530579","score":0,"strand...
chr1    897734  899229  {"chrom":"chr1","start":897734,"end":899229,"strand":"+","name":"uc010nyb.1","cds...
chr1    901876  910484  {"chrom":"chr1","start":901876,"end":910484,"strand":"+","name":"uc001acd.3","cds...
chr1    901876  910484  {"chrom":"chr1","start":901876,"end":910484,"strand":"+","name":"uc001ace.3","cds...
chr1    901876  910484  {"chrom":"chr1","start":901876,"end":910484,"strand":"+","name":"uc001acf.3","cds...
chr1    907739  907740  {"chrom":"chr1","start":907739,"end":907740,"name":"rs112235940","score":0,"stran...
chr1    910578  917473  {"chrom":"chr1","start":910578,"end":917473,"strand":"-","name":"uc001ach.2","cds...
chr1    934341  935552  {"chrom":"chr1","start":934341,"end":935552,"strand":"-","name":"uc001aci.2","cds...
chr1    934341  935552  {"chrom":"chr1","start":934341,"end":935552,"strand":"-","name":"uc010nyc.1","cds...
chr1    948846  949919  {"chrom":"chr1","start":948846,"end":949919,"strand":"+","name":"uc001acj.4","cds...
chr1    949607  949608  {"chrom":"chr1","start":949607,"end":949608,"name":"rs1921","score":0,"strand":"+...
chr1    955502  991499  {"chrom":"chr1","start":955502,"end":991499,"strand":"+","name":"uc001ack.2","cds...

27 February 2013

4 Tools I wrote today to annotate VCF files.

Here are four java-bases tools I wrote today for @SolenaLS to annotate VCF files.

VCFTabix

VCFTabix : we want to use the VCF of the 1Kgenomes project (ALL.wgs.phase1_release_v3.20101123.snps_indels_sv.sites.vcf.gz) as a source of annotation for our VCF.

Options

-f (vcf indexed with tabix) REQUIRED.
 -T (tag String) VCF-INFO-ID optional can be used several times.
 -R doesn't use REF allele
 -A doesn't use ALT allele
 -I don't replace ID if it exists.
 -F don't replace INFO field if it exists.
 -C (TAG) use this tag in case of conflict with the ALT allele.

Example:

java -jar dist/vcftabix.jar -f ALL.wgs.phase1_release_v3.20101123.snps_indels_sv.sites.vcf.gz \
   -T AC -T SNPSOURCE  input.vcf 

##fileformat=VCFv4.1
(....)
##Annotated with fr.inserm.umr1087.jvarkit.tools.vcftabix.VCFTabix:ALL.wgs.phase1_release_v3.20101123.snps_indels_sv.sites.vcf.gz
##INFO=<ID=AC,Number=.,Type=Integer,Description="Alternate Allele Count">
##INFO=<ID=SNPSOURCE,Number=.,Type=String,Description="indicates if a snp was called when analysing the low coverage or exome alignment data">
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  CD10977
5       2999819 rs21234 A       G       1620    .       AC=2156;AF=1.00;AN=2;DB;DP=41;Dels=0.00;FS=0.000;HaplotypeScore=0.0000;MLEAC=2;MLEAF=1.00;MQ=59.44;MQ0=0;QD=39.51;SB=-6.519e-03;SNPSOURCE=LOWCOV;       GT:AD:DP:GQ:PL  1/1:0,41:41:99:1653,123,0
(...)


EVS2BED

EVS2BED: bulk download of the XML data of EVS into a BED file CHROM/START/END/XML (see my related post: http://plindenbaum.blogspot.fr/2012/05/downloading-xml-data-from-exome-variant.html )

Example

Download 10 records and index the data with tabix:
java -jar dist/evs2bed.jar -L 10 | LC_ALL=C sort -k1,1 -k2,2n -k3,3n -k4,4 -u |\
  /path/totabix-0.2.6/bgzip -c > test.bed.gz
/path/totabix-0.2.6/tabix -p bed -f test.bed.gz


$ gunzip -c  test.bed.gz  | cut -c 1-500 | fold -w 80
1 69427 69428 <snpList><positionString>1:69428</positionString><chrPos
ition>69428</chrPosition><alleles>G/T</alleles><uaAlleleCounts>G=313/T=6535</uaA
lleleCounts><aaAlleleCounts>G=14/T=3808</aaAlleleCounts><totalAlleleCounts>G=327
/T=10343</totalAlleleCounts><uaMAF>4.5707</uaMAF><aaMAF>0.3663</aaMAF><totalMAF>
3.0647</totalMAF><avgSampleReadDepth>110</avgSampleReadDepth><geneList>OR4F5</ge
neList><snpFunction><chromosome>1</chromosome><position>69428</position><conserv
ationScore>1.0</conservationSc
1 69475 69476 <snpList><positionString>1:69476</positionString><chrPos
ition>69476</chrPosition><alleles>C/T</alleles><uaAlleleCounts>C=2/T=7020</uaAll
eleCounts><aaAlleleCounts>C=0/T=3908</aaAlleleCounts><totalAlleleCounts>C=2/T=10
928</totalAlleleCounts><uaMAF>0.0285</uaMAF><aaMAF>0.0</aaMAF><totalMAF>0.0183</
totalMAF><avgSampleReadDepth>123</avgSampleReadDepth><geneList>OR4F5</geneList><
snpFunction><chromosome>1</chromosome><position>69476</position><conservationSco
re>0.6</conservationScore><con
1 69495 69496 <snpList><positionString>1:69496</positionString><chrPos
ition>69496</chrPosition><alleles>A/G</alleles><uaAlleleCounts>A=2/G=6764</uaAll
eleCounts><aaAlleleCounts>A=23/G=3785</aaAlleleCounts><totalAlleleCounts>A=25/G=
10549</totalAlleleCounts><uaMAF>0.0296</uaMAF><aaMAF>0.604</aaMAF><totalMAF>0.23
64</totalMAF><avgSampleReadDepth>91</avgSampleReadDepth><geneList>OR4F5</geneLis
t><snpFunction><chromosome>1</chromosome><position>69496</position><conservation
Score>0.5</conservationScore><
1 69510 69511 <snpList><positionString>1:69511</positionString><chrPos
ition>69511</chrPosition><alleles>G/A</alleles><uaAlleleCounts>G=5337/A=677</uaA
lleleCounts><aaAlleleCounts>G=1937/A=1623</aaAlleleCounts><totalAlleleCounts>G=7
274/A=2300</totalAlleleCounts><uaMAF>11.2571</uaMAF><aaMAF>45.5899</aaMAF><total
MAF>24.0234</totalMAF><avgSampleReadDepth>69</avgSampleReadDepth><geneList>OR4F5
</geneList><snpFunction><chromosome>1</chromosome><position>69511</position><con
servationScore>1.0</conservati
1 69589 69590 <snpList><positionString>1:69590</positionString><chrPos
ition>69590</chrPosition><alleles>A/T</alleles><uaAlleleCounts>A=0/T=6214</uaAll
eleCounts><aaAlleleCounts>A=1/T=3555</aaAlleleCounts><totalAlleleCounts>A=1/T=97
69</totalAlleleCounts><uaMAF>0.0</uaMAF><aaMAF>0.0281</aaMAF><totalMAF>0.0102</t
otalMAF><avgSampleReadDepth>119</avgSampleReadDepth><geneList>OR4F5</geneList><s
npFunction><chromosome>1</chromosome><position>69590</position><conservationScor
e>0.8</conservationScore><cons
1 69593 69594 <snpList><positionString>1:69594</positionString><chrPos
ition>69594</chrPosition><alleles>C/T</alleles><uaAlleleCounts>C=2/T=6190</uaAll
eleCounts><aaAlleleCounts>C=0/T=3548</aaAlleleCounts><totalAlleleCounts>C=2/T=97
38</totalAlleleCounts><uaMAF>0.0323</uaMAF><aaMAF>0.0</aaMAF><totalMAF>0.0205</t
otalMAF><avgSampleReadDepth>109</avgSampleReadDepth><geneList>OR4F5</geneList><s
npFunction><chromosome>1</chromosome><position>69594</position><conservationScor
e>0.8</conservationScore><cons
1 69619 69620 <snpList><positionString>1:69620</positionString><chrPos
ition>69620</chrPosition><alleles>T/TA</alleles><uaAlleleCounts>A1=2/R=4694</uaA
lleleCounts><aaAlleleCounts>A1=0/R=2954</aaAlleleCounts><totalAlleleCounts>A1=2/
R=7648</totalAlleleCounts><uaMAF>0.0426</uaMAF><aaMAF>0.0</aaMAF><totalMAF>0.026
1</totalMAF><avgSampleReadDepth>56</avgSampleReadDepth><geneList>OR4F5</geneList
><snpFunction><chromosome>1</chromosome><position>69620</position><conservationS
core>0.7</conservationScore><c
1 69744 69745 <snpList><positionString>1:69745</positionString><chrPos
ition>69745</chrPosition><alleles>CA/C</alleles><uaAlleleCounts>A1=4/R=4254</uaA
lleleCounts><aaAlleleCounts>A1=0/R=2716</aaAlleleCounts><totalAlleleCounts>A1=4/
R=6970</totalAlleleCounts><uaMAF>0.0939</uaMAF><aaMAF>0.0</aaMAF><totalMAF>0.057
4</totalMAF><avgSampleReadDepth>10</avgSampleReadDepth><geneList>OR4F5</geneList
><snpFunction><chromosome>1</chromosome><position>69745</position><conservationS
core>0.9</conservationScore><c
1 69760 69761 <snpList><positionString>1:69761</positionString><chrPos
ition>69761</chrPosition><alleles>T/A</alleles><uaAlleleCounts>T=645/A=4093</uaA
lleleCounts><aaAlleleCounts>T=62/A=2840</aaAlleleCounts><totalAlleleCounts>T=707
/A=6933</totalAlleleCounts><uaMAF>13.6133</uaMAF><aaMAF>2.1365</aaMAF><totalMAF>
9.2539</totalMAF><avgSampleReadDepth>8</avgSampleReadDepth><geneList>OR4F5</gene
List><snpFunction><chromosome>1</chromosome><position>69761</position><conservat
ionScore>0.1</conservationScor

VCFTabixml

VCFTabixml retrieves the annotations from a BED (CHROM/START/END/XML) and, using XLST, annotates a VCF.

Example

In the following example, a few variations are downloaded from EVS using EVS2BED. Then a small VCF is created and annotated using the bed and the following XSLT stylesheet: evs2vcf.xsl.
java -jar dist/evs2bed.jar -L 10  |\
       LC_ALL=C sort -k1,1 -k2,2n -k3,3n -k4,4 -u |\
       tabix-0.2.6/bgzip -c > test.bed.gz

/tabix-0.2.6/tabix -p bed -f test.bed.gz
java -jar  dist/vcftabixml.jar \
   -f test.bed.gz -x  dist/evs2vcf.xsl test.vcf



echo "#CHROM POS ID REF ALT QUAL FILTER INFO" > test.vcf
echo "1 1 . C T . . T=test1" >> test.vcf
echo "1 69476 . T C . . T=test2" >> test.vcf
echo "1 69511 . A C . . T=test3" >> test.vcf 
java -jar  dist/vcftabixml.jar \
    -f test.bed.gz -x  dist/evs2vcf.xsl test.vcf

Result:
##Annotated with fr.inserm.umr1087.jvarkit.tools.vcftabixml.VCFTabixml:test.bed.gz
#CHROM POS ID REF ALT QUAL FILTER INFO
1 1 . C T . . T=test1;
1 69476 . T C . . T=test2;EVS_UAMAF=0.0285;EVS_AAMAF=0.0;EVS_TOTALMAF=0.0183;EVS_AVGSAMPLEREADDEPTH=123;EVS_GENELIST=OR4F5;EVS_CONSERVATIONSCORE=0.6;EVS_CONSERVATIONSCOREGERP=2.3;EVS_RSIDS=rs148502021;EVS_CLINICALLINK=unknown;EVS_ONEXOMECHIP=false;EVS_GWASPUBMEDIDS=unknown;
1 69511 . A C . . T=test3;EVS_UAMAF=11.2571;EVS_AAMAF=45.5899;EVS_TOTALMAF=24.0234;EVS_AVGSAMPLEREADDEPTH=69;EVS_GENELIST=OR4F5;EVS_CONSERVATIONSCORE=1.0;EVS_CONSERVATIONSCOREGERP=1.1;EVS_RSIDS=rs75062661;EVS_CLINICALLINK=unknown;EVS_ONEXOMECHIP=false;EVS_GWASPUBMEDIDS=unknown;EVS_CONFLICTALT=G;

VCFBigWig

VCFBigWig use a bigwig file to annotate a VCF file.

Example:

add the GERP score to a VCF:
$ java -jar dist/vcfbigwig.jar -f /commun/data/pubdb/ucsc/hg19/bbi/All_hg19_RS_noprefix.bw  test.vep.vcf
##fileformat=VCFv4.1
##Annotated with class fr.inserm.umr1087.jvarkit.tools.vcfbigwig.VCFBigWig:/commun/data/pubdb/ucsc/hg19/bbi/All_hg19_RS_noprefix.bw
##INFO=<ID=All_hg19_RS_noprefix,Number=1,Type=Float,Description="Annotations from /commun/data/pubdb/ucsc/hg19/bbi/All_hg19_RS_noprefix.bw">
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO
22      12345   .       G       T       100.0   .       All_hg19_RS_noprefix=4.829999923706055

27 September 2012

Playing with the #Ensembl REST API: filtering a VCF with javascript

The new Ensembl REST API was announced today: "We are pleased to announce the beta release of our programming language agnostic REST API, for Release 68 data, at http://beta.rest.ensembl. Our initial release provides access to:

  • Sequences (genomic, cDNA, CDS and protein)
  • VEP (Variant Effect Predictor)
  • Homologies
  • Gene Trees
  • Assembly and coordinate mapping."

In the current post I will filter a VCF file with this API and javascript.

The VCF

Our initial file is the following VCF:
##fileformat=VCFv4.0
##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">
#CHROM POS ID REF ALT QUAL FILTER INFO
1 10327 rs112750067 T C . PASS DP=65
1 69427 rs148502021 T A . PASS DP=1557
1 69452 rs142004627 A G . PASS DP=155
1 69475 rs148502021 C T . PASS DP=231
1 865583 rs148711625 A G . PASS DP=231
1 866460 rs148884928 A G . PASS DP=23
1 866461 . G A . PASS DP=24

The javascript

The VCF will be read on the standard input using the following script and the Rhino JS engine. The script reads the VCF and for each substitution, it calls the Ensembl REST API, parses the JSON response and return the transcript identifier if the mutation is a missense_variant or a polyphen_prediction="probably damaging":
importPackage(java.io);
importPackage(java.lang);



function sleep(milliseconds)
  {
  /* hacked from http://www.phpied.com/sleep-in-javascript/ */
  var start = new Date().getTime();
  for (var i = 0; i < 1e7; i++) {
    if ((new Date().getTime() - start) > milliseconds){
      break;
      }
    }
  }


function damagingTranscript(json)
 {
 for(var d in json.data)
  {
  var transcripts=json.data[d].transcripts;
 
  if(!transcripts) return null;
  for(var t in transcripts)
   {
   var transcript=transcripts[t];
   for(var a in transcript.alleles)
    {
    var allele=transcript.alleles[a];
    if(allele.polyphen_prediction=="probably damaging" ||
       allele.consequence_terms.indexOf("missense_variant")!=-1 )
        {
        return transcript.transcript_id;
        }
    }
   }
  }
 return null;
 }

var baseRegex=new RegExp(/^[ATGCatgc]$/);



var stdin = new java.io.BufferedReader( new java.io.InputStreamReader(java.lang.System['in']) );
var line;
while((line=stdin.readLine())!=null)
 {
 if(line.startsWith("#"))
  {
  print(line); continue;
  }
 var tokens=line.split("\t");
 var chrom=tokens[0];
 var pos= parseInt(tokens[1]);
 var ref= tokens[3];
 var alt= tokens[4];
 if(!baseRegex.test(ref)) continue;
 if(!baseRegex.test(alt)) continue;
 
 sleep(200);
 var url="http://beta.rest.ensembl.org/vep/human/"+
  chrom+":"+pos+"-"+pos+"/"+alt+
  "/consequences?content-type=application/json";
 var jsonStr=readUrl(url,"UTF-8");
 var json=eval("("+jsonStr+")");
 var transcript=damagingTranscript(json);
 if(transcript==null) continue;
 tokens[7]+=(";TRANSCRIPT="+transcript);
 print(tokens.join('\t'));
 }

As an example, here is the response from the ENSEMBL server for 1:866460 A/G:

http://beta.rest.ensembl.org/vep/human/1:866460-866460/G/consequences?content-type=application/json
{
    "data": [
        {
            "location": {
                "coord_system": "chromosome",
                "name": "1",
                "strand": 1,
                "end": "866460",
                "start": "866460"
            },
            "hgvs": {
                "G": "1:g.866460C>G"
            },
            "transcripts": [
                {
                    "translation_stable_id": "ENSP00000411579",
                    "intron_number": null,
                    "cdna_end": 385,
                    "translation_end": 99,
                    "exon_number": "4/7",
                    "is_canonical": 0,
                    "transcript_id": "ENST00000420190",
                    "cdna_start": 385,
                    "gene_id": "ENSG00000187634",
                    "cds_start": 296,
                    "translation_start": 99,
                    "alleles": {
                        "C/G": {
                            "sift_prediction": "deleterious",
                            "polyphen_prediction": "benign",
                            "polyphen_score": 0.001,
                            "display_codon_allele_string": "gCg/gGg",
                            "hgvs_protein": "ENSP00000411579.1:p.Ala99Gly",
                            "sift_score": 0.04,
                            "consequence_terms": [
                                "missense_variant"
                            ],
                            "pep_allele_string": "A/G",
                            "codon_allele_string": "GCG/GGG",
                            "hgvs_transcript": "ENST00000420190.1:c.296C>G"
                        }
                    },
                    "name": "SAMD11",
                    "biotype": "protein_coding",
                    "cds_end": 296,
                    "cdna_allele_string": "C/G",
                    "codon_position": 2
                },
                {
                    "translation_stable_id": "ENSP00000393181",
                    "intron_number": null,
                    "cdna_end": 356,
                    "translation_end": 99,
                    "exon_number": "4/5",
                    "is_canonical": 0,
                    "transcript_id": "ENST00000437963",
                    "cdna_start": 356,
                    "gene_id": "ENSG00000187634",
                    "cds_start": 296,
                    "translation_start": 99,
                    "alleles": {
                        "C/G": {
                            "sift_prediction": "deleterious",
                            "polyphen_prediction": "benign",
                            "polyphen_score": 0.001,
                            "display_codon_allele_string": "gCg/gGg",
                            "hgvs_protein": "ENSP00000393181.1:p.Ala99Gly",
                            "sift_score": 0.03,
                            "consequence_terms": [
                                "missense_variant"
                            ],
                            "pep_allele_string": "A/G",
                            "codon_allele_string": "GCG/GGG",
                            "hgvs_transcript": "ENST00000437963.1:c.296C>G"
                        }
                    },
                    "name": "SAMD11",
                    "biotype": "protein_coding",
                    "cds_end": 296,
                    "cdna_allele_string": "C/G",
                    "codon_position": 2
                },
                {
                    "translation_stable_id": "ENSP00000342313",
                    "intron_number": null,
                    "cdna_end": 379,
                    "translation_end": 99,
                    "exon_number": "4/14",
                    "is_canonical": 1,
                    "transcript_id": "ENST00000342066",
                    "cdna_start": 379,
                    "gene_id": "ENSG00000187634",
                    "cds_start": 296,
                    "alleles": {
                        "C/G": {
                            "sift_prediction": "deleterious",
                            "polyphen_prediction": "benign",
                            "polyphen_score": 0.001,
                            "display_codon_allele_string": "gCg/gGg",
                            "hgvs_protein": "ENSP00000342313.3:p.Ala99Gly",
                            "sift_score": 0.01,
                            "consequence_terms": [
                                "missense_variant"
                            ],
                            "pep_allele_string": "A/G",
                            "codon_allele_string": "GCG/GGG",
                            "hgvs_transcript": "ENST00000342066.3:c.296C>G"
                        }
                    },
                    "translation_start": 99,
                    "name": "SAMD11",
                    "biotype": "protein_coding",
                    "cds_end": 296,
                    "cdna_allele_string": "C/G",
                    "ccds": "CCDS2.2",
                    "codon_position": 2
                },
                {
                    "translation_stable_id": "ENSP00000349216",
                    "intron_number": null,
                    "cdna_end": 67,
                    "translation_end": 23,
                    "exon_number": "2/12",
                    "is_canonical": 0,
                    "transcript_id": "ENST00000341065",
                    "cdna_start": 67,
                    "gene_id": "ENSG00000187634",
                    "cds_start": 68,
                    "translation_start": 23,
                    "alleles": {
                        "C/G": {
                            "sift_prediction": "deleterious",
                            "polyphen_prediction": "benign",
                            "polyphen_score": 0.008,
                            "display_codon_allele_string": "gCg/gGg",
                            "hgvs_protein": "ENSP00000349216.4:p.Ala23Gly",
                            "sift_score": 0.01,
                            "consequence_terms": [
                                "missense_variant"
                            ],
                            "pep_allele_string": "A/G",
                            "codon_allele_string": "GCG/GGG",
                            "hgvs_transcript": "ENST00000341065.4:c.67C>G"
                        }
                    },
                    "name": "SAMD11",
                    "biotype": "protein_coding",
                    "cds_end": 68,
                    "cdna_allele_string": "C/G",
                    "codon_position": 2
                }
            ]
        }
    ]
}

Invoking RHINO, filtering the VCF

$ cat input.vcf | rhino -f restensembl.js
  
##fileformat=VCFv4.0
##INFO=
#CHROM POS ID REF ALT QUAL FILTER INFO
1 69427 rs148502021 T A . PASS DP=1557;TRANSCRIPT=ENST00000335137
1 69452 rs142004627 A G . PASS DP=155;TRANSCRIPT=ENST00000335137
1 69475 rs148502021 C T . PASS DP=231;TRANSCRIPT=ENST00000335137
1 865583 rs148711625 A G . PASS DP=231;TRANSCRIPT=ENST00000420190
1 866460 rs148884928 A G . PASS DP=23;TRANSCRIPT=ENST00000420190

Limitations for the Ensembl REST API:




That's it,

Pierre

01 January 2011

My tool to annotate VCF files.

The Variant Call Format is a text file format generated by many tools for NGS. It contains meta-information lines, a header line, and then data lines describing how the mutations were called. I don't like this format because it cannot be used to store some hierachical annotations (like json or xml), nevertheless it is a de facto standard.

I wrote a tool called vcfannotator to append a set of annotations from the UCSC database to a VCF file. As I wanted to keep this tool simple and without any dependencies, it only uses the flat files available from the download area at the UCSC.

This tools appends several informations:

  • A prediction of the mutation: is it in the cDNA ? in an intron ? in an exon ? is it a non-synonymous mutation ? was a stop codon lost or gained ? is there any consequence on the splicing process ? Here, the UCSC DAS server and the KnonwGenes table are used to retrieve the genomic DNA and the structure of the gene.
  • dbSNP: was this mutation found before in dbSNP ?
  • Personal genomes: was this mutation found before in Venter's genome ? In Watson's genome ? etc...
  • The mapability (Uniqueness of Reference Genome ): gives an idea about how the context is unique in the genome
  • genomicSuperDups: is this mutation located in a segmental genomic duplication ?
  • tfbsConsSites: is this mutation located in the site of a transcription factor ?
  • phastCons44way: how is the mutation conserved within a multiple alignments of 44 vertebrate genomes to the human genome ?


The tool is available on github at: https://github.com/lindenb/jsandbox in jsandbox/src/sandbox/VCFAnnotator.java.

Compilation

> cd jsandbox
> ant vcfannotator

Execution & Options

> java -jar dist/vcfannotator.jar -h
VCF annotator
Pierre Lindenbaum PhD. 2010. http://plindenbaum.blogspot.com
Options:
-b ucsc.build default:hg18
-m mapability.
-g genomicSuperDups
-p basic prediction
-c phastcons prediction (phastCons44way)
-t transcription factors sites prediction
-pg personal genomes
-snp <id> add ucsc <id> must be present in "http://hgdownload.cse.ucsc.edu/goldenPath/<ucscdb>/database/<id>.txt.gz" e.g. snp129
-log <level> one value from java.util.logging.Level default:OFF
-proxyHost <host>
-proxyPort <port>

Example

Input

##fileformat=VCFv3.2
##fileDate=20091120
##source=gigabayes
##reference=ncbi36
##phasing=none
#CHROM POS ID REF ALT QUAL FILTER INFO
1 1105366 . T C 99 0 NS=471;DP=16179;AC=8;AN=942;HWE=0.0685233
1 1105373 . C T 99 0 NS=469;DP=15318;AC=5;AN=938;HWE=0.0267954
1 1108138 rs61733845 C T 99 0 NS=450;DP=16134;AC=110;AN=900;dbSNP;HWE=0.930276
1 1109206 . G A 99 0 NS=696;DP=54701;AC=2;AN=1392;HWE=0.0028777
1 1109262 . C T 99 0 NS=696;DP=55783;AC=12;AN=1392;HWE=0.104349
1 1110233 . C G 99 0 NS=694;DP=34301;AC=43;AN=1388;HWE=4.91397
1 1110240 . T A 99 0 NS=695;DP=35846;AC=3;AN=1390;HWE=0.00648883
1 1110294 rs1320571 G A 99 0 NS=608;DP=36050;AC=246;AN=1216;dbSNP;HM;HWE=3.05358
1 1110351 . A C 99 0 NS=696;DP=26284;AC=15;AN=1392;HWE=0.163402
1 1110358 . T C 99 0 NS=694;DP=24596;AC=24;AN=1388;HWE=0.422309
1 1110366 . G A 99 0 NS=697;DP=22511;AC=16;AN=1394;HWE=0.185781
1 3537495 . C T 99 0 NS=697;DP=25302;AC=15;AN=1394;HWE=0.163165
1 3537910 . G A 99 0 NS=605;DP=15896;AC=34;AN=1210;HWE=0.46683
1 3537996 rs2760321 T C 99 0 NS=571;DP=15376;AC=1040;AN=1142;dbSNP;HM;HWE=4.28852
(...)

Command:

java -jar dist/vcfannotator.jar -pg -m -g -p -c -t -snp snp130 -log ALL input.vcf
## wait !

Result

##fileformat=VCFv3.2
##fileDate=20091120
##source=gigabayes
##reference=ncbi36
##phasing=none
##SNP130=table snp130 from UCSC
##INFO=NA12878,1,String,"NA12878's Personal genome"
##INFO=NA12891,1,String,"NA12891's Personal genome"
(...)
chr1 1108138 rs61733845 C T 99 0 NS=450;DP=16134;AC=110;AN=900;dbSNP;HWE=0.930276;MAPABILITY_WGENCODEBROADMAPABILITYALIGN36MER=1;MAPABILITY_WGENCODEDUKEUNIQUENESS20=1;MAPABILITY_WGENCODEDUKEUNIQUENESS24=1;MAPABILITY_WGENCODEDUKEUNIQUENESS35=1;phastCons44way=1.000;PREDICTION=geneSymbol:TTLL10|wild.codon:TGC|wild.aa:C|pos.cdna:935|kgId:uc001acy.1|position.protein:312|strand:+|mut.aa:C|mut.codon:TGT|exon.name:Exon 11|type:EXON_CODING_SYNONYMOUS;PREDICTION=geneSymbol:TTLL10|wild.codon:TGC|wild.aa:C|pos.cdna:716|kgId:uc001acz.1|position.protein:239|strand:+|mut.aa:C|mut.codon:TGT|exon.name:Exon 7|type:EXON_CODING_SYNONYMOUS
(...)
chr1 1110351 . A C 99 0 NS=696;DP=26284;AC=15;AN=1392;HWE=0.163402;MAPABILITY_WGENCODEBROADMAPABILITYALIGN36MER=1;MAPABILITY_WGENCODEDUKEUNIQUENESS20=1;MAPABILITY_WGENCODEDUKEUNIQUENESS24=1;MAPABILITY_WGENCODEDUKEUNIQUENESS35=1;phastCons44way=1.000;PREDICTION=geneSymbol:TTLL10|wild.codon:AAG|wild.aa:K|splicing:SPLICING_DONOR|pos.cdna:1399|kgId:uc001acy.1|position.protein:467|strand:+|mut.aa:T|mut.codon:ACG|exon.name:Exon 13|type:EXON_CODING_NON_SYNONYMOUS;PREDICTION=geneSymbol:TTLL10|wild.codon:AAG|wild.aa:K|pos.cdna:1180|kgId:uc001acz.1|position.protein:394|strand:+|mut.aa:T|mut.codon:ACG|exon.name:Exon 9|type:EXON_CODING_NON_SYNONYMOUS
(...)
chr1 113068276 . C G 99 0 NS=697;DP=12774;AC=7;AN=1394;HWE=0.0353282;MAPABILITY_WGENCODEBROADMAPABILITYALIGN36MER=1;MAPABILITY_WGENCODEDUKEUNIQUENESS20=1;MAPABILITY_WGENCODEDUKEUNIQUENESS24=1;MAPABILITY_WGENCODEDUKEUNIQUENESS35=1;phastCons44way=1.000;PREDICTION=geneSymbol:FAM19A3|wild.codon:CCA|wild.aa:P|pos.cdna:451|kgId:uc001ecu.1|position.protein:151|strand:+|mut.aa:R|mut.codon:CGA|exon.name:Exon 4|type:EXON_CODING_NON_SYNONYMOUS;PREDICTION=geneSymbol:FAM19A3|wild.codon:ACC|wild.aa:T|pos.cdna:383|kgId:uc001ecv.1|position.protein:128|strand:+|mut.aa:T|mut.codon:ACG|exon.name:Exon 4|type:EXON_CODING_SYNONYMOUS
(...)
Interestingly, for this last mutation chr1:113068276, there are two transcripts at the same position with two different translation frames, so there are two predictions: one synonymous mutation and one non-synonymous mutation.


That's it,

Pierre

17 November 2010

BLAST/XML+Annotations

I recently asked on Biostar if it would be possible to align two sequences while displaying their respective annotations.

As both answers I received (SPICE and jalview ) require a graphical interface, I quickly wrote a command-line java program doing the job. This program reads a NCBI/BLAST XML output and, if the 'query' or the 'hit' definition lines start with "gi|....", it fetches the Genbank records and the annotations for the sequence and map them onto the alignments.

The program is available on github at https://github.com/lindenb/jsandbox/blob/master/src/sandbox/BlastAnnotation.java.

Do we need an external library parsing Blast?

No, the java binding compiler, ${JAVA_HOME}/bin/xjc, can generate a java parser from BLAST DTD:
xjc -d src -p sandbox.ncbi.blast -dtd http://www.ncbi.nlm.nih.gov/dtd/NCBI_BlastOutput.dtd

parsing a schema...
compiling a schema...
sandbox/ncbi/blast/BlastOutput.java
sandbox/ncbi/blast/BlastOutputIterations.java
sandbox/ncbi/blast/BlastOutputMbstat.java
sandbox/ncbi/blast/BlastOutputParam.java
sandbox/ncbi/blast/Hit.java
sandbox/ncbi/blast/HitHsps.java
sandbox/ncbi/blast/Hsp.java
sandbox/ncbi/blast/Iteration.java
sandbox/ncbi/blast/IterationHits.java
sandbox/ncbi/blast/IterationStat.java
sandbox/ncbi/blast/ObjectFactory.java
sandbox/ncbi/blast/Parameters.java
sandbox/ncbi/blast/Statistics.java


And do we need an external library parsing Genbank?

No, again xjc did the job:
xjc -d src -p sandbox.ncbi.gbc -dtd http://www.ncbi.nlm.nih.gov/dtd/INSD_INSDSeq.dtd

parsing a schema...
compiling a schema...
sandbox/ncbi/gbc/INSDAltSeqData.java
sandbox/ncbi/gbc/INSDAltSeqDataItems.java
sandbox/ncbi/gbc/INSDAltSeqItem.java
sandbox/ncbi/gbc/INSDAltSeqItemInterval.java
sandbox/ncbi/gbc/INSDAltSeqItemIsgap.java
sandbox/ncbi/gbc/INSDAuthor.java
sandbox/ncbi/gbc/INSDComment.java
sandbox/ncbi/gbc/INSDCommentItem.java
sandbox/ncbi/gbc/INSDCommentParagraph.java
sandbox/ncbi/gbc/INSDCommentParagraphItems.java
sandbox/ncbi/gbc/INSDCommentParagraphs.java
sandbox/ncbi/gbc/INSDFeature.java
sandbox/ncbi/gbc/INSDFeatureIntervals.java
sandbox/ncbi/gbc/INSDFeaturePartial3.java
sandbox/ncbi/gbc/INSDFeaturePartial5.java
sandbox/ncbi/gbc/INSDFeatureQuals.java
sandbox/ncbi/gbc/INSDFeatureSet.java
sandbox/ncbi/gbc/INSDFeatureSetFeatures.java
sandbox/ncbi/gbc/INSDFeatureXrefs.java
sandbox/ncbi/gbc/INSDInterval.java
sandbox/ncbi/gbc/INSDIntervalInterbp.java
sandbox/ncbi/gbc/INSDIntervalIscomp.java
sandbox/ncbi/gbc/INSDKeyword.java
sandbox/ncbi/gbc/INSDQualifier.java
sandbox/ncbi/gbc/INSDReference.java
sandbox/ncbi/gbc/INSDReferenceAuthors.java
sandbox/ncbi/gbc/INSDReferenceXref.java
sandbox/ncbi/gbc/INSDSecondaryAccn.java
sandbox/ncbi/gbc/INSDSeq.java
sandbox/ncbi/gbc/INSDSeqAltSeq.java
sandbox/ncbi/gbc/INSDSeqCommentSet.java
sandbox/ncbi/gbc/INSDSeqFeatureSet.java
sandbox/ncbi/gbc/INSDSeqFeatureTable.java
sandbox/ncbi/gbc/INSDSeqKeywords.java
sandbox/ncbi/gbc/INSDSeqOtherSeqids.java
sandbox/ncbi/gbc/INSDSeqReferences.java
sandbox/ncbi/gbc/INSDSeqSecondaryAccessions.java
sandbox/ncbi/gbc/INSDSeqStrucComments.java
sandbox/ncbi/gbc/INSDSeqid.java
sandbox/ncbi/gbc/INSDSet.java
sandbox/ncbi/gbc/INSDStrucComment.java
sandbox/ncbi/gbc/INSDStrucCommentItem.java
sandbox/ncbi/gbc/INSDStrucCommentItems.java
sandbox/ncbi/gbc/INSDXref.java
sandbox/ncbi/gbc/ObjectFactory.java

Example


As an example I've aligned the "human eif4G1" (gi|303227906) with "Mus musculus eif4G1" (gi|56699433).
The very first lines of the BLAST report are:
<?xml version="1.0"?>
<!DOCTYPE BlastOutput PUBLIC "-//NCBI//NCBI BlastOutput/EN" "NCBI_BlastOutput.dt
<BlastOutput>
<BlastOutput_program>blastn</BlastOutput_program>
<BlastOutput_version>BLASTN 2.2.24+</BlastOutput_version>
<BlastOutput_reference>Stephen F. Altschul, Thomas L. Madden, Alejandro A. Sch
<BlastOutput_db>n/a</BlastOutput_db>
<BlastOutput_query-ID>gi|303227906|ref|NM_198241.2|</BlastOutput_query-ID>
<BlastOutput_query-def>Homo sapiens eukaryotic translation initiation factor 4
<BlastOutput_query-len>5538</BlastOutput_query-len>
<BlastOutput_param>
<Parameters>
<Parameters_expect>10</Parameters_expect>
<Parameters_sc-match>2</Parameters_sc-match>
<Parameters_sc-mismatch>-3</Parameters_sc-mismatch>
<Parameters_gap-open>5</Parameters_gap-open>
<Parameters_gap-extend>2</Parameters_gap-extend>
<Parameters_filter>L;m;</Parameters_filter>
</Parameters>
</BlastOutput_param>
<BlastOutput_iterations>
<Iteration>
<Iteration_iter-num>1</Iteration_iter-num>
<Iteration_query-ID>gi|303227906|ref|NM_198241.2|</Iteration_query-ID>
<Iteration_query-def>Homo sapiens eukaryotic translation initiation factor 4 g
<Iteration_query-len>5538</Iteration_query-len>
<Iteration_hits>
<Hit>
<Hit_num>1</Hit_num>
<Hit_id>gi|56699433|ref|NM_001005331.1|</Hit_id>
<Hit_def>Mus musculus eukaryotic translation initiation factor 4, gamma 1 (Eif
<Hit_accession>NM_001005331</Hit_accession>
<Hit_len>5460</Hit_len>
<Hit_hsps>
<Hsp>
<Hsp_num>1</Hsp_num>
<Hsp_bit-score>6818.02</Hsp_bit-score>
<Hsp_score>7560</Hsp_score>
<Hsp_evalue>0</Hsp_evalue>
<Hsp_query-from>53</Hsp_query-from>
<Hsp_query-to>5538</Hsp_query-to>
<Hsp_hit-from>1</Hsp_hit-from>
<Hsp_hit-to>5418</Hsp_hit-to>
<Hsp_query-frame>1</Hsp_query-frame>
<Hsp_hit-frame>1</Hsp_hit-frame>
<Hsp_identity>4820</Hsp_identity>
<Hsp_positive>4820</Hsp_positive>
<Hsp_gaps>138</Hsp_gaps>
<Hsp_align-len>5521</Hsp_align-len>
<Hsp_qseq>GGCGCCGGCTGCGCCTGCGGAGAAGCGGTGGCCGCCGAGCGGGATCTGTGCGGGGAGCCGGAAA...
<Hsp_hseq>GGCGCTGGCTGCGCCTGCGGAGAAGCGGTGGCCGCCGAGCGGGATCTGTGCGGGGAGCCGGAAA...
<Hsp_midline>||||| |||||||||||||||||||||||||||||||||||||||||||||||||||||||...
</Hsp>
</Hit_hsps>
</Hit>
</Iteration_hits>
<Iteration_stat>
<Statistics>
<Statistics_db-num>0</Statistics_db-num>
<Statistics_db-len>0</Statistics_db-len>
<Statistics_hsp-len>0</Statistics_hsp-len>
<Statistics_eff-space>0</Statistics_eff-space>
<Statistics_kappa>-1</Statistics_kappa>
<Statistics_lambda>-1</Statistics_lambda>
<Statistics_entropy>-1</Statistics_entropy>
</Statistics>
</Iteration_stat>
</Iteration>
</BlastOutput_iterations>
</BlastOutput>

And here is the output of my program:

java -jar dist/blastannot.jar ~/jeter.blast.xml

QUERY: Homo sapiens eukaryotic translation initiation factor 4 gamma, 1 (EIF4G1), transcript variant 2, mRNA
ID:gi|303227906|ref|NM_198241.2| Len:5538
>Mus musculus eukaryotic translation initiation factor 4, gamma 1 (Eif4g1), transcript variant 2, mRNA
NM_001005331
id:gi|56699433|ref|NM_001005331.1| len:5460

e-value:0 gap:138 bitScore:6818.02

#####:############################################ exon 1..180 gene:EIF4G1
QUERY 000000053 GGCGCCGGCTGCGCCTGCGGAGAAGCGGTGGCCGCCGAGCGGGATCTGTG 000000102
||||| ||||||||||||||||||||||||||||||||||||||||||||
HIT 000000001 GGCGCTGGCTGCGCCTGCGGAGAAGCGGTGGCCGCCGAGCGGGATCTGTG 000000050
#####:############################################ exon 1..128 gene:Eif4g1



################################################## exon 1..180 gene:EIF4G1
QUERY 000000103 CGGGGAGCCGGAAATGGTTGTGGACTACGTCTGTGCGGCTGCGTGGGGCT 000000152
||||||||||||||||||||||||||||||||||||||||||||||||||
HIT 000000051 CGGGGAGCCGGAAATGGTTGTGGACTACGTCTGTGCGGCTGCGTGGGGCT 000000100
################################################## exon 1..128 gene:Eif4g1



############::::::::::###### exon 1..180 gene:EIF4G1
#:::::::::::::###::::: exon 181..237 gene:EIF4G1
QUERY 000000153 CGGCCGCGCGGACTGAAGGAGACTGAAGGCCCTCGGATGCCCAGAACCTG 000000202
|||||||||||| ||||||| |||
HIT 000000101 CGGCCGCGCGGA----------CTGAAGG-------------AGA----- 000000122
############----------#######-------------### gene 1..5460 gene:Eif4g1
############----------#######-------------### exon 1..128 gene:Eif4g1



::::::::::::::::::::::##:##:::::::# exon 181..237 gene:EIF4G1
############### exon 238..331 gene:EIF4G1
QUERY 000000203 TAGGCCGCACCGTGGACTTGTTCTTAATCGAGGGGGTGCTGGGGGGACCC 000000252
|| || ||||||||||||||||
HIT 000000123 ----------------------CTGAA-------GGTGCTGGGGGGACCC 000000143
----------------------##:##-------# exon 1..128 gene:Eif4g1
############### exon 129..222 gene:Eif4g1



#:###############################:###:############ exon 238..331 gene:EIF4G1
##############:###:############ CDS 272..5071 gene:EIF4G1
QUERY 000000253 TGATGTGGCACCAAATGAAATGAACAAAGCTCCACAGTCCACAGGCCCCC 000000302
| ||||||||||||||||||||||||||||||| ||| ||||||||||||
HIT 000000144 TAATGTGGCACCAAATGAAATGAACAAAGCTCCCCAGCCCACAGGCCCCC 000000193
#:###############################:###:############ exon 129..222 gene:Eif4g1
##############:###:############ CDS 163..4944 gene:Eif4g1


(...)


############:#:#:#####:######:#:########:##:###### exon 4890..5521 gene:EIF4G1
############:#:#:#####:######:#:########:##:###### STS 4948..5505 gene:EIF4G1
############:#:#:#####:######:#:########:##:###### STS 5174..5403 gene:EIF4G1
QUERY 000005319 TTGGTGTGTCTTGGGGTGGGGAGGGGCACCAACGCCTGCCCCTGGGGTCC 000005368
|||||||||||| | | ||||| |||||| | |||||||| || ||||||
HIT 000005201 TTGGTGTGTCTTTGCGGGGGGAAGGGCACTACCGCCTGCCTCTAGGGTCC 000005250
############:#:#:#####:######:#:########:##:###### exon 4760..5396 gene:Eif4g1



::##############:##########:###################### exon 4890..5521 gene:EIF4G1
::##############:##########:###################### STS 4948..5505 gene:EIF4G1
::##############:##########:####### STS 5174..5403 gene:EIF4G1
QUERY 000005369 TTTTTTTTATTTTCTGAAAATCACTCTCGGGACTGCCGTCCTCGCTGCTG 000005418
|||||||||||||| |||||||||| ||||||||||||||||||||||
HIT 000005251 --TTTTTTATTTTCTG-AAATCACTCTTGGGACTGCCGTCCTCGCTGCTG 000005297
--##############-##########:###################### exon 4760..5396 gene:Eif4g1



######################:#############:############# exon 4890..5521 gene:EIF4G1
######################:#############:############# STS 4948..5505 gene:EIF4G1
QUERY 000005419 GGGGCATATGCCCCAGCCCCTGTACCACCCCTGCTGTTGCCTGGGCAGGG 000005468
|||||||||||||||||||||| ||||||||||||| |||||||||||||
HIT 000005298 GGGGCATATGCCCCAGCCCCTGCACCACCCCTGCTGCTGCCTGGGCAGGG 000005347
######################:#############:############# exon 4760..5396 gene:Eif4g1



#:##-############################################: exon 4890..5521 gene:EIF4G1
#:##-################################# STS 4948..5505 gene:EIF4G1
###### polyA_signal 5496..5501 gene:EIF4G1
# polyA_site 5516 gene:EIF4G1
QUERY 000005469 GGAA-GGGGGGGCACGGTGCCTGTAATTATTAAACATGAATTCAATTAAG 000005517
| || ||||||||||||||||||||||||||||||||||||||||||||
HIT 000005348 GAAAGGGGGGGGCACGGTGCCTGTAATTATTAAACATGAATTCAATTAAA 000005397
#:##:############################################ exon 4760..5396 gene:Eif4g1



:::# exon 4890..5521 gene:EIF4G1
# polyA_site 5521 gene:EIF4G1
QUERY 000005518 CTCAAAAAAAAAAAAAAAAAA 000005538
||||||||||||||||||
HIT 000005398 AAAAAAAAAAAAAAAAAAAAA 000005418



That's it,
Pierre

11 October 2010

A custom JSP tag for mediawiki

I'm pretty happy with the following custom JSP tag for mediawiki I wrote last week (http://code.google.com/p/code915/source/browse/trunk/charpak/src/WEB-INF/src/fr/inserm/umr915/charpak/j2ee/tags/MediaWikiTag.java). For a given title, it calls the mediawiki api to test if an article exists in a mediawiki installation. In my case, this system will be useful to let my users write some custom annotations to some records from our database.

<u915:mediawiki title="${geneName}" preload="Template:Gene"/>
If the article does exist on the mediawiki installation a simple hyperlink is created. If the article does not exist, an hyperlink for creating/editing the new article is displayed with a predload option: Preloading wikitext presents the user with a partially created page rather than a blank page, possibly with inline instructions for content organization..For example, in the following web application, the blue and red icons point to some existing articles or some articles to be created:


Internals

This custom tag calls the mediawiki api (e.g api.php?action=query&format=xml&titles=Charles+Darwin )and get the XML response as a DOM document. An XPath expression is then evaluated to test if an attribute '@missing' was declared for the given article.
boolean pageFound=false;
String termUTF8 = URLEncoder.encode(term,"UTF-8");
URL url=new URL(baseurl+"/api.php?action=query&format=xml&titles="+termUTF8);
URLConnection con=url.openConnection();
in=con.getInputStream();
Document dom=this.domBuilder.parse(in);
Element e=(Element)this.xpathPage.evaluate(dom,XPathConstants.NODE);
if(e!=null && e.getAttributeNode("missing")==null)
{
pageFound=true;
}
in.close();


That's it,
Pierre

01 September 2009

Using the BerkeleyDB Direct Persistence Layer: my notebook

In this post I show how I've used the Java BerkeleyDB API / Direct Persistence Layer to store a set of individuals in a BerkeleyDB database.


In a prevous post, I've shown how to use the BerkeleyDB API, a key/value database, to store some RDF statements. In this example, a set of TupleBinding was created to read and write the java Objects from/to the BerkeleyDB database.
Via Oracle: The Direct Persistence Layer (DPL) is one of two APIs that BerkeleyDB provides for interaction with databases. The DPL provides the ability to cause any Java type to be persistent without implementing special interfaces. The only real requirement is that each persistent class have a default constructor. No hand-coding of bindings is required. A binding is a way of transforming data types into a format which can be stored in a JE database. No external schema is required to define primary and secondary index keys. Java annotations are used to define all metadata.

OK, say you want to store a set of individuals in a BerkeleyDB database. The class Individual will be annotated with the @Entity annotation to tell the DPL that it should save this class. The primary key will be annotated with @PrimaryKey and will be automatically filled by the BerkeleyDB engine.
@Entity //Indicates a persistent entity class.
public class Individual
{
@PrimaryKey(sequence="individual") //Indicates the primary key field of an entity class
private long id;
(...)
}
. We also want to have a quick access to the family names, to the fathers and to the mothers. A @SecondaryKey is used to create those secondary indexes. Those secondary indexes also act as a constraint: references to the parents are allowed only if their ID already exist in the database.
@Entity
public class Individual
{
@PrimaryKey(sequence="individual")
private long id;
private String firstName=null;
@SecondaryKey(relate=Relationship.MANY_TO_ONE)
private String lastName=null;
@SecondaryKey(relate=Relationship.MANY_TO_ONE,
relatedEntity=Individual.class,
onRelatedEntityDelete=DeleteAction.NULLIFY
)
private Long fatherId=null;
@SecondaryKey(relate=Relationship.MANY_TO_ONE,
relatedEntity=Individual.class,
onRelatedEntityDelete=DeleteAction.NULLIFY)
private Long motherId=null;
(...)
}

At the end, here is the full source code of the class Individual.
package dpl;
import com.sleepycat.persist.model.DeleteAction;
import com.sleepycat.persist.model.Entity;
import com.sleepycat.persist.model.PrimaryKey;
import com.sleepycat.persist.model.Relationship;
import com.sleepycat.persist.model.SecondaryKey;

@Entity
public class Individual
{
@PrimaryKey(sequence="individual")
private long id;
private String firstName=null;
@SecondaryKey(relate=Relationship.MANY_TO_ONE)
private String lastName=null;
@SecondaryKey(relate=Relationship.MANY_TO_ONE,
relatedEntity=Individual.class,
onRelatedEntityDelete=DeleteAction.NULLIFY
)
private Long fatherId=null;
@SecondaryKey(relate=Relationship.MANY_TO_ONE,
relatedEntity=Individual.class,
onRelatedEntityDelete=DeleteAction.NULLIFY)
private Long motherId=null;
private int gender=0;

public Individual()
{

}

public Individual(String firstName,String lastName,int gender)
{
this.firstName=firstName;
this.lastName=lastName;
this.gender=gender;
}

public long getId() {
return id;
}



public String getFirstName() {
return firstName;
}
public void setFirstName(String firstName) {
this.firstName = firstName;
}
public String getLastName() {
return lastName;
}
public void setLastName(String lastName) {
this.lastName = lastName;
}

public long getFatherId() {
return fatherId;
}
public void setFatherId(long fatherId) {
this.fatherId = fatherId;
}
public long getMotherId() {
return motherId;
}
public void setMotherId(long motherId) {
this.motherId = motherId;
}

public void setGender(int gender) {
this.gender = gender;
}
public int getGender() {
return gender;
}

@Override
public int hashCode() {
return 31 + (int) (id ^ (id >>> 32));
}

@Override
public boolean equals(Object obj) {
if (this == obj)
return true;
if (obj == null)
return false;
if (!(obj instanceof Individual))
return false;
Individual other = (Individual) obj;
if (id != other.id)
return false;
return true;
}

@Override
public String toString() {
return getFirstName()+" "+getLastName();
}
}

Opening the Database


The database, the datastore, and the indexes are opened:
EnvironmentConfig EnvironmentConfig envCfg= new EnvironmentConfig();
StoreConfig storeCfg= new StoreConfig();
envCfg.setAllowCreate(true);
envCfg.setTransactional(true);
storeCfg.setAllowCreate(true);
storeCfg.setTransactional(true);
this.environment= new Environment(dataDirectory,envCfg);
this.store= new EntityStore(this.environment,"StoreName",storeCfg);
this.individualById = this.store.getPrimaryIndex(Long.class, Individual.class);
this.individualByLastName= this.store.getSecondaryIndex(this.individualById, String.class, "lastName");

Creating a few Individuals


A transaction is opened, some individuals of Charles Darwin's family are inserted in the datastore and the transaction is commited.
Transaction txn;
//create a transaction
txn= environment.beginTransaction(null, null);

Individual gp1= new Individual("Robert","Darwin",1);
individualById.put(gp1);
Individual gm1= new Individual("Susannah","Wedgwood",2);
individualById.put(gm1);
Individual gp2= new Individual("Josiah","Wedgwood",1);
individualById.put(gp2);
Individual gm2= new Individual("Elisabeth","Allen",2);
individualById.put(gm2);

Individual father= new Individual("Charles","Darwin",1);
father.setFatherId(gp1.getId());
father.setMotherId(gm1.getId());
individualById.put(father);
Individual mother= new Individual("Emma","Wedgwood",2);
mother.setFatherId(gp2.getId());
mother.setMotherId(gm2.getId());
individualById.put(mother);


Individual c1= new Individual("William","Darwin",1);
c1.setFatherId(father.getId());
c1.setMotherId(mother.getId());
individualById.put(c1);
Individual c2= new Individual("Anne Elisabeth","Darwin",2);
c2.setFatherId(father.getId());
c2.setMotherId(mother.getId());
individualById.put(c2);

txn.commit();

Using the secondary indexes


An EntityCursor obtained from the secondary index individualByLastName is used to iterate over all the individuals named "Darwin":
EntityCursor<Individual> cursor = individualByLastName.entities("Darwin", true, "Darwin", true);
for(Individual indi:cursor)
{
System.out.println(indi.getLastName()+"\t"+indi.getFirstName()+"\t"+indi.getId());
}
cursor.close();


Output

###Listing all Darwin
Darwin Robert 1
Darwin Charles 5
Darwin William 7
Darwin Anne Elisabeth 8


Source code



package dpl;

import java.io.File;
import java.util.logging.Logger;

import com.sleepycat.je.DatabaseException;
import com.sleepycat.je.Environment;
import com.sleepycat.je.EnvironmentConfig;
import com.sleepycat.je.Transaction;
import com.sleepycat.persist.EntityCursor;
import com.sleepycat.persist.EntityStore;
import com.sleepycat.persist.PrimaryIndex;
import com.sleepycat.persist.SecondaryIndex;
import com.sleepycat.persist.StoreConfig;

public class DirectPersistenceLayerTest
{
private static Logger LOG= Logger.getLogger(DirectPersistenceLayerTest.class.getName());
private Environment environment=null;
private EntityStore store;
private PrimaryIndex<Long, Individual> individualById;
private SecondaryIndex<String, Long, Individual> individualByLastName;


public void open(File dir) throws DatabaseException
{
close();
EnvironmentConfig envCfg= new EnvironmentConfig();
StoreConfig storeCfg= new StoreConfig();
envCfg.setAllowCreate(true);
envCfg.setTransactional(true);
storeCfg.setAllowCreate(true);
storeCfg.setTransactional(true);
LOG.info("opening "+dir);
this.environment= new Environment(dir,envCfg);
this.store= new EntityStore(this.environment,"StoreName",storeCfg);
this.individualById = this.store.getPrimaryIndex(Long.class, Individual.class);
this.individualByLastName= this.store.getSecondaryIndex(this.individualById, String.class, "lastName");
}

public void close()
{
if(this.store!=null)
{
LOG.info("close store");
try {
this.store.close();
}
catch (DatabaseException e)
{
LOG.warning(e.getMessage());
}
this.store=null;
}

if(this.environment!=null)
{
LOG.info("close env");
try {
this.environment.cleanLog();
this.environment.close();
}
catch (DatabaseException e)
{
LOG.warning(e.getMessage());
}
this.environment=null;
}
}

void run() throws DatabaseException
{
Transaction txn;
LOG.info("count.individuals="+ individualById.count());
//create a transaction
txn= environment.beginTransaction(null, null);

Individual gp1= new Individual("Robert","Darwin",1);
individualById.put(gp1);
Individual gm1= new Individual("Susannah","Wedgwood",2);
individualById.put(gm1);
Individual gp2= new Individual("Josiah","Wedgwood",1);
individualById.put(gp2);
Individual gm2= new Individual("Elisabeth","Allen",2);
individualById.put(gm2);

Individual father= new Individual("Charles","Darwin",1);
father.setFatherId(gp1.getId());
father.setMotherId(gm1.getId());
individualById.put(father);
Individual mother= new Individual("Emma","Wedgwood",2);
mother.setFatherId(gp2.getId());
mother.setMotherId(gm2.getId());
individualById.put(mother);


Individual c1= new Individual("William","Darwin",1);
c1.setFatherId(father.getId());
c1.setMotherId(mother.getId());
individualById.put(c1);
Individual c2= new Individual("Anne Elisabeth","Darwin",2);
c2.setFatherId(father.getId());
c2.setMotherId(mother.getId());
individualById.put(c2);

txn.commit();



System.out.println("###Listing all Darwin");
EntityCursor<Individual> cursor = individualByLastName.entities("Darwin", true, "Darwin", true);
for(Individual indi:cursor)
{
System.out.println(indi.getLastName()+"\t"+indi.getFirstName()+"\t"+indi.getId());
}
cursor.close();

LOG.info("count.individuals="+individualById.count());
}

public static void main(String[] args)
{
DirectPersistenceLayerTest app= new DirectPersistenceLayerTest();
try
{
int optind=0;
while(optind< args.length)
{
if(args[optind].equals("-h"))
{
System.err.println("");
}
else if(args[optind].equals("--"))
{
optind++;
break;
}
else if(args[optind].startsWith("-"))
{
System.err.println("Unknown option "+args[optind]);
}
else
{
break;
}
++optind;
}
app.open(new File("/tmp/bdb"));
app.run();
}
catch(Throwable err)
{
err.printStackTrace();
}
finally
{
app.close();
}
LOG.info("done.");
}
}

That's it.
Pierre

22 November 2008

A Web Service for ONSolubility.

This post is about the ONSolubility project (For references search FriendFeed for Solubility). This post is about how I've used Egon's code to create a web service to query the data of solubility. Egon has already done a great job by using the google java spreasheet API to download Jean-Claude's Solubility data. On his side, Rajarshi Guha wrote an HTML page querying those data using the Google Query-API. Here I show how I have created a webservice searching for the measurements based on their solvent/solute/concentration.

Server Side


Classes


I've added some JAXB(Java Architecture for XML Binding) annotations to Egon's Measurement.java. Those annotations help the web-service compiler (wsgen) to understand how the data will be transmitted to the client.
@javax.xml.bind.annotation .XmlRootElement(name="Measurement")
public class Measurement
implements Serializable
{
(...)

Then we create the WebService ONService.java. This service is just a java class containing also a few annotations. First we flag the class as a webservice:
@javax.jws.WebService(
name="onsolubility",
serviceName="ons"
)
public class ONService
{
Then comes the function seach provided by this service. This function will download the data from google using Egon's API and will return a collection of Measurement based on their solute/solvent/concentration. Again the java annotations will help the compiler to implement the service
@WebMethod(action="urn:search",operationName="search")
public List search(
@WebParam(name="solute")String solute,
@WebParam(name="solvent")String solvent,
@WebParam(name="concMin")Double concMin,
@WebParam(name="concMax")Double concMax
) throws Exception
{....
. The web service is launched with only 3 lines of code (!).
ONService service=new ONService();
Endpoint endpoint = Endpoint.create(service);
endpoint.publish("http://localhost:8080/onsolubility");

Compilation


I've create a ant file invoking wsgen generating the stubs and installing the webservice. Here is the ouput
compile-webservice:
[javac] Compiling 1 source file to /home/pierre/tmp/onssolubility/ons.solubility.data/bin
[wsgen] command line: wsgen -classpath (...) -verbose ons.solubility.ws.ONService
[wsgen] Note: ap round: 1
[wsgen] [ProcessedMethods Class: ons.solubility.ws.ONService]
[wsgen] [should process method: search hasWebMethods: true ]
[wsgen] [endpointReferencesInterface: false]
[wsgen] [declaring class has WebSevice: true]
[wsgen] [returning: true]
[wsgen] [WrapperGen - method: search(java.lang.String,java.lang.String,java.lang.Double,java.lang.Double)]
[wsgen] [method.getDeclaringType(): ons.solubility.ws.ONService]
[wsgen] [requestWrapper: ons.solubility.ws.jaxws.Search]
[wsgen] [should process method: main hasWebMethods: true ]
[wsgen] [webMethod == null]
[wsgen] [ProcessedMethods Class: java.lang.Object]
[wsgen] ons/solubility/ws/jaxws/ExceptionBean.java
[wsgen] ons/solubility/ws/jaxws/Search.java
[wsgen] ons/solubility/ws/jaxws/SearchResponse.java
[wsgen] Note: ap round: 2

publish-webservice:
[java] Publishing Service on http://localhost:8080/onsolubility?WSDL
.
And... that's it. When I open my browser on http://localhost:8080/onsolubility?WSDL , I can now see the WSDL description/schema of this service.

Client Side


Writing a client using this api looks the same way I did for a previous post about the IntAct/EBI API where the wsimport command generated the stubs from the WSDL file. I then wrote a simple test ONServiceTest.java, invoking our service several times.
private void test(
String solute,
String solvent,
Double concMin,
Double concMax)
{
try
{
Ons service=new Ons();
Onsolubility port=service.getOnsolubilityPort();
List data=port.search(solute, solvent, concMin, concMax);

for(Measurement measure:data)
{
System.out.println(
" sample :\t"+measure.getSample()+"\n"+
" solute :\t"+measure.getSolute()+"\n"+
" solvent :\t"+measure.getSolvent()+"\n"+
" experiment:\t"+measure.getExperiment()+"\n"+
" reference :\t"+measure.getReference()+"\n"+
" conc :\t"+measure.getConcentration()+"\n"
);
}
} catch(Throwable err)

{
System.err.println("#error:"+err.getMessage());
}
}
private void test()
{
test(null,null,null,null);
test("4-nitrobenzaldehyde",null,null,null);
test("4-nitrobenzaldehyde",null,0.3,0.4);
}
Here is the output
ant test-webservice
Buildfile: build.xml
test-webservice
[wsimport] parsing WSDL...
[wsimport] generating code...
[javac] Compiling 1 source file to onssolubility/ons.solubility.data/bin
[java] ##Searching solute: null solvent: null conc: null-null
[java] sample : 9
[java] solute : D-Glucose
[java] solvent : THF
[java] experiment: 1
[java] reference : http://onschallenge.wikispaces.com/JennyHale-1
[java] conc : 0.00222
[java]
[java] sample : 6
[java] solute : D-Mannitol
[java] solvent : Methanol
[java] experiment: 1
[java] reference : http://onschallenge.wikispaces.com/JennyHale-1
[java] conc : 0.00548
[java]
(...)
[java]
[java] sample : 10
[java] solute : D-Mannitol
[java] solvent : THF
[java] experiment: 1
[java] reference : http://onschallenge.wikispaces.com/JennyHale-1
[java] conc : 0.01098
[java] ##Searching solute: 4-nitrobenzaldehyde solvent: null conc: 0.3-0.4
[java] sample : 2b
[java] solute : 4-nitrobenzaldehyde
[java] solvent : Methanol
[java] experiment: 212
[java] reference : http://usefulchem.wikispaces.com/exp212
[java] conc : 0.38

That's it and that's enough code for the week-end.

Pierre

10 May 2007

SharedCopy: A collaborative tool for annotating web pages

Via TechCrunch: SharedCopy is a collaborative tool for annotating web pages. It takes a snapshot of the current page and uses it for annotations: then the visitors will sees the original page. I've tested it on a paper in pubmedcentral:


Pierre

10 May 2006

Playing with the connotea API (2/2)

A few monthes ago Ben Lund gave me the opportunity to test a beta-version of the connotea API and I wondered if I was able to build an Annozilla server that could act as a bridge between the firefox web browser and the connotea server allowing scientists to see and share comments about a web site/ a paper. As it is said on the annozilla server:

The Annozilla project is designed to view and create annotations associated with a web page, as defined by the W3C Annotea project. The idea is to store annotations as RDF on a server, using XPointer (or at least XPointer-like constructs) to identify the region of the document being annotated. The intention of Annozilla is to use Mozilla's native facilities to manipulate annotation data - its built-in RDF handling to parse the annotations, and nsIXmlHttpRequest to submit data when creating annotations..

annozilla02
In this example: firefox opened the NCBI home page (right side). Once the page is loaded, annozilla fetches the bookmarks about NCBI from connotea (top left). Double clicking in the annotations makes annozilla download the body of those bookmarks (bottom left).


The JAVA servlet I wrote is available at :



But wait, there is a problem: For "security reasons" Annozilla does not use the "GET" parameters in an URL (I really understand that). So when the following URL is submited:

http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=pubmed&dopt=Abstract&list_uids=14870871

Annozilla ignores all the parameters and inserts the comments for:

http://www.ncbi.nlm.nih.gov/entrez/query.fcgi



which is really less interesting !!!... Anyway, this was still a nice to write and it was proof of concept on how to use the connotea API.

update: 2010-08-12: source code

/*
Copyright (c) 2006 Pierre Lindenbaum PhD

Permission is hereby granted, free of charge, to any person obtaining
a copy of this software and associated documentation files (the
``Software''), to deal in the Software without restriction, including
without limitation the rights to use, copy, modify, merge, publish,
distribute, sublicense, and/or sell copies of the Software, and to
permit persons to whom the Software is furnished to do so, subject to
the following conditions:

The above copyright notice and this permission notice shall be included
in all copies or substantial portions of the Software.

The name of the authors when specified in the source files shall be
kept unmodified.

THE SOFTWARE IS PROVIDED ``AS IS'', WITHOUT WARRANTY OF ANY KIND, EXPRESS
OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.
IN NO EVENT SHALL 4XT.ORG BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE,
ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE
USE OR OTHER DEALINGS IN THE SOFTWARE.


$Id: $
$Author: $
$Revision: $
$Date: $
$Locker: $
$RCSfile: $
$Source: $
$State: $
$Name: $
$Log: $


*************************************************************************/
package org.lindenb.annotea.server;

import java.io.BufferedReader;
import java.io.File;
import java.io.FileNotFoundException;
import java.io.FileWriter;
import java.io.IOException;
import java.io.InputStream;
import java.io.OutputStreamWriter;
import java.io.PrintWriter;
import java.io.StreamTokenizer;
import java.io.StringReader;
import java.io.StringWriter;
import java.net.MalformedURLException;
import java.net.Socket;
import java.net.URL;
import java.net.URLConnection;
import java.net.URLEncoder;
import java.security.MessageDigest;
import java.util.Enumeration;
import java.util.HashSet;
import java.util.Iterator;

import javax.servlet.ServletException;
import javax.servlet.http.HttpServlet;
import javax.servlet.http.HttpServletRequest;
import javax.servlet.http.HttpServletResponse;
import javax.xml.parsers.DocumentBuilder;
import javax.xml.parsers.DocumentBuilderFactory;
import javax.xml.parsers.ParserConfigurationException;
import javax.xml.transform.Result;
import javax.xml.transform.Source;
import javax.xml.transform.Transformer;
import javax.xml.transform.TransformerException;
import javax.xml.transform.TransformerFactory;
import javax.xml.transform.dom.DOMSource;
import javax.xml.transform.stream.StreamResult;


import org.w3c.dom.CDATASection;
import org.w3c.dom.Document;
import org.w3c.dom.Element;
import org.w3c.dom.Node;
import org.w3c.dom.Text;
import org.xml.sax.SAXException;

import com.oreilly.servlet.Base64Decoder;


/**
* @author lindenb
* http://islande:8080/annotea/Annotea
*/
public class AnnoteaServer extends HttpServlet
{
/**
* serialVersionUID
*/
private static final long serialVersionUID = 1L;
/** flag for debugging on/off */
private static boolean DEBUG=false;

//static declaration of xml namespaces
static public final String RDF = "http://www.w3.org/1999/02/22-rdf-syntax-ns#";
static public final String RDFS = "http://www.w3.org/2000/01/rdf-schema#";
static public final String DC = "http://purl.org/dc/elements/1.1/";
static public final String XMLNS= "http://www.w3.org/2000/xmlns/";
static public final String AN = "http://www.w3.org/2000/10/annotation-ns#";
static public final String XHTML = "http://www.w3.org/1999/xhtml" ;
static public final String HTTP = "http://www.w3.org/1999/xx/http#";
static public final String CONNOTEA ="http://www.connotea.org/2005/01/schema#";
static public final String TERMS = "http://purl.org/dc/terms/";



/** query parameter name as specified in the spec */
static final String QUERY_ANNOTATION_PARAMETER="w3c_annotates";
/** default number of rows to fetch from http://www.connotea.org */
static final int CONNOTEA_DEFAULT_NUM_ROWS=10;
/** number of rows to fetch from http://www.connotea.org */
int connoteaNumRowsToFetch =CONNOTEA_DEFAULT_NUM_ROWS;



/** @see javax.servlet.GenericServlet#init() */
public void init() throws ServletException
{
//init number of rows
String s = getInitParameter("connoteaNumRowsToFetch");
if(s!=null)
{
try
{
this.connoteaNumRowsToFetch=Integer.parseInt(s);
if(this.connoteaNumRowsToFetch<=0) this.connoteaNumRowsToFetch=CONNOTEA_DEFAULT_NUM_ROWS;
}
catch(Exception err)
{
throw new ServletException(err);
}
}
}


/** convert a string to MD5 */
static String toMD5(String url) throws ServletException
{
StringBuffer result= new StringBuffer();
try
{
MessageDigest md = MessageDigest.getInstance("MD5");

md.update(url.getBytes());
byte[] digest = md.digest();

for (int k=0; k<digest.length; k++)
{
String hex = Integer.toHexString(digest[k]);
if (hex.length() == 1) hex = "0" + hex;
result.append(hex.substring(hex.length()-2));
}
}
catch(Exception err)
{
throw new ServletException(err);
}
return result.toString();
}

/** @see javax.servlet.http.HttpServlet#doGet(javax.servlet.http.HttpServletRequest, javax.servlet.http.HttpServletResponse) */
protected void doGet(HttpServletRequest req, HttpServletResponse res) throws ServletException, IOException
{
res.setContentType("text/xml");
String urlquery= req.getParameter(QUERY_ANNOTATION_PARAMETER);
PrintWriter out=res.getWriter();

String pathInfo = req.getPathInfo(); // /a/b;c=123

/**
*
* WE FOUND URLQUERY
*
*/
if(urlquery!=null)
{
debug("query is "+urlquery);
String authorizationUTF8= URLEncoder.encode(getAuthorization(req),"UTF-8");
String md5=toMD5(urlquery);
Document doc= getConnoteaRDF(
"http://www.connotea.org/data/bookmarks/uri/"+md5,
req
);

String postCount=null;
String created=null;
if(doc!=null)
{
Element root= doc.getDocumentElement();
if(root==null || !isA(root,RDF,"RDF")) throw new ServletException("Bad XML root from connotea");

for(Node n1= root.getFirstChild();n1!=null;n1=n1.getNextSibling())
{
if(!isA(n1,TERMS,"URI")) continue;
for(Node n2= n1.getFirstChild();n2!=null;n2=n2.getNextSibling())
{
if(isA(n2,CONNOTEA,"postCount"))
{
postCount=textContent(n2).replace('\n',' ').trim();
}
else if(isA(n2,CONNOTEA,"created"))
{
created=textContent(n2).replace('\n',' ').trim();
}
}
break;
}
}

debug("postcount="+postCount);

out.print(
"<?xml version=\"1.0\" ?>\n" +
"<r:RDF xmlns:r=\""+RDF+"\"\n" +
" xmlns:a=\""+AN+"\"\n" +
" xmlns:d=\""+DC+"\">\n"
);
if(postCount!=null)
{
out.print(
" <r:Description r:about=\""+getBaseURL(req)+"/"+md5+"\">\n" +
" <r:type r:resource=\""+ AN +"Annotation\"/>\n" +
" <r:type r:resource=\"http://www.w3.org/2000/10/annotationType#Comment\"/>\n" +
" <a:annotates r:resource=\""+urlquery+"\"/>\n" +
" <d:title>"+postCount+" Annotation(s) of "+urlquery+" on connotea</d:title>\n" +
" <a:context>" + urlquery+"#xpointer(/html[1])</a:context>\n" +
" <d:creator>Connotea</d:creator>\n" +
" <a:created>"+created+"</a:created>\n" +
" <d:date>"+created+"</d:date>\n" +
" <a:body r:resource=\""+getBaseURL(req)+"/body/"+md5+"?authorization="+authorizationUTF8+"\">"+
"</r:Description>"
);
}

out.print("</r:RDF>\n");
}
/**
*
* /BODY/ in pathinfo
*
*/
else if(pathInfo!=null && pathInfo.startsWith("/body/"))
{
String md5=pathInfo.substring(6);
Document doc=getConnoteaRDF("http://www.connotea.org/data/uri/"+md5+
"?num="+this.connoteaNumRowsToFetch,
req);
if(doc==null)
{
throw new ServletException("Cannot get http://www.connotea.org/data/uri/"+md5);
}
Element root= doc.getDocumentElement();
if(root==null || !isA(root,RDF,"RDF")) throw new ServletException("Bad XML root from connotea");


out.print("<?xml version=\"1.0\" ?>\n"+
"<!DOCTYPE html PUBLIC \"-//W3C//DTD XHTML 1.0 Strict//EN\" \"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd\">"
);

printHTMLBodyFromConnotea(new PrintWriter(new FileWriter("/tmp/logAnnotea.txt")),md5,root);
printHTMLBodyFromConnotea(out,md5,root);
}
/**
*
* other
*
*/
else
{
debug("pathinfo not handled ?");
out.print(
"<?xml version=\"1.0\" ?>\n" +
"<r:RDF xmlns:r=\""+RDF+"\"\n" +
" xmlns:a=\""+AN+"\"\n" +
" xmlns:d=\""+DC+"\">\n" +
"<a:annotates r:resource=\""+
req.getRequestURL()+ "\"/>"+
"</r:RDF>\n"
);
}


out.flush();
}

/** return and parse an HTML annotation from connotea */
private void printHTMLBodyFromConnotea(PrintWriter out,String md5,Element root)
{

out.print("<html xmlns=\"http://www.w3.org/1999/xhtml\" xml:lang=\"en\" lang=\"en\">"+
"<body>");
out.print("<img src=\"http://www.connotea.org/connotealogo.gif\" alt=\"connotea\"/><br/>");

for(Node n1= root.getFirstChild();n1!=null;n1=n1.getNextSibling())
{
if(isA(n1,CONNOTEA,"Post"))
{

String title=null;
String user=null;
String uri=null;
String created=null;
HashSet subject= new HashSet();
for(Node n2= n1.getFirstChild();n2!=null;n2=n2.getNextSibling())
{
if(isA(n2,DC,"creator")) { user=textContent(n2).trim();}
else if(isA(n2,CONNOTEA,"title")) { title=textContent(n2).trim();}
else if(isA(n2,CONNOTEA,"created")) {
created=textContent(n2).trim();
int T=created.indexOf('T');
if(T!=-1) created= created.substring(0,T);
}
else if(isA(n2,DC,"subject")) { subject.add(textContent(n2).trim());}
else if(isA(n2,CONNOTEA,"uri"))
{
for(Node n3= n2.getFirstChild();n3!=null;n3=n3.getNextSibling())
{
if(isA(n3,TERMS,"URI"))
{
Element e3=(Element)n3;
uri=e3.getAttributeNS(RDF,"about").trim();
}
}
}
}
out.print("<div>");

if(title!=null)
{
if(uri!=null) out.print("<h4><a target=\"ext\" title=\""+escape(uri)+"\" href=\""+escape(uri)+"\">");
out.print(""+escape(title));
if(uri!=null) out.print("</a>");
out.print(" (<a target=\"ext\" title=\"http://www.connotea.org/uri/"+md5+"\" href=\"http://www.connotea.org/uri/"+md5+"\">info</a>)");
out.print("</h4>");
}


if(user!=null)
{
out.print("Posted by <a target=\"ext\" href=\"http://www.connotea.org/user/"+user+"\">"+user+"</a>");
if(!subject.isEmpty())
{
out.print(" to");
for (Iterator iter = subject.iterator(); iter.hasNext();)
{
String sbj=escape((String)iter.next());
out.print(" <a target=\"ext\" href=\"http://www.connotea.org/tag/"+sbj+"\">"+sbj+"</a>");

}
}
if(created!=null ) out.print(" on <a target=\"ext\" href=\"http://www.connotea.org/date/"+created+"\">"+created+"</a>");
out.print("<br/>");
}
out.print("</div><hr/>");
}


}
out.print("<a title=\"mailto:plindenbaum@yahoo.fr\" href=\"mailto:plindenbaum@yahoo.fr\">plindenbaum@yahoo.fr</a> Pierre Lindenbaum PhD<br/>");
out.print("<div align=\"center\"><img border=\"1\" src=\"http://www.integragen.com/img/title.png\"/></div>");
out.print("</body></html>");
out.flush();
}


/**
* @see javax.servlet.http.HttpServlet#doPost(javax.servlet.http.HttpServletRequest, javax.servlet.http.HttpServletResponse) */
protected void doPost(HttpServletRequest req, HttpServletResponse res) throws ServletException, IOException
{
Document doc=null;
try
{
doc= newDocumentBuilder().parse(req.getInputStream());
}
catch(SAXException err)
{
debug("cannot get document from input stream");
throw new ServletException(err);
}

String context=null;
String content=null;
Element root= doc.getDocumentElement();
if(root==null) throw new ServletException("Cannot find root element of input");
debug("DOM1 content is "+DOM2String(root));
for(Node c1= root.getFirstChild();c1!=null;c1=c1.getNextSibling())
{
if(!isA(c1,RDF,"Description")) continue;
for(Node c2= c1.getFirstChild();c2!=null;c2=c2.getNextSibling())
{
if(c2.hasChildNodes()&& isA(c2,AN,"context"))
{
Element e2=(Element)c2;
context=textContent(e2).trim();
int xpointer=context.indexOf("#xpointer");
if(xpointer!=-1) context=context.substring(0,xpointer);
}
else if(c2.hasChildNodes()&& isA(c2,AN,"body"))
{
for(Node c3= c2.getFirstChild();c3!=null;c3=c3.getNextSibling())
{
if(!(c3.hasChildNodes()&& isA(c3,HTTP,"Message"))) continue;

for(Node c4= c3.getFirstChild();c4!=null;c4=c4.getNextSibling())
{
if(!(c4.hasChildNodes()&& isA(c4,HTTP,"Body"))) continue;
content= textContent(c4).trim().replaceAll("[ ]+"," ");
debug("DOM content is "+DOM2String(c4));
}

}
}
}
}
if(context==null)
{
throw new ServletException("Cannot find context in "+root);
}
if(content==null)
{
throw new ServletException("Cannot find content in "+DOM2String(root));
}


/** check if boomarks already exists */
try
{
String usertitle=null;
String comment=null;
StringBuffer description=new StringBuffer();
HashSet tags= new HashSet();

doc= getConnoteaRDF(
"http://www.connotea.org/data/user/"+
getLogin(req)+"/uri/"+
toMD5(context),req
);
if(doc!=null)
{
root= doc.getDocumentElement();
if(root!=null)
{
for(Node n1= root.getFirstChild();n1!=null;n1=n1.getNextSibling())
{
if(isA(n1,CONNOTEA,"Post"))
{
for(Node n2= n1.getFirstChild();n2!=null;n2=n2.getNextSibling())
{
if(isA(n2,CONNOTEA,"title")) { usertitle=textContent(n2).trim();}
else if(isA(n2,DC,"subject")) { tags.add(textContent(n2).trim().toLowerCase());}
else if(isA(n2,CONNOTEA,"description")) { description= new StringBuffer(textContent(n2).trim()+"\n");}
}
break;
}
}
}
debug("existed: title="+usertitle+" subject="+tags+ " desc="+description);
}


/* construct URL to connotea */

BufferedReader buffReader= new BufferedReader(new StringReader(content));
debug("content is "+content);
String line=null;

while((line=buffReader.readLine())!=null)
{
line= line.trim();
String upper= line.toUpperCase();
//parse tags
if(upper.startsWith("TAG:"))
{
try
{
StreamTokenizer parser= new StreamTokenizer(
new StringReader(line.substring(4))
);
parser.quoteChar('"');
parser.quoteChar('\'');
parser.eolIsSignificant(false);
parser.lowerCaseMode(true);
parser.ordinaryChars('0','9');
parser.wordChars('0','9');
parser.whitespaceChars(',',',');
parser.whitespaceChars(';',';');
parser.whitespaceChars('(','(');
parser.whitespaceChars(')',')');

while ( parser.nextToken() != StreamTokenizer.TT_EOF )
{
if ( parser.ttype == StreamTokenizer.TT_WORD)
{
tags.add(parser.sval.toLowerCase());
}
else if ( parser.ttype == StreamTokenizer.TT_NUMBER )
{
//hum... ignoring number
//items.add(""+parser.nval);
}
else if ( parser.ttype == StreamTokenizer.TT_EOL )
{
continue;
}
else if(parser.sval!=null)
{
tags.add(parser.sval.toLowerCase()) ;
}
}
}
catch(Exception ex)
{

}
}
else if(upper.startsWith("TI:"))
{
usertitle= line.substring(3).trim();
if(usertitle.length()==0) usertitle=null;
}
else if(upper.startsWith("COM:"))
{
comment= line.substring(4).trim();
if(comment.length()==0) comment=null;
}
else
{
description.append(line+"\n");
}
}
//put one tag if no tags was declared
if(tags.isEmpty()) tags.add("annoteated");

if(description.length()==0) description= new StringBuffer(context);
//build tags parameter
StringBuffer tagStr= new StringBuffer();
for (Iterator iter = tags.iterator(); iter.hasNext();)
{
if(tagStr.length()!=0) tagStr.append(",");
tagStr.append(iter.next().toString());
}

debug("tagstr="+tagStr);


URL url=new URL("http://www.connotea.org:80/data/add");

String postbody=
"uri=" + URLEncoder.encode(context,"UTF-8")+
(usertitle==null?"":"&usertitle="+URLEncoder.encode(usertitle,"UTF-8"))+
"&description="+URLEncoder.encode(description.toString(),"UTF-8")+
(comment==null?"":"&comment="+URLEncoder.encode(comment,"UTF-8"))+
"&tags=" + URLEncoder.encode(tagStr.toString(),"UTF-8")+
"&annoteaflag=hiBenThisIsPierre"
;

StringBuffer poststring= new StringBuffer();
poststring.append("POST "+url.getFile()+" HTTP/1.1\n");
poststring.append("Host: "+url.getHost()+"\n");
poststring.append("authorization: "+getAuthorization(req)+"\n");
poststring.append("Content-length: "+postbody.length()+"\n");
poststring.append("\n");
poststring.append(postbody.toString());


Socket socket= new Socket(url.getHost(),url.getPort());
InputStream from_server= socket.getInputStream();
PrintWriter to_server= new PrintWriter(
new OutputStreamWriter(socket.getOutputStream()));


to_server.print(poststring.toString());
to_server.flush();


StringBuffer response= new StringBuffer();
int c; while((c=from_server.read())!=-1) { response.append((char)c);}


to_server.close();
from_server.close();
debug("sent "+ poststring+" response is:\n"+response+"\n");
}
catch(Exception err)
{
debug("error "+err);
throw new ServletException(err);
}





{
String msg="<?xml version=\"1.0\" encoding=\"UTF-8\" ?>\n" +
"<rdf:RDF xmlns:rdf=\""+RDF+"\"\n" +
" xmlns:a=\""+AN+"\"\n" +
" xmlns:dc=\""+DC+"\">\n" +
" <rdf:Description rdf:about=\""+
getBaseURL(req)+"/"+toMD5(context)
+"\">\n" +
" <dc:creator>connotea user</dc:creator>\n"+
" <a:created>2006-01-31</a:created>\n"+
" <dc:date>2006-01-31</dc:date>\n"+
" <a:annotates rdf:resource=\""+escape(context)+"\"/>\n" +
" <rdf:type rdf:resource=\""+ AN +"Annotation\"/>\n" +
" <rdf:type rdf:resource=\"http://www.w3.org/2000/10/annotationType#Comment\"/>\n" +
//" <a:body rdf:resource=\""+getBaseURL(req)+"/body/"+toMD5(context) +"\"/>\n" +
" </rdf:Description>\n" +
"</rdf:RDF>";



res.setStatus(HttpServletResponse.SC_CREATED);
res.setHeader("Connection","close");
res.setHeader("Pragma","no-cache");
res.setContentType("text/xml");
res.setContentLength(msg.length());



PrintWriter out= res.getWriter();
out.print(msg);
debug("message sent is "+msg);
out.flush();
}
}




/** return the BASE URL of this servet */
private static String getBaseURL(HttpServletRequest req) throws ServletException
{
String scheme = req.getScheme(); // http
String serverName = req.getServerName(); // hostname.com
int serverPort = req.getServerPort(); // 80
String contextPath = req.getContextPath(); // /mywebapp
String servletPath = req.getServletPath(); // /servlet/MyServlet
//String pathInfo = req.getPathInfo(); // /a/b;c=123
//String queryString = req.getQueryString(); // d=789
return scheme+"://"+serverName+":"+serverPort+contextPath+servletPath;
}

/** creates a new namespace aware DocumentBuilder parsing DOM */
private static DocumentBuilder newDocumentBuilder() throws ServletException
{
DocumentBuilderFactory factory= DocumentBuilderFactory.newInstance();
factory.setCoalescing(true);
factory.setNamespaceAware(true);
factory.setExpandEntityReferences(true);
factory.setIgnoringComments(true);
factory.setValidating(false);
try
{
return factory.newDocumentBuilder();
} catch (ParserConfigurationException error)
{
throw new ServletException(error);
}
}

/** simple escape XML function */
public static String escape(String s)
{
if(s==null) return s;
StringBuffer buff= new StringBuffer();
for(int i=0;i< s.length();++i)
{
switch(s.charAt(i))
{
case('\"') : buff.append("&quot;"); break;
case('\'') : buff.append("&apos;"); break;
case('&') : buff.append("&amp;"); break;
case('<') : buff.append("&lt;"); break;
case('>') : buff.append("&gt;"); break;
default: buff.append(s.charAt(i)); break;
}
}
return buff.toString();
}

/** simple test for XML element */
static public boolean isA(Node e,String ns, String localname)
{
if(e==null) return false;
return ns.equals(e.getNamespaceURI()) && e.getLocalName().equals(localname);
}

/** get text content of a DOM node */
static public String textContent(Node node)
{
return textContent(node,new StringBuffer()).toString();
}

/** get text content of a DOM node */
static private StringBuffer textContent(Node node,StringBuffer s)
{
if(node==null) return s;
for(Node c= node.getFirstChild();c!=null;c=c.getNextSibling())
{
if(isA(c,XHTML,"br"))
{
s.append("\n");
}
else if(c.getNodeType()==Node.CDATA_SECTION_NODE)
{
s.append(((CDATASection)c).getNodeValue());
}
else if(c.getNodeType()==Node.TEXT_NODE)
{
s.append(((Text)c).getNodeValue());
}
else
{
textContent(c,s);
}
}
return s;
}


/** download a document from connotea or null if the url was not found */
private Document getConnoteaRDF(String url,HttpServletRequest req) throws ServletException
{
DocumentBuilder builder= newDocumentBuilder();
try
{
return builder.parse(
openSecureURLInputStream(
url,
req
)
);
}
catch (FileNotFoundException e)
{
debug("file not found for "+url+" returning empty rdf document");
return null;
}
catch(Exception err)
{
debug("cannot download :"+url+ " "+err);
throw new ServletException(err);
}
}


/** get login from header authorization */
static private String getLogin(HttpServletRequest req) throws ServletException
{
return getLoginAndPassword(req)[0];
}

/** get login and passord from header authorization */
static private String[] getLoginAndPassword(HttpServletRequest req) throws ServletException
{
String s64= getAuthorization(req);
if(!s64.startsWith("Basic ")) throw new ServletException("header \"authorization\" does not start with \"Basic \" ");
s64=s64.substring(6);
String decoded= Base64Decoder.decode(s64);
int loc= decoded.indexOf(':');
if(loc==-1) throw new ServletException("no \":\" in decoded authorization "+s64);
return new String[]{decoded.substring(0,loc),decoded.substring(loc+1)};
}

/** find parameter authorization in http request */
static String getAuthorization(HttpServletRequest req) throws ServletException
{
String s64= req.getHeader("authorization");
if(s64==null)
{
s64=req.getParameter("authorization");
}
if(s64==null)
{
s64=req.getParameter("Authorization");
}
if(s64==null)
{
Enumeration e= req.getHeaderNames();
StringBuffer headers= new StringBuffer();
while(e.hasMoreElements())
{
String key=(String)e.nextElement();
headers.append(key).append("=").append(req.getHeader(key)).append(";\n");
}
throw new ServletException("no header \"authorization\" was found in\n"+headers);
}
return s64;
}

/** open a URL, filling authorization header */
static private InputStream openSecureURLInputStream(String urlstr,HttpServletRequest req)
throws ServletException,FileNotFoundException
{
URL url=null;
String s64=null;
try
{
s64= getAuthorization(req);
url= new URL(urlstr);
URLConnection uc = url.openConnection();
uc.setRequestProperty("Authorization", s64);
InputStream content = uc.getInputStream();
return content;
}
catch (MalformedURLException e)
{
throw new ServletException(e);
}
catch (FileNotFoundException e)
{
throw e;
}
catch (IOException e)
{
throw new ServletException(e);
}
}


/** quick n dirty debugging function: append the message to "/tmp/logAnnotea.txt" */
static private void debug(Object o)
{
if(!DEBUG) return;
try
{
File f= new File("/tmp/logAnnotea.txt");
PrintWriter pout= new PrintWriter(new FileWriter(f,true));
pout.println("###"+System.currentTimeMillis()+"########");
pout.println(o);
pout.flush();
pout.close();

}
catch (Exception err)
{
err.printStackTrace();
}

}

private String DOM2String(Node doc)throws ServletException
{
StringWriter out= new StringWriter();
printDOM(new PrintWriter(out,true),doc);
return out.toString();
}

/* print DOM document for debugging purpose...*/
private static void printDOM(PrintWriter log,Node doc) throws ServletException
{
Source source = new DOMSource(doc);
Result result = new StreamResult(log);


try
{
Transformer xformer = TransformerFactory.newInstance().newTransformer();
xformer.transform(source, result);
}
catch(TransformerException error)
{
error.printStackTrace();
error.printStackTrace(log);
log.flush();
throw new ServletException(error);
}
}


}