Showing posts with label data. Show all posts
Showing posts with label data. Show all posts

17 October 2013

Rapid prototyping of a read-only Lims using JSON and Apache Velocity.

In a previous post, I showed how to use the Apache Velocity template engine to format JSON data.



Since that post, I've moved my application to a github repository: https://github.com/lindenb/jsvelocity. The project contains a java-based standalone tool to process the JSON data.
Here is an example: The JSON data:

{
individuals:[
    {
    name: "Riri",
    age: 8,
    duck: true
    },
    {
    name: "Fifi",
    age: 9,
    duck: true
    },
    {
    name: "Loulou",
    age: 10,
    duck: true
    }
    ]
}
.... and the velocity template:
#foreach($indi in ${all.individuals})
<h1>${indi['name']}</h1>
Age:${indi.age}<br/>${indi.duck}
#end
... with the following command line ...
$ java -jar dist/jsvelocity.jar \
    -f all test.json \
    test.vm
... produces the following output ...
<h1>Riri</h1>
Age:8<br/>true
<h1>Fifi</h1>
Age:9<br/>true
<h1>Loulou</h1>
Age:10<br/>true


Today I wrote a web version of the tool using the jetty server. I wanted to quickly write a web interface to display various summaries for our NGS experiments.
My JSON input looks like this:
{
"sequencer":[
 {
 "name":"HiSeq"
 },
 {
 "name":"MiSeq"
 }
 ],
"run":[ {
 "sequencer":"HiSeq",
 "flowcell":"C2AVTACXX",
 "name":"131010_C2AVTACXX_61",
 "date":"2013-10-10",
 "comment":"A comment",
 "samples":[
  {
  "name":"CD0001",
  "meancov": 10
  },
  {
  "name":"CD0002",
  "meancov": 20.0
  }
  ,
  {
  "name":"CD0003",
  "meancov": 30.0
  }
  ]
 },
 {
 "sequencer":"MiSeq",
 "flowcell":"C3VATACYY",
 "name":"131011_C3VATACYY_62",
 "date":"2013-10-11",
 "comment":"Another comment",
 "samples":[
  {
  "name":"CD0001",
  "meancov": 11
  },
  {
  "name":"CD0006",
  "meancov": 21.0
  }
  ,
  {
  "name":"CD0008",
  "meancov": null
  }
  ]
 },
 {
 "sequencer":"MiSeq",
 "flowcell":"C4VATACYZ",
 "name":"131012_C4VATACYZ_63",
 "date":"2013-10-12",
 "comment":"Another comment",
 "samples":[
  {
  "name":"CD0010",
  "meancov":1,
  "comment":"Failed, please, re-sequence"
  }
  ]
 }
 
 
 ],
"samples":[ 
 { "name":"CD0001" },
 { "name":"CD0002" },
 { "name":"CD0003" }, 
 { "name":"CD0004" }, 
 { "name":"CD0005" }, 
 { "name":"CD0006" }, 
 { "name":"CD0007" }, 
 { "name":"CD0008" },
 { "name":"CD0009" },
 { "name":"CD0010" }
 ],
"projects":[
 {
 "name":"Disease1",
 "description": "sequencing Project 1",
 "samples":["CD0001","CD0002","CD0006","CD0009"]
 },
 {
 "name":"Disease2",
 "description": "sequencing Project 2",
 "samples":["CD0002","CD0003","CD0008","CD0009"]
 }
 ]

}
One velocity template is used to browse this 'database': https://github.com/lindenb/jsvelocity/blob/master/src/test/resources/velocity/lims.vm.
The server is started like:
java -jar dist/webjsvelocity.jar  \
    -F lims src/test/resources/json/lims.json \
    src/test/resources/velocity/lims.vm

2013-10-17 12:43:35.566:INFO:oejs.Server:main: jetty-9.1.0.M0
2013-10-17 12:43:35.602:INFO:oejs.ServerConnector:main: Started ServerConnector@72dcb6{HTTP/1.1}{0.0.0.0:8080}
(...)
And here is a screenshot of the result:






That's it,

Pierre

18 July 2013

Inside the Variation toolkit: annotating a VCF with the data of NCBI biosystems mapped to BED.

Let's annotate a VCF file with the data from the NCBI biosystem.

First the 'NCBI biosystem' data are mapped to a BED file using the following script. It joins "ncbi;biosystem2gene", "ncbi:biosystem-label" and "biomart-ensembl:gene"


It produces a tabix-inded BED mapping the data of 'NCBI biosystem':

$ gunzip -c ncbibiosystem.bed.gz | head
1 69091 70008 79501 106356 30 Signaling_by_GPCR
1 69091 70008 79501 106383 50 Olfactory_Signaling_Pathway
1 69091 70008 79501 119548 40 GPCR_downstream_signaling
1 69091 70008 79501 477114 30 Signal_Transduction
1 69091 70008 79501 498 40 Olfactory_transduction
1 69091 70008 79501 83087 60 Olfactory_transduction
1 367640 368634 26683 106356 30 Signaling_by_GPCR
1 367640 368634 26683 106383 50 Olfactory_Signaling_Pathway
1 367640 368634 26683 119548 40 GPCR_downstream_signaling
1 367640 368634 26683 477114 30 Signal_Transduction

I wrote a tool named VCFBed: it takes as input a VCF and annotate it with the data of a tabix-ed BED. This tool is available on github at https://github.com/lindenb/jvarkit#vcfbed. Let's annotate a remote VCF with ncbibiosystem.bed.gz :
java -jar dist/vcfbed.jar \
   TABIXFILE=~/ncbibiosystem.bed.gz \
   TAG=NCBIBIOSYS \
   FMT='($4|$5|$6|$7)'|\
grep -E '(^#CHR|NCBI)'

##INFO=<ID=NCBIBIOSYS,Number=.,Type=String,Description="metadata added from /home/lindenb/ncbibiosystem.bed.gz . Format was ($4|$5|$6|$7)">
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT 1094PC0005 1094PC0009 1094PC0012 1094PC0013
1 69270 . A G 2694.18 . AC=40;AF=1.000;AN=40;DP=83;Dels=0.00;EFF=SYNONYMOUS_CODING(LOW|SILENT|tcA/tcG|S60|305|OR4F5|protein_coding|CODING|ENST00000335137|exon_1_69091_70008);FS=0.000;HRun=0;HaplotypeScore=0.0000;InbreedingCoeff=-0.0598;MQ=31.06;MQ0=0;NCBIBIOSYS=(79501|119548|40|GPCR_downstream_signaling),(79501|106356|30|Signaling_by_GPCR),(79501|498|40|Olfactory_transduction),(79501|83087|60|Olfactory_transduction),(79501|477114|30|Signal_Transduction),(79501|106383|50|Olfactory_Signaling_Pathway);QD=32.86 GT:AD:DP:GQ:PL ./. ./. 1/1:0,3:3:9.03:106,9,0 1/1:0,6:6:18.05:203,18,0
1 69511 . A G 77777.27 . AC=49;AF=0.875;AN=56;BaseQRankSum=0.150;DP=2816;DS;Dels=0.00;EFF=NON_SYNONYMOUS_CODING(MODERATE|MISSENSE|Aca/Gca|T141A|305|OR4F5|protein_coding|CODING|ENST00000335137|exon_1_69091_70008);FS=21.286;HRun=0;HaplotypeScore=3.8956;InbreedingCoeff=0.0604;MQ=32.32;MQ0=0;MQRankSum=1.653;NCBIBIOSYS=(79501|119548|40|GPCR_downstream_signaling),(79501|106356|30|Signaling_by_GPCR),(79501|498|40|Olfactory_transduction),(79501|83087|60|Olfactory_transduction),(79501|477114|30|Signal_Transduction),(79501|106383|50|Olfactory_Signaling_Pathway);QD=27.68;ReadPosRankSum=2.261 GT:AD:DP:GQ:PL ./. ./. 0/1:2,4:6:15.70:16,0,40 0/1:2,2:4:21.59:22,0,40

That's it,

Pierre

15 July 2013

Playing with the "UCSC Genome Browser Track Hubs". my notebook

The UCSC has recently created the Genome Browser Track Hubs: " Track hubs are web-accessible directories of genomic data that can be viewed on the UCSC Genome Browser. ". I've created a Hub for the Rotavirus Genome hosted on github at:https://github.com/lindenb/genomehub.
My data were primarily described as a XML file. It contains a description of the genome, of the tracks, the path to the fasta sequence etc... The FASTA sequence was provided by Dr Didier Poncet (CNRS/Gig). As far as I understand, it is not currently possible to specify that a track describes a protein.

<?xml version="1.0" encoding="UTF-8"?>
<genomeHub >
  <name>Rotavirus</name>
  <shortLabel>Rotavirus</shortLabel>
  <longLabel>Rotavirus</longLabel>
  (...)
  <accessions id="set1">
   <acn>GU144588</acn>
 <acn source="uniprot">Q0H8C5</acn>
 <acn source="uniprot">Q45UF6</acn>
 (..)
  <genome id="rf11">
    <description>Rotavirus RF11</description>
    <organism>Rotavirus</organism>
    <defaultPos>RF01:1-10</defaultPos>
    <scientificName>Rotavirus</scientificName>
    <organism>Rotavirus</organism>
    <orderKey>10970</orderKey>
    <fasta>rotavirus/rf/rf.fa</fasta>
     (...)
 <group id="active_site"><accessions ref="set1"/><include>active site</include></group>
 <group id="calcium-binding_region"><accessions ref="set1"/><include>calcium-binding region</include></group>
 <group id="chain"><accessions ref="set1"/><include>chain</include></group>
   (...)
This XML file is then processed with the following xsl stylsheet: https://github.com/lindenb/genomehub/blob/master/data/genomehub.xml : it generates a Makefile that will translate the fasta sequence to 2bit, create the bed files by aligning some annotated files to the reference with blast and convert them to bigbed.
At the end, my directory contains the following files:
./data/genomehub.xml
./data/genomehub2make.xsl
./data/sequence2fasta.xsl
./data/hub.txt
./data/genomes.txt
./data/rotavirus
./data/rotavirus/rf
./data/rotavirus/rf/signal_peptide.bed
./data/rotavirus/rf/CDS.bed
./data/rotavirus/rf/turn.bb
./data/rotavirus/rf/chrom.sizes
./data/rotavirus/rf/site.bed
./data/rotavirus/rf/coiled-coil_region.bed
./data/rotavirus/rf/mutagenesis_site.bb
./data/rotavirus/rf/UTR.bed
./data/rotavirus/rf/reference.fa~
./data/rotavirus/rf/misc_feature.bed
./data/rotavirus/rf/CDS.bb
./data/rotavirus/rf/helix.bed
./data/rotavirus/rf/strand.bb
./data/rotavirus/rf/sequence_conflict.bb
./data/rotavirus/rf/modified_residue.bb
./data/rotavirus/rf/coiled-coil_region.bb
./data/rotavirus/rf/topological_domain.bb
./data/rotavirus/rf/active_site.bed
./data/rotavirus/rf/sequence_variant.bb
./data/rotavirus/rf/transmembrane_region.bb
./data/rotavirus/rf/zinc_finger_region.bed
./data/rotavirus/rf/region_of_interest.bb
./data/rotavirus/rf/glycosylation_site.bb
./data/rotavirus/rf/domain.bb
./data/rotavirus/rf/region_of_interest.bed
./data/rotavirus/rf/misc_feature.bb
./data/rotavirus/rf/topological_domain.bed
./data/rotavirus/rf/sequence_conflict.bed
./data/rotavirus/rf/UTR.bb
./data/rotavirus/rf/compositionally_biased_region.bed
./data/rotavirus/rf/chain.bed
./data/rotavirus/rf/glycosylation_site.bed
./data/rotavirus/rf/trackDb.txt
./data/rotavirus/rf/modified_residue.bed
./data/rotavirus/rf/disulfide_bond.bed
./data/rotavirus/rf/strand.bed
./data/rotavirus/rf/helix.bb
./data/rotavirus/rf/compositionally_biased_region.bb
./data/rotavirus/rf/transmembrane_region.bed
./data/rotavirus/rf/rf.fa
./data/rotavirus/rf/rf.2bit
./data/rotavirus/rf/splice_variant.bed
./data/rotavirus/rf/short_sequence_motif.bed
./data/rotavirus/rf/rf.fa.nsq
./data/rotavirus/rf/ALL.bed.blast.xml~
./data/rotavirus/rf/gene.bed
./data/rotavirus/rf/sequence_variant.bed
./data/rotavirus/rf/disulfide_bond.bb
./data/rotavirus/rf/signal_peptide.bb
./data/rotavirus/rf/rf.fa.nin
./data/rotavirus/rf/short_sequence_motif.bb
./data/rotavirus/rf/turn.bed
./data/rotavirus/rf/domain.bed
./data/rotavirus/rf/mutagenesis_site.bed
./data/rotavirus/rf/zinc_finger_region.bb
./data/rotavirus/rf/chain.bb
./data/rotavirus/rf/rf.fa.nhr
./data/rotavirus/rf/splice_variant.bb
./data/rotavirus/rf/active_site.bb
./data/rotavirus/rf/site.bb
./data/rotavirus/rf/description.html
./README.md

The files required by the UCSC are then pushed on github and the URL pointing to hub.txt (https://raw.github.com/lindenb/genomehub/master/data/hub.txt) is registered at http://genome.ucsc.edu/cgi-bin/hgHubConnect. And a few clicks later...





That's it,

Pierre