20 December 2012

RDF/Jena: a simple extension for XSLT/XALAN. Testing with NCBI-Gene

In a previous post, I've shown that the XALAN XSLT engine can be extended with custom function returning a DOM Document that will be used by the xslt-stylesheet. Here, I'll create an extension for XALAN getting some RDF statements from a Jena/RDF model. The RDF model will be loaded in memory but one can imagine to use a persistent model ( TDB or SDB). I'll download a record from NCBI-gene, transform it to html and use the disease-ontology database as RDF to annotate it.

A Gene record is downloaded as XML from NCBI gene:

curl "http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=gene&id=4853&retmode=xml" > notch2.html

The disease ontology is downloaded as RDF/XML:

curl -odoid.owl "http://www.berkeleybop.org/ontologies/doid.owl"

The XSLT Stylesheet

The stylesheet declares the extension jena, loads the RDF model ("$model"), searches for the OMIM identifiers in the Gene record and loads the RDF statements related to that OMIM-ID.
For example the following xpath expression:

jena:query(
   $model,
   $doiid,
   'http://www.geneontology.org/formats/oboInOwl#hasExactSynonym',
   ''
   )

returns a rdf/XML document containing the RDF statements having a subject=$doiid, a property "http://www.geneontology.org/formats/oboInOwl#hasExactSynonym" and any object.

<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
  <rdf:Statement>
    <rdf:subject rdf:resource="http://purl.obolibrary.org/obo/DOID_0050721"/>
    <rdf:predicate rdf:resource="http://www.geneontology.org/formats/oboInOwl#hasExactSynonym"/>
    <rdf:object>Phosphoserine phosphatase deficiency</rdf:object>
  </rdf:Statement>
</rdf:RDF>

The stylesheet:

The Java code

This is the java extension: the constructor loads the RDF model in memory. The function query(..) returns a RDF/XML document matching the query.

Makefile

config.mk:

Result

java -cp ${class.path} org.apache.xalan.xslt.Process \
 -IN notch2.xml \
 -XSL gene2html.xsl -EDUMP -OUT result.html

NOTCH2

Omim ID 610205

Label

Alagille syndrome

Synonym

Arteriohepatic dysplasia (disorder)

Sub-Class Of

Label

gastrointestinal system disease

Synonym

gastrointestinal disease

Sub-Class Of

Label

disease of anatomical entity

Sub-Class Of

Label: disease

Omim ID 102500

Label

Hajdu-Cheney syndrome

Synonym

Hajdu-Cheney syndrome (disorder)

Sub-Class Of

Label

autosomal dominant disease

Sub-Class Of

Label

autosomal genetic disease

Sub-Class Of

Label

monogenic disease

Sub-Class Of

Label

genetic disease

Sub-Class Of

Label: disease

That's it,

Pierre

29 November 2012

Reading/Writing a VCF file with the GATK-API.

This is a simple post to save my notes about reading a VCF file and writing it to another file using the java libraries of the GATK. The only way I found requires a SAMSequenceDictionary and always writes an index.:

The code

import java.io.*;
import org.broad.tribble.AbstractFeatureReader;
import org.broad.tribble.FeatureReader;
import org.broadinstitute.sting.utils.Utils;
import org.broadinstitute.sting.utils.codecs.vcf.*;
import org.broadinstitute.sting.utils.variantcontext.VariantContext;
import org.broadinstitute.sting.utils.variantcontext.writer.*;
import net.sf.samtools.SAMSequenceDictionary;
import net.sf.picard.reference.*;

import java.util.Iterator;
import java.util.Map;
/**
 * motivation:
 *      copy a VCF 
 * usage:
 * javac -cp ${GATK}  ReadVCF.java
 * java -cp ${GATK}:. ReadVCF ref.fa my.vcf
 */
public class ReadVCF
 {
 public static void main(String args[]) throws Exception
  {
  /** latest VCF specification */
  final VCFCodec vcfCodec = new VCFCodec();
  /** we don't need some indexed VCFs */
  boolean requireIndex=false;
  /* load a SAM sequence dictionary */
  SAMSequenceDictionary dict=new IndexedFastaSequenceFile(
    new File(args[0])).getSequenceDictionary();
  /* loop over each vcf */
  for(int i=1;i< args.length;++i)
   {
   /* input VCF */
   String filename=args[i];
   /* output VCF */
   File fileout=new File("tmp"+i+".vcf"); 
   VariantContextWriter writer=VariantContextWriterFactory.create(fileout,dict);
   /* get A VCF Reader */
   FeatureReader<VariantContext> reader = AbstractFeatureReader.getFeatureReader(
      filename, vcfCodec, requireIndex);
   /* read the header */
   VCFHeader header = (VCFHeader)reader.getHeader();
   /* write the header */
   writer.writeHeader(header);
   /** loop over each Variation */
   Iterator<VariantContext> it = reader.iterator();
              while ( it.hasNext() )
               {
               /* get next variation and save it */
     VariantContext vc = it.next();
     writer.add(vc);
    }
   /* we're done */
   reader.close();
   writer.close();
   }  
  }
 }

Makefile

GATK=GenomeAnalysisTKLite-2.2-15-g4828906/GenomeAnalysisTKLite.jar
VCF=gatk-master/public/testdata/exampleDBSNP.vcf
REF=./gatk-master/public/testdata/exampleFASTA.fasta
all: 
 javac -cp ${GATK} -nowarn ReadVCF.java
 java -cp ${GATK}:. ReadVCF $(REF) ${VCF}

That's it,
Pierre

21 November 2012

visualizing the dependencies in a Makefile

Update 2014: I wrote a C version at https://github.com/lindenb/makefile2graph.

I've just coded a tool to visualize the dependencies in a Makefile. The java source code is available on github at : https://github.com/lindenb/jsandbox/blob/master/src/sandbox/MakeGraphDependencies.java. This simple tool parses the ouput of

make -dq

( here option '-d' is 'Print lots of debugging information' and '-q' is 'Run no commands') and prints a graphiz-dot file.

Example

Below is a simple NGS workflow:

%.bam.bai : %.bam
 
file.vcf:  merged.bam.bai ref.fa
merged.bam : sorted1.bam sorted2.bam
sorted1.bam: lane1_1.fastq  lane1_2.fastq ref.fa
sorted2.bam: lane2_1.fastq  lane2_2.fastq ref.fa

Invoking the program:

make -d --dry-run | java -jar makegraphdependencies.jar

generates the following graphiz-dot file:

digraph G {
n9[label="sorted2.bam" ];
n3[label="merged.bam.bai" ];
n10[label="lane2_1.fastq" ];
n11[label="lane2_2.fastq" ];
n2[label="file.vcf" ];
n4[label="merged.bam" ];
n6[label="lane1_1.fastq" ];
n8[label="ref.fa" ];
n7[label="lane1_2.fastq" ];
n0[label="[ROOT]" ];
n5[label="sorted1.bam" ];
n1[label="Makefile" ];
n10->n9;
n11->n9;
n8->n9;
n4->n3;
n3->n2;
n8->n2;
n9->n4;
n5->n4;
n2->n0;
n1->n0;
n6->n5;
n8->n5;
n7->n5;
}

The result: (here using the google chart API for Graphviz)

That's it,
Pierre

13 November 2012

Creating a virtual RDF graph describing a set of OpenOffice spreadsheets with Apache Jena and Fuseki

In the current post, I will use the Jena API for RDF to implement a virtual RDF graph describing the content of a set of openoffice/libreoffice spreasheets.

Fact: An openoffice file (*.ods) is a Zip file

An openoffice file is nothing but a zip file:

$ unzip -t jeter.ods 
Archive:  jeter.ods
    testing: mimetype                 OK
    testing: meta.xml                 OK
    testing: settings.xml             OK
    testing: content.xml              OK
    testing: Thumbnails/thumbnail.png   OK
    testing: Configurations2/images/Bitmaps/   OK
    testing: Configurations2/popupmenu/   OK
    testing: Configurations2/toolpanel/   OK
    testing: Configurations2/statusbar/   OK
    testing: Configurations2/progressbar/   OK
    testing: Configurations2/toolbar/   OK
    testing: Configurations2/menubar/   OK
    testing: Configurations2/accelerator/current.xml   OK
    testing: Configurations2/floater/   OK
    testing: styles.xml               OK
    testing: META-INF/manifest.xml    OK
No errors detected in compressed data of jeter.ods.

The entry content.xml is a XML file describing the tables in the spreadsheet:

$ unzip -c jeter.ods content.xml |\
grep -v Archive |\
grep -v inflating | xmllint --format - |\
head -n 20


<?xml version="1.0" encoding="UTF-8"?>
<office:document-content xmlns:office="urn:oasis:names:tc:opendocument:xmlns:office:1.0" xmlns:style="urn:oasis:names:tc:opendocument:xmlns:style:1.0" xmlns:text="urn:oasis:names:tc:opendocument:xmlns:text:1.0" xmlns:table="urn:oasis:names:tc:opendocument:xmlns:table:1.0" xmlns:draw="urn:oasis:names:tc:opendocument:xmlns:drawing:1.0" xmlns:fo="urn:oasis:names:tc:opendocument:xmlns:xsl-fo-compatible:1.0" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:meta="urn:oasis:names:tc:opendocument:xmlns:meta:1.0" xmlns:number="urn:oasis:names:tc:opendocument:xmlns:datastyle:1.0" xmlns:presentation="urn:oasis:names:tc:opendocument:xmlns:presentation:1.0" xmlns:svg="urn:oasis:names:tc:opendocument:xmlns:svg-compatible:1.0" xmlns:chart="urn:oasis:names:tc:opendocument:xmlns:chart:1.0" xmlns:dr3d="urn:oasis:names:tc:opendocument:xmlns:dr3d:1.0" xmlns:math="http://www.w3.org/1998/Math/MathML" xmlns:form="urn:oasis:names:tc:opendocument:xmlns:form:1.0" xmlns:script="urn:oasis:names:tc:opendocument:xmlns:script:1.0" xmlns:ooo="http://openoffice.org/2004/office" xmlns:ooow="http://openoffice.org/2004/writer" xmlns:oooc="http://openoffice.org/2004/calc" xmlns:dom="http://www.w3.org/2001/xml-events" xmlns:xforms="http://www.w3.org/2002/xforms" xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:rpt="http://openoffice.org/2005/report" xmlns:of="urn:oasis:names:tc:opendocument:xmlns:of:1.2" xmlns:xhtml="http://www.w3.org/1999/xhtml" xmlns:grddl="http://www.w3.org/2003/g/data-view#" xmlns:tableooo="http://openoffice.org/2009/table" xmlns:field="urn:openoffice:names:experimental:ooo-ms-interop:xmlns:field:1.0" xmlns:formx="urn:openoffice:names:experimental:ooxml-odf-interop:xmlns:form:1.0" xmlns:css3t="http://www.w3.org/TR/css3-text/" office:version="1.2">
  <office:scripts/>
  <office:font-face-decls>
    <style:font-face style:name="Liberation Sans" svg:font-family="'Liberation Sans'" style:font-family-generic="swiss" style:font-pitch="variable"/>
    <style:font-face style:name="DejaVu Sans" svg:font-family="'DejaVu Sans'" style:font-family-generic="system" style:font-pitch="variable"/>
    <style:font-face style:name="Lohit Hindi" svg:font-family="'Lohit Hindi'" style:font-family-generic="system" style:font-pitch="variable"/>
    <style:font-face style:name="WenQuanYi Micro Hei" svg:font-family="'WenQuanYi Micro Hei'" style:font-family-generic="system" style:font-pitch="variable"/>
  </office:font-face-decls>
  <office:automatic-styles>
    <style:style style:name="co1" style:family="table-column">
      <style:table-column-properties fo:break-before="auto" style:column-width="0.889in"/>
    </style:style>
    <style:style style:name="ro2" style:family="table-row">
      <style:table-row-properties style:row-height="0.178in" fo:break-before="auto" style:use-optimal-row-height="true"/>
    </style:style>
    <style:style style:name="ro3" style:family="table-row">
      <style:table-row-properties style:row-height="0.1681in" fo:break-before="auto" style:use-optimal-row-height="true"/>
    </style:style>
    <style:style style:name="ta1" style:family="table" style:master-page-name="Default">

Fact: Implementing a simple virtual RDF graph with Jena is easy

By virtual I mean that there is no RDFStore, the triples are created on the fly.
Implementing a simple virtual RDF graph with Jena is easy: you simply have to extend the class com.hp.hpl.jena.graph.impl.GraphBase and only implement the method graphBaseFind which returns all the RDF Triples matching a TripleMatch.

(...)
 @Override
    protected ExtendedIterator<Triple> graphBaseFind(TripleMatch matcher)
        {
        return ...;
        }
(...)

The code

My implementation of a RDFGraph for a set of OpenOffice Calc is not effective but it works fine: for each call of graphBaseFind, it creates an "Iterator<Triple>" scanning each content.xml entry of each openoffice file. This iterator creates some new Triples, add them to a list of Triples that will be filtered by the TripleMatcher.

Compilation

the Makefile:

CP=...#path to the jars of JENA/ARQ/etc... e.g: =`find ${ARQ} -name "*.jar" |  | tr "\n" ":"`
.PHONY: all
all:
 javac -cp ${CP} -sourcepath src src/oocalc/OpenOfficeCalcGraph.java
 jar cvf dist/openoffice2rdf.jar -C src .

Querying using sparql

Now that the Graph has been implemented and compiled, one can query it using ARQ, the sparql engine of Jena:

The spreadsheet

I've created the following spreadsheet and saved it in a file named "jeter.ods":

CHROM	START	END	NAME
chr1	100	200	rs654
chr1	150	250	rs264
chr1	200	300	rs610
chr1	250	350	rs929
chr1	300	400	rs408
chr1	350	450	rs346
chr1	400	500	rs430
chr1	450	550	rs735
chr1	500	600	rs575
chr1	550	650	rs891
chr1	600	700	rs627
chr1	650	750	rs650
chr1	700	800	rs715
chr1	750	850	rs467
chr1	800	900	rs882
chr1	850	950	rs301
chr1	900	1000	rs643
chr1	950	1050	rs246
chr1	1000	1100	rs178
chr1	1050	1150	rs928
chr1	1100	1200	rs213

The sparql query

The following SPARQL returns the informations about the cells in the 3rd row of the spreadsheet:

Invoke:

java -cp `find /home/lindenb/.ivy2/cache -name "*.jar" | tr "\n" ":"`:dist/openoffice2rdf.jar  \
 oocalc.OpenOfficeCalcGraph test.sparql /home/lindenb/jeter.ods

Result:

-----------------------------------------------------------------------------------------------------------------------------------
| s                                       | p                                                 | o                                 |
===================================================================================================================================
| <file:/home/lindenb/jeter.ods/t1/y3/x1> | office:table                                      | <file:/home/lindenb/jeter.ods/t1> |
| <file:/home/lindenb/jeter.ods/t1/y3/x1> | <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> | office:Cell                       |
| <file:/home/lindenb/jeter.ods/t1/y3/x1> | office:X                                          | "1"^^xsd:int                      |
| <file:/home/lindenb/jeter.ods/t1/y3/x1> | office:Y                                          | "3"^^xsd:int                      |
| <file:/home/lindenb/jeter.ods/t1/y3/x1> | office:value                                      | "chr1"                            |
| <file:/home/lindenb/jeter.ods/t1/y3/x2> | office:table                                      | <file:/home/lindenb/jeter.ods/t1> |
| <file:/home/lindenb/jeter.ods/t1/y3/x2> | <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> | office:Cell                       |
| <file:/home/lindenb/jeter.ods/t1/y3/x2> | office:X                                          | "2"^^xsd:int                      |
| <file:/home/lindenb/jeter.ods/t1/y3/x2> | office:Y                                          | "3"^^xsd:int                      |
| <file:/home/lindenb/jeter.ods/t1/y3/x2> | office:value                                      | "150"^^xsd:float                  |
| <file:/home/lindenb/jeter.ods/t1/y3/x3> | office:table                                      | <file:/home/lindenb/jeter.ods/t1> |
| <file:/home/lindenb/jeter.ods/t1/y3/x3> | <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> | office:Cell                       |
| <file:/home/lindenb/jeter.ods/t1/y3/x3> | office:X                                          | "3"^^xsd:int                      |
| <file:/home/lindenb/jeter.ods/t1/y3/x3> | office:Y                                          | "3"^^xsd:int                      |
| <file:/home/lindenb/jeter.ods/t1/y3/x3> | office:value                                      | "250"^^xsd:float                  |
| <file:/home/lindenb/jeter.ods/t1/y3/x4> | office:table                                      | <file:/home/lindenb/jeter.ods/t1> |
| <file:/home/lindenb/jeter.ods/t1/y3/x4> | <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> | office:Cell                       |
| <file:/home/lindenb/jeter.ods/t1/y3/x4> | office:X                                          | "4"^^xsd:int                      |
| <file:/home/lindenb/jeter.ods/t1/y3/x4> | office:Y                                          | "3"^^xsd:int                      |
| <file:/home/lindenb/jeter.ods/t1/y3/x4> | office:value                                      | "rs264"                           |
-----------------------------------------------------------------------------------------------------------------------------------

Serving the OpenOffice spreadsheets as RDF over HTTP

Fuseki is a SPARQL server. It provides REST-style SPARQL HTTP Update, SPARQL Query, and SPARQL Update using the SPARQL protocol over HTTP. We're going to deploy the OpenOfficeCalcGraph in Fuseki to query a set of OpenOffice files.

Download an install Fuseki

wget https://repository.apache.org/content/repositories/releases/org/apache/jena/jena-fuseki/0.2.5/jena-fuseki-0.2.5-distribution.tar.gz
tar xfz jena-fuseki-0.2.5-distribution.tar.gz
rm jena-fuseki-0.2.5-distribution.tar.gz

Tell Fuseki about our OpenOfficeCalcGraph

We need to create a config file for Fuseki. That was the most complicated part as the process is not clearly documented:

The line:

[] ja:loadClass "oocalc.OpenOfficeCalcGraph" .

loads the class oocalc.OpenOfficeCalcGraph. The class OpenOfficeCalcGraph contains a static initialisation method:

(...)
static { init() ; }
    private static void init()
        {
        (...)

In this static method, a Jena Assembler for OpenOfficeCalcGraph is registered under the resource named: "http://rdf.lindenb.org/build".

public static OpenOfficeAssembler assembler = new OpenOfficeAssembler();
(...)
private static final Resource buildRsrc=ResourceFactory.createResource(NS+"build");
(...)
Assembler.general.implementWith(buildRsrc,assembler);
(...)

An Assembler configures a Graph from a RDF config file. In our example, the config contains the path to the OpenOffice spreadsheets:

<#ooservice> rdf:type openoffice:build ;
    openoffice:file "/home/lindenb/jeter.ods" ;
    openoffice:file "/home/lindenb/jeter2.ods" ;
.

This config is read in the Assembler:

public static class OpenOfficeAssembler extends AssemblerBase implements Assembler
      {
      @Override
      public Object open( Assembler a, Resource root, Mode mode )
            {
            Property fileRsrc=ResourceFactory.createProperty(NS+"file");
            //read the configuration an get the files
            List<File> files=new ArrayList<File>();
            StmtIterator iter=root.listProperties(fileRsrc);
     (...)

Start Fuseki with the config file:

$ cd jena-fuseki-0.2.5
$ java -cp fuseki-server.jar:/path/to/openoffice2rdf.jar  org.apache.jena.fuseki.FusekiCmd \
    --debug  -v --config /path/to/openoffice.ttl
14:11:50 INFO  Config               :: Configuration file: ../openoffice.ttl
14:11:50 INFO  Config               :: Service: :service1
14:11:50 INFO  Config               ::   name = ds
14:11:50 INFO  Config               ::   query = /ds/query
14:11:50 INFO  Config               ::   query = /ds/sparql
14:11:50 INFO  Config               ::   update = /ds/update
14:11:50 INFO  Config               ::   upload = /ds/upload
14:11:50 INFO  Config               ::   graphStore(RW) = /ds/data
14:11:50 INFO  Config               ::   graphStore(R) = /ds/get
14:11:50 INFO  ooffice2rdf          :: Calling OpenOfficeCalcGraph init
14:11:50 INFO  Config               :: Service: OpenOffice Service (R)
14:11:50 INFO  Config               ::   name = openoffice
14:11:50 INFO  Config               ::   query = /openoffice/sparql
14:11:50 INFO  Config               ::   query = /openoffice/query
14:11:50 INFO  Config               ::   update = /openoffice/update
14:11:50 INFO  Config               ::   graphStore(R) = /openoffice/get
14:11:50 INFO  Config               ::   graphStore(R) = /openoffice/data
14:11:51 INFO  Server               :: Dataset path = /ds
14:11:51 INFO  Server               :: Dataset path = /openoffice
14:11:51 INFO  Server               :: Fuseki 0.2.5 2012-10-20T17:03:29+0100
14:11:51 INFO  Server               :: Started 2012/11/13 14:11:51 CET on port 3030

Open your browser at http://localhost:3030, select the control panel at http://localhost:3030/control-panel.tpl and select /openoffice:

Fuseki Control Panel

The following form is displayed:

SPARQL Query

You can now copy, paste and run the previous sparql query:

--------------------------------------------------------------------------------------------------------------------------------------------------
| s                                        | p                                                 | o                                               |
==================================================================================================================================================
| <file:/home/lindenb/jeter.ods/t1/y3/x1>  | <http://rdf.lindenb.org/table>                    | <file:/home/lindenb/jeter.ods/t1>               |
| <file:/home/lindenb/jeter.ods/t1/y3/x1>  | <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> | <http://rdf.lindenb.org/Cell>                   |
| <file:/home/lindenb/jeter.ods/t1/y3/x1>  | <http://rdf.lindenb.org/X>                        | "1"^^<http://www.w3.org/2001/XMLSchema#int>     |
| <file:/home/lindenb/jeter.ods/t1/y3/x1>  | <http://rdf.lindenb.org/Y>                        | "3"^^<http://www.w3.org/2001/XMLSchema#int>     |
| <file:/home/lindenb/jeter.ods/t1/y3/x1>  | <http://rdf.lindenb.org/value>                    | "chr1"                                          |
| <file:/home/lindenb/jeter.ods/t1/y3/x2>  | <http://rdf.lindenb.org/table>                    | <file:/home/lindenb/jeter.ods/t1>               |
| <file:/home/lindenb/jeter.ods/t1/y3/x2>  | <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> | <http://rdf.lindenb.org/Cell>                   |
| <file:/home/lindenb/jeter.ods/t1/y3/x2>  | <http://rdf.lindenb.org/X>                        | "2"^^<http://www.w3.org/2001/XMLSchema#int>     |
| <file:/home/lindenb/jeter.ods/t1/y3/x2>  | <http://rdf.lindenb.org/Y>                        | "3"^^<http://www.w3.org/2001/XMLSchema#int>     |
| <file:/home/lindenb/jeter.ods/t1/y3/x2>  | <http://rdf.lindenb.org/value>                    | "150"^^<http://www.w3.org/2001/XMLSchema#float> |
| <file:/home/lindenb/jeter.ods/t1/y3/x3>  | <http://rdf.lindenb.org/table>                    | <file:/home/lindenb/jeter.ods/t1>               |
| <file:/home/lindenb/jeter.ods/t1/y3/x3>  | <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> | <http://rdf.lindenb.org/Cell>                   |
| <file:/home/lindenb/jeter.ods/t1/y3/x3>  | <http://rdf.lindenb.org/X>                        | "3"^^<http://www.w3.org/2001/XMLSchema#int>     |
| <file:/home/lindenb/jeter.ods/t1/y3/x3>  | <http://rdf.lindenb.org/Y>                        | "3"^^<http://www.w3.org/2001/XMLSchema#int>     |
| <file:/home/lindenb/jeter.ods/t1/y3/x3>  | <http://rdf.lindenb.org/value>                    | "250"^^<http://www.w3.org/2001/XMLSchema#float> |
| <file:/home/lindenb/jeter.ods/t1/y3/x4>  | <http://rdf.lindenb.org/table>                    | <file:/home/lindenb/jeter.ods/t1>               |
| <file:/home/lindenb/jeter.ods/t1/y3/x4>  | <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> | <http://rdf.lindenb.org/Cell>                   |
| <file:/home/lindenb/jeter.ods/t1/y3/x4>  | <http://rdf.lindenb.org/X>                        | "4"^^<http://www.w3.org/2001/XMLSchema#int>     |
| <file:/home/lindenb/jeter.ods/t1/y3/x4>  | <http://rdf.lindenb.org/Y>                        | "3"^^<http://www.w3.org/2001/XMLSchema#int>     |
| <file:/home/lindenb/jeter.ods/t1/y3/x4>  | <http://rdf.lindenb.org/value>                    | "rs264"                                         |
| <file:/home/lindenb/jeter2.ods/t1/y3/x1> | <http://rdf.lindenb.org/table>                    | <file:/home/lindenb/jeter2.ods/t1>              |
| <file:/home/lindenb/jeter2.ods/t1/y3/x1> | <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> | <http://rdf.lindenb.org/Cell>                   |
| <file:/home/lindenb/jeter2.ods/t1/y3/x1> | <http://rdf.lindenb.org/X>                        | "1"^^<http://www.w3.org/2001/XMLSchema#int>     |
| <file:/home/lindenb/jeter2.ods/t1/y3/x1> | <http://rdf.lindenb.org/Y>                        | "3"^^<http://www.w3.org/2001/XMLSchema#int>     |
| <file:/home/lindenb/jeter2.ods/t1/y3/x1> | <http://rdf.lindenb.org/value>                    | "1"^^<http://www.w3.org/2001/XMLSchema#float>   |
| <file:/home/lindenb/jeter2.ods/t1/y3/x2> | <http://rdf.lindenb.org/table>                    | <file:/home/lindenb/jeter2.od
| <file:/home/lindenb/jeter2.ods/t1/y3/x2> | <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> | <http://rdf.lindenb.org/Cell>                   |
| <file:/home/lindenb/jeter2.ods/t1/y3/x2> | <http://rdf.lindenb.org/X>                        | "2"^^<http://www.w3.org/2001/XMLSchema#int>     |
| <file:/home/lindenb/jeter2.ods/t1/y3/x2> | <http://rdf.lindenb.org/Y>                        | "3"^^<http://www.w3.org/2001/XMLSchema#int>     |
| <file:/home/lindenb/jeter2.ods/t1/y3/x2> | <http://rdf.lindenb.org/value>                    | "2"^^<http://www.w3.org/2001/XMLSchema#float>   |
| <file:/home/lindenb/jeter2.ods/t1/y3/x3> | <http://rdf.lindenb.org/table>                    | <file:/home/lindenb/jeter2.ods/t1>              |
| <file:/home/lindenb/jeter2.ods/t1/y3/x3> | <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> | <http://rdf.lindenb.org/Cell>                   |
| <file:/home/lindenb/jeter2.ods/t1/y3/x3> | <http://rdf.lindenb.org/X>                        | "3"^^<http://www.w3.org/2001/XMLSchema#int>     |
| <file:/home/lindenb/jeter2.ods/t1/y3/x3> | <http://rdf.lindenb.org/Y>                        | "3"^^<http://www.w3.org/2001/XMLSchema#int>     |
| <file:/home/lindenb/jeter2.ods/t1/y3/x3> | <http://rdf.lindenb.org/value>                    | "3"^^<http://www.w3.org/2001/XMLSchema#float>   |
| <file:/home/lindenb/jeter2.ods/t1/y3/x4> | <http://rdf.lindenb.org/table>                    | <file:/home/lindenb/jeter2.ods/t1>              |
| <file:/home/lindenb/jeter2.ods/t1/y3/x4> | <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> | <http://rdf.lindenb.org/Cell>                   |
| <file:/home/lindenb/jeter2.ods/t1/y3/x4> | <http://rdf.lindenb.org/X>                        | "4"^^<http://www.w3.org/2001/XMLSchema#int>     |
| <file:/home/lindenb/jeter2.ods/t1/y3/x4> | <http://rdf.lindenb.org/Y>                        | "3"^^<http://www.w3.org/2001/XMLSchema#int>     |
| <file:/home/lindenb/jeter2.ods/t1/y3/x4> | <http://rdf.lindenb.org/value>                    | "4"^^<http://www.w3.org/2001/XMLSchema#float>   |
--------------------------------------------------------------------------------------------------------------------------------------------------

That's it,

Pierre

02 November 2012

Saving your tweets in a database using sqlite, rhino, scribe, javascript

In the current post, I 'll describe a simple method to save your tweets in a sqlite database using Mozilla Rhino.

Prerequisites

sqlite
Apache Rhino. I think it should be de-facto available when the java developer toolkit (JDK) is installed
Scribe, the simple OAuth library for Java . It also requires Apache codec

The config.js file

Open an account on https://dev.twitter.com/ and create an App to receive an API-key and an API-secret.
Create the following file 'config.js' filled with the correct parameters.

The javascript

The following javascript file opens a Oauth connection, retrieves the tweets and stores them into sqlite. I've commented the code, I hope it is clear enough.

Running the script using Rhino

scribe.libs=/path/to/scribe-1.3.2.jar:/path/to/commons-codec.jar
rhino.libs=/usr/share/java/js.jar:/usr/share/java/jline.jar
sqlite.libs=/path/to/sqlitejdbc-v056.jar
CLASSPATH=${rhino.libs}:${scribe.libs}:${sqlite.libs}

java -cp ${CLASSPATH} org.mozilla.javascript.tools.shell.Main -f twitter2sqlite.js

At the first time, the user is asked to authorize the application to use the twitter API

The script runs forever (Ctrl-C to break), listening to the new tweets.

As a test, I wrote the following tweet:

wrote a tool to save my tweets: This is a test . ( #rhino, #jdbc, #sqlite, #scribe #javascript )
- Pierre Lindenbaum (@yokofakun) November 2, 2012

... and the tweet was later inserted in the database...

Sleep...



Inserted ({created_at:"Fri Nov 02 20:29:04 +0000 2012", id:264464160664981500, id_str:"264464160664981504", text:"wrote a tool to save my tweets: This is a test . ( #rhino, #jdbc, #sqlite, #scribe #javascript )", source:"web", truncated:false, in_reply_to_status_id:null, in_reply_to_status_id_str:null, in_reply_to_user_id:null, in_reply_to_user_id_str:null, in_reply_to_screen_name:null, geo:null, coordinates:null, place:null, contributors:null, retweet_count:0, entities:{hashtags:[{text:"rhino", indices:[51, 57]}, {text:"jdbc", indices:[59, 64]}, {text:"sqlite", indices:[66, 73]}, {text:"scribe", indices:[75, 82]}, {text:"javascript", indices:[83, 94]}], urls:[], user_mentions:[]}, favorited:false, retweeted:false})



Sleep...

Sleep...

Sleep...

Later, the tweets can be extracted using the sqlite command line:

$  sqlite3 tweets.sqlite 'select * from tweet'

264464160664981504|({created_at:"Fri Nov 02 20:29:04 +0000 2012", id:264464160664981500, id_str:"264464160664981504", text:"wrote a tool to save my tweets: This
264421310841638913|({created_at:"Fri Nov 02 17:38:47 +0000 2012", id:264421310841638900, id_str:"264421310841638913", text:"The tools for recalibration have cha
264264932097400832|({created_at:"Fri Nov 02 07:17:24 +0000 2012", id:264264932097400830, id_str:"264264932097400832", text:"@warandpeace you're welcome. Your sh
264158323287416832|({created_at:"Fri Nov 02 00:13:46 +0000 2012", id:264158323287416830, id_str:"264158323287416832", text:"Drawing of the day November 1, 2012.
264142732174438400|({created_at:"Thu Nov 01 23:11:49 +0000 2012", id:264142732174438400, id_str:"264142732174438400", text:"[delicious] PLOS Collections: How th
264064117558624256|({created_at:"Thu Nov 01 17:59:26 +0000 2012", id:264064117558624260, id_str:"264064117558624256", text:"I've added a stupid basic dependency
264025607724204034|({created_at:"Thu Nov 01 15:26:24 +0000 2012", id:264025607724204030, id_str:"264025607724204034", text:"in the desert lab, checking my on-go
264013563704795136|({created_at:"Thu Nov 01 14:38:33 +0000 2012", id:264013563704795140, id_str:"264013563704795136", text:"Drawing of the day November 1, 2012.
263996436679630848|({created_at:"Thu Nov 01 13:30:29 +0000 2012", id:263996436679630850, id_str:"263996436679630848", text:"RT @RealistComics: he's tall, dark a
263966759210590208|({created_at:"Thu Nov 01 11:32:34 +0000 2012", id:263966759210590200, id_str:"263966759210590208", text:"RT @guermonprez: #Aubry Un avion nor
263946369847398402|({created_at:"Thu Nov 01 10:11:33 +0000 2012", id:263946369847398400, id_str:"263946369847398402", text:"[delicious] OVal: object validation 
263946366919790593|({created_at:"Thu Nov 01 10:11:32 +0000 2012", id:263946366919790600, id_str:"263946366919790593", text:"[delicious] MyBatis #tweet: a first 
263941020729896960|({created_at:"Thu Nov 01 09:50:17 +0000 2012", id:263941020729896960, id_str:"263941020729896960", text:"RT @josh_wills: I have never been pr
263938670187388928|({created_at:"Thu Nov 01 09:40:57 +0000 2012", id:263938670187388930, id_str:"263938670187388928", text:"RT @softmodeling @peterneubauer: Usi
263936362716200960|({created_at:"Thu Nov 01 09:31:47 +0000 2012", id:263936362716200960, id_str:"263936362716200960", text:"declined to review an article about 
263934528186351616|({created_at:"Thu Nov 01 09:24:29 +0000 2012", id:263934528186351600, id_str:"263934528186351616", text:"@figshare Thanks, ( was http://t.co/
263815846139412480|({created_at:"Thu Nov 01 01:32:53 +0000 2012", id:263815846139412480, id_str:"263815846139412480", text:"Drawing of the day October 30, 2012.
263731855919026176|({created_at:"Wed Oct 31 19:59:09 +0000 2012", id:263731855919026180, id_str:"263731855919026176", text:"[delicious] An integrated map of gen
263726281647067136|({created_at:"Wed Oct 31 19:36:59 +0000 2012", id:263726281647067140, id_str:"263726281647067136", text:"RT @bryan_howie: 1000 Genomes paper 
263695076516052992|({created_at:"Wed Oct 31 17:33:00 +0000 2012", id:263695076516053000, id_str:"263695076516052992", text:"\"Forget your Past\" ( abandoned Bul

That's it
Pierre

15 October 2012

Modifying the GATK so it supports a XML-based format for VCF.

I've modified the sources of the GATK in order to support a XML-based format for the variation in addition of the VCF format.
Here are the sources I've modified or added:
new file: org.broadinstitute.sting.utils.variantcontext.writer:

package org.broadinstitute.sting.utils.variantcontext.writer.AbstractVCFWriter;

import java.io.File;
import java.io.OutputStream;
import java.lang.reflect.Array;
import java.util.ArrayList;
import java.util.Collections;
import java.util.HashMap;
import java.util.HashSet;
import java.util.List;
import java.util.Map;
import java.util.Set;

import org.broad.tribble.util.ParsingUtils;
import org.broadinstitute.sting.utils.Utils;
import org.broadinstitute.sting.utils.codecs.vcf.VCFConstants;
import org.broadinstitute.sting.utils.codecs.vcf.VCFHeader;
import org.broadinstitute.sting.utils.exceptions.ReviewedStingException;
import org.broadinstitute.sting.utils.variantcontext.Allele;
import org.broadinstitute.sting.utils.variantcontext.Genotype;
import org.broadinstitute.sting.utils.variantcontext.VariantContext;

import net.sf.samtools.SAMSequenceDictionary;

public abstract class AbstractVCFWriter
 extends IndexingVariantContextWriter
 {
    // the VCF header we're storing
 protected VCFHeader mHeader = null;

 protected IntGenotypeFieldAccessors intGenotypeFieldAccessors = new IntGenotypeFieldAccessors();

    // should we write genotypes or just sites?
    final protected boolean doNotWriteGenotypes;
    final protected boolean allowMissingFieldsInHeader;

 
    protected AbstractVCFWriter(
      final File location,
      final OutputStream output,
      final SAMSequenceDictionary refDict,
            final boolean enableOnTheFlyIndexing,
            boolean doNotWriteGenotypes,
            final boolean allowMissingFieldsInHeader
            )
     {
     super(writerName(location, output), location, output, refDict, enableOnTheFlyIndexing);
        this.doNotWriteGenotypes = doNotWriteGenotypes;
        this.allowMissingFieldsInHeader = allowMissingFieldsInHeader;

     }
    
    protected VCFHeader getVCFHeader()
     {
     return this.mHeader;
     }
    
   protected static Map<Allele, String> buildAlleleMap(final VariantContext vc) {
        final Map<Allele, String> alleleMap = new HashMap<Allele, String>(vc.getAlleles().size()+1);
        alleleMap.put(Allele.NO_CALL, VCFConstants.EMPTY_ALLELE); // convenience for lookup

        final List<Allele> alleles = vc.getAlleles();
        for ( int i = 0; i < alleles.size(); i++ ) {
            alleleMap.put(alleles.get(i), String.valueOf(i));
        }

        return alleleMap;
    }

   
   private static final String QUAL_FORMAT_STRING = "%.2f";
   private static final String QUAL_FORMAT_EXTENSION_TO_TRIM = ".00";

   protected String formatQualValue(double qual) {
       String s = String.format(QUAL_FORMAT_STRING, qual);
       if ( s.endsWith(QUAL_FORMAT_EXTENSION_TO_TRIM) )
           s = s.substring(0, s.length() - QUAL_FORMAT_EXTENSION_TO_TRIM.length());
       return s;
   }

   public static final void missingSampleError(final VariantContext vc, final VCFHeader header) {
       final List<String> badSampleNames = new ArrayList<String>();
       for ( final String x : header.getGenotypeSamples() )
           if ( ! vc.hasGenotype(x) ) badSampleNames.add(x);
       throw new ReviewedStingException("BUG: we now require all samples in VCFheader to have genotype objects.  Missing samples are " + Utils.join(",", badSampleNames));
   }
   
   protected boolean isMissingValue(String s) {
       // we need to deal with the case that it's a list of missing values
       return (countOccurrences(VCFConstants.MISSING_VALUE_v4.charAt(0), s) + countOccurrences(',', s) == s.length());
   }

   
   /**
    * Takes a double value and pretty prints it to a String for display
    *
    * Large doubles => gets %.2f style formatting
    * Doubles < 1 / 10 but > 1/100 </>=> get %.3f style formatting
    * Double < 1/100 => %.3e formatting
    * @param d
    * @return
    */
   public static final String formatVCFDouble(final double d) {
       String format;
       if ( d < 1 ) {
           if ( d < 0.01 ) {
               if ( Math.abs(d) >= 1e-20 )
                   format = "%.3e";
               else {
                   // return a zero format
                   return "0.00";
               }
           } else {
               format = "%.3f";
           }
       } else {
           format = "%.2f";
       }

       return String.format(format, d);
   }

   public static String formatVCFField(Object val) {
       String result;
       if ( val == null )
           result = VCFConstants.MISSING_VALUE_v4;
       else if ( val instanceof Double )
           result = formatVCFDouble((Double) val);
       else if ( val instanceof Boolean )
           result = (Boolean)val ? "" : null; // empty string for true, null for false
       else if ( val instanceof List ) {
           result = formatVCFField(((List)val).toArray());
       } else if ( val.getClass().isArray() ) {
           int length = Array.getLength(val);
           if ( length == 0 )
               return formatVCFField(null);
           StringBuffer sb = new StringBuffer(formatVCFField(Array.get(val, 0)));
           for ( int i = 1; i < length; i++) {
               sb.append(",");
               sb.append(formatVCFField(Array.get(val, i)));
           }
           result = sb.toString();
       } else
           result = val.toString();

       return result;
   }

   /**
    * Determine which genotype fields are in use in the genotypes in VC
    * @param vc
    * @return an ordered list of genotype fields in use in VC.  If vc has genotypes this will always include GT first
    */
   public static List<String> calcVCFGenotypeKeys(final VariantContext vc, final VCFHeader header) {
       Set<String> keys = new HashSet<String>();

       boolean sawGoodGT = false;
       boolean sawGoodQual = false;
       boolean sawGenotypeFilter = false;
       boolean sawDP = false;
       boolean sawAD = false;
       boolean sawPL = false;
       for ( final Genotype g : vc.getGenotypes() ) {
           keys.addAll(g.getExtendedAttributes().keySet());
           if ( g.isAvailable() ) sawGoodGT = true;
           if ( g.hasGQ() ) sawGoodQual = true;
           if ( g.hasDP() ) sawDP = true;
           if ( g.hasAD() ) sawAD = true;
           if ( g.hasPL() ) sawPL = true;
           if (g.isFiltered()) sawGenotypeFilter = true;
       }

       if ( sawGoodQual ) keys.add(VCFConstants.GENOTYPE_QUALITY_KEY);
       if ( sawDP ) keys.add(VCFConstants.DEPTH_KEY);
       if ( sawAD ) keys.add(VCFConstants.GENOTYPE_ALLELE_DEPTHS);
       if ( sawPL ) keys.add(VCFConstants.GENOTYPE_PL_KEY);
       if ( sawGenotypeFilter ) keys.add(VCFConstants.GENOTYPE_FILTER_KEY);

       List<String> sortedList = ParsingUtils.sortList(new ArrayList<String>(keys));

       // make sure the GT is first
       if ( sawGoodGT ) {
           List<String> newList = new ArrayList<String>(sortedList.size()+1);
           newList.add(VCFConstants.GENOTYPE_KEY);
           newList.addAll(sortedList);
           sortedList = newList;
       }

       if ( sortedList.isEmpty() && header.hasGenotypingData() ) {
           // this needs to be done in case all samples are no-calls
           return Collections.singletonList(VCFConstants.GENOTYPE_KEY);
       } else {
           return sortedList;
       }
   }


   private static int countOccurrences(char c, String s) {
          int count = 0;
          for (int i = 0; i < s.length(); i++) {
              count += s.charAt(i) == c ? 1 : 0;
          }
          return count;
   }


 }

modified file: org.broadinstitute.sting.utils.variantcontext.writer.VCFWriter:

/*
 * Copyright (c) 2010, The Broad Institute
 *
 * Permission is hereby granted, free of charge, to any person
 * obtaining a copy of this software and associated documentation
 * files (the "Software"), to deal in the Software without
 * restriction, including without limitation the rights to use,
 * copy, modify, merge, publish, distribute, sublicense, and/or sell
 * copies of the Software, and to permit persons to whom the
 * Software is furnished to do so, subject to the following
 * conditions:
 *
 * The above copyright notice and this permission notice shall be
 * included in all copies or substantial portions of the Software.
 * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
 * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES
 * OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
 * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT
 * HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY,
 * WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
 * FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR
 * OTHER DEALINGS IN THE SOFTWARE.
 */

package org.broadinstitute.sting.utils.variantcontext.writer;

import net.sf.samtools.SAMSequenceDictionary;
import org.broad.tribble.TribbleException;
import org.broad.tribble.util.ParsingUtils;
import org.broadinstitute.sting.utils.codecs.vcf.*;
import org.broadinstitute.sting.utils.exceptions.ReviewedStingException;
import org.broadinstitute.sting.utils.exceptions.UserException;
import org.broadinstitute.sting.utils.variantcontext.*;

import java.io.*;
import java.util.*;

/**
 * this class writes VCF files
 */
class VCFWriter extends AbstractVCFWriter {
    private final static String VERSION_LINE = VCFHeader.METADATA_INDICATOR + VCFHeaderVersion.VCF4_1.getFormatString() + "=" + VCFHeaderVersion.VCF4_1.getVersionString();

    // the print stream we're writing to
    final protected BufferedWriter mWriter;




    public VCFWriter(final File location, final OutputStream output, final SAMSequenceDictionary refDict,
                     final boolean enableOnTheFlyIndexing,
                     boolean doNotWriteGenotypes,
                     final boolean allowMissingFieldsInHeader )
     {
        super(location, output, refDict, enableOnTheFlyIndexing,doNotWriteGenotypes,allowMissingFieldsInHeader);
        mWriter = new BufferedWriter(new OutputStreamWriter(getOutputStream())); // todo -- fix buffer size
     }

    // --------------------------------------------------------------------------------
    //
    // VCFWriter interface functions
    //
    // --------------------------------------------------------------------------------

    @Override
    public void writeHeader(VCFHeader header) {
        // note we need to update the mHeader object after this call because they header
        // may have genotypes trimmed out of it, if doNotWriteGenotypes is true
        super.mHeader = writeHeader(header, mWriter, doNotWriteGenotypes, getVersionLine(), getStreamName());
     }

    public static final String getVersionLine() {
        return VERSION_LINE;
    }

    public static VCFHeader writeHeader(VCFHeader header,
                                        final Writer writer,
                                        final boolean doNotWriteGenotypes,
                                        final String versionLine,
                                        final String streamNameForError) {
        header = doNotWriteGenotypes ? new VCFHeader(header.getMetaDataInSortedOrder()) : header;
        
        try {
            // the file format field needs to be written first
            writer.write(versionLine + "\n");

            for ( VCFHeaderLine line : header.getMetaDataInSortedOrder() ) {
                if ( VCFHeaderVersion.isFormatString(line.getKey()) )
                    continue;

                writer.write(VCFHeader.METADATA_INDICATOR);
                writer.write(line.toString());
                writer.write("\n");
            }

            // write out the column line
            writer.write(VCFHeader.HEADER_INDICATOR);
            boolean isFirst = true;
            for ( VCFHeader.HEADER_FIELDS field : header.getHeaderFields() ) {
                if ( isFirst )
                    isFirst = false; // don't write out a field separator
                else
                    writer.write(VCFConstants.FIELD_SEPARATOR);
                writer.write(field.toString());
            }

            if ( header.hasGenotypingData() ) {
                writer.write(VCFConstants.FIELD_SEPARATOR);
                writer.write("FORMAT");
                for ( String sample : header.getGenotypeSamples() ) {
                    writer.write(VCFConstants.FIELD_SEPARATOR);
                    writer.write(sample);
                }
            }

            writer.write("\n");
            writer.flush();  // necessary so that writing to an output stream will work
        }
        catch (IOException e) {
            throw new ReviewedStingException("IOException writing the VCF header to " + streamNameForError, e);
        }

        return header;
    }

    /**
     * attempt to close the VCF file
     */
    @Override
    public void close() {
        // try to close the vcf stream
        try {
            mWriter.flush();
            mWriter.close();
        } catch (IOException e) {
            throw new ReviewedStingException("Unable to close " + getStreamName(), e);
        }

        super.close();
    }

    /**
     * add a record to the file
     *
     * @param vc      the Variant Context object
     */
    @Override
    public void add(VariantContext vc) {
        if ( mHeader == null )
            throw new IllegalStateException("The VCF Header must be written before records can be added: " + getStreamName());

        if ( doNotWriteGenotypes )
            vc = new VariantContextBuilder(vc).noGenotypes().make();

        try {
            super.add(vc);

            Map<Allele, String> alleleMap = buildAlleleMap(vc);

            // CHROM
            mWriter.write(vc.getChr());
            mWriter.write(VCFConstants.FIELD_SEPARATOR);

            // POS
            mWriter.write(String.valueOf(vc.getStart()));
            mWriter.write(VCFConstants.FIELD_SEPARATOR);

            // ID
            String ID = vc.getID();
            mWriter.write(ID);
            mWriter.write(VCFConstants.FIELD_SEPARATOR);

            // REF
            String refString = vc.getReference().getDisplayString();
            mWriter.write(refString);
            mWriter.write(VCFConstants.FIELD_SEPARATOR);

            // ALT
            if ( vc.isVariant() ) {
                Allele altAllele = vc.getAlternateAllele(0);
                String alt = altAllele.getDisplayString();
                mWriter.write(alt);

                for (int i = 1; i < vc.getAlternateAlleles().size(); i++) {
                    altAllele = vc.getAlternateAllele(i);
                    alt = altAllele.getDisplayString();
                    mWriter.write(",");
                    mWriter.write(alt);
                }
            } else {
                mWriter.write(VCFConstants.EMPTY_ALTERNATE_ALLELE_FIELD);
            }
            mWriter.write(VCFConstants.FIELD_SEPARATOR);

            // QUAL
            if ( !vc.hasLog10PError() )
                mWriter.write(VCFConstants.MISSING_VALUE_v4);
            else
                mWriter.write(formatQualValue(vc.getPhredScaledQual()));
            mWriter.write(VCFConstants.FIELD_SEPARATOR);

            // FILTER
            String filters = getFilterString(vc);
            mWriter.write(filters);
            mWriter.write(VCFConstants.FIELD_SEPARATOR);

            // INFO
            Map<String, String> infoFields = new TreeMap<String, String>();
            for ( Map.Entry<String, Object> field : vc.getAttributes().entrySet() ) {
                String key = field.getKey();

                if ( ! mHeader.hasInfoLine(key) )
                    fieldIsMissingFromHeaderError(vc, key, "INFO");

                String outputValue = formatVCFField(field.getValue());
                if ( outputValue != null )
                    infoFields.put(key, outputValue);
            }
            writeInfoString(infoFields);

            // FORMAT
            final GenotypesContext gc = vc.getGenotypes();
            if ( gc.isLazyWithData() && ((LazyGenotypesContext)gc).getUnparsedGenotypeData() instanceof String ) {
                mWriter.write(VCFConstants.FIELD_SEPARATOR);
                mWriter.write(((LazyGenotypesContext)gc).getUnparsedGenotypeData().toString());
            } else {
                List<String> genotypeAttributeKeys = calcVCFGenotypeKeys(vc, mHeader);
                if ( ! genotypeAttributeKeys.isEmpty() ) {
                    for ( final String format : genotypeAttributeKeys )
                        if ( ! mHeader.hasFormatLine(format) )
                            fieldIsMissingFromHeaderError(vc, format, "FORMAT");

                    final String genotypeFormatString = ParsingUtils.join(VCFConstants.GENOTYPE_FIELD_SEPARATOR, genotypeAttributeKeys);

                    mWriter.write(VCFConstants.FIELD_SEPARATOR);
                    mWriter.write(genotypeFormatString);

                    addGenotypeData(vc, alleleMap, genotypeAttributeKeys);
                }
            }
            
            mWriter.write("\n");
            mWriter.flush();  // necessary so that writing to an output stream will work
        } catch (IOException e) {
            throw new RuntimeException("Unable to write the VCF object to " + getStreamName());
        }
    }


    // --------------------------------------------------------------------------------
    //
    // implementation functions
    //
    // --------------------------------------------------------------------------------

    private final String getFilterString(final VariantContext vc) {
        if ( vc.isFiltered() ) {
            for ( final String filter : vc.getFilters() )
                if ( ! mHeader.hasFilterLine(filter) )
                    fieldIsMissingFromHeaderError(vc, filter, "FILTER");

            return ParsingUtils.join(";", ParsingUtils.sortList(vc.getFilters()));
        }
        else if ( vc.filtersWereApplied() )
            return VCFConstants.PASSES_FILTERS_v4;
        else
            return VCFConstants.UNFILTERED;
    }


    /**
     * create the info string; assumes that no values are null
     *
     * @param infoFields a map of info fields
     * @throws IOException for writer
     */
    private void writeInfoString(Map<String, String> infoFields) throws IOException {
        if ( infoFields.isEmpty() ) {
            mWriter.write(VCFConstants.EMPTY_INFO_FIELD);
            return;
        }

        boolean isFirst = true;
        for ( Map.Entry<String, String> entry : infoFields.entrySet() ) {
            if ( isFirst )
                isFirst = false;
            else
                mWriter.write(VCFConstants.INFO_FIELD_SEPARATOR);

            String key = entry.getKey();
            mWriter.write(key);

            if ( !entry.getValue().equals("") ) {
                VCFInfoHeaderLine metaData = mHeader.getInfoHeaderLine(key);
                if ( metaData == null || metaData.getCountType() != VCFHeaderLineCount.INTEGER || metaData.getCount() != 0 ) {
                    mWriter.write("=");
                    mWriter.write(entry.getValue());
                }
            }
        }
    }

    /**
     * add the genotype data
     *
     * @param vc                     the variant context
     * @param genotypeFormatKeys  Genotype formatting string
     * @param alleleMap              alleles for this context
     * @throws IOException for writer
     */
    private void addGenotypeData(VariantContext vc, Map<Allele, String> alleleMap, List<String> genotypeFormatKeys)
    throws IOException {
        for ( String sample : mHeader.getGenotypeSamples() ) {
            mWriter.write(VCFConstants.FIELD_SEPARATOR);

            Genotype g = vc.getGenotype(sample);
            if ( g == null ) {
                missingSampleError(vc, mHeader);
            }

            final List<String> attrs = new ArrayList<String>(genotypeFormatKeys.size());
            for ( String field : genotypeFormatKeys ) {
                if ( field.equals(VCFConstants.GENOTYPE_KEY) ) {
                    if ( !g.isAvailable() ) {
                        throw new ReviewedStingException("GTs cannot be missing for some samples if they are available for others in the record");
                    }

                    writeAllele(g.getAllele(0), alleleMap);
                    for (int i = 1; i < g.getPloidy(); i++) {
                        mWriter.write(g.isPhased() ? VCFConstants.PHASED : VCFConstants.UNPHASED);
                        writeAllele(g.getAllele(i), alleleMap);
                    }

                    continue;
                } else {
                    String outputValue;
                    if ( field.equals(VCFConstants.GENOTYPE_FILTER_KEY ) ) {
                        outputValue = g.isFiltered() ? g.getFilters() : VCFConstants.PASSES_FILTERS_v4;
                    } else {
                        final IntGenotypeFieldAccessors.Accessor accessor = intGenotypeFieldAccessors.getAccessor(field);
                        if ( accessor != null ) {
                            final int[] intValues = accessor.getValues(g);
                            if ( intValues == null )
                                outputValue = VCFConstants.MISSING_VALUE_v4;
                            else if ( intValues.length == 1 ) // fast path
                                outputValue = Integer.toString(intValues[0]);
                            else {
                                StringBuilder sb = new StringBuilder();
                                sb.append(intValues[0]);
                                for ( int i = 1; i < intValues.length; i++) {
                                    sb.append(",");
                                    sb.append(intValues[i]);
                                }
                                outputValue = sb.toString();
                            }
                        } else {
                            Object val = g.hasExtendedAttribute(field) ? g.getExtendedAttribute(field) : VCFConstants.MISSING_VALUE_v4;

                            VCFFormatHeaderLine metaData = mHeader.getFormatHeaderLine(field);
                            if ( metaData != null ) {
                                int numInFormatField = metaData.getCount(vc);
                                if ( numInFormatField > 1 && val.equals(VCFConstants.MISSING_VALUE_v4) ) {
                                    // If we have a missing field but multiple values are expected, we need to construct a new string with all fields.
                                    // For example, if Number=2, the string has to be ".,."
                                    StringBuilder sb = new StringBuilder(VCFConstants.MISSING_VALUE_v4);
                                    for ( int i = 1; i < numInFormatField; i++ ) {
                                        sb.append(",");
                                        sb.append(VCFConstants.MISSING_VALUE_v4);
                                    }
                                    val = sb.toString();
                                }
                            }

                            // assume that if key is absent, then the given string encoding suffices
                            outputValue = formatVCFField(val);
                        }
                    }

                    if ( outputValue != null )
                        attrs.add(outputValue);
                }
            }

            // strip off trailing missing values
            for (int i = attrs.size()-1; i >= 0; i--) {
                if ( isMissingValue(attrs.get(i)) )
                    attrs.remove(i);
                else
                    break;
            }

            for (int i = 0; i < attrs.size(); i++) {
                if ( i > 0 || genotypeFormatKeys.contains(VCFConstants.GENOTYPE_KEY) )
                    mWriter.write(VCFConstants.GENOTYPE_FIELD_SEPARATOR);
                mWriter.write(attrs.get(i));
            }
        }
    }



    private void writeAllele(Allele allele, Map<Allele, String> alleleMap) throws IOException {
        String encoding = alleleMap.get(allele);
        if ( encoding == null )
            throw new TribbleException.InternalCodecException("Allele " + allele + " is not an allele in the variant context");
        mWriter.write(encoding);
    }

    private final void fieldIsMissingFromHeaderError(final VariantContext vc, final String id, final String field) {
        if ( !allowMissingFieldsInHeader)
            throw new UserException.MalformedVCFHeader("Key " + id + " found in VariantContext field " + field
                    + " at " + vc.getChr() + ":" + vc.getStart()
                    + " but this key isn't defined in the VCFHeader.  The GATK now requires all VCFs to have"
                    + " complete VCF headers by default.  This error can be disabled with the engine argument"
                    + " -U LENIENT_VCF_PROCESSING");
    }
}

modified file: org.broadinstitute.sting.utils.variantcontext.writer.VariantContextWriterFactory:

/*
 * Copyright (c) 2012, The Broad Institute
 *
 * Permission is hereby granted, free of charge, to any person
 * obtaining a copy of this software and associated documentation
 * files (the "Software"), to deal in the Software without
 * restriction, including without limitation the rights to use,
 * copy, modify, merge, publish, distribute, sublicense, and/or sell
 * copies of the Software, and to permit persons to whom the
 * Software is furnished to do so, subject to the following
 * conditions:
 *
 * The above copyright notice and this permission notice shall be
 * included in all copies or substantial portions of the Software.
 * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
 * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES
 * OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
 * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT
 * HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY,
 * WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
 * FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR
 * OTHER DEALINGS IN THE SOFTWARE.
 */

package org.broadinstitute.sting.utils.variantcontext.writer;

import net.sf.samtools.SAMSequenceDictionary;
import org.broadinstitute.sting.utils.exceptions.UserException;

import java.io.File;
import java.io.FileNotFoundException;
import java.io.FileOutputStream;
import java.io.OutputStream;
import java.util.EnumSet;

import javax.xml.stream.XMLStreamException;

/**
 * Factory methods to create VariantContext writers
 *
 * @author depristo
 * @since 5/12
 */
public class VariantContextWriterFactory {

    public static final EnumSet<Options> DEFAULT_OPTIONS = EnumSet.of(Options.INDEX_ON_THE_FLY);
    public static final EnumSet<Options> NO_OPTIONS = EnumSet.noneOf(Options.class);

    private VariantContextWriterFactory() {}

    public static VariantContextWriter create(final File location, final SAMSequenceDictionary refDict) {
        return create(location, openOutputStream(location), refDict, DEFAULT_OPTIONS);
    }

    public static VariantContextWriter create(final File location, final SAMSequenceDictionary refDict, final EnumSet<Options> options) {
        return create(location, openOutputStream(location), refDict, options);
    }

    public static VariantContextWriter create(final File location,
                                              final OutputStream output,
                                              final SAMSequenceDictionary refDict) {
        return create(location, output, refDict, DEFAULT_OPTIONS);
    }

    public static VariantContextWriter create(final OutputStream output,
                                              final SAMSequenceDictionary refDict,
                                              final EnumSet<Options> options) {
        return create(null, output, refDict, options);
    }

    public static VariantContextWriter create(final File location,
                                              final OutputStream output,
                                              final SAMSequenceDictionary refDict,
                                              final EnumSet<Options> options) {
        final boolean enableBCF = isBCFOutput(location, options);

        if ( enableBCF )
            return new BCF2Writer(location, output, refDict,
                    options.contains(Options.INDEX_ON_THE_FLY),
                    options.contains(Options.DO_NOT_WRITE_GENOTYPES));
        else if(location!=null && location.getName().endsWith(".xml"))
         {
         try {
            return new XMLVariantContextWriter(location, output, refDict,
                    options.contains(Options.INDEX_ON_THE_FLY),
                    options.contains(Options.DO_NOT_WRITE_GENOTYPES),
                    options.contains(Options.ALLOW_MISSING_FIELDS_IN_HEADER)
                    );
         } catch(XMLStreamException err)
          {
          throw new UserException.CouldNotCreateOutputFile(location, "Unable to create XML writer", err);
          }
         }
        else
         {
            return new VCFWriter(location, output, refDict,
                    options.contains(Options.INDEX_ON_THE_FLY),
                    options.contains(Options.DO_NOT_WRITE_GENOTYPES),
                    options.contains(Options.ALLOW_MISSING_FIELDS_IN_HEADER));
         }
     } 

    /**
     * Should we output a BCF file based solely on the name of the file at location?
     *
     * @param location
     * @return
     */
    public static boolean isBCFOutput(final File location) {
        return isBCFOutput(location, EnumSet.noneOf(Options.class));
    }

    public static boolean isBCFOutput(final File location, final EnumSet<Options> options) {
        return options.contains(Options.FORCE_BCF) || (location != null && location.getName().contains(".bcf"));
    }

    public static VariantContextWriter sortOnTheFly(final VariantContextWriter innerWriter, int maxCachingStartDistance) {
        return sortOnTheFly(innerWriter, maxCachingStartDistance, false);
    }

    public static VariantContextWriter sortOnTheFly(final VariantContextWriter innerWriter, int maxCachingStartDistance, boolean takeOwnershipOfInner) {
        return new SortingVariantContextWriter(innerWriter, maxCachingStartDistance, takeOwnershipOfInner);
    }

    /**
     * Returns a output stream writing to location, or throws a UserException if this fails
     * @param location
     * @return
     */
    protected static OutputStream openOutputStream(final File location) {
        try {
            return new FileOutputStream(location);
        } catch (FileNotFoundException e) {
            throw new UserException.CouldNotCreateOutputFile(location, "Unable to create VCF writer", e);
        }
    }
}

new file: org.broadinstitute.sting.utils.variantcontext.writer.XMLVariantContextWriter:

package org.broadinstitute.sting.utils.variantcontext.writer;

import java.io.File;
import java.io.OutputStream;
import java.util.Map;

import javax.xml.stream.XMLOutputFactory;
import javax.xml.stream.XMLStreamException;
import javax.xml.stream.XMLStreamWriter;

import org.broadinstitute.sting.utils.codecs.vcf.VCFContigHeaderLine;
import org.broadinstitute.sting.utils.codecs.vcf.VCFFilterHeaderLine;
import org.broadinstitute.sting.utils.codecs.vcf.VCFFormatHeaderLine;
import org.broadinstitute.sting.utils.codecs.vcf.VCFHeader;
import org.broadinstitute.sting.utils.codecs.vcf.VCFHeaderVersion;
import org.broadinstitute.sting.utils.codecs.vcf.VCFInfoHeaderLine;
import org.broadinstitute.sting.utils.exceptions.ReviewedStingException;
import org.broadinstitute.sting.utils.variantcontext.Allele;
import org.broadinstitute.sting.utils.variantcontext.VariantContext;

import net.sf.samtools.SAMSequenceDictionary;

public class XMLVariantContextWriter
 extends AbstractVCFWriter
 {
 public final String NS="http://xml.1000genomes.org/";
    // the print stream we're writing to
    final protected XMLStreamWriter mWriter;

 public XMLVariantContextWriter(final File location, final OutputStream output, final SAMSequenceDictionary refDict,
            final boolean enableOnTheFlyIndexing,
            boolean doNotWriteGenotypes,
            final boolean allowMissingFieldsInHeader )
   throws XMLStreamException
  {
  super(location, output, refDict, enableOnTheFlyIndexing,doNotWriteGenotypes,allowMissingFieldsInHeader);
  XMLOutputFactory factory=XMLOutputFactory.newInstance();
  this.mWriter=factory.createXMLStreamWriter(super.getOutputStream());
  }

 protected void writeMetaData(String key,String value)
  throws XMLStreamException
  {
  if(value!=null)
   {
      this.mWriter.writeStartElement("metadata");
      this.mWriter.writeAttribute("key",key);
      this.mWriter.writeCharacters(value);
      this.mWriter.writeEndElement();
   }
  else
   {
      this.mWriter.writeEmptyElement("metadata");
      this.mWriter.writeAttribute("key",key);
   }
  }
 
 @Override
 public void writeHeader(VCFHeader header)
  {
   
     //
     
        header = doNotWriteGenotypes ? new VCFHeader(header.getMetaDataInSortedOrder()) : header;
        
        try {
         this.mWriter.writeStartElement("vcf");
         this.mWriter.writeAttribute("xmlns", NS);
         this.mWriter.writeStartElement("head");
         
         writeMetaData(
           VCFHeaderVersion.VCF4_1.getFormatString(),
           VCFHeaderVersion.VCF4_1.getVersionString()
           );
         //INFO
         this.mWriter.writeStartElement("info-list");
         for ( VCFInfoHeaderLine line : header.getInfoHeaderLines() )
            {
             this.mWriter.writeStartElement("info");
             this.mWriter.writeAttribute("ID",line.getID());
             this.mWriter.writeAttribute("type",line.getType().name());
             if(line.isFixedCount()) this.mWriter.writeAttribute("count",String.valueOf(line.getCount()));
             this.mWriter.writeCharacters(line.getDescription());
             this.mWriter.writeEndElement();
            }
         this.mWriter.writeEndElement();
         
         //FORMAT
         this.mWriter.writeStartElement("format-list");
         for ( VCFFormatHeaderLine line : header.getFormatHeaderLines() )
            {
             this.mWriter.writeStartElement("format");
             this.mWriter.writeAttribute("ID",line.getID());
             this.mWriter.writeAttribute("type",line.getType().name());
             if(line.isFixedCount()) this.mWriter.writeAttribute("count",String.valueOf(line.getCount()));
             this.mWriter.writeCharacters(line.getDescription());
             this.mWriter.writeEndElement();
            }
         this.mWriter.writeEndElement();
         
         //FILTER
         this.mWriter.writeStartElement("filters-list");
         for ( VCFFilterHeaderLine line : header.getFilterLines() )
            {
             this.mWriter.writeStartElement("filter");
             this.mWriter.writeAttribute("ID",line.getID());
             this.mWriter.writeCharacters(line.getValue());
             this.mWriter.writeEndElement();
            }
         this.mWriter.writeEndElement();

         //CONTIGS
         this.mWriter.writeStartElement("contigs-list");
         for ( VCFContigHeaderLine line : header.getContigLines() )
            {
             this.mWriter.writeStartElement("contig");
             this.mWriter.writeAttribute("ID",line.getID());
             this.mWriter.writeAttribute("index",String.valueOf(line.getContigIndex()));
             this.mWriter.writeEndElement();
            }
         this.mWriter.writeEndElement();
         
         //SAMPLES
         this.mWriter.writeStartElement("samples-list");
         for (int i=0;i< header.getSampleNamesInOrder().size();++i )
            {
             this.mWriter.writeStartElement("sample");
             this.mWriter.writeAttribute("id",String.valueOf(i+1));
             this.mWriter.writeCharacters(header.getSampleNamesInOrder().get(i));
             this.mWriter.writeEndElement();
            }
         this.mWriter.writeEndElement();

         this.mWriter.writeEndElement();//head
         this.mWriter.writeStartElement("body");
         this.mWriter.writeStartElement("variations");
        }
        catch (XMLStreamException e)
      {
         throw new ReviewedStingException("IOException writing the VCF/XML header to " + super.getStreamName(), e);
      }

     }
 
 @Override
    public void add(VariantContext vc)
     { 
        try
         {
         super.add(vc);

            Map<Allele, String> alleleMap = buildAlleleMap(vc);
             
         this.mWriter.writeStartElement("variation");
         
         this.mWriter.writeAttribute("chrom",vc.getChr());
         this.mWriter.writeAttribute("pos",String.valueOf(vc.getStart()));

         
            String ID = vc.getID();
            if(!(ID==null || ID.isEmpty() || ID.equals(".")))
             {
             this.mWriter.writeStartElement("id");
             this.mWriter.writeCharacters(ID);
             this.mWriter.writeEndElement();//id
             }
            
         this.mWriter.writeStartElement("id");
         this.mWriter.writeCharacters(ID);
         this.mWriter.writeEndElement();//body

         this.mWriter.writeStartElement("ref");
         this.mWriter.writeCharacters( vc.getReference().getDisplayString());
         this.mWriter.writeEndElement();


         if ( vc.isVariant() )
          {
                for (int i = 0; i < vc.getAlternateAlleles().size(); i++)
                 {
                    Allele altAllele = vc.getAlternateAllele(i);
                 this.mWriter.writeStartElement("alt");
                 this.mWriter.writeCharacters(altAllele.getDisplayString());
                 this.mWriter.writeEndElement();
                 }
          }
         
         
         this.mWriter.writeEndElement();//variation
         }
     catch(XMLStreamException err)
      {
      throw new ReviewedStingException("Cannot close XMLStream",err);
      }
  }

 
 @Override
    public void close()
     {
        super.close();
        try
         {
         this.mWriter.writeEndElement();//variations
         this.mWriter.writeEndElement();//body
         this.mWriter.writeEndElement();//vcf
         this.mWriter.flush();
         this.mWriter.close();
         }
        catch(XMLStreamException err)
         {
         throw new ReviewedStingException("Cannot close XMLStream",err);
         }
        try
         {
         getOutputStream().close();
         }
       catch(Throwable err)
         {
         throw new ReviewedStingException("Cannot close ouputstream",err);
         }
     }
 }

Compiling and testing

With my version, if the filename ends with "*.xml", the XML-Writer is used instead of the standard VCF-writer.

$ ant
(...)
java -jar dist/GenomeAnalysisTK.jar  \
   -T UnifiedGenotyper \
   -o ex1f.vcf.xml \
   -R ex1.fa \
   -I sorted.bam

INFO  17:12:28,358 HelpFormatter - ---------------------------------------------------------------------------------------------------------- 
INFO  17:12:28,361 HelpFormatter - The Genome Analysis Toolkit (GATK) vdbffd2fa3e7a043a6951d8ac58dd619e68a6caa8, Compiled 2012/10/15 16:53:32 
INFO  17:12:28,361 HelpFormatter - Copyright (c) 2010 The Broad Institute 
INFO  17:12:28,361 HelpFormatter - For support and documentation go to http://www.broadinstitute.org/gatk 
INFO  17:12:28,362 HelpFormatter - Program Args: -T UnifiedGenotyper -o  -o ex1f.vcf.xml -R ex1.fa -I sorted.bam 
INFO  17:12:28,363 HelpFormatter - Date/Time: 2012/10/15 17:12:28 
INFO  17:12:28,364 HelpFormatter - ---------------------------------------------------------------------------------------------------------- 
INFO  17:12:28,364 HelpFormatter - ---------------------------------------------------------------------------------------------------------- 
INFO  17:12:28,392 GenomeAnalysisEngine - Strictness is SILENT 
INFO  17:12:28,430 SAMDataSource$SAMReaders - Initializing SAMRecords in serial 
INFO  17:12:28,444 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 0.01 
INFO  17:12:28,835 TraversalEngine - [INITIALIZATION COMPLETE; TRAVERSAL STARTING] 
INFO  17:12:28,835 TraversalEngine -        Location processed.sites  runtime per.1M.sites completed total.runtime remaining 
INFO  17:12:30,721 TraversalEngine - Total runtime 2.00 secs, 0.03 min, 0.00 hours 
INFO  17:12:30,723 TraversalEngine - 108 reads were filtered out during traversal out of 9921 total (1.09%) 
INFO  17:12:30,727 TraversalEngine -   -> 108 reads (1.09% of total) failing UnmappedReadFilter

And here is the XML file produced (I've just played with the xml format, handling the INFO and the genotypes for each variation was on my todo list).

xmllint --format   ex1f.vcf.xml

<?xml version="1.0"?>
<vcf xmlns="http://xml.1000genomes.org/">
  <head>
    <metadata key="fileformat">VCFv4.1</metadata>
    <info-list>
      <info ID="FS" type="Float" count="1">Phred-scaled p-value using Fisher's exact test to detect strand bias</info>
      <info ID="AN" type="Integer" count="1">Total number of alleles in called genotypes</info>
      <info ID="BaseQRankSum" type="Float" count="1">Z-score from Wilcoxon rank sum test of Alt Vs. Ref base qualities</info>
      <info ID="MQ" type="Float" count="1">RMS Mapping Quality</info>
      <info ID="AF" type="Float">Allele Frequency, for each ALT allele, in the same order as listed</info>
       (....)
    </info-list>
    <format-list>
      <format ID="DP" type="Integer" count="1">Approximate read depth (reads with MQ=255 or with bad mates are filtered)</format>
      <format ID="GT" type="String" count="1">Genotype</format>
      <format ID="PL" type="Integer">Normalized, Phred-scaled likelihoods for genotypes as defined in the VCF specification</format>
      <format ID="GQ" type="Integer" count="1">Genotype Quality</format>
      <format ID="AD" type="Integer">Allelic depths for the ref and alt alleles in the order listed</format>
    </format-list>
    <filters-list>
      <filter ID="LowQual"/>
    </filters-list>
    <contigs-list>
      <contig ID="seq1" index="0"/>
      <contig ID="seq2" index="1"/>
    </contigs-list>
    <samples-list>
      <sample id="1">ex1</sample>
      <sample id="2">ex1b</sample>
    </samples-list>
  </head>
  <body>
    <variations>
      <variation chrom="seq1" pos="285">
        <id>.</id>
        <ref>T</ref>
        <alt>A</alt>
      </variation>
      <variation chrom="seq1" pos="287">
        <id>.</id>
        <ref>C</ref>
        <alt>A</alt>
      </variation>
      (....)
  </body>
</vcf>

Image via wikipedia

Sorry, didn't mean to make you sad! But we <3 VCF... RT @yokofakun: ... and the @gatk_dev team said "no" #gatk #VCF #ngs #Imsosaaaaad
— GATK Dev Team (@gatk_dev) October 15, 2012

That's it ;-) ,

Pierre

YOKOFAKUN

20 December 2012

RDF/Jena: a simple extension for XSLT/XALAN. Testing with NCBI-Gene

The XSLT Stylesheet

The Java code

Makefile

Result

NOTCH2

Omim ID 610205

Omim ID 102500

29 November 2012

Reading/Writing a VCF file with the GATK-API.

The code

Makefile

21 November 2012

visualizing the dependencies in a Makefile

Example

13 November 2012

Creating a virtual RDF graph describing a set of OpenOffice spreadsheets with Apache Jena and Fuseki

Fact: An openoffice file (*.ods) is a Zip file

Fact: Implementing a simple virtual RDF graph with Jena is easy

The code

Compilation

Querying using sparql

The spreadsheet

The sparql query

Serving the OpenOffice spreadsheets as RDF over HTTP

Download an install Fuseki

Tell Fuseki about our OpenOfficeCalcGraph

Start Fuseki with the config file:

02 November 2012

Saving your tweets in a database using sqlite, rhino, scribe, javascript

Prerequisites

The config.js file

The javascript

Running the script using Rhino

15 October 2012

Modifying the GATK so it supports a XML-based format for VCF.

Compiling and testing

About Me

Feeds

Blog Archive

Web2.0

Labels