19 May 2010

First rule of Bioinfo-Club

The first rule of Bioinfo-Club is:

If you don't like a
Standard File Format,
create a new one


and, I don't like the VCF format. VCF is a text file format (most likely stored in a compressed manner). It contains metainformation lines, a header line, and then data lines each containing information about a position in the genome.. A VCF file looks like this:
##fileformat=VCFv3.3
##FORMAT=DP,1,Integer,"Read Depth (only filtered reads used for calling)"
##FORMAT=GL,3,Float,"Log-scaled likelihoods for AA,AB,BB genotypes where A=ref and B=alt; not applicable if site is not biallelic"
##FORMAT=GQ,1,Float,"Genotype Quality"
##FORMAT=GT,1,String,"Genotype"
##INFO=AB,1,Float,"Allele Balance for hets (ref/(ref+alt))"
##INFO=AC,1,Integer,"Allele count in genotypes, for each ALT allele, in the same order as listed"
##INFO=AF,1,Float,"Allele Frequency"
##INFO=AN,1,Integer,"Total number of alleles in called genotypes"
##INFO=DP,1,Integer,"Total Depth"
##INFO=Dels,1,Float,"Fraction of Reads Containing Spanning Deletions"
##INFO=HRun,1,Integer,"Largest Contiguous Homopolymer Run of Variant Allele In Either Direction"
##INFO=MQ,1,Float,"RMS Mapping Quality"
##INFO=MQ0,1,Integer,"Total Mapping Quality Zero Reads"
##INFO=QD,1,Float,"Variant Confidence/Quality by Depth"
##INFO=SB,1,Float,"Strand Bias"
##reference=chrN.fa
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT chr.markdup.bam chr1.sorted.bam
chr1 10109 . A T 94.50 . AB=0.67;AC=2;AF=0.50;AN=4;DP=14;Dels=0.00;HRun=0;MQ=19.70;MQ0=4;QD=6.75;SB=-64.74 GT:DP:GL:GQ 0/1:4:-8.74,-2.21,-6.8
0:45.90 0/1:4:-8.74,-2.21,-6.80:45.90
chr1 10578 . G A 120.87 . AC=4;AF=1.00;AN=4;DP=4;Dels=0.00;HRun=0;MQ=50.00;MQ0=0;QD=30.22;SB=-87.10 GT:DP:GL:GQ 1/1:2:-8.92,-1.60,-1.00:6.02 1/1:2:-8.92,-1.60,-1.00:6.02
(...)


The problem is that I want to add some informations about each variation in this file: it started to become messy when I wanted to annotate the mutations (position in the protein ? in the cDNA ? is it non-synonymous ?, which exon ? which gene ? what are the GO terms ?, etc...). So I created a tool named 'VCFToXML' (source available here) transforming the VCF format into XML:
cat file.vcf | java -jar vcf2xml.jar > vcf.xml

The XML-VCF now looks like this:
.
Of course, it is less compact and less easy to process with the common unix tools (cut/sort/awk/etc...). But with the help of a second tool named 'vcfannot' (source available here), I can now add some structured annotations (here some information about UCSC/knownGenes and UCSC/segDups):

A first XSLT stylesheet: back to VCF

Great ! I can use XSLT to transform this XML into another file format. My first XSLT stylesheet xvcf2vcf.xsl transforms the XML-VCF back to the plain text format:
xsltproc http://code915.googlecode.com/svn/trunk/core/src/xsl/xvcf2vcf.xsl vcf.xml

##fileformat=VCFv3.3
##INFO=AC,1,Integer,"Allele count in genotypes, for each ALT allele, in the same order as listed"
##INFO=HRun,1,Integer,"Largest Contiguous Homopolymer Run of Variant Allele In Either Direction"
##INFO=AB,1,Float,"Allele Balance for hets (ref/(ref+alt))"
##INFO=Dels,1,Float,"Fraction of Reads Containing Spanning Deletions"
##INFO=DP,1,Integer,"Total Depth"
##INFO=QD,1,Float,"Variant Confidence/Quality by Depth"
##INFO=AF,1,Float,"Allele Frequency"
##INFO=MQ0,1,Integer,"Total Mapping Quality Zero Reads"
##INFO=MQ,1,Float,"RMS Mapping Quality"
##INFO=SB,1,Float,"Strand Bias"
##INFO=AN,1,Integer,"Total number of alleles in called genotypes"
##FORMAT=GT,1,String,"Genotype"
##FORMAT=GQ,1,Float,"Genotype Quality"
##FORMAT=DP,1,Integer,"Read Depth (only filtered reads used for calling)"
##FORMAT=GL,3,Float,"Log-scaled likelihoods for AA,AB,BB genotypes where A=ref and B=alt; not applicable if site is not biallelic"
##(...)
##reference=chrN.fa
##source=UnifiedGenotyper
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT chrN.markdup.bam chrN.sorted.bam
chrN 10109 . A T 94.5 . AC=2;HRun=0;AB=0.67;Dels=0.00;DP=14;QD=6.75;AF=0.50;MQ0=4;MQ=19.70;SB=-64.74;AN=4 GT:DP:GL:GQ 1/1:4:-8.74,-2.21,-6.80:45.90 0/1:4:-8.74,-2.21,-6.80:45.90
chrN 10578 . G A 120.87 . AC=4;HRun=0;Dels=0.00;DP=4;QD=30.22;AF=1.00;MQ0=0;MQ=50.00;SB=-87.10;AN=4 GT:DP:GL:GQ 0/1:2:-8.92,-1.60,-1.00:6.02 0/1:2:-8.92,-1.60,-1.00:6.02

Second XSLT stylesheet: insert into SQL


The second XSL Stylesheet xvcf2sql.xsl generates a mySQL script for inserting the VCF into a database:
xsltproc http://code915.googlecode.com/svn/trunk/core/src/xsl/xvcf2sql.xsl vcf.xml |\
mysql -D test


Then, for example, the following query scans the database and returns the genotypes/info for each sample:
select
V.chrom,
V.position,
V.ref,
V.alt,
S.name,
concat(left(T.descr,10),"...") as "Format",
F.content
from
vcf_variant as V,
vcf_sample as S,
vcf_call as C,
vcf_callfeature as F,
vcf_format as T
where
C.variant_id=V.id and
C.sample_id=S.id and
F.call_id=C.id and
T.id=F.format_id;

Result:


(xsltproc http://code915.googlecode.com/svn/trunk/core/src/xsl/xvcf2sql.xsl vcf.xml; cat query.sql )|\
mysql -D test

chrom position ref alt name Format content
chrN 10109 A T chr.markdup.bam Genotype... 1/1
chrN 10109 A T chr.markdup.bam Read Depth... 4
chrN 10109 A T chr.markdup.bam Log-scaled... -8.74,-2.21,-6.80
chrN 10109 A T chr.markdup.bam Genotype Q... 45.90
chrN 10109 A T chrN.sorted.bam Genotype... 0/1
chrN 10109 A T chrN.sorted.bam Read Depth... 4
chrN 10109 A T chrN.sorted.bam Log-scaled... -8.74,-2.21,-6.80
chrN 10109 A T chrN.sorted.bam Genotype Q... 45.90
chrN 10578 G A chr.markdup.bam Genotype... 0/1
chrN 10578 G A chr.markdup.bam Read Depth... 2
chrN 10578 G A chr.markdup.bam Log-scaled... -8.92,-1.60,-1.00
chrN 10578 G A chr.markdup.bam Genotype Q... 6.02
chrN 10578 G A chrN.sorted.bam Genotype... 0/1
chrN 10578 G A chrN.sorted.bam Read Depth... 2
chrN 10578 G A chrN.sorted.bam Log-scaled... -8.92,-1.60,-1.00
chrN 10578 G A chrN.sorted.bam Genotype Q... 6.02


That's it
Pierre

6 comments:

dalloliogm said...

what about coding those informations in another file? or in a table?

VCF is designed to store only genotypes and raw informations, from a certain point of view this could be an advantage.

Pierre Lindenbaum said...

I don't meant that the NGS softwares should output some XML but I needed a file format where I could add some informations about the mutations. So let's say that's VCF is OK (it is compact, sortable, etc..) but I've only created a new kind of file to manage my data.

Unknown said...

thanks for this post it. I find it very educational and a good example of XML use.

Sylvain said...

Just one tiny remark:

cat file.vcf | java -jar vcf2xml.jar > vcf.xml
is considered bad Un*x style, indeed there is an unnecessary call to cat...

The same can be done with pure shell as
java -jar vcf2xml.jar < file.vcf > vcf.xml

Sylvain

James Casbon said...

My bugbear with these formats is that the embedded structured data should be done in JSON, not some ad hoc format. In vcf, you have


GT:DP:GL:GQ 1/1:11:-28.60,-3.15,-0.15:30.00


In GFF3 you have

ID=blah;Parent=foo


If you are going to embed data in a single column, use JSON. i.e.:


{'GT': '1/0', 'DP': 11, ...}


or

{'ID': 'blah', 'Parent': 'foo'}


It would save everyone a lot of hassle.

Anonymous said...

I believe more than half of people are using command lines. JSON/XML will utterly ruin their lives. Also in the world of NGS, efficiency is often of the first priority. A few of us increasingly feel VCF is not efficient to parse. JSON/XML will only make this worse.