19 May 2010

First rule of Bioinfo-Club

The first rule of Bioinfo-Club is:

If you don't like a
Standard File Format,
create a new one


and, I don't like the VCF format. VCF is a text file format (most likely stored in a compressed manner). It contains metainformation lines, a header line, and then data lines each containing information about a position in the genome.. A VCF file looks like this:
##fileformat=VCFv3.3
##FORMAT=DP,1,Integer,"Read Depth (only filtered reads used for calling)"
##FORMAT=GL,3,Float,"Log-scaled likelihoods for AA,AB,BB genotypes where A=ref and B=alt; not applicable if site is not biallelic"
##FORMAT=GQ,1,Float,"Genotype Quality"
##FORMAT=GT,1,String,"Genotype"
##INFO=AB,1,Float,"Allele Balance for hets (ref/(ref+alt))"
##INFO=AC,1,Integer,"Allele count in genotypes, for each ALT allele, in the same order as listed"
##INFO=AF,1,Float,"Allele Frequency"
##INFO=AN,1,Integer,"Total number of alleles in called genotypes"
##INFO=DP,1,Integer,"Total Depth"
##INFO=Dels,1,Float,"Fraction of Reads Containing Spanning Deletions"
##INFO=HRun,1,Integer,"Largest Contiguous Homopolymer Run of Variant Allele In Either Direction"
##INFO=MQ,1,Float,"RMS Mapping Quality"
##INFO=MQ0,1,Integer,"Total Mapping Quality Zero Reads"
##INFO=QD,1,Float,"Variant Confidence/Quality by Depth"
##INFO=SB,1,Float,"Strand Bias"
##reference=chrN.fa
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT chr.markdup.bam chr1.sorted.bam
chr1 10109 . A T 94.50 . AB=0.67;AC=2;AF=0.50;AN=4;DP=14;Dels=0.00;HRun=0;MQ=19.70;MQ0=4;QD=6.75;SB=-64.74 GT:DP:GL:GQ 0/1:4:-8.74,-2.21,-6.8
0:45.90 0/1:4:-8.74,-2.21,-6.80:45.90
chr1 10578 . G A 120.87 . AC=4;AF=1.00;AN=4;DP=4;Dels=0.00;HRun=0;MQ=50.00;MQ0=0;QD=30.22;SB=-87.10 GT:DP:GL:GQ 1/1:2:-8.92,-1.60,-1.00:6.02 1/1:2:-8.92,-1.60,-1.00:6.02
(...)


The problem is that I want to add some informations about each variation in this file: it started to become messy when I wanted to annotate the mutations (position in the protein ? in the cDNA ? is it non-synonymous ?, which exon ? which gene ? what are the GO terms ?, etc...). So I created a tool named 'VCFToXML' (source available here) transforming the VCF format into XML:
cat file.vcf | java -jar vcf2xml.jar > vcf.xml

The XML-VCF now looks like this:
.
Of course, it is less compact and less easy to process with the common unix tools (cut/sort/awk/etc...). But with the help of a second tool named 'vcfannot' (source available here), I can now add some structured annotations (here some information about UCSC/knownGenes and UCSC/segDups):

A first XSLT stylesheet: back to VCF

Great ! I can use XSLT to transform this XML into another file format. My first XSLT stylesheet xvcf2vcf.xsl transforms the XML-VCF back to the plain text format:
xsltproc http://code915.googlecode.com/svn/trunk/core/src/xsl/xvcf2vcf.xsl vcf.xml

##fileformat=VCFv3.3
##INFO=AC,1,Integer,"Allele count in genotypes, for each ALT allele, in the same order as listed"
##INFO=HRun,1,Integer,"Largest Contiguous Homopolymer Run of Variant Allele In Either Direction"
##INFO=AB,1,Float,"Allele Balance for hets (ref/(ref+alt))"
##INFO=Dels,1,Float,"Fraction of Reads Containing Spanning Deletions"
##INFO=DP,1,Integer,"Total Depth"
##INFO=QD,1,Float,"Variant Confidence/Quality by Depth"
##INFO=AF,1,Float,"Allele Frequency"
##INFO=MQ0,1,Integer,"Total Mapping Quality Zero Reads"
##INFO=MQ,1,Float,"RMS Mapping Quality"
##INFO=SB,1,Float,"Strand Bias"
##INFO=AN,1,Integer,"Total number of alleles in called genotypes"
##FORMAT=GT,1,String,"Genotype"
##FORMAT=GQ,1,Float,"Genotype Quality"
##FORMAT=DP,1,Integer,"Read Depth (only filtered reads used for calling)"
##FORMAT=GL,3,Float,"Log-scaled likelihoods for AA,AB,BB genotypes where A=ref and B=alt; not applicable if site is not biallelic"
##(...)
##reference=chrN.fa
##source=UnifiedGenotyper
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT chrN.markdup.bam chrN.sorted.bam
chrN 10109 . A T 94.5 . AC=2;HRun=0;AB=0.67;Dels=0.00;DP=14;QD=6.75;AF=0.50;MQ0=4;MQ=19.70;SB=-64.74;AN=4 GT:DP:GL:GQ 1/1:4:-8.74,-2.21,-6.80:45.90 0/1:4:-8.74,-2.21,-6.80:45.90
chrN 10578 . G A 120.87 . AC=4;HRun=0;Dels=0.00;DP=4;QD=30.22;AF=1.00;MQ0=0;MQ=50.00;SB=-87.10;AN=4 GT:DP:GL:GQ 0/1:2:-8.92,-1.60,-1.00:6.02 0/1:2:-8.92,-1.60,-1.00:6.02

Second XSLT stylesheet: insert into SQL


The second XSL Stylesheet xvcf2sql.xsl generates a mySQL script for inserting the VCF into a database:
xsltproc http://code915.googlecode.com/svn/trunk/core/src/xsl/xvcf2sql.xsl vcf.xml |\
mysql -D test


Then, for example, the following query scans the database and returns the genotypes/info for each sample:
select
V.chrom,
V.position,
V.ref,
V.alt,
S.name,
concat(left(T.descr,10),"...") as "Format",
F.content
from
vcf_variant as V,
vcf_sample as S,
vcf_call as C,
vcf_callfeature as F,
vcf_format as T
where
C.variant_id=V.id and
C.sample_id=S.id and
F.call_id=C.id and
T.id=F.format_id;

Result:


(xsltproc http://code915.googlecode.com/svn/trunk/core/src/xsl/xvcf2sql.xsl vcf.xml; cat query.sql )|\
mysql -D test

chrom position ref alt name Format content
chrN 10109 A T chr.markdup.bam Genotype... 1/1
chrN 10109 A T chr.markdup.bam Read Depth... 4
chrN 10109 A T chr.markdup.bam Log-scaled... -8.74,-2.21,-6.80
chrN 10109 A T chr.markdup.bam Genotype Q... 45.90
chrN 10109 A T chrN.sorted.bam Genotype... 0/1
chrN 10109 A T chrN.sorted.bam Read Depth... 4
chrN 10109 A T chrN.sorted.bam Log-scaled... -8.74,-2.21,-6.80
chrN 10109 A T chrN.sorted.bam Genotype Q... 45.90
chrN 10578 G A chr.markdup.bam Genotype... 0/1
chrN 10578 G A chr.markdup.bam Read Depth... 2
chrN 10578 G A chr.markdup.bam Log-scaled... -8.92,-1.60,-1.00
chrN 10578 G A chr.markdup.bam Genotype Q... 6.02
chrN 10578 G A chrN.sorted.bam Genotype... 0/1
chrN 10578 G A chrN.sorted.bam Read Depth... 2
chrN 10578 G A chrN.sorted.bam Log-scaled... -8.92,-1.60,-1.00
chrN 10578 G A chrN.sorted.bam Genotype Q... 6.02


That's it
Pierre

6 comments:

dalloliogm said...

what about coding those informations in another file? or in a table?

VCF is designed to store only genotypes and raw informations, from a certain point of view this could be an advantage.

Pierre Lindenbaum said...

I don't meant that the NGS softwares should output some XML but I needed a file format where I could add some informations about the mutations. So let's say that's VCF is OK (it is compact, sortable, etc..) but I've only created a new kind of file to manage my data.

diffusing thoughts said...

thanks for this post it. I find it very educational and a good example of XML use.

Sylvain said...

Just one tiny remark:

cat file.vcf | java -jar vcf2xml.jar > vcf.xml
is considered bad Un*x style, indeed there is an unnecessary call to cat...

The same can be done with pure shell as
java -jar vcf2xml.jar < file.vcf > vcf.xml

Sylvain

casbon said...

My bugbear with these formats is that the embedded structured data should be done in JSON, not some ad hoc format. In vcf, you have


GT:DP:GL:GQ 1/1:11:-28.60,-3.15,-0.15:30.00


In GFF3 you have

ID=blah;Parent=foo


If you are going to embed data in a single column, use JSON. i.e.:


{'GT': '1/0', 'DP': 11, ...}


or

{'ID': 'blah', 'Parent': 'foo'}


It would save everyone a lot of hassle.

Anonymous said...

I believe more than half of people are using command lines. JSON/XML will utterly ruin their lives. Also in the world of NGS, efficiency is often of the first priority. A few of us increasingly feel VCF is not efficient to parse. JSON/XML will only make this worse.