If you don't like a
Standard File Format,
create a new one
Standard File Format,
create a new one
and, I don't like the VCF format. VCF is a text file format (most likely stored in a compressed manner). It contains metainformation lines, a header line, and then data lines each containing information about a position in the genome.. A VCF file looks like this:
##fileformat=VCFv3.3
##FORMAT=DP,1,Integer,"Read Depth (only filtered reads used for calling)"
##FORMAT=GL,3,Float,"Log-scaled likelihoods for AA,AB,BB genotypes where A=ref and B=alt; not applicable if site is not biallelic"
##FORMAT=GQ,1,Float,"Genotype Quality"
##FORMAT=GT,1,String,"Genotype"
##INFO=AB,1,Float,"Allele Balance for hets (ref/(ref+alt))"
##INFO=AC,1,Integer,"Allele count in genotypes, for each ALT allele, in the same order as listed"
##INFO=AF,1,Float,"Allele Frequency"
##INFO=AN,1,Integer,"Total number of alleles in called genotypes"
##INFO=DP,1,Integer,"Total Depth"
##INFO=Dels,1,Float,"Fraction of Reads Containing Spanning Deletions"
##INFO=HRun,1,Integer,"Largest Contiguous Homopolymer Run of Variant Allele In Either Direction"
##INFO=MQ,1,Float,"RMS Mapping Quality"
##INFO=MQ0,1,Integer,"Total Mapping Quality Zero Reads"
##INFO=QD,1,Float,"Variant Confidence/Quality by Depth"
##INFO=SB,1,Float,"Strand Bias"
##reference=chrN.fa
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT chr.markdup.bam chr1.sorted.bam
chr1 10109 . A T 94.50 . AB=0.67;AC=2;AF=0.50;AN=4;DP=14;Dels=0.00;HRun=0;MQ=19.70;MQ0=4;QD=6.75;SB=-64.74 GT:DP:GL:GQ 0/1:4:-8.74,-2.21,-6.8
0:45.90 0/1:4:-8.74,-2.21,-6.80:45.90
chr1 10578 . G A 120.87 . AC=4;AF=1.00;AN=4;DP=4;Dels=0.00;HRun=0;MQ=50.00;MQ0=0;QD=30.22;SB=-87.10 GT:DP:GL:GQ 1/1:2:-8.92,-1.60,-1.00:6.02 1/1:2:-8.92,-1.60,-1.00:6.02
(...)
##FORMAT=DP,1,Integer,"Read Depth (only filtered reads used for calling)"
##FORMAT=GL,3,Float,"Log-scaled likelihoods for AA,AB,BB genotypes where A=ref and B=alt; not applicable if site is not biallelic"
##FORMAT=GQ,1,Float,"Genotype Quality"
##FORMAT=GT,1,String,"Genotype"
##INFO=AB,1,Float,"Allele Balance for hets (ref/(ref+alt))"
##INFO=AC,1,Integer,"Allele count in genotypes, for each ALT allele, in the same order as listed"
##INFO=AF,1,Float,"Allele Frequency"
##INFO=AN,1,Integer,"Total number of alleles in called genotypes"
##INFO=DP,1,Integer,"Total Depth"
##INFO=Dels,1,Float,"Fraction of Reads Containing Spanning Deletions"
##INFO=HRun,1,Integer,"Largest Contiguous Homopolymer Run of Variant Allele In Either Direction"
##INFO=MQ,1,Float,"RMS Mapping Quality"
##INFO=MQ0,1,Integer,"Total Mapping Quality Zero Reads"
##INFO=QD,1,Float,"Variant Confidence/Quality by Depth"
##INFO=SB,1,Float,"Strand Bias"
##reference=chrN.fa
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT chr.markdup.bam chr1.sorted.bam
chr1 10109 . A T 94.50 . AB=0.67;AC=2;AF=0.50;AN=4;DP=14;Dels=0.00;HRun=0;MQ=19.70;MQ0=4;QD=6.75;SB=-64.74 GT:DP:GL:GQ 0/1:4:-8.74,-2.21,-6.8
0:45.90 0/1:4:-8.74,-2.21,-6.80:45.90
chr1 10578 . G A 120.87 . AC=4;AF=1.00;AN=4;DP=4;Dels=0.00;HRun=0;MQ=50.00;MQ0=0;QD=30.22;SB=-87.10 GT:DP:GL:GQ 1/1:2:-8.92,-1.60,-1.00:6.02 1/1:2:-8.92,-1.60,-1.00:6.02
(...)
The problem is that I want to add some informations about each variation in this file: it started to become messy when I wanted to annotate the mutations (position in the protein ? in the cDNA ? is it non-synonymous ?, which exon ? which gene ? what are the GO terms ?, etc...). So I created a tool named 'VCFToXML' (source available here) transforming the VCF format into XML:
cat file.vcf | java -jar vcf2xml.jar > vcf.xml
The XML-VCF now looks like this:.
Of course, it is less compact and less easy to process with the common unix tools (cut/sort/awk/etc...). But with the help of a second tool named 'vcfannot' (source available here), I can now add some structured annotations (here some information about UCSC/knownGenes and UCSC/segDups):
A first XSLT stylesheet: back to VCF
Great ! I can use XSLT to transform this XML into another file format. My first XSLT stylesheet xvcf2vcf.xsl transforms the XML-VCF back to the plain text format:xsltproc http://code915.googlecode.com/svn/trunk/core/src/xsl/xvcf2vcf.xsl vcf.xml
##fileformat=VCFv3.3
##INFO=AC,1,Integer,"Allele count in genotypes, for each ALT allele, in the same order as listed"
##INFO=HRun,1,Integer,"Largest Contiguous Homopolymer Run of Variant Allele In Either Direction"
##INFO=AB,1,Float,"Allele Balance for hets (ref/(ref+alt))"
##INFO=Dels,1,Float,"Fraction of Reads Containing Spanning Deletions"
##INFO=DP,1,Integer,"Total Depth"
##INFO=QD,1,Float,"Variant Confidence/Quality by Depth"
##INFO=AF,1,Float,"Allele Frequency"
##INFO=MQ0,1,Integer,"Total Mapping Quality Zero Reads"
##INFO=MQ,1,Float,"RMS Mapping Quality"
##INFO=SB,1,Float,"Strand Bias"
##INFO=AN,1,Integer,"Total number of alleles in called genotypes"
##FORMAT=GT,1,String,"Genotype"
##FORMAT=GQ,1,Float,"Genotype Quality"
##FORMAT=DP,1,Integer,"Read Depth (only filtered reads used for calling)"
##FORMAT=GL,3,Float,"Log-scaled likelihoods for AA,AB,BB genotypes where A=ref and B=alt; not applicable if site is not biallelic"
##(...)
##reference=chrN.fa
##source=UnifiedGenotyper
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT chrN.markdup.bam chrN.sorted.bam
chrN 10109 . A T 94.5 . AC=2;HRun=0;AB=0.67;Dels=0.00;DP=14;QD=6.75;AF=0.50;MQ0=4;MQ=19.70;SB=-64.74;AN=4 GT:DP:GL:GQ 1/1:4:-8.74,-2.21,-6.80:45.90 0/1:4:-8.74,-2.21,-6.80:45.90
chrN 10578 . G A 120.87 . AC=4;HRun=0;Dels=0.00;DP=4;QD=30.22;AF=1.00;MQ0=0;MQ=50.00;SB=-87.10;AN=4 GT:DP:GL:GQ 0/1:2:-8.92,-1.60,-1.00:6.02 0/1:2:-8.92,-1.60,-1.00:6.02
##fileformat=VCFv3.3
##INFO=AC,1,Integer,"Allele count in genotypes, for each ALT allele, in the same order as listed"
##INFO=HRun,1,Integer,"Largest Contiguous Homopolymer Run of Variant Allele In Either Direction"
##INFO=AB,1,Float,"Allele Balance for hets (ref/(ref+alt))"
##INFO=Dels,1,Float,"Fraction of Reads Containing Spanning Deletions"
##INFO=DP,1,Integer,"Total Depth"
##INFO=QD,1,Float,"Variant Confidence/Quality by Depth"
##INFO=AF,1,Float,"Allele Frequency"
##INFO=MQ0,1,Integer,"Total Mapping Quality Zero Reads"
##INFO=MQ,1,Float,"RMS Mapping Quality"
##INFO=SB,1,Float,"Strand Bias"
##INFO=AN,1,Integer,"Total number of alleles in called genotypes"
##FORMAT=GT,1,String,"Genotype"
##FORMAT=GQ,1,Float,"Genotype Quality"
##FORMAT=DP,1,Integer,"Read Depth (only filtered reads used for calling)"
##FORMAT=GL,3,Float,"Log-scaled likelihoods for AA,AB,BB genotypes where A=ref and B=alt; not applicable if site is not biallelic"
##(...)
##reference=chrN.fa
##source=UnifiedGenotyper
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT chrN.markdup.bam chrN.sorted.bam
chrN 10109 . A T 94.5 . AC=2;HRun=0;AB=0.67;Dels=0.00;DP=14;QD=6.75;AF=0.50;MQ0=4;MQ=19.70;SB=-64.74;AN=4 GT:DP:GL:GQ 1/1:4:-8.74,-2.21,-6.80:45.90 0/1:4:-8.74,-2.21,-6.80:45.90
chrN 10578 . G A 120.87 . AC=4;HRun=0;Dels=0.00;DP=4;QD=30.22;AF=1.00;MQ0=0;MQ=50.00;SB=-87.10;AN=4 GT:DP:GL:GQ 0/1:2:-8.92,-1.60,-1.00:6.02 0/1:2:-8.92,-1.60,-1.00:6.02
Second XSLT stylesheet: insert into SQL
The second XSL Stylesheet xvcf2sql.xsl generates a mySQL script for inserting the VCF into a database:
xsltproc http://code915.googlecode.com/svn/trunk/core/src/xsl/xvcf2sql.xsl vcf.xml |\
mysql -D test
mysql -D test
Then, for example, the following query scans the database and returns the genotypes/info for each sample:
select
V.chrom,
V.position,
V.ref,
V.alt,
S.name,
concat(left(T.descr,10),"...") as "Format",
F.content
from
vcf_variant as V,
vcf_sample as S,
vcf_call as C,
vcf_callfeature as F,
vcf_format as T
where
C.variant_id=V.id and
C.sample_id=S.id and
F.call_id=C.id and
T.id=F.format_id;
V.chrom,
V.position,
V.ref,
V.alt,
S.name,
concat(left(T.descr,10),"...") as "Format",
F.content
from
vcf_variant as V,
vcf_sample as S,
vcf_call as C,
vcf_callfeature as F,
vcf_format as T
where
C.variant_id=V.id and
C.sample_id=S.id and
F.call_id=C.id and
T.id=F.format_id;
Result:
(xsltproc http://code915.googlecode.com/svn/trunk/core/src/xsl/xvcf2sql.xsl vcf.xml; cat query.sql )|\
mysql -D test
chrom position ref alt name Format content
chrN 10109 A T chr.markdup.bam Genotype... 1/1
chrN 10109 A T chr.markdup.bam Read Depth... 4
chrN 10109 A T chr.markdup.bam Log-scaled... -8.74,-2.21,-6.80
chrN 10109 A T chr.markdup.bam Genotype Q... 45.90
chrN 10109 A T chrN.sorted.bam Genotype... 0/1
chrN 10109 A T chrN.sorted.bam Read Depth... 4
chrN 10109 A T chrN.sorted.bam Log-scaled... -8.74,-2.21,-6.80
chrN 10109 A T chrN.sorted.bam Genotype Q... 45.90
chrN 10578 G A chr.markdup.bam Genotype... 0/1
chrN 10578 G A chr.markdup.bam Read Depth... 2
chrN 10578 G A chr.markdup.bam Log-scaled... -8.92,-1.60,-1.00
chrN 10578 G A chr.markdup.bam Genotype Q... 6.02
chrN 10578 G A chrN.sorted.bam Genotype... 0/1
chrN 10578 G A chrN.sorted.bam Read Depth... 2
chrN 10578 G A chrN.sorted.bam Log-scaled... -8.92,-1.60,-1.00
chrN 10578 G A chrN.sorted.bam Genotype Q... 6.02
That's it
Pierre
what about coding those informations in another file? or in a table?
ReplyDeleteVCF is designed to store only genotypes and raw informations, from a certain point of view this could be an advantage.
I don't meant that the NGS softwares should output some XML but I needed a file format where I could add some informations about the mutations. So let's say that's VCF is OK (it is compact, sortable, etc..) but I've only created a new kind of file to manage my data.
ReplyDeletethanks for this post it. I find it very educational and a good example of XML use.
ReplyDeleteJust one tiny remark:
ReplyDeletecat file.vcf | java -jar vcf2xml.jar > vcf.xml
is considered bad Un*x style, indeed there is an unnecessary call to cat...
The same can be done with pure shell as
java -jar vcf2xml.jar < file.vcf > vcf.xml
Sylvain
My bugbear with these formats is that the embedded structured data should be done in JSON, not some ad hoc format. In vcf, you have
ReplyDeleteGT:DP:GL:GQ 1/1:11:-28.60,-3.15,-0.15:30.00
In GFF3 you have
ID=blah;Parent=foo
If you are going to embed data in a single column, use JSON. i.e.:
{'GT': '1/0', 'DP': 11, ...}
or
{'ID': 'blah', 'Parent': 'foo'}
It would save everyone a lot of hassle.
I believe more than half of people are using command lines. JSON/XML will utterly ruin their lives. Also in the world of NGS, efficiency is often of the first priority. A few of us increasingly feel VCF is not efficient to parse. JSON/XML will only make this worse.
ReplyDelete