My tool to annotate VCF files.
The Variant Call Format is a text file format generated by many tools for NGS. It contains meta-information lines, a header line, and then data lines describing how the mutations were called. I don't like this format because it cannot be used to store some hierachical annotations (like json or xml), nevertheless it is a de facto standard.
I wrote a tool called vcfannotator to append a set of annotations from the UCSC database to a VCF file. As I wanted to keep this tool simple and without any dependencies, it only uses the flat files available from the download area at the UCSC.
This tools appends several informations:
- A prediction of the mutation: is it in the cDNA ? in an intron ? in an exon ? is it a non-synonymous mutation ? was a stop codon lost or gained ? is there any consequence on the splicing process ? Here, the UCSC DAS server and the KnonwGenes table are used to retrieve the genomic DNA and the structure of the gene.
- dbSNP: was this mutation found before in dbSNP ?
- Personal genomes: was this mutation found before in Venter's genome ? In Watson's genome ? etc...
- The mapability (Uniqueness of Reference Genome ): gives an idea about how the context is unique in the genome
- genomicSuperDups: is this mutation located in a segmental genomic duplication ?
- tfbsConsSites: is this mutation located in the site of a transcription factor ?
- phastCons44way: how is the mutation conserved within a multiple alignments of 44 vertebrate genomes to the human genome ?
The tool is available on github at: https://github.com/lindenb/jsandbox in
jsandbox/src/sandbox/VCFAnnotator.java
.Compilation
> cd jsandbox
> ant vcfannotator
> ant vcfannotator
Execution & Options
> java -jar dist/vcfannotator.jar -h
VCF annotator
Pierre Lindenbaum PhD. 2010. http://plindenbaum.blogspot.com
Options:
-b ucsc.build default:hg18
-m mapability.
-g genomicSuperDups
-p basic prediction
-c phastcons prediction (phastCons44way)
-t transcription factors sites prediction
-pg personal genomes
-snp <id> add ucsc <id> must be present in "http://hgdownload.cse.ucsc.edu/goldenPath/<ucscdb>/database/<id>.txt.gz" e.g. snp129
-log <level> one value from java.util.logging.Level default:OFF
-proxyHost <host>
-proxyPort <port>
VCF annotator
Pierre Lindenbaum PhD. 2010. http://plindenbaum.blogspot.com
Options:
-b ucsc.build default:hg18
-m mapability.
-g genomicSuperDups
-p basic prediction
-c phastcons prediction (phastCons44way)
-t transcription factors sites prediction
-pg personal genomes
-snp <id> add ucsc <id> must be present in "http://hgdownload.cse.ucsc.edu/goldenPath/<ucscdb>/database/<id>.txt.gz" e.g. snp129
-log <level> one value from java.util.logging.Level default:OFF
-proxyHost <host>
-proxyPort <port>
Example
Input
##fileformat=VCFv3.2
##fileDate=20091120
##source=gigabayes
##reference=ncbi36
##phasing=none
#CHROM POS ID REF ALT QUAL FILTER INFO
1 1105366 . T C 99 0 NS=471;DP=16179;AC=8;AN=942;HWE=0.0685233
1 1105373 . C T 99 0 NS=469;DP=15318;AC=5;AN=938;HWE=0.0267954
1 1108138 rs61733845 C T 99 0 NS=450;DP=16134;AC=110;AN=900;dbSNP;HWE=0.930276
1 1109206 . G A 99 0 NS=696;DP=54701;AC=2;AN=1392;HWE=0.0028777
1 1109262 . C T 99 0 NS=696;DP=55783;AC=12;AN=1392;HWE=0.104349
1 1110233 . C G 99 0 NS=694;DP=34301;AC=43;AN=1388;HWE=4.91397
1 1110240 . T A 99 0 NS=695;DP=35846;AC=3;AN=1390;HWE=0.00648883
1 1110294 rs1320571 G A 99 0 NS=608;DP=36050;AC=246;AN=1216;dbSNP;HM;HWE=3.05358
1 1110351 . A C 99 0 NS=696;DP=26284;AC=15;AN=1392;HWE=0.163402
1 1110358 . T C 99 0 NS=694;DP=24596;AC=24;AN=1388;HWE=0.422309
1 1110366 . G A 99 0 NS=697;DP=22511;AC=16;AN=1394;HWE=0.185781
1 3537495 . C T 99 0 NS=697;DP=25302;AC=15;AN=1394;HWE=0.163165
1 3537910 . G A 99 0 NS=605;DP=15896;AC=34;AN=1210;HWE=0.46683
1 3537996 rs2760321 T C 99 0 NS=571;DP=15376;AC=1040;AN=1142;dbSNP;HM;HWE=4.28852
(...)
##fileDate=20091120
##source=gigabayes
##reference=ncbi36
##phasing=none
#CHROM POS ID REF ALT QUAL FILTER INFO
1 1105366 . T C 99 0 NS=471;DP=16179;AC=8;AN=942;HWE=0.0685233
1 1105373 . C T 99 0 NS=469;DP=15318;AC=5;AN=938;HWE=0.0267954
1 1108138 rs61733845 C T 99 0 NS=450;DP=16134;AC=110;AN=900;dbSNP;HWE=0.930276
1 1109206 . G A 99 0 NS=696;DP=54701;AC=2;AN=1392;HWE=0.0028777
1 1109262 . C T 99 0 NS=696;DP=55783;AC=12;AN=1392;HWE=0.104349
1 1110233 . C G 99 0 NS=694;DP=34301;AC=43;AN=1388;HWE=4.91397
1 1110240 . T A 99 0 NS=695;DP=35846;AC=3;AN=1390;HWE=0.00648883
1 1110294 rs1320571 G A 99 0 NS=608;DP=36050;AC=246;AN=1216;dbSNP;HM;HWE=3.05358
1 1110351 . A C 99 0 NS=696;DP=26284;AC=15;AN=1392;HWE=0.163402
1 1110358 . T C 99 0 NS=694;DP=24596;AC=24;AN=1388;HWE=0.422309
1 1110366 . G A 99 0 NS=697;DP=22511;AC=16;AN=1394;HWE=0.185781
1 3537495 . C T 99 0 NS=697;DP=25302;AC=15;AN=1394;HWE=0.163165
1 3537910 . G A 99 0 NS=605;DP=15896;AC=34;AN=1210;HWE=0.46683
1 3537996 rs2760321 T C 99 0 NS=571;DP=15376;AC=1040;AN=1142;dbSNP;HM;HWE=4.28852
(...)
Command:
java -jar dist/vcfannotator.jar -pg -m -g -p -c -t -snp snp130 -log ALL input.vcf
## wait !
## wait !
Result
##fileformat=VCFv3.2
##fileDate=20091120
##source=gigabayes
##reference=ncbi36
##phasing=none
##SNP130=table snp130 from UCSC
##INFO=NA12878,1,String,"NA12878's Personal genome"
##INFO=NA12891,1,String,"NA12891's Personal genome"
(...)
chr1 1108138 rs61733845 C T 99 0 NS=450;DP=16134;AC=110;AN=900;dbSNP;HWE=0.930276;MAPABILITY_WGENCODEBROADMAPABILITYALIGN36MER=1;MAPABILITY_WGENCODEDUKEUNIQUENESS20=1;MAPABILITY_WGENCODEDUKEUNIQUENESS24=1;MAPABILITY_WGENCODEDUKEUNIQUENESS35=1;phastCons44way=1.000;PREDICTION=geneSymbol:TTLL10|wild.codon:TGC|wild.aa:C|pos.cdna:935|kgId:uc001acy.1|position.protein:312|strand:+|mut.aa:C|mut.codon:TGT|exon.name:Exon 11|type:EXON_CODING_SYNONYMOUS;PREDICTION=geneSymbol:TTLL10|wild.codon:TGC|wild.aa:C|pos.cdna:716|kgId:uc001acz.1|position.protein:239|strand:+|mut.aa:C|mut.codon:TGT|exon.name:Exon 7|type:EXON_CODING_SYNONYMOUS
(...)
chr1 1110351 . A C 99 0 NS=696;DP=26284;AC=15;AN=1392;HWE=0.163402;MAPABILITY_WGENCODEBROADMAPABILITYALIGN36MER=1;MAPABILITY_WGENCODEDUKEUNIQUENESS20=1;MAPABILITY_WGENCODEDUKEUNIQUENESS24=1;MAPABILITY_WGENCODEDUKEUNIQUENESS35=1;phastCons44way=1.000;PREDICTION=geneSymbol:TTLL10|wild.codon:AAG|wild.aa:K|splicing:SPLICING_DONOR|pos.cdna:1399|kgId:uc001acy.1|position.protein:467|strand:+|mut.aa:T|mut.codon:ACG|exon.name:Exon 13|type:EXON_CODING_NON_SYNONYMOUS;PREDICTION=geneSymbol:TTLL10|wild.codon:AAG|wild.aa:K|pos.cdna:1180|kgId:uc001acz.1|position.protein:394|strand:+|mut.aa:T|mut.codon:ACG|exon.name:Exon 9|type:EXON_CODING_NON_SYNONYMOUS
(...)
chr1 113068276 . C G 99 0 NS=697;DP=12774;AC=7;AN=1394;HWE=0.0353282;MAPABILITY_WGENCODEBROADMAPABILITYALIGN36MER=1;MAPABILITY_WGENCODEDUKEUNIQUENESS20=1;MAPABILITY_WGENCODEDUKEUNIQUENESS24=1;MAPABILITY_WGENCODEDUKEUNIQUENESS35=1;phastCons44way=1.000;PREDICTION=geneSymbol:FAM19A3|wild.codon:CCA|wild.aa:P|pos.cdna:451|kgId:uc001ecu.1|position.protein:151|strand:+|mut.aa:R|mut.codon:CGA|exon.name:Exon 4|type:EXON_CODING_NON_SYNONYMOUS;PREDICTION=geneSymbol:FAM19A3|wild.codon:ACC|wild.aa:T|pos.cdna:383|kgId:uc001ecv.1|position.protein:128|strand:+|mut.aa:T|mut.codon:ACG|exon.name:Exon 4|type:EXON_CODING_SYNONYMOUS
(...)
Interestingly, for this last mutation chr1:113068276, there are two transcripts at the same position with two different translation frames, so there are two predictions: one synonymous mutation and one non-synonymous mutation.##fileDate=20091120
##source=gigabayes
##reference=ncbi36
##phasing=none
##SNP130=table snp130 from UCSC
##INFO=NA12878,1,String,"NA12878's Personal genome"
##INFO=NA12891,1,String,"NA12891's Personal genome"
(...)
chr1 1108138 rs61733845 C T 99 0 NS=450;DP=16134;AC=110;AN=900;dbSNP;HWE=0.930276;MAPABILITY_WGENCODEBROADMAPABILITYALIGN36MER=1;MAPABILITY_WGENCODEDUKEUNIQUENESS20=1;MAPABILITY_WGENCODEDUKEUNIQUENESS24=1;MAPABILITY_WGENCODEDUKEUNIQUENESS35=1;phastCons44way=1.000;PREDICTION=geneSymbol:TTLL10|wild.codon:TGC|wild.aa:C|pos.cdna:935|kgId:uc001acy.1|position.protein:312|strand:+|mut.aa:C|mut.codon:TGT|exon.name:Exon 11|type:EXON_CODING_SYNONYMOUS;PREDICTION=geneSymbol:TTLL10|wild.codon:TGC|wild.aa:C|pos.cdna:716|kgId:uc001acz.1|position.protein:239|strand:+|mut.aa:C|mut.codon:TGT|exon.name:Exon 7|type:EXON_CODING_SYNONYMOUS
(...)
chr1 1110351 . A C 99 0 NS=696;DP=26284;AC=15;AN=1392;HWE=0.163402;MAPABILITY_WGENCODEBROADMAPABILITYALIGN36MER=1;MAPABILITY_WGENCODEDUKEUNIQUENESS20=1;MAPABILITY_WGENCODEDUKEUNIQUENESS24=1;MAPABILITY_WGENCODEDUKEUNIQUENESS35=1;phastCons44way=1.000;PREDICTION=geneSymbol:TTLL10|wild.codon:AAG|wild.aa:K|splicing:SPLICING_DONOR|pos.cdna:1399|kgId:uc001acy.1|position.protein:467|strand:+|mut.aa:T|mut.codon:ACG|exon.name:Exon 13|type:EXON_CODING_NON_SYNONYMOUS;PREDICTION=geneSymbol:TTLL10|wild.codon:AAG|wild.aa:K|pos.cdna:1180|kgId:uc001acz.1|position.protein:394|strand:+|mut.aa:T|mut.codon:ACG|exon.name:Exon 9|type:EXON_CODING_NON_SYNONYMOUS
(...)
chr1 113068276 . C G 99 0 NS=697;DP=12774;AC=7;AN=1394;HWE=0.0353282;MAPABILITY_WGENCODEBROADMAPABILITYALIGN36MER=1;MAPABILITY_WGENCODEDUKEUNIQUENESS20=1;MAPABILITY_WGENCODEDUKEUNIQUENESS24=1;MAPABILITY_WGENCODEDUKEUNIQUENESS35=1;phastCons44way=1.000;PREDICTION=geneSymbol:FAM19A3|wild.codon:CCA|wild.aa:P|pos.cdna:451|kgId:uc001ecu.1|position.protein:151|strand:+|mut.aa:R|mut.codon:CGA|exon.name:Exon 4|type:EXON_CODING_NON_SYNONYMOUS;PREDICTION=geneSymbol:FAM19A3|wild.codon:ACC|wild.aa:T|pos.cdna:383|kgId:uc001ecv.1|position.protein:128|strand:+|mut.aa:T|mut.codon:ACG|exon.name:Exon 4|type:EXON_CODING_SYNONYMOUS
(...)
That's it,
Pierre
16 comments:
Great !
Similar work in Plants seems harder to be done !
Very useful tool, thanks!
Just a comment : the tool works without a problem in hg18, but some url's are not directly translatable to hg19, thus the tool cannot fetch some of the flat files for hg19.
It would be great if it worked. But sadly doesn't at all.
some ulrs does not exist even for HG18! This tool does not work at all.
ahaha :-) just tell the UCSC ! :-)
http://www.lazycodeslinger.com/pics/works-on-my-machine-starburst.png :-)
Dear Pierre,
I would be very grateful if you could have a look on a crash report from VCFannotator test run from yesterday. It seems it's some argument issue with PredictionAnnotator.
I will gladly provide you a complete log and VCF file used for the analysis, but haven't been able to dig out your email address.
Please contact me on: michal.kovac@well.ox.ac.uk or michal.kovac@unibas.ch
Many thanks for your time.
Michal
Pierre, looks like a great tool. But I've never used jsandbox and couldn't find any information on how to install and use it. Sorry for maybe stupid question.
download the code from github, go in the directory jsandbox and type "ant vcfannotator". If its sounds too technical , ask for help after a informatician.
Hello Pierre,
First of all, thanks for making your code available. I tried running your annotation code but had problems... I get the same error with the input file on your blog. I get:
java.io.IOException: Error in header got 1 1105366 . T C 99 0 NS=471;DP=16179;AC=8;AN=942;HWE=0.0685233 but expected #CHROM POS ID REF ALT QUAL FILTER INFO
Any ideas?
Thanks
Adrian
@anonymous, I guess your VCF is missing the required VCF header starting with "#CHROM..." , see the VCF specification.
Initially I had the same java.io.IOException: Error in header error as the previous poster.
Turns out it was due to the whitespace being spaces and not tabs.
Quick conversion using vim can be done:
:%s/\s\+/\t/g
Then the sample file should work.
My comment is that the annotation process seems to take a looooong time before it finishes - even for this short sample data.
With your hadoop experience - would this be something worth applying the mapreduce paradigm to? Seems like it should be fairly trivial as all the variants are all independent of one another.
We have a slow script that does some similar annotations, and I'm looking for solutions to improve the speed/replace it.
Thanks again!
Hi Pierre,
I didn't find any way to have the output written into another vcf file. Is that possible?
Thanks in advance for your kind attention,
Francesco
@Francesco the tool was designed for Linux. You can can redirect the output to a file http://www.linuxsa.org.au/tips/io-redirection.html
Seems like a very useful tool!
However, when I try to run it, I get the following error:
java.io.IOException: illegal number of columns in
at sandbox.VCFFile.read(VCFAnnotator.java:521)
at sandbox.VCFFile.parse(VCFAnnotator.java:576)
at sandbox.VCFAnnotator.main(VCFAnnotator.java:2778)
My VCF-file is an output of GATK, so I think the format should be fine. I already tried with a few lines which are the same as your example. But it still keeps throwing me this error. It might be a stupid question, but do you know what I'm doing wrong?
Thanks!
This software is now outdated, you should use something like galaxy or snpeff or etc...
Dear Pierre,
Sorry for my question yesterday. Very stupid indeed. Apparently my VCF-file contains an empty row at the bottom of the file and this is the reason why vcfannotator wouldn't work.
It works fine now. Thanks for this great tool!
Post a Comment