01 January 2011

My tool to annotate VCF files.

The Variant Call Format is a text file format generated by many tools for NGS. It contains meta-information lines, a header line, and then data lines describing how the mutations were called. I don't like this format because it cannot be used to store some hierachical annotations (like json or xml), nevertheless it is a de facto standard.

I wrote a tool called vcfannotator to append a set of annotations from the UCSC database to a VCF file. As I wanted to keep this tool simple and without any dependencies, it only uses the flat files available from the download area at the UCSC.

This tools appends several informations:

  • A prediction of the mutation: is it in the cDNA ? in an intron ? in an exon ? is it a non-synonymous mutation ? was a stop codon lost or gained ? is there any consequence on the splicing process ? Here, the UCSC DAS server and the KnonwGenes table are used to retrieve the genomic DNA and the structure of the gene.
  • dbSNP: was this mutation found before in dbSNP ?
  • Personal genomes: was this mutation found before in Venter's genome ? In Watson's genome ? etc...
  • The mapability (Uniqueness of Reference Genome ): gives an idea about how the context is unique in the genome
  • genomicSuperDups: is this mutation located in a segmental genomic duplication ?
  • tfbsConsSites: is this mutation located in the site of a transcription factor ?
  • phastCons44way: how is the mutation conserved within a multiple alignments of 44 vertebrate genomes to the human genome ?


The tool is available on github at: https://github.com/lindenb/jsandbox in jsandbox/src/sandbox/VCFAnnotator.java.

Compilation

> cd jsandbox
> ant vcfannotator

Execution & Options

> java -jar dist/vcfannotator.jar -h
VCF annotator
Pierre Lindenbaum PhD. 2010. http://plindenbaum.blogspot.com
Options:
-b ucsc.build default:hg18
-m mapability.
-g genomicSuperDups
-p basic prediction
-c phastcons prediction (phastCons44way)
-t transcription factors sites prediction
-pg personal genomes
-snp <id> add ucsc <id> must be present in "http://hgdownload.cse.ucsc.edu/goldenPath/<ucscdb>/database/<id>.txt.gz" e.g. snp129
-log <level> one value from java.util.logging.Level default:OFF
-proxyHost <host>
-proxyPort <port>

Example

Input

##fileformat=VCFv3.2
##fileDate=20091120
##source=gigabayes
##reference=ncbi36
##phasing=none
#CHROM POS ID REF ALT QUAL FILTER INFO
1 1105366 . T C 99 0 NS=471;DP=16179;AC=8;AN=942;HWE=0.0685233
1 1105373 . C T 99 0 NS=469;DP=15318;AC=5;AN=938;HWE=0.0267954
1 1108138 rs61733845 C T 99 0 NS=450;DP=16134;AC=110;AN=900;dbSNP;HWE=0.930276
1 1109206 . G A 99 0 NS=696;DP=54701;AC=2;AN=1392;HWE=0.0028777
1 1109262 . C T 99 0 NS=696;DP=55783;AC=12;AN=1392;HWE=0.104349
1 1110233 . C G 99 0 NS=694;DP=34301;AC=43;AN=1388;HWE=4.91397
1 1110240 . T A 99 0 NS=695;DP=35846;AC=3;AN=1390;HWE=0.00648883
1 1110294 rs1320571 G A 99 0 NS=608;DP=36050;AC=246;AN=1216;dbSNP;HM;HWE=3.05358
1 1110351 . A C 99 0 NS=696;DP=26284;AC=15;AN=1392;HWE=0.163402
1 1110358 . T C 99 0 NS=694;DP=24596;AC=24;AN=1388;HWE=0.422309
1 1110366 . G A 99 0 NS=697;DP=22511;AC=16;AN=1394;HWE=0.185781
1 3537495 . C T 99 0 NS=697;DP=25302;AC=15;AN=1394;HWE=0.163165
1 3537910 . G A 99 0 NS=605;DP=15896;AC=34;AN=1210;HWE=0.46683
1 3537996 rs2760321 T C 99 0 NS=571;DP=15376;AC=1040;AN=1142;dbSNP;HM;HWE=4.28852
(...)

Command:

java -jar dist/vcfannotator.jar -pg -m -g -p -c -t -snp snp130 -log ALL input.vcf
## wait !

Result

##fileformat=VCFv3.2
##fileDate=20091120
##source=gigabayes
##reference=ncbi36
##phasing=none
##SNP130=table snp130 from UCSC
##INFO=NA12878,1,String,"NA12878's Personal genome"
##INFO=NA12891,1,String,"NA12891's Personal genome"
(...)
chr1 1108138 rs61733845 C T 99 0 NS=450;DP=16134;AC=110;AN=900;dbSNP;HWE=0.930276;MAPABILITY_WGENCODEBROADMAPABILITYALIGN36MER=1;MAPABILITY_WGENCODEDUKEUNIQUENESS20=1;MAPABILITY_WGENCODEDUKEUNIQUENESS24=1;MAPABILITY_WGENCODEDUKEUNIQUENESS35=1;phastCons44way=1.000;PREDICTION=geneSymbol:TTLL10|wild.codon:TGC|wild.aa:C|pos.cdna:935|kgId:uc001acy.1|position.protein:312|strand:+|mut.aa:C|mut.codon:TGT|exon.name:Exon 11|type:EXON_CODING_SYNONYMOUS;PREDICTION=geneSymbol:TTLL10|wild.codon:TGC|wild.aa:C|pos.cdna:716|kgId:uc001acz.1|position.protein:239|strand:+|mut.aa:C|mut.codon:TGT|exon.name:Exon 7|type:EXON_CODING_SYNONYMOUS
(...)
chr1 1110351 . A C 99 0 NS=696;DP=26284;AC=15;AN=1392;HWE=0.163402;MAPABILITY_WGENCODEBROADMAPABILITYALIGN36MER=1;MAPABILITY_WGENCODEDUKEUNIQUENESS20=1;MAPABILITY_WGENCODEDUKEUNIQUENESS24=1;MAPABILITY_WGENCODEDUKEUNIQUENESS35=1;phastCons44way=1.000;PREDICTION=geneSymbol:TTLL10|wild.codon:AAG|wild.aa:K|splicing:SPLICING_DONOR|pos.cdna:1399|kgId:uc001acy.1|position.protein:467|strand:+|mut.aa:T|mut.codon:ACG|exon.name:Exon 13|type:EXON_CODING_NON_SYNONYMOUS;PREDICTION=geneSymbol:TTLL10|wild.codon:AAG|wild.aa:K|pos.cdna:1180|kgId:uc001acz.1|position.protein:394|strand:+|mut.aa:T|mut.codon:ACG|exon.name:Exon 9|type:EXON_CODING_NON_SYNONYMOUS
(...)
chr1 113068276 . C G 99 0 NS=697;DP=12774;AC=7;AN=1394;HWE=0.0353282;MAPABILITY_WGENCODEBROADMAPABILITYALIGN36MER=1;MAPABILITY_WGENCODEDUKEUNIQUENESS20=1;MAPABILITY_WGENCODEDUKEUNIQUENESS24=1;MAPABILITY_WGENCODEDUKEUNIQUENESS35=1;phastCons44way=1.000;PREDICTION=geneSymbol:FAM19A3|wild.codon:CCA|wild.aa:P|pos.cdna:451|kgId:uc001ecu.1|position.protein:151|strand:+|mut.aa:R|mut.codon:CGA|exon.name:Exon 4|type:EXON_CODING_NON_SYNONYMOUS;PREDICTION=geneSymbol:FAM19A3|wild.codon:ACC|wild.aa:T|pos.cdna:383|kgId:uc001ecv.1|position.protein:128|strand:+|mut.aa:T|mut.codon:ACG|exon.name:Exon 4|type:EXON_CODING_SYNONYMOUS
(...)
Interestingly, for this last mutation chr1:113068276, there are two transcripts at the same position with two different translation frames, so there are two predictions: one synonymous mutation and one non-synonymous mutation.


That's it,

Pierre

16 comments:

whuanleeu said...

Great !
Similar work in Plants seems harder to be done !

zeynep kalender said...

Very useful tool, thanks!
Just a comment : the tool works without a problem in hg18, but some url's are not directly translatable to hg19, thus the tool cannot fetch some of the flat files for hg19.

Anonymous said...

It would be great if it worked. But sadly doesn't at all.

Anonymous said...

some ulrs does not exist even for HG18! This tool does not work at all.

Pierre Lindenbaum said...

ahaha :-) just tell the UCSC ! :-)

http://www.lazycodeslinger.com/pics/works-on-my-machine-starburst.png :-)

Michal Kovac said...

Dear Pierre,

I would be very grateful if you could have a look on a crash report from VCFannotator test run from yesterday. It seems it's some argument issue with PredictionAnnotator.

I will gladly provide you a complete log and VCF file used for the analysis, but haven't been able to dig out your email address.

Please contact me on: michal.kovac@well.ox.ac.uk or michal.kovac@unibas.ch

Many thanks for your time.
Michal

Anonymous said...

Pierre, looks like a great tool. But I've never used jsandbox and couldn't find any information on how to install and use it. Sorry for maybe stupid question.

Pierre Lindenbaum said...

download the code from github, go in the directory jsandbox and type "ant vcfannotator". If its sounds too technical , ask for help after a informatician.

Anonymous said...

Hello Pierre,

First of all, thanks for making your code available. I tried running your annotation code but had problems... I get the same error with the input file on your blog. I get:
java.io.IOException: Error in header got 1 1105366 . T C 99 0 NS=471;DP=16179;AC=8;AN=942;HWE=0.0685233 but expected #CHROM POS ID REF ALT QUAL FILTER INFO

Any ideas?

Thanks
Adrian

Pierre Lindenbaum said...

@anonymous, I guess your VCF is missing the required VCF header starting with "#CHROM..." , see the VCF specification.

Jim V said...

Initially I had the same java.io.IOException: Error in header error as the previous poster.

Turns out it was due to the whitespace being spaces and not tabs.

Quick conversion using vim can be done:

:%s/\s\+/\t/g

Then the sample file should work.

My comment is that the annotation process seems to take a looooong time before it finishes - even for this short sample data.

With your hadoop experience - would this be something worth applying the mapreduce paradigm to? Seems like it should be fairly trivial as all the variants are all independent of one another.

We have a slow script that does some similar annotations, and I'm looking for solutions to improve the speed/replace it.

Thanks again!

Francesco Nea said...

Hi Pierre,
I didn't find any way to have the output written into another vcf file. Is that possible?

Thanks in advance for your kind attention,

Francesco

Pierre Lindenbaum said...

@Francesco the tool was designed for Linux. You can can redirect the output to a file http://www.linuxsa.org.au/tips/io-redirection.html

Lien said...

Seems like a very useful tool!
However, when I try to run it, I get the following error:
java.io.IOException: illegal number of columns in
at sandbox.VCFFile.read(VCFAnnotator.java:521)
at sandbox.VCFFile.parse(VCFAnnotator.java:576)
at sandbox.VCFAnnotator.main(VCFAnnotator.java:2778)

My VCF-file is an output of GATK, so I think the format should be fine. I already tried with a few lines which are the same as your example. But it still keeps throwing me this error. It might be a stupid question, but do you know what I'm doing wrong?
Thanks!

Pierre Lindenbaum said...

This software is now outdated, you should use something like galaxy or snpeff or etc...

Lien said...

Dear Pierre,
Sorry for my question yesterday. Very stupid indeed. Apparently my VCF-file contains an empty row at the bottom of the file and this is the reason why vcfannotator wouldn't work.
It works fine now. Thanks for this great tool!