4 Tools I wrote today to annotate VCF files.
Here are four java-bases tools I wrote today for @SolenaLS to annotate VCF files.
VCFTabix
VCFTabix : we want to use the VCF of the 1Kgenomes project (ALL.wgs.phase1_release_v3.20101123.snps_indels_sv.sites.vcf.gz) as a source of annotation for our VCF.Options
-f (vcf indexed with tabix) REQUIRED. -T (tag String) VCF-INFO-ID optional can be used several times. -R doesn't use REF allele -A doesn't use ALT allele -I don't replace ID if it exists. -F don't replace INFO field if it exists. -C (TAG) use this tag in case of conflict with the ALT allele.
Example:
java -jar dist/vcftabix.jar -f ALL.wgs.phase1_release_v3.20101123.snps_indels_sv.sites.vcf.gz \ -T AC -T SNPSOURCE input.vcf ##fileformat=VCFv4.1 (....) ##Annotated with fr.inserm.umr1087.jvarkit.tools.vcftabix.VCFTabix:ALL.wgs.phase1_release_v3.20101123.snps_indels_sv.sites.vcf.gz ##INFO=<ID=AC,Number=.,Type=Integer,Description="Alternate Allele Count"> ##INFO=<ID=SNPSOURCE,Number=.,Type=String,Description="indicates if a snp was called when analysing the low coverage or exome alignment data"> #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT CD10977 5 2999819 rs21234 A G 1620 . AC=2156;AF=1.00;AN=2;DB;DP=41;Dels=0.00;FS=0.000;HaplotypeScore=0.0000;MLEAC=2;MLEAF=1.00;MQ=59.44;MQ0=0;QD=39.51;SB=-6.519e-03;SNPSOURCE=LOWCOV; GT:AD:DP:GQ:PL 1/1:0,41:41:99:1653,123,0 (...)
EVS2BED
EVS2BED: bulk download of the XML data of EVS into a BED file CHROM/START/END/XML (see my related post: http://plindenbaum.blogspot.fr/2012/05/downloading-xml-data-from-exome-variant.html )Example
Download 10 records and index the data with tabix:java -jar dist/evs2bed.jar -L 10 | LC_ALL=C sort -k1,1 -k2,2n -k3,3n -k4,4 -u |\ /path/totabix-0.2.6/bgzip -c > test.bed.gz /path/totabix-0.2.6/tabix -p bed -f test.bed.gz $ gunzip -c test.bed.gz | cut -c 1-500 | fold -w 80 1 69427 69428 <snpList><positionString>1:69428</positionString><chrPos ition>69428</chrPosition><alleles>G/T</alleles><uaAlleleCounts>G=313/T=6535</uaA lleleCounts><aaAlleleCounts>G=14/T=3808</aaAlleleCounts><totalAlleleCounts>G=327 /T=10343</totalAlleleCounts><uaMAF>4.5707</uaMAF><aaMAF>0.3663</aaMAF><totalMAF> 3.0647</totalMAF><avgSampleReadDepth>110</avgSampleReadDepth><geneList>OR4F5</ge neList><snpFunction><chromosome>1</chromosome><position>69428</position><conserv ationScore>1.0</conservationSc 1 69475 69476 <snpList><positionString>1:69476</positionString><chrPos ition>69476</chrPosition><alleles>C/T</alleles><uaAlleleCounts>C=2/T=7020</uaAll eleCounts><aaAlleleCounts>C=0/T=3908</aaAlleleCounts><totalAlleleCounts>C=2/T=10 928</totalAlleleCounts><uaMAF>0.0285</uaMAF><aaMAF>0.0</aaMAF><totalMAF>0.0183</ totalMAF><avgSampleReadDepth>123</avgSampleReadDepth><geneList>OR4F5</geneList>< snpFunction><chromosome>1</chromosome><position>69476</position><conservationSco re>0.6</conservationScore><con 1 69495 69496 <snpList><positionString>1:69496</positionString><chrPos ition>69496</chrPosition><alleles>A/G</alleles><uaAlleleCounts>A=2/G=6764</uaAll eleCounts><aaAlleleCounts>A=23/G=3785</aaAlleleCounts><totalAlleleCounts>A=25/G= 10549</totalAlleleCounts><uaMAF>0.0296</uaMAF><aaMAF>0.604</aaMAF><totalMAF>0.23 64</totalMAF><avgSampleReadDepth>91</avgSampleReadDepth><geneList>OR4F5</geneLis t><snpFunction><chromosome>1</chromosome><position>69496</position><conservation Score>0.5</conservationScore>< 1 69510 69511 <snpList><positionString>1:69511</positionString><chrPos ition>69511</chrPosition><alleles>G/A</alleles><uaAlleleCounts>G=5337/A=677</uaA lleleCounts><aaAlleleCounts>G=1937/A=1623</aaAlleleCounts><totalAlleleCounts>G=7 274/A=2300</totalAlleleCounts><uaMAF>11.2571</uaMAF><aaMAF>45.5899</aaMAF><total MAF>24.0234</totalMAF><avgSampleReadDepth>69</avgSampleReadDepth><geneList>OR4F5 </geneList><snpFunction><chromosome>1</chromosome><position>69511</position><con servationScore>1.0</conservati 1 69589 69590 <snpList><positionString>1:69590</positionString><chrPos ition>69590</chrPosition><alleles>A/T</alleles><uaAlleleCounts>A=0/T=6214</uaAll eleCounts><aaAlleleCounts>A=1/T=3555</aaAlleleCounts><totalAlleleCounts>A=1/T=97 69</totalAlleleCounts><uaMAF>0.0</uaMAF><aaMAF>0.0281</aaMAF><totalMAF>0.0102</t otalMAF><avgSampleReadDepth>119</avgSampleReadDepth><geneList>OR4F5</geneList><s npFunction><chromosome>1</chromosome><position>69590</position><conservationScor e>0.8</conservationScore><cons 1 69593 69594 <snpList><positionString>1:69594</positionString><chrPos ition>69594</chrPosition><alleles>C/T</alleles><uaAlleleCounts>C=2/T=6190</uaAll eleCounts><aaAlleleCounts>C=0/T=3548</aaAlleleCounts><totalAlleleCounts>C=2/T=97 38</totalAlleleCounts><uaMAF>0.0323</uaMAF><aaMAF>0.0</aaMAF><totalMAF>0.0205</t otalMAF><avgSampleReadDepth>109</avgSampleReadDepth><geneList>OR4F5</geneList><s npFunction><chromosome>1</chromosome><position>69594</position><conservationScor e>0.8</conservationScore><cons 1 69619 69620 <snpList><positionString>1:69620</positionString><chrPos ition>69620</chrPosition><alleles>T/TA</alleles><uaAlleleCounts>A1=2/R=4694</uaA lleleCounts><aaAlleleCounts>A1=0/R=2954</aaAlleleCounts><totalAlleleCounts>A1=2/ R=7648</totalAlleleCounts><uaMAF>0.0426</uaMAF><aaMAF>0.0</aaMAF><totalMAF>0.026 1</totalMAF><avgSampleReadDepth>56</avgSampleReadDepth><geneList>OR4F5</geneList ><snpFunction><chromosome>1</chromosome><position>69620</position><conservationS core>0.7</conservationScore><c 1 69744 69745 <snpList><positionString>1:69745</positionString><chrPos ition>69745</chrPosition><alleles>CA/C</alleles><uaAlleleCounts>A1=4/R=4254</uaA lleleCounts><aaAlleleCounts>A1=0/R=2716</aaAlleleCounts><totalAlleleCounts>A1=4/ R=6970</totalAlleleCounts><uaMAF>0.0939</uaMAF><aaMAF>0.0</aaMAF><totalMAF>0.057 4</totalMAF><avgSampleReadDepth>10</avgSampleReadDepth><geneList>OR4F5</geneList ><snpFunction><chromosome>1</chromosome><position>69745</position><conservationS core>0.9</conservationScore><c 1 69760 69761 <snpList><positionString>1:69761</positionString><chrPos ition>69761</chrPosition><alleles>T/A</alleles><uaAlleleCounts>T=645/A=4093</uaA lleleCounts><aaAlleleCounts>T=62/A=2840</aaAlleleCounts><totalAlleleCounts>T=707 /A=6933</totalAlleleCounts><uaMAF>13.6133</uaMAF><aaMAF>2.1365</aaMAF><totalMAF> 9.2539</totalMAF><avgSampleReadDepth>8</avgSampleReadDepth><geneList>OR4F5</gene List><snpFunction><chromosome>1</chromosome><position>69761</position><conservat ionScore>0.1</conservationScor
VCFTabixml
VCFTabixml retrieves the annotations from a BED (CHROM/START/END/XML) and, using XLST, annotates a VCF.Example
In the following example, a few variations are downloaded from EVS using EVS2BED. Then a small VCF is created and annotated using the bed and the following XSLT stylesheet: evs2vcf.xsl.java -jar dist/evs2bed.jar -L 10 |\ LC_ALL=C sort -k1,1 -k2,2n -k3,3n -k4,4 -u |\ tabix-0.2.6/bgzip -c > test.bed.gz /tabix-0.2.6/tabix -p bed -f test.bed.gz java -jar dist/vcftabixml.jar \ -f test.bed.gz -x dist/evs2vcf.xsl test.vcf echo "#CHROM POS ID REF ALT QUAL FILTER INFO" > test.vcf echo "1 1 . C T . . T=test1" >> test.vcf echo "1 69476 . T C . . T=test2" >> test.vcf echo "1 69511 . A C . . T=test3" >> test.vcf java -jar dist/vcftabixml.jar \ -f test.bed.gz -x dist/evs2vcf.xsl test.vcf
Result:
##Annotated with fr.inserm.umr1087.jvarkit.tools.vcftabixml.VCFTabixml:test.bed.gz #CHROM POS ID REF ALT QUAL FILTER INFO 1 1 . C T . . T=test1; 1 69476 . T C . . T=test2;EVS_UAMAF=0.0285;EVS_AAMAF=0.0;EVS_TOTALMAF=0.0183;EVS_AVGSAMPLEREADDEPTH=123;EVS_GENELIST=OR4F5;EVS_CONSERVATIONSCORE=0.6;EVS_CONSERVATIONSCOREGERP=2.3;EVS_RSIDS=rs148502021;EVS_CLINICALLINK=unknown;EVS_ONEXOMECHIP=false;EVS_GWASPUBMEDIDS=unknown; 1 69511 . A C . . T=test3;EVS_UAMAF=11.2571;EVS_AAMAF=45.5899;EVS_TOTALMAF=24.0234;EVS_AVGSAMPLEREADDEPTH=69;EVS_GENELIST=OR4F5;EVS_CONSERVATIONSCORE=1.0;EVS_CONSERVATIONSCOREGERP=1.1;EVS_RSIDS=rs75062661;EVS_CLINICALLINK=unknown;EVS_ONEXOMECHIP=false;EVS_GWASPUBMEDIDS=unknown;EVS_CONFLICTALT=G;
VCFBigWig
VCFBigWig use a bigwig file to annotate a VCF file.Example:
add the GERP score to a VCF:$ java -jar dist/vcfbigwig.jar -f /commun/data/pubdb/ucsc/hg19/bbi/All_hg19_RS_noprefix.bw test.vep.vcf ##fileformat=VCFv4.1 ##Annotated with class fr.inserm.umr1087.jvarkit.tools.vcfbigwig.VCFBigWig:/commun/data/pubdb/ucsc/hg19/bbi/All_hg19_RS_noprefix.bw ##INFO=<ID=All_hg19_RS_noprefix,Number=1,Type=Float,Description="Annotations from /commun/data/pubdb/ucsc/hg19/bbi/All_hg19_RS_noprefix.bw"> #CHROM POS ID REF ALT QUAL FILTER INFO 22 12345 . G T 100.0 . All_hg19_RS_noprefix=4.829999923706055
4 comments:
Only four? The whole day? :-)
well... I started yesterday. The sources are small, that was not Rocket Science :-)
why you no try www.gene-talk.de ?
there you can annotate, communicate, filter, and so on...
Dear Pierre Lindenbaum, I admire your work, therefore I would like to know what you think about GeneTalk.
It would be nice if you would visit www.gene-talk.de and give some critics.
thanks
Alex
Post a Comment