The NHLBI Exome Sequencing Project provides a download area but I wanted to build a local database for the richer XML data returned by their Web Services (previously described here on my blog ). The following java program sends some XML/SOAP requests to the EVS server for each chromosome using a genomic window of 150000 bp and parses the XML response.
it then dumps the results into a tab-delimited file:
Compilation:
javac DumpExomeVariantServerData.java
Execution:
java DumpExomeVariantServerData > input.evs.tsv May 8, 2012 7:59:58 AM sandbox.DumpExomeVariantServerData fetchEvsData INFO: 1:1-200011 May 8, 2012 8:00:02 AM sandbox.DumpExomeVariantServerData fetchEvsData INFO: 1:200001-400011 May 8, 2012 8:00:03 AM sandbox.DumpExomeVariantServerData fetchEvsData INFO: 1:400001-600011 May 8, 2012 8:00:04 AM sandbox.DumpExomeVariantServerData fetchEvsData INFO: 1:600001-800011 May 8, 2012 8:00:05 AM sandbox.DumpExomeVariantServerData fetchEvsData INFO: 1:800001-1000011 (...)
head input.evs.tsv 1 69116 <snpList><positionString>1:69116</positionString><chrPosition>69116</chr 1 69134 <snpList><positionString>1:69134</positionString><chrPosition>69134</chr 1 69270 <snpList><positionString>1:69270</positionString><chrPosition>69270</chr 1 69428 <snpList><positionString>1:69428</positionString><chrPosition>69428</chr 1 69453 <snpList><positionString>1:69453</positionString><chrPosition>69453</chr 1 69476 <snpList><positionString>1:69476</positionString><chrPosition>69476</chr 1 69496 <snpList><positionString>1:69496</positionString><chrPosition>69496</chr 1 69511 <snpList><positionString>1:69511</positionString><chrPosition>69511</chr 1 69552 <snpList><positionString>1:69552</positionString><chrPosition>69552</chr 1 69590 <snpList><positionString>1:69590</positionString><chrPosition>69590</chr
Inserting the EVS data in a sqlite database
We can now create a sqlite3 database to insert the data ...$ sqlite3 evs.sqlite sqlite> create table evsData(chrom TEXT NOT NULL,pos INT NOT NULL,xml TEXT NOT NULL); sqlite> create index chrompos on evsData(chrom,pos); sqlite> .separator "\t"; sqlite> .import "input.evs.tsv" evsData
... and query this database
$ sqlite3 evs.sqlite 'select xml from evsData where chrom="1" and pos=69552' |\ xmllint --format - <?xml version="1.0"?> <snpList> <positionString>1:69552</positionString> <chrPosition>69552</chrPosition> <alleles>C/G</alleles> <uaAlleleCounts>C=4/G=4644</uaAlleleCounts> <aaAlleleCounts>C=0/G=2944</aaAlleleCounts> <totalAlleleCounts>C=4/G=7588</totalAlleleCounts> <uaMAF>0.0861</uaMAF> <aaMAF>0.0</aaMAF> <totalMAF>0.0527</totalMAF> <avgSampleReadDepth>143</avgSampleReadDepth> <geneList>OR4F5</geneList> <snpFunction> <chromosome>1</chromosome> <position>69552</position> <conservationScore>1.0</conservationScore> <conservationScoreGERP>-0.1</conservationScoreGERP> <snpFxnList> <mrnaAccession>NM_001005484.1</mrnaAccession> <fxnClassGVS>coding-synonymous</fxnClassGVS> <aminoAcids>none</aminoAcids> <proteinPos>154/306</proteinPos> <cdnaPos>462</cdnaPos> <pphPrediction>unknown</pphPrediction> <granthamScore>NA</granthamScore> </snpFxnList> <refAllele>G</refAllele> <ancestralAllele>C</ancestralAllele> <firstRsId>55874132</firstRsId> <secondRsId>0</secondRsId> <filters>SVM</filters> <clinicalLink>unknown</clinicalLink> </snpFunction> <conservationScore>1.0</conservationScore> <conservationScoreGERP>-0.1</conservationScoreGERP> <refAllele>G</refAllele> <altAlleles>C</altAlleles> <ancestralAllele>C</ancestralAllele> <chromosome>1</chromosome> <hasAtLeastOneAccession>true</hasAtLeastOneAccession> <rsIds>rs55874132</rsIds> <filters>SVM</filters> <clinicalLink>unknown</clinicalLink> <dbsnpVersion>dbSNP_129</dbsnpVersion> <uaGenotypeCounts>CC=0/CG=4/GG=2320</uaGenotypeCounts> <aaGenotypeCounts>CC=0/CG=0/GG=1472</aaGenotypeCounts> <totalGenotypeCounts>CC=0/CG=4/GG=3792</totalGenotypeCounts> <onExomeChip>false</onExomeChip> <gwasPubmedIds>unknown</gwasPubmedIds> </snpList>
The Variation Toolkit
I also wrote a C++ program (that is part of my (always-beta) Variation Toolkit) to use this sqlite database to annotate some VCF-like files. See http://code.google.com/p/variationtoolkit/wiki/VcfEvsExample 1
$ echo -e "#CHROM\tPOS\n1\t69511\n1\t69512\n1\t69552" |\ vcfevs -f evs.sqlite #CHROM POS evs.positionString evs.chrPosition evs.alleles evs.uaAlleleCounts evs.aaAlleleCounts evs.totalAlleleCounts evs.uaMAF evs.aaMAF evs.totalMAF evs.avgSampleReadDepth evs.geneList evs.conservationScore evs.conservationScoreGERP evs.refAllele evs.altAllelesevs.ancestralAllele evs.chromosome evs.hasAtLeastOneAccession evs.rsIds evs.filters evs.clinicalLink evs.dbsnpVersion evs.uaGenotypeCounts evs.aaGenotypeCounts evs.totalGenotypeCounts evs.onExomeChip evs.gwasPubmedIds 1 69511 1:69511 69511 G/A G=4235/A=483 G=1707/A=1297 G=5942/A=1780 10.2374 43.1758 23.051 74 OR4F5 1.0 1.1 A G G1 true rs75062661 PASS unknown dbSNP_131 GG=1964/GA=307/AA=88 GG=703/GA=301/AA=498 GG=2667/GA=608/AA=586 false unknown 1 69512 . . . . . . . . . . . . . . . . . . .. . . . . . . . 1 69552 1:69552 69552 C/G C=4/G=4644 C=0/G=2944 C=4/G=7588 0.0861 0.0 0.0527 143 OR4F5 1.0 -0.1 G C C1 true rs55874132 SVM unknown dbSNP_129 CC=0/CG=4/GG=2320 CC=0/CG=0/GG=1472 CC=0/CG=4/GG=3792 false unknown
Example 2
$ echo -e "#CHROM\tPOS\n1\t69511\n1\t69512\n1\t69552" |\ vcfevs -f ~/WORK/20120506.evs.download/evs.sqlite -c uaMAF #CHROM POS evs.uaMAF 1 69511 10.2374 1 69512 . 1 69552 0.0861
Example 3
$ echo -e "#CHROM\tPOS\n1\t69511\n1\t69512\n1\t69552" |\ vcfevs -f evs.sqlite -x -c _ | cut -c 1-200 #CHROM POS evs.xml 1 69511 <snpList><positionString>1:69511</positionString><chrPosition>69511</chrPosition><alleles>G/A</alleles><uaAlleleCounts>G=4235/A=483</uaAlleleCounts><aaAlleleCounts>G=1707/A=1297</aaAlleleCount 1 69512 . 1 69552 <snpList><positionString>1:69552</positionString><chrPosition>69552</chrPosition><alleles>C/G</alleles><uaAlleleCounts>C=4/G=4644</uaAlleleCounts><aaAlleleCounts>C=0/G=2944</aaAlleleCounts><to
That's it,
Pierre