The NHLBI Exome Sequencing Project provides a download area but I wanted to build a local database for the richer XML data returned by their Web Services (previously described here on my blog ). The following java program sends some XML/SOAP requests to the EVS server for each chromosome using a genomic window of 150000 bp and parses the XML response.
it then dumps the results into a tab-delimited file:
Compilation:
javac DumpExomeVariantServerData.java
Execution:
java DumpExomeVariantServerData > input.evs.tsv May 8, 2012 7:59:58 AM sandbox.DumpExomeVariantServerData fetchEvsData INFO: 1:1-200011 May 8, 2012 8:00:02 AM sandbox.DumpExomeVariantServerData fetchEvsData INFO: 1:200001-400011 May 8, 2012 8:00:03 AM sandbox.DumpExomeVariantServerData fetchEvsData INFO: 1:400001-600011 May 8, 2012 8:00:04 AM sandbox.DumpExomeVariantServerData fetchEvsData INFO: 1:600001-800011 May 8, 2012 8:00:05 AM sandbox.DumpExomeVariantServerData fetchEvsData INFO: 1:800001-1000011 (...)
head input.evs.tsv 1 69116 <snpList><positionString>1:69116</positionString><chrPosition>69116</chr 1 69134 <snpList><positionString>1:69134</positionString><chrPosition>69134</chr 1 69270 <snpList><positionString>1:69270</positionString><chrPosition>69270</chr 1 69428 <snpList><positionString>1:69428</positionString><chrPosition>69428</chr 1 69453 <snpList><positionString>1:69453</positionString><chrPosition>69453</chr 1 69476 <snpList><positionString>1:69476</positionString><chrPosition>69476</chr 1 69496 <snpList><positionString>1:69496</positionString><chrPosition>69496</chr 1 69511 <snpList><positionString>1:69511</positionString><chrPosition>69511</chr 1 69552 <snpList><positionString>1:69552</positionString><chrPosition>69552</chr 1 69590 <snpList><positionString>1:69590</positionString><chrPosition>69590</chr
Inserting the EVS data in a sqlite database
We can now create a sqlite3 database to insert the data ...$ sqlite3 evs.sqlite sqlite> create table evsData(chrom TEXT NOT NULL,pos INT NOT NULL,xml TEXT NOT NULL); sqlite> create index chrompos on evsData(chrom,pos); sqlite> .separator "\t"; sqlite> .import "input.evs.tsv" evsData
... and query this database
$ sqlite3 evs.sqlite 'select xml from evsData where chrom="1" and pos=69552' |\
xmllint --format -
<?xml version="1.0"?>
<snpList>
<positionString>1:69552</positionString>
<chrPosition>69552</chrPosition>
<alleles>C/G</alleles>
<uaAlleleCounts>C=4/G=4644</uaAlleleCounts>
<aaAlleleCounts>C=0/G=2944</aaAlleleCounts>
<totalAlleleCounts>C=4/G=7588</totalAlleleCounts>
<uaMAF>0.0861</uaMAF>
<aaMAF>0.0</aaMAF>
<totalMAF>0.0527</totalMAF>
<avgSampleReadDepth>143</avgSampleReadDepth>
<geneList>OR4F5</geneList>
<snpFunction>
<chromosome>1</chromosome>
<position>69552</position>
<conservationScore>1.0</conservationScore>
<conservationScoreGERP>-0.1</conservationScoreGERP>
<snpFxnList>
<mrnaAccession>NM_001005484.1</mrnaAccession>
<fxnClassGVS>coding-synonymous</fxnClassGVS>
<aminoAcids>none</aminoAcids>
<proteinPos>154/306</proteinPos>
<cdnaPos>462</cdnaPos>
<pphPrediction>unknown</pphPrediction>
<granthamScore>NA</granthamScore>
</snpFxnList>
<refAllele>G</refAllele>
<ancestralAllele>C</ancestralAllele>
<firstRsId>55874132</firstRsId>
<secondRsId>0</secondRsId>
<filters>SVM</filters>
<clinicalLink>unknown</clinicalLink>
</snpFunction>
<conservationScore>1.0</conservationScore>
<conservationScoreGERP>-0.1</conservationScoreGERP>
<refAllele>G</refAllele>
<altAlleles>C</altAlleles>
<ancestralAllele>C</ancestralAllele>
<chromosome>1</chromosome>
<hasAtLeastOneAccession>true</hasAtLeastOneAccession>
<rsIds>rs55874132</rsIds>
<filters>SVM</filters>
<clinicalLink>unknown</clinicalLink>
<dbsnpVersion>dbSNP_129</dbsnpVersion>
<uaGenotypeCounts>CC=0/CG=4/GG=2320</uaGenotypeCounts>
<aaGenotypeCounts>CC=0/CG=0/GG=1472</aaGenotypeCounts>
<totalGenotypeCounts>CC=0/CG=4/GG=3792</totalGenotypeCounts>
<onExomeChip>false</onExomeChip>
<gwasPubmedIds>unknown</gwasPubmedIds>
</snpList>The Variation Toolkit
I also wrote a C++ program (that is part of my (always-beta) Variation Toolkit) to use this sqlite database to annotate some VCF-like files. See http://code.google.com/p/variationtoolkit/wiki/VcfEvsExample 1
$ echo -e "#CHROM\tPOS\n1\t69511\n1\t69512\n1\t69552" |\ vcfevs -f evs.sqlite #CHROM POS evs.positionString evs.chrPosition evs.alleles evs.uaAlleleCounts evs.aaAlleleCounts evs.totalAlleleCounts evs.uaMAF evs.aaMAF evs.totalMAF evs.avgSampleReadDepth evs.geneList evs.conservationScore evs.conservationScoreGERP evs.refAllele evs.altAllelesevs.ancestralAllele evs.chromosome evs.hasAtLeastOneAccession evs.rsIds evs.filters evs.clinicalLink evs.dbsnpVersion evs.uaGenotypeCounts evs.aaGenotypeCounts evs.totalGenotypeCounts evs.onExomeChip evs.gwasPubmedIds 1 69511 1:69511 69511 G/A G=4235/A=483 G=1707/A=1297 G=5942/A=1780 10.2374 43.1758 23.051 74 OR4F5 1.0 1.1 A G G1 true rs75062661 PASS unknown dbSNP_131 GG=1964/GA=307/AA=88 GG=703/GA=301/AA=498 GG=2667/GA=608/AA=586 false unknown 1 69512 . . . . . . . . . . . . . . . . . . .. . . . . . . . 1 69552 1:69552 69552 C/G C=4/G=4644 C=0/G=2944 C=4/G=7588 0.0861 0.0 0.0527 143 OR4F5 1.0 -0.1 G C C1 true rs55874132 SVM unknown dbSNP_129 CC=0/CG=4/GG=2320 CC=0/CG=0/GG=1472 CC=0/CG=4/GG=3792 false unknown
Example 2
$ echo -e "#CHROM\tPOS\n1\t69511\n1\t69512\n1\t69552" |\ vcfevs -f ~/WORK/20120506.evs.download/evs.sqlite -c uaMAF #CHROM POS evs.uaMAF 1 69511 10.2374 1 69512 . 1 69552 0.0861
Example 3
$ echo -e "#CHROM\tPOS\n1\t69511\n1\t69512\n1\t69552" |\ vcfevs -f evs.sqlite -x -c _ | cut -c 1-200 #CHROM POS evs.xml 1 69511 <snpList><positionString>1:69511</positionString><chrPosition>69511</chrPosition><alleles>G/A</alleles><uaAlleleCounts>G=4235/A=483</uaAlleleCounts><aaAlleleCounts>G=1707/A=1297</aaAlleleCount 1 69512 . 1 69552 <snpList><positionString>1:69552</positionString><chrPosition>69552</chrPosition><alleles>C/G</alleles><uaAlleleCounts>C=4/G=4644</uaAlleleCounts><aaAlleleCounts>C=0/G=2944</aaAlleleCounts><to
That's it,
Pierre