Downloading the XML data from the Exome Variant Server
From EVS: "The goal of the NHLBI GO Exome Sequencing Project (ESP) is to discover novel genes and mechanisms contributing to heart, lung and blood disorders by pioneering the application of next-generation sequencing of the protein coding regions of the human genome across diverse, richly-phenotyped populations and to share these datasets and findings with the scientific community to extend and enrich the diagnosis, management and treatment of heart, lung and blood disorders."
The NHLBI Exome Sequencing Project provides a download area but I wanted to build a local database for the richer XML data returned by their Web Services (previously described here on my blog ). The following java program sends some XML/SOAP requests to the EVS server for each chromosome using a genomic window of 150000 bp and parses the XML response.
it then dumps the results into a tab-delimited file:
Compilation:
javac DumpExomeVariantServerData.java
Execution:
java DumpExomeVariantServerData > input.evs.tsv May 8, 2012 7:59:58 AM sandbox.DumpExomeVariantServerData fetchEvsData INFO: 1:1-200011 May 8, 2012 8:00:02 AM sandbox.DumpExomeVariantServerData fetchEvsData INFO: 1:200001-400011 May 8, 2012 8:00:03 AM sandbox.DumpExomeVariantServerData fetchEvsData INFO: 1:400001-600011 May 8, 2012 8:00:04 AM sandbox.DumpExomeVariantServerData fetchEvsData INFO: 1:600001-800011 May 8, 2012 8:00:05 AM sandbox.DumpExomeVariantServerData fetchEvsData INFO: 1:800001-1000011 (...)
head input.evs.tsv 1 69116 <snpList><positionString>1:69116</positionString><chrPosition>69116</chr 1 69134 <snpList><positionString>1:69134</positionString><chrPosition>69134</chr 1 69270 <snpList><positionString>1:69270</positionString><chrPosition>69270</chr 1 69428 <snpList><positionString>1:69428</positionString><chrPosition>69428</chr 1 69453 <snpList><positionString>1:69453</positionString><chrPosition>69453</chr 1 69476 <snpList><positionString>1:69476</positionString><chrPosition>69476</chr 1 69496 <snpList><positionString>1:69496</positionString><chrPosition>69496</chr 1 69511 <snpList><positionString>1:69511</positionString><chrPosition>69511</chr 1 69552 <snpList><positionString>1:69552</positionString><chrPosition>69552</chr 1 69590 <snpList><positionString>1:69590</positionString><chrPosition>69590</chr
Inserting the EVS data in a sqlite database
We can now create a sqlite3 database to insert the data ...$ sqlite3 evs.sqlite sqlite> create table evsData(chrom TEXT NOT NULL,pos INT NOT NULL,xml TEXT NOT NULL); sqlite> create index chrompos on evsData(chrom,pos); sqlite> .separator "\t"; sqlite> .import "input.evs.tsv" evsData
... and query this database
$ sqlite3 evs.sqlite 'select xml from evsData where chrom="1" and pos=69552' |\ xmllint --format - <?xml version="1.0"?> <snpList> <positionString>1:69552</positionString> <chrPosition>69552</chrPosition> <alleles>C/G</alleles> <uaAlleleCounts>C=4/G=4644</uaAlleleCounts> <aaAlleleCounts>C=0/G=2944</aaAlleleCounts> <totalAlleleCounts>C=4/G=7588</totalAlleleCounts> <uaMAF>0.0861</uaMAF> <aaMAF>0.0</aaMAF> <totalMAF>0.0527</totalMAF> <avgSampleReadDepth>143</avgSampleReadDepth> <geneList>OR4F5</geneList> <snpFunction> <chromosome>1</chromosome> <position>69552</position> <conservationScore>1.0</conservationScore> <conservationScoreGERP>-0.1</conservationScoreGERP> <snpFxnList> <mrnaAccession>NM_001005484.1</mrnaAccession> <fxnClassGVS>coding-synonymous</fxnClassGVS> <aminoAcids>none</aminoAcids> <proteinPos>154/306</proteinPos> <cdnaPos>462</cdnaPos> <pphPrediction>unknown</pphPrediction> <granthamScore>NA</granthamScore> </snpFxnList> <refAllele>G</refAllele> <ancestralAllele>C</ancestralAllele> <firstRsId>55874132</firstRsId> <secondRsId>0</secondRsId> <filters>SVM</filters> <clinicalLink>unknown</clinicalLink> </snpFunction> <conservationScore>1.0</conservationScore> <conservationScoreGERP>-0.1</conservationScoreGERP> <refAllele>G</refAllele> <altAlleles>C</altAlleles> <ancestralAllele>C</ancestralAllele> <chromosome>1</chromosome> <hasAtLeastOneAccession>true</hasAtLeastOneAccession> <rsIds>rs55874132</rsIds> <filters>SVM</filters> <clinicalLink>unknown</clinicalLink> <dbsnpVersion>dbSNP_129</dbsnpVersion> <uaGenotypeCounts>CC=0/CG=4/GG=2320</uaGenotypeCounts> <aaGenotypeCounts>CC=0/CG=0/GG=1472</aaGenotypeCounts> <totalGenotypeCounts>CC=0/CG=4/GG=3792</totalGenotypeCounts> <onExomeChip>false</onExomeChip> <gwasPubmedIds>unknown</gwasPubmedIds> </snpList>
The Variation Toolkit
I also wrote a C++ program (that is part of my (always-beta) Variation Toolkit) to use this sqlite database to annotate some VCF-like files. See http://code.google.com/p/variationtoolkit/wiki/VcfEvsExample 1
$ echo -e "#CHROM\tPOS\n1\t69511\n1\t69512\n1\t69552" |\ vcfevs -f evs.sqlite #CHROM POS evs.positionString evs.chrPosition evs.alleles evs.uaAlleleCounts evs.aaAlleleCounts evs.totalAlleleCounts evs.uaMAF evs.aaMAF evs.totalMAF evs.avgSampleReadDepth evs.geneList evs.conservationScore evs.conservationScoreGERP evs.refAllele evs.altAllelesevs.ancestralAllele evs.chromosome evs.hasAtLeastOneAccession evs.rsIds evs.filters evs.clinicalLink evs.dbsnpVersion evs.uaGenotypeCounts evs.aaGenotypeCounts evs.totalGenotypeCounts evs.onExomeChip evs.gwasPubmedIds 1 69511 1:69511 69511 G/A G=4235/A=483 G=1707/A=1297 G=5942/A=1780 10.2374 43.1758 23.051 74 OR4F5 1.0 1.1 A G G1 true rs75062661 PASS unknown dbSNP_131 GG=1964/GA=307/AA=88 GG=703/GA=301/AA=498 GG=2667/GA=608/AA=586 false unknown 1 69512 . . . . . . . . . . . . . . . . . . .. . . . . . . . 1 69552 1:69552 69552 C/G C=4/G=4644 C=0/G=2944 C=4/G=7588 0.0861 0.0 0.0527 143 OR4F5 1.0 -0.1 G C C1 true rs55874132 SVM unknown dbSNP_129 CC=0/CG=4/GG=2320 CC=0/CG=0/GG=1472 CC=0/CG=4/GG=3792 false unknown
Example 2
$ echo -e "#CHROM\tPOS\n1\t69511\n1\t69512\n1\t69552" |\ vcfevs -f ~/WORK/20120506.evs.download/evs.sqlite -c uaMAF #CHROM POS evs.uaMAF 1 69511 10.2374 1 69512 . 1 69552 0.0861
Example 3
$ echo -e "#CHROM\tPOS\n1\t69511\n1\t69512\n1\t69552" |\ vcfevs -f evs.sqlite -x -c _ | cut -c 1-200 #CHROM POS evs.xml 1 69511 <snpList><positionString>1:69511</positionString><chrPosition>69511</chrPosition><alleles>G/A</alleles><uaAlleleCounts>G=4235/A=483</uaAlleleCounts><aaAlleleCounts>G=1707/A=1297</aaAlleleCount 1 69512 . 1 69552 <snpList><positionString>1:69552</positionString><chrPosition>69552</chrPosition><alleles>C/G</alleles><uaAlleleCounts>C=4/G=4644</uaAlleleCounts><aaAlleleCounts>C=0/G=2944</aaAlleleCounts><to
That's it,
Pierre
4 comments:
Why not using a VCF ?
http://evs.gs.washington.edu/esp5500_bulk_data/ESP5400.snps.vcf.tar.gz
The VCF is not suitable for those structured data and as far as I remember it doesn't contains all the data you can find in the XML files.
Hello Sir
Well written Post..
Please Add a Email Subscription option for your blog So it would be essayer for us (readers) to reach your new posts
Thank you
you can just use a RSS reader (see thunderbird or GoogleReader)
Post a Comment