08 May 2012

Downloading the XML data from the Exome Variant Server

From EVS: "The goal of the NHLBI GO Exome Sequencing Project (ESP) is to discover novel genes and mechanisms contributing to heart, lung and blood disorders by pioneering the application of next-generation sequencing of the protein coding regions of the human genome across diverse, richly-phenotyped populations and to share these datasets and findings with the scientific community to extend and enrich the diagnosis, management and treatment of heart, lung and blood disorders."

The NHLBI Exome Sequencing Project provides a download area but I wanted to build a local database for the richer XML data returned by their Web Services (previously described here on my blog ). The following java program sends some XML/SOAP requests to the EVS server for each chromosome using a genomic window of 150000 bp and parses the XML response.



it then dumps the results into a tab-delimited file:

Compilation:

javac DumpExomeVariantServerData.java

Execution:
java DumpExomeVariantServerData > input.evs.tsv

May 8, 2012 7:59:58 AM sandbox.DumpExomeVariantServerData fetchEvsData
INFO: 1:1-200011
May 8, 2012 8:00:02 AM sandbox.DumpExomeVariantServerData fetchEvsData
INFO: 1:200001-400011
May 8, 2012 8:00:03 AM sandbox.DumpExomeVariantServerData fetchEvsData
INFO: 1:400001-600011
May 8, 2012 8:00:04 AM sandbox.DumpExomeVariantServerData fetchEvsData
INFO: 1:600001-800011
May 8, 2012 8:00:05 AM sandbox.DumpExomeVariantServerData fetchEvsData
INFO: 1:800001-1000011
(...)

head input.evs.tsv
1 69116 <snpList><positionString>1:69116</positionString><chrPosition>69116</chr
1 69134 <snpList><positionString>1:69134</positionString><chrPosition>69134</chr
1 69270 <snpList><positionString>1:69270</positionString><chrPosition>69270</chr
1 69428 <snpList><positionString>1:69428</positionString><chrPosition>69428</chr
1 69453 <snpList><positionString>1:69453</positionString><chrPosition>69453</chr
1 69476 <snpList><positionString>1:69476</positionString><chrPosition>69476</chr
1 69496 <snpList><positionString>1:69496</positionString><chrPosition>69496</chr
1 69511 <snpList><positionString>1:69511</positionString><chrPosition>69511</chr
1 69552 <snpList><positionString>1:69552</positionString><chrPosition>69552</chr
1 69590 <snpList><positionString>1:69590</positionString><chrPosition>69590</chr

Inserting the EVS data in a sqlite database

We can now create a sqlite3 database to insert the data ...
$ sqlite3 evs.sqlite
sqlite> create table evsData(chrom TEXT NOT NULL,pos INT NOT NULL,xml TEXT NOT NULL);
sqlite> create index chrompos on evsData(chrom,pos);
sqlite> .separator "\t";
sqlite> .import "input.evs.tsv" evsData

... and query this database
$ sqlite3 evs.sqlite 'select xml from evsData where chrom="1" and pos=69552' |\
xmllint  --format -

<?xml version="1.0"?>
<snpList>
  <positionString>1:69552</positionString>
  <chrPosition>69552</chrPosition>
  <alleles>C/G</alleles>
  <uaAlleleCounts>C=4/G=4644</uaAlleleCounts>
  <aaAlleleCounts>C=0/G=2944</aaAlleleCounts>
  <totalAlleleCounts>C=4/G=7588</totalAlleleCounts>
  <uaMAF>0.0861</uaMAF>
  <aaMAF>0.0</aaMAF>
  <totalMAF>0.0527</totalMAF>
  <avgSampleReadDepth>143</avgSampleReadDepth>
  <geneList>OR4F5</geneList>
  <snpFunction>
    <chromosome>1</chromosome>
    <position>69552</position>
    <conservationScore>1.0</conservationScore>
    <conservationScoreGERP>-0.1</conservationScoreGERP>
    <snpFxnList>
      <mrnaAccession>NM_001005484.1</mrnaAccession>
      <fxnClassGVS>coding-synonymous</fxnClassGVS>
      <aminoAcids>none</aminoAcids>
      <proteinPos>154/306</proteinPos>
      <cdnaPos>462</cdnaPos>
      <pphPrediction>unknown</pphPrediction>
      <granthamScore>NA</granthamScore>
    </snpFxnList>
    <refAllele>G</refAllele>
    <ancestralAllele>C</ancestralAllele>
    <firstRsId>55874132</firstRsId>
    <secondRsId>0</secondRsId>
    <filters>SVM</filters>
    <clinicalLink>unknown</clinicalLink>
  </snpFunction>
  <conservationScore>1.0</conservationScore>
  <conservationScoreGERP>-0.1</conservationScoreGERP>
  <refAllele>G</refAllele>
  <altAlleles>C</altAlleles>
  <ancestralAllele>C</ancestralAllele>
  <chromosome>1</chromosome>
  <hasAtLeastOneAccession>true</hasAtLeastOneAccession>
  <rsIds>rs55874132</rsIds>
  <filters>SVM</filters>
  <clinicalLink>unknown</clinicalLink>
  <dbsnpVersion>dbSNP_129</dbsnpVersion>
  <uaGenotypeCounts>CC=0/CG=4/GG=2320</uaGenotypeCounts>
  <aaGenotypeCounts>CC=0/CG=0/GG=1472</aaGenotypeCounts>
  <totalGenotypeCounts>CC=0/CG=4/GG=3792</totalGenotypeCounts>
  <onExomeChip>false</onExomeChip>
  <gwasPubmedIds>unknown</gwasPubmedIds>
</snpList>

The Variation Toolkit

I also wrote a C++ program (that is part of my (always-beta) Variation Toolkit) to use this sqlite database to annotate some VCF-like files. See http://code.google.com/p/variationtoolkit/wiki/VcfEvs

Example 1

$ echo -e "#CHROM\tPOS\n1\t69511\n1\t69512\n1\t69552" |\
vcfevs -f evs.sqlite 

#CHROM  POS     evs.positionString      evs.chrPosition evs.alleles     evs.uaAlleleCounts      evs.aaAlleleCounts      evs.totalAlleleCounts   evs.uaMAF       evs.aaMAF       evs.totalMAF    evs.avgSampleReadDepth  evs.geneList    evs.conservationScore   evs.conservationScoreGERP       evs.refAllele   evs.altAllelesevs.ancestralAllele       evs.chromosome  evs.hasAtLeastOneAccession      evs.rsIds       evs.filters     evs.clinicalLink        evs.dbsnpVersion        evs.uaGenotypeCounts    evs.aaGenotypeCounts    evs.totalGenotypeCounts evs.onExomeChip evs.gwasPubmedIds
1       69511   1:69511 69511   G/A     G=4235/A=483    G=1707/A=1297   G=5942/A=1780   10.2374 43.1758 23.051  74      OR4F5   1.0     1.1     A       G       G1      true    rs75062661      PASS    unknown dbSNP_131       GG=1964/GA=307/AA=88    GG=703/GA=301/AA=498    GG=2667/GA=608/AA=586   false   unknown
1       69512   .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       .       ..      .       .       .       .       .       .       .
1       69552   1:69552 69552   C/G     C=4/G=4644      C=0/G=2944      C=4/G=7588      0.0861  0.0     0.0527  143     OR4F5   1.0     -0.1    G       C       C1      true    rs55874132      SVM     unknown dbSNP_129       CC=0/CG=4/GG=2320       CC=0/CG=0/GG=1472       CC=0/CG=4/GG=3792       false   unknown

Example 2

$ echo -e "#CHROM\tPOS\n1\t69511\n1\t69512\n1\t69552" |\
vcfevs -f ~/WORK/20120506.evs.download/evs.sqlite -c uaMAF 
#CHROM  POS     evs.uaMAF
1       69511   10.2374
1       69512   .
1       69552   0.0861

Example 3

$ echo -e "#CHROM\tPOS\n1\t69511\n1\t69512\n1\t69552" |\
vcfevs -f evs.sqlite  -x -c _ | cut -c 1-200

#CHROM  POS     evs.xml
1       69511   <snpList><positionString>1:69511</positionString><chrPosition>69511</chrPosition><alleles>G/A</alleles><uaAlleleCounts>G=4235/A=483</uaAlleleCounts><aaAlleleCounts>G=1707/A=1297</aaAlleleCount
1       69512   .
1       69552   <snpList><positionString>1:69552</positionString><chrPosition>69552</chrPosition><alleles>C/G</alleles><uaAlleleCounts>C=4/G=4644</uaAlleleCounts><aaAlleleCounts>C=0/G=2944</aaAlleleCounts><to

That's it,

Pierre

4 comments:

Raony Guimaraes said...

Why not using a VCF ?
http://evs.gs.washington.edu/esp5500_bulk_data/ESP5400.snps.vcf.tar.gz

Pierre Lindenbaum said...

The VCF is not suitable for those structured data and as far as I remember it doesn't contains all the data you can find in the XML files.

Unknown said...

Hello Sir
Well written Post..
Please Add a Email Subscription option for your blog So it would be essayer for us (readers) to reach your new posts
Thank you

Pierre Lindenbaum said...

you can just use a RSS reader (see thunderbird or GoogleReader)