25 March 2010

Playing with Biomart. My notebook.


Thanks to Neil, I've discovered that Biomart was not just a web interace, it also contains a powerful API: Citing http://www.biomart.org/martservice.html:
To submit a query using our webservices generate an XML document conforming to our Query XML syntax. This can be achieved simply by building up your query using MartView and hitting the XML button . This XML should be posted to http://(...)/martservice attached to a single parameter of query.

The base URL for Biomart is http://www.biomart.org/biomart.

Retrieving the Registry information for this Biomart installation: http://www.biomart.org/biomart/martservice?type=registry
<MartRegistry>
<MartURLLocation database="ensembl_mart_57" default="1" displayName="ENSEMBL GENES 57 (SANGER UK)" host="www.biomart.org" includeDatasets="" martUser="" name="ensembl" path="/biomart/martservice" port="80" serverVirtualSchema="default" visible="1" />
<MartURLLocation database="hapmart_27_36" default="1" displayName="HAPMAP 27 (NCBI US)" host="hapmap.ncbi.nlm.nih.gov" includeDatasets="" name="HapMap_rel27" path="/biomart/martservice" port="80" redirect="1" serverVirtualSchema="rel27_NCBI_Build36" visible="1" />
<MartURLLocation database="vega_mart_57" default="0" displayName="VEGA 37 (SANGER UK)" host="www.biomart.org" includeDatasets="" martUser="" name="vega" path="/biomart/martservice" port="80" serverVirtualSchema="default" visible="1" />
<MartURLLocation database="genomic_features_mart_57" default="0" displayName="ENSEMBL GENOMIC FEATURES 57 (SANGER UK)" host="www.biomart.org" includeDatasets="" martUser="" name="genomic_features" path="/biomart/martservice" port="80" serverVirtualSchema="default" visible="0" />
(...)
</MartRegistry>


Retrieving the datasets available for the registry HapMap_rel27: http://www.biomart.org/biomart/martservice?type=datasets&mart=HapMap_rel27 :
TableSet hm27_variation_yri HapMap Population YRI - Yoruba in Ibadan, Nigeria (West Africa) 1 20050000 default 2009-10-28 11:19:52
TableSet hm27_variation_tsi HapMap Population TSI - Toscans in Italy 1 200 50000 default 2009-10-28 11:13:07
TableSet hm27_gene hm27_gene 0 200 50000 default 2009-09-17 12:04:20
TableSet hm27_variation_chd HapMap Population CHD - Chinese in Metropolitan Denver, Colorado 1 200 50000 default 2009-10-28 11:07:01
TableSet hm27_variation All Populations 1 200 50000 default 2009-10-28 11:22:21
TableSet hm27_variation_chb HapMap Population CHB - Han Chinese in Beijing, China 1 200 50000 default 2009-10-28 11:05:57
TableSet hm27_variation_asw HapMap Population ASW - African ancestry in Southwest USA 1 20050000 default 2009-10-28 08:53:43
TableSet hm27_variation_ceu HapMap Population CEU - Utah residents with Northern and Western European ancestry from the CEPH collection 1 200 50000 default 2009-10-28 11:04:39
TableSet hm27_variation_mkk HapMap Population MKK - Maasai in Kinyawa, Kenya 1 200 50000 default 2009-10-28 11:12:12
(...)

Where can I find a description of those columns ?


Retrieving the attributes for the dataset hm27_variation_ceu: http://www.biomart.org/biomart/martservice?type=attributes&dataset=hm27_variation_ceu :
chrom chromosome naive_attributes html,txt,csv,tsv,xls hm27_variation_ceu__variation__main chrom
start position naive_attributes html,txt,csv,tsv,xls hm27_variation_ceu__variation__main start
strand strand naive_attributes html,txt,csv,tsv,xls hm27_variation_ceu__variation__main strand
marker1 marker id naive_attributes html,txt,csv,tsv,xls hm27_variation_ceu__variation__main marker1
alleles alleles naive_attributes html,txt,csv,tsv,xls hm27_variation_ceu__variation__main alleles
refallele reference allele naive_attributes html,txt,csv,tsv,xls hm27_variation_ceu__variation__main refallele
center genotyping center naive_attributes html,txt,csv,tsv,xls hm27_variation_ceu__variation__maincenter
platform genotyping platform naive_attributes html,txt,csv,tsv,xls hm27_variation_ceu__variation__main platform
ceu_id CEU genotype naive_attributes html,txt,csv,tsv,xls hm27_variation_ceu__variation__dm genotype
ref_allele reference allele naive_attributes html,txt,csv,tsv,xls hm27_variation_ceu__variation__main ref_allele
ref_allele_freq reference allele frequency naive_attributes html,txt,csv,tsv,xls hm27_variation_ceu__variation__main ref_allele_freq
ref_allele_count reference allele count naive_attributes html,txt,csv,tsv,xls hm27_variation_ceu__variation__main ref_allele_count
other_allele other allele naive_attributes html,txt,csv,tsv,xls hm27_variation_ceu__variation__mainother_allele
other_allele_freq other allele frequency naive_attributes html,txt,csv,tsv,xls hm27_variation_ceu__variation__main other_allele_freq
other_allele_count other allele count naive_attributes html,txt,csv,tsv,xls hm27_variation_ceu__variation__main other_allele_count
total_allele_count total allele count naive_attributes html,txt,csv,tsv,xls hm27_variation_ceu__variation__main total_allele_count
refhom_gt reference homozygote genotype naive_attributes html,txt,csv,tsv,xls hm27_variation_ceu__variation__main refhom_gt
het_gt heterozygote genotype naive_attributes html,txt,csv,tsv,xls hm27_variation_ceu__variation__mainhet_gt
otherhom_gt other homozygote genotype naive_attributes html,txt,csv,tsv,xls hm27_variation_ceu__variation__main otherhom_gt
refhom_gtfreq reference homozygote genotype frequency naive_attributes html,txt,csv,tsv,xls hm27_variation_ceu__variation__main refhom_gtfreq
het_gtfreq heterozygote genotype frequency naive_attributes html,txt,csv,tsv,xls hm27_variation_ceu__variation__main het_gtfreq
otherhom_gtfreq other homozygote genotype frequency naive_attributes html,txt,csv,tsv,xls hm27_variation_ceu__variation__main otherhom_gtfreq
refhom_gtcount reference homozygote genotype count naive_attributes html,txt,csv,tsv,xls hm27_variation_ceu__variation__main refhom_gtcount
het_gtcount heterozygote genotype count naive_attributes html,txt,csv,tsv,xls hm27_variation_ceu__variation__main het_gtcount
otherhom_gtcount other homozygote genotype count naive_attributes html,txt,csv,tsv,xls hm27_variation_ceu__variation__main otherhom_gtcount
total_gtcount total genotype count naive_attributes html,txt,csv,tsv,xls hm27_variation_ceu__variation__main total_gtcount
prot_lsid protocol LSID naive_attributes html,txt,csv,tsv,xls hm27_variation_ceu__assay__dm prot_lsid
assay_id assay LSID naive_attributes html,txt,csv,tsv,xls hm27_variation_ceu__assay__dm assay_lsid

(Where can I find a description of those columns ?). Each attribute is a specific column in the final result. Example of attributes available for the dataset hm27_variation_ceu:
  • chrom
  • start
  • marker1
  • alleles
  • (...)


Retrieving the filters for the dataset hm27_variation_ceu: http://www.biomart.org/biomart/martservice?type=filters&dataset=hm27_variation_ceu :
allele_freq Minor Allele Frequency [>=] [0.0,0.01,0.05,0.1,0.15,0.2,0.25,0.3,0.35,0.4,0.45,0.5] naive_filters list >= hm27_variation_ceu__variation__main avg_dprime
center Genotyping Center [affymetrix,bcm,illumina,imsut-riken,mcgill-gqic,perlegen,ucsf-wu,broad,sanger] naive_filters list = hm27_variation_ceu__variation__main center
platform Genotyping Platform [AFFY_100K,AFFY_500K,BeadArray,FP-TDI,Infinium_H1,Infinium_H300,Invader,MIP,Perlegen,AFFY_6.0,Illumina_1M] naive_filters list = hm27_variation_ceu__variation__main platform
monomorphic Monomorphic SNPs [,0] naive_filters boolean_num =only,excluded hm27_variation_ceu__variation__main is_reference_bool
marker_name List of HapMap SNPs [] naive_filters = hm27_variation_ceu__variation__mainmarker1
utr3 3' UTR [] naive_filters boolean_list only,excluded hm27_variation_ceu__variation__main is_utr_bool
utr5 5' UTR [] naive_filters boolean_list only,excluded hm27_variation_ceu__variation__main is_5utr_bool
coding_synon Exons - synonymous coding SNPs [] naive_filters boolean_list only,excluded hm27_variation_ceu__variation__main is_coding_synon_bool
coding_nonsynon Exons - non-synonymous coding SNPs [] naive_filters boolean_list only,excluded hm27_variation_ceu__variation__main is_coding_nonsynon_bool
intronic introns [] naive_filters boolean_list only,excluded hm27_variation_ceu__variation__mainis_intronic_bool
chrom chromosome [chr1,chr2,chr3,chr4,chr5,chr6,chr7,chr8,chr9,chr10,chr11,chr12,chr13,chr14,chr15,chr16,chr17,chr18,chr19,chr20,chr21,chr22,chrX,chrY] naive_filters list = hm27_variation_ceu__variation__main chrom
start From position [] naive_filters text >= hm27_variation_ceu__variation__main start
stop To position [] naive_filters text <= hm27_variation_ceu__variation__main stop
gene_list List of Genes [] naive_filters =,in pointer dataset gene_primary_symbol
encode_region ENCODE region [ENm010: 7:26924045..27424045,ENm013: 7:89621624..90736048,ENm014: 7:125865891..127029088,ENr112: 2:51512208..52012208,ENr113: 4:118466103..118966103,ENr123: 12:38626476:39126476,ENr131: 2:234156563..234656627,ENr213: 18:23719231..24219231,ENr232: 9:130725122..131225122,ENr321: 8:118882220..119382220] naive_filters list = pointer dataset glook_encode_region
(Where can I find a description of those columns ?) Here we can find some filters for hm27_variation_ceu (e.g.: allele_freq)

Retrieving the configurations for the dataset hm27_variation_ceu: http://www.biomart.org/biomart/martservice?type=configuration&dataset=hm27_variation_ceu :
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE DatasetConfig>
<DatasetConfig dataset="hm27_variation_ceu" datasetID="2" displayName="HapMap Population CEU - Utah residents with Northern and Western European ancestry from the CEPH collection" interfaces="default" internalName="default" martUsers="default" modified="2009-10-28 11:04:39" softwareVersion="0.6" template="hm27_variation_ceu" type="TableSet" visible="1">
<MainTable>hm27_variation_ceu__variation__main</MainTable>
<Key>marker_id_key</Key>
<Importable filters="chrom,start,stop" internalName="encode_filter" linkName="encode_filter" name="encode_filter" type="link"/>
<Importable displayName="gene_filter_35" filters="chrom,start,stop" internalName="gene_filter_35" linkName="gene_filter_35" name="gene_filter_35" type="link"/>
<FilterPage displayName="FILTERS" internalName="naive_filters">
<FilterGroup displayName="FILTERS" internalName="filters">
<FilterCollection displayName="MINOR ALLELE FREQUENCY [&gt;=]" internalName="pop_filter">
<FilterDescription displayName="Minor Allele Frequency [&gt;=]" displayType="list" field="avg_dprime" internalName="allele_freq" key="marker_id_key" legal_qualifiers="&gt;=" qualifier="&gt;=" style="menu" tableConstraint="main" type="list">
<Option displayName="0.0" internalName="0.0" isSelectable="true" value="0.0"/>
<Option displayName="0.01" internalName="0.01" isSelectable="true" value="0.01"/>
<Option displayName="0.05" internalName="0.05" isSelectable="true" value="0.05"/>
<Option displayName="0.1" internalName="0.1" isSelectable="true" value="0.1"/>
<Option displayName="0.15" internalName="0.15" isSelectable="true" value="0.15"/>
<Option displayName="0.2" internalName="0.2" isSelectable="true" value="0.2"/>
<Option displayName="0.25" internalName="0.25" isSelectable="true" value="0.25"/>
<Option displayName="0.3" internalName="0.3" isSelectable="true" value="0.3"/>
<Option displayName="0.35" internalName="0.35" isSelectable="true" value="0.35"/>
(...)
As far as I understand , this file is used to 'group' the filters.

At the end, here is Query sent to Biomart:
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE Query>
<Query virtualSchemaName="default" formatter="TSV" header="0" uniqueRows="0" count="" datasetConfigVersion="0.6">
<Dataset name="hm27_variation_ceu" interface="default">
<Filter name="stop" value="1000000"/>
<Filter name="allele_freq" value="0.3"/>
<Filter name="chrom" value="chr1"/>
<Filter name="start" value="0"/>
<Attribute name="chrom"/>
<Attribute name="start"/>
<Attribute name="strand"/>
<Attribute name="marker1"/>
<Attribute name="alleles"/>
<Attribute name="ceu_id"/>
</Dataset>
</Query>
and we can send it with curl:
curl --data 'query=<?xml version="1.0" encoding="UTF-8"?><!DOCTYPE Query><Query virtualSchemaName="default" formatter="TSV" header="0" uniqueRows="0" count="" datasetConfigVersion="0.6" ><Dataset name="hm27_variation_ceu" interface="default" ><Filter name="stop" value="1000000"/><Filter name="allele_freq" value="0.3"/><Filter name="chrom" value="chr1"/><Filter name="start" value="0"/><Attribute name="chrom" /><Attribute name="start" /><Attribute name="strand" /><Attribute name="marker1" /><Attribute name="alleles" /><Attribute name="ceu_id" /></Dataset></Query>' http://www.biomart.org/biomart/martservice


Response from the Biomart server:
chr1 843817 + rs1806509 A/C AC AA AC AC AC CC AC CC AC CC AC NN AC CC AA NN AC AC NN CC AC CC AC AC AA AC AC AC CC AA AC AC AC AC AA CC AC AC AA AA AA AA NN AA AA AA AC AA NN AA NN AC AC CC AA AA AC AC AC AC AA AA AA AC AA AA CC CC AC AA AC AC AC AA CC AA AC CC AA NN CC CC AA AC AA AC AA AA AC AC AC AC AA AA NN AA AA CC AC AC AA AA AA AC AC AC AA AC AA AC AC AA AA AC AA CC AC AC AC AC AA AA AA CC AC CC AC NN AC AC AA AA CC AA AC AC AC AA AC AA AC AC AC AA AC CC AA AA CC AA AA AC AC AC AC AA CC AA AC CC CC AC AC AA CC AA AC AC AC AA AA CC AA AA
chr1 881808 + rs13303106 A/G AG GG AG GG GG AG AG AG GG GG GG NN AA AG GG NN GG GG NN GG AG AG GG AG GG GG AG GG AA GG AG GG AG AA GG AG GG AG GG GG GG GG NN GG GG GG AA AG NN GG NN AG AG AA GG AG AG AA AG AG GG GG GG AG GG GG AG AA GG GG AG GG AG GG AG GG AG AG GG NN AA GG GG AG GG AG AG AG AG AG GG AG GG GG NN GG GG AG AG AG AG GG AG AG GG GG GG AG GG AG AG GG AG AG AG AA GG AG AG AG GG GG GG AG AG AA AG NN AG AG GG GG AA GG GG GG AA GG GG AG AG AG AG GG AG AA GG GG AA GG GG AG AG AG GG GG AA GG GG AA AA AG AG AG AA GG AA AG AG GG AG AA GG GG
chr1 906697 + rs6694632 A/G AG AA AA AG AA AG AG AG AA AA AG NN GG AA GG NN AA AA NN AA AG AG AA AG AA AA AG AA GG GG GG AG AA GG AA AA AA AG AG AG AG AA NN AA AA AA AG AG NN AA NN AG AG AG AG AG AG GG AG AG AA AA AA AG AG AG AA GG AA AA AA GG GG AA AG GG AA AG AA NN AG AA AG AG AA AG GG GG AA GG AG AG AG AG NN AA AG AG AG AG AG AA AA GG AA AA AA AA AA AG AG AA AG AG AG AA AG AA AG AG AA AA GG AG AG GG AG NN AG AG GG AG AA AG AG AA AG AA AA AA AG AA AG AA AA GG AA AA GG AA AA AG AG AG AA AA GG AA AA GG GG AG AG AG AG AG GG AA AG AG AG AG AG AG
chr1 908247 + rs13303118 G/T GT TT TT GT TT GT GT GT TT TT GT NN GG TT GG NN TT TT NN TT GT GT TT GT TT TT GT TT GG GG GG GT TT GG TT TT TT GT GT GT GT TT NN TT TT TT GT GT NN TT NN NN GT GT GT GT GT GG GT GT TT TT TT GT GT GT TT GG TT TT TT GG GG TT GT GG TT GT TT NN GT TT GT GT TT GT GG GG TT GG GT GT GT GT NN TT GT GT GT GT GT TT TT GG TT TT TT TT TT GT GT TT GT GT GT TT GT TT GT GT TT TT GG GT GT GG GT NN GT GT GG GT TT GT GT TT GT TT TT TT GT TT GT TT TT GG TT TT GG TT TT GT GT GT TT TT GG TT TT GG GG GT GT GT GT GT GG TT GT GT GT GT GT GT
chr1 908436 + rs2341354 A/G AG GG GG AG GG AG AG AG GG GG AG NN AA GG AA NN GG GG NN GG AG AG GG AG GG GG AG GG AA AA AA AG GG AA GG GG GG AG AG AG AG GG NN GG GG GG AG AG NN GG NN AG AG AG AG AG AG AA AG AG GG GG GG AG AG AG AG AA GG GG GG AA AA GG AG AA GG AG GG NN AG GG AG AG GG AG AA AA GG AA AG AG AG AG NN GG AG AG AG AG AG GG GG AA GG GG GG GG GG AG AG GG AG AG AG GG AG GG AG AG GG GG AA AG AG AA AG NN AG AG AA AG GG AG AG GG AG GG GG GG AG GG AG GG GG AA GG GG AA GG GG AG AG AG GG GG AA GG GG AA AA AG AG AG AG AG AA GG AG AG AG AG AG AG
chr1 934427 + rs3128117 C/T TT TT CT CC TT CT CC CT TT TT CT NN CT CT CC NN TT TT NN TT TT CT TT CC TT TT CT TT CC CC TT TT TT CT TT TT CT CC CT CT CT TT NN TT CT TT CT CT NN TT NN CT CC TT CT CT CT CC CT CT TT CT TT CT CT CT TT CT TT TT CT CC CT TT CT CC CT CT TT NN CT TT CT CT CT CT CC CC TT CC CT TT TT CC NN TT TT CT CC CT TT TT TT CT CT TT TT TT TT TT TT TT CT CT CT CT CT CT CC CT CT TT CC CT CT CT CT NN CC CC CC CT CT CT CC TT CT TT TT TT CT TT CC CT TT CC CT TT CC TT TT CT CC CT TT TT CT TT TT CT CC TT CT TT CT CT CT CT TT CT TT TT CT CT
chr1 940106 + rs1891906 A/C AA AA AC CC AA AC AC AC AA AA AC NN AC AC CC NN AA AA NN AA AA AC AA CC AA AA AC AA CC CC AA AA AA AC AA AA AC CC AC AC AC AA NN AA AC AA AC AC NN AA NN AC CC AA AC AC AC CC AC AC AA AC AA AC AC AC AA AC AA AA AC CC AC AA AC AC AC AC AA NN AC AA AC AC AC AC CC CC AA NN AC AA AA CC NN AA AA AC CC AC AA AA AA AC AC AA AA AA AA AA AA AA AC AC AC AC AC AC CC AC AC AA CC AC AC AC AC NN CC CC AC AC AC AC CC AA AC AA AA AA AC AA CC AC AA CC AC AA CC AA AA AC CC AC AA AA AC AA AA AC CC AA AC AA AC AC AC AC AA AC AA AA AC AC
chr1 949705 + rs2710888 C/T CT CC CT CT CC CT CT CT CC CC CT NN CT CT TT NN CC CC NN CC CC CC NN CT CC CC CT CC TT CT CC CC CT CC CC CC CT TT CT CT CT CC NN CC CC CT TT CT NN CC NN CT TT CC CT CC CT TT CT CT CT CC CC CT CC CT CC CT CC CC CT CT CT CT CT CC CT CT CC NN CT CC CT CT CT CT CT TT CT TT CC CC CC TT NN CC CC CT TT CC CC CC CT CT CT CC CC CT CC CC CT CC CT CT CT TT CC CT TT CT CT CC CT CT CT CT CC NN CC TT CT CT CT CT CT CC CT CC CC CC CT CT TT CT TT TT CT CC TT CC CC CT TT CT CC CC CT CC CC CT TT CC CT CC CT CT CT CT CT CC CT CT CC CT
chr1 952073 + rs3128126 A/G NN AA NN NN AA AG AG NN NN AA NN AG AG AG NN AG NN NN AG NN AG AA AG NN NN AA NN AA NN AG AA AA NN NN AA AG NN NN NN AG AG NN AA NN NN AA AG AG AA AA AG AG GG NN NN AA AG GG AA AG AA NN AG AG NN NN NN NN NN NN NN NN NN NN AG AG AG AG AA AG GG AA AG AG NN AG AG GG AG GG AA AA AA GG GG AA AG AG GG NN NN NN NN NN NN NN NN NN NN NN NN NN NN NN NN NN NN NN NN NN NN NN NN NN GG NN AG GG NN NN AG NN NN AG AG AA NN AA AA AA AG NN NN NN NN NN NN AA GG AA AA AG GG NN NN NN NN NN NN NN NN NN GG AA AG GG AG AG NN AA NN NN AA AG
chr1 952469 + rs4970393 A/G AG AA AG GG AA AG AG AG AA AA AG NN AG AG AG NN AA AA NN AA AG AA AG AG AA AA AG AA GG GG AA AA AA GG AA AA AG GG AG AG AG AA NN AA AG AA AG AG NN AA NN AG GG AG AG AG AG GG AA AG AA AG AA AG AG AG AA GG AA AA AG AG AG AG AG AG AG AG AA NN AG AA AG AG AG AG GG NN AG GG AG AA AA GG NN AA AA AG GG AG AG AA AA AG AG AA AA AG AA AA AG AA AG AG AG AG AG AG GG AG AG AG NN AG AG AG AA NN AG NN GG AG AG AG GG AA AG AA AA AA AG AG GG AG GG GG AG AA GG AA AA AG GG AG AA AA AG AA AA GG AG AG AG AG AG AG NN AG AA AG AA AA AG AG



Note: I've also tried to use the SOAP/WSDL Biomart API, but as it was discussed on Biostar, the API does not work.


That's it

Pierre

No comments: