Searching for Genotypes with SPARQL.
This week-end, I've noticed that the NCBI has an interface called Genotype Query Form used to query some genotypes the generating the following kind of XML output:
.
That's it !
Pierre
<GenoExchange xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns="http://www.ncbi.nlm
.nih.gov/SNP/geno" xsi:schemaLocation="http://www.ncbi.nlm.nih.gov/SNP/geno ftp://ftp.ncbi.nlm.nih.gov/snp/specs/genoex_1_4.xsd" dbSNPBuildNo="129">
<Population popId="1409" handle="CSHL-HAPMAP" locPopId="HapMap-CEU">
<popClass self="NOT SPECIFIED" />
</Population>
<Individual indId="170" taxId="9606" sex="F" indGroup="European">
<SourceInfo source="Coriell" sourceType="repository" ncbiPedId="80" pedId="1340" indId="NA07000" maId="0" paId="0" srcIndGroup="Western and Nothern European" />
<SubmitInfo popId="1409" submittedIndId="NA07000" subIndGroup="Western and Northern European" />
</Individual>
<Individual indId="621" taxId="9606" sex="F" indGroup="European">
(...)
<SnpLoc genomicAssembly="36:reference" chrom="1" start="1286927" locType="2" rsOrientToCh
rom="rev" contigAllele="C" />
<SsInfo ssId="3906671" locSnpId="AL139287.6_22772" ssOrientToRs="fwd">
<ByPop popId="1409" sampleSize="120">
<AlleleFreq allele="A" freq="0.117" />
<AlleleFreq allele="G" freq="0.883" />
<GTypeFreq gtype="A/G" freq="0.233" />
<GTypeFreq gtype="G/G" freq="0.767" />
(...)
<GTypeByInd indId="636" gtype="G/G" />
<GTypeByInd indId="456" gtype="G/G" />
<GTypeByInd indId="536" gtype="G/G" />
</ByPop>
</SsInfo>
<GTypeFreq gtype="A/A" freq="0.380952380952381" />
<GTypeFreq gtype="A/G" freq="0.352380952380952" />
<GTypeFreq gtype="G/G" freq="0.266666666666667" />
</SnpInfo>
<SnpInfo rsId="2765021" observed="A/G">
(...)
I wanted to see how one could query this kind of data with SPARQL... well, I'm sure that RDF is one of the most inefficient way to store this kind of data but I wanted to see what could be extracted from such RDFStore from a semantic query. First, I wrote a XSLT stylesheet transforming <GenoExchange/> to <rdf:RDF/>. The stylsheet is available at http://code.google.com/p/lindenb/source/browse/trunk/src/xsl/genoexch2rdf.xsl..nih.gov/SNP/geno" xsi:schemaLocation="http://www.ncbi.nlm.nih.gov/SNP/geno ftp://ftp.ncbi.nlm.nih.gov/snp/specs/genoex_1_4.xsd" dbSNPBuildNo="129">
<Population popId="1409" handle="CSHL-HAPMAP" locPopId="HapMap-CEU">
<popClass self="NOT SPECIFIED" />
</Population>
<Individual indId="170" taxId="9606" sex="F" indGroup="European">
<SourceInfo source="Coriell" sourceType="repository" ncbiPedId="80" pedId="1340" indId="NA07000" maId="0" paId="0" srcIndGroup="Western and Nothern European" />
<SubmitInfo popId="1409" submittedIndId="NA07000" subIndGroup="Western and Northern European" />
</Individual>
<Individual indId="621" taxId="9606" sex="F" indGroup="European">
(...)
<SnpLoc genomicAssembly="36:reference" chrom="1" start="1286927" locType="2" rsOrientToCh
rom="rev" contigAllele="C" />
<SsInfo ssId="3906671" locSnpId="AL139287.6_22772" ssOrientToRs="fwd">
<ByPop popId="1409" sampleSize="120">
<AlleleFreq allele="A" freq="0.117" />
<AlleleFreq allele="G" freq="0.883" />
<GTypeFreq gtype="A/G" freq="0.233" />
<GTypeFreq gtype="G/G" freq="0.767" />
(...)
<GTypeByInd indId="636" gtype="G/G" />
<GTypeByInd indId="456" gtype="G/G" />
<GTypeByInd indId="536" gtype="G/G" />
</ByPop>
</SsInfo>
<GTypeFreq gtype="A/A" freq="0.380952380952381" />
<GTypeFreq gtype="A/G" freq="0.352380952380952" />
<GTypeFreq gtype="G/G" freq="0.266666666666667" />
</SnpInfo>
<SnpInfo rsId="2765021" observed="A/G">
(...)
.
Transform the data
About 639 HAPMAP snps on the chromosome 1 were extracted using the HTML form and saved as XML to the file 'SNPgenotype-100201-1244-3905.xml
'(size 4Mo). The xml was converted to RDF with the xsltproc engine:xsltproc --stringparam "with-sequence" yes --novalid genoexch2rdf.xsl SNPgenotype-100201-1244-3905.xml > input.rdf
The size of 'input.rdf' (including the flanking sequences of the SNPs) was 20Mo.Result
<?xml version="1.0"?>
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:g="http://www.ncbi.nlm.nih.gov/SNP/geno" xmlns:snp="http://www.ncbi.nlm.nih.gov/SNP/docsum" xmlns="http://ontology.lindenb.org/genotypes/">
<Population rdf:about="http://www.ncbi.nlm.nih.gov/projects/SNP/snp_viewTable.cgi?type=pop&pop_id=1409">
<handle>CSHL-HAPMAP</handle>
<locPopId>HapMap-CEU</locPopId>
</Population>
<Individual rdf:about="http://www.ncbi.nlm.nih.gov/projects/SNP/snp_ind.cgi?ind_id=170">
<hasPop rdf:resource="http://www.ncbi.nlm.nih.gov/projects/SNP/snp_viewTable.cgi?type=pop&pop_id=1409"/>
<sex>F</sex>
<name>NA07000</name>
</Individual>
<Individual rdf:about="http://www.ncbi.nlm.nih.gov/projects/SNP/snp_ind.cgi?ind_id=621">
<hasPop rdf:resource="http://www.ncbi.nlm.nih.gov/projects/SNP/snp_viewTable.cgi?type=pop&pop_id=1409"/>
<sex>F</sex>
<name>NA12875</name>
</Individual>
<Individual rdf:about="http://www.ncbi.nlm.nih.gov/projects/SNP/snp_ind.cgi?ind_id=538">
<hasPop rdf:resource="http://www.ncbi.nlm.nih.gov/projects/SNP/snp_viewTable.cgi?type=pop&pop_id=1409"/>
<sex>F</sex>
<name>NA12753</name>
(...)
<SNP rdf:about="http://www.ncbi.nlm.nih.gov/projects/SNP/snp_ref.cgi?rs=307347">
<het rdf:datatype="http://www.w3.org/2001/XMLSchema#float">0.1</het>
<name>rs307347</name>
<seq5>GGGGATGGCTGCTCCTGGGCCTCAGAAAGATGCAGTCCCATAGACTTCCAGCACGCCCCTCCCCTCCTCGGGCCTTAATTTTGTCCACTGAGAAGATGGTCTCTGAGGCTCTGGGGTTTCCTTCTTGGTCACCAGATATTCTGCGGGCCTTGCCTTCCTGCCCAGATTCGAGCCAGTGGCAAACAGAAGCTGCCAGGAGC</seq5>
<observed>C/T</observed>
<seq3>TCTCAGAGCTGTGGCTGGTGGCTCGGTAACAACAGGAAGGGCAGTGGCTGTGCAGGAGGCAGGCAGCTTGCCAGCCCAGGAAGGTGACCCAGGACACCTCCAGGCCTTTCCCAGGGCAGCCCAACGGCCCAAGGTCAGGGCCGGGCGCGAGGGCGGCCTGAGCACAGAGCACGGGGGCTGACAGCAGGCTGGGGGGCCAG</seq3>
</SNP>
<MapLoc>
<hasSNP rdf:resource="http://www.ncbi.nlm.nih.gov/projects/SNP/snp_ref.cgi?rs=307347"/>
<strand>+</strand>
<chrom>1</chrom>
<start rdf:datatype="http://www.w3.org/2001/XMLSchema#integer">1320381</start>
<assembly rdf:resource="urn:assembly:Celera:36_3"/>
<type>exact</type>
</MapLoc>
(...)
<Genotype>
<hasIndi rdf:resource="http://www.ncbi.nlm.nih.gov/projects/SNP/snp_ind.cgi?ind_id=465"/>
<hasSNP rdf:resource="http://www.ncbi.nlm.nih.gov/projects/SNP/snp_ref.cgi?rs=940550"/>
<allele1>T</allele1>
<allele2>T</allele2>
</Genotype>
<Genotype>
<hasIndi rdf:resource="http://www.ncbi.nlm.nih.gov/projects/SNP/snp_ind.cgi?ind_id=253"/>
<hasSNP rdf:resource="http://www.ncbi.nlm.nih.gov/projects/SNP/snp_ref.cgi?rs=940550"/>
<allele1>T</allele1>
<allele2>T</allele2>
</Genotype>
</rdf:RDF>
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:g="http://www.ncbi.nlm.nih.gov/SNP/geno" xmlns:snp="http://www.ncbi.nlm.nih.gov/SNP/docsum" xmlns="http://ontology.lindenb.org/genotypes/">
<Population rdf:about="http://www.ncbi.nlm.nih.gov/projects/SNP/snp_viewTable.cgi?type=pop&pop_id=1409">
<handle>CSHL-HAPMAP</handle>
<locPopId>HapMap-CEU</locPopId>
</Population>
<Individual rdf:about="http://www.ncbi.nlm.nih.gov/projects/SNP/snp_ind.cgi?ind_id=170">
<hasPop rdf:resource="http://www.ncbi.nlm.nih.gov/projects/SNP/snp_viewTable.cgi?type=pop&pop_id=1409"/>
<sex>F</sex>
<name>NA07000</name>
</Individual>
<Individual rdf:about="http://www.ncbi.nlm.nih.gov/projects/SNP/snp_ind.cgi?ind_id=621">
<hasPop rdf:resource="http://www.ncbi.nlm.nih.gov/projects/SNP/snp_viewTable.cgi?type=pop&pop_id=1409"/>
<sex>F</sex>
<name>NA12875</name>
</Individual>
<Individual rdf:about="http://www.ncbi.nlm.nih.gov/projects/SNP/snp_ind.cgi?ind_id=538">
<hasPop rdf:resource="http://www.ncbi.nlm.nih.gov/projects/SNP/snp_viewTable.cgi?type=pop&pop_id=1409"/>
<sex>F</sex>
<name>NA12753</name>
(...)
<SNP rdf:about="http://www.ncbi.nlm.nih.gov/projects/SNP/snp_ref.cgi?rs=307347">
<het rdf:datatype="http://www.w3.org/2001/XMLSchema#float">0.1</het>
<name>rs307347</name>
<seq5>GGGGATGGCTGCTCCTGGGCCTCAGAAAGATGCAGTCCCATAGACTTCCAGCACGCCCCTCCCCTCCTCGGGCCTTAATTTTGTCCACTGAGAAGATGGTCTCTGAGGCTCTGGGGTTTCCTTCTTGGTCACCAGATATTCTGCGGGCCTTGCCTTCCTGCCCAGATTCGAGCCAGTGGCAAACAGAAGCTGCCAGGAGC</seq5>
<observed>C/T</observed>
<seq3>TCTCAGAGCTGTGGCTGGTGGCTCGGTAACAACAGGAAGGGCAGTGGCTGTGCAGGAGGCAGGCAGCTTGCCAGCCCAGGAAGGTGACCCAGGACACCTCCAGGCCTTTCCCAGGGCAGCCCAACGGCCCAAGGTCAGGGCCGGGCGCGAGGGCGGCCTGAGCACAGAGCACGGGGGCTGACAGCAGGCTGGGGGGCCAG</seq3>
</SNP>
<MapLoc>
<hasSNP rdf:resource="http://www.ncbi.nlm.nih.gov/projects/SNP/snp_ref.cgi?rs=307347"/>
<strand>+</strand>
<chrom>1</chrom>
<start rdf:datatype="http://www.w3.org/2001/XMLSchema#integer">1320381</start>
<assembly rdf:resource="urn:assembly:Celera:36_3"/>
<type>exact</type>
</MapLoc>
(...)
<Genotype>
<hasIndi rdf:resource="http://www.ncbi.nlm.nih.gov/projects/SNP/snp_ind.cgi?ind_id=465"/>
<hasSNP rdf:resource="http://www.ncbi.nlm.nih.gov/projects/SNP/snp_ref.cgi?rs=940550"/>
<allele1>T</allele1>
<allele2>T</allele2>
</Genotype>
<Genotype>
<hasIndi rdf:resource="http://www.ncbi.nlm.nih.gov/projects/SNP/snp_ind.cgi?ind_id=253"/>
<hasSNP rdf:resource="http://www.ncbi.nlm.nih.gov/projects/SNP/snp_ref.cgi?rs=940550"/>
<allele1>T</allele1>
<allele2>T</allele2>
</Genotype>
</rdf:RDF>
Invoking ARQ
export ARQROOT=ARQ-2.5.0
ARQ-2.5.0/bin/arq --data ~/input.rdf --query ~/query01.rq
ARQ-2.5.0/bin/arq --data ~/input.rdf --query ~/query01.rq
Dump All
- Query
- PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX g: <http://ontology.lindenb.org/genotypes/>
SELECT ?s ?p ?o {?s ?p ?o.} - Result
- | _:b0 | g:allele2 | "C" |
| _:b0 | g:allele1 | "C" |
| _:b0 | g:hasSNP | <http://www.ncbi.nlm.nih.gov/projects/SNP/snp_ref.cgi?rs=17160669> |
| _:b0 | g:hasIndi | <http://www.ncbi.nlm.nih.gov/projects/SNP/snp_ind.cgi?ind_id=636> |
| _:b0 | rdf:type | g:Genotype |
| _:b1 | g:allele2 | "T" |
| _:b1 | g:allele1 | "C" |
| _:b1 | g:hasSNP | <http://www.ncbi.nlm.nih.gov/projects/SNP/snp_ref.cgi?rs=17160669> |
| _:b1 | g:hasIndi | <http://www.ncbi.nlm.nih.gov/projects/SNP/snp_ind.cgi?ind_id=361> |
| _:b1 | rdf:type | g:Genotype |
| <http://www.ncbi.nlm.nih.gov/projects/SNP/snp_ind.cgi?ind_id=174> | g:name | "NA07048" |
| <http://www.ncbi.nlm.nih.gov/projects/SNP/snp_ind.cgi?ind_id=174> | g:sex | "M" |
| <http://www.ncbi.nlm.nih.gov/projects/SNP/snp_ind.cgi?ind_id=174> | g:hasPop | <http://www.ncbi.nlm.nih.gov/projects/SNP/snp_viewTable.cgi?type=pop&pop_id=1409> |
| <http://www.ncbi.nlm.nih.gov/projects/SNP/snp_ind.cgi?ind_id=174> | rdf:type | g:Individual |
| <http://www.ncbi.nlm.nih.gov/projects/SNP/snp_ind.cgi?ind_id=566> | g:name | "NA12802" |
| <http://www.ncbi.nlm.nih.gov/projects/SNP/snp_ind.cgi?ind_id=566> | g:sex | "F" |
| <http://www.ncbi.nlm.nih.gov/projects/SNP/snp_ind.cgi?ind_id=566> | g:hasPop | <http://www.ncbi.nlm.nih.gov/projects/SNP/snp_viewTable.cgi?type=pop&pop_id=1409> |
| <http://www.ncbi.nlm.nih.gov/projects/SNP/snp_ind.cgi?ind_id=566> | rdf:type | g:Individual |
| _:b2 | g:allele2 | "A" |
| _:b2 | g:allele1 | "A" |
| _:b2 | g:hasSNP | <http://www.ncbi.nlm.nih.gov/projects/SNP/snp_ref.cgi?rs=2765021> |
| _:b2 | g:hasIndi | <http://www.ncbi.nlm.nih.gov/projects/SNP/snp_ind.cgi?ind_id=429> |
| _:b2 | rdf:type | g:Genotype |
| _:b3 | g:allele2 | "C" |
| _:b3 | g:allele1 | "C" |
| _:b3 | g:hasSNP | <http://www.ncbi.nlm.nih.gov/projects/SNP/snp_ref.cgi?rs=17160669> |
| _:b3 | g:hasIndi | <http://www.ncbi.nlm.nih.gov/projects/SNP/snp_ind.cgi?ind_id=546> |
| _:b3 | rdf:type | g:Genotype |
| _:b4 | g:allele2 | "T" |
| _:b4 | g:allele1 | "C" |
| _:b4 | g:hasSNP | <http://www.ncbi.nlm.nih.gov/projects/SNP/snp_ref.cgi?rs=17160669> |
| _:b4 | g:hasIndi | <http://www.ncbi.nlm.nih.gov/projects/SNP/snp_ind.cgi?ind_id=159> |
| _:b4 | rdf:type | g:Genotype |
| <http://www.ncbi.nlm.nih.gov/projects/SNP/snp_ind.cgi?ind_id=621> | g:name | "NA12875" |
| <http://www.ncbi.nlm.nih.gov/projects/SNP/snp_ind.cgi?ind_id=621> | g:sex | "F" |
(...)
Select the populations
- Query
- PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX g: <http://ontology.lindenb.org/genotypes/>
SELECT ?pop
{
?s a g:Population .
?s g:handle ?pop .
} - Result
- -----------------
| pop |
=================
| "CSHL-HAPMAP" |
-----------------
List six individuals for each population
- Query
- PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX g: <http://ontology.lindenb.org/genotypes/>
SELECT ?pop ?indi_name ?good
{
?s a g:Population .
?s g:handle ?pop .
?s2 a g:Individual .
?s2 g:hasPop ?s .
?s2 g:sex ?good .
?s2 g:name ?indi_name
}
limit 6 - Result
- ------------------------------------
| pop | indi_name | good |
====================================
| "CSHL-HAPMAP" | "NA10854" | "F" |
| "CSHL-HAPMAP" | "NA12264" | "M" |
| "CSHL-HAPMAP" | "NA11993" | "F" |
| "CSHL-HAPMAP" | "NA10830" | "M" |
| "CSHL-HAPMAP" | "NA12762" | "M" |
| "CSHL-HAPMAP" | "NA12155" | "M" |
------------------------------------
List the SNPs having a flanking sequence containing 'CACACA'
- Query
- PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX g: <http://ontology.lindenb.org/genotypes/>
PREFIX fn: <http://www.w3.org/2005/xpath-functions#>
SELECT ?name ?seq5 ?observed ?seq3
WHERE
{
?s a g:SNP .
?s g:name ?name .
?s g:seq5 ?seq5 .
?s g:seq3 ?seq3 .
?s g:observed ?observed .
FILTER (
fn:contains(fn:upper-case(?seq5), "CACACA") ||
fn:contains(fn:upper-case(?seq3), "CACACA")
)
} - Result
- -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
| name | seq5 | observed | seq3 |
=============================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================
| "rs17160669" | "GCCACCGCGCCTGGCCCACAAGCATAACTTTTATAAAAATAATTTACTTTTACAATTAAGCTTAGGAATCACACAGACTCAGGGCTGGCTCATGGCTTCC" | "C/T" | "GGCAAGTTAAACTCTGTACTTAGGCTCGGCGCGTATGAAATGGCTAATTCTAATCAGTGGTGCAATGAAGTAACTCCTCTAAAGAACTTATCGGGCCGGG" |
| "rs2765023" | "ACTTGTAAATTTAGTCAGCATACATAACTAACCAAAACTTCAATATATCTTGAGACCCCCTTGGGGGGCTGTCTCCATAAAAGTGACTTTCCCAGGAGAGTGACTGGATGTGATTGGCCAACACCGTCTTAGCCCGCAGGGGTTCCTGGCGCGGAAGCCTCACGTCCCTCCCCACAGCGAGTTTTCAGAATCCAAAGGCCGTAGGAGAAAGAAGGCTGGCGGTGTTTCCTCTTAGAGGGGAGAAACTCAGCCTGGGTAGGAGACCCAGCCCCACGCAGGGAAAACTGTGCTAACGCTTCC" | "A/G" | "ATGTGCGTGGCAGGTGCGGCGGCGGCGAATACGGTTTGTCCTCGAGCCTAACCCTGTCTGTGTTGGTGTCAGCAGTGGCCCCCCTACCACACACACAGGGTCCCTGGCGTCCCAAGACCACTCCTGGCAGCCCCGCCACTGGCTGCGCCTGGAAGCCGCGTCCTCAGGCCTCGCCTGGCATTTGCTGTCACAGAGGTTGCTTCCTTGGGTCCGTCCGTCCTCGCCCCTCCAGCCTGGGCGCCCCCCCACCCCTGTCTCATTCCCTCCACCACATGCAGCACAGTCCAGGAGGCTGGGGTC" |
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Get 12 Heterozygous Genotypes
- Query
- PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX g: <http://ontology.lindenb.org/genotypes/>
PREFIX fn: <http://www.w3.org/2005/xpath-functions#>
SELECT ?indi ?snp ?a1 ?a2
WHERE
{
?s a g:Genotype .
?s g:allele1 ?a1 .
?s g:allele2 ?a2 .
?s g:hasIndi ?s2 .
?s2 g:name ?indi .
?s g:hasSNP ?s3 .
?s3 g:name ?snp .
FILTER ( ?a1 != ?a2 )
}
LIMIT 10 - Result
- ----------------------------------------
| indi | snp | a1 | a2 |
========================================
| "NA12056" | "rs17160669" | "C" | "T" |
| "NA12716" | "rs17160669" | "C" | "T" |
| "NA12761" | "rs17160669" | "C" | "T" |
| "NA10839" | "rs2765023" | "A" | "G" |
| "NA12813" | "rs2765023" | "A" | "G" |
| "NA12760" | "rs2765023" | "A" | "G" |
| "NA12865" | "rs17160669" | "C" | "T" |
| "NA07056" | "rs17160669" | "C" | "T" |
| "NA12146" | "rs2765023" | "A" | "G" |
| "NA10860" | "rs2765023" | "A" | "G" |
| "NA10839" | "rs17160669" | "C" | "T" |
| "NA12812" | "rs17160669" | "C" | "T" |
----------------------------------------
List 12 SNPs on chr1 between 100000 and 500000 on the reference assembly, order by chrom/position
- Query
- PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX g: <http://ontology.lindenb.org/genotypes/>
PREFIX fn: <http://www.w3.org/2005/xpath-functions#>
SELECT ?snp ?chrom ?orient ?start
WHERE
{
?s a g:SNP .
?s g:name ?snp .
?s2 a g:MapLoc .
?s2 g:hasSNP ?s .
?s2 g:chrom ?chrom .
?s2 g:chrom "1" .
?s2 g:strand ?orient .
?s2 g:start ?start .
?s2 g:assembly <urn:assembly:reference:36_3> .
FILTER ( ?start > 100000 && ?start< 500000)
}
ORDER BY ?chrom ?start
LIMIT 12 - Result
- ------------------------------------------
| snp | chrom | orient | start |
==========================================
| "rs17009015" | "1" | "-" | 121810 |
| "rs11490937" | "1" | "+" | 222076 |
| "rs12041624" | "1" | "+" | 232164 |
| "rs11514575" | "1" | "-" | 235726 |
| "rs4731490" | "1" | "+" | 311783 |
| "rs4006867" | "1" | "+" | 325493 |
| "rs7462951" | "1" | "-" | 360984 |
| "rs4030300" | "1" | "+" | 392471 |
| "rs4030303" | "1" | "+" | 392552 |
| "rs9661032" | "1" | "-" | 396549 |
| "rs3872250" | "1" | "-" | 400742 |
| "rs3907361" | "1" | "-" | 412985 |
------------------------------------------
List the positions of 10 SNPs on the reference assembly and chr1, print the heterozygosity if it exists and is greater than 0.1
- Query
- PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX g: <http://ontology.lindenb.org/genotypes/>
SELECT ?snp ?chrom ?orient ?start ?het
WHERE
{
?s a g:SNP .
?s g:name ?snp .
?s2 a g:MapLoc .
?s2 g:hasSNP ?s .
?s2 g:chrom ?chrom .
?s2 g:chrom "1" .
?s2 g:strand ?orient .
?s2 g:start ?start .
?s2 g:assembly <urn:assembly:reference:36_3> .
OPTIONAL { ?s g:het ?het . FILTER ( ?het > 0.1 ) }
}
LIMIT 10 - Result
- ------------------------------------------------------------------------------------------------
| snp | chrom | orient | start | het |
================================================================================================
| "rs7417504" | "1" | "+" | 555799 | |
| "rs10018120" | "1" | "-" | 241387750 | "0.48"^^<http://www.w3.org/2001/XMLSchema#float> |
| "rs12043546" | "1" | "+" | 224043895 | |
| "rs4023296" | "1" | "-" | 141776514 | |
| "rs1320571" | "1" | "+" | 1110293 | "0.31"^^<http://www.w3.org/2001/XMLSchema#float> |
| "rs1359759" | "1" | "+" | 115826181 | "0.49"^^<http://www.w3.org/2001/XMLSchema#float> |
| "rs7553429" | "1" | "+" | 1080419 | "0.19"^^<http://www.w3.org/2001/XMLSchema#float> |
| "rs4245756" | "1" | "+" | 789325 | |
| "rs3766177" | "1" | "-" | 1471210 | "0.5"^^<http://www.w3.org/2001/XMLSchema#float> |
| "rs9442372" | "1" | "+" | 1008566 | "0.46"^^<http://www.w3.org/2001/XMLSchema#float> |
------------------------------------------------------------------------------------------------
Print 10 differences between the Reference Assembly and the Celera Assembly
- Query
- PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX g: <http://ontology.lindenb.org/genotypes/>
SELECT ?snp ?chrom1 ?orient1 ?start1 ?chrom2 ?orient2 ?start2
WHERE
{
?s a g:SNP .
?s g:name ?snp .
?s2 a g:MapLoc .
?s2 g:hasSNP ?s .
?s2 g:chrom ?chrom1 .
?s2 g:strand ?orient1 .
?s2 g:start ?start1 .
?s2 g:assembly <urn:assembly:Celera:36_3> .
?s3 a g:MapLoc .
?s3 g:hasSNP ?s .
?s3 g:chrom ?chrom2 .
?s3 g:strand ?orient2 .
?s3 g:start ?start2 .
?s3 g:assembly <urn:assembly:reference:36_3> . .
}
LIMIT 10 - Result
- -----------------------------------------------------------------------------
| snp | chrom1 | orient1 | start1 | chrom2 | orient2 | start2 |
=============================================================================
| "rs7553640" | "1" | "-" | 833104 | "1" | "+" | 1751873 |
| "rs3951936" | "9" | "-" | 41330304 | "4" | "+" | 49186295 |
| "rs3951936" | "9" | "-" | 41330304 | "1" | "-" | 142233119 |
| "rs3951936" | "9" | "-" | 41330304 | "1" | "+" | 142038296 |
| "rs3951936" | "9" | "-" | 41330304 | "1" | "-" | 141781399 |
| "rs3951936" | "9" | "-" | 41330304 | "1" | "+" | 141641811 |
| "rs41319344" | "Y" | "+" | 10690990 | "Y" | "-" | 25853159 |
| "rs41319344" | "Y" | "+" | 10690990 | "Y" | "+" | 24928047 |
| "rs41319344" | "Y" | "+" | 10690990 | "1" | "-" | 241194834 |
| "rs10907183" | "1" | "-" | 1511375 | "1" | "+" | 1060980 |
-----------------------------------------------------------------------------
Create a new RDF graph of 10 SNPs having a neighbour at a distance less than 500pb
- Query
- PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX g: <http://ontology.lindenb.org/genotypes/>
PREFIX fn: <http://www.w3.org/2005/xpath-functions#>
CONSTRUCT { ?snp1 g:hasNeighbour ?snp2 . }
WHERE
{
?snp1 a g:SNP .
?snp2 a g:SNP .
?s1 a g:MapLoc .
?s1 g:hasSNP ?snp1 .
?s1 g:chrom ?chrom1 .
?s1 g:strand ?orient1 .
?s1 g:start ?start1 .
?s1 g:assembly <urn:assembly:reference:36_3> .
?s2 a g:MapLoc .
?s2 g:hasSNP ?snp2 .
?s2 g:chrom ?chrom2 .
?s2 g:strand ?orient2 .
?s2 g:start ?start2 .
?s2 g:assembly <urn:assembly:reference:36_3> .
FILTER( (fn:abs(?start1 - ?start2) < 500) && ?chrom1=?chrom2 && ?snp1!=?snp2)
}
LIMIT 10 - Result
- @prefix : <http://ontology.lindenb.org/genotypes/> .
@prefix g: <http://ontology.lindenb.org/genotypes/> .
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix fn: <http://www.w3.org/2005/xpath-functions#> .
<http://www.ncbi.nlm.nih.gov/projects/SNP/snp_ref.cgi?rs=7545812>
:hasNeighbour <http://www.ncbi.nlm.nih.gov/projects/SNP/snp_ref.cgi?rs=9970455> .
<http://www.ncbi.nlm.nih.gov/projects/SNP/snp_ref.cgi?rs=1043506>
:hasNeighbour <http://www.ncbi.nlm.nih.gov/projects/SNP/snp_ref.cgi?rs=12126411> .
<http://www.ncbi.nlm.nih.gov/projects/SNP/snp_ref.cgi?rs=6603793>
:hasNeighbour <http://www.ncbi.nlm.nih.gov/projects/SNP/snp_ref.cgi?rs=7548693> ;
:hasNeighbour <http://www.ncbi.nlm.nih.gov/projects/SNP/snp_ref.cgi?rs=7553066> .
<http://www.ncbi.nlm.nih.gov/projects/SNP/snp_ref.cgi?rs=10907178>
:hasNeighbour <http://www.ncbi.nlm.nih.gov/projects/SNP/snp_ref.cgi?rs=10907177> ;
:hasNeighbour <http://www.ncbi.nlm.nih.gov/projects/SNP/snp_ref.cgi?rs=11260588> ;
:hasNeighbour <http://www.ncbi.nlm.nih.gov/projects/SNP/snp_ref.cgi?rs=11260587> ;
:hasNeighbour <http://www.ncbi.nlm.nih.gov/projects/SNP/snp_ref.cgi?rs=6701114> ;
:hasNeighbour <http://www.ncbi.nlm.nih.gov/projects/SNP/snp_ref.cgi?rs=3737728> ;
:hasNeighbour <http://www.ncbi.nlm.nih.gov/projects/SNP/snp_ref.cgi?rs=9442398> .
That's it !
Pierre
2 comments:
What RDF store are you using right know? TDB?
I suggest you retry your SPARQL queries after putting the RDF in a Virtuoso instance... see my blog for practical tips on setting it up, though I have yet to work out proper indexing. For that, see this page:
http://docs.openlinksw.com/virtuoso/rdfperformancetuning.html#rdfperfgeneraldbpedia
Hi Egon ! :-)
I'm just running my queries on a flat RDF file. Again, I just wanted to play again with sparql (before biohackathon2010 that will be focused on the semantic web).
I guess I'll play with Virtuoso next week in Japan, but again, I cannot believe that a RDF store can be used to store a large amount of genotypes.
Post a Comment