Showing posts with label sparql. Show all posts
Showing posts with label sparql. Show all posts

13 November 2012

Creating a virtual RDF graph describing a set of OpenOffice spreadsheets with Apache Jena and Fuseki

In the current post, I will use the Jena API for RDF to implement a virtual RDF graph describing the content of a set of openoffice/libreoffice spreasheets.

Fact: An openoffice file (*.ods) is a Zip file

An openoffice file is nothing but a zip file:
$ unzip -t jeter.ods 
Archive:  jeter.ods
    testing: mimetype                 OK
    testing: meta.xml                 OK
    testing: settings.xml             OK
    testing: content.xml              OK
    testing: Thumbnails/thumbnail.png   OK
    testing: Configurations2/images/Bitmaps/   OK
    testing: Configurations2/popupmenu/   OK
    testing: Configurations2/toolpanel/   OK
    testing: Configurations2/statusbar/   OK
    testing: Configurations2/progressbar/   OK
    testing: Configurations2/toolbar/   OK
    testing: Configurations2/menubar/   OK
    testing: Configurations2/accelerator/current.xml   OK
    testing: Configurations2/floater/   OK
    testing: styles.xml               OK
    testing: META-INF/manifest.xml    OK
No errors detected in compressed data of jeter.ods.

The entry content.xml is a XML file describing the tables in the spreadsheet:
$ unzip -c jeter.ods content.xml |\
grep -v Archive |\
grep -v inflating | xmllint --format - |\
head -n 20


<?xml version="1.0" encoding="UTF-8"?>
<office:document-content xmlns:office="urn:oasis:names:tc:opendocument:xmlns:office:1.0" xmlns:style="urn:oasis:names:tc:opendocument:xmlns:style:1.0" xmlns:text="urn:oasis:names:tc:opendocument:xmlns:text:1.0" xmlns:table="urn:oasis:names:tc:opendocument:xmlns:table:1.0" xmlns:draw="urn:oasis:names:tc:opendocument:xmlns:drawing:1.0" xmlns:fo="urn:oasis:names:tc:opendocument:xmlns:xsl-fo-compatible:1.0" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:meta="urn:oasis:names:tc:opendocument:xmlns:meta:1.0" xmlns:number="urn:oasis:names:tc:opendocument:xmlns:datastyle:1.0" xmlns:presentation="urn:oasis:names:tc:opendocument:xmlns:presentation:1.0" xmlns:svg="urn:oasis:names:tc:opendocument:xmlns:svg-compatible:1.0" xmlns:chart="urn:oasis:names:tc:opendocument:xmlns:chart:1.0" xmlns:dr3d="urn:oasis:names:tc:opendocument:xmlns:dr3d:1.0" xmlns:math="http://www.w3.org/1998/Math/MathML" xmlns:form="urn:oasis:names:tc:opendocument:xmlns:form:1.0" xmlns:script="urn:oasis:names:tc:opendocument:xmlns:script:1.0" xmlns:ooo="http://openoffice.org/2004/office" xmlns:ooow="http://openoffice.org/2004/writer" xmlns:oooc="http://openoffice.org/2004/calc" xmlns:dom="http://www.w3.org/2001/xml-events" xmlns:xforms="http://www.w3.org/2002/xforms" xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:rpt="http://openoffice.org/2005/report" xmlns:of="urn:oasis:names:tc:opendocument:xmlns:of:1.2" xmlns:xhtml="http://www.w3.org/1999/xhtml" xmlns:grddl="http://www.w3.org/2003/g/data-view#" xmlns:tableooo="http://openoffice.org/2009/table" xmlns:field="urn:openoffice:names:experimental:ooo-ms-interop:xmlns:field:1.0" xmlns:formx="urn:openoffice:names:experimental:ooxml-odf-interop:xmlns:form:1.0" xmlns:css3t="http://www.w3.org/TR/css3-text/" office:version="1.2">
  <office:scripts/>
  <office:font-face-decls>
    <style:font-face style:name="Liberation Sans" svg:font-family="'Liberation Sans'" style:font-family-generic="swiss" style:font-pitch="variable"/>
    <style:font-face style:name="DejaVu Sans" svg:font-family="'DejaVu Sans'" style:font-family-generic="system" style:font-pitch="variable"/>
    <style:font-face style:name="Lohit Hindi" svg:font-family="'Lohit Hindi'" style:font-family-generic="system" style:font-pitch="variable"/>
    <style:font-face style:name="WenQuanYi Micro Hei" svg:font-family="'WenQuanYi Micro Hei'" style:font-family-generic="system" style:font-pitch="variable"/>
  </office:font-face-decls>
  <office:automatic-styles>
    <style:style style:name="co1" style:family="table-column">
      <style:table-column-properties fo:break-before="auto" style:column-width="0.889in"/>
    </style:style>
    <style:style style:name="ro2" style:family="table-row">
      <style:table-row-properties style:row-height="0.178in" fo:break-before="auto" style:use-optimal-row-height="true"/>
    </style:style>
    <style:style style:name="ro3" style:family="table-row">
      <style:table-row-properties style:row-height="0.1681in" fo:break-before="auto" style:use-optimal-row-height="true"/>
    </style:style>
    <style:style style:name="ta1" style:family="table" style:master-page-name="Default">

Fact: Implementing a simple virtual RDF graph with Jena is easy

By virtual I mean that there is no RDFStore, the triples are created on the fly.
Implementing a simple virtual RDF graph with Jena is easy: you simply have to extend the class com.hp.hpl.jena.graph.impl.GraphBase and only implement the method graphBaseFind which returns all the RDF Triples matching a TripleMatch.

(...)
 @Override
    protected ExtendedIterator<Triple> graphBaseFind(TripleMatch matcher)
        {
        return ...;
        }
(...)

The code

My implementation of a RDFGraph for a set of OpenOffice Calc is not effective but it works fine: for each call of graphBaseFind, it creates an "Iterator<Triple>" scanning each content.xml entry of each openoffice file. This iterator creates some new Triples, add them to a list of Triples that will be filtered by the TripleMatcher.

Compilation

the Makefile:
CP=...#path to the jars of JENA/ARQ/etc... e.g: =`find ${ARQ} -name "*.jar" |  | tr "\n" ":"`
.PHONY: all
all:
 javac -cp ${CP} -sourcepath src src/oocalc/OpenOfficeCalcGraph.java
 jar cvf dist/openoffice2rdf.jar -C src .

Querying using sparql

Now that the Graph has been implemented and compiled, one can query it using ARQ, the sparql engine of Jena:

The spreadsheet

I've created the following spreadsheet and saved it in a file named "jeter.ods":
CHROMSTARTENDNAME
chr1100200rs654
chr1150250rs264
chr1200300rs610
chr1250350rs929
chr1300400rs408
chr1350450rs346
chr1400500rs430
chr1450550rs735
chr1500600rs575
chr1550650rs891
chr1600700rs627
chr1650750rs650
chr1700800rs715
chr1750850rs467
chr1800900rs882
chr1850950rs301
chr19001000rs643
chr19501050rs246
chr110001100rs178
chr110501150rs928
chr111001200rs213

The sparql query

The following SPARQL returns the informations about the cells in the 3rd row of the spreadsheet:


Invoke:
java -cp `find /home/lindenb/.ivy2/cache -name "*.jar" | tr "\n" ":"`:dist/openoffice2rdf.jar  \
 oocalc.OpenOfficeCalcGraph test.sparql /home/lindenb/jeter.ods

Result:
-----------------------------------------------------------------------------------------------------------------------------------
| s                                       | p                                                 | o                                 |
===================================================================================================================================
| <file:/home/lindenb/jeter.ods/t1/y3/x1> | office:table                                      | <file:/home/lindenb/jeter.ods/t1> |
| <file:/home/lindenb/jeter.ods/t1/y3/x1> | <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> | office:Cell                       |
| <file:/home/lindenb/jeter.ods/t1/y3/x1> | office:X                                          | "1"^^xsd:int                      |
| <file:/home/lindenb/jeter.ods/t1/y3/x1> | office:Y                                          | "3"^^xsd:int                      |
| <file:/home/lindenb/jeter.ods/t1/y3/x1> | office:value                                      | "chr1"                            |
| <file:/home/lindenb/jeter.ods/t1/y3/x2> | office:table                                      | <file:/home/lindenb/jeter.ods/t1> |
| <file:/home/lindenb/jeter.ods/t1/y3/x2> | <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> | office:Cell                       |
| <file:/home/lindenb/jeter.ods/t1/y3/x2> | office:X                                          | "2"^^xsd:int                      |
| <file:/home/lindenb/jeter.ods/t1/y3/x2> | office:Y                                          | "3"^^xsd:int                      |
| <file:/home/lindenb/jeter.ods/t1/y3/x2> | office:value                                      | "150"^^xsd:float                  |
| <file:/home/lindenb/jeter.ods/t1/y3/x3> | office:table                                      | <file:/home/lindenb/jeter.ods/t1> |
| <file:/home/lindenb/jeter.ods/t1/y3/x3> | <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> | office:Cell                       |
| <file:/home/lindenb/jeter.ods/t1/y3/x3> | office:X                                          | "3"^^xsd:int                      |
| <file:/home/lindenb/jeter.ods/t1/y3/x3> | office:Y                                          | "3"^^xsd:int                      |
| <file:/home/lindenb/jeter.ods/t1/y3/x3> | office:value                                      | "250"^^xsd:float                  |
| <file:/home/lindenb/jeter.ods/t1/y3/x4> | office:table                                      | <file:/home/lindenb/jeter.ods/t1> |
| <file:/home/lindenb/jeter.ods/t1/y3/x4> | <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> | office:Cell                       |
| <file:/home/lindenb/jeter.ods/t1/y3/x4> | office:X                                          | "4"^^xsd:int                      |
| <file:/home/lindenb/jeter.ods/t1/y3/x4> | office:Y                                          | "3"^^xsd:int                      |
| <file:/home/lindenb/jeter.ods/t1/y3/x4> | office:value                                      | "rs264"                           |
-----------------------------------------------------------------------------------------------------------------------------------

Serving the OpenOffice spreadsheets as RDF over HTTP

Fuseki is a SPARQL server. It provides REST-style SPARQL HTTP Update, SPARQL Query, and SPARQL Update using the SPARQL protocol over HTTP. We're going to deploy the OpenOfficeCalcGraph in Fuseki to query a set of OpenOffice files.

Download an install Fuseki

wget https://repository.apache.org/content/repositories/releases/org/apache/jena/jena-fuseki/0.2.5/jena-fuseki-0.2.5-distribution.tar.gz
tar xfz jena-fuseki-0.2.5-distribution.tar.gz
rm jena-fuseki-0.2.5-distribution.tar.gz

Tell Fuseki about our OpenOfficeCalcGraph

We need to create a config file for Fuseki. That was the most complicated part as the process is not clearly documented:

The line:
[] ja:loadClass "oocalc.OpenOfficeCalcGraph" .
loads the class oocalc.OpenOfficeCalcGraph. The class OpenOfficeCalcGraph contains a static initialisation method:
(...)
static { init() ; }
    private static void init()
        {
        (...)
In this static method, a Jena Assembler for OpenOfficeCalcGraph is registered under the resource named: "http://rdf.lindenb.org/build".
public static OpenOfficeAssembler assembler = new OpenOfficeAssembler();
(...)
private static final Resource buildRsrc=ResourceFactory.createResource(NS+"build");
(...)
Assembler.general.implementWith(buildRsrc,assembler);
(...)
An Assembler configures a Graph from a RDF config file. In our example, the config contains the path to the OpenOffice spreadsheets:
<#ooservice> rdf:type openoffice:build ;
    openoffice:file "/home/lindenb/jeter.ods" ;
    openoffice:file "/home/lindenb/jeter2.ods" ;
.
This config is read in the Assembler:
public static class OpenOfficeAssembler extends AssemblerBase implements Assembler
      {
      @Override
      public Object open( Assembler a, Resource root, Mode mode )
            {
            Property fileRsrc=ResourceFactory.createProperty(NS+"file");
            //read the configuration an get the files
            List<File> files=new ArrayList<File>();
            StmtIterator iter=root.listProperties(fileRsrc);
     (...)

Start Fuseki with the config file:

$ cd jena-fuseki-0.2.5
$ java -cp fuseki-server.jar:/path/to/openoffice2rdf.jar  org.apache.jena.fuseki.FusekiCmd \
    --debug  -v --config /path/to/openoffice.ttl
14:11:50 INFO  Config               :: Configuration file: ../openoffice.ttl
14:11:50 INFO  Config               :: Service: :service1
14:11:50 INFO  Config               ::   name = ds
14:11:50 INFO  Config               ::   query = /ds/query
14:11:50 INFO  Config               ::   query = /ds/sparql
14:11:50 INFO  Config               ::   update = /ds/update
14:11:50 INFO  Config               ::   upload = /ds/upload
14:11:50 INFO  Config               ::   graphStore(RW) = /ds/data
14:11:50 INFO  Config               ::   graphStore(R) = /ds/get
14:11:50 INFO  ooffice2rdf          :: Calling OpenOfficeCalcGraph init
14:11:50 INFO  Config               :: Service: OpenOffice Service (R)
14:11:50 INFO  Config               ::   name = openoffice
14:11:50 INFO  Config               ::   query = /openoffice/sparql
14:11:50 INFO  Config               ::   query = /openoffice/query
14:11:50 INFO  Config               ::   update = /openoffice/update
14:11:50 INFO  Config               ::   graphStore(R) = /openoffice/get
14:11:50 INFO  Config               ::   graphStore(R) = /openoffice/data
14:11:51 INFO  Server               :: Dataset path = /ds
14:11:51 INFO  Server               :: Dataset path = /openoffice
14:11:51 INFO  Server               :: Fuseki 0.2.5 2012-10-20T17:03:29+0100
14:11:51 INFO  Server               :: Started 2012/11/13 14:11:51 CET on port 3030
Open your browser at http://localhost:3030, select the control panel at http://localhost:3030/control-panel.tpl and select /openoffice:
Fuseki Control Panel
Dataset:

The following form is displayed:
SPARQL Query




Output:


XSLT style sheet (blank for none):




Force the accept header to text/plain regardless



You can now copy, paste and run the previous sparql query:
--------------------------------------------------------------------------------------------------------------------------------------------------
| s                                        | p                                                 | o                                               |
==================================================================================================================================================
| <file:/home/lindenb/jeter.ods/t1/y3/x1>  | <http://rdf.lindenb.org/table>                    | <file:/home/lindenb/jeter.ods/t1>               |
| <file:/home/lindenb/jeter.ods/t1/y3/x1>  | <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> | <http://rdf.lindenb.org/Cell>                   |
| <file:/home/lindenb/jeter.ods/t1/y3/x1>  | <http://rdf.lindenb.org/X>                        | "1"^^<http://www.w3.org/2001/XMLSchema#int>     |
| <file:/home/lindenb/jeter.ods/t1/y3/x1>  | <http://rdf.lindenb.org/Y>                        | "3"^^<http://www.w3.org/2001/XMLSchema#int>     |
| <file:/home/lindenb/jeter.ods/t1/y3/x1>  | <http://rdf.lindenb.org/value>                    | "chr1"                                          |
| <file:/home/lindenb/jeter.ods/t1/y3/x2>  | <http://rdf.lindenb.org/table>                    | <file:/home/lindenb/jeter.ods/t1>               |
| <file:/home/lindenb/jeter.ods/t1/y3/x2>  | <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> | <http://rdf.lindenb.org/Cell>                   |
| <file:/home/lindenb/jeter.ods/t1/y3/x2>  | <http://rdf.lindenb.org/X>                        | "2"^^<http://www.w3.org/2001/XMLSchema#int>     |
| <file:/home/lindenb/jeter.ods/t1/y3/x2>  | <http://rdf.lindenb.org/Y>                        | "3"^^<http://www.w3.org/2001/XMLSchema#int>     |
| <file:/home/lindenb/jeter.ods/t1/y3/x2>  | <http://rdf.lindenb.org/value>                    | "150"^^<http://www.w3.org/2001/XMLSchema#float> |
| <file:/home/lindenb/jeter.ods/t1/y3/x3>  | <http://rdf.lindenb.org/table>                    | <file:/home/lindenb/jeter.ods/t1>               |
| <file:/home/lindenb/jeter.ods/t1/y3/x3>  | <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> | <http://rdf.lindenb.org/Cell>                   |
| <file:/home/lindenb/jeter.ods/t1/y3/x3>  | <http://rdf.lindenb.org/X>                        | "3"^^<http://www.w3.org/2001/XMLSchema#int>     |
| <file:/home/lindenb/jeter.ods/t1/y3/x3>  | <http://rdf.lindenb.org/Y>                        | "3"^^<http://www.w3.org/2001/XMLSchema#int>     |
| <file:/home/lindenb/jeter.ods/t1/y3/x3>  | <http://rdf.lindenb.org/value>                    | "250"^^<http://www.w3.org/2001/XMLSchema#float> |
| <file:/home/lindenb/jeter.ods/t1/y3/x4>  | <http://rdf.lindenb.org/table>                    | <file:/home/lindenb/jeter.ods/t1>               |
| <file:/home/lindenb/jeter.ods/t1/y3/x4>  | <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> | <http://rdf.lindenb.org/Cell>                   |
| <file:/home/lindenb/jeter.ods/t1/y3/x4>  | <http://rdf.lindenb.org/X>                        | "4"^^<http://www.w3.org/2001/XMLSchema#int>     |
| <file:/home/lindenb/jeter.ods/t1/y3/x4>  | <http://rdf.lindenb.org/Y>                        | "3"^^<http://www.w3.org/2001/XMLSchema#int>     |
| <file:/home/lindenb/jeter.ods/t1/y3/x4>  | <http://rdf.lindenb.org/value>                    | "rs264"                                         |
| <file:/home/lindenb/jeter2.ods/t1/y3/x1> | <http://rdf.lindenb.org/table>                    | <file:/home/lindenb/jeter2.ods/t1>              |
| <file:/home/lindenb/jeter2.ods/t1/y3/x1> | <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> | <http://rdf.lindenb.org/Cell>                   |
| <file:/home/lindenb/jeter2.ods/t1/y3/x1> | <http://rdf.lindenb.org/X>                        | "1"^^<http://www.w3.org/2001/XMLSchema#int>     |
| <file:/home/lindenb/jeter2.ods/t1/y3/x1> | <http://rdf.lindenb.org/Y>                        | "3"^^<http://www.w3.org/2001/XMLSchema#int>     |
| <file:/home/lindenb/jeter2.ods/t1/y3/x1> | <http://rdf.lindenb.org/value>                    | "1"^^<http://www.w3.org/2001/XMLSchema#float>   |
| <file:/home/lindenb/jeter2.ods/t1/y3/x2> | <http://rdf.lindenb.org/table>                    | <file:/home/lindenb/jeter2.od
| <file:/home/lindenb/jeter2.ods/t1/y3/x2> | <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> | <http://rdf.lindenb.org/Cell>                   |
| <file:/home/lindenb/jeter2.ods/t1/y3/x2> | <http://rdf.lindenb.org/X>                        | "2"^^<http://www.w3.org/2001/XMLSchema#int>     |
| <file:/home/lindenb/jeter2.ods/t1/y3/x2> | <http://rdf.lindenb.org/Y>                        | "3"^^<http://www.w3.org/2001/XMLSchema#int>     |
| <file:/home/lindenb/jeter2.ods/t1/y3/x2> | <http://rdf.lindenb.org/value>                    | "2"^^<http://www.w3.org/2001/XMLSchema#float>   |
| <file:/home/lindenb/jeter2.ods/t1/y3/x3> | <http://rdf.lindenb.org/table>                    | <file:/home/lindenb/jeter2.ods/t1>              |
| <file:/home/lindenb/jeter2.ods/t1/y3/x3> | <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> | <http://rdf.lindenb.org/Cell>                   |
| <file:/home/lindenb/jeter2.ods/t1/y3/x3> | <http://rdf.lindenb.org/X>                        | "3"^^<http://www.w3.org/2001/XMLSchema#int>     |
| <file:/home/lindenb/jeter2.ods/t1/y3/x3> | <http://rdf.lindenb.org/Y>                        | "3"^^<http://www.w3.org/2001/XMLSchema#int>     |
| <file:/home/lindenb/jeter2.ods/t1/y3/x3> | <http://rdf.lindenb.org/value>                    | "3"^^<http://www.w3.org/2001/XMLSchema#float>   |
| <file:/home/lindenb/jeter2.ods/t1/y3/x4> | <http://rdf.lindenb.org/table>                    | <file:/home/lindenb/jeter2.ods/t1>              |
| <file:/home/lindenb/jeter2.ods/t1/y3/x4> | <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> | <http://rdf.lindenb.org/Cell>                   |
| <file:/home/lindenb/jeter2.ods/t1/y3/x4> | <http://rdf.lindenb.org/X>                        | "4"^^<http://www.w3.org/2001/XMLSchema#int>     |
| <file:/home/lindenb/jeter2.ods/t1/y3/x4> | <http://rdf.lindenb.org/Y>                        | "3"^^<http://www.w3.org/2001/XMLSchema#int>     |
| <file:/home/lindenb/jeter2.ods/t1/y3/x4> | <http://rdf.lindenb.org/value>                    | "4"^^<http://www.w3.org/2001/XMLSchema#float>   |
--------------------------------------------------------------------------------------------------------------------------------------------------

That's it,

Pierre

24 April 2012

Mapping the genes involved in a category of disease: the GeneWikiPlus + SPARQL way.

In my previous post, I've used the RDF/XML files of the Disease Ontology to map all the genes involved in a cardiac disease.

Andrew Su immediately mentioned on Twitter that he was working on GeneWiki+, an integration of GeneWiki on Semantic-MediaWiki that could answer the same question.





Later, Benjamin Good announced that a SPARQL endpoint for GeneWiki+ was now available:


The following java code uses the Jena/ARQ API to query this SPARQL endpoint. For a given Disease Ontology accession identifier, it fetches all the genes associated to this disease and run recursively with the sub-classes of this disease.



Here is the output (gene-name, gene-id, disease) with DOID:114 ("Heart Disease"):
Protein C 5624 Heart disease
HMG-CoA reductase 3156 Heart disease
SCARB1 949 Heart disease
Coagulation factor II receptor 2149 Heart disease
Cathepsin S 1520 Heart disease
ABCA1 19 Heart disease
CHD7 55636 Heart disease
GJA5 2702 Heart disease
ENTPD1 953 Heart disease
PEDF 5176 Heart disease
HMG CoA reductase 3156 Heart disease
PROC 5624 Heart disease
F2R 2149 Heart disease
SERPINF1 5176 Heart disease
HMGCR 3156 Heart disease
CTSS 1520 Heart disease
Cytochrome c 54205 Heart failure
FOXP1 27086 Heart failure
Vasoactive intestinal peptide 7432 Heart failure
Angiotensin-converting enzyme 1636 Heart failure
PPP1CA 5499 Heart failure
Transferrin 7018 Heart failure
Natriuretic peptide precursor C 4880 Heart failure
Insulin-like growth factor 1 3479 Heart failure
CA-125 94025 Heart failure
Myosin binding protein C, cardiac 4607 Heart failure
MYH7 4625 Heart failure
Tafazzin 6901 Heart failure
5-HT2B receptor 3357 Heart failure
Beta-1 adrenergic receptor 153 Heart failure
PTGS2 5743 Heart failure
EPAS1 2034 Heart failure
Nociceptin receptor 4987 Heart failure
Cystatin C 1471 Heart failure
Ryanodine receptor 2 6262 Heart failure
Multidrug resistance-associated protein 2 1244 Heart failure
KCNA5 3741 Heart failure
ANXA6 309 Heart failure
CMA1 1215 Heart failure
KLF15 28999 Heart failure
IL1RL1 9173 Heart failure
JPH2 57158 Heart failure
Heart-type fatty acid binding protein 2170 Heart failure
TF 7018 Heart failure
ABCC2 1244 Heart failure
Cytochrome-c 54205 Heart failure
HTR2B 3357 Heart failure
Cytochrome C 54205 Heart failure
Hif2a 2034 Heart failure
FABP3 2170 Heart failure
MYBPC3 4607 Heart failure
Angiotensin converting enzyme 1636 Heart failure
IGF-1 3479 Heart failure
Insulin-like growth factor-1 3479 Heart failure
Stress-induced polymorphic ventricular tachycardia 6262 Heart failure
C-type natriuretic peptide 4880 Heart failure
OPRL1 4987 Heart failure
CYCS 54205 Heart failure
ADRB1 153 Heart failure
TAZ 6901 Heart failure
VIP 7432 Heart failure
IGF1 3479 Heart failure
NPPC 4880 Heart failure
ACE 1636 Heart failure
CST3 1471 Heart failure
MUC16 94025 Heart failure
RYR2 6262 Heart failure
Aquaporin-2 359 Congestive heart failure
Aquaporin 2 359 Congestive heart failure
Atrial natriuretic peptide 4878 Congestive heart failure
Brain natriuretic peptide 4879 Congestive heart failure
Phospholamban 5350 Congestive heart failure
CYP2C9 1559 Congestive heart failure
RAGE (receptor) 177 Congestive heart failure
Angiotensin II receptor type 1 185 Congestive heart failure
Programmed cell death 1 5133 Congestive heart failure
AGTR1 185 Congestive heart failure
Atrial natriuretic factor 4878 Congestive heart failure
PDCD1 5133 Congestive heart failure
AGER 177 Congestive heart failure
AQP2 359 Congestive heart failure
PLN 5350 Congestive heart failure
NPPB 4879 Congestive heart failure
NPPA 4878 Congestive heart failure
GroEL 3329 Endocarditis
Ornithine transcarbamylase 5009 Endocarditis
Valosin-containing protein 7415 Endocarditis
Parathyroid hormone 1 receptor 5745 Endocarditis
VDAC1 7416 Endocarditis
RuvB-like 1 8607 Endocarditis
TUBB2A 7280 Endocarditis
ACTG1 71 Endocarditis
ACTC1 70 Endocarditis
PRDX6 9588 Endocarditis
Hyaluronan-mediated motility receptor 3161 Endocarditis
HSPB6 126393 Endocarditis
Parathyroid hormone receptor 1 5745 Endocarditis
VCP 7415 Endocarditis
OTC 5009 Endocarditis
PTH1R 5745 Endocarditis
HSPD1 3329 Endocarditis
HMMR 3161 Endocarditis
RUVBL1 8607 Endocarditis
HCN4 10021 Sick sinus syndrome
Heparin-binding EGF-like growth factor 1839 Aortic valve disease
HBEGF 1839 Aortic valve disease
Von Willebrand factor 7450 Aortic valve stenosis
ADAMTS13 11093 Aortic valve stenosis
VWF 7450 Aortic valve stenosis
Elastin 2006 Supravalvular aortic stenosis
ELN 2006 Supravalvular aortic stenosis
PRG4 10216 Pericarditis
Histamine H3 receptor 11255 Myocardial ischemia
MAP3K7IP1 10454 Myocardial ischemia
Vascular endothelial growth factor A 7422 Myocardial ischemia
Cathepsin L1 1514 Myocardial ischemia
VEGF-A 7422 Myocardial ischemia
VEGFA 7422 Myocardial ischemia
CTSL1 1514 Myocardial ischemia
TAB1 10454 Myocardial ischemia
HRH3 11255 Myocardial ischemia
APOA1 335 Coronary heart disease
APOC3 345 Coronary heart disease
Lipoprotein(a) 4018 Coronary heart disease
Brain natriuretic peptide 4879 Coronary heart disease
Beta-3 adrenergic receptor 155 Coronary heart disease
Insulin-like growth factor 1 3479 Coronary heart disease
Perlecan 3339 Coronary heart disease
PCSK9 255738 Coronary heart disease
Cholesterylester transfer protein 1071 Coronary heart disease
Arachidonate 5-lipoxygenase 240 Coronary heart disease
Apolipoprotein B 338 Coronary heart disease
Apolipoprotein A1 335 Coronary heart disease
Beta-1 adrenergic receptor 153 Coronary heart disease
Apolipoprotein C3 345 Coronary heart disease
Lipoprotein-associated phospholipase A2 7941 Coronary heart disease
NEUROG3 50674 Coronary heart disease
5-lipoxygenase 240 Coronary heart disease
ApoA1 335 Coronary heart disease
CETP 1071 Coronary heart disease
ApoB 338 Coronary heart disease
IGF-1 3479 Coronary heart disease
Insulin-like growth factor-1 3479 Coronary heart disease
ApoCIII 345 Coronary heart disease
PLA2G7 7941 Coronary heart disease
ADRB3 155 Coronary heart disease
ADRB1 153 Coronary heart disease
APOB 338 Coronary heart disease
ALOX5 240 Coronary heart disease
IGF1 3479 Coronary heart disease
NPPB 4879 Coronary heart disease
HSPG2 3339 Coronary heart disease
LPA 4018 Coronary heart disease
CYP7A1 1581 Myocardial infarction
Caspase 3 836 Myocardial infarction
C-reactive protein 1401 Myocardial infarction
Renin 5972 Myocardial infarction
Factor VII 2155 Myocardial infarction
Factor H 3075 Myocardial infarction
Hepatic lipase 3990 Myocardial infarction
Myeloperoxidase 4353 Myocardial infarction
Endothelial protein C receptor 10544 Myocardial infarction
ALDH2 217 Myocardial infarction
C1-inhibitor 710 Myocardial infarction
Basic fibroblast growth factor 2247 Myocardial infarction
Myocyte-specific enhancer factor 2A 4205 Myocardial infarction
5-Lipoxygenase-activating protein 241 Myocardial infarction
RAGE (receptor) 177 Myocardial infarction
OLR1 4973 Myocardial infarction
Beta-1 adrenergic receptor 153 Myocardial infarction
PTGS2 5743 Myocardial infarction
Cholesterol 7 alpha-hydroxylase 1581 Myocardial infarction
GPVI 51206 Myocardial infarction
Adrenomedullin 133 Myocardial infarction
Prostacyclin synthase 5740 Myocardial infarction
Cystatin C 1471 Myocardial infarction
Tenascin X 7148 Myocardial infarction
Thymosin beta-4 7114 Myocardial infarction
GCLM 2730 Myocardial infarction
S100A9 6280 Myocardial infarction
IL1RL1 9173 Myocardial infarction
LGALS2 3957 Myocardial infarction
CKM (gene) 1158 Myocardial infarction
ABCC9 10060 Myocardial infarction
Renalase 55328 Myocardial infarction
VTI1A 143187 Myocardial infarction
MIAT (gene) 440823 Myocardial infarction
BFGF 2247 Myocardial infarction
TMSB4X 7114 Myocardial infarction
CASP3 836 Myocardial infarction
Caspase-3 836 Myocardial infarction
Complement factor H 3075 Myocardial infarction
MEF2A 4205 Myocardial infarction
5-lipoxygenase activating protein 241 Myocardial infarction
Factor VIIa 2155 Myocardial infarction
PROCR 10544 Myocardial infarction
GP6 51206 Myocardial infarction
F7 2155 Myocardial infarction
AGER 177 Myocardial infarction
ADRB1 153 Myocardial infarction
MIAT 440823 Myocardial infarction
CFH 3075 Myocardial infarction
CKM 1158 Myocardial infarction
CRP 1401 Myocardial infarction
LIPC 3990 Myocardial infarction
RNLS 55328 Myocardial infarction
PTGIS 5740 Myocardial infarction
TNXB 7148 Myocardial infarction
SERPING1 710 Myocardial infarction
FGF2 2247 Myocardial infarction
REN 5972 Myocardial infarction
ADM 133 Myocardial infarction
CST3 1471 Myocardial infarction
MPO 4353 Myocardial infarction
ALOX5AP 241 Myocardial infarction
Myoglobin 4151 Acute myocardial infarction
Tissue plasminogen activator 5327 Acute myocardial infarction
MIRN21 406991 Acute myocardial infarction
Apolipoprotein B 338 Acute myocardial infarction
Endothelin 1 1906 Acute myocardial infarction
MMP3 4314 Acute myocardial infarction
Heart-type fatty acid binding protein 2170 Acute myocardial infarction
Alteplase 5327 Acute myocardial infarction
FABP3 2170 Acute myocardial infarction
ApoB 338 Acute myocardial infarction
MB 4151 Acute myocardial infarction
APOB 338 Acute myocardial infarction
PLAT 5327 Acute myocardial infarction
EDN1 1906 Acute myocardial infarction
MIR21 406991 Acute myocardial infarction
Adenosine A1 receptor 134 Myocardial stunning
SOD2 6648 Myocardial stunning
ADORA1 134 Myocardial stunning
MYH7 4625 Endocardial fibroelastosis
Tafazzin 6901 Endocardial fibroelastosis
TAZ 6901 Endocardial fibroelastosis
Nav1.5 6331 Conduction disease
SCN5A 6331 Conduction disease
PRKAG2 51422 Wolff-Parkinson-White syndrome
TNNT2 7139 Restrictive cardiomyopathy
Titin 7273 Hypertrophic cardiomyopathy
CSRP3 8048 Hypertrophic cardiomyopathy
CD36 948 Hypertrophic cardiomyopathy
Myosin binding protein C, cardiac 4607 Hypertrophic cardiomyopathy
MYH7 4625 Hypertrophic cardiomyopathy
MYL9 10398 Hypertrophic cardiomyopathy
TNNT2 7139 Hypertrophic cardiomyopathy
ACTC1 70 Hypertrophic cardiomyopathy
Endothelin 2 1907 Hypertrophic cardiomyopathy
MYL2 4633 Hypertrophic cardiomyopathy
MYH6 4624 Hypertrophic cardiomyopathy
MYBPC1 4604 Hypertrophic cardiomyopathy
MYL3 4634 Hypertrophic cardiomyopathy
JPH2 57158 Hypertrophic cardiomyopathy
MYLK2 85366 Hypertrophic cardiomyopathy
MYBPC3 4607 Hypertrophic cardiomyopathy
CD-36 948 Hypertrophic cardiomyopathy
TTN 7273 Hypertrophic cardiomyopathy
EDN2 1907 Hypertrophic cardiomyopathy
Titin 7273 Dilated cardiomyopathy
CSRP3 8048 Dilated cardiomyopathy
Phospholamban 5350 Dilated cardiomyopathy
Tafazzin 6901 Dilated cardiomyopathy
Beta-1 adrenergic receptor 153 Dilated cardiomyopathy
LMNA 4000 Dilated cardiomyopathy
Palladin 23022 Dilated cardiomyopathy
Fukutin 2218 Dilated cardiomyopathy
TNNT2 7139 Dilated cardiomyopathy
ACTC1 70 Dilated cardiomyopathy
SGCD 6444 Dilated cardiomyopathy
Programmed cell death 1 5133 Dilated cardiomyopathy
LDB3 11155 Dilated cardiomyopathy
ABCC9 10060 Dilated cardiomyopathy
PDCD1 5133 Dilated cardiomyopathy
ADRB1 153 Dilated cardiomyopathy
TTN 7273 Dilated cardiomyopathy
TAZ 6901 Dilated cardiomyopathy
PLN 5350 Dilated cardiomyopathy
PALLD 23022 Dilated cardiomyopathy
FKTN 2218 Dilated cardiomyopathy

Note: In my previous post ADA was found to be associated to DOID:3363 (coronary arteriosclerosis). This result was not retrieved using SPARQL and this information is not available on the GeneWiki+ page for ADA. But keep in mind that GeneWiki+ is still under development.

That's it,

Pierre


11 February 2010

Mapping RDBMS to RDF with D2RQ (yet another geeky title)

One of the coolest thing have seen here at Biohackathon 2010 is D2RQ (thank you Jan !):
D2RQ is a declarative language to describe mappings between relational database schemata and OWL/RDFS ontologies. The D2RQ Platform uses these mapping to enables applications to access a RDF-view on a non-RDF database.
In this post, I'll describe how I've installed a D2RQ server.
First, Download D2RQ:

wget http://downloads.sourceforge.net/project/d2rq-map/D2R%20Server/v0.7%20%28alpha%29/d2r-server-0.7.tar.gz
tar xfz d2r-server-0.7.tar.gz

Check if the java mysql driver is presnet in the lib folder (yes, it is)
ls lib/mysql-connector-java-5.1.7-bin.jar

Create one table in mysql describing some SNPs:
mysql> create table snp(id int unsigned primary key, name varchar(20) not null,avHet float);
Query OK, 0 rows affected (0.04 sec)

insert into snp(id,name,avHet) values (3210717,"rs3210717",0.2408);
insert into snp(id,name,avHet) values (1045871,"rs1045871",0.4278);
insert into snp(id,name,avHet) values (1045862,"rs1045862",0.2688);
insert into snp(id,name,avHet) values (17149433,"rs17149433",0.1958);
insert into snp(id,name,avHet) values (17149429,"rs17149429",0.1128);
insert into snp(id,name,avHet) values (16925319,"rs16925319",0.2822);
insert into snp(id,name,avHet) values (17353727,"rs17353727",0.495);
insert into snp(id,name,avHet) values (17157186,"rs17157186",0.4118);
insert into snp(id,name,avHet) values (3210688,"rs3210688",0.1638);
insert into snp(id,name,avHet) values (17157183,"rs17157183",0.4422);

Now call generate-mapping to generate the mapping between MYSQL and RDF:
./generate-mapping -u root -d com.mysql.jdbc.Driver -o mapping.n3 -b "my:bio:database" "jdbc:mysql://localhost/test"

Here is the file mapping.n3 that was generated
@prefix map: <file:/home/pierre/tmp/D2RQ/d2r-server-0.7/mapping.n3#> .
@prefix db: <> .
@prefix vocab: <my:bio:databasevocab/resource/> .
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .
@prefix d2rq: <http://www.wiwiss.fu-berlin.de/suhl/bizer/D2RQ/0.1#> .
@prefix jdbc: <http://d2rq.org/terms/jdbc/> .

map:database a d2rq:Database;
d2rq:jdbcDriver "com.mysql.jdbc.Driver";
d2rq:jdbcDSN "jdbc:mysql://localhost/test";
d2rq:username "root";
jdbc:autoReconnect "true";
jdbc:zeroDateTimeBehavior "convertToNull";
.

# Table snp
map:snp a d2rq:ClassMap;
d2rq:dataStorage map:database;
d2rq:uriPattern "snp/@@snp.id@@";
d2rq:class vocab:snp;
d2rq:classDefinitionLabel "snp";
.
map:snp__label a d2rq:PropertyBridge;
d2rq:belongsToClassMap map:snp;
d2rq:property rdfs:label;
d2rq:pattern "snp #@@snp.id@@";
.
map:snp_id a d2rq:PropertyBridge;
d2rq:belongsToClassMap map:snp;
d2rq:property vocab:snp_id;
d2rq:propertyDefinitionLabel "snp id";
d2rq:column "snp.id";
d2rq:datatype xsd:int;
.
map:snp_name a d2rq:PropertyBridge;
d2rq:belongsToClassMap map:snp;
d2rq:property vocab:snp_name;
d2rq:propertyDefinitionLabel "snp name";
d2rq:column "snp.name";
.
map:snp_avHet a d2rq:PropertyBridge;
d2rq:belongsToClassMap map:snp;
d2rq:property vocab:snp_avHet;
d2rq:propertyDefinitionLabel "snp avHet";
d2rq:column "snp.avHet";
d2rq:datatype xsd:float;
.
...now, start the d2r-server:
./d2r-server -p 8080 mapping.n3
03:15:03 INFO log :: Logging to org.slf4j.impl.Log4jLoggerAdapter(org.mortbay.log) via org.mortbay.log.Slf4jLog
03:15:03 INFO log :: jetty-6.1.10
03:15:03 INFO log :: NO JSP Support for , did not find org.apache.jasper.servlet.JspServlet
03:15:03 INFO D2RServer :: using config file: file:/home/pierre/tmp/D2RQ/d2r-server-0.7/ucsc.n3
(...)

Open your web browser at http://localhost:8080/snorql/ and TADA!!!!!!!! Here is a functional SPARQL engine mapping your database.


FTW !

That's it !
Pierre

03 February 2010

Using a FASTA file as a source of RDF statements for SPARQL.


In this post, I'll show how a Fasta file can be used as a source of RDF statements for the Jena API.
The DNA sequences in the Fasta file will be used by Jena without any prior transformation: the file will be used as a Graph by Jena by implementing com.hp.hpl.jena.graph.Graph.

Here, my example uses a Fasta file but it could have been any kind of input: a SQL database, a XML file, a GFF file, etc...

How it works


com.hp.hpl.jena.graph.Graph is the interface to be satisfied by implementations maintaining collections of RDF triples. The core interface is small (add, delete, find, contains) and is augmented by additional classes to handle more complicated matters such as reification, query handling, bulk update, event management, and transaction handling. My implementation for this interface extends com.hp.hpl.jena.graph.impl.GraphBase, will read a Fasta file and create set of RDF triple for each sequence. All we need is (love and ) implementing the abstract function ExtendedIterator<Triple> graphBaseFind(TripleMatch matcher)
public class FastaModel
extends GraphBase
{
protected ExtendedIterator<Triple> graphBaseFind(TripleMatch matcher)
{
//the function to be implemented....
}
}

The FastaSequence


A simple container for a name and a sequence.
/** a simple fasta sequence */
private static class FastaSequence
{
StringBuilder name=new StringBuilder();
StringBuilder sequence=new StringBuilder();
}

Reading the next Fasta Sequence

... Not Rocket Science...
/** the file reader */
private PushbackReader reader;
(...)
reader=new PushbackReader(new FileReader(fastaFile));
(...)
private FastaSequence readNext() throws IOException
{
if(this.reader==null) return null;
int c;
FastaSequence seq=null;
while((c=this.reader.read())!=-1)
{
if(c=='>')
{
if(seq!=null)
{
this.reader.unread(c);
return seq;
}
seq=new FastaSequence();
while((c=this.reader.read())!=-1)
{
if(c=='\n') break;
seq.name.append((char)c);
}
}
else if(seq!=null && Character.isLetter(c))
{
seq.sequence.append((char)c);
}
}
this.close();//close the FileReader
return seq;
}

Implementing the Iterator of Triples


My FastaIterator extends jena.util.iterator.NiceIterator<Triple>, a class extening the ExtendedIterator returned by the Graph function ExtendedIterator<Triple> graphBaseFind(TripleMatch matcher). The class contains three fields:
  • a FileReader
  • com.hp.hpl.jena.graph.Triple that is used as a filter
  • a stack/queue of RDF Triples
. The constructor for 'FastaIterator' opens the stream:
FastaIterator(TripleMatch matcher) throws IOException
{
this.filter=matcher.asTriple();
try
{
this.reader=new PushbackReader(new FileReader(FastaModel.this.fastaFile));
}
catch (IOException e)
{
throw new JenaException(e);
}
}
The method 'close' just close the input stream
@Override
public void close()
{
try
{
if(this.reader!=null) reader.close();
}
catch (IOException e)
{
throw new JenaException(e);
}
finally
{
this.reader=null;
super.close();
}
The method 'next()' check if there is something in the RDF queue, if true a RDF triple is removed and returned:
@Override
public Triple next()
{
if(this.triples_queue.isEmpty()) hasNext();
if(this.triples_queue.isEmpty()) throw new IllegalStateException();
return this.triples_queue.pop();
}
The method 'hasNext()' returns true if the queue of RDF triple is not empty. Otherwise, it gets the next Fasta Sequence
from the input stream and transforms it into a set of RDF triple that are added to the RDF queue if they match this.filter.That is to say, the following fasta sequence...:


>gi|227935373|gb|FJ425127.1| Rotavirus G8 isolate 6862/2000/ARN NSP3 gene, partial cds
GGCCACTTCAACATTAGAATTAATGGGTATTCAATATGATTACAATGAAGTATTTACCAGAGTTAAAAGT
AAATTTGATTATGTGATGGATGACTCTGGTGTTAAAAACAATCTTTTGGGTAAAGCTATAACTATTGATC
AGGCATTAAATGGAAAGTTTGGCTCAGCTATTAGAAATAGAAATTGGATGACTGATTCTAAAACGGTTGC
TAAATTAGATGAAGACGTGAATAAACTTAGAATGACATTATCTTCTAAAGGAATCGACCAAAAGATGAGA
GTACTTAATGCTTGTTTAGTGTA

... will generate those four RDF statements:
<http://www.ncbi.nlm.nih.gov/nuccore/227935373> <urn:lindenb:ontology:length> "303"^^<http://www.w3.org/2001/XMLSchema#int>
<http://www.ncbi.nlm.nih.gov/nuccore/227935373> <urn:lindenb:ontology:sequence> "GGCCACTTCAACATTAGAATTAATGGGTATTCAATATGATTACAATGAAGTATTTACCAGAGTTAAAAGTAAATTTGATTATGTGATGGATGACTCTGGTGTTAAAAACAATCTTTTGGGTAAAGCTATAACTATTGATCAGGCATTAAATGGAAAGTTTGGCTCAGCTATTAGAAATAGAAATTGGATGACTGATTCTAAAACGGTTGCTAAATTAGATGAAGACGTGAATAAACTTAGAATGACATTATCTTCTAAAGGAATCGACCAAAAGATGAGAGTACTTAATGCTTGTTTAGTGTA"
<http://www.ncbi.nlm.nih.gov/nuccore/227935373> <http://purl.org/dc/elements/1.1/title> "gi|227935373|gb|FJ425127.1| Rotavirus G8 isolate 6862/2000/ARN NSP3 gene, partial cds"
<http://www.ncbi.nlm.nih.gov/nuccore/227935373> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <urn:lindenb:ontology:Sequence>
Here is the code for the 'hasNext' function:
@Override
public boolean hasNext()
{
if(!triples_queue.isEmpty()) return true;
if(this.reader==null) return false;
try
{
/* loop until the queue is not empty or the stream is closed */
while(this.triples_queue.isEmpty())
{
//try to get a new fasta sequence
FastaSequence seq=readNext();
if(seq==null) return false;

String name=seq.name.toString();
//check it is a genbank file with a gi
if(!name.startsWith("gi|"))
{
continue;
}
int i=name.indexOf('|',3);
if(i==-1) continue;
//create the subject
Node subject =Node.createURI("http://www.ncbi.nlm.nih.gov/nuccore/"+name.substring(3,i));

//make a triple for the rdf:type
Triple triple=new Triple(
subject,
RDF.type.asNode(),
Node.createURI("urn:lindenb:ontology:Sequence")
);
//append this triple to the queue if it is accepted by this.filter
if(this.filter.asTriple().matches(triple))
{
this.triples_queue.add(triple);
}

//make a triple for the dc:title
triple=new Triple(
subject,
DC.title.asNode(),
Node.createLiteral(name)
);

//append this triple to the queue if it is accepted by this.filter
if(this.filter.asTriple().matches(triple))
{
this.triples_queue.add(triple);
}

//make a triple for the DNA sequence
triple=new Triple(
subject,
Node.createURI("urn:lindenb:ontology:sequence"),
Node.createLiteral(seq.sequence.toString())
);

//append this triple to the queue if it is accepted by this.filter
if(this.filter.asTriple().matches(triple))
{
this.triples_queue.add(triple);
}

//make a triple for the size of this sequence
triple=new Triple(
subject,
Node.createURI("urn:lindenb:ontology:length"),
Node.createLiteral(String.valueOf(seq.sequence.length()),null,XSDDatatype.XSDint)
);

//append this triple to the queue if it is accepted by this.filter
if(this.filter.asTriple().matches(triple))
{
this.triples_queue.add(triple);
}

}
}
catch (IOException e)
{
close();
throw new JenaException(e);
}
return !triples_queue.isEmpty();
}

Using the graph

.
Creating a new Jena RDF Model
Model m=ModelFactory.createModelForGraph(
new FastaModel(
new File("rotavirus.fa")
));

Looping over the RDF statements
After creating this new Model, it can be used as a regular Jena RDF Model. e.g:
StmtIterator i=m.listStatements();
while(i.hasNext())
{
System.err.println(i.next());
}
Result
[http://www.ncbi.nlm.nih.gov/nuccore/227935373, urn:lindenb:ontology:length, "303"^^http://www.w3.org/2001/XMLSchema#int]
[http://www.ncbi.nlm.nih.gov/nuccore/227935373, urn:lindenb:ontology:sequence, "GGCCACTTCAACATTAGAATTAATGGGTATTCAATATGATTACAATGAAGTATTTACCAGAGTTAAAAGTAAATTTGATTATGTGATGGATGACTCTGGTGTTAAAAACAATCTTTTGGGTAAAGCTATAACTATTGATCAGGCATTAAATGGAAAGTTTGGCTCAGCTATTAGAAATAGAAATTGGATGACTGATTCTAAAACGGTTGCTAAATTAGATGAAGACGTGAATAAACTTAGAATGACATTATCTTCTAAAGGAATCGACCAAAAGATGAGAGTACTTAATGCTTGTTTAGTGTA"]
[http://www.ncbi.nlm.nih.gov/nuccore/227935373, http://purl.org/dc/elements/1.1/title, "gi|227935373|gb|FJ425127.1| Rotavirus G8 isolate 6862/2000/ARN NSP3 gene, partial cds"]
[http://www.ncbi.nlm.nih.gov/nuccore/227935373, http://www.w3.org/1999/02/22-rdf-syntax-ns#type, urn:lindenb:ontology:Sequence]
[http://www.ncbi.nlm.nih.gov/nuccore/227935371, urn:lindenb:ontology:length, "303"^^http://www.w3.org/2001/XMLSchema#int]
[http://www.ncbi.nlm.nih.gov/nuccore/227935371, urn:lindenb:ontology:sequence, "GGCCACTTCAACATTAGAATTAATGGGTATTCAATATGATTACAATGAAGTATTTACCAGAGTTAAAAGTAAATTTGATTATGTGATGGATGACTCTGGTGTTAAAAACAATCTTTTGGGTAAAGCTATAACTATTGATCAGGCATTAAATGGAAAGTTTGGCTCAGCTATTAGAAATAGAAATTGGATGACTGATTCTAAAACGGTTGCTAAATTAGATGAAGACGTGAATAAACTTAGAATGACATTATCTTCTAAAGGAATCGACCAAAAGATGAGAGTACTTAATGCTTGTTTAGTGTA"]
[http://www.ncbi.nlm.nih.gov/nuccore/227935371, http://purl.org/dc/elements/1.1/title, "gi|227935371|gb|FJ425126.1| Rotavirus G8 isolate 6854/2002/ARN NSP3 gene, partial cds"]
[http://www.ncbi.nlm.nih.gov/nuccore/227935371, http://www.w3.org/1999/02/22-rdf-syntax-ns#type, urn:lindenb:ontology:Sequence]
[http://www.ncbi.nlm.nih.gov/nuccore/227935369, urn:lindenb:ontology:length, "303"^^http://www.w3.org/2001/XMLSchema#int]
[http://www.ncbi.nlm.nih.gov/nuccore/227935369, urn:lindenb:ontology:sequence, "GGCCACTTCAACATTAGAATTAATGGGTATTCAATATGATTACAATGAAGTATTTACCAGAGTTAAAAGTAAATTTGATTATGTGATGGATGACTCTGGTGTTAAAAACAATCTTCTGGGTAAAGCTATAACTATTGATCAGGCATTAAATGGAAAGTTTGGCTCAGCTATTAGAAATAGAAATTGGATGACTGATTCTAAAACGGTTGCTAAATTAGATGAAGACGTGAATAAACTTAGAATGACATTATCTTCTAAAGGAATCGACCAAAAGATGAGAGTACTTAATGCTTGTTTAGTGTA"]
[http://www.ncbi.nlm.nih.gov/nuccore/227935369, http://purl.org/dc/elements/1.1/title, "gi|227935369|gb|FJ425125.1| Rotavirus G8 isolate 6810/2004/ARN NSP3 gene, partial cds"]
(...)

This model can also be used as a source of RDF by ARQ , the SPARQL engine for Jena (!). Here we create a new SPARQL engine and list the sequences having a length lower than the others
Query query=QueryFactory.create(
"SELECT ?Seq1 ?Len1 ?Seq2 ?Len2" +
"{" +
"?Seq1 a <urn:lindenb:ontology:Sequence> . " +
"?Seq1 <urn:lindenb:ontology:length> ?Len1 . " +
"?Seq2 a <urn:lindenb:ontology:Sequence> . " +
"?Seq2 <urn:lindenb:ontology:length> ?Len2 . " +
"FILTER (?Seq1!=?Seq2 && ?Len1 < ?Len2) "+

"}"
);
QueryExecution execution = QueryExecutionFactory.create(query, m);
ResultSet row=execution.execSelect();
while(row.hasNext())
{
QuerySolution solution=row.next();

for(Iterator<String> si=solution.varNames();si.hasNext();)
{
String name=si.next();
System.out.println(name+" : "+solution.get(name));
}
System.out.println();
}
Result:
Seq1 : http://www.ncbi.nlm.nih.gov/nuccore/227935373
Len1 : 303^^http://www.w3.org/2001/XMLSchema#int
Seq2 : http://www.ncbi.nlm.nih.gov/nuccore/227935361
Len2 : 304^^http://www.w3.org/2001/XMLSchema#int

Seq1 : http://www.ncbi.nlm.nih.gov/nuccore/227935373
Len1 : 303^^http://www.w3.org/2001/XMLSchema#int
Seq2 : http://www.ncbi.nlm.nih.gov/nuccore/227935359
Len2 : 304^^http://www.w3.org/2001/XMLSchema#int

Seq1 : http://www.ncbi.nlm.nih.gov/nuccore/227935373
Len1 : 303^^http://www.w3.org/2001/XMLSchema#int
Seq2 : http://www.ncbi.nlm.nih.gov/nuccore/215489730
Len2 : 305^^http://www.w3.org/2001/XMLSchema#int

Seq1 : http://www.ncbi.nlm.nih.gov/nuccore/227935371
Len1 : 303^^http://www.w3.org/2001/XMLSchema#int
Seq2 : http://www.ncbi.nlm.nih.gov/nuccore/227935361
Len2 : 304^^http://www.w3.org/2001/XMLSchema#int

Seq1 : http://www.ncbi.nlm.nih.gov/nuccore/227935371
Len1 : 303^^http://www.w3.org/2001/XMLSchema#int
Seq2 : http://www.ncbi.nlm.nih.gov/nuccore/227935359
Len2 : 304^^http://www.w3.org/2001/XMLSchema#int

Seq1 : http://www.ncbi.nlm.nih.gov/nuccore/227935371
Len1 : 303^^http://www.w3.org/2001/XMLSchema#int
Seq2 : http://www.ncbi.nlm.nih.gov/nuccore/215489730
Len2 : 305^^http://www.w3.org/2001/XMLSchema#int

Seq1 : http://www.ncbi.nlm.nih.gov/nuccore/227935369
Len1 : 303^^http://www.w3.org/2001/XMLSchema#int
Seq2 : http://www.ncbi.nlm.nih.gov/nuccore/227935361
Len2 : 304^^http://www.w3.org/2001/XMLSchema#int
(...)
Hey, I thinks it's coool ! :-)
BTW I wonder how, knowing the FILTER of the SPARQL query, searching the Graph can be optimized, for example if we know that the sequences have been sorted in the fasta file according to their lengths.... Any idea ?

Compiling



export JENAPATH=${JENALIB}/icu4j-3.4.4.jar:${JENALIB}/iri-0.7.jar:${JENALIB}/jena-2.6.2.jar:${JENALIB}/jena-2.6.2-tests.jar:${JENALIB}/junit-4.5.jar:${JENALIB}/log4j-1.2.13.jar:${JENALIB}/lucene-core-2.3.1.jar:${JENALIB}/slf4j-api-1.5.6.jar:${JENALIB}/slf4j-log4j12-1.5.6.jar:${JENALIB}/stax-api-1.0.1.jar:${JENALIB}/wstx-asl-3.2.9.jar:${JENALIB}/xercesImpl-2.7.1.jar:${JENALIB}/icu4j-3.4.4.jar:${JENALIB}/iri-0.7.jar:${JENALIB}/jena-2.6.2.jar:${JENALIB}/jena-2.6.2-tests.jar:${JENALIB}/junit-4.5.jar:${JENALIB}/log4j-1.2.13.jar:${JENALIB}/lucene-core-2.3.1.jar:${JENALIB}/slf4j-api-1.5.6.jar:${JENALIB}/slf4j-log4j12-1.5.6.jar:${JENALIB}/stax-api-1.0.1.jar:${JENALIB}/wstx-asl-3.2.9.jar:${JENALIB}/xercesImpl-2.7.1.jar:${JENALIB}/arq-2.8.1.jar
javac -cp ${JENAPATH}:. -d bin -sourcepath src src/test/FastaModel.java

Running


java -cp ${JENAPATH}:bin test.FastaModel

All, in one, here is the code



That's it !
Pierre

01 February 2010

Searching for Genotypes with SPARQL.

This week-end, I've noticed that the NCBI has an interface called Genotype Query Form used to query some genotypes the generating the following kind of XML output:
<GenoExchange xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns="http://www.ncbi.nlm
.nih.gov/SNP/geno" xsi:schemaLocation="http://www.ncbi.nlm.nih.gov/SNP/geno ftp://ftp.ncbi.nlm.nih.gov/snp/specs/genoex_1_4.xsd" dbSNPBuildNo="129">
<Population popId="1409" handle="CSHL-HAPMAP" locPopId="HapMap-CEU">
<popClass self="NOT SPECIFIED" />
</Population>
<Individual indId="170" taxId="9606" sex="F" indGroup="European">
<SourceInfo source="Coriell" sourceType="repository" ncbiPedId="80" pedId="1340" indId="NA07000" maId="0" paId="0" srcIndGroup="Western and Nothern European" />
<SubmitInfo popId="1409" submittedIndId="NA07000" subIndGroup="Western and Northern European" />
</Individual>
<Individual indId="621" taxId="9606" sex="F" indGroup="European">

(...)
<SnpLoc genomicAssembly="36:reference" chrom="1" start="1286927" locType="2" rsOrientToCh
rom="rev" contigAllele="C" />
<SsInfo ssId="3906671" locSnpId="AL139287.6_22772" ssOrientToRs="fwd">
<ByPop popId="1409" sampleSize="120">
<AlleleFreq allele="A" freq="0.117" />
<AlleleFreq allele="G" freq="0.883" />
<GTypeFreq gtype="A/G" freq="0.233" />
<GTypeFreq gtype="G/G" freq="0.767" />
(...)
<GTypeByInd indId="636" gtype="G/G" />
<GTypeByInd indId="456" gtype="G/G" />
<GTypeByInd indId="536" gtype="G/G" />
</ByPop>
</SsInfo>
<GTypeFreq gtype="A/A" freq="0.380952380952381" />
<GTypeFreq gtype="A/G" freq="0.352380952380952" />
<GTypeFreq gtype="G/G" freq="0.266666666666667" />
</SnpInfo>
<SnpInfo rsId="2765021" observed="A/G">
(...)
I wanted to see how one could query this kind of data with SPARQL... well, I'm sure that RDF is one of the most inefficient way to store this kind of data but I wanted to see what could be extracted from such RDFStore from a semantic query. First, I wrote a XSLT stylesheet transforming <GenoExchange/> to <rdf:RDF/>. The stylsheet is available at http://code.google.com/p/lindenb/source/browse/trunk/src/xsl/genoexch2rdf.xsl.
.

Transform the data

About 639 HAPMAP snps on the chromosome 1 were extracted using the HTML form and saved as XML to the file 'SNPgenotype-100201-1244-3905.xml'(size 4Mo). The xml was converted to RDF with the xsltproc engine:
xsltproc --stringparam "with-sequence" yes --novalid genoexch2rdf.xsl SNPgenotype-100201-1244-3905.xml > input.rdf
The size of 'input.rdf' (including the flanking sequences of the SNPs) was 20Mo.

Result


<?xml version="1.0"?>
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:g="http://www.ncbi.nlm.nih.gov/SNP/geno" xmlns:snp="http://www.ncbi.nlm.nih.gov/SNP/docsum" xmlns="http://ontology.lindenb.org/genotypes/">
<Population rdf:about="http://www.ncbi.nlm.nih.gov/projects/SNP/snp_viewTable.cgi?type=pop&amp;pop_id=1409">
<handle>CSHL-HAPMAP</handle>
<locPopId>HapMap-CEU</locPopId>
</Population>
<Individual rdf:about="http://www.ncbi.nlm.nih.gov/projects/SNP/snp_ind.cgi?ind_id=170">
<hasPop rdf:resource="http://www.ncbi.nlm.nih.gov/projects/SNP/snp_viewTable.cgi?type=pop&amp;pop_id=1409"/>
<sex>F</sex>
<name>NA07000</name>
</Individual>
<Individual rdf:about="http://www.ncbi.nlm.nih.gov/projects/SNP/snp_ind.cgi?ind_id=621">
<hasPop rdf:resource="http://www.ncbi.nlm.nih.gov/projects/SNP/snp_viewTable.cgi?type=pop&amp;pop_id=1409"/>
<sex>F</sex>
<name>NA12875</name>
</Individual>
<Individual rdf:about="http://www.ncbi.nlm.nih.gov/projects/SNP/snp_ind.cgi?ind_id=538">
<hasPop rdf:resource="http://www.ncbi.nlm.nih.gov/projects/SNP/snp_viewTable.cgi?type=pop&amp;pop_id=1409"/>
<sex>F</sex>
<name>NA12753</name>
(...)
<SNP rdf:about="http://www.ncbi.nlm.nih.gov/projects/SNP/snp_ref.cgi?rs=307347">
<het rdf:datatype="http://www.w3.org/2001/XMLSchema#float">0.1</het>
<name>rs307347</name>
<seq5>GGGGATGGCTGCTCCTGGGCCTCAGAAAGATGCAGTCCCATAGACTTCCAGCACGCCCCTCCCCTCCTCGGGCCTTAATTTTGTCCACTGAGAAGATGGTCTCTGAGGCTCTGGGGTTTCCTTCTTGGTCACCAGATATTCTGCGGGCCTTGCCTTCCTGCCCAGATTCGAGCCAGTGGCAAACAGAAGCTGCCAGGAGC</seq5>
<observed>C/T</observed>
<seq3>TCTCAGAGCTGTGGCTGGTGGCTCGGTAACAACAGGAAGGGCAGTGGCTGTGCAGGAGGCAGGCAGCTTGCCAGCCCAGGAAGGTGACCCAGGACACCTCCAGGCCTTTCCCAGGGCAGCCCAACGGCCCAAGGTCAGGGCCGGGCGCGAGGGCGGCCTGAGCACAGAGCACGGGGGCTGACAGCAGGCTGGGGGGCCAG</seq3>
</SNP>
<MapLoc>
<hasSNP rdf:resource="http://www.ncbi.nlm.nih.gov/projects/SNP/snp_ref.cgi?rs=307347"/>
<strand>+</strand>
<chrom>1</chrom>
<start rdf:datatype="http://www.w3.org/2001/XMLSchema#integer">1320381</start>
<assembly rdf:resource="urn:assembly:Celera:36_3"/>
<type>exact</type>
</MapLoc>
(...)
<Genotype>
<hasIndi rdf:resource="http://www.ncbi.nlm.nih.gov/projects/SNP/snp_ind.cgi?ind_id=465"/>
<hasSNP rdf:resource="http://www.ncbi.nlm.nih.gov/projects/SNP/snp_ref.cgi?rs=940550"/>
<allele1>T</allele1>
<allele2>T</allele2>
</Genotype>
<Genotype>
<hasIndi rdf:resource="http://www.ncbi.nlm.nih.gov/projects/SNP/snp_ind.cgi?ind_id=253"/>
<hasSNP rdf:resource="http://www.ncbi.nlm.nih.gov/projects/SNP/snp_ref.cgi?rs=940550"/>
<allele1>T</allele1>
<allele2>T</allele2>
</Genotype>
</rdf:RDF>

Invoking ARQ

export ARQROOT=ARQ-2.5.0
ARQ-2.5.0/bin/arq --data ~/input.rdf --query ~/query01.rq

Dump All



Query

PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX g: <http://ontology.lindenb.org/genotypes/>
SELECT ?s ?p ?o {?s ?p ?o.}

Result

| _:b0 | g:allele2 | "C" |
| _:b0 | g:allele1 | "C" |
| _:b0 | g:hasSNP | <http://www.ncbi.nlm.nih.gov/projects/SNP/snp_ref.cgi?rs=17160669> |
| _:b0 | g:hasIndi | <http://www.ncbi.nlm.nih.gov/projects/SNP/snp_ind.cgi?ind_id=636> |
| _:b0 | rdf:type | g:Genotype |
| _:b1 | g:allele2 | "T" |
| _:b1 | g:allele1 | "C" |
| _:b1 | g:hasSNP | <http://www.ncbi.nlm.nih.gov/projects/SNP/snp_ref.cgi?rs=17160669> |
| _:b1 | g:hasIndi | <http://www.ncbi.nlm.nih.gov/projects/SNP/snp_ind.cgi?ind_id=361> |
| _:b1 | rdf:type | g:Genotype |
| <http://www.ncbi.nlm.nih.gov/projects/SNP/snp_ind.cgi?ind_id=174> | g:name | "NA07048" |
| <http://www.ncbi.nlm.nih.gov/projects/SNP/snp_ind.cgi?ind_id=174> | g:sex | "M" |
| <http://www.ncbi.nlm.nih.gov/projects/SNP/snp_ind.cgi?ind_id=174> | g:hasPop | <http://www.ncbi.nlm.nih.gov/projects/SNP/snp_viewTable.cgi?type=pop&pop_id=1409> |
| <http://www.ncbi.nlm.nih.gov/projects/SNP/snp_ind.cgi?ind_id=174> | rdf:type | g:Individual |
| <http://www.ncbi.nlm.nih.gov/projects/SNP/snp_ind.cgi?ind_id=566> | g:name | "NA12802" |
| <http://www.ncbi.nlm.nih.gov/projects/SNP/snp_ind.cgi?ind_id=566> | g:sex | "F" |
| <http://www.ncbi.nlm.nih.gov/projects/SNP/snp_ind.cgi?ind_id=566> | g:hasPop | <http://www.ncbi.nlm.nih.gov/projects/SNP/snp_viewTable.cgi?type=pop&pop_id=1409> |
| <http://www.ncbi.nlm.nih.gov/projects/SNP/snp_ind.cgi?ind_id=566> | rdf:type | g:Individual |
| _:b2 | g:allele2 | "A" |
| _:b2 | g:allele1 | "A" |
| _:b2 | g:hasSNP | <http://www.ncbi.nlm.nih.gov/projects/SNP/snp_ref.cgi?rs=2765021> |
| _:b2 | g:hasIndi | <http://www.ncbi.nlm.nih.gov/projects/SNP/snp_ind.cgi?ind_id=429> |
| _:b2 | rdf:type | g:Genotype |
| _:b3 | g:allele2 | "C" |
| _:b3 | g:allele1 | "C" |
| _:b3 | g:hasSNP | <http://www.ncbi.nlm.nih.gov/projects/SNP/snp_ref.cgi?rs=17160669> |
| _:b3 | g:hasIndi | <http://www.ncbi.nlm.nih.gov/projects/SNP/snp_ind.cgi?ind_id=546> |
| _:b3 | rdf:type | g:Genotype |
| _:b4 | g:allele2 | "T" |
| _:b4 | g:allele1 | "C" |
| _:b4 | g:hasSNP | <http://www.ncbi.nlm.nih.gov/projects/SNP/snp_ref.cgi?rs=17160669> |
| _:b4 | g:hasIndi | <http://www.ncbi.nlm.nih.gov/projects/SNP/snp_ind.cgi?ind_id=159> |
| _:b4 | rdf:type | g:Genotype |
| <http://www.ncbi.nlm.nih.gov/projects/SNP/snp_ind.cgi?ind_id=621> | g:name | "NA12875" |
| <http://www.ncbi.nlm.nih.gov/projects/SNP/snp_ind.cgi?ind_id=621> | g:sex | "F" |
(...)


Select the populations



Query

PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX g: <http://ontology.lindenb.org/genotypes/>
SELECT ?pop
{
?s a g:Population .
?s g:handle ?pop .
}

Result

-----------------
| pop |
=================
| "CSHL-HAPMAP" |
-----------------


List six individuals for each population



Query

PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX g: <http://ontology.lindenb.org/genotypes/>
SELECT ?pop ?indi_name ?good
{
?s a g:Population .
?s g:handle ?pop .
?s2 a g:Individual .
?s2 g:hasPop ?s .
?s2 g:sex ?good .
?s2 g:name ?indi_name
}
limit 6

Result

------------------------------------
| pop | indi_name | good |
====================================
| "CSHL-HAPMAP" | "NA10854" | "F" |
| "CSHL-HAPMAP" | "NA12264" | "M" |
| "CSHL-HAPMAP" | "NA11993" | "F" |
| "CSHL-HAPMAP" | "NA10830" | "M" |
| "CSHL-HAPMAP" | "NA12762" | "M" |
| "CSHL-HAPMAP" | "NA12155" | "M" |
------------------------------------


List the SNPs having a flanking sequence containing 'CACACA'



Query

PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX g: <http://ontology.lindenb.org/genotypes/>
PREFIX fn: <http://www.w3.org/2005/xpath-functions#>

SELECT ?name ?seq5 ?observed ?seq3
WHERE
{
?s a g:SNP .
?s g:name ?name .
?s g:seq5 ?seq5 .
?s g:seq3 ?seq3 .
?s g:observed ?observed .

FILTER (
fn:contains(fn:upper-case(?seq5), "CACACA") ||
fn:contains(fn:upper-case(?seq3), "CACACA")
)
}

Result

-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
| name | seq5 | observed | seq3 |
=============================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================
| "rs17160669" | "GCCACCGCGCCTGGCCCACAAGCATAACTTTTATAAAAATAATTTACTTTTACAATTAAGCTTAGGAATCACACAGACTCAGGGCTGGCTCATGGCTTCC" | "C/T" | "GGCAAGTTAAACTCTGTACTTAGGCTCGGCGCGTATGAAATGGCTAATTCTAATCAGTGGTGCAATGAAGTAACTCCTCTAAAGAACTTATCGGGCCGGG" |
| "rs2765023" | "ACTTGTAAATTTAGTCAGCATACATAACTAACCAAAACTTCAATATATCTTGAGACCCCCTTGGGGGGCTGTCTCCATAAAAGTGACTTTCCCAGGAGAGTGACTGGATGTGATTGGCCAACACCGTCTTAGCCCGCAGGGGTTCCTGGCGCGGAAGCCTCACGTCCCTCCCCACAGCGAGTTTTCAGAATCCAAAGGCCGTAGGAGAAAGAAGGCTGGCGGTGTTTCCTCTTAGAGGGGAGAAACTCAGCCTGGGTAGGAGACCCAGCCCCACGCAGGGAAAACTGTGCTAACGCTTCC" | "A/G" | "ATGTGCGTGGCAGGTGCGGCGGCGGCGAATACGGTTTGTCCTCGAGCCTAACCCTGTCTGTGTTGGTGTCAGCAGTGGCCCCCCTACCACACACACAGGGTCCCTGGCGTCCCAAGACCACTCCTGGCAGCCCCGCCACTGGCTGCGCCTGGAAGCCGCGTCCTCAGGCCTCGCCTGGCATTTGCTGTCACAGAGGTTGCTTCCTTGGGTCCGTCCGTCCTCGCCCCTCCAGCCTGGGCGCCCCCCCACCCCTGTCTCATTCCCTCCACCACATGCAGCACAGTCCAGGAGGCTGGGGTC" |
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------


Get 12 Heterozygous Genotypes



Query

PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX g: <http://ontology.lindenb.org/genotypes/>
PREFIX fn: <http://www.w3.org/2005/xpath-functions#>

SELECT ?indi ?snp ?a1 ?a2
WHERE
{
?s a g:Genotype .
?s g:allele1 ?a1 .
?s g:allele2 ?a2 .
?s g:hasIndi ?s2 .
?s2 g:name ?indi .
?s g:hasSNP ?s3 .
?s3 g:name ?snp .
FILTER ( ?a1 != ?a2 )
}
LIMIT 10

Result

----------------------------------------
| indi | snp | a1 | a2 |
========================================
| "NA12056" | "rs17160669" | "C" | "T" |
| "NA12716" | "rs17160669" | "C" | "T" |
| "NA12761" | "rs17160669" | "C" | "T" |
| "NA10839" | "rs2765023" | "A" | "G" |
| "NA12813" | "rs2765023" | "A" | "G" |
| "NA12760" | "rs2765023" | "A" | "G" |
| "NA12865" | "rs17160669" | "C" | "T" |
| "NA07056" | "rs17160669" | "C" | "T" |
| "NA12146" | "rs2765023" | "A" | "G" |
| "NA10860" | "rs2765023" | "A" | "G" |
| "NA10839" | "rs17160669" | "C" | "T" |
| "NA12812" | "rs17160669" | "C" | "T" |
----------------------------------------


List 12 SNPs on chr1 between 100000 and 500000 on the reference assembly, order by chrom/position



Query

PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX g: <http://ontology.lindenb.org/genotypes/>
PREFIX fn: <http://www.w3.org/2005/xpath-functions#>

SELECT ?snp ?chrom ?orient ?start
WHERE
{
?s a g:SNP .
?s g:name ?snp .
?s2 a g:MapLoc .
?s2 g:hasSNP ?s .
?s2 g:chrom ?chrom .
?s2 g:chrom "1" .
?s2 g:strand ?orient .
?s2 g:start ?start .
?s2 g:assembly <urn:assembly:reference:36_3> .
FILTER ( ?start > 100000 && ?start< 500000)
}
ORDER BY ?chrom ?start
LIMIT 12

Result

------------------------------------------
| snp | chrom | orient | start |
==========================================
| "rs17009015" | "1" | "-" | 121810 |
| "rs11490937" | "1" | "+" | 222076 |
| "rs12041624" | "1" | "+" | 232164 |
| "rs11514575" | "1" | "-" | 235726 |
| "rs4731490" | "1" | "+" | 311783 |
| "rs4006867" | "1" | "+" | 325493 |
| "rs7462951" | "1" | "-" | 360984 |
| "rs4030300" | "1" | "+" | 392471 |
| "rs4030303" | "1" | "+" | 392552 |
| "rs9661032" | "1" | "-" | 396549 |
| "rs3872250" | "1" | "-" | 400742 |
| "rs3907361" | "1" | "-" | 412985 |
------------------------------------------


List the positions of 10 SNPs on the reference assembly and chr1, print the heterozygosity if it exists and is greater than 0.1



Query

PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX g: <http://ontology.lindenb.org/genotypes/>

SELECT ?snp ?chrom ?orient ?start ?het
WHERE
{
?s a g:SNP .
?s g:name ?snp .
?s2 a g:MapLoc .
?s2 g:hasSNP ?s .
?s2 g:chrom ?chrom .
?s2 g:chrom "1" .
?s2 g:strand ?orient .
?s2 g:start ?start .
?s2 g:assembly <urn:assembly:reference:36_3> .
OPTIONAL { ?s g:het ?het . FILTER ( ?het > 0.1 ) }
}
LIMIT 10

Result

------------------------------------------------------------------------------------------------
| snp | chrom | orient | start | het |
================================================================================================
| "rs7417504" | "1" | "+" | 555799 | |
| "rs10018120" | "1" | "-" | 241387750 | "0.48"^^<http://www.w3.org/2001/XMLSchema#float> |
| "rs12043546" | "1" | "+" | 224043895 | |
| "rs4023296" | "1" | "-" | 141776514 | |
| "rs1320571" | "1" | "+" | 1110293 | "0.31"^^<http://www.w3.org/2001/XMLSchema#float> |
| "rs1359759" | "1" | "+" | 115826181 | "0.49"^^<http://www.w3.org/2001/XMLSchema#float> |
| "rs7553429" | "1" | "+" | 1080419 | "0.19"^^<http://www.w3.org/2001/XMLSchema#float> |
| "rs4245756" | "1" | "+" | 789325 | |
| "rs3766177" | "1" | "-" | 1471210 | "0.5"^^<http://www.w3.org/2001/XMLSchema#float> |
| "rs9442372" | "1" | "+" | 1008566 | "0.46"^^<http://www.w3.org/2001/XMLSchema#float> |
------------------------------------------------------------------------------------------------


Print 10 differences between the Reference Assembly and the Celera Assembly



Query

PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX g: <http://ontology.lindenb.org/genotypes/>

SELECT ?snp ?chrom1 ?orient1 ?start1 ?chrom2 ?orient2 ?start2
WHERE
{
?s a g:SNP .
?s g:name ?snp .

?s2 a g:MapLoc .
?s2 g:hasSNP ?s .
?s2 g:chrom ?chrom1 .
?s2 g:strand ?orient1 .
?s2 g:start ?start1 .
?s2 g:assembly <urn:assembly:Celera:36_3> .

?s3 a g:MapLoc .
?s3 g:hasSNP ?s .
?s3 g:chrom ?chrom2 .
?s3 g:strand ?orient2 .
?s3 g:start ?start2 .
?s3 g:assembly <urn:assembly:reference:36_3> . .

}
LIMIT 10

Result

-----------------------------------------------------------------------------
| snp | chrom1 | orient1 | start1 | chrom2 | orient2 | start2 |
=============================================================================
| "rs7553640" | "1" | "-" | 833104 | "1" | "+" | 1751873 |
| "rs3951936" | "9" | "-" | 41330304 | "4" | "+" | 49186295 |
| "rs3951936" | "9" | "-" | 41330304 | "1" | "-" | 142233119 |
| "rs3951936" | "9" | "-" | 41330304 | "1" | "+" | 142038296 |
| "rs3951936" | "9" | "-" | 41330304 | "1" | "-" | 141781399 |
| "rs3951936" | "9" | "-" | 41330304 | "1" | "+" | 141641811 |
| "rs41319344" | "Y" | "+" | 10690990 | "Y" | "-" | 25853159 |
| "rs41319344" | "Y" | "+" | 10690990 | "Y" | "+" | 24928047 |
| "rs41319344" | "Y" | "+" | 10690990 | "1" | "-" | 241194834 |
| "rs10907183" | "1" | "-" | 1511375 | "1" | "+" | 1060980 |
-----------------------------------------------------------------------------


Create a new RDF graph of 10 SNPs having a neighbour at a distance less than 500pb



Query

PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX g: <http://ontology.lindenb.org/genotypes/>
PREFIX fn: <http://www.w3.org/2005/xpath-functions#>

CONSTRUCT { ?snp1 g:hasNeighbour ?snp2 . }
WHERE
{
?snp1 a g:SNP .
?snp2 a g:SNP .

?s1 a g:MapLoc .
?s1 g:hasSNP ?snp1 .
?s1 g:chrom ?chrom1 .
?s1 g:strand ?orient1 .
?s1 g:start ?start1 .
?s1 g:assembly <urn:assembly:reference:36_3> .

?s2 a g:MapLoc .
?s2 g:hasSNP ?snp2 .
?s2 g:chrom ?chrom2 .
?s2 g:strand ?orient2 .
?s2 g:start ?start2 .
?s2 g:assembly <urn:assembly:reference:36_3> .

FILTER( (fn:abs(?start1 - ?start2) < 500) && ?chrom1=?chrom2 && ?snp1!=?snp2)

}
LIMIT 10

Result

@prefix : <http://ontology.lindenb.org/genotypes/> .
@prefix g: <http://ontology.lindenb.org/genotypes/> .
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix fn: <http://www.w3.org/2005/xpath-functions#> .

<http://www.ncbi.nlm.nih.gov/projects/SNP/snp_ref.cgi?rs=7545812>
:hasNeighbour <http://www.ncbi.nlm.nih.gov/projects/SNP/snp_ref.cgi?rs=9970455> .

<http://www.ncbi.nlm.nih.gov/projects/SNP/snp_ref.cgi?rs=1043506>
:hasNeighbour <http://www.ncbi.nlm.nih.gov/projects/SNP/snp_ref.cgi?rs=12126411> .

<http://www.ncbi.nlm.nih.gov/projects/SNP/snp_ref.cgi?rs=6603793>
:hasNeighbour <http://www.ncbi.nlm.nih.gov/projects/SNP/snp_ref.cgi?rs=7548693> ;
:hasNeighbour <http://www.ncbi.nlm.nih.gov/projects/SNP/snp_ref.cgi?rs=7553066> .

<http://www.ncbi.nlm.nih.gov/projects/SNP/snp_ref.cgi?rs=10907178>
:hasNeighbour <http://www.ncbi.nlm.nih.gov/projects/SNP/snp_ref.cgi?rs=10907177> ;
:hasNeighbour <http://www.ncbi.nlm.nih.gov/projects/SNP/snp_ref.cgi?rs=11260588> ;
:hasNeighbour <http://www.ncbi.nlm.nih.gov/projects/SNP/snp_ref.cgi?rs=11260587> ;
:hasNeighbour <http://www.ncbi.nlm.nih.gov/projects/SNP/snp_ref.cgi?rs=6701114> ;
:hasNeighbour <http://www.ncbi.nlm.nih.gov/projects/SNP/snp_ref.cgi?rs=3737728> ;
:hasNeighbour <http://www.ncbi.nlm.nih.gov/projects/SNP/snp_ref.cgi?rs=9442398> .



That's it !
Pierre

06 November 2009

Handling RDF Statements with Apache Velocity

This post is about using Apache Velocity ( a Java-based template engine ) and the Jena RDF library. My aim was to use Velocity to handle the content of one or more RDF store without compiling, just by using a custom velocity template. This idea was much inspired by Egon Willighagen's posts where the RDF was handled with a scripting engine embedded in bioclipse. It also seems that I'm not the first who had this idea of using Velocity+RDF: see [here].
OK, my experimental source code for the program JenaVelocity is available here:

Describing the RDFstores


On the command line, one (or more) RDF dataset is described as a JSON document. In the following example this is a remote file, but it could also be the description of a persistent database, a N3 file, etc... This RDF file will also be used later for resolving some names from the bio2rdf repository, this is why I also added a table for the prefix mappings. This RDF model will be inserted in the VelocityContext under the name "$store1"
[
{
"name": "store1",
"url":"http://www.lri.fr/~pietriga/foaf.rdf",
"prefix-mapping":{
"uniprot":"http://bio2rdf.org/uniprot:"
}
}
]

Example 1

In the following example: all the RDFstores are inserted in the VelocityContext as $rdfstores. For each rdfstore a HTML list is created. The list contains the names of all the infividuals of the FOAF file described previously.
<html><body>
#set($RDF="http://www.w3.org/1999/02/22-rdf-syntax-ns#")
#set($FOAF="http://xmlns.com/foaf/0.1/")
<h1>Staff</h1>
<ul>
#foreach($store in $rdfstores)
#set($pred = ${store.model.createProperty("${FOAF}","name")})
#foreach($stmt in
${store.model.listStatements(null,${store.model.createProperty("${FOAF}","name")},null,null)})
<li>${stmt.object.string}</li>
#end
</ul></body></html>

After running JenaVelocity, I got the following result:

Staff



  • Jean-Daniel Fekete

  • Chris Bizer

  • Caroline Appert

  • Ralph Swick

  • Vincent Quint

  • Jean-Yves Vion-Dury

  • Yves Guiard

  • Eric Miller

  • Renaud Blanch

  • Emmanuel Pietriga

  • Jose Kahan

  • Eric Prud'hommeaux

  • Catherine Letondal

  • Olivier Chapuis

  • Michel Beaudouin-Lafon

  • Nicolas Roussel

  • Ryan Lee

  • Wendy Mackay

Example 2


Here, I've inserted an object called $sparql in the VelocityContext. This object is used to send a SPARQL query to the bio2rdf sparql endpoint and the Statements related to the rdf:type http://bio2rdf.org/ns/uniprot:Strain are fetched and displayed in a HTML table. For each Resource, we try to get a short form of its URI using our previously defined $store1. It the object of a statement is a literal, the quoted string is printed.
<html><body>
#set($RDF="http://www.w3.org/1999/02/22-rdf-syntax-ns#")
#set($FOAF="http://xmlns.com/foaf/0.1/")
<h1>Strains</h1>
<table>
#foreach($row in
$sparql.select("http://quebec.bio2rdf.org/sparql","select distinct ?s
?p ?o where { ?s a <http://bio2rdf.org/ns/uniprot:Strain> . ?s ?p ?o}
LIMIT 100
"))
<tr>
<td><a href="${row.get("s").getURI()}">${store1.shortForm(${row.get("s").getURI()})}</a></td>
<td><a href="${row.get("p").getURI()}">${store1.shortForm(${row.get("p").getURI()})}</a></td>
<td>#if(${row.get("o").isResource()})
<a href="${row.get("o").getURI()}">${store1.shortForm(${row.get("o").getURI()})}</a>
#else
<span>"$row.get("o").string"</span>
#end</td>
</tr>
#end
</table>
#end
</body></html>

After running JenaVelocity, I got the following result:

Strains

uniprot:Q8C7G5_5rdf:typehttp://bio2rdf.org/ns/uniprot:Strain
uniprot:Q8C7G5_5dc:title"C57BL/6J"
uniprot:Q8C7G5_8rdf:typehttp://bio2rdf.org/ns/uniprot:Strain
uniprot:Q8C7G5_8dc:title"FVB/N"
uniprot:Q8C7H1_2rdf:typehttp://bio2rdf.org/ns/uniprot:Strain
uniprot:Q8C7H1_2dc:title"C57BL/6J"
uniprot:Q8C7K6_2rdf:typehttp://bio2rdf.org/ns/uniprot:Strain
uniprot:Q8C7K6_2dc:title"C57BL/6J"
uniprot:Q8C7K6_5rdf:typehttp://bio2rdf.org/ns/uniprot:Strain
uniprot:Q8C7K6_5dc:title"C57BL/6"
uniprot:Q8C7M3_2rdf:typehttp://bio2rdf.org/ns/uniprot:Strain
uniprot:Q8C7M3_2dc:title"C57BL/6J"
uniprot:Q8C7M3_Ardf:typehttp://bio2rdf.org/ns/uniprot:Strain
uniprot:Q8C7M3_Adc:title"C57BL/6"
uniprot:Q8C7N7_2rdf:typehttp://bio2rdf.org/ns/uniprot:Strain
uniprot:Q8C7N7_2dc:title"C57BL/6J"
uniprot:Q8C7N7_3rdf:typehttp://bio2rdf.org/ns/uniprot:Strain
uniprot:Q8C7N7_3dc:title"NOD"
uniprot:Q8C7N7_7rdf:typehttp://bio2rdf.org/ns/uniprot:Strain
uniprot:Q8C7N7_7dc:title"C57BL/6"
uniprot:Q8C7Q4_3rdf:typehttp://bio2rdf.org/ns/uniprot:Strain
uniprot:Q8C7Q4_3dc:title"C57BL/6J"
uniprot:Q8C7R4_2rdf:typehttp://bio2rdf.org/ns/uniprot:Strain
uniprot:Q8C7R4_2dc:title"C57BL/6J"
uniprot:Q8C7R4_5rdf:typehttp://bio2rdf.org/ns/uniprot:Strain
uniprot:Q8C7R4_5dc:title"C57BL/6"
uniprot:Q8C7U1_2rdf:typehttp://bio2rdf.org/ns/uniprot:Strain
uniprot:Q8C7U1_2dc:title"C57BL/6J"
uniprot:Q8C7U1_3rdf:typehttp://bio2rdf.org/ns/uniprot:Strain
uniprot:Q8C7U1_3dc:title"NOD"
uniprot:Q8C7U1_7rdf:typehttp://bio2rdf.org/ns/uniprot:Strain
uniprot:Q8C7U1_7dc:title"FVB/N"
uniprot:Q8C7U7_4rdf:typehttp://bio2rdf.org/ns/uniprot:Strain
uniprot:Q8C7U7_4dc:title"C57BL/6J"
uniprot:Q8C7U7_5rdf:typehttp://bio2rdf.org/ns/uniprot:Strain
uniprot:Q8C7U7_5dc:title"NOD"
uniprot:Q8C7V3_4rdf:typehttp://bio2rdf.org/ns/uniprot:Strain
uniprot:Q8C7V3_4dc:title"C57BL/6J"
uniprot:Q8C7V3_7rdf:typehttp://bio2rdf.org/ns/uniprot:Strain
uniprot:Q8C7V3_7dc:title"C57BL/6"
uniprot:Q8C7V8_2rdf:typehttp://bio2rdf.org/ns/uniprot:Strain
uniprot:Q8C7V8_2dc:title"C57BL/6J"
uniprot:Q8C7V8_3rdf:typehttp://bio2rdf.org/ns/uniprot:Strain
uniprot:Q8C7V8_3dc:title"NOD"
uniprot:Q8C7V8_8rdf:typehttp://bio2rdf.org/ns/uniprot:Strain
uniprot:Q8C7V8_8dc:title"C57BL/6"
uniprot:Q8C7W7_2rdf:typehttp://bio2rdf.org/ns/uniprot:Strain
uniprot:Q8C7W7_2dc:title"C57BL/6J"
uniprot:Q8C7W7_3rdf:typehttp://bio2rdf.org/ns/uniprot:Strain
uniprot:Q8C7W7_3dc:title"NOD"
uniprot:Q8C7W7_8rdf:typehttp://bio2rdf.org/ns/uniprot:Strain
uniprot:Q8C7W7_8dc:title"Czech II"
(...)(...)(...)


Conclusion: Velocity templates allow to handle and render some RDF data without compiling anything. However, a prior knowledge of the Jena API is required.

That's it.

Pierre