Creating a virtual RDF graph describing a set of OpenOffice spreadsheets with Apache Jena and Fuseki
In the current post, I will use the Jena API for RDF to implement a virtual RDF graph describing the content of a set of openoffice/libreoffice spreasheets.
The entry content.xml is a XML file describing the tables in the spreadsheet:
Implementing a simple virtual RDF graph with Jena is easy: you simply have to extend the class com.hp.hpl.jena.graph.impl.GraphBase and only implement the method graphBaseFind which returns all the RDF Triples matching a TripleMatch.
Invoke:
Result:
The line:
That's it,
Pierre
Fact: An openoffice file (*.ods) is a Zip file
An openoffice file is nothing but a zip file:$ unzip -t jeter.ods
Archive: jeter.ods
testing: mimetype OK
testing: meta.xml OK
testing: settings.xml OK
testing: content.xml OK
testing: Thumbnails/thumbnail.png OK
testing: Configurations2/images/Bitmaps/ OK
testing: Configurations2/popupmenu/ OK
testing: Configurations2/toolpanel/ OK
testing: Configurations2/statusbar/ OK
testing: Configurations2/progressbar/ OK
testing: Configurations2/toolbar/ OK
testing: Configurations2/menubar/ OK
testing: Configurations2/accelerator/current.xml OK
testing: Configurations2/floater/ OK
testing: styles.xml OK
testing: META-INF/manifest.xml OK
No errors detected in compressed data of jeter.ods.
The entry content.xml is a XML file describing the tables in the spreadsheet:
$ unzip -c jeter.ods content.xml |\ grep -v Archive |\ grep -v inflating | xmllint --format - |\ head -n 20 <?xml version="1.0" encoding="UTF-8"?> <office:document-content xmlns:office="urn:oasis:names:tc:opendocument:xmlns:office:1.0" xmlns:style="urn:oasis:names:tc:opendocument:xmlns:style:1.0" xmlns:text="urn:oasis:names:tc:opendocument:xmlns:text:1.0" xmlns:table="urn:oasis:names:tc:opendocument:xmlns:table:1.0" xmlns:draw="urn:oasis:names:tc:opendocument:xmlns:drawing:1.0" xmlns:fo="urn:oasis:names:tc:opendocument:xmlns:xsl-fo-compatible:1.0" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:meta="urn:oasis:names:tc:opendocument:xmlns:meta:1.0" xmlns:number="urn:oasis:names:tc:opendocument:xmlns:datastyle:1.0" xmlns:presentation="urn:oasis:names:tc:opendocument:xmlns:presentation:1.0" xmlns:svg="urn:oasis:names:tc:opendocument:xmlns:svg-compatible:1.0" xmlns:chart="urn:oasis:names:tc:opendocument:xmlns:chart:1.0" xmlns:dr3d="urn:oasis:names:tc:opendocument:xmlns:dr3d:1.0" xmlns:math="http://www.w3.org/1998/Math/MathML" xmlns:form="urn:oasis:names:tc:opendocument:xmlns:form:1.0" xmlns:script="urn:oasis:names:tc:opendocument:xmlns:script:1.0" xmlns:ooo="http://openoffice.org/2004/office" xmlns:ooow="http://openoffice.org/2004/writer" xmlns:oooc="http://openoffice.org/2004/calc" xmlns:dom="http://www.w3.org/2001/xml-events" xmlns:xforms="http://www.w3.org/2002/xforms" xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:rpt="http://openoffice.org/2005/report" xmlns:of="urn:oasis:names:tc:opendocument:xmlns:of:1.2" xmlns:xhtml="http://www.w3.org/1999/xhtml" xmlns:grddl="http://www.w3.org/2003/g/data-view#" xmlns:tableooo="http://openoffice.org/2009/table" xmlns:field="urn:openoffice:names:experimental:ooo-ms-interop:xmlns:field:1.0" xmlns:formx="urn:openoffice:names:experimental:ooxml-odf-interop:xmlns:form:1.0" xmlns:css3t="http://www.w3.org/TR/css3-text/" office:version="1.2"> <office:scripts/> <office:font-face-decls> <style:font-face style:name="Liberation Sans" svg:font-family="'Liberation Sans'" style:font-family-generic="swiss" style:font-pitch="variable"/> <style:font-face style:name="DejaVu Sans" svg:font-family="'DejaVu Sans'" style:font-family-generic="system" style:font-pitch="variable"/> <style:font-face style:name="Lohit Hindi" svg:font-family="'Lohit Hindi'" style:font-family-generic="system" style:font-pitch="variable"/> <style:font-face style:name="WenQuanYi Micro Hei" svg:font-family="'WenQuanYi Micro Hei'" style:font-family-generic="system" style:font-pitch="variable"/> </office:font-face-decls> <office:automatic-styles> <style:style style:name="co1" style:family="table-column"> <style:table-column-properties fo:break-before="auto" style:column-width="0.889in"/> </style:style> <style:style style:name="ro2" style:family="table-row"> <style:table-row-properties style:row-height="0.178in" fo:break-before="auto" style:use-optimal-row-height="true"/> </style:style> <style:style style:name="ro3" style:family="table-row"> <style:table-row-properties style:row-height="0.1681in" fo:break-before="auto" style:use-optimal-row-height="true"/> </style:style> <style:style style:name="ta1" style:family="table" style:master-page-name="Default">
Fact: Implementing a simple virtual RDF graph with Jena is easy
By virtual I mean that there is no RDFStore, the triples are created on the fly.Implementing a simple virtual RDF graph with Jena is easy: you simply have to extend the class com.hp.hpl.jena.graph.impl.GraphBase and only implement the method graphBaseFind which returns all the RDF Triples matching a TripleMatch.
(...) @Override protected ExtendedIterator<Triple> graphBaseFind(TripleMatch matcher) { return ...; } (...)
The code
My implementation of a RDFGraph for a set of OpenOffice Calc is not effective but it works fine: for each call of graphBaseFind, it creates an "Iterator<Triple>" scanning each content.xml entry of each openoffice file. This iterator creates some new Triples, add them to a list of Triples that will be filtered by the TripleMatcher.Compilation
the Makefile:CP=...#path to the jars of JENA/ARQ/etc... e.g: =`find ${ARQ} -name "*.jar" | | tr "\n" ":"` .PHONY: all all: javac -cp ${CP} -sourcepath src src/oocalc/OpenOfficeCalcGraph.java jar cvf dist/openoffice2rdf.jar -C src .
Querying using sparql
Now that the Graph has been implemented and compiled, one can query it using ARQ, the sparql engine of Jena:The spreadsheet
I've created the following spreadsheet and saved it in a file named "jeter.ods":CHROM | START | END | NAME |
chr1 | 100 | 200 | rs654 |
chr1 | 150 | 250 | rs264 |
chr1 | 200 | 300 | rs610 |
chr1 | 250 | 350 | rs929 |
chr1 | 300 | 400 | rs408 |
chr1 | 350 | 450 | rs346 |
chr1 | 400 | 500 | rs430 |
chr1 | 450 | 550 | rs735 |
chr1 | 500 | 600 | rs575 |
chr1 | 550 | 650 | rs891 |
chr1 | 600 | 700 | rs627 |
chr1 | 650 | 750 | rs650 |
chr1 | 700 | 800 | rs715 |
chr1 | 750 | 850 | rs467 |
chr1 | 800 | 900 | rs882 |
chr1 | 850 | 950 | rs301 |
chr1 | 900 | 1000 | rs643 |
chr1 | 950 | 1050 | rs246 |
chr1 | 1000 | 1100 | rs178 |
chr1 | 1050 | 1150 | rs928 |
chr1 | 1100 | 1200 | rs213 |
The sparql query
The following SPARQL returns the informations about the cells in the 3rd row of the spreadsheet:Invoke:
java -cp `find /home/lindenb/.ivy2/cache -name "*.jar" | tr "\n" ":"`:dist/openoffice2rdf.jar \ oocalc.OpenOfficeCalcGraph test.sparql /home/lindenb/jeter.ods
Result:
----------------------------------------------------------------------------------------------------------------------------------- | s | p | o | =================================================================================================================================== | <file:/home/lindenb/jeter.ods/t1/y3/x1> | office:table | <file:/home/lindenb/jeter.ods/t1> | | <file:/home/lindenb/jeter.ods/t1/y3/x1> | <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> | office:Cell | | <file:/home/lindenb/jeter.ods/t1/y3/x1> | office:X | "1"^^xsd:int | | <file:/home/lindenb/jeter.ods/t1/y3/x1> | office:Y | "3"^^xsd:int | | <file:/home/lindenb/jeter.ods/t1/y3/x1> | office:value | "chr1" | | <file:/home/lindenb/jeter.ods/t1/y3/x2> | office:table | <file:/home/lindenb/jeter.ods/t1> | | <file:/home/lindenb/jeter.ods/t1/y3/x2> | <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> | office:Cell | | <file:/home/lindenb/jeter.ods/t1/y3/x2> | office:X | "2"^^xsd:int | | <file:/home/lindenb/jeter.ods/t1/y3/x2> | office:Y | "3"^^xsd:int | | <file:/home/lindenb/jeter.ods/t1/y3/x2> | office:value | "150"^^xsd:float | | <file:/home/lindenb/jeter.ods/t1/y3/x3> | office:table | <file:/home/lindenb/jeter.ods/t1> | | <file:/home/lindenb/jeter.ods/t1/y3/x3> | <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> | office:Cell | | <file:/home/lindenb/jeter.ods/t1/y3/x3> | office:X | "3"^^xsd:int | | <file:/home/lindenb/jeter.ods/t1/y3/x3> | office:Y | "3"^^xsd:int | | <file:/home/lindenb/jeter.ods/t1/y3/x3> | office:value | "250"^^xsd:float | | <file:/home/lindenb/jeter.ods/t1/y3/x4> | office:table | <file:/home/lindenb/jeter.ods/t1> | | <file:/home/lindenb/jeter.ods/t1/y3/x4> | <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> | office:Cell | | <file:/home/lindenb/jeter.ods/t1/y3/x4> | office:X | "4"^^xsd:int | | <file:/home/lindenb/jeter.ods/t1/y3/x4> | office:Y | "3"^^xsd:int | | <file:/home/lindenb/jeter.ods/t1/y3/x4> | office:value | "rs264" | -----------------------------------------------------------------------------------------------------------------------------------
Serving the OpenOffice spreadsheets as RDF over HTTP
Fuseki is a SPARQL server. It provides REST-style SPARQL HTTP Update, SPARQL Query, and SPARQL Update using the SPARQL protocol over HTTP. We're going to deploy the OpenOfficeCalcGraph in Fuseki to query a set of OpenOffice files.Download an install Fuseki
wget https://repository.apache.org/content/repositories/releases/org/apache/jena/jena-fuseki/0.2.5/jena-fuseki-0.2.5-distribution.tar.gz tar xfz jena-fuseki-0.2.5-distribution.tar.gz rm jena-fuseki-0.2.5-distribution.tar.gz
Tell Fuseki about our OpenOfficeCalcGraph
We need to create a config file for Fuseki. That was the most complicated part as the process is not clearly documented:The line:
[] ja:loadClass "oocalc.OpenOfficeCalcGraph" .loads the class oocalc.OpenOfficeCalcGraph. The class OpenOfficeCalcGraph contains a static initialisation method:
(...) static { init() ; } private static void init() { (...)In this static method, a Jena Assembler for OpenOfficeCalcGraph is registered under the resource named: "http://rdf.lindenb.org/build".
public static OpenOfficeAssembler assembler = new OpenOfficeAssembler(); (...) private static final Resource buildRsrc=ResourceFactory.createResource(NS+"build"); (...) Assembler.general.implementWith(buildRsrc,assembler); (...)An Assembler configures a Graph from a RDF config file. In our example, the config contains the path to the OpenOffice spreadsheets:
<#ooservice> rdf:type openoffice:build ; openoffice:file "/home/lindenb/jeter.ods" ; openoffice:file "/home/lindenb/jeter2.ods" ; .This config is read in the Assembler:
public static class OpenOfficeAssembler extends AssemblerBase implements Assembler { @Override public Object open( Assembler a, Resource root, Mode mode ) { Property fileRsrc=ResourceFactory.createProperty(NS+"file"); //read the configuration an get the files List<File> files=new ArrayList<File>(); StmtIterator iter=root.listProperties(fileRsrc); (...)
Start Fuseki with the config file:
$ cd jena-fuseki-0.2.5 $ java -cp fuseki-server.jar:/path/to/openoffice2rdf.jar org.apache.jena.fuseki.FusekiCmd \ --debug -v --config /path/to/openoffice.ttl 14:11:50 INFO Config :: Configuration file: ../openoffice.ttl 14:11:50 INFO Config :: Service: :service1 14:11:50 INFO Config :: name = ds 14:11:50 INFO Config :: query = /ds/query 14:11:50 INFO Config :: query = /ds/sparql 14:11:50 INFO Config :: update = /ds/update 14:11:50 INFO Config :: upload = /ds/upload 14:11:50 INFO Config :: graphStore(RW) = /ds/data 14:11:50 INFO Config :: graphStore(R) = /ds/get 14:11:50 INFO ooffice2rdf :: Calling OpenOfficeCalcGraph init 14:11:50 INFO Config :: Service: OpenOffice Service (R) 14:11:50 INFO Config :: name = openoffice 14:11:50 INFO Config :: query = /openoffice/sparql 14:11:50 INFO Config :: query = /openoffice/query 14:11:50 INFO Config :: update = /openoffice/update 14:11:50 INFO Config :: graphStore(R) = /openoffice/get 14:11:50 INFO Config :: graphStore(R) = /openoffice/data 14:11:51 INFO Server :: Dataset path = /ds 14:11:51 INFO Server :: Dataset path = /openoffice 14:11:51 INFO Server :: Fuseki 0.2.5 2012-10-20T17:03:29+0100 14:11:51 INFO Server :: Started 2012/11/13 14:11:51 CET on port 3030Open your browser at http://localhost:3030, select the control panel at http://localhost:3030/control-panel.tpl and select /openoffice:
Fuseki Control Panel
The following form is displayed:SPARQL Query
You can now copy, paste and run the previous sparql query:-------------------------------------------------------------------------------------------------------------------------------------------------- | s | p | o | ================================================================================================================================================== | <file:/home/lindenb/jeter.ods/t1/y3/x1> | <http://rdf.lindenb.org/table> | <file:/home/lindenb/jeter.ods/t1> | | <file:/home/lindenb/jeter.ods/t1/y3/x1> | <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> | <http://rdf.lindenb.org/Cell> | | <file:/home/lindenb/jeter.ods/t1/y3/x1> | <http://rdf.lindenb.org/X> | "1"^^<http://www.w3.org/2001/XMLSchema#int> | | <file:/home/lindenb/jeter.ods/t1/y3/x1> | <http://rdf.lindenb.org/Y> | "3"^^<http://www.w3.org/2001/XMLSchema#int> | | <file:/home/lindenb/jeter.ods/t1/y3/x1> | <http://rdf.lindenb.org/value> | "chr1" | | <file:/home/lindenb/jeter.ods/t1/y3/x2> | <http://rdf.lindenb.org/table> | <file:/home/lindenb/jeter.ods/t1> | | <file:/home/lindenb/jeter.ods/t1/y3/x2> | <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> | <http://rdf.lindenb.org/Cell> | | <file:/home/lindenb/jeter.ods/t1/y3/x2> | <http://rdf.lindenb.org/X> | "2"^^<http://www.w3.org/2001/XMLSchema#int> | | <file:/home/lindenb/jeter.ods/t1/y3/x2> | <http://rdf.lindenb.org/Y> | "3"^^<http://www.w3.org/2001/XMLSchema#int> | | <file:/home/lindenb/jeter.ods/t1/y3/x2> | <http://rdf.lindenb.org/value> | "150"^^<http://www.w3.org/2001/XMLSchema#float> | | <file:/home/lindenb/jeter.ods/t1/y3/x3> | <http://rdf.lindenb.org/table> | <file:/home/lindenb/jeter.ods/t1> | | <file:/home/lindenb/jeter.ods/t1/y3/x3> | <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> | <http://rdf.lindenb.org/Cell> | | <file:/home/lindenb/jeter.ods/t1/y3/x3> | <http://rdf.lindenb.org/X> | "3"^^<http://www.w3.org/2001/XMLSchema#int> | | <file:/home/lindenb/jeter.ods/t1/y3/x3> | <http://rdf.lindenb.org/Y> | "3"^^<http://www.w3.org/2001/XMLSchema#int> | | <file:/home/lindenb/jeter.ods/t1/y3/x3> | <http://rdf.lindenb.org/value> | "250"^^<http://www.w3.org/2001/XMLSchema#float> | | <file:/home/lindenb/jeter.ods/t1/y3/x4> | <http://rdf.lindenb.org/table> | <file:/home/lindenb/jeter.ods/t1> | | <file:/home/lindenb/jeter.ods/t1/y3/x4> | <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> | <http://rdf.lindenb.org/Cell> | | <file:/home/lindenb/jeter.ods/t1/y3/x4> | <http://rdf.lindenb.org/X> | "4"^^<http://www.w3.org/2001/XMLSchema#int> | | <file:/home/lindenb/jeter.ods/t1/y3/x4> | <http://rdf.lindenb.org/Y> | "3"^^<http://www.w3.org/2001/XMLSchema#int> | | <file:/home/lindenb/jeter.ods/t1/y3/x4> | <http://rdf.lindenb.org/value> | "rs264" | | <file:/home/lindenb/jeter2.ods/t1/y3/x1> | <http://rdf.lindenb.org/table> | <file:/home/lindenb/jeter2.ods/t1> | | <file:/home/lindenb/jeter2.ods/t1/y3/x1> | <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> | <http://rdf.lindenb.org/Cell> | | <file:/home/lindenb/jeter2.ods/t1/y3/x1> | <http://rdf.lindenb.org/X> | "1"^^<http://www.w3.org/2001/XMLSchema#int> | | <file:/home/lindenb/jeter2.ods/t1/y3/x1> | <http://rdf.lindenb.org/Y> | "3"^^<http://www.w3.org/2001/XMLSchema#int> | | <file:/home/lindenb/jeter2.ods/t1/y3/x1> | <http://rdf.lindenb.org/value> | "1"^^<http://www.w3.org/2001/XMLSchema#float> | | <file:/home/lindenb/jeter2.ods/t1/y3/x2> | <http://rdf.lindenb.org/table> | <file:/home/lindenb/jeter2.od | <file:/home/lindenb/jeter2.ods/t1/y3/x2> | <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> | <http://rdf.lindenb.org/Cell> | | <file:/home/lindenb/jeter2.ods/t1/y3/x2> | <http://rdf.lindenb.org/X> | "2"^^<http://www.w3.org/2001/XMLSchema#int> | | <file:/home/lindenb/jeter2.ods/t1/y3/x2> | <http://rdf.lindenb.org/Y> | "3"^^<http://www.w3.org/2001/XMLSchema#int> | | <file:/home/lindenb/jeter2.ods/t1/y3/x2> | <http://rdf.lindenb.org/value> | "2"^^<http://www.w3.org/2001/XMLSchema#float> | | <file:/home/lindenb/jeter2.ods/t1/y3/x3> | <http://rdf.lindenb.org/table> | <file:/home/lindenb/jeter2.ods/t1> | | <file:/home/lindenb/jeter2.ods/t1/y3/x3> | <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> | <http://rdf.lindenb.org/Cell> | | <file:/home/lindenb/jeter2.ods/t1/y3/x3> | <http://rdf.lindenb.org/X> | "3"^^<http://www.w3.org/2001/XMLSchema#int> | | <file:/home/lindenb/jeter2.ods/t1/y3/x3> | <http://rdf.lindenb.org/Y> | "3"^^<http://www.w3.org/2001/XMLSchema#int> | | <file:/home/lindenb/jeter2.ods/t1/y3/x3> | <http://rdf.lindenb.org/value> | "3"^^<http://www.w3.org/2001/XMLSchema#float> | | <file:/home/lindenb/jeter2.ods/t1/y3/x4> | <http://rdf.lindenb.org/table> | <file:/home/lindenb/jeter2.ods/t1> | | <file:/home/lindenb/jeter2.ods/t1/y3/x4> | <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> | <http://rdf.lindenb.org/Cell> | | <file:/home/lindenb/jeter2.ods/t1/y3/x4> | <http://rdf.lindenb.org/X> | "4"^^<http://www.w3.org/2001/XMLSchema#int> | | <file:/home/lindenb/jeter2.ods/t1/y3/x4> | <http://rdf.lindenb.org/Y> | "3"^^<http://www.w3.org/2001/XMLSchema#int> | | <file:/home/lindenb/jeter2.ods/t1/y3/x4> | <http://rdf.lindenb.org/value> | "4"^^<http://www.w3.org/2001/XMLSchema#float> | --------------------------------------------------------------------------------------------------------------------------------------------------
That's it,
Pierre
No comments:
Post a Comment