Creating a virtual RDF graph describing a set of OpenOffice spreadsheets with Apache Jena and Fuseki

Fact: An openoffice file (*.ods) is a Zip file
An openoffice file is nothing but a zip file:$ unzip -t jeter.ods
Archive: jeter.ods
testing: mimetype OK
testing: meta.xml OK
testing: settings.xml OK
testing: content.xml OK
testing: Thumbnails/thumbnail.png OK
testing: Configurations2/images/Bitmaps/ OK
testing: Configurations2/popupmenu/ OK
testing: Configurations2/toolpanel/ OK
testing: Configurations2/statusbar/ OK
testing: Configurations2/progressbar/ OK
testing: Configurations2/toolbar/ OK
testing: Configurations2/menubar/ OK
testing: Configurations2/accelerator/current.xml OK
testing: Configurations2/floater/ OK
testing: styles.xml OK
testing: META-INF/manifest.xml OK
No errors detected in compressed data of jeter.ods.
The entry content.xml is a XML file describing the tables in the spreadsheet:
$ unzip -c jeter.ods content.xml |\ grep -v Archive |\ grep -v inflating | xmllint --format - |\ head -n 20 <?xml version="1.0" encoding="UTF-8"?> <office:document-content xmlns:office="urn:oasis:names:tc:opendocument:xmlns:office:1.0" xmlns:style="urn:oasis:names:tc:opendocument:xmlns:style:1.0" xmlns:text="urn:oasis:names:tc:opendocument:xmlns:text:1.0" xmlns:table="urn:oasis:names:tc:opendocument:xmlns:table:1.0" xmlns:draw="urn:oasis:names:tc:opendocument:xmlns:drawing:1.0" xmlns:fo="urn:oasis:names:tc:opendocument:xmlns:xsl-fo-compatible:1.0" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:meta="urn:oasis:names:tc:opendocument:xmlns:meta:1.0" xmlns:number="urn:oasis:names:tc:opendocument:xmlns:datastyle:1.0" xmlns:presentation="urn:oasis:names:tc:opendocument:xmlns:presentation:1.0" xmlns:svg="urn:oasis:names:tc:opendocument:xmlns:svg-compatible:1.0" xmlns:chart="urn:oasis:names:tc:opendocument:xmlns:chart:1.0" xmlns:dr3d="urn:oasis:names:tc:opendocument:xmlns:dr3d:1.0" xmlns:math="http://www.w3.org/1998/Math/MathML" xmlns:form="urn:oasis:names:tc:opendocument:xmlns:form:1.0" xmlns:script="urn:oasis:names:tc:opendocument:xmlns:script:1.0" xmlns:ooo="http://openoffice.org/2004/office" xmlns:ooow="http://openoffice.org/2004/writer" xmlns:oooc="http://openoffice.org/2004/calc" xmlns:dom="http://www.w3.org/2001/xml-events" xmlns:xforms="http://www.w3.org/2002/xforms" xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:rpt="http://openoffice.org/2005/report" xmlns:of="urn:oasis:names:tc:opendocument:xmlns:of:1.2" xmlns:xhtml="http://www.w3.org/1999/xhtml" xmlns:grddl="http://www.w3.org/2003/g/data-view#" xmlns:tableooo="http://openoffice.org/2009/table" xmlns:field="urn:openoffice:names:experimental:ooo-ms-interop:xmlns:field:1.0" xmlns:formx="urn:openoffice:names:experimental:ooxml-odf-interop:xmlns:form:1.0" xmlns:css3t="http://www.w3.org/TR/css3-text/" office:version="1.2"> <office:scripts/> <office:font-face-decls> <style:font-face style:name="Liberation Sans" svg:font-family="'Liberation Sans'" style:font-family-generic="swiss" style:font-pitch="variable"/> <style:font-face style:name="DejaVu Sans" svg:font-family="'DejaVu Sans'" style:font-family-generic="system" style:font-pitch="variable"/> <style:font-face style:name="Lohit Hindi" svg:font-family="'Lohit Hindi'" style:font-family-generic="system" style:font-pitch="variable"/> <style:font-face style:name="WenQuanYi Micro Hei" svg:font-family="'WenQuanYi Micro Hei'" style:font-family-generic="system" style:font-pitch="variable"/> </office:font-face-decls> <office:automatic-styles> <style:style style:name="co1" style:family="table-column"> <style:table-column-properties fo:break-before="auto" style:column-width="0.889in"/> </style:style> <style:style style:name="ro2" style:family="table-row"> <style:table-row-properties style:row-height="0.178in" fo:break-before="auto" style:use-optimal-row-height="true"/> </style:style> <style:style style:name="ro3" style:family="table-row"> <style:table-row-properties style:row-height="0.1681in" fo:break-before="auto" style:use-optimal-row-height="true"/> </style:style> <style:style style:name="ta1" style:family="table" style:master-page-name="Default">
Fact: Implementing a simple virtual RDF graph with Jena is easy
By virtual I mean that there is no RDFStore, the triples are created on the fly.Implementing a simple virtual RDF graph with Jena is easy: you simply have to extend the class com.hp.hpl.jena.graph.impl.GraphBase and only implement the method graphBaseFind which returns all the RDF Triples matching a TripleMatch.
(...) @Override protected ExtendedIterator<Triple> graphBaseFind(TripleMatch matcher) { return ...; } (...)
The code
My implementation of a RDFGraph for a set of OpenOffice Calc is not effective but it works fine: for each call of graphBaseFind, it creates an "Iterator<Triple>" scanning each content.xml entry of each openoffice file. This iterator creates some new Triples, add them to a list of Triples that will be filtered by the TripleMatcher.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
/** | |
* Author: Pierre Lindenbaum PhD | |
* plindenbaum@yahoo.fr | |
* Date: 2012-11 | |
* Motivation: RDFGraph from openoffice calc files | |
* | |
*/ | |
package oocalc; | |
import java.io.File; | |
import java.io.FileNotFoundException; | |
import java.io.FileReader; | |
import java.io.InputStream; | |
import java.util.ArrayList; | |
import java.util.LinkedList; | |
import java.util.List; | |
import java.util.zip.ZipEntry; | |
import java.util.zip.ZipFile; | |
import javax.xml.namespace.QName; | |
import javax.xml.stream.XMLEventReader; | |
import javax.xml.stream.XMLInputFactory; | |
import javax.xml.stream.XMLStreamException; | |
import javax.xml.stream.events.Attribute; | |
import javax.xml.stream.events.EndElement; | |
import javax.xml.stream.events.StartElement; | |
import javax.xml.stream.events.XMLEvent; | |
import com.hp.hpl.jena.assembler.assemblers.AssemblerBase; | |
import com.hp.hpl.jena.assembler.Assembler; | |
import com.hp.hpl.jena.sparql.core.assembler.AssemblerUtils; | |
import com.hp.hpl.jena.assembler.Mode; | |
import com.hp.hpl.jena.datatypes.RDFDatatype; | |
import com.hp.hpl.jena.datatypes.xsd.XSDDatatype; | |
import com.hp.hpl.jena.rdf.model.Resource; | |
import com.hp.hpl.jena.rdf.model.Statement; | |
import com.hp.hpl.jena.rdf.model.StmtIterator; | |
import com.hp.hpl.jena.rdf.model.Property; | |
import com.hp.hpl.jena.graph.Node; | |
import com.hp.hpl.jena.graph.Triple; | |
import com.hp.hpl.jena.graph.TripleMatch; | |
import com.hp.hpl.jena.graph.TripleMatchIterator; | |
import com.hp.hpl.jena.graph.impl.GraphBase; | |
import com.hp.hpl.jena.rdf.model.AnonId; | |
import com.hp.hpl.jena.rdf.model.ResourceFactory; | |
import com.hp.hpl.jena.rdf.model.impl.ModelCom; | |
import com.hp.hpl.jena.util.iterator.ExtendedIterator; | |
import com.hp.hpl.jena.util.iterator.NiceIterator; | |
import com.hp.hpl.jena.sparql.core.DatasetImpl; | |
import com.hp.hpl.jena.vocabulary.DC; | |
import com.hp.hpl.jena.vocabulary.RDF; | |
import com.hp.hpl.jena.vocabulary.XSD; | |
import com.hp.hpl.jena.query.Dataset; | |
import org.slf4j.LoggerFactory; | |
import com.hp.hpl.jena.query.*; | |
/** | |
* implementation of a RDF Graph for OpenOffice calc | |
* | |
*/ | |
public class OpenOfficeCalcGraph | |
extends GraphBase | |
{ | |
/** logger */ | |
protected static final org.slf4j.Logger LOG= LoggerFactory.getLogger("ooffice2rdf"); | |
/** namespaces */ | |
private static final String OFFICE="urn:oasis:names:tc:opendocument:xmlns:office:1.0"; | |
private static final String TABLE="urn:oasis:names:tc:opendocument:xmlns:table:1.0"; | |
private static final String TEXT="urn:oasis:names:tc:opendocument:xmlns:text:1.0"; | |
private static final String NS="http://rdf.lindenb.org/"; | |
/** attributes */ | |
private static final QName number_columns_repeated=new QName(TABLE,"number-columns-repeated","table"); | |
private static final QName number_rows_repeated=new QName(TABLE,"number-rows-repeated","table"); | |
private static final QName value_type=new QName(OFFICE,"value-type","office"); | |
private static final QName value=new QName(OFFICE,"value","office"); | |
private static final QName name=new QName(TABLE,"name","table"); | |
//rdf:type Node | |
private static final Node rdfType=Node.createURI(RDF.type.getURI()); | |
//all open office files | |
private List<File> caclFiles=null; | |
/** static Assembler for OpenOfficeCalcGraph | |
* An assembler creates a Dataset(graph) from a RDF-based configuration file. | |
* It is called by Fuseki | |
*/ | |
public static OpenOfficeAssembler assembler = new OpenOfficeAssembler(); | |
public static class OpenOfficeAssembler extends AssemblerBase implements Assembler | |
{ | |
@Override | |
public Object open( Assembler a, Resource root, Mode mode ) | |
{ | |
//read the configuration an get the files | |
List<File> files=new ArrayList<File>(); | |
StmtIterator iter=root.listProperties(fileRsrc); | |
while(iter.hasNext()) | |
{ | |
Statement stmt=iter.nextStatement(); | |
if(!stmt.getObject().isLiteral()) throw new RuntimeException("Not a literal "+stmt); | |
String lit=stmt.getString(); | |
File file=new File(lit); | |
if(!file.exists()) throw new RuntimeException("File not found : "+file); | |
if(!file.getName().endsWith(".ods")) throw new RuntimeException("Not an .ods file : "+file); | |
files.add(file); | |
} | |
iter.close(); | |
OpenOfficeCalcGraph g=new OpenOfficeCalcGraph(files); | |
OpenOfficeCalcModel m=new OpenOfficeCalcModel(g); | |
Dataset ds=new DatasetImpl(m); | |
return ds; | |
} | |
} | |
/** Initializer for FUZEKI */ | |
private static boolean init_called = false ; | |
private static final Resource buildRsrc=ResourceFactory.createResource(NS+"build"); | |
private static final Property fileRsrc=ResourceFactory.createProperty(NS+"file"); | |
/** static initializer, when this class is invoked, | |
* it tells Fuzeki that there is another assembler using Assembler.general | |
* the resource-name for this assembler is this.buildRsrc | |
*/ | |
static { init() ; } | |
private static void init() | |
{ | |
if(init_called) return; | |
LOG.info("Calling OpenOfficeCalcGraph init"); | |
AssemblerUtils.init(); | |
Assembler.general.implementWith(buildRsrc,assembler); | |
init_called=true; | |
} | |
/** RDF Model for OpenOfficeCalcGraph */ | |
public static class OpenOfficeCalcModel extends ModelCom | |
{ | |
public OpenOfficeCalcModel(OpenOfficeCalcGraph g) | |
{ | |
super(g); | |
} | |
} | |
/* one row in the spredsheet */ | |
private static class Row | |
{ | |
int repeat=1; | |
private List<Cell> cells=new ArrayList<Cell>(); | |
} | |
/* one cell in the spredsheet */ | |
private static class Cell | |
{ | |
int repeat=1; | |
String type=null; | |
String value=null; | |
String literal=null; | |
} | |
/** Constructor from an array of OO files */ | |
public OpenOfficeCalcGraph(List<File> calcFiles) | |
{ | |
this.caclFiles=new ArrayList<File>(calcFiles); | |
this.getPrefixMapping().setNsPrefix("office", NS); | |
this.getPrefixMapping().setNsPrefix("xsd", XSD.getURI()); | |
this.getPrefixMapping().setNsPrefix("dc", DC.getURI()); | |
} | |
@Override | |
protected ExtendedIterator<Triple> graphBaseFind(TripleMatch matcher) | |
{ | |
return new TripleMatchIterator((Triple)matcher, new CellIterator()); | |
} | |
/** parse the openoffice files and get the Triples */ | |
private class CellIterator extends NiceIterator<Triple> | |
{ | |
/** current index in array of OO files */ | |
private int fileIndex=-1; | |
/** buffer of triples */ | |
private List<Triple> buffer=new LinkedList<Triple>(); | |
/** next triple to be returned */ | |
private Triple next=null; | |
/** was hasNext() called ? */ | |
private boolean hasNextCalled=false; | |
/** current OO file opened */ | |
private File ioFile=null; | |
/** Zip Handler for OO file */ | |
private ZipFile zipFile=null; | |
/** Input Stream for current Zip entry */ | |
private InputStream zipInputStream; | |
/** xml-handler for current zip entry */ | |
private XMLEventReader xmlEventReader; | |
/* rdf subject for file */ | |
private Node fileRsrc=null; | |
/* rdf subject for tab */ | |
private Node tabRsrc=null; | |
/** current tab index */ | |
private int tabIndex=0; | |
/* current colun */ | |
private int X=0; | |
/** current row */ | |
private int Y=0; | |
private void add(Node s,Node p,Node o) | |
{ | |
this.buffer.add(Triple.create(s, p, o)); | |
} | |
public CellIterator() | |
{ | |
} | |
private boolean isA(XMLEvent evt,String ns,String localName) | |
{ | |
QName q=null; | |
if(evt.isStartElement()) | |
{ | |
q=evt.asStartElement().getName(); | |
} | |
else if(evt.isEndElement()) | |
{ | |
q=evt.asEndElement().getName(); | |
} | |
return q!=null && | |
q.getNamespaceURI().equals(ns) && | |
q.getLocalPart().equals(localName) | |
; | |
} | |
@Override | |
public boolean hasNext() | |
{ | |
if(!hasNextCalled) | |
{ | |
hasNextCalled=true; | |
next=null; | |
for(;;) | |
{ | |
if(!buffer.isEmpty()) | |
{ | |
next=buffer.remove(0); | |
break; | |
} | |
try | |
{ | |
if(xmlEventReader==null) | |
{ | |
//open next file | |
if(fileIndex+1>=OpenOfficeCalcGraph.this.caclFiles.size()) break; | |
this.fileIndex++; | |
this.tabIndex=0; | |
//open XML StaX reader for current OO file | |
XMLInputFactory xmlInputFactory = XMLInputFactory.newInstance(); | |
xmlInputFactory.setProperty(XMLInputFactory.IS_NAMESPACE_AWARE, Boolean.TRUE); | |
xmlInputFactory.setProperty(XMLInputFactory.IS_COALESCING, Boolean.TRUE); | |
xmlInputFactory.setProperty(XMLInputFactory.IS_REPLACING_ENTITY_REFERENCES, Boolean.TRUE); | |
try | |
{ | |
this.ioFile=OpenOfficeCalcGraph.this.caclFiles.get(this.fileIndex); | |
this.zipFile=new ZipFile(this.ioFile); | |
ZipEntry zipEntry=zipFile.getEntry("content.xml"); | |
if(zipEntry==null) throw new RuntimeException("Cannot get content.xml"); | |
this.zipInputStream=this.zipFile.getInputStream(zipEntry); | |
xmlEventReader= xmlInputFactory.createXMLEventReader(this.zipInputStream); | |
//describe the file as RDF | |
this.fileRsrc=Node.createURI(this.ioFile.toURI().toASCIIString()); | |
add(this.fileRsrc,rdfType,Node.createURI(NS+"Spreadsheet")); | |
add(this.fileRsrc,Node.createURI(DC.title.getURI()),Node.createLiteral(this.ioFile.getName())); | |
continue; | |
} | |
catch (Exception e) | |
{ | |
throw new RuntimeException(e); | |
} | |
} | |
if(xmlEventReader.hasNext()) | |
{ | |
Attribute att=null; | |
XMLEvent evt=xmlEventReader.nextEvent(); | |
if(evt.isStartElement()) | |
{ | |
StartElement E=evt.asStartElement(); | |
if(isA(E,TABLE,"table")) | |
{ | |
att=E.getAttributeByName(name); | |
this.tabIndex++; | |
//describe the tab as RDF | |
this.tabRsrc=Node.createURI(this.ioFile.toURI().toASCIIString()+"/t"+tabIndex); | |
add(this.tabRsrc,Node.createURI(NS+"file"),this.fileRsrc); | |
add(this.tabRsrc,rdfType,Node.createURI(NS+"Table")); | |
add(this.tabRsrc,Node.createURI(DC.title.getURI()),Node.createLiteral(att.getValue())); | |
this.X=0; | |
this.Y=0; | |
} | |
else if(isA(E,TABLE,"table-row")) | |
{ | |
//parse the row | |
Row row=parseRow(E); | |
//create the statements for that row | |
for(int i=0;i< row.repeat;++i) | |
{ | |
this.X=0; | |
this.Y++; | |
for(Cell cell:row.cells) | |
{ | |
for(int j=0;j< cell.repeat;++j) | |
{ | |
this.X++; | |
if(cell.value==null && cell.literal==null) continue; | |
Node subject=Node.createURI(this.ioFile.toURI().toASCIIString()+"/t"+tabIndex+"/y"+Y+"/x"+X); | |
add(subject,Node.createURI(NS+"table"),this.tabRsrc); | |
add(subject,rdfType,Node.createURI(NS+"Cell")); | |
add(subject,Node.createURI(NS+"X"),Node.createLiteral(String.valueOf(X),null,XSDDatatype.XSDint)); | |
add(subject,Node.createURI(NS+"Y"),Node.createLiteral(String.valueOf(Y),null,XSDDatatype.XSDint)); | |
Node cellValue=null; | |
if(cell.type!=null && cell.value!=null) | |
{ | |
XSDDatatype dataType=XSDDatatype.XSDstring; | |
if(cell.type.equals("float")) | |
{ | |
dataType=XSDDatatype.XSDfloat; | |
} | |
else if(cell.type.equals("int")) | |
{ | |
dataType=XSDDatatype.XSDint; | |
} | |
cellValue=Node.createLiteral(cell.value, null, dataType); | |
} | |
else | |
{ | |
cellValue=Node.createLiteral(String.valueOf(cell.literal)); | |
} | |
add( subject, | |
Node.createURI(NS+"value"), | |
cellValue | |
); | |
} | |
} | |
} | |
} | |
} | |
else if(evt.isEndElement()) | |
{ | |
if(isA(evt,TABLE,"table")) | |
{ | |
this.tabRsrc=null; | |
} | |
} | |
} | |
else //we're done for that file. | |
{ | |
this.xmlEventReader.close(); | |
this.zipInputStream.close(); | |
this.zipFile.close(); | |
this.xmlEventReader=null; | |
this.zipInputStream=null; | |
this.zipFile=null; | |
this.fileRsrc=null; | |
this.ioFile=null; | |
} | |
} | |
catch(Exception err) | |
{ | |
throw new RuntimeException(err); | |
} | |
} | |
} | |
return next!=null; | |
} | |
@Override | |
public void close() | |
{ | |
try { if(this.xmlEventReader!=null) this.xmlEventReader.close(); } catch (Exception e) {} | |
this.xmlEventReader=null; | |
try { if(this.zipInputStream!=null) this.zipInputStream.close(); } catch (Exception e) {} | |
this.zipInputStream=null; | |
try { if(this.zipFile!=null) this.zipFile.close(); } catch (Exception e) {} | |
this.zipFile=null; | |
this.buffer.clear(); | |
this.fileIndex=caclFiles.size(); | |
} | |
@Override | |
public Triple next() | |
{ | |
if(!hasNextCalled) hasNext(); | |
if(!hasNext()) throw new IllegalStateException(); | |
Triple t=next; | |
next=null; | |
hasNextCalled=false; | |
return t; | |
} | |
/** parses a table:table-row */ | |
private Row parseRow(StartElement root) | |
throws XMLStreamException | |
{ | |
Row row=new Row(); | |
Attribute att=root.getAttributeByName(number_rows_repeated); | |
if(att!=null) | |
{ | |
row.repeat=Integer.parseInt(att.getValue()); | |
} | |
while(this.xmlEventReader.hasNext()) | |
{ | |
XMLEvent evt=this.xmlEventReader.nextEvent(); | |
if(evt.isStartElement()) | |
{ | |
StartElement E=evt.asStartElement(); | |
if(isA(E,TABLE,"table-cell")) | |
{ | |
row.cells.add(parseCell(E)); | |
} | |
} | |
else if(evt.isEndElement()) | |
{ | |
if(isA(evt,TABLE,"table-row")) | |
{ | |
break; | |
} | |
} | |
} | |
return row; | |
} | |
/** parses a table:table-cell */ | |
private Cell parseCell(StartElement root) | |
throws XMLStreamException | |
{ | |
Cell cell=new Cell(); | |
Attribute att=root.getAttributeByName(number_columns_repeated); | |
if(att!=null) | |
{ | |
cell.repeat=Integer.parseInt(att.getValue()); | |
} | |
att=root.getAttributeByName(value_type); | |
if(att!=null) | |
{ | |
cell.type=att.getValue(); | |
} | |
att=root.getAttributeByName(value); | |
if(att!=null) | |
{ | |
cell.value=att.getValue(); | |
cell.literal=cell.value; | |
} | |
while(this.xmlEventReader.hasNext()) | |
{ | |
XMLEvent evt=this.xmlEventReader.nextEvent(); | |
if(evt.isStartElement()) | |
{ | |
StartElement E=evt.asStartElement(); | |
if(isA(E,TEXT,"p")) | |
{ | |
cell.literal=parseText(E); | |
} | |
} | |
else if(evt.isEndElement()) | |
{ | |
if(isA(evt,TABLE,"table-cell")) | |
{ | |
break; | |
} | |
} | |
} | |
return cell; | |
} | |
/** returns the content of <text:p/> */ | |
private String parseText(StartElement root) | |
throws XMLStreamException | |
{ | |
StringBuilder b=new StringBuilder(); | |
while(xmlEventReader.hasNext()) | |
{ | |
XMLEvent evt=this.xmlEventReader.nextEvent(); | |
if(evt.isStartElement()) | |
{ | |
throw new IllegalStateException(); | |
} | |
else if(evt.isEndElement()) | |
{ | |
if(isA(evt,TEXT,"p")) | |
{ | |
return b.toString(); | |
} | |
} | |
else if(evt.isCharacters()) | |
{ | |
b.append(evt.asCharacters().getData()); | |
} | |
} | |
throw new IllegalStateException(); | |
} | |
} | |
public static void main(String[] args) throws Exception | |
{ | |
if(args.length<2) | |
{ | |
System.err.println("Usage: query.sparql file1.ods, file2.ods... filen.ods"); | |
return; | |
} | |
List<File> files=new ArrayList<File>(); | |
for(int optind=1;optind< args.length;++optind) | |
{ | |
files.add(new File(args[optind])); | |
} | |
OpenOfficeCalcGraph g=new OpenOfficeCalcGraph(files); | |
OpenOfficeCalcModel m=new OpenOfficeCalcModel(g); | |
com.hp.hpl.jena.query.Query query = QueryFactory.read(args[0]) ; | |
LOG.info("starting query"); | |
QueryExecution qexec = QueryExecutionFactory.create(query, m) ; | |
try { | |
ResultSet results = qexec.execSelect(); | |
ResultSetFormatter.out(System.out,results,g.getPrefixMapping()); | |
} finally { qexec.close() ; } | |
} | |
} | |
Compilation
the Makefile:CP=...#path to the jars of JENA/ARQ/etc... e.g: =`find ${ARQ} -name "*.jar" | | tr "\n" ":"` .PHONY: all all: javac -cp ${CP} -sourcepath src src/oocalc/OpenOfficeCalcGraph.java jar cvf dist/openoffice2rdf.jar -C src .
Querying using sparql
Now that the Graph has been implemented and compiled, one can query it using ARQ, the sparql engine of Jena:The spreadsheet
I've created the following spreadsheet and saved it in a file named "jeter.ods":CHROM | START | END | NAME |
chr1 | 100 | 200 | rs654 |
chr1 | 150 | 250 | rs264 |
chr1 | 200 | 300 | rs610 |
chr1 | 250 | 350 | rs929 |
chr1 | 300 | 400 | rs408 |
chr1 | 350 | 450 | rs346 |
chr1 | 400 | 500 | rs430 |
chr1 | 450 | 550 | rs735 |
chr1 | 500 | 600 | rs575 |
chr1 | 550 | 650 | rs891 |
chr1 | 600 | 700 | rs627 |
chr1 | 650 | 750 | rs650 |
chr1 | 700 | 800 | rs715 |
chr1 | 750 | 850 | rs467 |
chr1 | 800 | 900 | rs882 |
chr1 | 850 | 950 | rs301 |
chr1 | 900 | 1000 | rs643 |
chr1 | 950 | 1050 | rs246 |
chr1 | 1000 | 1100 | rs178 |
chr1 | 1050 | 1150 | rs928 |
chr1 | 1100 | 1200 | rs213 |
The sparql query
The following SPARQL returns the informations about the cells in the 3rd row of the spreadsheet:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
PREFIX oo: <http://rdf.lindenb.org/> | |
SELECT ?s ?p ?o | |
WHERE | |
{ | |
?s oo:Y "3"^^<http://www.w3.org/2001/XMLSchema#int> . | |
?s a oo:Cell . | |
?s ?p ?o . | |
} |
Invoke:
java -cp `find /home/lindenb/.ivy2/cache -name "*.jar" | tr "\n" ":"`:dist/openoffice2rdf.jar \ oocalc.OpenOfficeCalcGraph test.sparql /home/lindenb/jeter.ods
Result:
----------------------------------------------------------------------------------------------------------------------------------- | s | p | o | =================================================================================================================================== | <file:/home/lindenb/jeter.ods/t1/y3/x1> | office:table | <file:/home/lindenb/jeter.ods/t1> | | <file:/home/lindenb/jeter.ods/t1/y3/x1> | <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> | office:Cell | | <file:/home/lindenb/jeter.ods/t1/y3/x1> | office:X | "1"^^xsd:int | | <file:/home/lindenb/jeter.ods/t1/y3/x1> | office:Y | "3"^^xsd:int | | <file:/home/lindenb/jeter.ods/t1/y3/x1> | office:value | "chr1" | | <file:/home/lindenb/jeter.ods/t1/y3/x2> | office:table | <file:/home/lindenb/jeter.ods/t1> | | <file:/home/lindenb/jeter.ods/t1/y3/x2> | <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> | office:Cell | | <file:/home/lindenb/jeter.ods/t1/y3/x2> | office:X | "2"^^xsd:int | | <file:/home/lindenb/jeter.ods/t1/y3/x2> | office:Y | "3"^^xsd:int | | <file:/home/lindenb/jeter.ods/t1/y3/x2> | office:value | "150"^^xsd:float | | <file:/home/lindenb/jeter.ods/t1/y3/x3> | office:table | <file:/home/lindenb/jeter.ods/t1> | | <file:/home/lindenb/jeter.ods/t1/y3/x3> | <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> | office:Cell | | <file:/home/lindenb/jeter.ods/t1/y3/x3> | office:X | "3"^^xsd:int | | <file:/home/lindenb/jeter.ods/t1/y3/x3> | office:Y | "3"^^xsd:int | | <file:/home/lindenb/jeter.ods/t1/y3/x3> | office:value | "250"^^xsd:float | | <file:/home/lindenb/jeter.ods/t1/y3/x4> | office:table | <file:/home/lindenb/jeter.ods/t1> | | <file:/home/lindenb/jeter.ods/t1/y3/x4> | <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> | office:Cell | | <file:/home/lindenb/jeter.ods/t1/y3/x4> | office:X | "4"^^xsd:int | | <file:/home/lindenb/jeter.ods/t1/y3/x4> | office:Y | "3"^^xsd:int | | <file:/home/lindenb/jeter.ods/t1/y3/x4> | office:value | "rs264" | -----------------------------------------------------------------------------------------------------------------------------------
Serving the OpenOffice spreadsheets as RDF over HTTP
Fuseki is a SPARQL server. It provides REST-style SPARQL HTTP Update, SPARQL Query, and SPARQL Update using the SPARQL protocol over HTTP. We're going to deploy the OpenOfficeCalcGraph in Fuseki to query a set of OpenOffice files.Download an install Fuseki
wget https://repository.apache.org/content/repositories/releases/org/apache/jena/jena-fuseki/0.2.5/jena-fuseki-0.2.5-distribution.tar.gz tar xfz jena-fuseki-0.2.5-distribution.tar.gz rm jena-fuseki-0.2.5-distribution.tar.gz
Tell Fuseki about our OpenOfficeCalcGraph
We need to create a config file for Fuseki. That was the most complicated part as the process is not clearly documented:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
# Licensed under the terms of http://www.apache.org/licenses/LICENSE-2.0 | |
## A collection of example configurations for Fuseki | |
@prefix : <#> . | |
@prefix fuseki: <http://jena.apache.org/fuseki#> . | |
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> . | |
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> . | |
@prefix openoffice: <http://rdf.lindenb.org/> . | |
@prefix ja: <http://jena.hpl.hp.com/2005/11/Assembler#> . | |
[] rdf:type fuseki:Server ; | |
# Timeout - server-wide default: milliseconds. | |
# Format 1: "1000" -- 1 second timeout | |
# Format 2: "10000,60000" -- 10s timeout to first result, then 60s timeout to for rest of query. | |
# See java doc for ARQ.queryTimeout | |
# ja:context [ ja:cxtName "arq:queryTimeout" ; ja:cxtValue "10000" ] ; | |
# ja:loadClass "your.code.Class" ; | |
fuseki:services ( | |
<#service1> | |
<#ooservice> | |
) . | |
# Custom code. | |
[] ja:loadClass "oocalc.OpenOfficeCalcGraph" . | |
openoffice:GraphTDB rdfs:subClassOf ja:Model . | |
# OpenOffice | |
## --------------------------------------------------------------- | |
## Updatable in-memory dataset. | |
<#service1> rdf:type fuseki:Service ; | |
# URI of the dataset -- http://host:port/ds | |
fuseki:name "ds" ; | |
# SPARQL query services e.g. http://host:port/ds/sparql?query=... | |
fuseki:serviceQuery "sparql" ; | |
fuseki:serviceQuery "query" ; | |
# SPARQL Update service -- http://host:port/ds/update?request=... | |
fuseki:serviceUpdate "update" ; # SPARQL query service -- /ds/update | |
# Upload service -- http://host:port/ds/upload?graph=default or ?graph=URI or ?default | |
# followed by a multipart body, each part being RDF syntax. | |
# Syntax determined by the file name extension. | |
fuseki:serviceUpload "upload" ; # Non-SPARQL upload service | |
# SPARQL Graph store protocol (read and write) | |
# GET, PUT, POST DELETE to http://host:port/ds/data?graph= or ?default= | |
fuseki:serviceReadWriteGraphStore "data" ; | |
# A separate read-only graph store endpoint: | |
fuseki:serviceReadGraphStore "get" ; # Graph store protocol (read only) -- /ds/get | |
fuseki:dataset <#emptyDataset> ; | |
. | |
## In-memory, initially empty. | |
<#emptyDataset> rdf:type ja:RDFDataset . | |
<#ooservice> rdf:type fuseki:Service ; | |
rdfs:label "OpenOffice Service (R)" ; | |
fuseki:name "openoffice" ; | |
fuseki:serviceQuery "query" ; | |
fuseki:serviceQuery "sparql" ; | |
fuseki:serviceUpdate "update" ; | |
fuseki:serviceReadGraphStore "data" ; | |
fuseki:serviceReadGraphStore "get" ; | |
fuseki:dataset <#ooservice> ; | |
. | |
<#ooservice> rdf:type openoffice:build ; | |
openoffice:file "/home/lindenb/jeter.ods" ; | |
openoffice:file "/home/lindenb/jeter2.ods" ; | |
. | |
# ---- RDFS Inference models | |
# Thiese must be incorporate in a dataset in order to use them. | |
# All in one file. | |
<#model_inf_1> rdfs:label "Inf-1" ; | |
ja:baseModel | |
[ a ja:MemoryModel ; | |
ja:content [ja:externalContent <file:Data/test_data_rdfs.ttl>] ; | |
] ; | |
ja:reasoner | |
[ ja:reasonerURL <http://jena.hpl.hp.com/2003/RDFSExptRuleReasoner> ] | |
. | |
# Separate ABox and TBox | |
<#model_inf_2> rdfs:label "Inf-2" ; | |
ja:baseModel | |
[ a ja:MemoryModel ; | |
ja:content [ja:externalContent <file:Data/test_abox.ttl>] ; | |
ja:content [ja:externalContent <file:Data/test_tbox.ttl>] ; | |
] ; | |
ja:reasoner | |
[ ja:reasonerURL <http://jena.hpl.hp.com/2003/RDFSExptRuleReasoner> ] | |
. |
The line:
[] ja:loadClass "oocalc.OpenOfficeCalcGraph" .loads the class oocalc.OpenOfficeCalcGraph. The class OpenOfficeCalcGraph contains a static initialisation method:
(...) static { init() ; } private static void init() { (...)In this static method, a Jena Assembler for OpenOfficeCalcGraph is registered under the resource named: "http://rdf.lindenb.org/build".
public static OpenOfficeAssembler assembler = new OpenOfficeAssembler(); (...) private static final Resource buildRsrc=ResourceFactory.createResource(NS+"build"); (...) Assembler.general.implementWith(buildRsrc,assembler); (...)An Assembler configures a Graph from a RDF config file. In our example, the config contains the path to the OpenOffice spreadsheets:
<#ooservice> rdf:type openoffice:build ; openoffice:file "/home/lindenb/jeter.ods" ; openoffice:file "/home/lindenb/jeter2.ods" ; .This config is read in the Assembler:
public static class OpenOfficeAssembler extends AssemblerBase implements Assembler { @Override public Object open( Assembler a, Resource root, Mode mode ) { Property fileRsrc=ResourceFactory.createProperty(NS+"file"); //read the configuration an get the files List<File> files=new ArrayList<File>(); StmtIterator iter=root.listProperties(fileRsrc); (...)
Start Fuseki with the config file:
$ cd jena-fuseki-0.2.5 $ java -cp fuseki-server.jar:/path/to/openoffice2rdf.jar org.apache.jena.fuseki.FusekiCmd \ --debug -v --config /path/to/openoffice.ttl 14:11:50 INFO Config :: Configuration file: ../openoffice.ttl 14:11:50 INFO Config :: Service: :service1 14:11:50 INFO Config :: name = ds 14:11:50 INFO Config :: query = /ds/query 14:11:50 INFO Config :: query = /ds/sparql 14:11:50 INFO Config :: update = /ds/update 14:11:50 INFO Config :: upload = /ds/upload 14:11:50 INFO Config :: graphStore(RW) = /ds/data 14:11:50 INFO Config :: graphStore(R) = /ds/get 14:11:50 INFO ooffice2rdf :: Calling OpenOfficeCalcGraph init 14:11:50 INFO Config :: Service: OpenOffice Service (R) 14:11:50 INFO Config :: name = openoffice 14:11:50 INFO Config :: query = /openoffice/sparql 14:11:50 INFO Config :: query = /openoffice/query 14:11:50 INFO Config :: update = /openoffice/update 14:11:50 INFO Config :: graphStore(R) = /openoffice/get 14:11:50 INFO Config :: graphStore(R) = /openoffice/data 14:11:51 INFO Server :: Dataset path = /ds 14:11:51 INFO Server :: Dataset path = /openoffice 14:11:51 INFO Server :: Fuseki 0.2.5 2012-10-20T17:03:29+0100 14:11:51 INFO Server :: Started 2012/11/13 14:11:51 CET on port 3030Open your browser at http://localhost:3030, select the control panel at http://localhost:3030/control-panel.tpl and select /openoffice:
Fuseki Control Panel
The following form is displayed:SPARQL Query
You can now copy, paste and run the previous sparql query:-------------------------------------------------------------------------------------------------------------------------------------------------- | s | p | o | ================================================================================================================================================== | <file:/home/lindenb/jeter.ods/t1/y3/x1> | <http://rdf.lindenb.org/table> | <file:/home/lindenb/jeter.ods/t1> | | <file:/home/lindenb/jeter.ods/t1/y3/x1> | <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> | <http://rdf.lindenb.org/Cell> | | <file:/home/lindenb/jeter.ods/t1/y3/x1> | <http://rdf.lindenb.org/X> | "1"^^<http://www.w3.org/2001/XMLSchema#int> | | <file:/home/lindenb/jeter.ods/t1/y3/x1> | <http://rdf.lindenb.org/Y> | "3"^^<http://www.w3.org/2001/XMLSchema#int> | | <file:/home/lindenb/jeter.ods/t1/y3/x1> | <http://rdf.lindenb.org/value> | "chr1" | | <file:/home/lindenb/jeter.ods/t1/y3/x2> | <http://rdf.lindenb.org/table> | <file:/home/lindenb/jeter.ods/t1> | | <file:/home/lindenb/jeter.ods/t1/y3/x2> | <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> | <http://rdf.lindenb.org/Cell> | | <file:/home/lindenb/jeter.ods/t1/y3/x2> | <http://rdf.lindenb.org/X> | "2"^^<http://www.w3.org/2001/XMLSchema#int> | | <file:/home/lindenb/jeter.ods/t1/y3/x2> | <http://rdf.lindenb.org/Y> | "3"^^<http://www.w3.org/2001/XMLSchema#int> | | <file:/home/lindenb/jeter.ods/t1/y3/x2> | <http://rdf.lindenb.org/value> | "150"^^<http://www.w3.org/2001/XMLSchema#float> | | <file:/home/lindenb/jeter.ods/t1/y3/x3> | <http://rdf.lindenb.org/table> | <file:/home/lindenb/jeter.ods/t1> | | <file:/home/lindenb/jeter.ods/t1/y3/x3> | <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> | <http://rdf.lindenb.org/Cell> | | <file:/home/lindenb/jeter.ods/t1/y3/x3> | <http://rdf.lindenb.org/X> | "3"^^<http://www.w3.org/2001/XMLSchema#int> | | <file:/home/lindenb/jeter.ods/t1/y3/x3> | <http://rdf.lindenb.org/Y> | "3"^^<http://www.w3.org/2001/XMLSchema#int> | | <file:/home/lindenb/jeter.ods/t1/y3/x3> | <http://rdf.lindenb.org/value> | "250"^^<http://www.w3.org/2001/XMLSchema#float> | | <file:/home/lindenb/jeter.ods/t1/y3/x4> | <http://rdf.lindenb.org/table> | <file:/home/lindenb/jeter.ods/t1> | | <file:/home/lindenb/jeter.ods/t1/y3/x4> | <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> | <http://rdf.lindenb.org/Cell> | | <file:/home/lindenb/jeter.ods/t1/y3/x4> | <http://rdf.lindenb.org/X> | "4"^^<http://www.w3.org/2001/XMLSchema#int> | | <file:/home/lindenb/jeter.ods/t1/y3/x4> | <http://rdf.lindenb.org/Y> | "3"^^<http://www.w3.org/2001/XMLSchema#int> | | <file:/home/lindenb/jeter.ods/t1/y3/x4> | <http://rdf.lindenb.org/value> | "rs264" | | <file:/home/lindenb/jeter2.ods/t1/y3/x1> | <http://rdf.lindenb.org/table> | <file:/home/lindenb/jeter2.ods/t1> | | <file:/home/lindenb/jeter2.ods/t1/y3/x1> | <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> | <http://rdf.lindenb.org/Cell> | | <file:/home/lindenb/jeter2.ods/t1/y3/x1> | <http://rdf.lindenb.org/X> | "1"^^<http://www.w3.org/2001/XMLSchema#int> | | <file:/home/lindenb/jeter2.ods/t1/y3/x1> | <http://rdf.lindenb.org/Y> | "3"^^<http://www.w3.org/2001/XMLSchema#int> | | <file:/home/lindenb/jeter2.ods/t1/y3/x1> | <http://rdf.lindenb.org/value> | "1"^^<http://www.w3.org/2001/XMLSchema#float> | | <file:/home/lindenb/jeter2.ods/t1/y3/x2> | <http://rdf.lindenb.org/table> | <file:/home/lindenb/jeter2.od | <file:/home/lindenb/jeter2.ods/t1/y3/x2> | <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> | <http://rdf.lindenb.org/Cell> | | <file:/home/lindenb/jeter2.ods/t1/y3/x2> | <http://rdf.lindenb.org/X> | "2"^^<http://www.w3.org/2001/XMLSchema#int> | | <file:/home/lindenb/jeter2.ods/t1/y3/x2> | <http://rdf.lindenb.org/Y> | "3"^^<http://www.w3.org/2001/XMLSchema#int> | | <file:/home/lindenb/jeter2.ods/t1/y3/x2> | <http://rdf.lindenb.org/value> | "2"^^<http://www.w3.org/2001/XMLSchema#float> | | <file:/home/lindenb/jeter2.ods/t1/y3/x3> | <http://rdf.lindenb.org/table> | <file:/home/lindenb/jeter2.ods/t1> | | <file:/home/lindenb/jeter2.ods/t1/y3/x3> | <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> | <http://rdf.lindenb.org/Cell> | | <file:/home/lindenb/jeter2.ods/t1/y3/x3> | <http://rdf.lindenb.org/X> | "3"^^<http://www.w3.org/2001/XMLSchema#int> | | <file:/home/lindenb/jeter2.ods/t1/y3/x3> | <http://rdf.lindenb.org/Y> | "3"^^<http://www.w3.org/2001/XMLSchema#int> | | <file:/home/lindenb/jeter2.ods/t1/y3/x3> | <http://rdf.lindenb.org/value> | "3"^^<http://www.w3.org/2001/XMLSchema#float> | | <file:/home/lindenb/jeter2.ods/t1/y3/x4> | <http://rdf.lindenb.org/table> | <file:/home/lindenb/jeter2.ods/t1> | | <file:/home/lindenb/jeter2.ods/t1/y3/x4> | <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> | <http://rdf.lindenb.org/Cell> | | <file:/home/lindenb/jeter2.ods/t1/y3/x4> | <http://rdf.lindenb.org/X> | "4"^^<http://www.w3.org/2001/XMLSchema#int> | | <file:/home/lindenb/jeter2.ods/t1/y3/x4> | <http://rdf.lindenb.org/Y> | "3"^^<http://www.w3.org/2001/XMLSchema#int> | | <file:/home/lindenb/jeter2.ods/t1/y3/x4> | <http://rdf.lindenb.org/value> | "4"^^<http://www.w3.org/2001/XMLSchema#float> | --------------------------------------------------------------------------------------------------------------------------------------------------
That's it,
Pierre
No comments:
Post a Comment