13 November 2012

Creating a virtual RDF graph describing a set of OpenOffice spreadsheets with Apache Jena and Fuseki

In the current post, I will use the Jena API for RDF to implement a virtual RDF graph describing the content of a set of openoffice/libreoffice spreasheets.

Fact: An openoffice file (*.ods) is a Zip file

An openoffice file is nothing but a zip file:
$ unzip -t jeter.ods 
Archive:  jeter.ods
    testing: mimetype                 OK
    testing: meta.xml                 OK
    testing: settings.xml             OK
    testing: content.xml              OK
    testing: Thumbnails/thumbnail.png   OK
    testing: Configurations2/images/Bitmaps/   OK
    testing: Configurations2/popupmenu/   OK
    testing: Configurations2/toolpanel/   OK
    testing: Configurations2/statusbar/   OK
    testing: Configurations2/progressbar/   OK
    testing: Configurations2/toolbar/   OK
    testing: Configurations2/menubar/   OK
    testing: Configurations2/accelerator/current.xml   OK
    testing: Configurations2/floater/   OK
    testing: styles.xml               OK
    testing: META-INF/manifest.xml    OK
No errors detected in compressed data of jeter.ods.

The entry content.xml is a XML file describing the tables in the spreadsheet:
$ unzip -c jeter.ods content.xml |\
grep -v Archive |\
grep -v inflating | xmllint --format - |\
head -n 20


<?xml version="1.0" encoding="UTF-8"?>
<office:document-content xmlns:office="urn:oasis:names:tc:opendocument:xmlns:office:1.0" xmlns:style="urn:oasis:names:tc:opendocument:xmlns:style:1.0" xmlns:text="urn:oasis:names:tc:opendocument:xmlns:text:1.0" xmlns:table="urn:oasis:names:tc:opendocument:xmlns:table:1.0" xmlns:draw="urn:oasis:names:tc:opendocument:xmlns:drawing:1.0" xmlns:fo="urn:oasis:names:tc:opendocument:xmlns:xsl-fo-compatible:1.0" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:meta="urn:oasis:names:tc:opendocument:xmlns:meta:1.0" xmlns:number="urn:oasis:names:tc:opendocument:xmlns:datastyle:1.0" xmlns:presentation="urn:oasis:names:tc:opendocument:xmlns:presentation:1.0" xmlns:svg="urn:oasis:names:tc:opendocument:xmlns:svg-compatible:1.0" xmlns:chart="urn:oasis:names:tc:opendocument:xmlns:chart:1.0" xmlns:dr3d="urn:oasis:names:tc:opendocument:xmlns:dr3d:1.0" xmlns:math="http://www.w3.org/1998/Math/MathML" xmlns:form="urn:oasis:names:tc:opendocument:xmlns:form:1.0" xmlns:script="urn:oasis:names:tc:opendocument:xmlns:script:1.0" xmlns:ooo="http://openoffice.org/2004/office" xmlns:ooow="http://openoffice.org/2004/writer" xmlns:oooc="http://openoffice.org/2004/calc" xmlns:dom="http://www.w3.org/2001/xml-events" xmlns:xforms="http://www.w3.org/2002/xforms" xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:rpt="http://openoffice.org/2005/report" xmlns:of="urn:oasis:names:tc:opendocument:xmlns:of:1.2" xmlns:xhtml="http://www.w3.org/1999/xhtml" xmlns:grddl="http://www.w3.org/2003/g/data-view#" xmlns:tableooo="http://openoffice.org/2009/table" xmlns:field="urn:openoffice:names:experimental:ooo-ms-interop:xmlns:field:1.0" xmlns:formx="urn:openoffice:names:experimental:ooxml-odf-interop:xmlns:form:1.0" xmlns:css3t="http://www.w3.org/TR/css3-text/" office:version="1.2">
  <office:scripts/>
  <office:font-face-decls>
    <style:font-face style:name="Liberation Sans" svg:font-family="'Liberation Sans'" style:font-family-generic="swiss" style:font-pitch="variable"/>
    <style:font-face style:name="DejaVu Sans" svg:font-family="'DejaVu Sans'" style:font-family-generic="system" style:font-pitch="variable"/>
    <style:font-face style:name="Lohit Hindi" svg:font-family="'Lohit Hindi'" style:font-family-generic="system" style:font-pitch="variable"/>
    <style:font-face style:name="WenQuanYi Micro Hei" svg:font-family="'WenQuanYi Micro Hei'" style:font-family-generic="system" style:font-pitch="variable"/>
  </office:font-face-decls>
  <office:automatic-styles>
    <style:style style:name="co1" style:family="table-column">
      <style:table-column-properties fo:break-before="auto" style:column-width="0.889in"/>
    </style:style>
    <style:style style:name="ro2" style:family="table-row">
      <style:table-row-properties style:row-height="0.178in" fo:break-before="auto" style:use-optimal-row-height="true"/>
    </style:style>
    <style:style style:name="ro3" style:family="table-row">
      <style:table-row-properties style:row-height="0.1681in" fo:break-before="auto" style:use-optimal-row-height="true"/>
    </style:style>
    <style:style style:name="ta1" style:family="table" style:master-page-name="Default">

Fact: Implementing a simple virtual RDF graph with Jena is easy

By virtual I mean that there is no RDFStore, the triples are created on the fly.
Implementing a simple virtual RDF graph with Jena is easy: you simply have to extend the class com.hp.hpl.jena.graph.impl.GraphBase and only implement the method graphBaseFind which returns all the RDF Triples matching a TripleMatch.

(...)
 @Override
    protected ExtendedIterator<Triple> graphBaseFind(TripleMatch matcher)
        {
        return ...;
        }
(...)

The code

My implementation of a RDFGraph for a set of OpenOffice Calc is not effective but it works fine: for each call of graphBaseFind, it creates an "Iterator<Triple>" scanning each content.xml entry of each openoffice file. This iterator creates some new Triples, add them to a list of Triples that will be filtered by the TripleMatcher.
/**
* Author: Pierre Lindenbaum PhD
* plindenbaum@yahoo.fr
* Date: 2012-11
* Motivation: RDFGraph from openoffice calc files
*
*/
package oocalc;
import java.io.File;
import java.io.FileNotFoundException;
import java.io.FileReader;
import java.io.InputStream;
import java.util.ArrayList;
import java.util.LinkedList;
import java.util.List;
import java.util.zip.ZipEntry;
import java.util.zip.ZipFile;
import javax.xml.namespace.QName;
import javax.xml.stream.XMLEventReader;
import javax.xml.stream.XMLInputFactory;
import javax.xml.stream.XMLStreamException;
import javax.xml.stream.events.Attribute;
import javax.xml.stream.events.EndElement;
import javax.xml.stream.events.StartElement;
import javax.xml.stream.events.XMLEvent;
import com.hp.hpl.jena.assembler.assemblers.AssemblerBase;
import com.hp.hpl.jena.assembler.Assembler;
import com.hp.hpl.jena.sparql.core.assembler.AssemblerUtils;
import com.hp.hpl.jena.assembler.Mode;
import com.hp.hpl.jena.datatypes.RDFDatatype;
import com.hp.hpl.jena.datatypes.xsd.XSDDatatype;
import com.hp.hpl.jena.rdf.model.Resource;
import com.hp.hpl.jena.rdf.model.Statement;
import com.hp.hpl.jena.rdf.model.StmtIterator;
import com.hp.hpl.jena.rdf.model.Property;
import com.hp.hpl.jena.graph.Node;
import com.hp.hpl.jena.graph.Triple;
import com.hp.hpl.jena.graph.TripleMatch;
import com.hp.hpl.jena.graph.TripleMatchIterator;
import com.hp.hpl.jena.graph.impl.GraphBase;
import com.hp.hpl.jena.rdf.model.AnonId;
import com.hp.hpl.jena.rdf.model.ResourceFactory;
import com.hp.hpl.jena.rdf.model.impl.ModelCom;
import com.hp.hpl.jena.util.iterator.ExtendedIterator;
import com.hp.hpl.jena.util.iterator.NiceIterator;
import com.hp.hpl.jena.sparql.core.DatasetImpl;
import com.hp.hpl.jena.vocabulary.DC;
import com.hp.hpl.jena.vocabulary.RDF;
import com.hp.hpl.jena.vocabulary.XSD;
import com.hp.hpl.jena.query.Dataset;
import org.slf4j.LoggerFactory;
import com.hp.hpl.jena.query.*;
/**
* implementation of a RDF Graph for OpenOffice calc
*
*/
public class OpenOfficeCalcGraph
extends GraphBase
{
/** logger */
protected static final org.slf4j.Logger LOG= LoggerFactory.getLogger("ooffice2rdf");
/** namespaces */
private static final String OFFICE="urn:oasis:names:tc:opendocument:xmlns:office:1.0";
private static final String TABLE="urn:oasis:names:tc:opendocument:xmlns:table:1.0";
private static final String TEXT="urn:oasis:names:tc:opendocument:xmlns:text:1.0";
private static final String NS="http://rdf.lindenb.org/";
/** attributes */
private static final QName number_columns_repeated=new QName(TABLE,"number-columns-repeated","table");
private static final QName number_rows_repeated=new QName(TABLE,"number-rows-repeated","table");
private static final QName value_type=new QName(OFFICE,"value-type","office");
private static final QName value=new QName(OFFICE,"value","office");
private static final QName name=new QName(TABLE,"name","table");
//rdf:type Node
private static final Node rdfType=Node.createURI(RDF.type.getURI());
//all open office files
private List<File> caclFiles=null;
/** static Assembler for OpenOfficeCalcGraph
* An assembler creates a Dataset(graph) from a RDF-based configuration file.
* It is called by Fuseki
*/
public static OpenOfficeAssembler assembler = new OpenOfficeAssembler();
public static class OpenOfficeAssembler extends AssemblerBase implements Assembler
{
@Override
public Object open( Assembler a, Resource root, Mode mode )
{
//read the configuration an get the files
List<File> files=new ArrayList<File>();
StmtIterator iter=root.listProperties(fileRsrc);
while(iter.hasNext())
{
Statement stmt=iter.nextStatement();
if(!stmt.getObject().isLiteral()) throw new RuntimeException("Not a literal "+stmt);
String lit=stmt.getString();
File file=new File(lit);
if(!file.exists()) throw new RuntimeException("File not found : "+file);
if(!file.getName().endsWith(".ods")) throw new RuntimeException("Not an .ods file : "+file);
files.add(file);
}
iter.close();
OpenOfficeCalcGraph g=new OpenOfficeCalcGraph(files);
OpenOfficeCalcModel m=new OpenOfficeCalcModel(g);
Dataset ds=new DatasetImpl(m);
return ds;
}
}
/** Initializer for FUZEKI */
private static boolean init_called = false ;
private static final Resource buildRsrc=ResourceFactory.createResource(NS+"build");
private static final Property fileRsrc=ResourceFactory.createProperty(NS+"file");
/** static initializer, when this class is invoked,
* it tells Fuzeki that there is another assembler using Assembler.general
* the resource-name for this assembler is this.buildRsrc
*/
static { init() ; }
private static void init()
{
if(init_called) return;
LOG.info("Calling OpenOfficeCalcGraph init");
AssemblerUtils.init();
Assembler.general.implementWith(buildRsrc,assembler);
init_called=true;
}
/** RDF Model for OpenOfficeCalcGraph */
public static class OpenOfficeCalcModel extends ModelCom
{
public OpenOfficeCalcModel(OpenOfficeCalcGraph g)
{
super(g);
}
}
/* one row in the spredsheet */
private static class Row
{
int repeat=1;
private List<Cell> cells=new ArrayList<Cell>();
}
/* one cell in the spredsheet */
private static class Cell
{
int repeat=1;
String type=null;
String value=null;
String literal=null;
}
/** Constructor from an array of OO files */
public OpenOfficeCalcGraph(List<File> calcFiles)
{
this.caclFiles=new ArrayList<File>(calcFiles);
this.getPrefixMapping().setNsPrefix("office", NS);
this.getPrefixMapping().setNsPrefix("xsd", XSD.getURI());
this.getPrefixMapping().setNsPrefix("dc", DC.getURI());
}
@Override
protected ExtendedIterator<Triple> graphBaseFind(TripleMatch matcher)
{
return new TripleMatchIterator((Triple)matcher, new CellIterator());
}
/** parse the openoffice files and get the Triples */
private class CellIterator extends NiceIterator<Triple>
{
/** current index in array of OO files */
private int fileIndex=-1;
/** buffer of triples */
private List<Triple> buffer=new LinkedList<Triple>();
/** next triple to be returned */
private Triple next=null;
/** was hasNext() called ? */
private boolean hasNextCalled=false;
/** current OO file opened */
private File ioFile=null;
/** Zip Handler for OO file */
private ZipFile zipFile=null;
/** Input Stream for current Zip entry */
private InputStream zipInputStream;
/** xml-handler for current zip entry */
private XMLEventReader xmlEventReader;
/* rdf subject for file */
private Node fileRsrc=null;
/* rdf subject for tab */
private Node tabRsrc=null;
/** current tab index */
private int tabIndex=0;
/* current colun */
private int X=0;
/** current row */
private int Y=0;
private void add(Node s,Node p,Node o)
{
this.buffer.add(Triple.create(s, p, o));
}
public CellIterator()
{
}
private boolean isA(XMLEvent evt,String ns,String localName)
{
QName q=null;
if(evt.isStartElement())
{
q=evt.asStartElement().getName();
}
else if(evt.isEndElement())
{
q=evt.asEndElement().getName();
}
return q!=null &&
q.getNamespaceURI().equals(ns) &&
q.getLocalPart().equals(localName)
;
}
@Override
public boolean hasNext()
{
if(!hasNextCalled)
{
hasNextCalled=true;
next=null;
for(;;)
{
if(!buffer.isEmpty())
{
next=buffer.remove(0);
break;
}
try
{
if(xmlEventReader==null)
{
//open next file
if(fileIndex+1>=OpenOfficeCalcGraph.this.caclFiles.size()) break;
this.fileIndex++;
this.tabIndex=0;
//open XML StaX reader for current OO file
XMLInputFactory xmlInputFactory = XMLInputFactory.newInstance();
xmlInputFactory.setProperty(XMLInputFactory.IS_NAMESPACE_AWARE, Boolean.TRUE);
xmlInputFactory.setProperty(XMLInputFactory.IS_COALESCING, Boolean.TRUE);
xmlInputFactory.setProperty(XMLInputFactory.IS_REPLACING_ENTITY_REFERENCES, Boolean.TRUE);
try
{
this.ioFile=OpenOfficeCalcGraph.this.caclFiles.get(this.fileIndex);
this.zipFile=new ZipFile(this.ioFile);
ZipEntry zipEntry=zipFile.getEntry("content.xml");
if(zipEntry==null) throw new RuntimeException("Cannot get content.xml");
this.zipInputStream=this.zipFile.getInputStream(zipEntry);
xmlEventReader= xmlInputFactory.createXMLEventReader(this.zipInputStream);
//describe the file as RDF
this.fileRsrc=Node.createURI(this.ioFile.toURI().toASCIIString());
add(this.fileRsrc,rdfType,Node.createURI(NS+"Spreadsheet"));
add(this.fileRsrc,Node.createURI(DC.title.getURI()),Node.createLiteral(this.ioFile.getName()));
continue;
}
catch (Exception e)
{
throw new RuntimeException(e);
}
}
if(xmlEventReader.hasNext())
{
Attribute att=null;
XMLEvent evt=xmlEventReader.nextEvent();
if(evt.isStartElement())
{
StartElement E=evt.asStartElement();
if(isA(E,TABLE,"table"))
{
att=E.getAttributeByName(name);
this.tabIndex++;
//describe the tab as RDF
this.tabRsrc=Node.createURI(this.ioFile.toURI().toASCIIString()+"/t"+tabIndex);
add(this.tabRsrc,Node.createURI(NS+"file"),this.fileRsrc);
add(this.tabRsrc,rdfType,Node.createURI(NS+"Table"));
add(this.tabRsrc,Node.createURI(DC.title.getURI()),Node.createLiteral(att.getValue()));
this.X=0;
this.Y=0;
}
else if(isA(E,TABLE,"table-row"))
{
//parse the row
Row row=parseRow(E);
//create the statements for that row
for(int i=0;i< row.repeat;++i)
{
this.X=0;
this.Y++;
for(Cell cell:row.cells)
{
for(int j=0;j< cell.repeat;++j)
{
this.X++;
if(cell.value==null && cell.literal==null) continue;
Node subject=Node.createURI(this.ioFile.toURI().toASCIIString()+"/t"+tabIndex+"/y"+Y+"/x"+X);
add(subject,Node.createURI(NS+"table"),this.tabRsrc);
add(subject,rdfType,Node.createURI(NS+"Cell"));
add(subject,Node.createURI(NS+"X"),Node.createLiteral(String.valueOf(X),null,XSDDatatype.XSDint));
add(subject,Node.createURI(NS+"Y"),Node.createLiteral(String.valueOf(Y),null,XSDDatatype.XSDint));
Node cellValue=null;
if(cell.type!=null && cell.value!=null)
{
XSDDatatype dataType=XSDDatatype.XSDstring;
if(cell.type.equals("float"))
{
dataType=XSDDatatype.XSDfloat;
}
else if(cell.type.equals("int"))
{
dataType=XSDDatatype.XSDint;
}
cellValue=Node.createLiteral(cell.value, null, dataType);
}
else
{
cellValue=Node.createLiteral(String.valueOf(cell.literal));
}
add( subject,
Node.createURI(NS+"value"),
cellValue
);
}
}
}
}
}
else if(evt.isEndElement())
{
if(isA(evt,TABLE,"table"))
{
this.tabRsrc=null;
}
}
}
else //we're done for that file.
{
this.xmlEventReader.close();
this.zipInputStream.close();
this.zipFile.close();
this.xmlEventReader=null;
this.zipInputStream=null;
this.zipFile=null;
this.fileRsrc=null;
this.ioFile=null;
}
}
catch(Exception err)
{
throw new RuntimeException(err);
}
}
}
return next!=null;
}
@Override
public void close()
{
try { if(this.xmlEventReader!=null) this.xmlEventReader.close(); } catch (Exception e) {}
this.xmlEventReader=null;
try { if(this.zipInputStream!=null) this.zipInputStream.close(); } catch (Exception e) {}
this.zipInputStream=null;
try { if(this.zipFile!=null) this.zipFile.close(); } catch (Exception e) {}
this.zipFile=null;
this.buffer.clear();
this.fileIndex=caclFiles.size();
}
@Override
public Triple next()
{
if(!hasNextCalled) hasNext();
if(!hasNext()) throw new IllegalStateException();
Triple t=next;
next=null;
hasNextCalled=false;
return t;
}
/** parses a table:table-row */
private Row parseRow(StartElement root)
throws XMLStreamException
{
Row row=new Row();
Attribute att=root.getAttributeByName(number_rows_repeated);
if(att!=null)
{
row.repeat=Integer.parseInt(att.getValue());
}
while(this.xmlEventReader.hasNext())
{
XMLEvent evt=this.xmlEventReader.nextEvent();
if(evt.isStartElement())
{
StartElement E=evt.asStartElement();
if(isA(E,TABLE,"table-cell"))
{
row.cells.add(parseCell(E));
}
}
else if(evt.isEndElement())
{
if(isA(evt,TABLE,"table-row"))
{
break;
}
}
}
return row;
}
/** parses a table:table-cell */
private Cell parseCell(StartElement root)
throws XMLStreamException
{
Cell cell=new Cell();
Attribute att=root.getAttributeByName(number_columns_repeated);
if(att!=null)
{
cell.repeat=Integer.parseInt(att.getValue());
}
att=root.getAttributeByName(value_type);
if(att!=null)
{
cell.type=att.getValue();
}
att=root.getAttributeByName(value);
if(att!=null)
{
cell.value=att.getValue();
cell.literal=cell.value;
}
while(this.xmlEventReader.hasNext())
{
XMLEvent evt=this.xmlEventReader.nextEvent();
if(evt.isStartElement())
{
StartElement E=evt.asStartElement();
if(isA(E,TEXT,"p"))
{
cell.literal=parseText(E);
}
}
else if(evt.isEndElement())
{
if(isA(evt,TABLE,"table-cell"))
{
break;
}
}
}
return cell;
}
/** returns the content of <text:p/> */
private String parseText(StartElement root)
throws XMLStreamException
{
StringBuilder b=new StringBuilder();
while(xmlEventReader.hasNext())
{
XMLEvent evt=this.xmlEventReader.nextEvent();
if(evt.isStartElement())
{
throw new IllegalStateException();
}
else if(evt.isEndElement())
{
if(isA(evt,TEXT,"p"))
{
return b.toString();
}
}
else if(evt.isCharacters())
{
b.append(evt.asCharacters().getData());
}
}
throw new IllegalStateException();
}
}
public static void main(String[] args) throws Exception
{
if(args.length<2)
{
System.err.println("Usage: query.sparql file1.ods, file2.ods... filen.ods");
return;
}
List<File> files=new ArrayList<File>();
for(int optind=1;optind< args.length;++optind)
{
files.add(new File(args[optind]));
}
OpenOfficeCalcGraph g=new OpenOfficeCalcGraph(files);
OpenOfficeCalcModel m=new OpenOfficeCalcModel(g);
com.hp.hpl.jena.query.Query query = QueryFactory.read(args[0]) ;
LOG.info("starting query");
QueryExecution qexec = QueryExecutionFactory.create(query, m) ;
try {
ResultSet results = qexec.execSelect();
ResultSetFormatter.out(System.out,results,g.getPrefixMapping());
} finally { qexec.close() ; }
}
}

Compilation

the Makefile:
CP=...#path to the jars of JENA/ARQ/etc... e.g: =`find ${ARQ} -name "*.jar" |  | tr "\n" ":"`
.PHONY: all
all:
 javac -cp ${CP} -sourcepath src src/oocalc/OpenOfficeCalcGraph.java
 jar cvf dist/openoffice2rdf.jar -C src .

Querying using sparql

Now that the Graph has been implemented and compiled, one can query it using ARQ, the sparql engine of Jena:

The spreadsheet

I've created the following spreadsheet and saved it in a file named "jeter.ods":
CHROMSTARTENDNAME
chr1100200rs654
chr1150250rs264
chr1200300rs610
chr1250350rs929
chr1300400rs408
chr1350450rs346
chr1400500rs430
chr1450550rs735
chr1500600rs575
chr1550650rs891
chr1600700rs627
chr1650750rs650
chr1700800rs715
chr1750850rs467
chr1800900rs882
chr1850950rs301
chr19001000rs643
chr19501050rs246
chr110001100rs178
chr110501150rs928
chr111001200rs213

The sparql query

The following SPARQL returns the informations about the cells in the 3rd row of the spreadsheet:
PREFIX oo: <http://rdf.lindenb.org/>
SELECT ?s ?p ?o
WHERE
{
?s oo:Y "3"^^<http://www.w3.org/2001/XMLSchema#int> .
?s a oo:Cell .
?s ?p ?o .
}
view raw test.sparql hosted with ❤ by GitHub


Invoke:
java -cp `find /home/lindenb/.ivy2/cache -name "*.jar" | tr "\n" ":"`:dist/openoffice2rdf.jar  \
 oocalc.OpenOfficeCalcGraph test.sparql /home/lindenb/jeter.ods

Result:
-----------------------------------------------------------------------------------------------------------------------------------
| s                                       | p                                                 | o                                 |
===================================================================================================================================
| <file:/home/lindenb/jeter.ods/t1/y3/x1> | office:table                                      | <file:/home/lindenb/jeter.ods/t1> |
| <file:/home/lindenb/jeter.ods/t1/y3/x1> | <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> | office:Cell                       |
| <file:/home/lindenb/jeter.ods/t1/y3/x1> | office:X                                          | "1"^^xsd:int                      |
| <file:/home/lindenb/jeter.ods/t1/y3/x1> | office:Y                                          | "3"^^xsd:int                      |
| <file:/home/lindenb/jeter.ods/t1/y3/x1> | office:value                                      | "chr1"                            |
| <file:/home/lindenb/jeter.ods/t1/y3/x2> | office:table                                      | <file:/home/lindenb/jeter.ods/t1> |
| <file:/home/lindenb/jeter.ods/t1/y3/x2> | <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> | office:Cell                       |
| <file:/home/lindenb/jeter.ods/t1/y3/x2> | office:X                                          | "2"^^xsd:int                      |
| <file:/home/lindenb/jeter.ods/t1/y3/x2> | office:Y                                          | "3"^^xsd:int                      |
| <file:/home/lindenb/jeter.ods/t1/y3/x2> | office:value                                      | "150"^^xsd:float                  |
| <file:/home/lindenb/jeter.ods/t1/y3/x3> | office:table                                      | <file:/home/lindenb/jeter.ods/t1> |
| <file:/home/lindenb/jeter.ods/t1/y3/x3> | <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> | office:Cell                       |
| <file:/home/lindenb/jeter.ods/t1/y3/x3> | office:X                                          | "3"^^xsd:int                      |
| <file:/home/lindenb/jeter.ods/t1/y3/x3> | office:Y                                          | "3"^^xsd:int                      |
| <file:/home/lindenb/jeter.ods/t1/y3/x3> | office:value                                      | "250"^^xsd:float                  |
| <file:/home/lindenb/jeter.ods/t1/y3/x4> | office:table                                      | <file:/home/lindenb/jeter.ods/t1> |
| <file:/home/lindenb/jeter.ods/t1/y3/x4> | <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> | office:Cell                       |
| <file:/home/lindenb/jeter.ods/t1/y3/x4> | office:X                                          | "4"^^xsd:int                      |
| <file:/home/lindenb/jeter.ods/t1/y3/x4> | office:Y                                          | "3"^^xsd:int                      |
| <file:/home/lindenb/jeter.ods/t1/y3/x4> | office:value                                      | "rs264"                           |
-----------------------------------------------------------------------------------------------------------------------------------

Serving the OpenOffice spreadsheets as RDF over HTTP

Fuseki is a SPARQL server. It provides REST-style SPARQL HTTP Update, SPARQL Query, and SPARQL Update using the SPARQL protocol over HTTP. We're going to deploy the OpenOfficeCalcGraph in Fuseki to query a set of OpenOffice files.

Download an install Fuseki

wget https://repository.apache.org/content/repositories/releases/org/apache/jena/jena-fuseki/0.2.5/jena-fuseki-0.2.5-distribution.tar.gz
tar xfz jena-fuseki-0.2.5-distribution.tar.gz
rm jena-fuseki-0.2.5-distribution.tar.gz

Tell Fuseki about our OpenOfficeCalcGraph

We need to create a config file for Fuseki. That was the most complicated part as the process is not clearly documented:
# Licensed under the terms of http://www.apache.org/licenses/LICENSE-2.0
## A collection of example configurations for Fuseki
@prefix : <#> .
@prefix fuseki: <http://jena.apache.org/fuseki#> .
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix openoffice: <http://rdf.lindenb.org/> .
@prefix ja: <http://jena.hpl.hp.com/2005/11/Assembler#> .
[] rdf:type fuseki:Server ;
# Timeout - server-wide default: milliseconds.
# Format 1: "1000" -- 1 second timeout
# Format 2: "10000,60000" -- 10s timeout to first result, then 60s timeout to for rest of query.
# See java doc for ARQ.queryTimeout
# ja:context [ ja:cxtName "arq:queryTimeout" ; ja:cxtValue "10000" ] ;
# ja:loadClass "your.code.Class" ;
fuseki:services (
<#service1>
<#ooservice>
) .
# Custom code.
[] ja:loadClass "oocalc.OpenOfficeCalcGraph" .
openoffice:GraphTDB rdfs:subClassOf ja:Model .
# OpenOffice
## ---------------------------------------------------------------
## Updatable in-memory dataset.
<#service1> rdf:type fuseki:Service ;
# URI of the dataset -- http://host:port/ds
fuseki:name "ds" ;
# SPARQL query services e.g. http://host:port/ds/sparql?query=...
fuseki:serviceQuery "sparql" ;
fuseki:serviceQuery "query" ;
# SPARQL Update service -- http://host:port/ds/update?request=...
fuseki:serviceUpdate "update" ; # SPARQL query service -- /ds/update
# Upload service -- http://host:port/ds/upload?graph=default or ?graph=URI or ?default
# followed by a multipart body, each part being RDF syntax.
# Syntax determined by the file name extension.
fuseki:serviceUpload "upload" ; # Non-SPARQL upload service
# SPARQL Graph store protocol (read and write)
# GET, PUT, POST DELETE to http://host:port/ds/data?graph= or ?default=
fuseki:serviceReadWriteGraphStore "data" ;
# A separate read-only graph store endpoint:
fuseki:serviceReadGraphStore "get" ; # Graph store protocol (read only) -- /ds/get
fuseki:dataset <#emptyDataset> ;
.
## In-memory, initially empty.
<#emptyDataset> rdf:type ja:RDFDataset .
<#ooservice> rdf:type fuseki:Service ;
rdfs:label "OpenOffice Service (R)" ;
fuseki:name "openoffice" ;
fuseki:serviceQuery "query" ;
fuseki:serviceQuery "sparql" ;
fuseki:serviceUpdate "update" ;
fuseki:serviceReadGraphStore "data" ;
fuseki:serviceReadGraphStore "get" ;
fuseki:dataset <#ooservice> ;
.
<#ooservice> rdf:type openoffice:build ;
openoffice:file "/home/lindenb/jeter.ods" ;
openoffice:file "/home/lindenb/jeter2.ods" ;
.
# ---- RDFS Inference models
# Thiese must be incorporate in a dataset in order to use them.
# All in one file.
<#model_inf_1> rdfs:label "Inf-1" ;
ja:baseModel
[ a ja:MemoryModel ;
ja:content [ja:externalContent <file:Data/test_data_rdfs.ttl>] ;
] ;
ja:reasoner
[ ja:reasonerURL <http://jena.hpl.hp.com/2003/RDFSExptRuleReasoner> ]
.
# Separate ABox and TBox
<#model_inf_2> rdfs:label "Inf-2" ;
ja:baseModel
[ a ja:MemoryModel ;
ja:content [ja:externalContent <file:Data/test_abox.ttl>] ;
ja:content [ja:externalContent <file:Data/test_tbox.ttl>] ;
] ;
ja:reasoner
[ ja:reasonerURL <http://jena.hpl.hp.com/2003/RDFSExptRuleReasoner> ]
.
view raw openoffice.ttl hosted with ❤ by GitHub

The line:
[] ja:loadClass "oocalc.OpenOfficeCalcGraph" .
loads the class oocalc.OpenOfficeCalcGraph. The class OpenOfficeCalcGraph contains a static initialisation method:
(...)
static { init() ; }
    private static void init()
        {
        (...)
In this static method, a Jena Assembler for OpenOfficeCalcGraph is registered under the resource named: "http://rdf.lindenb.org/build".
public static OpenOfficeAssembler assembler = new OpenOfficeAssembler();
(...)
private static final Resource buildRsrc=ResourceFactory.createResource(NS+"build");
(...)
Assembler.general.implementWith(buildRsrc,assembler);
(...)
An Assembler configures a Graph from a RDF config file. In our example, the config contains the path to the OpenOffice spreadsheets:
<#ooservice> rdf:type openoffice:build ;
    openoffice:file "/home/lindenb/jeter.ods" ;
    openoffice:file "/home/lindenb/jeter2.ods" ;
.
This config is read in the Assembler:
public static class OpenOfficeAssembler extends AssemblerBase implements Assembler
      {
      @Override
      public Object open( Assembler a, Resource root, Mode mode )
            {
            Property fileRsrc=ResourceFactory.createProperty(NS+"file");
            //read the configuration an get the files
            List<File> files=new ArrayList<File>();
            StmtIterator iter=root.listProperties(fileRsrc);
     (...)

Start Fuseki with the config file:

$ cd jena-fuseki-0.2.5
$ java -cp fuseki-server.jar:/path/to/openoffice2rdf.jar  org.apache.jena.fuseki.FusekiCmd \
    --debug  -v --config /path/to/openoffice.ttl
14:11:50 INFO  Config               :: Configuration file: ../openoffice.ttl
14:11:50 INFO  Config               :: Service: :service1
14:11:50 INFO  Config               ::   name = ds
14:11:50 INFO  Config               ::   query = /ds/query
14:11:50 INFO  Config               ::   query = /ds/sparql
14:11:50 INFO  Config               ::   update = /ds/update
14:11:50 INFO  Config               ::   upload = /ds/upload
14:11:50 INFO  Config               ::   graphStore(RW) = /ds/data
14:11:50 INFO  Config               ::   graphStore(R) = /ds/get
14:11:50 INFO  ooffice2rdf          :: Calling OpenOfficeCalcGraph init
14:11:50 INFO  Config               :: Service: OpenOffice Service (R)
14:11:50 INFO  Config               ::   name = openoffice
14:11:50 INFO  Config               ::   query = /openoffice/sparql
14:11:50 INFO  Config               ::   query = /openoffice/query
14:11:50 INFO  Config               ::   update = /openoffice/update
14:11:50 INFO  Config               ::   graphStore(R) = /openoffice/get
14:11:50 INFO  Config               ::   graphStore(R) = /openoffice/data
14:11:51 INFO  Server               :: Dataset path = /ds
14:11:51 INFO  Server               :: Dataset path = /openoffice
14:11:51 INFO  Server               :: Fuseki 0.2.5 2012-10-20T17:03:29+0100
14:11:51 INFO  Server               :: Started 2012/11/13 14:11:51 CET on port 3030
Open your browser at http://localhost:3030, select the control panel at http://localhost:3030/control-panel.tpl and select /openoffice:
Fuseki Control Panel
Dataset:

The following form is displayed:
SPARQL Query




Output:


XSLT style sheet (blank for none):




Force the accept header to text/plain regardless



You can now copy, paste and run the previous sparql query:
--------------------------------------------------------------------------------------------------------------------------------------------------
| s                                        | p                                                 | o                                               |
==================================================================================================================================================
| <file:/home/lindenb/jeter.ods/t1/y3/x1>  | <http://rdf.lindenb.org/table>                    | <file:/home/lindenb/jeter.ods/t1>               |
| <file:/home/lindenb/jeter.ods/t1/y3/x1>  | <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> | <http://rdf.lindenb.org/Cell>                   |
| <file:/home/lindenb/jeter.ods/t1/y3/x1>  | <http://rdf.lindenb.org/X>                        | "1"^^<http://www.w3.org/2001/XMLSchema#int>     |
| <file:/home/lindenb/jeter.ods/t1/y3/x1>  | <http://rdf.lindenb.org/Y>                        | "3"^^<http://www.w3.org/2001/XMLSchema#int>     |
| <file:/home/lindenb/jeter.ods/t1/y3/x1>  | <http://rdf.lindenb.org/value>                    | "chr1"                                          |
| <file:/home/lindenb/jeter.ods/t1/y3/x2>  | <http://rdf.lindenb.org/table>                    | <file:/home/lindenb/jeter.ods/t1>               |
| <file:/home/lindenb/jeter.ods/t1/y3/x2>  | <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> | <http://rdf.lindenb.org/Cell>                   |
| <file:/home/lindenb/jeter.ods/t1/y3/x2>  | <http://rdf.lindenb.org/X>                        | "2"^^<http://www.w3.org/2001/XMLSchema#int>     |
| <file:/home/lindenb/jeter.ods/t1/y3/x2>  | <http://rdf.lindenb.org/Y>                        | "3"^^<http://www.w3.org/2001/XMLSchema#int>     |
| <file:/home/lindenb/jeter.ods/t1/y3/x2>  | <http://rdf.lindenb.org/value>                    | "150"^^<http://www.w3.org/2001/XMLSchema#float> |
| <file:/home/lindenb/jeter.ods/t1/y3/x3>  | <http://rdf.lindenb.org/table>                    | <file:/home/lindenb/jeter.ods/t1>               |
| <file:/home/lindenb/jeter.ods/t1/y3/x3>  | <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> | <http://rdf.lindenb.org/Cell>                   |
| <file:/home/lindenb/jeter.ods/t1/y3/x3>  | <http://rdf.lindenb.org/X>                        | "3"^^<http://www.w3.org/2001/XMLSchema#int>     |
| <file:/home/lindenb/jeter.ods/t1/y3/x3>  | <http://rdf.lindenb.org/Y>                        | "3"^^<http://www.w3.org/2001/XMLSchema#int>     |
| <file:/home/lindenb/jeter.ods/t1/y3/x3>  | <http://rdf.lindenb.org/value>                    | "250"^^<http://www.w3.org/2001/XMLSchema#float> |
| <file:/home/lindenb/jeter.ods/t1/y3/x4>  | <http://rdf.lindenb.org/table>                    | <file:/home/lindenb/jeter.ods/t1>               |
| <file:/home/lindenb/jeter.ods/t1/y3/x4>  | <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> | <http://rdf.lindenb.org/Cell>                   |
| <file:/home/lindenb/jeter.ods/t1/y3/x4>  | <http://rdf.lindenb.org/X>                        | "4"^^<http://www.w3.org/2001/XMLSchema#int>     |
| <file:/home/lindenb/jeter.ods/t1/y3/x4>  | <http://rdf.lindenb.org/Y>                        | "3"^^<http://www.w3.org/2001/XMLSchema#int>     |
| <file:/home/lindenb/jeter.ods/t1/y3/x4>  | <http://rdf.lindenb.org/value>                    | "rs264"                                         |
| <file:/home/lindenb/jeter2.ods/t1/y3/x1> | <http://rdf.lindenb.org/table>                    | <file:/home/lindenb/jeter2.ods/t1>              |
| <file:/home/lindenb/jeter2.ods/t1/y3/x1> | <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> | <http://rdf.lindenb.org/Cell>                   |
| <file:/home/lindenb/jeter2.ods/t1/y3/x1> | <http://rdf.lindenb.org/X>                        | "1"^^<http://www.w3.org/2001/XMLSchema#int>     |
| <file:/home/lindenb/jeter2.ods/t1/y3/x1> | <http://rdf.lindenb.org/Y>                        | "3"^^<http://www.w3.org/2001/XMLSchema#int>     |
| <file:/home/lindenb/jeter2.ods/t1/y3/x1> | <http://rdf.lindenb.org/value>                    | "1"^^<http://www.w3.org/2001/XMLSchema#float>   |
| <file:/home/lindenb/jeter2.ods/t1/y3/x2> | <http://rdf.lindenb.org/table>                    | <file:/home/lindenb/jeter2.od
| <file:/home/lindenb/jeter2.ods/t1/y3/x2> | <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> | <http://rdf.lindenb.org/Cell>                   |
| <file:/home/lindenb/jeter2.ods/t1/y3/x2> | <http://rdf.lindenb.org/X>                        | "2"^^<http://www.w3.org/2001/XMLSchema#int>     |
| <file:/home/lindenb/jeter2.ods/t1/y3/x2> | <http://rdf.lindenb.org/Y>                        | "3"^^<http://www.w3.org/2001/XMLSchema#int>     |
| <file:/home/lindenb/jeter2.ods/t1/y3/x2> | <http://rdf.lindenb.org/value>                    | "2"^^<http://www.w3.org/2001/XMLSchema#float>   |
| <file:/home/lindenb/jeter2.ods/t1/y3/x3> | <http://rdf.lindenb.org/table>                    | <file:/home/lindenb/jeter2.ods/t1>              |
| <file:/home/lindenb/jeter2.ods/t1/y3/x3> | <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> | <http://rdf.lindenb.org/Cell>                   |
| <file:/home/lindenb/jeter2.ods/t1/y3/x3> | <http://rdf.lindenb.org/X>                        | "3"^^<http://www.w3.org/2001/XMLSchema#int>     |
| <file:/home/lindenb/jeter2.ods/t1/y3/x3> | <http://rdf.lindenb.org/Y>                        | "3"^^<http://www.w3.org/2001/XMLSchema#int>     |
| <file:/home/lindenb/jeter2.ods/t1/y3/x3> | <http://rdf.lindenb.org/value>                    | "3"^^<http://www.w3.org/2001/XMLSchema#float>   |
| <file:/home/lindenb/jeter2.ods/t1/y3/x4> | <http://rdf.lindenb.org/table>                    | <file:/home/lindenb/jeter2.ods/t1>              |
| <file:/home/lindenb/jeter2.ods/t1/y3/x4> | <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> | <http://rdf.lindenb.org/Cell>                   |
| <file:/home/lindenb/jeter2.ods/t1/y3/x4> | <http://rdf.lindenb.org/X>                        | "4"^^<http://www.w3.org/2001/XMLSchema#int>     |
| <file:/home/lindenb/jeter2.ods/t1/y3/x4> | <http://rdf.lindenb.org/Y>                        | "3"^^<http://www.w3.org/2001/XMLSchema#int>     |
| <file:/home/lindenb/jeter2.ods/t1/y3/x4> | <http://rdf.lindenb.org/value>                    | "4"^^<http://www.w3.org/2001/XMLSchema#float>   |
--------------------------------------------------------------------------------------------------------------------------------------------------

That's it,

Pierre

No comments: