About one year ago, I wrote a lightweight java parser for RDF based on the Stream API for XML (Stax). It is far from being perfect as , for example, it does not handle the reified statements, xml:base, ... but it is small (24K) and works fine with most RDF files. Inspired by the XML SAX parsers, this RDF parser doesn't keep the statements in memory but calls a method "found"
each time a triple is found. This method can be overridden to implement your own code.
Source code
The code is available at
RDFEvent
First we need a small internal class to record the content of each triple
private static class RDFEvent
{
URI subject=null;
URI predicate=null;
Object value=null;
URI valueType=null;
String lang=null;
int listIndex=-1;
(...)
}
Searching for rdf:RDF
First we scan the elements of the document until the <rdf:RDF> element is found. Then, the method
parseRDF
is called.
this.parser = this.factory.createXMLEventReader(in);
while(getReader().hasNext())
{
XMLEvent event = getReader().nextEvent();
if(event.isStartElement())
{
StartElement start=(StartElement)event;
if(name2string(start).equals(RDF.NS+"RDF"))
{
parseRDF();
}
}
}
parseRDF: Searching the statements
All the nodes are then scanned .The method
parseDescription is called for each element.
while(getReader().hasNext())
{
XMLEvent event = getReader().nextEvent();
if(event.isEndElement())
{
return;
}
else if(event.isStartElement())
{
parseDescription(event.asStartElement());
}
else if(event.isProcessingInstruction())
{
throw new XMLStreamException("Found Processing Instruction in RDF ???");
}
else if(event.isCharacters() &&
event.asCharacters().getData().trim().length()>0)
{
throw new XMLStreamException("Found text in RDF ???");
}
}
parseDescription: Parsing the subject of a triple
The current element will be the
subject of the triple.
The URI of this subject need to extracted.
First we check if this URI can be extracted from an attribute
rdf:about
Attribute att= description.getAttributeByName(new QName(RDF.NS,"about"));
if(att!=null) descriptionURI= createURI( att.getValue());
If it was not found, the attribute
rdf:nodeID
is searched:
att= description.getAttributeByName(new QName(RDF.NS,"nodeID"));
if(att!=null) descriptionURI= createURI( att.getValue());
If it was not found, the attribute
rdf:ID
is searched.
att= description.getAttributeByName(new QName(RDF.NS,"ID"));
if(att!=null) descriptionURI= resolveBase(att.getValue());
If it was not found, this is an
anonymous node. We create a
random URI.
descriptionURI= createAnonymousURI();
rdf:type
The qualified name of the element contains the
rdf:type
of this statement. We can emit a new triple about this type:
QName qn=description.getName();
if(!(qn.getNamespaceURI().equals(RDF.NS) &&
qn.getLocalPart().equals("Description")))
{
RDFEvent evt= new RDFEvent();
evt.subject=descriptionURI;
evt.predicate= createURI(RDF.NS+"type");
evt.value=name2uri(qn);
found(evt);
}
Other attributes
The other attributes of the current element may contains some new triples.
for(Iterator<?> i=description.getAttributes();
i.hasNext();)
{
att=(Attribute)i.next();
qn= att.getName();
String local= qn.getLocalPart();
if(qn.getNamespaceURI().equals(RDF.NS) &&
( local.equals("about") ||
local.equals("ID") ||
local.equals("nodeID")))
{
continue;
}
RDFEvent evt= new RDFEvent();
evt.subject=descriptionURI;
evt.predicate= name2uri(qn);
evt.value= att.getValue();
found(evt);
}
Searching the predicates
We then loop over the children of the current element. Those nodes are the
predicates of the current subject. The method
parsePredicate is called, each time a new element is found.
while(getReader().hasNext())
{
XMLEvent event = getReader().nextEvent();
if(event.isEndElement())
{
return descriptionURI;
}
else if(event.isStartElement())
{
parsePredicate(descriptionURI,event.asStartElement());
}
else if(event.isProcessingInstruction())
{
throw new XMLStreamException("Found Processing Instruction in RDF ???");
}
else if(event.isCharacters() &&
event.asCharacters().getData().trim().length()>0)
{
throw new XMLStreamException("Found text in RDF ??? \""+
event.asCharacters().getData()+"\""
);
}
}
parsePredicate: Parsing the predicate of the current triple
First the
property attributes of the current element are scanned, and some new triples may be created. e.g:
<rdf:Description ex:fullName="Dave Beckett">
<ex:homePage rdf:resource="http://purl.org/net/dajobe/"/>
</rdf:Description>
During this process, the value of the attribute
rdf:parseType is noted if it was found.
Furthermore, if there was an attribute
rdf:resource, then this element is a new triple linking another resource.
<ex:homePage rdf:resource="http://purl.org/net/dajobe/"/>
If
rdf:parseType="Literal" then we transform the children of the current node into a string, and a new triple is created.
if(parseType.equals("Literal"))
{
StringBuilder b= parseLiteral();
RDFEvent evt= new RDFEvent();
evt.subject=descriptionURI;
evt.predicate= predicateURI;
evt.value= b.toString();
evt.lang=lang;
evt.valueType=datatype;
found(evt);
}
If
rdf:parseType="Resource", then the current node is a blank node: The rdf:Description will be omitted. A blank node is created and we call recursively
parsePredicate using this blank node has the new subject.
URI blanck = createAnonymousURI();
RDFEvent evt= new RDFEvent();
evt.subject=descriptionURI;
evt.predicate= predicateURI;
evt.value=blanck;
evt.lang=lang;
evt.valueType=datatype;
found(evt);
while(getReader().hasNext())
{
XMLEvent event = getReader().nextEvent();
if(event.isStartElement())
{
parsePredicate(blanck, event.asStartElement());
}
else if(event.isEndElement())
{
return;
}
}
If
rdf:parseType="Collection", The children elements give the set of subject nodes of the collection. We call recursively
parseDescription for each of these nodes.
int index=0;
while(getReader().hasNext())
{
XMLEvent event = getReader().nextEvent();
if(event.isStartElement())
{
URI value= parseDescription(event.asStartElement());
RDFEvent evt= new RDFEvent();
evt.subject=descriptionURI;
evt.predicate= predicateURI;
evt.value=value;
evt.lang=lang;
evt.valueType=datatype;
evt.listIndex=(++index);
found(evt);
}
else if(event.isEndElement())
{
return;
}
}
Else this is the default
rdf:parseType.
If a new element is found, then , this is the subject of a new resource (We call recursively
parseDescription), else the current statement has a
Literal as the object of this statement and we concatenate all the text.
StringBuilder b= new StringBuilder();
while(getReader().hasNext())
{
XMLEvent event = getReader().nextEvent();
if(event.isStartElement())
{
URI childURI=parseDescription(event.asStartElement());
RDFEvent evt= new RDFEvent();
evt.subject=descriptionURI;
evt.predicate= predicateURI;
evt.value= childURI;
found(evt);
b.setLength(0);
foundResourceAsChild=true;
}
else if(event.isCharacters())
{
b.append(event.asCharacters().getData());
}
else if(event.isEndElement())
{
if(!foundResourceAsChild)
{
RDFEvent evt= new RDFEvent();
evt.subject=descriptionURI;
evt.predicate= predicateURI;
evt.value= b.toString();
evt.lang=lang;
evt.valueType=datatype;
found(evt);
}
else
{
if(b.toString().trim().length()!=0) throw new XMLStreamException("Found bad text "+b);
}
return;
}
}
Testing
The following code parses
go.rdf.gz (1744 Ko) and returns the number of statements.
long now= System.currentTimeMillis();
URL url= new URL("
ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/rdf/go.rdf.gz");
InputStream r= new GZIPInputStream(url.openStream());
RDFHandler h= new RDFHandler()
{
@Override
public void found(URI subject, URI predicate, Object value,
URI dataType, String lang, int index)
throws IOException {
++count;
}
};
h.parse(r);
r.close();
System.out.println("time:"+((System.currentTimeMillis()-now)/1000)+" secs count:"+count+" triples");
Result:
time:17 secs count:188391 triples
That's it.
Pierre