Taxonomy and Semantic Web: writing an extension for ARQ/SPARQL
In this post I'll show how I've implemented a custom function in ARQ, the SPARQL/Jena engine for querying a RDF graph. The new function implemented tests if a node in the NCBI-taxonomy hierarchy as a given ancestor.
Requirements
- Jena/ARQ: http://jena.sourceforge.net/ARQ/
- A java 1.6 compiler
- nodes.dmp , the ncbi taxonomy downloaded from ftp://ftp.ncbi.nih.gov/pub/taxonomy/
Here are a sample of the very first lines of nodes.dmp: the first column is the node-id of the taxon, the second column is its parent-id.
1 | 1 | no rank | |
2 | 131567 | superki
6 | 335928 | genus |
7 | 6 | species | AC
9 | 32199 | species
10 | 135621 | genus
11 | 10 | species |
13 | 203488 | genus
14 | 13 | species |
16 | 32011 | genus |
The input
our input is a RDF file:
<rdf:RDF
xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns:dc="http://purl.org/dc/elements/1.1/"
xmlns:tax="http://species.lindenb.org"
>
<tax:Individual rdf:about="http://fr.wikipedia.org/wiki/Tintin">
<dc:title xml:lang="fr">Tintin</dc:title>
<dc:title xml:lang="en">Tintin</dc:title>
<tax:taxon rdf:resource="lsid:ncbi.nlm.nih.gov:taxonomy:9606"/>
</tax:Individual>
<tax:Individual rdf:about="http://fr.wikipedia.org/wiki/Babar">
<dc:title xml:lang="fr">Babar</dc:title>
<dc:title xml:lang="en">Babar</dc:title>
<tax:taxon rdf:resource="lsid:ncbi.nlm.nih.gov:taxonomy:9785"/>
</tax:Individual>
<tax:Individual rdf:about="http://fr.wikipedia.org/wiki/Milou">
<dc:title xml:lang="fr">Milou</dc:title>
<dc:title xml:lang="en">Snowy</dc:title>
<tax:taxon rdf:resource="lsid:ncbi.nlm.nih.gov:taxonomy:9615"/>
</tax:Individual>
<tax:Individual rdf:about="http://fr.wikipedia.org/wiki/Donald_Duck">
<dc:title xml:lang="fr">Donald</dc:title>
<dc:title xml:lang="en">Donald Duck</dc:title>
<tax:taxon rdf:resource="lsid:ncbi.nlm.nih.gov:taxonomy:8839"/>
</tax:Individual>
<tax:Individual rdf:about="http://fr.wikipedia.org/wiki/Le_L%C3%A9zard">
<dc:title xml:lang="fr">Lezard</dc:title>
<dc:title xml:lang="en">Lizard</dc:title>
<dc:title xml:lang="fr">Curt Connors</dc:title>
<dc:title xml:lang="en">Curt Connors</dc:title>
<tax:taxon rdf:resource="lsid:ncbi.nlm.nih.gov:taxonomy:9606"/>
<tax:taxon rdf:resource="lsid:ncbi.nlm.nih.gov:taxonomy:8504"/>
</tax:Individual>
</rdf:RDF>
Tintin & Snowy | Babar | Donald | The Lizard |
Basically this file describes
- 4 individuals: Tintin (human), Snowy (dog), Donal (duck) , Babar (Elephant) and Dr Connors/The Lizard (spiderman's foe)
- Each individual unambigously identified by his URI in wikipedia
- Each individual is named in english and in french
- For each individual, is ID in the NCBI hierarchy is specified using a simple URI (here I've tried to use a LSID, but it could have been something else (a URL... ))
A basic query
The following SPARQL query retrieve the URI, the taxonomy and the english name for each individuals.
The query
PREFIX dc: <http://purl.org/dc/elements/1.1/>
PREFIX tax: <http://species.lindenb.org>
SELECT ?individual ?taxon ?title
{
?individual a tax:Individual .
?individual dc:title ?title .
?individual tax:taxon ?taxon .
FILTER langMatches( lang(?title), "en" )
}
Invoking ARQ
Result
| individual | taxon | title |
=============================================================================================================
| <http://fr.wikipedia.org/wiki/Le_L%C3%A9zard> | <lsid:ncbi.nlm.nih.gov:taxonomy:8504> | "Curt Connors"@en |
| <http://fr.wikipedia.org/wiki/Le_L%C3%A9zard> | <lsid:ncbi.nlm.nih.gov:taxonomy:9606> | "Curt Connors"@en |
| <http://fr.wikipedia.org/wiki/Le_L%C3%A9zard> | <lsid:ncbi.nlm.nih.gov:taxonomy:8504> | "Lizard"@en |
| <http://fr.wikipedia.org/wiki/Le_L%C3%A9zard> | <lsid:ncbi.nlm.nih.gov:taxonomy:9606> | "Lizard"@en |
| <http://fr.wikipedia.org/wiki/Donald_Duck> | <lsid:ncbi.nlm.nih.gov:taxonomy:8839> | "Donald Duck"@en |
| <http://fr.wikipedia.org/wiki/Milou> | <lsid:ncbi.nlm.nih.gov:taxonomy:9615> | "Snowy"@en |
| <http://fr.wikipedia.org/wiki/Babar> | <lsid:ncbi.nlm.nih.gov:taxonomy:9785> | "Babar"@en |
| <http://fr.wikipedia.org/wiki/Tintin> | <lsid:ncbi.nlm.nih.gov:taxonomy:9606> | "Tintin"@en |
-------------------------------------------------------------------------------------------------------------
Adding a custom function
Now, I want to add a new function in sparql. This function 'isA' will take as input to parameters: the taxon/LSID of the child and the taxon/LSID of the parent and it will return a boolean 'true' if the 'child' has the 'parent' in his phylogeny. This new function is implemented by extending the class com.hp.hpl.jena.sparql.function.FunctionBase2. This new class contains an associative array child2parent mapping each taxon-id to its parent. This map is loaded as described bellow:
Pattern pat= Pattern.compile("[ \t]*\\|[ \t]*");
String line;
BufferedReader r= new BufferedReader(new FileReader(TAXONOMY_NODES_PATH));
while((line=r.readLine())!=null)
{
String tokens[]=pat.split(line, 3);
this.child2parent.put(
Integer.parseInt(tokens[0]),
Integer.parseInt(tokens[1])
);
}
r.close();
(...)
The function 'exec' will check if the two arguments are an URI and will invoke the method isChildOf
public NodeValue exec(NodeValue childNode, NodeValue parentNode)
{
(...check the nodes are URI)
return NodeValue.makeBoolean(isChildOf(childId,parentId));
}
The function 'isChildOf' loops in the map child2parent to check if the parent is an ancestor of the child:
while(true)
{
Integer id= child2parent.get(childid);
if(id==null || id==childid) return false;
if(id==parentid) return true;
childid=id;
}
Here is the complete source code of this class:
package org.lindenb.arq4taxonomy;
import java.io.BufferedReader;
import java.io.FileReader;
import java.io.IOException;
import java.util.HashMap;
import java.util.Map;
import java.util.regex.Pattern;
import com.hp.hpl.jena.sparql.expr.ExprEvalException;
import com.hp.hpl.jena.sparql.expr.NodeValue;
import com.hp.hpl.jena.sparql.function.FunctionBase2;
public class isA
extends FunctionBase2
{
public static final String LSID="lsid:ncbi.nlm.nih.gov:taxonomy:";
public static final String TAXONOMY_NODES_PATH="/home/lindenb/tmp/TAXONOMY_NCBI/nodes.dmp";
private Map<Integer, Integer> child2parent=null;
public isA()
{
}
/**
* return a associative map child.id -> parent.id
* @return
*/
private Map<Integer, Integer> getTaxonomy()
{
if(this.child2parent==null)
{
this.child2parent= new HashMap<Integer, Integer>();
try
{
Pattern pat= Pattern.compile("[ \t]*\\|[ \t]*");
String line;
BufferedReader r= new BufferedReader(new FileReader(TAXONOMY_NODES_PATH));
while((line=r.readLine())!=null)
{
String tokens[]=pat.split(line, 3);
this.child2parent.put(
Integer.parseInt(tokens[0]),
Integer.parseInt(tokens[1])
);
}
r.close();
System.err.println(this.child2parent.size());
}
catch(IOException err)
{
err.printStackTrace();
throw new ExprEvalException(err);
}
}
return this.child2parent;
}
private boolean isChildOf(int childid,int parentid)
{
if(childid==parentid) return true;
Map<Integer,Integer> map= getTaxonomy();
while(true)
{
Integer id= map.get(childid);
if(id==null || id==childid) return false;
if(id==parentid) return true;
childid=id;
}
}
@Override
public NodeValue exec(NodeValue childNode, NodeValue parentNode)
{
if( childNode.isLiteral() ||
parentNode.isLiteral() ||
childNode.asNode().isBlank() ||
parentNode.asNode().isBlank())
{
return NodeValue.makeBoolean(false);
}
String childURI = childNode.asNode().getURI();
if(!childURI.startsWith(LSID))
{
return NodeValue.makeBoolean(false);
}
String parentURI = parentNode.asNode().getURI();
if(!parentURI.startsWith(LSID))
{
return NodeValue.makeBoolean(false);
}
int childId=0;
try {
childId= Integer.parseInt(childURI.substring(LSID.length()));
}
catch (NumberFormatException e)
{
return NodeValue.makeBoolean(false);
}
int parentId=0;
try {
parentId= Integer.parseInt(parentURI.substring(LSID.length()));
}
catch (NumberFormatException e)
{
return NodeValue.makeBoolean(false);
}
return NodeValue.makeBoolean(isChildOf(childId,parentId));
}
}
This class is then compiled and packaged into the file tax.jar:
javac -cp $(ARQ_CLASSPATH):. -sourcepath src src/org/lindenb/arq4taxonomy/isA.java
jar cvf tax.jar -C src org
and we add this jar in the classpath:
To tell ARQ about this new functio,n we just add its classpath as a new PREFIX in the SPARQL query:
First test
the following SPARQL query retrieves all the Mammals (http://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?mode=Undef&id=40674) in the data set.
The query
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX dc: <http://purl.org/dc/elements/1.1/>
PREFIX tax: <http://species.lindenb.org>
PREFIX fn: <java:org.lindenb.arq4taxonomy.>
SELECT ?individual ?taxon ?title
{
?individual a tax:Individual .
?individual dc:title ?title .
?individual tax:taxon ?taxon .
FILTER fn:isA(?taxon,<lsid:ncbi.nlm.nih.gov:taxonomy:40674> )
FILTER langMatches( lang(?title), "en" )
}
The command line
The result
| individual | taxon | title |
=============================================================================================================
| <http://fr.wikipedia.org/wiki/Le_L%C3%A9zard> | <lsid:ncbi.nlm.nih.gov:taxonomy:9606> | "Curt Connors"@en |
| <http://fr.wikipedia.org/wiki/Le_L%C3%A9zard> | <lsid:ncbi.nlm.nih.gov:taxonomy:9606> | "Lizard"@en |
| <http://fr.wikipedia.org/wiki/Milou> | <lsid:ncbi.nlm.nih.gov:taxonomy:9615> | "Snowy"@en |
| <http://fr.wikipedia.org/wiki/Babar> | <lsid:ncbi.nlm.nih.gov:taxonomy:9785> | "Babar"@en |
| <http://fr.wikipedia.org/wiki/Tintin> | <lsid:ncbi.nlm.nih.gov:taxonomy:9606> | "Tintin"@en |
-------------------------------------------------------------------------------------------------------------
Second query
the following SPARQL query retrieves all the 'Sauropdias' (http://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?mode=Undef&id=8457) in the RDF file.
The SPARQL file
PREFIX dc: <http://purl.org/dc/elements/1.1/>
PREFIX tax: <http://species.lindenb.org>
PREFIX fn: <java:org.lindenb.arq4taxonomy.>
SELECT ?individual ?taxon ?title
{
?individual a tax:Individual .
?individual dc:title ?title .
?individual tax:taxon ?taxon .
FILTER fn:isA(?taxon,<lsid:ncbi.nlm.nih.gov:taxonomy:8457> )
FILTER langMatches( lang(?title), "en" )
}
Command line
The result
| individual | taxon | title |
=============================================================================================================
| <http://fr.wikipedia.org/wiki/Le_L%C3%A9zard> | <lsid:ncbi.nlm.nih.gov:taxonomy:8504> | "Curt Connors"@en |
| <http://fr.wikipedia.org/wiki/Le_L%C3%A9zard> | <lsid:ncbi.nlm.nih.gov:taxonomy:8504> | "Lizard"@en |
| <http://fr.wikipedia.org/wiki/Donald_Duck> | <lsid:ncbi.nlm.nih.gov:taxonomy:8839> | "Donald Duck"@en |
-------------------------------------------------------------------------------------------------------------
Et hop ! voila ! That's it !