25 November 2008

Taxonomy and Semantic Web: writing an extension for ARQ/SPARQL

In this post I'll show how I've implemented a custom function in ARQ, the SPARQL/Jena engine for querying a RDF graph. The new function implemented tests if a node in the NCBI-taxonomy hierarchy as a given ancestor.

Requirements


Here are a sample of the very first lines of nodes.dmp: the first column is the node-id of the taxon, the second column is its parent-id.
cat nodes.dmp | cut -c 1-20 | head
1 | 1 | no rank | |
2 | 131567 | superki
6 | 335928 | genus |
7 | 6 | species | AC
9 | 32199 | species
10 | 135621 | genus
11 | 10 | species |
13 | 203488 | genus
14 | 13 | species |
16 | 32011 | genus |



The input


our input is a RDF file:

<?xml version="1.0" encoding="UTF-8"?>
<rdf:RDF
xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns:dc="http://purl.org/dc/elements/1.1/"
xmlns:tax="http://species.lindenb.org"
>

<tax:Individual rdf:about="http://fr.wikipedia.org/wiki/Tintin">
<dc:title xml:lang="fr">Tintin</dc:title>
<dc:title xml:lang="en">Tintin</dc:title>
<tax:taxon rdf:resource="lsid:ncbi.nlm.nih.gov:taxonomy:9606"/>
</tax:Individual>

<tax:Individual rdf:about="http://fr.wikipedia.org/wiki/Babar">
<dc:title xml:lang="fr">Babar</dc:title>
<dc:title xml:lang="en">Babar</dc:title>
<tax:taxon rdf:resource="lsid:ncbi.nlm.nih.gov:taxonomy:9785"/>
</tax:Individual>

<tax:Individual rdf:about="http://fr.wikipedia.org/wiki/Milou">
<dc:title xml:lang="fr">Milou</dc:title>
<dc:title xml:lang="en">Snowy</dc:title>
<tax:taxon rdf:resource="lsid:ncbi.nlm.nih.gov:taxonomy:9615"/>
</tax:Individual>

<tax:Individual rdf:about="http://fr.wikipedia.org/wiki/Donald_Duck">
<dc:title xml:lang="fr">Donald</dc:title>
<dc:title xml:lang="en">Donald Duck</dc:title>
<tax:taxon rdf:resource="lsid:ncbi.nlm.nih.gov:taxonomy:8839"/>
</tax:Individual>

<tax:Individual rdf:about="http://fr.wikipedia.org/wiki/Le_L%C3%A9zard">
<dc:title xml:lang="fr">Lezard</dc:title>
<dc:title xml:lang="en">Lizard</dc:title>
<dc:title xml:lang="fr">Curt Connors</dc:title>
<dc:title xml:lang="en">Curt Connors</dc:title>
<tax:taxon rdf:resource="lsid:ncbi.nlm.nih.gov:taxonomy:9606"/>
<tax:taxon rdf:resource="lsid:ncbi.nlm.nih.gov:taxonomy:8504"/>
</tax:Individual>

</rdf:RDF>

Images via wikipedia

Tintin & Snowy

Babar

Donald

The Lizard

Basically this file describes
  • 4 individuals: Tintin (human), Snowy (dog), Donal (duck) , Babar (Elephant) and Dr Connors/The Lizard (spiderman's foe)
  • Each individual unambigously identified by his URI in wikipedia
  • Each individual is named in english and in french
  • For each individual, is ID in the NCBI hierarchy is specified using a simple URI (here I've tried to use a LSID, but it could have been something else (a URL... ))


A basic query


The following SPARQL query retrieve the URI, the taxonomy and the english name for each individuals.

The query

PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX dc: <http://purl.org/dc/elements/1.1/>
PREFIX tax: <http://species.lindenb.org>

SELECT ?individual ?taxon ?title
{
?individual a tax:Individual .
?individual dc:title ?title .
?individual tax:taxon ?taxon .
FILTER langMatches( lang(?title), "en" )
}

Invoking ARQ


arq --query query01.rq --data taxonomy.rdf

Result


-------------------------------------------------------------------------------------------------------------
| individual | taxon | title |
=============================================================================================================
| <http://fr.wikipedia.org/wiki/Le_L%C3%A9zard> | <lsid:ncbi.nlm.nih.gov:taxonomy:8504> | "Curt Connors"@en |
| <http://fr.wikipedia.org/wiki/Le_L%C3%A9zard> | <lsid:ncbi.nlm.nih.gov:taxonomy:9606> | "Curt Connors"@en |
| <http://fr.wikipedia.org/wiki/Le_L%C3%A9zard> | <lsid:ncbi.nlm.nih.gov:taxonomy:8504> | "Lizard"@en |
| <http://fr.wikipedia.org/wiki/Le_L%C3%A9zard> | <lsid:ncbi.nlm.nih.gov:taxonomy:9606> | "Lizard"@en |
| <http://fr.wikipedia.org/wiki/Donald_Duck> | <lsid:ncbi.nlm.nih.gov:taxonomy:8839> | "Donald Duck"@en |
| <http://fr.wikipedia.org/wiki/Milou> | <lsid:ncbi.nlm.nih.gov:taxonomy:9615> | "Snowy"@en |
| <http://fr.wikipedia.org/wiki/Babar> | <lsid:ncbi.nlm.nih.gov:taxonomy:9785> | "Babar"@en |
| <http://fr.wikipedia.org/wiki/Tintin> | <lsid:ncbi.nlm.nih.gov:taxonomy:9606> | "Tintin"@en |
-------------------------------------------------------------------------------------------------------------


Adding a custom function


Now, I want to add a new function in sparql. This function 'isA' will take as input to parameters: the taxon/LSID of the child and the taxon/LSID of the parent and it will return a boolean 'true' if the 'child' has the 'parent' in his phylogeny. This new function is implemented by extending the class com.hp.hpl.jena.sparql.function.FunctionBase2. This new class contains an associative array child2parent mapping each taxon-id to its parent. This map is loaded as described bellow:

Pattern pat= Pattern.compile("[ \t]*\\|[ \t]*");
String line;
BufferedReader r= new BufferedReader(new FileReader(TAXONOMY_NODES_PATH));
while((line=r.readLine())!=null)
{
String tokens[]=pat.split(line, 3);
this.child2parent.put(
Integer.parseInt(tokens[0]),
Integer.parseInt(tokens[1])
);
}
r.close();
(...)

The function 'exec' will check if the two arguments are an URI and will invoke the method isChildOf

public NodeValue exec(NodeValue childNode, NodeValue parentNode)
{
(...check the nodes are URI)
return NodeValue.makeBoolean(isChildOf(childId,parentId));
}


The function 'isChildOf' loops in the map child2parent to check if the parent is an ancestor of the child:

while(true)
{
Integer id= child2parent.get(childid);
if(id==null || id==childid) return false;
if(id==parentid) return true;
childid=id;
}

Here is the complete source code of this class:

package org.lindenb.arq4taxonomy;

import java.io.BufferedReader;
import java.io.FileReader;
import java.io.IOException;
import java.util.HashMap;
import java.util.Map;
import java.util.regex.Pattern;

import com.hp.hpl.jena.sparql.expr.ExprEvalException;
import com.hp.hpl.jena.sparql.expr.NodeValue;
import com.hp.hpl.jena.sparql.function.FunctionBase2;

public class isA
extends FunctionBase2
{
public static final String LSID="lsid:ncbi.nlm.nih.gov:taxonomy:";
public static final String TAXONOMY_NODES_PATH="/home/lindenb/tmp/TAXONOMY_NCBI/nodes.dmp";
private Map<Integer, Integer> child2parent=null;

public isA()
{

}
/**
* return a associative map child.id -> parent.id
* @return
*/
private Map<Integer, Integer> getTaxonomy()
{
if(this.child2parent==null)
{
this.child2parent= new HashMap<Integer, Integer>();
try
{
Pattern pat= Pattern.compile("[ \t]*\\|[ \t]*");
String line;
BufferedReader r= new BufferedReader(new FileReader(TAXONOMY_NODES_PATH));
while((line=r.readLine())!=null)
{
String tokens[]=pat.split(line, 3);
this.child2parent.put(
Integer.parseInt(tokens[0]),
Integer.parseInt(tokens[1])
);
}
r.close();
System.err.println(this.child2parent.size());
}
catch(IOException err)
{
err.printStackTrace();
throw new ExprEvalException(err);
}
}
return this.child2parent;
}

private boolean isChildOf(int childid,int parentid)
{
if(childid==parentid) return true;
Map<Integer,Integer> map= getTaxonomy();
while(true)
{
Integer id= map.get(childid);
if(id==null || id==childid) return false;
if(id==parentid) return true;
childid=id;
}
}

@Override
public NodeValue exec(NodeValue childNode, NodeValue parentNode)
{

if( childNode.isLiteral() ||
parentNode.isLiteral() ||
childNode.asNode().isBlank() ||
parentNode.asNode().isBlank())
{
return NodeValue.makeBoolean(false);
}

String childURI = childNode.asNode().getURI();
if(!childURI.startsWith(LSID))
{
return NodeValue.makeBoolean(false);
}


String parentURI = parentNode.asNode().getURI();
if(!parentURI.startsWith(LSID))
{
return NodeValue.makeBoolean(false);
}

int childId=0;
try {
childId= Integer.parseInt(childURI.substring(LSID.length()));
}
catch (NumberFormatException e)
{
return NodeValue.makeBoolean(false);
}

int parentId=0;
try {
parentId= Integer.parseInt(parentURI.substring(LSID.length()));
}
catch (NumberFormatException e)
{
return NodeValue.makeBoolean(false);
}

return NodeValue.makeBoolean(isChildOf(childId,parentId));
}

}

This class is then compiled and packaged into the file tax.jar:

javac -cp $(ARQ_CLASSPATH):. -sourcepath src src/org/lindenb/arq4taxonomy/isA.java
jar cvf tax.jar -C src org


and we add this jar in the classpath:
export CP=$PWD/tax.jar

To tell ARQ about this new functio,n we just add its classpath as a new PREFIX in the SPARQL query:
PREFIX fn: <java:org.lindenb.arq4taxonomy.>



First test


the following SPARQL query retrieves all the Mammals (http://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?mode=Undef&id=40674) in the data set.

The query


PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX dc: <http://purl.org/dc/elements/1.1/>
PREFIX tax: <http://species.lindenb.org>
PREFIX fn: <java:org.lindenb.arq4taxonomy.>

SELECT ?individual ?taxon ?title
{
?individual a tax:Individual .
?individual dc:title ?title .
?individual tax:taxon ?taxon .
FILTER fn:isA(?taxon,<lsid:ncbi.nlm.nih.gov:taxonomy:40674> )
FILTER langMatches( lang(?title), "en" )
}

The command line


arq --query query02.rq --data taxonomy.rdf


The result


-------------------------------------------------------------------------------------------------------------
| individual | taxon | title |
=============================================================================================================
| <http://fr.wikipedia.org/wiki/Le_L%C3%A9zard> | <lsid:ncbi.nlm.nih.gov:taxonomy:9606> | "Curt Connors"@en |
| <http://fr.wikipedia.org/wiki/Le_L%C3%A9zard> | <lsid:ncbi.nlm.nih.gov:taxonomy:9606> | "Lizard"@en |
| <http://fr.wikipedia.org/wiki/Milou> | <lsid:ncbi.nlm.nih.gov:taxonomy:9615> | "Snowy"@en |
| <http://fr.wikipedia.org/wiki/Babar> | <lsid:ncbi.nlm.nih.gov:taxonomy:9785> | "Babar"@en |
| <http://fr.wikipedia.org/wiki/Tintin> | <lsid:ncbi.nlm.nih.gov:taxonomy:9606> | "Tintin"@en |
-------------------------------------------------------------------------------------------------------------


Second query


the following SPARQL query retrieves all the 'Sauropdias' (http://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?mode=Undef&id=8457) in the RDF file.

The SPARQL file


PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX dc: <http://purl.org/dc/elements/1.1/>
PREFIX tax: <http://species.lindenb.org>
PREFIX fn: <java:org.lindenb.arq4taxonomy.>

SELECT ?individual ?taxon ?title
{
?individual a tax:Individual .
?individual dc:title ?title .
?individual tax:taxon ?taxon .
FILTER fn:isA(?taxon,<lsid:ncbi.nlm.nih.gov:taxonomy:8457> )
FILTER langMatches( lang(?title), "en" )
}

Command line


arq --query query03.rq --datataxonomy.rdf


The result


-------------------------------------------------------------------------------------------------------------
| individual | taxon | title |
=============================================================================================================
| <http://fr.wikipedia.org/wiki/Le_L%C3%A9zard> | <lsid:ncbi.nlm.nih.gov:taxonomy:8504> | "Curt Connors"@en |
| <http://fr.wikipedia.org/wiki/Le_L%C3%A9zard> | <lsid:ncbi.nlm.nih.gov:taxonomy:8504> | "Lizard"@en |
| <http://fr.wikipedia.org/wiki/Donald_Duck> | <lsid:ncbi.nlm.nih.gov:taxonomy:8839> | "Donald Duck"@en |
-------------------------------------------------------------------------------------------------------------



Et hop ! voila ! That's it !

2 comments:

BlogMasterPg said...

I love you!!! That blog is a source of life for a webmaster as I'd like to becomes!!!!! Well, now i'll comes a your reader ( I read very few blogs..) Goodbye from Perugia, Italy.
PS: I hope you can help my bad knowledgments in programming. I love blogging, but is only 3 years I have a Pc; at the moment i learned only Seo......

Ray Tiran said...

Dear Mr,

Thanks for your helpful blogs, ehm...
I has been searching about MAPRF 6.0 by Ritter (1990)/Ritter & Salamini (1996)...because I will use this program for linkage analysis...but i didn't find it in internet...especially for protocol.
Do you know about it?

Ray Tiran
Indonesia

raytiran@gmail.com