Requirements
- Jena/ARQ: http://jena.sourceforge.net/ARQ/
- A java 1.6 compiler
- nodes.dmp , the ncbi taxonomy downloaded from ftp://ftp.ncbi.nih.gov/pub/taxonomy/
Here are a sample of the very first lines of nodes.dmp: the first column is the node-id of the taxon, the second column is its parent-id.
cat nodes.dmp | cut -c 1-20 | head
1 | 1 | no rank | |
2 | 131567 | superki
6 | 335928 | genus |
7 | 6 | species | AC
9 | 32199 | species
10 | 135621 | genus
11 | 10 | species |
13 | 203488 | genus
14 | 13 | species |
16 | 32011 | genus |
1 | 1 | no rank | |
2 | 131567 | superki
6 | 335928 | genus |
7 | 6 | species | AC
9 | 32199 | species
10 | 135621 | genus
11 | 10 | species |
13 | 203488 | genus
14 | 13 | species |
16 | 32011 | genus |
The input
our input is a RDF file:
<?xml version="1.0" encoding="UTF-8"?>
<rdf:RDF
xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns:dc="http://purl.org/dc/elements/1.1/"
xmlns:tax="http://species.lindenb.org"
>
<tax:Individual rdf:about="http://fr.wikipedia.org/wiki/Tintin">
<dc:title xml:lang="fr">Tintin</dc:title>
<dc:title xml:lang="en">Tintin</dc:title>
<tax:taxon rdf:resource="lsid:ncbi.nlm.nih.gov:taxonomy:9606"/>
</tax:Individual>
<tax:Individual rdf:about="http://fr.wikipedia.org/wiki/Babar">
<dc:title xml:lang="fr">Babar</dc:title>
<dc:title xml:lang="en">Babar</dc:title>
<tax:taxon rdf:resource="lsid:ncbi.nlm.nih.gov:taxonomy:9785"/>
</tax:Individual>
<tax:Individual rdf:about="http://fr.wikipedia.org/wiki/Milou">
<dc:title xml:lang="fr">Milou</dc:title>
<dc:title xml:lang="en">Snowy</dc:title>
<tax:taxon rdf:resource="lsid:ncbi.nlm.nih.gov:taxonomy:9615"/>
</tax:Individual>
<tax:Individual rdf:about="http://fr.wikipedia.org/wiki/Donald_Duck">
<dc:title xml:lang="fr">Donald</dc:title>
<dc:title xml:lang="en">Donald Duck</dc:title>
<tax:taxon rdf:resource="lsid:ncbi.nlm.nih.gov:taxonomy:8839"/>
</tax:Individual>
<tax:Individual rdf:about="http://fr.wikipedia.org/wiki/Le_L%C3%A9zard">
<dc:title xml:lang="fr">Lezard</dc:title>
<dc:title xml:lang="en">Lizard</dc:title>
<dc:title xml:lang="fr">Curt Connors</dc:title>
<dc:title xml:lang="en">Curt Connors</dc:title>
<tax:taxon rdf:resource="lsid:ncbi.nlm.nih.gov:taxonomy:9606"/>
<tax:taxon rdf:resource="lsid:ncbi.nlm.nih.gov:taxonomy:8504"/>
</tax:Individual>
</rdf:RDF>
<rdf:RDF
xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns:dc="http://purl.org/dc/elements/1.1/"
xmlns:tax="http://species.lindenb.org"
>
<tax:Individual rdf:about="http://fr.wikipedia.org/wiki/Tintin">
<dc:title xml:lang="fr">Tintin</dc:title>
<dc:title xml:lang="en">Tintin</dc:title>
<tax:taxon rdf:resource="lsid:ncbi.nlm.nih.gov:taxonomy:9606"/>
</tax:Individual>
<tax:Individual rdf:about="http://fr.wikipedia.org/wiki/Babar">
<dc:title xml:lang="fr">Babar</dc:title>
<dc:title xml:lang="en">Babar</dc:title>
<tax:taxon rdf:resource="lsid:ncbi.nlm.nih.gov:taxonomy:9785"/>
</tax:Individual>
<tax:Individual rdf:about="http://fr.wikipedia.org/wiki/Milou">
<dc:title xml:lang="fr">Milou</dc:title>
<dc:title xml:lang="en">Snowy</dc:title>
<tax:taxon rdf:resource="lsid:ncbi.nlm.nih.gov:taxonomy:9615"/>
</tax:Individual>
<tax:Individual rdf:about="http://fr.wikipedia.org/wiki/Donald_Duck">
<dc:title xml:lang="fr">Donald</dc:title>
<dc:title xml:lang="en">Donald Duck</dc:title>
<tax:taxon rdf:resource="lsid:ncbi.nlm.nih.gov:taxonomy:8839"/>
</tax:Individual>
<tax:Individual rdf:about="http://fr.wikipedia.org/wiki/Le_L%C3%A9zard">
<dc:title xml:lang="fr">Lezard</dc:title>
<dc:title xml:lang="en">Lizard</dc:title>
<dc:title xml:lang="fr">Curt Connors</dc:title>
<dc:title xml:lang="en">Curt Connors</dc:title>
<tax:taxon rdf:resource="lsid:ncbi.nlm.nih.gov:taxonomy:9606"/>
<tax:taxon rdf:resource="lsid:ncbi.nlm.nih.gov:taxonomy:8504"/>
</tax:Individual>
</rdf:RDF>
Tintin & Snowy | Babar | Donald | The Lizard |
Basically this file describes
- 4 individuals: Tintin (human), Snowy (dog), Donal (duck) , Babar (Elephant) and Dr Connors/The Lizard (spiderman's foe)
- Each individual unambigously identified by his URI in wikipedia
- Each individual is named in english and in french
- For each individual, is ID in the NCBI hierarchy is specified using a simple URI (here I've tried to use a LSID, but it could have been something else (a URL... ))
A basic query
The following SPARQL query retrieve the URI, the taxonomy and the english name for each individuals.
The query
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX dc: <http://purl.org/dc/elements/1.1/>
PREFIX tax: <http://species.lindenb.org>
SELECT ?individual ?taxon ?title
{
?individual a tax:Individual .
?individual dc:title ?title .
?individual tax:taxon ?taxon .
FILTER langMatches( lang(?title), "en" )
}
PREFIX dc: <http://purl.org/dc/elements/1.1/>
PREFIX tax: <http://species.lindenb.org>
SELECT ?individual ?taxon ?title
{
?individual a tax:Individual .
?individual dc:title ?title .
?individual tax:taxon ?taxon .
FILTER langMatches( lang(?title), "en" )
}
Invoking ARQ
arq --query query01.rq --data taxonomy.rdf
Result
-------------------------------------------------------------------------------------------------------------
| individual | taxon | title |
=============================================================================================================
| <http://fr.wikipedia.org/wiki/Le_L%C3%A9zard> | <lsid:ncbi.nlm.nih.gov:taxonomy:8504> | "Curt Connors"@en |
| <http://fr.wikipedia.org/wiki/Le_L%C3%A9zard> | <lsid:ncbi.nlm.nih.gov:taxonomy:9606> | "Curt Connors"@en |
| <http://fr.wikipedia.org/wiki/Le_L%C3%A9zard> | <lsid:ncbi.nlm.nih.gov:taxonomy:8504> | "Lizard"@en |
| <http://fr.wikipedia.org/wiki/Le_L%C3%A9zard> | <lsid:ncbi.nlm.nih.gov:taxonomy:9606> | "Lizard"@en |
| <http://fr.wikipedia.org/wiki/Donald_Duck> | <lsid:ncbi.nlm.nih.gov:taxonomy:8839> | "Donald Duck"@en |
| <http://fr.wikipedia.org/wiki/Milou> | <lsid:ncbi.nlm.nih.gov:taxonomy:9615> | "Snowy"@en |
| <http://fr.wikipedia.org/wiki/Babar> | <lsid:ncbi.nlm.nih.gov:taxonomy:9785> | "Babar"@en |
| <http://fr.wikipedia.org/wiki/Tintin> | <lsid:ncbi.nlm.nih.gov:taxonomy:9606> | "Tintin"@en |
-------------------------------------------------------------------------------------------------------------
| individual | taxon | title |
=============================================================================================================
| <http://fr.wikipedia.org/wiki/Le_L%C3%A9zard> | <lsid:ncbi.nlm.nih.gov:taxonomy:8504> | "Curt Connors"@en |
| <http://fr.wikipedia.org/wiki/Le_L%C3%A9zard> | <lsid:ncbi.nlm.nih.gov:taxonomy:9606> | "Curt Connors"@en |
| <http://fr.wikipedia.org/wiki/Le_L%C3%A9zard> | <lsid:ncbi.nlm.nih.gov:taxonomy:8504> | "Lizard"@en |
| <http://fr.wikipedia.org/wiki/Le_L%C3%A9zard> | <lsid:ncbi.nlm.nih.gov:taxonomy:9606> | "Lizard"@en |
| <http://fr.wikipedia.org/wiki/Donald_Duck> | <lsid:ncbi.nlm.nih.gov:taxonomy:8839> | "Donald Duck"@en |
| <http://fr.wikipedia.org/wiki/Milou> | <lsid:ncbi.nlm.nih.gov:taxonomy:9615> | "Snowy"@en |
| <http://fr.wikipedia.org/wiki/Babar> | <lsid:ncbi.nlm.nih.gov:taxonomy:9785> | "Babar"@en |
| <http://fr.wikipedia.org/wiki/Tintin> | <lsid:ncbi.nlm.nih.gov:taxonomy:9606> | "Tintin"@en |
-------------------------------------------------------------------------------------------------------------
Adding a custom function
Now, I want to add a new function in sparql. This function 'isA' will take as input to parameters: the taxon/LSID of the child and the taxon/LSID of the parent and it will return a boolean 'true' if the 'child' has the 'parent' in his phylogeny. This new function is implemented by extending the class com.hp.hpl.jena.sparql.function.FunctionBase2. This new class contains an associative array child2parent mapping each taxon-id to its parent. This map is loaded as described bellow:
Pattern pat= Pattern.compile("[ \t]*\\|[ \t]*");
String line;
BufferedReader r= new BufferedReader(new FileReader(TAXONOMY_NODES_PATH));
while((line=r.readLine())!=null)
{
String tokens[]=pat.split(line, 3);
this.child2parent.put(
Integer.parseInt(tokens[0]),
Integer.parseInt(tokens[1])
);
}
r.close();
(...)
The function 'exec' will check if the two arguments are an URI and will invoke the method isChildOf
public NodeValue exec(NodeValue childNode, NodeValue parentNode)
{
(...check the nodes are URI)
return NodeValue.makeBoolean(isChildOf(childId,parentId));
}
The function 'isChildOf' loops in the map child2parent to check if the parent is an ancestor of the child:
while(true)
{
Integer id= child2parent.get(childid);
if(id==null || id==childid) return false;
if(id==parentid) return true;
childid=id;
}
Here is the complete source code of this class:
package org.lindenb.arq4taxonomy;
import java.io.BufferedReader;
import java.io.FileReader;
import java.io.IOException;
import java.util.HashMap;
import java.util.Map;
import java.util.regex.Pattern;
import com.hp.hpl.jena.sparql.expr.ExprEvalException;
import com.hp.hpl.jena.sparql.expr.NodeValue;
import com.hp.hpl.jena.sparql.function.FunctionBase2;
public class isA
extends FunctionBase2
{
public static final String LSID="lsid:ncbi.nlm.nih.gov:taxonomy:";
public static final String TAXONOMY_NODES_PATH="/home/lindenb/tmp/TAXONOMY_NCBI/nodes.dmp";
private Map<Integer, Integer> child2parent=null;
public isA()
{
}
/**
* return a associative map child.id -> parent.id
* @return
*/
private Map<Integer, Integer> getTaxonomy()
{
if(this.child2parent==null)
{
this.child2parent= new HashMap<Integer, Integer>();
try
{
Pattern pat= Pattern.compile("[ \t]*\\|[ \t]*");
String line;
BufferedReader r= new BufferedReader(new FileReader(TAXONOMY_NODES_PATH));
while((line=r.readLine())!=null)
{
String tokens[]=pat.split(line, 3);
this.child2parent.put(
Integer.parseInt(tokens[0]),
Integer.parseInt(tokens[1])
);
}
r.close();
System.err.println(this.child2parent.size());
}
catch(IOException err)
{
err.printStackTrace();
throw new ExprEvalException(err);
}
}
return this.child2parent;
}
private boolean isChildOf(int childid,int parentid)
{
if(childid==parentid) return true;
Map<Integer,Integer> map= getTaxonomy();
while(true)
{
Integer id= map.get(childid);
if(id==null || id==childid) return false;
if(id==parentid) return true;
childid=id;
}
}
@Override
public NodeValue exec(NodeValue childNode, NodeValue parentNode)
{
if( childNode.isLiteral() ||
parentNode.isLiteral() ||
childNode.asNode().isBlank() ||
parentNode.asNode().isBlank())
{
return NodeValue.makeBoolean(false);
}
String childURI = childNode.asNode().getURI();
if(!childURI.startsWith(LSID))
{
return NodeValue.makeBoolean(false);
}
String parentURI = parentNode.asNode().getURI();
if(!parentURI.startsWith(LSID))
{
return NodeValue.makeBoolean(false);
}
int childId=0;
try {
childId= Integer.parseInt(childURI.substring(LSID.length()));
}
catch (NumberFormatException e)
{
return NodeValue.makeBoolean(false);
}
int parentId=0;
try {
parentId= Integer.parseInt(parentURI.substring(LSID.length()));
}
catch (NumberFormatException e)
{
return NodeValue.makeBoolean(false);
}
return NodeValue.makeBoolean(isChildOf(childId,parentId));
}
}
This class is then compiled and packaged into the file tax.jar:
javac -cp $(ARQ_CLASSPATH):. -sourcepath src src/org/lindenb/arq4taxonomy/isA.java
jar cvf tax.jar -C src org
and we add this jar in the classpath:
export CP=$PWD/tax.jar
To tell ARQ about this new functio,n we just add its classpath as a new PREFIX in the SPARQL query:
PREFIX fn: <java:org.lindenb.arq4taxonomy.>
First test
the following SPARQL query retrieves all the Mammals (http://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?mode=Undef&id=40674) in the data set.
The query
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX dc: <http://purl.org/dc/elements/1.1/>
PREFIX tax: <http://species.lindenb.org>
PREFIX fn: <java:org.lindenb.arq4taxonomy.>
SELECT ?individual ?taxon ?title
{
?individual a tax:Individual .
?individual dc:title ?title .
?individual tax:taxon ?taxon .
FILTER fn:isA(?taxon,<lsid:ncbi.nlm.nih.gov:taxonomy:40674> )
FILTER langMatches( lang(?title), "en" )
}
The command line
arq --query query02.rq --data taxonomy.rdf
The result
-------------------------------------------------------------------------------------------------------------
| individual | taxon | title |
=============================================================================================================
| <http://fr.wikipedia.org/wiki/Le_L%C3%A9zard> | <lsid:ncbi.nlm.nih.gov:taxonomy:9606> | "Curt Connors"@en |
| <http://fr.wikipedia.org/wiki/Le_L%C3%A9zard> | <lsid:ncbi.nlm.nih.gov:taxonomy:9606> | "Lizard"@en |
| <http://fr.wikipedia.org/wiki/Milou> | <lsid:ncbi.nlm.nih.gov:taxonomy:9615> | "Snowy"@en |
| <http://fr.wikipedia.org/wiki/Babar> | <lsid:ncbi.nlm.nih.gov:taxonomy:9785> | "Babar"@en |
| <http://fr.wikipedia.org/wiki/Tintin> | <lsid:ncbi.nlm.nih.gov:taxonomy:9606> | "Tintin"@en |
-------------------------------------------------------------------------------------------------------------
| individual | taxon | title |
=============================================================================================================
| <http://fr.wikipedia.org/wiki/Le_L%C3%A9zard> | <lsid:ncbi.nlm.nih.gov:taxonomy:9606> | "Curt Connors"@en |
| <http://fr.wikipedia.org/wiki/Le_L%C3%A9zard> | <lsid:ncbi.nlm.nih.gov:taxonomy:9606> | "Lizard"@en |
| <http://fr.wikipedia.org/wiki/Milou> | <lsid:ncbi.nlm.nih.gov:taxonomy:9615> | "Snowy"@en |
| <http://fr.wikipedia.org/wiki/Babar> | <lsid:ncbi.nlm.nih.gov:taxonomy:9785> | "Babar"@en |
| <http://fr.wikipedia.org/wiki/Tintin> | <lsid:ncbi.nlm.nih.gov:taxonomy:9606> | "Tintin"@en |
-------------------------------------------------------------------------------------------------------------
Second query
the following SPARQL query retrieves all the 'Sauropdias' (http://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?mode=Undef&id=8457) in the RDF file.
The SPARQL file
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX dc: <http://purl.org/dc/elements/1.1/>
PREFIX tax: <http://species.lindenb.org>
PREFIX fn: <java:org.lindenb.arq4taxonomy.>
SELECT ?individual ?taxon ?title
{
?individual a tax:Individual .
?individual dc:title ?title .
?individual tax:taxon ?taxon .
FILTER fn:isA(?taxon,<lsid:ncbi.nlm.nih.gov:taxonomy:8457> )
FILTER langMatches( lang(?title), "en" )
}
PREFIX dc: <http://purl.org/dc/elements/1.1/>
PREFIX tax: <http://species.lindenb.org>
PREFIX fn: <java:org.lindenb.arq4taxonomy.>
SELECT ?individual ?taxon ?title
{
?individual a tax:Individual .
?individual dc:title ?title .
?individual tax:taxon ?taxon .
FILTER fn:isA(?taxon,<lsid:ncbi.nlm.nih.gov:taxonomy:8457> )
FILTER langMatches( lang(?title), "en" )
}
Command line
arq --query query03.rq --datataxonomy.rdf
The result
-------------------------------------------------------------------------------------------------------------
| individual | taxon | title |
=============================================================================================================
| <http://fr.wikipedia.org/wiki/Le_L%C3%A9zard> | <lsid:ncbi.nlm.nih.gov:taxonomy:8504> | "Curt Connors"@en |
| <http://fr.wikipedia.org/wiki/Le_L%C3%A9zard> | <lsid:ncbi.nlm.nih.gov:taxonomy:8504> | "Lizard"@en |
| <http://fr.wikipedia.org/wiki/Donald_Duck> | <lsid:ncbi.nlm.nih.gov:taxonomy:8839> | "Donald Duck"@en |
-------------------------------------------------------------------------------------------------------------
| individual | taxon | title |
=============================================================================================================
| <http://fr.wikipedia.org/wiki/Le_L%C3%A9zard> | <lsid:ncbi.nlm.nih.gov:taxonomy:8504> | "Curt Connors"@en |
| <http://fr.wikipedia.org/wiki/Le_L%C3%A9zard> | <lsid:ncbi.nlm.nih.gov:taxonomy:8504> | "Lizard"@en |
| <http://fr.wikipedia.org/wiki/Donald_Duck> | <lsid:ncbi.nlm.nih.gov:taxonomy:8839> | "Donald Duck"@en |
-------------------------------------------------------------------------------------------------------------
Et hop ! voila ! That's it !
I love you!!! That blog is a source of life for a webmaster as I'd like to becomes!!!!! Well, now i'll comes a your reader ( I read very few blogs..) Goodbye from Perugia, Italy.
ReplyDeletePS: I hope you can help my bad knowledgments in programming. I love blogging, but is only 3 years I have a Pc; at the moment i learned only Seo......
Dear Mr,
ReplyDeleteThanks for your helpful blogs, ehm...
I has been searching about MAPRF 6.0 by Ritter (1990)/Ritter & Salamini (1996)...because I will use this program for linkage analysis...but i didn't find it in internet...especially for protocol.
Do you know about it?
Ray Tiran
Indonesia
raytiran@gmail.com