Showing posts with label arq. Show all posts
Showing posts with label arq. Show all posts

24 April 2012

Mapping the genes involved in a category of disease: the GeneWikiPlus + SPARQL way.

In my previous post, I've used the RDF/XML files of the Disease Ontology to map all the genes involved in a cardiac disease.

Andrew Su immediately mentioned on Twitter that he was working on GeneWiki+, an integration of GeneWiki on Semantic-MediaWiki that could answer the same question.





Later, Benjamin Good announced that a SPARQL endpoint for GeneWiki+ was now available:


The following java code uses the Jena/ARQ API to query this SPARQL endpoint. For a given Disease Ontology accession identifier, it fetches all the genes associated to this disease and run recursively with the sub-classes of this disease.



Here is the output (gene-name, gene-id, disease) with DOID:114 ("Heart Disease"):
Protein C 5624 Heart disease
HMG-CoA reductase 3156 Heart disease
SCARB1 949 Heart disease
Coagulation factor II receptor 2149 Heart disease
Cathepsin S 1520 Heart disease
ABCA1 19 Heart disease
CHD7 55636 Heart disease
GJA5 2702 Heart disease
ENTPD1 953 Heart disease
PEDF 5176 Heart disease
HMG CoA reductase 3156 Heart disease
PROC 5624 Heart disease
F2R 2149 Heart disease
SERPINF1 5176 Heart disease
HMGCR 3156 Heart disease
CTSS 1520 Heart disease
Cytochrome c 54205 Heart failure
FOXP1 27086 Heart failure
Vasoactive intestinal peptide 7432 Heart failure
Angiotensin-converting enzyme 1636 Heart failure
PPP1CA 5499 Heart failure
Transferrin 7018 Heart failure
Natriuretic peptide precursor C 4880 Heart failure
Insulin-like growth factor 1 3479 Heart failure
CA-125 94025 Heart failure
Myosin binding protein C, cardiac 4607 Heart failure
MYH7 4625 Heart failure
Tafazzin 6901 Heart failure
5-HT2B receptor 3357 Heart failure
Beta-1 adrenergic receptor 153 Heart failure
PTGS2 5743 Heart failure
EPAS1 2034 Heart failure
Nociceptin receptor 4987 Heart failure
Cystatin C 1471 Heart failure
Ryanodine receptor 2 6262 Heart failure
Multidrug resistance-associated protein 2 1244 Heart failure
KCNA5 3741 Heart failure
ANXA6 309 Heart failure
CMA1 1215 Heart failure
KLF15 28999 Heart failure
IL1RL1 9173 Heart failure
JPH2 57158 Heart failure
Heart-type fatty acid binding protein 2170 Heart failure
TF 7018 Heart failure
ABCC2 1244 Heart failure
Cytochrome-c 54205 Heart failure
HTR2B 3357 Heart failure
Cytochrome C 54205 Heart failure
Hif2a 2034 Heart failure
FABP3 2170 Heart failure
MYBPC3 4607 Heart failure
Angiotensin converting enzyme 1636 Heart failure
IGF-1 3479 Heart failure
Insulin-like growth factor-1 3479 Heart failure
Stress-induced polymorphic ventricular tachycardia 6262 Heart failure
C-type natriuretic peptide 4880 Heart failure
OPRL1 4987 Heart failure
CYCS 54205 Heart failure
ADRB1 153 Heart failure
TAZ 6901 Heart failure
VIP 7432 Heart failure
IGF1 3479 Heart failure
NPPC 4880 Heart failure
ACE 1636 Heart failure
CST3 1471 Heart failure
MUC16 94025 Heart failure
RYR2 6262 Heart failure
Aquaporin-2 359 Congestive heart failure
Aquaporin 2 359 Congestive heart failure
Atrial natriuretic peptide 4878 Congestive heart failure
Brain natriuretic peptide 4879 Congestive heart failure
Phospholamban 5350 Congestive heart failure
CYP2C9 1559 Congestive heart failure
RAGE (receptor) 177 Congestive heart failure
Angiotensin II receptor type 1 185 Congestive heart failure
Programmed cell death 1 5133 Congestive heart failure
AGTR1 185 Congestive heart failure
Atrial natriuretic factor 4878 Congestive heart failure
PDCD1 5133 Congestive heart failure
AGER 177 Congestive heart failure
AQP2 359 Congestive heart failure
PLN 5350 Congestive heart failure
NPPB 4879 Congestive heart failure
NPPA 4878 Congestive heart failure
GroEL 3329 Endocarditis
Ornithine transcarbamylase 5009 Endocarditis
Valosin-containing protein 7415 Endocarditis
Parathyroid hormone 1 receptor 5745 Endocarditis
VDAC1 7416 Endocarditis
RuvB-like 1 8607 Endocarditis
TUBB2A 7280 Endocarditis
ACTG1 71 Endocarditis
ACTC1 70 Endocarditis
PRDX6 9588 Endocarditis
Hyaluronan-mediated motility receptor 3161 Endocarditis
HSPB6 126393 Endocarditis
Parathyroid hormone receptor 1 5745 Endocarditis
VCP 7415 Endocarditis
OTC 5009 Endocarditis
PTH1R 5745 Endocarditis
HSPD1 3329 Endocarditis
HMMR 3161 Endocarditis
RUVBL1 8607 Endocarditis
HCN4 10021 Sick sinus syndrome
Heparin-binding EGF-like growth factor 1839 Aortic valve disease
HBEGF 1839 Aortic valve disease
Von Willebrand factor 7450 Aortic valve stenosis
ADAMTS13 11093 Aortic valve stenosis
VWF 7450 Aortic valve stenosis
Elastin 2006 Supravalvular aortic stenosis
ELN 2006 Supravalvular aortic stenosis
PRG4 10216 Pericarditis
Histamine H3 receptor 11255 Myocardial ischemia
MAP3K7IP1 10454 Myocardial ischemia
Vascular endothelial growth factor A 7422 Myocardial ischemia
Cathepsin L1 1514 Myocardial ischemia
VEGF-A 7422 Myocardial ischemia
VEGFA 7422 Myocardial ischemia
CTSL1 1514 Myocardial ischemia
TAB1 10454 Myocardial ischemia
HRH3 11255 Myocardial ischemia
APOA1 335 Coronary heart disease
APOC3 345 Coronary heart disease
Lipoprotein(a) 4018 Coronary heart disease
Brain natriuretic peptide 4879 Coronary heart disease
Beta-3 adrenergic receptor 155 Coronary heart disease
Insulin-like growth factor 1 3479 Coronary heart disease
Perlecan 3339 Coronary heart disease
PCSK9 255738 Coronary heart disease
Cholesterylester transfer protein 1071 Coronary heart disease
Arachidonate 5-lipoxygenase 240 Coronary heart disease
Apolipoprotein B 338 Coronary heart disease
Apolipoprotein A1 335 Coronary heart disease
Beta-1 adrenergic receptor 153 Coronary heart disease
Apolipoprotein C3 345 Coronary heart disease
Lipoprotein-associated phospholipase A2 7941 Coronary heart disease
NEUROG3 50674 Coronary heart disease
5-lipoxygenase 240 Coronary heart disease
ApoA1 335 Coronary heart disease
CETP 1071 Coronary heart disease
ApoB 338 Coronary heart disease
IGF-1 3479 Coronary heart disease
Insulin-like growth factor-1 3479 Coronary heart disease
ApoCIII 345 Coronary heart disease
PLA2G7 7941 Coronary heart disease
ADRB3 155 Coronary heart disease
ADRB1 153 Coronary heart disease
APOB 338 Coronary heart disease
ALOX5 240 Coronary heart disease
IGF1 3479 Coronary heart disease
NPPB 4879 Coronary heart disease
HSPG2 3339 Coronary heart disease
LPA 4018 Coronary heart disease
CYP7A1 1581 Myocardial infarction
Caspase 3 836 Myocardial infarction
C-reactive protein 1401 Myocardial infarction
Renin 5972 Myocardial infarction
Factor VII 2155 Myocardial infarction
Factor H 3075 Myocardial infarction
Hepatic lipase 3990 Myocardial infarction
Myeloperoxidase 4353 Myocardial infarction
Endothelial protein C receptor 10544 Myocardial infarction
ALDH2 217 Myocardial infarction
C1-inhibitor 710 Myocardial infarction
Basic fibroblast growth factor 2247 Myocardial infarction
Myocyte-specific enhancer factor 2A 4205 Myocardial infarction
5-Lipoxygenase-activating protein 241 Myocardial infarction
RAGE (receptor) 177 Myocardial infarction
OLR1 4973 Myocardial infarction
Beta-1 adrenergic receptor 153 Myocardial infarction
PTGS2 5743 Myocardial infarction
Cholesterol 7 alpha-hydroxylase 1581 Myocardial infarction
GPVI 51206 Myocardial infarction
Adrenomedullin 133 Myocardial infarction
Prostacyclin synthase 5740 Myocardial infarction
Cystatin C 1471 Myocardial infarction
Tenascin X 7148 Myocardial infarction
Thymosin beta-4 7114 Myocardial infarction
GCLM 2730 Myocardial infarction
S100A9 6280 Myocardial infarction
IL1RL1 9173 Myocardial infarction
LGALS2 3957 Myocardial infarction
CKM (gene) 1158 Myocardial infarction
ABCC9 10060 Myocardial infarction
Renalase 55328 Myocardial infarction
VTI1A 143187 Myocardial infarction
MIAT (gene) 440823 Myocardial infarction
BFGF 2247 Myocardial infarction
TMSB4X 7114 Myocardial infarction
CASP3 836 Myocardial infarction
Caspase-3 836 Myocardial infarction
Complement factor H 3075 Myocardial infarction
MEF2A 4205 Myocardial infarction
5-lipoxygenase activating protein 241 Myocardial infarction
Factor VIIa 2155 Myocardial infarction
PROCR 10544 Myocardial infarction
GP6 51206 Myocardial infarction
F7 2155 Myocardial infarction
AGER 177 Myocardial infarction
ADRB1 153 Myocardial infarction
MIAT 440823 Myocardial infarction
CFH 3075 Myocardial infarction
CKM 1158 Myocardial infarction
CRP 1401 Myocardial infarction
LIPC 3990 Myocardial infarction
RNLS 55328 Myocardial infarction
PTGIS 5740 Myocardial infarction
TNXB 7148 Myocardial infarction
SERPING1 710 Myocardial infarction
FGF2 2247 Myocardial infarction
REN 5972 Myocardial infarction
ADM 133 Myocardial infarction
CST3 1471 Myocardial infarction
MPO 4353 Myocardial infarction
ALOX5AP 241 Myocardial infarction
Myoglobin 4151 Acute myocardial infarction
Tissue plasminogen activator 5327 Acute myocardial infarction
MIRN21 406991 Acute myocardial infarction
Apolipoprotein B 338 Acute myocardial infarction
Endothelin 1 1906 Acute myocardial infarction
MMP3 4314 Acute myocardial infarction
Heart-type fatty acid binding protein 2170 Acute myocardial infarction
Alteplase 5327 Acute myocardial infarction
FABP3 2170 Acute myocardial infarction
ApoB 338 Acute myocardial infarction
MB 4151 Acute myocardial infarction
APOB 338 Acute myocardial infarction
PLAT 5327 Acute myocardial infarction
EDN1 1906 Acute myocardial infarction
MIR21 406991 Acute myocardial infarction
Adenosine A1 receptor 134 Myocardial stunning
SOD2 6648 Myocardial stunning
ADORA1 134 Myocardial stunning
MYH7 4625 Endocardial fibroelastosis
Tafazzin 6901 Endocardial fibroelastosis
TAZ 6901 Endocardial fibroelastosis
Nav1.5 6331 Conduction disease
SCN5A 6331 Conduction disease
PRKAG2 51422 Wolff-Parkinson-White syndrome
TNNT2 7139 Restrictive cardiomyopathy
Titin 7273 Hypertrophic cardiomyopathy
CSRP3 8048 Hypertrophic cardiomyopathy
CD36 948 Hypertrophic cardiomyopathy
Myosin binding protein C, cardiac 4607 Hypertrophic cardiomyopathy
MYH7 4625 Hypertrophic cardiomyopathy
MYL9 10398 Hypertrophic cardiomyopathy
TNNT2 7139 Hypertrophic cardiomyopathy
ACTC1 70 Hypertrophic cardiomyopathy
Endothelin 2 1907 Hypertrophic cardiomyopathy
MYL2 4633 Hypertrophic cardiomyopathy
MYH6 4624 Hypertrophic cardiomyopathy
MYBPC1 4604 Hypertrophic cardiomyopathy
MYL3 4634 Hypertrophic cardiomyopathy
JPH2 57158 Hypertrophic cardiomyopathy
MYLK2 85366 Hypertrophic cardiomyopathy
MYBPC3 4607 Hypertrophic cardiomyopathy
CD-36 948 Hypertrophic cardiomyopathy
TTN 7273 Hypertrophic cardiomyopathy
EDN2 1907 Hypertrophic cardiomyopathy
Titin 7273 Dilated cardiomyopathy
CSRP3 8048 Dilated cardiomyopathy
Phospholamban 5350 Dilated cardiomyopathy
Tafazzin 6901 Dilated cardiomyopathy
Beta-1 adrenergic receptor 153 Dilated cardiomyopathy
LMNA 4000 Dilated cardiomyopathy
Palladin 23022 Dilated cardiomyopathy
Fukutin 2218 Dilated cardiomyopathy
TNNT2 7139 Dilated cardiomyopathy
ACTC1 70 Dilated cardiomyopathy
SGCD 6444 Dilated cardiomyopathy
Programmed cell death 1 5133 Dilated cardiomyopathy
LDB3 11155 Dilated cardiomyopathy
ABCC9 10060 Dilated cardiomyopathy
PDCD1 5133 Dilated cardiomyopathy
ADRB1 153 Dilated cardiomyopathy
TTN 7273 Dilated cardiomyopathy
TAZ 6901 Dilated cardiomyopathy
PLN 5350 Dilated cardiomyopathy
PALLD 23022 Dilated cardiomyopathy
FKTN 2218 Dilated cardiomyopathy

Note: In my previous post ADA was found to be associated to DOID:3363 (coronary arteriosclerosis). This result was not retrieved using SPARQL and this information is not available on the GeneWiki+ page for ADA. But keep in mind that GeneWiki+ is still under development.

That's it,

Pierre


03 February 2010

Using a FASTA file as a source of RDF statements for SPARQL.


In this post, I'll show how a Fasta file can be used as a source of RDF statements for the Jena API.
The DNA sequences in the Fasta file will be used by Jena without any prior transformation: the file will be used as a Graph by Jena by implementing com.hp.hpl.jena.graph.Graph.

Here, my example uses a Fasta file but it could have been any kind of input: a SQL database, a XML file, a GFF file, etc...

How it works


com.hp.hpl.jena.graph.Graph is the interface to be satisfied by implementations maintaining collections of RDF triples. The core interface is small (add, delete, find, contains) and is augmented by additional classes to handle more complicated matters such as reification, query handling, bulk update, event management, and transaction handling. My implementation for this interface extends com.hp.hpl.jena.graph.impl.GraphBase, will read a Fasta file and create set of RDF triple for each sequence. All we need is (love and ) implementing the abstract function ExtendedIterator<Triple> graphBaseFind(TripleMatch matcher)
public class FastaModel
extends GraphBase
{
protected ExtendedIterator<Triple> graphBaseFind(TripleMatch matcher)
{
//the function to be implemented....
}
}

The FastaSequence


A simple container for a name and a sequence.
/** a simple fasta sequence */
private static class FastaSequence
{
StringBuilder name=new StringBuilder();
StringBuilder sequence=new StringBuilder();
}

Reading the next Fasta Sequence

... Not Rocket Science...
/** the file reader */
private PushbackReader reader;
(...)
reader=new PushbackReader(new FileReader(fastaFile));
(...)
private FastaSequence readNext() throws IOException
{
if(this.reader==null) return null;
int c;
FastaSequence seq=null;
while((c=this.reader.read())!=-1)
{
if(c=='>')
{
if(seq!=null)
{
this.reader.unread(c);
return seq;
}
seq=new FastaSequence();
while((c=this.reader.read())!=-1)
{
if(c=='\n') break;
seq.name.append((char)c);
}
}
else if(seq!=null && Character.isLetter(c))
{
seq.sequence.append((char)c);
}
}
this.close();//close the FileReader
return seq;
}

Implementing the Iterator of Triples


My FastaIterator extends jena.util.iterator.NiceIterator<Triple>, a class extening the ExtendedIterator returned by the Graph function ExtendedIterator<Triple> graphBaseFind(TripleMatch matcher). The class contains three fields:
  • a FileReader
  • com.hp.hpl.jena.graph.Triple that is used as a filter
  • a stack/queue of RDF Triples
. The constructor for 'FastaIterator' opens the stream:
FastaIterator(TripleMatch matcher) throws IOException
{
this.filter=matcher.asTriple();
try
{
this.reader=new PushbackReader(new FileReader(FastaModel.this.fastaFile));
}
catch (IOException e)
{
throw new JenaException(e);
}
}
The method 'close' just close the input stream
@Override
public void close()
{
try
{
if(this.reader!=null) reader.close();
}
catch (IOException e)
{
throw new JenaException(e);
}
finally
{
this.reader=null;
super.close();
}
The method 'next()' check if there is something in the RDF queue, if true a RDF triple is removed and returned:
@Override
public Triple next()
{
if(this.triples_queue.isEmpty()) hasNext();
if(this.triples_queue.isEmpty()) throw new IllegalStateException();
return this.triples_queue.pop();
}
The method 'hasNext()' returns true if the queue of RDF triple is not empty. Otherwise, it gets the next Fasta Sequence
from the input stream and transforms it into a set of RDF triple that are added to the RDF queue if they match this.filter.That is to say, the following fasta sequence...:


>gi|227935373|gb|FJ425127.1| Rotavirus G8 isolate 6862/2000/ARN NSP3 gene, partial cds
GGCCACTTCAACATTAGAATTAATGGGTATTCAATATGATTACAATGAAGTATTTACCAGAGTTAAAAGT
AAATTTGATTATGTGATGGATGACTCTGGTGTTAAAAACAATCTTTTGGGTAAAGCTATAACTATTGATC
AGGCATTAAATGGAAAGTTTGGCTCAGCTATTAGAAATAGAAATTGGATGACTGATTCTAAAACGGTTGC
TAAATTAGATGAAGACGTGAATAAACTTAGAATGACATTATCTTCTAAAGGAATCGACCAAAAGATGAGA
GTACTTAATGCTTGTTTAGTGTA

... will generate those four RDF statements:
<http://www.ncbi.nlm.nih.gov/nuccore/227935373> <urn:lindenb:ontology:length> "303"^^<http://www.w3.org/2001/XMLSchema#int>
<http://www.ncbi.nlm.nih.gov/nuccore/227935373> <urn:lindenb:ontology:sequence> "GGCCACTTCAACATTAGAATTAATGGGTATTCAATATGATTACAATGAAGTATTTACCAGAGTTAAAAGTAAATTTGATTATGTGATGGATGACTCTGGTGTTAAAAACAATCTTTTGGGTAAAGCTATAACTATTGATCAGGCATTAAATGGAAAGTTTGGCTCAGCTATTAGAAATAGAAATTGGATGACTGATTCTAAAACGGTTGCTAAATTAGATGAAGACGTGAATAAACTTAGAATGACATTATCTTCTAAAGGAATCGACCAAAAGATGAGAGTACTTAATGCTTGTTTAGTGTA"
<http://www.ncbi.nlm.nih.gov/nuccore/227935373> <http://purl.org/dc/elements/1.1/title> "gi|227935373|gb|FJ425127.1| Rotavirus G8 isolate 6862/2000/ARN NSP3 gene, partial cds"
<http://www.ncbi.nlm.nih.gov/nuccore/227935373> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <urn:lindenb:ontology:Sequence>
Here is the code for the 'hasNext' function:
@Override
public boolean hasNext()
{
if(!triples_queue.isEmpty()) return true;
if(this.reader==null) return false;
try
{
/* loop until the queue is not empty or the stream is closed */
while(this.triples_queue.isEmpty())
{
//try to get a new fasta sequence
FastaSequence seq=readNext();
if(seq==null) return false;

String name=seq.name.toString();
//check it is a genbank file with a gi
if(!name.startsWith("gi|"))
{
continue;
}
int i=name.indexOf('|',3);
if(i==-1) continue;
//create the subject
Node subject =Node.createURI("http://www.ncbi.nlm.nih.gov/nuccore/"+name.substring(3,i));

//make a triple for the rdf:type
Triple triple=new Triple(
subject,
RDF.type.asNode(),
Node.createURI("urn:lindenb:ontology:Sequence")
);
//append this triple to the queue if it is accepted by this.filter
if(this.filter.asTriple().matches(triple))
{
this.triples_queue.add(triple);
}

//make a triple for the dc:title
triple=new Triple(
subject,
DC.title.asNode(),
Node.createLiteral(name)
);

//append this triple to the queue if it is accepted by this.filter
if(this.filter.asTriple().matches(triple))
{
this.triples_queue.add(triple);
}

//make a triple for the DNA sequence
triple=new Triple(
subject,
Node.createURI("urn:lindenb:ontology:sequence"),
Node.createLiteral(seq.sequence.toString())
);

//append this triple to the queue if it is accepted by this.filter
if(this.filter.asTriple().matches(triple))
{
this.triples_queue.add(triple);
}

//make a triple for the size of this sequence
triple=new Triple(
subject,
Node.createURI("urn:lindenb:ontology:length"),
Node.createLiteral(String.valueOf(seq.sequence.length()),null,XSDDatatype.XSDint)
);

//append this triple to the queue if it is accepted by this.filter
if(this.filter.asTriple().matches(triple))
{
this.triples_queue.add(triple);
}

}
}
catch (IOException e)
{
close();
throw new JenaException(e);
}
return !triples_queue.isEmpty();
}

Using the graph

.
Creating a new Jena RDF Model
Model m=ModelFactory.createModelForGraph(
new FastaModel(
new File("rotavirus.fa")
));

Looping over the RDF statements
After creating this new Model, it can be used as a regular Jena RDF Model. e.g:
StmtIterator i=m.listStatements();
while(i.hasNext())
{
System.err.println(i.next());
}
Result
[http://www.ncbi.nlm.nih.gov/nuccore/227935373, urn:lindenb:ontology:length, "303"^^http://www.w3.org/2001/XMLSchema#int]
[http://www.ncbi.nlm.nih.gov/nuccore/227935373, urn:lindenb:ontology:sequence, "GGCCACTTCAACATTAGAATTAATGGGTATTCAATATGATTACAATGAAGTATTTACCAGAGTTAAAAGTAAATTTGATTATGTGATGGATGACTCTGGTGTTAAAAACAATCTTTTGGGTAAAGCTATAACTATTGATCAGGCATTAAATGGAAAGTTTGGCTCAGCTATTAGAAATAGAAATTGGATGACTGATTCTAAAACGGTTGCTAAATTAGATGAAGACGTGAATAAACTTAGAATGACATTATCTTCTAAAGGAATCGACCAAAAGATGAGAGTACTTAATGCTTGTTTAGTGTA"]
[http://www.ncbi.nlm.nih.gov/nuccore/227935373, http://purl.org/dc/elements/1.1/title, "gi|227935373|gb|FJ425127.1| Rotavirus G8 isolate 6862/2000/ARN NSP3 gene, partial cds"]
[http://www.ncbi.nlm.nih.gov/nuccore/227935373, http://www.w3.org/1999/02/22-rdf-syntax-ns#type, urn:lindenb:ontology:Sequence]
[http://www.ncbi.nlm.nih.gov/nuccore/227935371, urn:lindenb:ontology:length, "303"^^http://www.w3.org/2001/XMLSchema#int]
[http://www.ncbi.nlm.nih.gov/nuccore/227935371, urn:lindenb:ontology:sequence, "GGCCACTTCAACATTAGAATTAATGGGTATTCAATATGATTACAATGAAGTATTTACCAGAGTTAAAAGTAAATTTGATTATGTGATGGATGACTCTGGTGTTAAAAACAATCTTTTGGGTAAAGCTATAACTATTGATCAGGCATTAAATGGAAAGTTTGGCTCAGCTATTAGAAATAGAAATTGGATGACTGATTCTAAAACGGTTGCTAAATTAGATGAAGACGTGAATAAACTTAGAATGACATTATCTTCTAAAGGAATCGACCAAAAGATGAGAGTACTTAATGCTTGTTTAGTGTA"]
[http://www.ncbi.nlm.nih.gov/nuccore/227935371, http://purl.org/dc/elements/1.1/title, "gi|227935371|gb|FJ425126.1| Rotavirus G8 isolate 6854/2002/ARN NSP3 gene, partial cds"]
[http://www.ncbi.nlm.nih.gov/nuccore/227935371, http://www.w3.org/1999/02/22-rdf-syntax-ns#type, urn:lindenb:ontology:Sequence]
[http://www.ncbi.nlm.nih.gov/nuccore/227935369, urn:lindenb:ontology:length, "303"^^http://www.w3.org/2001/XMLSchema#int]
[http://www.ncbi.nlm.nih.gov/nuccore/227935369, urn:lindenb:ontology:sequence, "GGCCACTTCAACATTAGAATTAATGGGTATTCAATATGATTACAATGAAGTATTTACCAGAGTTAAAAGTAAATTTGATTATGTGATGGATGACTCTGGTGTTAAAAACAATCTTCTGGGTAAAGCTATAACTATTGATCAGGCATTAAATGGAAAGTTTGGCTCAGCTATTAGAAATAGAAATTGGATGACTGATTCTAAAACGGTTGCTAAATTAGATGAAGACGTGAATAAACTTAGAATGACATTATCTTCTAAAGGAATCGACCAAAAGATGAGAGTACTTAATGCTTGTTTAGTGTA"]
[http://www.ncbi.nlm.nih.gov/nuccore/227935369, http://purl.org/dc/elements/1.1/title, "gi|227935369|gb|FJ425125.1| Rotavirus G8 isolate 6810/2004/ARN NSP3 gene, partial cds"]
(...)

This model can also be used as a source of RDF by ARQ , the SPARQL engine for Jena (!). Here we create a new SPARQL engine and list the sequences having a length lower than the others
Query query=QueryFactory.create(
"SELECT ?Seq1 ?Len1 ?Seq2 ?Len2" +
"{" +
"?Seq1 a <urn:lindenb:ontology:Sequence> . " +
"?Seq1 <urn:lindenb:ontology:length> ?Len1 . " +
"?Seq2 a <urn:lindenb:ontology:Sequence> . " +
"?Seq2 <urn:lindenb:ontology:length> ?Len2 . " +
"FILTER (?Seq1!=?Seq2 && ?Len1 < ?Len2) "+

"}"
);
QueryExecution execution = QueryExecutionFactory.create(query, m);
ResultSet row=execution.execSelect();
while(row.hasNext())
{
QuerySolution solution=row.next();

for(Iterator<String> si=solution.varNames();si.hasNext();)
{
String name=si.next();
System.out.println(name+" : "+solution.get(name));
}
System.out.println();
}
Result:
Seq1 : http://www.ncbi.nlm.nih.gov/nuccore/227935373
Len1 : 303^^http://www.w3.org/2001/XMLSchema#int
Seq2 : http://www.ncbi.nlm.nih.gov/nuccore/227935361
Len2 : 304^^http://www.w3.org/2001/XMLSchema#int

Seq1 : http://www.ncbi.nlm.nih.gov/nuccore/227935373
Len1 : 303^^http://www.w3.org/2001/XMLSchema#int
Seq2 : http://www.ncbi.nlm.nih.gov/nuccore/227935359
Len2 : 304^^http://www.w3.org/2001/XMLSchema#int

Seq1 : http://www.ncbi.nlm.nih.gov/nuccore/227935373
Len1 : 303^^http://www.w3.org/2001/XMLSchema#int
Seq2 : http://www.ncbi.nlm.nih.gov/nuccore/215489730
Len2 : 305^^http://www.w3.org/2001/XMLSchema#int

Seq1 : http://www.ncbi.nlm.nih.gov/nuccore/227935371
Len1 : 303^^http://www.w3.org/2001/XMLSchema#int
Seq2 : http://www.ncbi.nlm.nih.gov/nuccore/227935361
Len2 : 304^^http://www.w3.org/2001/XMLSchema#int

Seq1 : http://www.ncbi.nlm.nih.gov/nuccore/227935371
Len1 : 303^^http://www.w3.org/2001/XMLSchema#int
Seq2 : http://www.ncbi.nlm.nih.gov/nuccore/227935359
Len2 : 304^^http://www.w3.org/2001/XMLSchema#int

Seq1 : http://www.ncbi.nlm.nih.gov/nuccore/227935371
Len1 : 303^^http://www.w3.org/2001/XMLSchema#int
Seq2 : http://www.ncbi.nlm.nih.gov/nuccore/215489730
Len2 : 305^^http://www.w3.org/2001/XMLSchema#int

Seq1 : http://www.ncbi.nlm.nih.gov/nuccore/227935369
Len1 : 303^^http://www.w3.org/2001/XMLSchema#int
Seq2 : http://www.ncbi.nlm.nih.gov/nuccore/227935361
Len2 : 304^^http://www.w3.org/2001/XMLSchema#int
(...)
Hey, I thinks it's coool ! :-)
BTW I wonder how, knowing the FILTER of the SPARQL query, searching the Graph can be optimized, for example if we know that the sequences have been sorted in the fasta file according to their lengths.... Any idea ?

Compiling



export JENAPATH=${JENALIB}/icu4j-3.4.4.jar:${JENALIB}/iri-0.7.jar:${JENALIB}/jena-2.6.2.jar:${JENALIB}/jena-2.6.2-tests.jar:${JENALIB}/junit-4.5.jar:${JENALIB}/log4j-1.2.13.jar:${JENALIB}/lucene-core-2.3.1.jar:${JENALIB}/slf4j-api-1.5.6.jar:${JENALIB}/slf4j-log4j12-1.5.6.jar:${JENALIB}/stax-api-1.0.1.jar:${JENALIB}/wstx-asl-3.2.9.jar:${JENALIB}/xercesImpl-2.7.1.jar:${JENALIB}/icu4j-3.4.4.jar:${JENALIB}/iri-0.7.jar:${JENALIB}/jena-2.6.2.jar:${JENALIB}/jena-2.6.2-tests.jar:${JENALIB}/junit-4.5.jar:${JENALIB}/log4j-1.2.13.jar:${JENALIB}/lucene-core-2.3.1.jar:${JENALIB}/slf4j-api-1.5.6.jar:${JENALIB}/slf4j-log4j12-1.5.6.jar:${JENALIB}/stax-api-1.0.1.jar:${JENALIB}/wstx-asl-3.2.9.jar:${JENALIB}/xercesImpl-2.7.1.jar:${JENALIB}/arq-2.8.1.jar
javac -cp ${JENAPATH}:. -d bin -sourcepath src src/test/FastaModel.java

Running


java -cp ${JENAPATH}:bin test.FastaModel

All, in one, here is the code



That's it !
Pierre

25 November 2008

Taxonomy and Semantic Web: writing an extension for ARQ/SPARQL

In this post I'll show how I've implemented a custom function in ARQ, the SPARQL/Jena engine for querying a RDF graph. The new function implemented tests if a node in the NCBI-taxonomy hierarchy as a given ancestor.

Requirements


Here are a sample of the very first lines of nodes.dmp: the first column is the node-id of the taxon, the second column is its parent-id.
cat nodes.dmp | cut -c 1-20 | head
1 | 1 | no rank | |
2 | 131567 | superki
6 | 335928 | genus |
7 | 6 | species | AC
9 | 32199 | species
10 | 135621 | genus
11 | 10 | species |
13 | 203488 | genus
14 | 13 | species |
16 | 32011 | genus |



The input


our input is a RDF file:

<?xml version="1.0" encoding="UTF-8"?>
<rdf:RDF
xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns:dc="http://purl.org/dc/elements/1.1/"
xmlns:tax="http://species.lindenb.org"
>

<tax:Individual rdf:about="http://fr.wikipedia.org/wiki/Tintin">
<dc:title xml:lang="fr">Tintin</dc:title>
<dc:title xml:lang="en">Tintin</dc:title>
<tax:taxon rdf:resource="lsid:ncbi.nlm.nih.gov:taxonomy:9606"/>
</tax:Individual>

<tax:Individual rdf:about="http://fr.wikipedia.org/wiki/Babar">
<dc:title xml:lang="fr">Babar</dc:title>
<dc:title xml:lang="en">Babar</dc:title>
<tax:taxon rdf:resource="lsid:ncbi.nlm.nih.gov:taxonomy:9785"/>
</tax:Individual>

<tax:Individual rdf:about="http://fr.wikipedia.org/wiki/Milou">
<dc:title xml:lang="fr">Milou</dc:title>
<dc:title xml:lang="en">Snowy</dc:title>
<tax:taxon rdf:resource="lsid:ncbi.nlm.nih.gov:taxonomy:9615"/>
</tax:Individual>

<tax:Individual rdf:about="http://fr.wikipedia.org/wiki/Donald_Duck">
<dc:title xml:lang="fr">Donald</dc:title>
<dc:title xml:lang="en">Donald Duck</dc:title>
<tax:taxon rdf:resource="lsid:ncbi.nlm.nih.gov:taxonomy:8839"/>
</tax:Individual>

<tax:Individual rdf:about="http://fr.wikipedia.org/wiki/Le_L%C3%A9zard">
<dc:title xml:lang="fr">Lezard</dc:title>
<dc:title xml:lang="en">Lizard</dc:title>
<dc:title xml:lang="fr">Curt Connors</dc:title>
<dc:title xml:lang="en">Curt Connors</dc:title>
<tax:taxon rdf:resource="lsid:ncbi.nlm.nih.gov:taxonomy:9606"/>
<tax:taxon rdf:resource="lsid:ncbi.nlm.nih.gov:taxonomy:8504"/>
</tax:Individual>

</rdf:RDF>

Images via wikipedia

Tintin & Snowy

Babar

Donald

The Lizard

Basically this file describes
  • 4 individuals: Tintin (human), Snowy (dog), Donal (duck) , Babar (Elephant) and Dr Connors/The Lizard (spiderman's foe)
  • Each individual unambigously identified by his URI in wikipedia
  • Each individual is named in english and in french
  • For each individual, is ID in the NCBI hierarchy is specified using a simple URI (here I've tried to use a LSID, but it could have been something else (a URL... ))


A basic query


The following SPARQL query retrieve the URI, the taxonomy and the english name for each individuals.

The query

PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX dc: <http://purl.org/dc/elements/1.1/>
PREFIX tax: <http://species.lindenb.org>

SELECT ?individual ?taxon ?title
{
?individual a tax:Individual .
?individual dc:title ?title .
?individual tax:taxon ?taxon .
FILTER langMatches( lang(?title), "en" )
}

Invoking ARQ


arq --query query01.rq --data taxonomy.rdf

Result


-------------------------------------------------------------------------------------------------------------
| individual | taxon | title |
=============================================================================================================
| <http://fr.wikipedia.org/wiki/Le_L%C3%A9zard> | <lsid:ncbi.nlm.nih.gov:taxonomy:8504> | "Curt Connors"@en |
| <http://fr.wikipedia.org/wiki/Le_L%C3%A9zard> | <lsid:ncbi.nlm.nih.gov:taxonomy:9606> | "Curt Connors"@en |
| <http://fr.wikipedia.org/wiki/Le_L%C3%A9zard> | <lsid:ncbi.nlm.nih.gov:taxonomy:8504> | "Lizard"@en |
| <http://fr.wikipedia.org/wiki/Le_L%C3%A9zard> | <lsid:ncbi.nlm.nih.gov:taxonomy:9606> | "Lizard"@en |
| <http://fr.wikipedia.org/wiki/Donald_Duck> | <lsid:ncbi.nlm.nih.gov:taxonomy:8839> | "Donald Duck"@en |
| <http://fr.wikipedia.org/wiki/Milou> | <lsid:ncbi.nlm.nih.gov:taxonomy:9615> | "Snowy"@en |
| <http://fr.wikipedia.org/wiki/Babar> | <lsid:ncbi.nlm.nih.gov:taxonomy:9785> | "Babar"@en |
| <http://fr.wikipedia.org/wiki/Tintin> | <lsid:ncbi.nlm.nih.gov:taxonomy:9606> | "Tintin"@en |
-------------------------------------------------------------------------------------------------------------


Adding a custom function


Now, I want to add a new function in sparql. This function 'isA' will take as input to parameters: the taxon/LSID of the child and the taxon/LSID of the parent and it will return a boolean 'true' if the 'child' has the 'parent' in his phylogeny. This new function is implemented by extending the class com.hp.hpl.jena.sparql.function.FunctionBase2. This new class contains an associative array child2parent mapping each taxon-id to its parent. This map is loaded as described bellow:

Pattern pat= Pattern.compile("[ \t]*\\|[ \t]*");
String line;
BufferedReader r= new BufferedReader(new FileReader(TAXONOMY_NODES_PATH));
while((line=r.readLine())!=null)
{
String tokens[]=pat.split(line, 3);
this.child2parent.put(
Integer.parseInt(tokens[0]),
Integer.parseInt(tokens[1])
);
}
r.close();
(...)

The function 'exec' will check if the two arguments are an URI and will invoke the method isChildOf

public NodeValue exec(NodeValue childNode, NodeValue parentNode)
{
(...check the nodes are URI)
return NodeValue.makeBoolean(isChildOf(childId,parentId));
}


The function 'isChildOf' loops in the map child2parent to check if the parent is an ancestor of the child:

while(true)
{
Integer id= child2parent.get(childid);
if(id==null || id==childid) return false;
if(id==parentid) return true;
childid=id;
}

Here is the complete source code of this class:

package org.lindenb.arq4taxonomy;

import java.io.BufferedReader;
import java.io.FileReader;
import java.io.IOException;
import java.util.HashMap;
import java.util.Map;
import java.util.regex.Pattern;

import com.hp.hpl.jena.sparql.expr.ExprEvalException;
import com.hp.hpl.jena.sparql.expr.NodeValue;
import com.hp.hpl.jena.sparql.function.FunctionBase2;

public class isA
extends FunctionBase2
{
public static final String LSID="lsid:ncbi.nlm.nih.gov:taxonomy:";
public static final String TAXONOMY_NODES_PATH="/home/lindenb/tmp/TAXONOMY_NCBI/nodes.dmp";
private Map<Integer, Integer> child2parent=null;

public isA()
{

}
/**
* return a associative map child.id -> parent.id
* @return
*/
private Map<Integer, Integer> getTaxonomy()
{
if(this.child2parent==null)
{
this.child2parent= new HashMap<Integer, Integer>();
try
{
Pattern pat= Pattern.compile("[ \t]*\\|[ \t]*");
String line;
BufferedReader r= new BufferedReader(new FileReader(TAXONOMY_NODES_PATH));
while((line=r.readLine())!=null)
{
String tokens[]=pat.split(line, 3);
this.child2parent.put(
Integer.parseInt(tokens[0]),
Integer.parseInt(tokens[1])
);
}
r.close();
System.err.println(this.child2parent.size());
}
catch(IOException err)
{
err.printStackTrace();
throw new ExprEvalException(err);
}
}
return this.child2parent;
}

private boolean isChildOf(int childid,int parentid)
{
if(childid==parentid) return true;
Map<Integer,Integer> map= getTaxonomy();
while(true)
{
Integer id= map.get(childid);
if(id==null || id==childid) return false;
if(id==parentid) return true;
childid=id;
}
}

@Override
public NodeValue exec(NodeValue childNode, NodeValue parentNode)
{

if( childNode.isLiteral() ||
parentNode.isLiteral() ||
childNode.asNode().isBlank() ||
parentNode.asNode().isBlank())
{
return NodeValue.makeBoolean(false);
}

String childURI = childNode.asNode().getURI();
if(!childURI.startsWith(LSID))
{
return NodeValue.makeBoolean(false);
}


String parentURI = parentNode.asNode().getURI();
if(!parentURI.startsWith(LSID))
{
return NodeValue.makeBoolean(false);
}

int childId=0;
try {
childId= Integer.parseInt(childURI.substring(LSID.length()));
}
catch (NumberFormatException e)
{
return NodeValue.makeBoolean(false);
}

int parentId=0;
try {
parentId= Integer.parseInt(parentURI.substring(LSID.length()));
}
catch (NumberFormatException e)
{
return NodeValue.makeBoolean(false);
}

return NodeValue.makeBoolean(isChildOf(childId,parentId));
}

}

This class is then compiled and packaged into the file tax.jar:

javac -cp $(ARQ_CLASSPATH):. -sourcepath src src/org/lindenb/arq4taxonomy/isA.java
jar cvf tax.jar -C src org


and we add this jar in the classpath:
export CP=$PWD/tax.jar

To tell ARQ about this new functio,n we just add its classpath as a new PREFIX in the SPARQL query:
PREFIX fn: <java:org.lindenb.arq4taxonomy.>



First test


the following SPARQL query retrieves all the Mammals (http://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?mode=Undef&id=40674) in the data set.

The query


PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX dc: <http://purl.org/dc/elements/1.1/>
PREFIX tax: <http://species.lindenb.org>
PREFIX fn: <java:org.lindenb.arq4taxonomy.>

SELECT ?individual ?taxon ?title
{
?individual a tax:Individual .
?individual dc:title ?title .
?individual tax:taxon ?taxon .
FILTER fn:isA(?taxon,<lsid:ncbi.nlm.nih.gov:taxonomy:40674> )
FILTER langMatches( lang(?title), "en" )
}

The command line


arq --query query02.rq --data taxonomy.rdf


The result


-------------------------------------------------------------------------------------------------------------
| individual | taxon | title |
=============================================================================================================
| <http://fr.wikipedia.org/wiki/Le_L%C3%A9zard> | <lsid:ncbi.nlm.nih.gov:taxonomy:9606> | "Curt Connors"@en |
| <http://fr.wikipedia.org/wiki/Le_L%C3%A9zard> | <lsid:ncbi.nlm.nih.gov:taxonomy:9606> | "Lizard"@en |
| <http://fr.wikipedia.org/wiki/Milou> | <lsid:ncbi.nlm.nih.gov:taxonomy:9615> | "Snowy"@en |
| <http://fr.wikipedia.org/wiki/Babar> | <lsid:ncbi.nlm.nih.gov:taxonomy:9785> | "Babar"@en |
| <http://fr.wikipedia.org/wiki/Tintin> | <lsid:ncbi.nlm.nih.gov:taxonomy:9606> | "Tintin"@en |
-------------------------------------------------------------------------------------------------------------


Second query


the following SPARQL query retrieves all the 'Sauropdias' (http://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?mode=Undef&id=8457) in the RDF file.

The SPARQL file


PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX dc: <http://purl.org/dc/elements/1.1/>
PREFIX tax: <http://species.lindenb.org>
PREFIX fn: <java:org.lindenb.arq4taxonomy.>

SELECT ?individual ?taxon ?title
{
?individual a tax:Individual .
?individual dc:title ?title .
?individual tax:taxon ?taxon .
FILTER fn:isA(?taxon,<lsid:ncbi.nlm.nih.gov:taxonomy:8457> )
FILTER langMatches( lang(?title), "en" )
}

Command line


arq --query query03.rq --datataxonomy.rdf


The result


-------------------------------------------------------------------------------------------------------------
| individual | taxon | title |
=============================================================================================================
| <http://fr.wikipedia.org/wiki/Le_L%C3%A9zard> | <lsid:ncbi.nlm.nih.gov:taxonomy:8504> | "Curt Connors"@en |
| <http://fr.wikipedia.org/wiki/Le_L%C3%A9zard> | <lsid:ncbi.nlm.nih.gov:taxonomy:8504> | "Lizard"@en |
| <http://fr.wikipedia.org/wiki/Donald_Duck> | <lsid:ncbi.nlm.nih.gov:taxonomy:8839> | "Donald Duck"@en |
-------------------------------------------------------------------------------------------------------------



Et hop ! voila ! That's it !