24 April 2012

Mapping the genes involved in a category of disease: the GeneWikiPlus + SPARQL way.

In my previous post, I've used the RDF/XML files of the Disease Ontology to map all the genes involved in a cardiac disease.

Andrew Su immediately mentioned on Twitter that he was working on GeneWiki+, an integration of GeneWiki on Semantic-MediaWiki that could answer the same question.





Later, Benjamin Good announced that a SPARQL endpoint for GeneWiki+ was now available:


The following java code uses the Jena/ARQ API to query this SPARQL endpoint. For a given Disease Ontology accession identifier, it fetches all the genes associated to this disease and run recursively with the sub-classes of this disease.



Here is the output (gene-name, gene-id, disease) with DOID:114 ("Heart Disease"):
Protein C 5624 Heart disease
HMG-CoA reductase 3156 Heart disease
SCARB1 949 Heart disease
Coagulation factor II receptor 2149 Heart disease
Cathepsin S 1520 Heart disease
ABCA1 19 Heart disease
CHD7 55636 Heart disease
GJA5 2702 Heart disease
ENTPD1 953 Heart disease
PEDF 5176 Heart disease
HMG CoA reductase 3156 Heart disease
PROC 5624 Heart disease
F2R 2149 Heart disease
SERPINF1 5176 Heart disease
HMGCR 3156 Heart disease
CTSS 1520 Heart disease
Cytochrome c 54205 Heart failure
FOXP1 27086 Heart failure
Vasoactive intestinal peptide 7432 Heart failure
Angiotensin-converting enzyme 1636 Heart failure
PPP1CA 5499 Heart failure
Transferrin 7018 Heart failure
Natriuretic peptide precursor C 4880 Heart failure
Insulin-like growth factor 1 3479 Heart failure
CA-125 94025 Heart failure
Myosin binding protein C, cardiac 4607 Heart failure
MYH7 4625 Heart failure
Tafazzin 6901 Heart failure
5-HT2B receptor 3357 Heart failure
Beta-1 adrenergic receptor 153 Heart failure
PTGS2 5743 Heart failure
EPAS1 2034 Heart failure
Nociceptin receptor 4987 Heart failure
Cystatin C 1471 Heart failure
Ryanodine receptor 2 6262 Heart failure
Multidrug resistance-associated protein 2 1244 Heart failure
KCNA5 3741 Heart failure
ANXA6 309 Heart failure
CMA1 1215 Heart failure
KLF15 28999 Heart failure
IL1RL1 9173 Heart failure
JPH2 57158 Heart failure
Heart-type fatty acid binding protein 2170 Heart failure
TF 7018 Heart failure
ABCC2 1244 Heart failure
Cytochrome-c 54205 Heart failure
HTR2B 3357 Heart failure
Cytochrome C 54205 Heart failure
Hif2a 2034 Heart failure
FABP3 2170 Heart failure
MYBPC3 4607 Heart failure
Angiotensin converting enzyme 1636 Heart failure
IGF-1 3479 Heart failure
Insulin-like growth factor-1 3479 Heart failure
Stress-induced polymorphic ventricular tachycardia 6262 Heart failure
C-type natriuretic peptide 4880 Heart failure
OPRL1 4987 Heart failure
CYCS 54205 Heart failure
ADRB1 153 Heart failure
TAZ 6901 Heart failure
VIP 7432 Heart failure
IGF1 3479 Heart failure
NPPC 4880 Heart failure
ACE 1636 Heart failure
CST3 1471 Heart failure
MUC16 94025 Heart failure
RYR2 6262 Heart failure
Aquaporin-2 359 Congestive heart failure
Aquaporin 2 359 Congestive heart failure
Atrial natriuretic peptide 4878 Congestive heart failure
Brain natriuretic peptide 4879 Congestive heart failure
Phospholamban 5350 Congestive heart failure
CYP2C9 1559 Congestive heart failure
RAGE (receptor) 177 Congestive heart failure
Angiotensin II receptor type 1 185 Congestive heart failure
Programmed cell death 1 5133 Congestive heart failure
AGTR1 185 Congestive heart failure
Atrial natriuretic factor 4878 Congestive heart failure
PDCD1 5133 Congestive heart failure
AGER 177 Congestive heart failure
AQP2 359 Congestive heart failure
PLN 5350 Congestive heart failure
NPPB 4879 Congestive heart failure
NPPA 4878 Congestive heart failure
GroEL 3329 Endocarditis
Ornithine transcarbamylase 5009 Endocarditis
Valosin-containing protein 7415 Endocarditis
Parathyroid hormone 1 receptor 5745 Endocarditis
VDAC1 7416 Endocarditis
RuvB-like 1 8607 Endocarditis
TUBB2A 7280 Endocarditis
ACTG1 71 Endocarditis
ACTC1 70 Endocarditis
PRDX6 9588 Endocarditis
Hyaluronan-mediated motility receptor 3161 Endocarditis
HSPB6 126393 Endocarditis
Parathyroid hormone receptor 1 5745 Endocarditis
VCP 7415 Endocarditis
OTC 5009 Endocarditis
PTH1R 5745 Endocarditis
HSPD1 3329 Endocarditis
HMMR 3161 Endocarditis
RUVBL1 8607 Endocarditis
HCN4 10021 Sick sinus syndrome
Heparin-binding EGF-like growth factor 1839 Aortic valve disease
HBEGF 1839 Aortic valve disease
Von Willebrand factor 7450 Aortic valve stenosis
ADAMTS13 11093 Aortic valve stenosis
VWF 7450 Aortic valve stenosis
Elastin 2006 Supravalvular aortic stenosis
ELN 2006 Supravalvular aortic stenosis
PRG4 10216 Pericarditis
Histamine H3 receptor 11255 Myocardial ischemia
MAP3K7IP1 10454 Myocardial ischemia
Vascular endothelial growth factor A 7422 Myocardial ischemia
Cathepsin L1 1514 Myocardial ischemia
VEGF-A 7422 Myocardial ischemia
VEGFA 7422 Myocardial ischemia
CTSL1 1514 Myocardial ischemia
TAB1 10454 Myocardial ischemia
HRH3 11255 Myocardial ischemia
APOA1 335 Coronary heart disease
APOC3 345 Coronary heart disease
Lipoprotein(a) 4018 Coronary heart disease
Brain natriuretic peptide 4879 Coronary heart disease
Beta-3 adrenergic receptor 155 Coronary heart disease
Insulin-like growth factor 1 3479 Coronary heart disease
Perlecan 3339 Coronary heart disease
PCSK9 255738 Coronary heart disease
Cholesterylester transfer protein 1071 Coronary heart disease
Arachidonate 5-lipoxygenase 240 Coronary heart disease
Apolipoprotein B 338 Coronary heart disease
Apolipoprotein A1 335 Coronary heart disease
Beta-1 adrenergic receptor 153 Coronary heart disease
Apolipoprotein C3 345 Coronary heart disease
Lipoprotein-associated phospholipase A2 7941 Coronary heart disease
NEUROG3 50674 Coronary heart disease
5-lipoxygenase 240 Coronary heart disease
ApoA1 335 Coronary heart disease
CETP 1071 Coronary heart disease
ApoB 338 Coronary heart disease
IGF-1 3479 Coronary heart disease
Insulin-like growth factor-1 3479 Coronary heart disease
ApoCIII 345 Coronary heart disease
PLA2G7 7941 Coronary heart disease
ADRB3 155 Coronary heart disease
ADRB1 153 Coronary heart disease
APOB 338 Coronary heart disease
ALOX5 240 Coronary heart disease
IGF1 3479 Coronary heart disease
NPPB 4879 Coronary heart disease
HSPG2 3339 Coronary heart disease
LPA 4018 Coronary heart disease
CYP7A1 1581 Myocardial infarction
Caspase 3 836 Myocardial infarction
C-reactive protein 1401 Myocardial infarction
Renin 5972 Myocardial infarction
Factor VII 2155 Myocardial infarction
Factor H 3075 Myocardial infarction
Hepatic lipase 3990 Myocardial infarction
Myeloperoxidase 4353 Myocardial infarction
Endothelial protein C receptor 10544 Myocardial infarction
ALDH2 217 Myocardial infarction
C1-inhibitor 710 Myocardial infarction
Basic fibroblast growth factor 2247 Myocardial infarction
Myocyte-specific enhancer factor 2A 4205 Myocardial infarction
5-Lipoxygenase-activating protein 241 Myocardial infarction
RAGE (receptor) 177 Myocardial infarction
OLR1 4973 Myocardial infarction
Beta-1 adrenergic receptor 153 Myocardial infarction
PTGS2 5743 Myocardial infarction
Cholesterol 7 alpha-hydroxylase 1581 Myocardial infarction
GPVI 51206 Myocardial infarction
Adrenomedullin 133 Myocardial infarction
Prostacyclin synthase 5740 Myocardial infarction
Cystatin C 1471 Myocardial infarction
Tenascin X 7148 Myocardial infarction
Thymosin beta-4 7114 Myocardial infarction
GCLM 2730 Myocardial infarction
S100A9 6280 Myocardial infarction
IL1RL1 9173 Myocardial infarction
LGALS2 3957 Myocardial infarction
CKM (gene) 1158 Myocardial infarction
ABCC9 10060 Myocardial infarction
Renalase 55328 Myocardial infarction
VTI1A 143187 Myocardial infarction
MIAT (gene) 440823 Myocardial infarction
BFGF 2247 Myocardial infarction
TMSB4X 7114 Myocardial infarction
CASP3 836 Myocardial infarction
Caspase-3 836 Myocardial infarction
Complement factor H 3075 Myocardial infarction
MEF2A 4205 Myocardial infarction
5-lipoxygenase activating protein 241 Myocardial infarction
Factor VIIa 2155 Myocardial infarction
PROCR 10544 Myocardial infarction
GP6 51206 Myocardial infarction
F7 2155 Myocardial infarction
AGER 177 Myocardial infarction
ADRB1 153 Myocardial infarction
MIAT 440823 Myocardial infarction
CFH 3075 Myocardial infarction
CKM 1158 Myocardial infarction
CRP 1401 Myocardial infarction
LIPC 3990 Myocardial infarction
RNLS 55328 Myocardial infarction
PTGIS 5740 Myocardial infarction
TNXB 7148 Myocardial infarction
SERPING1 710 Myocardial infarction
FGF2 2247 Myocardial infarction
REN 5972 Myocardial infarction
ADM 133 Myocardial infarction
CST3 1471 Myocardial infarction
MPO 4353 Myocardial infarction
ALOX5AP 241 Myocardial infarction
Myoglobin 4151 Acute myocardial infarction
Tissue plasminogen activator 5327 Acute myocardial infarction
MIRN21 406991 Acute myocardial infarction
Apolipoprotein B 338 Acute myocardial infarction
Endothelin 1 1906 Acute myocardial infarction
MMP3 4314 Acute myocardial infarction
Heart-type fatty acid binding protein 2170 Acute myocardial infarction
Alteplase 5327 Acute myocardial infarction
FABP3 2170 Acute myocardial infarction
ApoB 338 Acute myocardial infarction
MB 4151 Acute myocardial infarction
APOB 338 Acute myocardial infarction
PLAT 5327 Acute myocardial infarction
EDN1 1906 Acute myocardial infarction
MIR21 406991 Acute myocardial infarction
Adenosine A1 receptor 134 Myocardial stunning
SOD2 6648 Myocardial stunning
ADORA1 134 Myocardial stunning
MYH7 4625 Endocardial fibroelastosis
Tafazzin 6901 Endocardial fibroelastosis
TAZ 6901 Endocardial fibroelastosis
Nav1.5 6331 Conduction disease
SCN5A 6331 Conduction disease
PRKAG2 51422 Wolff-Parkinson-White syndrome
TNNT2 7139 Restrictive cardiomyopathy
Titin 7273 Hypertrophic cardiomyopathy
CSRP3 8048 Hypertrophic cardiomyopathy
CD36 948 Hypertrophic cardiomyopathy
Myosin binding protein C, cardiac 4607 Hypertrophic cardiomyopathy
MYH7 4625 Hypertrophic cardiomyopathy
MYL9 10398 Hypertrophic cardiomyopathy
TNNT2 7139 Hypertrophic cardiomyopathy
ACTC1 70 Hypertrophic cardiomyopathy
Endothelin 2 1907 Hypertrophic cardiomyopathy
MYL2 4633 Hypertrophic cardiomyopathy
MYH6 4624 Hypertrophic cardiomyopathy
MYBPC1 4604 Hypertrophic cardiomyopathy
MYL3 4634 Hypertrophic cardiomyopathy
JPH2 57158 Hypertrophic cardiomyopathy
MYLK2 85366 Hypertrophic cardiomyopathy
MYBPC3 4607 Hypertrophic cardiomyopathy
CD-36 948 Hypertrophic cardiomyopathy
TTN 7273 Hypertrophic cardiomyopathy
EDN2 1907 Hypertrophic cardiomyopathy
Titin 7273 Dilated cardiomyopathy
CSRP3 8048 Dilated cardiomyopathy
Phospholamban 5350 Dilated cardiomyopathy
Tafazzin 6901 Dilated cardiomyopathy
Beta-1 adrenergic receptor 153 Dilated cardiomyopathy
LMNA 4000 Dilated cardiomyopathy
Palladin 23022 Dilated cardiomyopathy
Fukutin 2218 Dilated cardiomyopathy
TNNT2 7139 Dilated cardiomyopathy
ACTC1 70 Dilated cardiomyopathy
SGCD 6444 Dilated cardiomyopathy
Programmed cell death 1 5133 Dilated cardiomyopathy
LDB3 11155 Dilated cardiomyopathy
ABCC9 10060 Dilated cardiomyopathy
PDCD1 5133 Dilated cardiomyopathy
ADRB1 153 Dilated cardiomyopathy
TTN 7273 Dilated cardiomyopathy
TAZ 6901 Dilated cardiomyopathy
PLN 5350 Dilated cardiomyopathy
PALLD 23022 Dilated cardiomyopathy
FKTN 2218 Dilated cardiomyopathy

Note: In my previous post ADA was found to be associated to DOID:3363 (coronary arteriosclerosis). This result was not retrieved using SPARQL and this information is not available on the GeneWiki+ page for ADA. But keep in mind that GeneWiki+ is still under development.

That's it,

Pierre


13 April 2012

Using the Disease ontology (DO) to map the genes involved in a category of disease. My notebook

In the current post, I'll use the disease ontology (DO) to map all the genes involved in a cardiac disease.


Using The BioPortal, I found that my term of interest is DOID:114 ("Heart Disease"). I now need to find all the descendants of this term.


The Disease Ontology is available for download here: http://www.obofoundry.org/cgi-bin/detail.cgi?id=disease_ontology. The following XSLT stylesheet retrieves of all the descendants of a given term using a recursive algorithm:



Usage:

xsltproc  --stringparam ID "DOID:114"   do.xsl  do.owl|\
sort | uniq | cut -f 1 & doids.txt 

Result:
$ head doids.txt

DOID:0050650
DOID:0060000
DOID:0060036
DOID:0060068
DOID:10234
DOID:10266
DOID:10272
DOID:10273
DOID:10314
DOID:10392

In Annotating the human genome with Disease Ontology, Osborne & al. have mapped the terms of DO to OMIM and to NCBI Gene. The database dump is available at http://projects.bioinformatics.northwestern.edu/do_rif/do_rif.human.txt. We can use the file "doids.txt" and the fgrep command to extract the genes associated to our selected terms.

~$ curl -s "http://projects.bioinformatics.northwestern.edu/do_rif/do_rif.human.txt" | fgrep -w -f doids.txt

100133941 A decrease in CD4+CD25+ T cell numbers in mitral stenosis patients might suggest a role for cellular autoimmunity in a smoldering rheumatic process. 17944116 C0026269 DOID:1754 in mitral stenosis patients 734
10014 Chronic upregulation/activation of CaMKIID, and PKD in heart failure shifts HDAC5 out of the nucleus, derepressing transcription of hypertrophic genes. 18218981 C0018801 DOID:6000 in heart failure 1000
10068 IL-18 levels, which are determined in part by variation in IL18/IL18BP, play a role in coronary heart disease development and postsurgery outcome. 17951325 C0010054 DOID:3363 in coronary heart disease development 756
10068 IL-18 levels, which are determined in part by variation in IL18/IL18BP, play a role in coronary heart disease development and postsurgery outcome. 17951325 C0010068 DOID:3393 in coronary heart disease development 756
100 ADA*2 allele may decrease genetic susceptibility to coronary artery disease. 17287605 C0010054 DOID:3363 to coronary artery disease 1000

(...)
The first column contains the NCBI/Gene ID. Let's extract this column and ask the mysql server of the UCSC for the positions of those genes:


$ curl -s "http://projects.bioinformatics.northwestern.edu/do_rif/do_rif.human.txt" |\
fgrep -w -f doids.txt | cut -d '   ' -f 1 | sort | uniq |\
awk '{printf("select distinct R.chrom,R.txStart,R.txEnd,L.product,L.locusLinkId from refLink as L,refGene as R where R.name=L.mrnaAcc and L.locusLinkId=%s;\n",$1);}' | \
mysql --user=genome --host=genome-mysql.cse.ucsc.edu -A -D hg19 -N

chr20 43248162 43280376 adenosine deaminase 100
chrY 21152525 21154705 signal transducer CD24 precursor 100133941
chr17 42154120 42201014 histone deacetylase 5 isoform 1 10014
chr17 42154120 42201014 histone deacetylase 5 isoform 3 10014
chr11 71710108 71713574 interleukin-18-binding protein isoform a precursor 10068
chr11 71709957 71713574 interleukin-18-binding protein isoform a precursor 10068
chr11 71710972 71713574 interleukin-18-binding protein isoform b precursor 10068
chr11 71710662 71713574 interleukin-18-binding protein isoform a precursor 10068
chr11 71709957 71713850 interleukin-18-binding protein isoform d precursor 10068
chr11 71710108 71713965 interleukin-18-binding protein isoform c precursor 10068
chr19 16435650 16438339 Krueppel-like factor 2 10365
chr7 30464142 30518393 nucleotide-binding oligomerization domain-containing protein 1 10392
chr20 35169886 35178226 myosin regulatory light polypeptide 9 isoform a 10398
chr20 35169886 35178226 myosin regulatory light polypeptide 9 isoform b 10398
chr12 48128452 48152889 rap guanine nucleotide exchange factor 3 isoform a 10411
chr12 48128452 48152244 rap guanine nucleotide exchange factor 3 isoform b 10411
chr12 48128452 48152181 rap guanine nucleotide exchange factor 3 isoform b 10411
chr16 56995834 57017756 cholesteryl ester transfer protein precursor 1071
chr1 11104854 11107296 mannan-binding lectin serine protease 2 isoform 2 precursor 10747
chr1 11086579 11107296 mannan-binding lectin serine protease 2 isoform 1 preproprotein 10747


checking; the first gene is ADA adenosine deaminase. It is associated to DOID:3363 (coronary arteriosclerosis) and it is cited in pmid:17287605 "ADA*2 allele of the adenosine deaminase gene may protect against coronary artery disease.".


That's it,

Pierre


I get mail: result of the contest "Draw a PhD Candidate"

I recently submitted the following illustration for a contest organized by the French University of Nice : "Draw a PHD candidate".



'Draw a PhD' Contest by ~yokofakun on deviantART



and I received the following email today:

 
De : D. Rauch
Envoyé le : Jeudi 12 avril 2012 21h31
Objet : résultat concours "Dessine-moi un doctorant !"

Monsieur,

L’AJC 06  a l’honneur de vous informer que votre œuvre « Gwennan André,
doctorante » a été retenue par le jury du concours « Dessine-moi un doctorant !
» parmi les œuvres gagnantes.

Votre œuvre ayant été classée à la première place dans la catégorie "public",
vous avez gagné un Ipad 2.
(...)

(translation:)

Sir,

The association AJC 06 is proud to inform you that your work titled "Gwennan André,
PhD student " was chosen by the jury of the contest" Draw me a PhD!
"Among the winning entries.

Your work has been ranked in first place in the "public" section, you have won an IPad 2.
 

Unfortunately I cannot go to Nice next week !!!

I have to renounce and say "goodbye" to this IPAD2 :-((


Sniffffffff


Pierre


EDIT: Simple & stupid Google translation of the text:

"Zig-zag path to understanding myself that it was research health that interested me, because this is an area overlooked by leaving the baccalaureate (At least when the secondary school is far from major cities, large research centers and our families are not the domain!). So the first year of medical school. This is where I found my way, I loved to understand the disease but did not want to be in direct contact with the patient. So reorientation IUT biological engineering: essentially the stage "business" of the second year. Two months in a lab search for "Robert Gordon University" in Aberdeen, Scotland history to be quickly immersed in a research lab and into human health in English. Excellent and successful experience: motivation professional found. Then L3 cell biology, physiology and genetics in Rennes and Master of Nantes "Biology Health" search option with a therapeutic internship at the Institute of alternation in the Thorax team Gervaise Loirand.
In 2010, got a scholarship to complete my thesis Ministerial in this team on a project of Vincent Sauzeau. Our goal is to understand the mechanisms underlying vascular disease to better manage diseases often heavy consequences.
Projects after the thesis: do a postdoc abroad to live fully international aspect of the research and see Search elsewhere. Trying to reconcile this with the personal life, something not always evident with extension to these studies, especially for a woman. Then return to do research in France, ideally in a position of Research Fellow in the audience. But it's still very open in my head (private, public, engineer ... to see according to experience and opportinities in 5 years ...!)
"

06 April 2012

Apache SOLR and GeneOntology: Creating the JQUERY-UI client (with autocompletion)

In the previous post I showed how to use Apache SOLR to index the content of gene ontology. In this post I will show how the SOLR server can be used as a source of data for a "JQuery-UI autocomplete" HTML input.

The JQuery autocomplete Widget expects an JSON array of Objects with label and value properties :
[ { label: "Choice1", value: "value1" }, ... ]

We need to tell SOLR to return the GO terms formatted this way.

Fortunately SOLR provides a mechanism to captures the output of the normal XML response and applies a XSLT transformation to it.

The following XSLT stylesheet transforms the XML response to JSON, it is created in the conf/xslt directory (apache-solr-3.5.0/example/solr/conf/xslt/solr2json.xsl)


We can test the stylesheet by adding the tr parameter in the query URL:
$ curl "http://localhost:8983/solr/select?q=go_definition:cancer&wt=xslt&tr=solr2json.xsl"

[
    {
        "label": "Z-phenylacetaldoxime metabolic process",
        "value": "GO:0018983"
    },
    {
        "label": "epothilone metabolic process",
        "value": "GO:0050813"
    },
    {
        "label": "epothilone biosynthetic process",
        "value": "GO:0050814"
    },
    {
        "label": "aflatoxin biosynthetic process",
        "value": "GO:0045122"
    },
    {
        "label": "aflatoxin metabolic process",
        "value": "GO:0046222"
    },
    {
        "label": "aflatoxin catabolic process",
        "value": "GO:0046223"
    }
]

The jquery client

Let's create the HTML/JQuery Client, I have downloaded the component for jquery and jquery-ui in the apache-solr-3.5.0/example/work/Jetty_xxxx_solr.war__solr__xxxx/webapp/ (and that's a bad idea, everything will be deleted with the SOLR server will be stopped or re-deployed).

I've also created the following HTML page 'solr.html' in the same directory: Let's visit http://localhost:8983/solr/solr.html to test the page:


(image via openwetware)

 That worked ! :-)

That's it,

Pierre

Indexing the content of Gene Ontology with apache SOLR

Via Wikipedia:"Solr (http://lucene.apache.org/solr/) is an open source enterprise search platform from the Apache Lucene project. Its major features include powerful full-text search, hit highlighting, faceted search, dynamic clustering, database integration, and rich document (e.g., Word, PDF) handling. Providing distributed search and index replication, Solr is highly scalable." In the this post, I'll show how I've used SOLR to index the content of GeneOntology.

Download and install SOLR

Download from http://mirrors.ircam.fr/pub/apache/lucene/solr/3.5.0/apache-solr-3.5.0.tgz.
tar xvfz apache-solr-3.5.0.tgz
rm apache-solr-3.5.0.tgz

Configure schema.xml

We need to tell SOLR about the which fields of GO will be indexed, what are their type, how they should be tokenized and parsed. This information is defined in the schema.xml. The following components will be indexed: accession, name, synonym and definition. Edit apache-solr-3.5.0/example/solr/conf/schema.xml and add the following <fields>:

<field name="go_name" type="text_general" indexed="true" stored="true" multiValued="false"/>
<field name="go_synonym" type="text_general" indexed="true" stored="true" multiValued="true"/>
<field name="go_definition" type="text_general" indexed="true" stored="true" multiValued="false"/>

Start the SOLR server

In this example, the SOLR server is started using the simple Jetty server provided in the distribution:

$ cd apache-solr-3.5.0/example/example
$ java -jar start.jar

(...)

Indexing Gene Ontology

Go is downloaded as RDF/XML from http://archive.geneontology.org/latest-termdb/go_daily-termdb.rdf-xml.gz
 
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE go:go PUBLIC "-//Gene Ontology//Custom XML/RDF Version 2.0//EN" "http://www.geneontology.org/dtd/go.dtd">

<go:go xmlns:go="http://www.geneontology.org/dtds/go.dtd#" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
    <rdf:RDF>
        <go:term rdf:about="http://www.geneontology.org/go#GO:0000001">
            <go:accession>GO:0000001</go:accession>
            <go:name>mitochondrion inheritance</go:name>
            <go:synonym>mitochondrial inheritance</go:synonym>
            <go:definition>The distribution of mitochondria, including the mitochondrial genome, into daughter cells after mitosis or meiosis, mediated by interactions between mitochondria and the cytoskeleton.</go:definition>
            <go:is_a rdf:resource="http://www.geneontology.org/go#GO:0048308" />
            <go:is_a rdf:resource="http://www.geneontology.org/go#GO:0048311" />
        </go:term>
        <go:term rdf:about="http://www.geneontology.org/go#GO:0000002">
            <go:accession>GO:0000002</go:accession>
            <go:name>mitochondrial genome maintenance</go:name>
            <go:definition>The maintenance of the structure and integrity of the mitochondrial genome; includes replication and segregation of the mitochondrial chromosome.</go:definition>
            <go:is_a rdf:resource="http://www.geneontology.org/go#GO:0007005" />
            <go:dbxref rdf:parseType="Resource">
                <go:database_symbol>InterPro</go:database_symbol>
(...)
 
We now need to transform this XML file to another XML file that can be indexed by the SOLR server.  

"You can modify a Solr index by POSTing XML Documents containing instructions to add (or update) documents, delete documents, commit pending adds and deletes, and optimize your index."

The following XSLT stylesheet is used to transform the RDF/XML for GO:


$ xsltproc --novalid go2solr.xsl go_daily-termdb.rdf-xml.gz > add.xml
$ cat add.xml

Before indexing the current disk usage under apache-solr-3.5.0 is 136Mo. We can now use the java utiliy post.jar to index GeneOntology.

 $ cd  ~/package/apache-solr-3.5.0/example/exampledocs
 $ java -jar post.jar  add.xml

SimplePostTool: version 1.4
SimplePostTool: POSTing files to http://localhost:8983/solr/update..
SimplePostTool: POSTing file jeter.xml
SimplePostTool: COMMITting Solr index changes..

After indexing, the disk usage under apache-solr-3.5.0 is 153Mo.

Querying

Search for the GO terms having go:definition containing "cancer" a go:name containing "genome" but discard those having go:definition containing "metabolism".
 curl "http://localhost:8983/solr/select/?q=go_definition%3Acancer+go_name%3Agenome+-go_definition%3Ametabolism&version=2.2&start=0&rows=10&indent=on"
Same query, but return the result as a JSON structure:
 curl "http://localhost:8983/solr/select/?q=go_definition%3Acancer+go_name%3Agenome+-go_definition%3Ametabolism&version=2.2&start=0&rows=10&indent=on&wt=json"
That's it, Pierre