Using the Disease ontology (DO) to map the genes involved in a category of disease. My notebook
In the current post, I'll use the disease ontology (DO) to map all the genes involved in a cardiac disease.
Using The BioPortal, I found that my term of interest is DOID:114 ("Heart Disease"). I now need to find all the descendants of this term.
The Disease Ontology is available for download here: http://www.obofoundry.org/cgi-bin/detail.cgi?id=disease_ontology. The following XSLT stylesheet retrieves of all the descendants of a given term using a recursive algorithm:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
<?xml version='1.0' encoding="ISO-8859-1"?> | |
<xsl:stylesheet | |
xmlns:xsl='http://www.w3.org/1999/XSL/Transform' | |
xmlns:obo="http://purl.obolibrary.org/obo/" | |
xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#" | |
xmlns:doid="http://purl.obolibrary.org/obo/doid/src/doid.obo#" | |
xmlns:owl="http://www.w3.org/2002/07/owl#" | |
xmlns:xsd="http://www.w3.org/2001/XMLSchema#" | |
xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" | |
xmlns:oboInOwl="http://www.geneontology.org/formats/oboInOwl#" | |
version='1.0' | |
> | |
<xsl:output method='text' encoding="UTF-8"/> | |
<xsl:param name="ID"/> | |
<xsl:template match="/"> | |
<xsl:apply-templates select="rdf:RDF/owl:Class[oboInOwl:id=$ID]"/> | |
</xsl:template> | |
<xsl:template match="owl:Class"> | |
<xsl:variable name="about" select="@rdf:about"/> | |
<xsl:value-of select="oboInOwl:id"/> | |
<xsl:text> </xsl:text> | |
<xsl:value-of select="rdfs:label"/> | |
<xsl:text> </xsl:text> | |
<xsl:value-of select="rdfs:subClassOf/@rdf:resource"/> | |
<xsl:text> | |
</xsl:text> | |
<xsl:apply-templates select="/rdf:RDF/owl:Class[rdfs:subClassOf/@rdf:resource=$about]"/> | |
</xsl:template> | |
</xsl:stylesheet> |
Usage:
xsltproc --stringparam ID "DOID:114" do.xsl do.owl|\ sort | uniq | cut -f 1 & doids.txt
Result:
$ head doids.txt DOID:0050650 DOID:0060000 DOID:0060036 DOID:0060068 DOID:10234 DOID:10266 DOID:10272 DOID:10273 DOID:10314 DOID:10392
In Annotating the human genome with Disease Ontology, Osborne & al. have mapped the terms of DO to OMIM and to NCBI Gene. The database dump is available at http://projects.bioinformatics.northwestern.edu/do_rif/do_rif.human.txt. We can use the file "doids.txt" and the fgrep command to extract the genes associated to our selected terms.
~$ curl -s "http://projects.bioinformatics.northwestern.edu/do_rif/do_rif.human.txt" | fgrep -w -f doids.txt 100133941 A decrease in CD4+CD25+ T cell numbers in mitral stenosis patients might suggest a role for cellular autoimmunity in a smoldering rheumatic process. 17944116 C0026269 DOID:1754 in mitral stenosis patients 734 10014 Chronic upregulation/activation of CaMKIID, and PKD in heart failure shifts HDAC5 out of the nucleus, derepressing transcription of hypertrophic genes. 18218981 C0018801 DOID:6000 in heart failure 1000 10068 IL-18 levels, which are determined in part by variation in IL18/IL18BP, play a role in coronary heart disease development and postsurgery outcome. 17951325 C0010054 DOID:3363 in coronary heart disease development 756 10068 IL-18 levels, which are determined in part by variation in IL18/IL18BP, play a role in coronary heart disease development and postsurgery outcome. 17951325 C0010068 DOID:3393 in coronary heart disease development 756 100 ADA*2 allele may decrease genetic susceptibility to coronary artery disease. 17287605 C0010054 DOID:3363 to coronary artery disease 1000 (...)The first column contains the NCBI/Gene ID. Let's extract this column and ask the mysql server of the UCSC for the positions of those genes:
$ curl -s "http://projects.bioinformatics.northwestern.edu/do_rif/do_rif.human.txt" |\ fgrep -w -f doids.txt | cut -d ' ' -f 1 | sort | uniq |\ awk '{printf("select distinct R.chrom,R.txStart,R.txEnd,L.product,L.locusLinkId from refLink as L,refGene as R where R.name=L.mrnaAcc and L.locusLinkId=%s;\n",$1);}' | \ mysql --user=genome --host=genome-mysql.cse.ucsc.edu -A -D hg19 -N chr20 43248162 43280376 adenosine deaminase 100 chrY 21152525 21154705 signal transducer CD24 precursor 100133941 chr17 42154120 42201014 histone deacetylase 5 isoform 1 10014 chr17 42154120 42201014 histone deacetylase 5 isoform 3 10014 chr11 71710108 71713574 interleukin-18-binding protein isoform a precursor 10068 chr11 71709957 71713574 interleukin-18-binding protein isoform a precursor 10068 chr11 71710972 71713574 interleukin-18-binding protein isoform b precursor 10068 chr11 71710662 71713574 interleukin-18-binding protein isoform a precursor 10068 chr11 71709957 71713850 interleukin-18-binding protein isoform d precursor 10068 chr11 71710108 71713965 interleukin-18-binding protein isoform c precursor 10068 chr19 16435650 16438339 Krueppel-like factor 2 10365 chr7 30464142 30518393 nucleotide-binding oligomerization domain-containing protein 1 10392 chr20 35169886 35178226 myosin regulatory light polypeptide 9 isoform a 10398 chr20 35169886 35178226 myosin regulatory light polypeptide 9 isoform b 10398 chr12 48128452 48152889 rap guanine nucleotide exchange factor 3 isoform a 10411 chr12 48128452 48152244 rap guanine nucleotide exchange factor 3 isoform b 10411 chr12 48128452 48152181 rap guanine nucleotide exchange factor 3 isoform b 10411 chr16 56995834 57017756 cholesteryl ester transfer protein precursor 1071 chr1 11104854 11107296 mannan-binding lectin serine protease 2 isoform 2 precursor 10747 chr1 11086579 11107296 mannan-binding lectin serine protease 2 isoform 1 preproprotein 10747
checking; the first gene is ADA adenosine deaminase. It is associated to DOID:3363 (coronary arteriosclerosis) and it is cited in pmid:17287605 "ADA*2 allele of the adenosine deaminase gene may protect against coronary artery disease.".
That's it,
Pierre
No comments:
Post a Comment