The best answer was given by Neil who suggested to use the BiomartR package.
On my side, I submitted two solutions:
1st solution: use a recursive XSLT stylesheet
The following XSLT stylesheet gets the GeneID from the mRNA, download the XML document describing this gene and extracts the identifiers for Gene Ontology :<?xml version="1.0" encoding="ISO-8859-1" ?>
<xsl:stylesheet version="1.0" xmlns:xsl='http://www.w3.org/1999/XSL/Transform'>
<xsl:output method="text" encoding="ISO-8859-1"/>
<xsl:template match="/">
<xsl:apply-templates/>
</xsl:template>
<xsl:template match="Dbtag[Dbtag_db='GeneID']">
<xsl:for-each select="Dbtag_tag/Object-id/Object-id_id">
<xsl:variable name="genid" select="."/>
<xsl:variable name="url" select="concat('http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=gene&retmode=xml&id=',$genid)"/>
<xsl:apply-templates select="document($url,//Other-source)" mode="go"/>
</xsl:for-each>
</xsl:template>
<xsl:template match="Other-source" mode="go">
<xsl:if test="Other-source_src/Dbtag/Dbtag_db='GO'">
<xsl:value-of select="concat('GO:',Other-source_src/Dbtag/Dbtag_tag/Object-id/Object-id_id)"/>
<xsl:text> </xsl:text>
<xsl:value-of select="Other-source_anchor"/>
<xsl:text>
</xsl:text>
</xsl:if>
</xsl:template>
<xsl:template match="text()">
<xsl:apply-templates/>
</xsl:template>
<xsl:template match="text()" mode="go">
</xsl:template>
</xsl:stylesheet>
<xsl:stylesheet version="1.0" xmlns:xsl='http://www.w3.org/1999/XSL/Transform'>
<xsl:output method="text" encoding="ISO-8859-1"/>
<xsl:template match="/">
<xsl:apply-templates/>
</xsl:template>
<xsl:template match="Dbtag[Dbtag_db='GeneID']">
<xsl:for-each select="Dbtag_tag/Object-id/Object-id_id">
<xsl:variable name="genid" select="."/>
<xsl:variable name="url" select="concat('http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=gene&retmode=xml&id=',$genid)"/>
<xsl:apply-templates select="document($url,//Other-source)" mode="go"/>
</xsl:for-each>
</xsl:template>
<xsl:template match="Other-source" mode="go">
<xsl:if test="Other-source_src/Dbtag/Dbtag_db='GO'">
<xsl:value-of select="concat('GO:',Other-source_src/Dbtag/Dbtag_tag/Object-id/Object-id_id)"/>
<xsl:text> </xsl:text>
<xsl:value-of select="Other-source_anchor"/>
<xsl:text>
</xsl:text>
</xsl:if>
</xsl:template>
<xsl:template match="text()">
<xsl:apply-templates/>
</xsl:template>
<xsl:template match="text()" mode="go">
</xsl:template>
</xsl:stylesheet>
Running the stylesheet with xsltproc
xsltproc --novalid stylesheet.xsl \
"http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nucleotide&id=NM_030621&rettype=xml"
"http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nucleotide&id=NM_030621&rettype=xml"
Result:
GO:5524 ATP binding
GO:8026 ATP-dependent helicase activity
GO:3725 double-stranded RNA binding
GO:4519 endonuclease activity
GO:4386 helicase activity
GO:16787 hydrolase activity
GO:46872 metal ion binding
GO:166 nucleotide binding
GO:5515 protein binding
GO:4525 ribonuclease III activity
GO:6396 RNA processing
GO:1525 angiogenesis
GO:48754 branching morphogenesis of a tube
GO:35116 embryonic hindlimb morphogenesis
GO:31047 gene silencing by RNA
GO:30324 lung development
GO:31054 pre-microRNA processing
GO:30422 production of siRNA involved in RNA interference
GO:19827 stem cell maintenance
GO:30423 targeting of mRNA for destruction involved in RNA interference
GO:16442 RNA-induced silencing complex
GO:5737 cytoplasm
GO:5622 intracellular
GO:8026 ATP-dependent helicase activity
GO:3725 double-stranded RNA binding
GO:4519 endonuclease activity
GO:4386 helicase activity
GO:16787 hydrolase activity
GO:46872 metal ion binding
GO:166 nucleotide binding
GO:5515 protein binding
GO:4525 ribonuclease III activity
GO:6396 RNA processing
GO:1525 angiogenesis
GO:48754 branching morphogenesis of a tube
GO:35116 embryonic hindlimb morphogenesis
GO:31047 gene silencing by RNA
GO:30324 lung development
GO:31054 pre-microRNA processing
GO:30422 production of siRNA involved in RNA interference
GO:19827 stem cell maintenance
GO:30423 targeting of mRNA for destruction involved in RNA interference
GO:16442 RNA-induced silencing complex
GO:5737 cytoplasm
GO:5622 intracellular
2nd solution: using the ensembl.org mysql server
After doing some little reverse engineering with the SQL schema of ensembl.org, I wrote the following SQL query:use homo_sapiens_core_48_36j;
select distinct
GENE_ID.stable_id as "ensembl.gene",
RNA_ID.stable_id as "ensembl.transcript",
PROT_ID.stable_id as "ensembl.translation",
GO.acc as "go.acc",
GO.name as "go.name",
GOXREF.linkage_type as "evidence"
from
ensembl_go_54.term as GO,
external_db as EXTDB0,
external_db as EXTDB1,
object_xref as OX0,
object_xref as OX1,
xref as XREF0,
xref as XREF1,
transcript as RNA,
transcript_stable_id as RNA_ID,
gene as GENE,
gene_stable_id as GENE_ID,
translation as PROT,
translation_stable_id as PROT_ID,
go_xref as GOXREF
where
XREF0.dbprimary_acc="NM_030621" and
XREF0.external_db_id=EXTDB0.external_db_id and
EXTDB0.db_name="RefSeq_dna" and
OX0.xref_id=XREF0.xref_id and
RNA.gene_id=GENE.gene_id and
GENE.gene_id= GENE_ID.gene_id and
RNA.transcript_id=OX0.ensembl_id and
RNA_ID.transcript_id=RNA.transcript_id and
PROT.transcript_id = RNA.transcript_id and
OX1.ensembl_id=PROT.translation_id and
PROT.translation_id=PROT_ID.translation_id and
OX1.ensembl_object_type='Translation' and
OX1.xref_id=XREF1.xref_id and
GOXREF.object_xref_id=OX1.object_xref_id and
XREF1.external_db_id=EXTDB1.external_db_id and
EXTDB1.db_name="GO" and
GO.acc=XREF1.dbprimary_acc
order by GO.acc;
select distinct
GENE_ID.stable_id as "ensembl.gene",
RNA_ID.stable_id as "ensembl.transcript",
PROT_ID.stable_id as "ensembl.translation",
GO.acc as "go.acc",
GO.name as "go.name",
GOXREF.linkage_type as "evidence"
from
ensembl_go_54.term as GO,
external_db as EXTDB0,
external_db as EXTDB1,
object_xref as OX0,
object_xref as OX1,
xref as XREF0,
xref as XREF1,
transcript as RNA,
transcript_stable_id as RNA_ID,
gene as GENE,
gene_stable_id as GENE_ID,
translation as PROT,
translation_stable_id as PROT_ID,
go_xref as GOXREF
where
XREF0.dbprimary_acc="NM_030621" and
XREF0.external_db_id=EXTDB0.external_db_id and
EXTDB0.db_name="RefSeq_dna" and
OX0.xref_id=XREF0.xref_id and
RNA.gene_id=GENE.gene_id and
GENE.gene_id= GENE_ID.gene_id and
RNA.transcript_id=OX0.ensembl_id and
RNA_ID.transcript_id=RNA.transcript_id and
PROT.transcript_id = RNA.transcript_id and
OX1.ensembl_id=PROT.translation_id and
PROT.translation_id=PROT_ID.translation_id and
OX1.ensembl_object_type='Translation' and
OX1.xref_id=XREF1.xref_id and
GOXREF.object_xref_id=OX1.object_xref_id and
XREF1.external_db_id=EXTDB1.external_db_id and
EXTDB1.db_name="GO" and
GO.acc=XREF1.dbprimary_acc
order by GO.acc;
Execute
mysql -A -h ensembldb.ensembl.org -u anonymous -P 5306 < query.sql
Result
ensembl.gene | ensembl.transcript | ensembl.translation | go.acc | go.name |
---|---|---|---|---|
ENSG00000100697 | ENST00000393063 | ENSP00000376783 | GO:0000166 | nucleotide binding |
ENSG00000100697 | ENST00000393063 | ENSP00000376783 | GO:0001525 | angiogenesis |
ENSG00000100697 | ENST00000393063 | ENSP00000376783 | GO:0003676 | nucleic acid binding |
ENSG00000100697 | ENST00000393063 | ENSP00000376783 | GO:0003677 | DNA binding |
ENSG00000100697 | ENST00000393063 | ENSP00000376783 | GO:0003723 | RNA binding |
ENSG00000100697 | ENST00000393063 | ENSP00000376783 | GO:0003725 | double-stranded RNA binding |
ENSG00000100697 | ENST00000393063 | ENSP00000376783 | GO:0004386 | helicase activity |
ENSG00000100697 | ENST00000393063 | ENSP00000376783 | GO:0004519 | endonuclease activity |
ENSG00000100697 | ENST00000393063 | ENSP00000376783 | GO:0004525 | ribonuclease III activity |
ENSG00000100697 | ENST00000393063 | ENSP00000376783 | GO:0005515 | protein binding |
ENSG00000100697 | ENST00000393063 | ENSP00000376783 | GO:0005524 | ATP binding |
ENSG00000100697 | ENST00000393063 | ENSP00000376783 | GO:0005622 | intracellular |
ENSG00000100697 | ENST00000393063 | ENSP00000376783 | GO:0006396 | RNA processing |
ENSG00000100697 | ENST00000393063 | ENSP00000376783 | GO:0008026 | ATP-dependent helicase activity |
ENSG00000100697 | ENST00000393063 | ENSP00000376783 | GO:0016787 | hydrolase activity |
ENSG00000100697 | ENST00000393063 | ENSP00000376783 | GO:0019827 | stem cell maintenance |
ENSG00000100697 | ENST00000393063 | ENSP00000376783 | GO:0030324 | lung development |
ENSG00000100697 | ENST00000393063 | ENSP00000376783 | GO:0030422 | RNA interference, production of siRNA |
ENSG00000100697 | ENST00000393063 | ENSP00000376783 | GO:0030423 | RNA interference, targeting of mRNA for destruction |
ENSG00000100697 | ENST00000393063 | ENSP00000376783 | GO:0031047 | gene silencing by RNA |
ENSG00000100697 | ENST00000393063 | ENSP00000376783 | GO:0035116 | embryonic hindlimb morphogenesis |
ENSG00000100697 | ENST00000393063 | ENSP00000376783 | GO:0035196 | gene silencing by miRNA, production of miRNAs |
ENSG00000100697 | ENST00000393063 | ENSP00000376783 | GO:0048754 | branching morphogenesis of a tube |
Note: my solutions don't list all the GO terms visible in http://www.ensembl.org/Homo_sapiens/Transcript/(...)/ENSG00000100697. E.g.: I can see GO:0016442 on the web page, but not in my result ). The ensembl schema is complex, and I might have missed some links ("left join" ?) between the tables (anybody knows ?).UPDATE:The results should be the same as ensembl when using "homo_sapiens_core_58_37c" (Thank you @joachimbaran. )
That's it
Pierre
No comments:
Post a Comment