29 May 2010

From mRNA to GO

In this post , I've just copied the solutions I gave on http://biostar.stackexchange.com/ for the following problem:"How do I do simple GO term lookup given a gene (or mRNA) identifier?".
The best answer was given by Neil who suggested to use the BiomartR package.

On my side, I submitted two solutions:

1st solution: use a recursive XSLT stylesheet

The following XSLT stylesheet gets the GeneID from the mRNA, download the XML document describing this gene and extracts the identifiers for Gene Ontology :
<?xml version="1.0" encoding="ISO-8859-1" ?>
<xsl:stylesheet version="1.0" xmlns:xsl='http://www.w3.org/1999/XSL/Transform'>

<xsl:output method="text" encoding="ISO-8859-1"/>

<xsl:template match="/">
<xsl:apply-templates/>
</xsl:template>

<xsl:template match="Dbtag[Dbtag_db='GeneID']">
<xsl:for-each select="Dbtag_tag/Object-id/Object-id_id">
<xsl:variable name="genid" select="."/>
<xsl:variable name="url" select="concat('http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=gene&retmode=xml&id=',$genid)"/>
<xsl:apply-templates select="document($url,//Other-source)" mode="go"/>
</xsl:for-each>
</xsl:template>

<xsl:template match="Other-source" mode="go">
<xsl:if test="Other-source_src/Dbtag/Dbtag_db='GO'">
<xsl:value-of select="concat('GO:',Other-source_src/Dbtag/Dbtag_tag/Object-id/Object-id_id)"/>
<xsl:text> </xsl:text>
<xsl:value-of select="Other-source_anchor"/>
<xsl:text>
</xsl:text>
</xsl:if>
</xsl:template>

<xsl:template match="text()">
<xsl:apply-templates/>
</xsl:template>

<xsl:template match="text()" mode="go">
</xsl:template>

</xsl:stylesheet>

Running the stylesheet with xsltproc

Result:

GO:5524 ATP binding
GO:8026 ATP-dependent helicase activity
GO:3725 double-stranded RNA binding
GO:4519 endonuclease activity
GO:4386 helicase activity
GO:16787 hydrolase activity
GO:46872 metal ion binding
GO:166 nucleotide binding
GO:5515 protein binding
GO:4525 ribonuclease III activity
GO:6396 RNA processing
GO:1525 angiogenesis
GO:48754 branching morphogenesis of a tube
GO:35116 embryonic hindlimb morphogenesis
GO:31047 gene silencing by RNA
GO:30324 lung development
GO:31054 pre-microRNA processing
GO:30422 production of siRNA involved in RNA interference
GO:19827 stem cell maintenance
GO:30423 targeting of mRNA for destruction involved in RNA interference
GO:16442 RNA-induced silencing complex
GO:5737 cytoplasm
GO:5622 intracellular

2nd solution: using the ensembl.org mysql server

After doing some little reverse engineering with the SQL schema of ensembl.org, I wrote the following SQL query:
use homo_sapiens_core_48_36j;

select distinct
GENE_ID.stable_id as "ensembl.gene",
RNA_ID.stable_id as "ensembl.transcript",
PROT_ID.stable_id as "ensembl.translation",
GO.acc as "go.acc",
GO.name as "go.name",
GOXREF.linkage_type as "evidence"
from

ensembl_go_54.term as GO,
external_db as EXTDB0,
external_db as EXTDB1,
object_xref as OX0,
object_xref as OX1,
xref as XREF0,
xref as XREF1,
transcript as RNA,
transcript_stable_id as RNA_ID,
gene as GENE,
gene_stable_id as GENE_ID,
translation as PROT,
translation_stable_id as PROT_ID,
go_xref as GOXREF
where
XREF0.dbprimary_acc="NM_030621" and
XREF0.external_db_id=EXTDB0.external_db_id and
EXTDB0.db_name="RefSeq_dna" and
OX0.xref_id=XREF0.xref_id and
RNA.gene_id=GENE.gene_id and
GENE.gene_id= GENE_ID.gene_id and
RNA.transcript_id=OX0.ensembl_id and
RNA_ID.transcript_id=RNA.transcript_id and
PROT.transcript_id = RNA.transcript_id and
OX1.ensembl_id=PROT.translation_id and
PROT.translation_id=PROT_ID.translation_id and
OX1.ensembl_object_type='Translation' and
OX1.xref_id=XREF1.xref_id and
GOXREF.object_xref_id=OX1.object_xref_id and
XREF1.external_db_id=EXTDB1.external_db_id and
EXTDB1.db_name="GO" and
GO.acc=XREF1.dbprimary_acc

order by GO.acc;

Execute

mysql -A -h ensembldb.ensembl.org -u anonymous -P 5306 < query.sql

Result

ensembl.geneensembl.transcriptensembl.translationgo.accgo.name
ENSG00000100697ENST00000393063ENSP00000376783GO:0000166nucleotide binding
ENSG00000100697ENST00000393063ENSP00000376783GO:0001525angiogenesis
ENSG00000100697ENST00000393063ENSP00000376783GO:0003676nucleic acid binding
ENSG00000100697ENST00000393063ENSP00000376783GO:0003677DNA binding
ENSG00000100697ENST00000393063ENSP00000376783GO:0003723RNA binding
ENSG00000100697ENST00000393063ENSP00000376783GO:0003725double-stranded RNA binding
ENSG00000100697ENST00000393063ENSP00000376783GO:0004386helicase activity
ENSG00000100697ENST00000393063ENSP00000376783GO:0004519endonuclease activity
ENSG00000100697ENST00000393063ENSP00000376783GO:0004525ribonuclease III activity
ENSG00000100697ENST00000393063ENSP00000376783GO:0005515protein binding
ENSG00000100697ENST00000393063ENSP00000376783GO:0005524ATP binding
ENSG00000100697ENST00000393063ENSP00000376783GO:0005622intracellular
ENSG00000100697ENST00000393063ENSP00000376783GO:0006396RNA processing
ENSG00000100697ENST00000393063ENSP00000376783GO:0008026ATP-dependent helicase activity
ENSG00000100697ENST00000393063ENSP00000376783GO:0016787hydrolase activity
ENSG00000100697ENST00000393063ENSP00000376783GO:0019827stem cell maintenance
ENSG00000100697ENST00000393063ENSP00000376783GO:0030324lung development
ENSG00000100697ENST00000393063ENSP00000376783GO:0030422RNA interference, production of siRNA
ENSG00000100697ENST00000393063ENSP00000376783GO:0030423RNA interference, targeting of mRNA for destruction
ENSG00000100697ENST00000393063ENSP00000376783GO:0031047gene silencing by RNA
ENSG00000100697ENST00000393063ENSP00000376783GO:0035116embryonic hindlimb morphogenesis
ENSG00000100697ENST00000393063ENSP00000376783GO:0035196gene silencing by miRNA, production of miRNAs
ENSG00000100697ENST00000393063ENSP00000376783GO:0048754branching morphogenesis of a tube


Note: my solutions don't list all the GO terms visible in http://www.ensembl.org/Homo_sapiens/Transcript/(...)/ENSG00000100697. E.g.: I can see GO:0016442 on the web page, but not in my result ). The ensembl schema is complex, and I might have missed some links ("left join" ?) between the tables (anybody knows ?).UPDATE:The results should be the same as ensembl when using "homo_sapiens_core_58_37c" (Thank you @joachimbaran. )

That's it

Pierre

No comments: