15 December 2008

An idea: Twitter as a tool to build a protein-protein interactions database

In this post I describe the idea about how http://twitter.com could be used as a tool to build a collaborative database of protein-protein interactions. This idea was inspired by the recent creation of http://twitter.com/omnee: Omnee is said to be the "first organic directory for Twitter which you can control directly via your tweets": Using a tag-based structure in your tweets this gives you the freedom to add yourself to multiple "groups" quickly and easily.


Chris Upton's tags
+informatics, +ipod touch, +genomics, +proteomics, +dnasequencing, + mac, +semanticweb, -ipodtoch, +bioinformatics, +virology, #omnee

How about building a collaborative biological database with this kind of tool ?. One could create a database of protein-protein interactions using twitter. For example, say the @biotecher account will be used as the core account to harvest the tweets, anybody could send a new component of the interactome by sending a tweet to @biotecher with the gi of the two proteins, a pubmed-id as reference and a special hashtag say #interactome.

E.g: Rotavirus protein NSP3 interacts with human EIF4G1 (view tweet )

@biotecher gi:41019505 gi:255458 pmid:9755181 #interactome

With such system the metadata ( who gave this information ? when ?) is also recorded by tweeter.com so we can imagine to filter the information according to our network ("I don't trust the information supplied by this user, discard it")

I've also created a short piece of code as a proof of concept: the program fetches search for the tweets about #interactome and bound to @biotecher. It then download a few information from the NCBI (get the organism and name of the protein, get the title of the paper, etc...) and output the network as a RDF graph. The code (java) of this program is available at: http://code.google.com/p/lindenb/source/browse/trunk/proj/tinytools/src/org/lindenb/tinytools/TwitterOmics.java.

Here is the output with 3 interactions. As you will see, each interaction is stored in the rdf:Class <Interaction>. The interaction is identified by the URL of the tweet. Each interaction contains a reference of the author, the proteins , the date and the article in pubmed.

<?xml version="1.0" encoding="UTF-8"?>

<foaf:Person rdf:about="http://twitter.com/yokofakun">
<foaf:name>yokofakun (Pierre Lindenbaum)</foaf:name>

<Organism rdf:about="lsid:ncbi.nlm.nih.gov:taxonomy:4932">
<dc:title>Saccharomyces cerevisiae</dc:title>

<Protein rdf:about="lsid:ncbi.nlm.nih.gov:protein:417441">
<dc:title>RecName: Full=Polyadenylate-binding protein, cytoplasmic and nuclear; Short=Poly(A)-binding protein; Short=PABP; AltName: Full=ARS consensus-binding protein ACBP-67; AltName: Full=Polyadenylate tail-binding protein</dc:title>
<organism rdf:resource="lsid:ncbi.nlm.nih.gov:taxonomy:4932"/>

<Organism rdf:about="lsid:ncbi.nlm.nih.gov:taxonomy:9606">
<dc:title>Homo sapiens</dc:title>

<Protein rdf:about="lsid:ncbi.nlm.nih.gov:protein:41019505">
<dc:title>RecName: Full=Eukaryotic translation initiation factor 4 gamma 1; Short=eIF-4-gamma 1; Short=eIF-4G 1; Short=eIF-4G1; AltName: Full=p220</dc:title>
<organism rdf:resource="lsid:ncbi.nlm.nih.gov:taxonomy:9606"/>

<bibo:Article rdf:about="http://www.ncbi.nlm.nih.gov/pubmed/9418852">
<dc:title>RNA recognition motif 2 of yeast Pab1p is required for its functional interaction with eukaryotic translation initiation factor 4G.</dc:title>

<Interaction rdf:about="http://twitter.com/yokofakun/statuses/1058586293">
<interactor rdf:resource="lsid:ncbi.nlm.nih.gov:protein:417441"/>
<interactor rdf:resource="lsid:ncbi.nlm.nih.gov:protein:41019505"/>
<reference rdf:resource="http://www.ncbi.nlm.nih.gov/pubmed/9418852"/>
<dc:creator rdf:resource="http://twitter.com/yokofakun"/>

<Organism rdf:about="lsid:ncbi.nlm.nih.gov:taxonomy:10922">
<dc:title>Simian rotavirus</dc:title>

<Protein rdf:about="lsid:ncbi.nlm.nih.gov:protein:255458">
<dc:title>NS34=gene 7 nonstructural protein [simian rotavirus, SA114F, serotype G3, Peptide, 315 aa]</dc:title>
<organism rdf:resource="lsid:ncbi.nlm.nih.gov:taxonomy:10922"/>

<Protein rdf:about="lsid:ncbi.nlm.nih.gov:protein:6176338">
<dc:title>ubiquitous tetratricopeptide containing protein RoXaN [Homo sapiens]</dc:title>
<organism rdf:resource="lsid:ncbi.nlm.nih.gov:taxonomy:9606"/>

<bibo:Article rdf:about="http://www.ncbi.nlm.nih.gov/pubmed/15047801">
<dc:title>RoXaN, a novel cellular protein containing TPR, LD, and zinc finger motifs, forms a ternary complex with eukaryotic initiation factor 4G and rotavirus NSP3.</dc:title>

<Interaction rdf:about="http://twitter.com/yokofakun/statuses/1058292539">
<interactor rdf:resource="lsid:ncbi.nlm.nih.gov:protein:255458"/>
<interactor rdf:resource="lsid:ncbi.nlm.nih.gov:protein:6176338"/>
<reference rdf:resource="http://www.ncbi.nlm.nih.gov/pubmed/15047801"/>
<dc:creator rdf:resource="http://twitter.com/yokofakun"/>

<bibo:Article rdf:about="http://www.ncbi.nlm.nih.gov/pubmed/9755181">
<dc:title>Rotavirus RNA-binding protein NSP3 interacts with eIF4GI and evicts the poly(A) binding protein from eIF4F.</dc:title>

<Interaction rdf:about="http://twitter.com/yokofakun/statuses/1058290564">
<interactor rdf:resource="lsid:ncbi.nlm.nih.gov:protein:41019505"/>
<interactor rdf:resource="lsid:ncbi.nlm.nih.gov:protein:255458"/>
<reference rdf:resource="http://www.ncbi.nlm.nih.gov/pubmed/9755181"/>
<dc:creator rdf:resource="http://twitter.com/yokofakun"/>


What do you think ?



Michael Kuhn said...

I can only repeat what I said on FriendFeed, use a dedicated account so that you don't add noise for everyone following you and biotecher. (@interactome ?)

I wonder if the use cases for this wouldn't correspond to those for CBioC.

Pierre Lindenbaum said...

Yes, I agree Michael. I also unsubscribed to @omnee because I didn't want to receive all its updates.

Ntino said...

A-w-e-s-o-m-e ! (and you've coded it super-ultra fast) :-)

Alex Tolley said...

I don't understand what your suggestion buys you exactly. Apart from a repository and the tools, Twitter would just either constantly send you updates on your protein or you would still need to search for them in the standard way.

A possibly simpler approach is to use Google Base and build a few simple tools on top of that to build the repository with search capability.

Or is the idea to simply piggyback on the "free" platform?

Pierre Lindenbaum said...


Most databases about protein-protein interactions contain false positives because they are built with a text-mining algorithm scanning the abstracts of pubmed (http://www.ncbi.nlm.nih.gov/pubmed/18834490 ) or because the information comes from a high-throughput experiment ( http://www.ncbi.nlm.nih.gov/pubmed/16267556 ). But here scientists working at the bench on a simple pair of protein *know* their job and are the best way to find the correct ID of the proteins/articles. Adding a this simple information in twitter is easy and straightforward. No need for the information to be controlled, for the user to register a new site. And there is already an API: everybody can use it.

Google Base ? I don't know much about it... I'd rather use http://www.freebase.com to store this kind of information.

oh, and I think that using twitter for this ... is just fun :-)


Alex Tolley said...

I was in bioinformatics, although not dealing with protein-protein interactions. I suppose if the idea is to connect pairs of objects with a note and data references, then this applies to anything with a graph structure. You could do this for biological pathways, gene transcription products, etc.

I certainly see the "fun" part, and I think that leveraging off existing infrastructure can be very convenient, especially when that takes away the burden of maintaining it and allowing you to focus on the analysis tools.

Google Base is Google's method of storing data as xml files, built on their data format. You would store your rdf data using base and use similar tools to retrieve all the data items and analyze it. It would work very similarly to the Twitter idea, except that you would need to build more tools to manage the data.