YOKOFAKUN

22 December 2008

Knime.org: creating a new Source Node reading Fasta sequences.

This blog is about how I wrote a java plugin for the workflow engine KNIME (http://www.knime.org). This plugin reads a FASTA file containing one or more sequence and transforms it into a table containing to columns: one for the name of the sequence and the other for the sequence itself.

In the last weeks , I've been looking for a workflow engine that could be easily handled by the members of my lab. I tested three tools:

Taverna (now in version 2) is mainly devoted to the web services. I've been trying to learn how to use Taverna a few times but I still don't find it user-friendly, not intuitive
http://kepler-project.org/: 141Mo ! Ouch !
http://www.knime.org: until the last version , the development version crashed at startup on my computer, but now everything works fine

KNime - The Konstanz Information Miner is a modular environment -, which enables easy visual assembly and interactive execution of a data pipeline. Knime is built over eclipse and each node in the workflow is developed as a plugin. As far as I understand Knime, it only handles tabular data and that's why I wrote this new plugin converting a set of fasta files. The Knime SDK comes with a dialog wizard creating a the java stubs required to create a new KNime node. Here are a few files:

XXXNodeModel

implements the logic of the node. The most important method

BufferedDataTable[] execute(finalBufferedDataTable[] inData,
    final ExecutionContext exec) throws Exception

takes as input one or more table, transforms it and returns an array of one or more table

XXXNodeDialog

A Swing-based dialog used to select the option of a node. Here, I've created a dialog selecting the fasta file

XXXNodeView

Visualizes the result of the node. Here, I didn't wrote a view but one could imagine a graphical interface drawing the GC%, etc...

XXXNodeFactory

A class generating the model, the dialog the views, ...

XXXNodePlugin

Describes the plugin. It is only used by eclipse

The sources I wrote are available here: warning, the sources are a draft and I'm still learning the Knime API, I guess there must have cleaver/smarter/safer way to write this stuff

FastaIterator.java: iterator over a fasta file
FastaTable.java: the tabular representation of the sequences. It contains two columns Name and Sequence
ReadFastaNodeDialog.java: the dialog selecting the fasta file
ReadFastaNodeFactory.java: the node factory
ReadFastaNodeModel.java: implements the logic of the node. Creates and returns the FastaTable
/ReadFastaNodePlugin: used by eclipse

Here is a screenshot: my Node reads a fasta file and transforms it into a two-columns table 'grep' all the sequences containing the word ONCOGENE, sort the sequences and output the result in a table (smaller window at the bottom).

That's it for tonight.

Pierre

17 December 2008

Putting semantics in the spreadsheets.

Just a few ideas

I've been recently asked to find a way to store a set of heterogeneous files ( pedigrees, linkage, results of unix pipelines.... ). My first idea was to upload the file in a wiki and to append some well choosen categories to then easily retrieve the file later. I also imagined to use a Template to create a form where the user would add some semi-structured annotations (see my test on openwetware.org here ).

But, of course, the users want always more. There must be a Murphy's law for this....

Now I should create a robot that could find any file of a given type (say a linkage file) containing a given information (say a snp defined by its rs-id). So I've started to create a set of two RDFS-based ontologies that could be used to describe what is this file about (e.g. File -> Plain Text -> Tab-Delimited -> Pedigree) , and what are the columns about (e.g. xsd:string -> biological entity -> genetic marker -> snp -> rs-id ). A robot would then be able to identify and parse the files and , for example, would find the columns containing "SNP" or "Microsattelite" if I ask for the columns containing a 'Genetic Marker' "
The two drafts are available here:

http://code.google.com/p/fileontology/source/browse/trunk/files/ont/columns.rdf

http://code.google.com/p/fileontology/source/browse/trunk/files/ont/files.rdf

.

I don't know if this idea has already been implemented elsewhere. Nevertheless Frank Gibson suggested me to have a look at Information-artifact-ontology: The Information Artifact Ontology (IAO) is a new ontology of information entities, originally driven by work by the OBI digital entity and realizable information entity branch.. Lots of information here...

I'm still exploring this subject.

Pierre

Validating JSON with lex & yac

In a recent post on Twitter, Chris Lasher/agbiotec said:
I did a quick Google for "JSON schema" and "JSON validation"; looks like there's nothing in plac e yet like XML schema..
I suggested that lex/yacc could be used to create a trivial tool for this kind of validation. Here is an example.
Say you have a linkage file expressed as JSON. This file contains some information about a set of genetic markers, a set of samples and some genotypes.

{
"markers":[
 {
  "id":1,
  "name":"rs1",
  "chrom":"chr1",
  "position":1
 },
 {
  "id":2,
  "name":"rs2",
  "chrom":"chr1",
  "position":2
 },
 {
  "id":3,
  "name":"rs3",
  "chrom":"chr1",
  "position":3
 }
 ],
"samples":[
 {
  "id":1,
  "name":"Individual1",
  "father-id":2,
  "mother-id":3,
  "illness":true
 },
 {
  "id":2,
  "name":"Individual2",
  "father-id":0,
  "mother-id":0,
  "illness":false
 },
 {
  "id":3,
  "name":"Individual3",
  "father-id":0,
  "mother-id":0,
  "illness":true
 }
],
"genotypes" : [

 {
  "sample":1,
  "marker":2,
  "allele-1":"A",
  "allele-2":"T"
 },
 {
  "sample":2,
  "marker":2,
  "allele-1":"A",
  "allele-2":"T"
 }

]
}

If we want to validate this file with lex/yacc or flex/bison. We need:

A Scanner generated by bison. This scanner contains the grammatical rules.

A Lexer generated by flex. This lexer transforms the input into a set of semantic tokens

sample.y: the Scanner

%{
#include <stdio.h>
int yywrap() { return 1;}
void yyerror(const char* s) {fprintf(stderr,"Error:%s.\n",s);}
%}


%token ALLELE_1 ALLELE_2 CHROM FATHER_ID GENOTYPES ID ILLNESS MARKER MARKERS MOTHER_ID NAME POSITION SAMPLE SAMPLES
%token BOOLEAN INTEGER STRING
%start linkage
%%

linkage: '{' markers ','  samples ',' genotypes '}' ;

markers: MARKERS ':' '[' marker_list ']';
marker_list: marker | marker_list ',' marker ;
marker: '{'
         ID ':' INTEGER ','
         NAME ':' STRING ','
         CHROM ':' STRING ','
         POSITION ':' INTEGER
        '}';



samples: SAMPLES ':' '[' sample_list ']';
sample_list: sample | sample_list ',' sample ;
sample: '{'
         ID ':' INTEGER ','
         NAME ':' STRING ','
         FATHER_ID ':' INTEGER ','
         MOTHER_ID ':' INTEGER ','
         ILLNESS ':' BOOLEAN
        '}';


genotypes: GENOTYPES ':' '[' genotype_list ']';
genotype_list: genotype | genotype_list ',' genotype;
genotype: '{'
         SAMPLE ':' INTEGER ','
         MARKER ':' INTEGER ','
         ALLELE_1 ':' STRING ','
         ALLELE_2 ':' STRING
        '}';
%%

int main(int argc,char** argv)
{
yyparse();
}

In this file the %token declares the keywords that will be accepted by the scanner. This grammer %starts with the linkage rule. This rule starst with a parenthesis followed by a 'markers' rule, followed by a comma, followed by a 'samples' rule, followed by a comma, followed by a 'genotypes' rule, followed by a parenthesis.
The 'markers' rule says that this rule is a 'marker_list' into a pair of brackets.
A marker_list is a marker or a marker_list (recursive rule) followed by a comma and another marker.
A marker is a set of JSON key/value. (Here, for simplicity, I expect that all the fields will appear in a given order)
etc...

To convert this file into a C source and a C header:

bison -d sample.y

sample.l: the Lexer

Here the lexer is basically a set of ordered regular expressions that will return a semantic identifier (e.g.BOOLEAN, INTEGER, MARKERS,...) about the tokens found in the input. Those identifiers were declared in a C header by the scanner.

%{
#include <stdio.h>
#include "sample.tab.h"/* generated by the scanner */
%}
%%
"\"allele-1\""      return ALLELE_1;
"\"allele-2\""      return ALLELE_2;
"\"chrom\"" return CHROM;
"\"father-id\"" return FATHER_ID;
"\"mother-id\"" return MOTHER_ID;
"\"id\""    return ID;
"\"markers\""       return MARKERS;
"\"marker\""        return MARKER;
"\"illness\""       return ILLNESS;
"\"genotypes\"" return GENOTYPES;
"\"name\""  return NAME;
"\"position\""      return POSITION;
"\"samples\""       return SAMPLES;
"\"sample\""        return SAMPLE;
true            return BOOLEAN;
false           return BOOLEAN;
[0-9]+          return INTEGER;
\"[^\"]*\"       return STRING;/* a very simple string without escapes... */
[ \n\t\r]       ;/* ignore */
.               return yytext[0];
%%

To convert this file into a C source :

flex sample.l

Compilation

gcc -o validate sample.tab.c lex.yy.c

Testing

cat sample.json | ./validate
echo "Hello"| ./validate
Error: Syntax error

That's it

Pierre

Darwin's evolution : four days later.

See my previous post about the underlying genetic algorithm: http://plindenbaum.blogspot.com/2008/12/random-notes-2008-12.html

15 December 2008

An idea: Twitter as a tool to build a protein-protein interactions database

In this post I describe the idea about how http://twitter.com could be used as a tool to build a collaborative database of protein-protein interactions. This idea was inspired by the recent creation of http://twitter.com/omnee: Omnee is said to be the "first organic directory for Twitter which you can control directly via your tweets": Using a tag-based structure in your tweets this gives you the freedom to add yourself to multiple "groups" quickly and easily.

e.g.:

Chris Upton's tags +informatics, +ipod touch, +genomics, +proteomics, +dnasequencing, + mac, +semanticweb, -ipodtoch, +bioinformatics, +virology, #omnee

.

How about building a collaborative biological database with this kind of tool ?. One could create a database of protein-protein interactions using twitter. For example, say the @biotecher account will be used as the core account to harvest the tweets, anybody could send a new component of the interactome by sending a tweet to @biotecher with the gi of the two proteins, a pubmed-id as reference and a special hashtag say #interactome.

E.g: Rotavirus protein NSP3 interacts with human EIF4G1 (view tweet )

@biotecher gi:41019505 gi:255458 pmid:9755181 #interactome

With such system the metadata ( who gave this information ? when ?) is also recorded by tweeter.com so we can imagine to filter the information according to our network ("I don't trust the information supplied by this user, discard it")

I've also created a short piece of code as a proof of concept: the program fetches search for the tweets about #interactome and bound to @biotecher. It then download a few information from the NCBI (get the organism and name of the protein, get the title of the paper, etc...) and output the network as a RDF graph. The code (java) of this program is available at: http://code.google.com/p/lindenb/source/browse/trunk/proj/tinytools/src/org/lindenb/tinytools/TwitterOmics.java.

Here is the output with 3 interactions. As you will see, each interaction is stored in the rdf:Class <Interaction>. The interaction is identified by the URL of the tweet. Each interaction contains a reference of the author, the proteins , the date and the article in pubmed.

<?xml version="1.0" encoding="UTF-8"?>
<rdf:RDF
xmlns:foaf="http://xmlns.com/foaf/0.1/"
xmlns:bibo="http://purl.org/ontology/bibo/"
xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns:dc="http://purl.org/dc/elements/1.1/"
xmlns="http://twitteromics.lindenb.org"
>

<foaf:Person rdf:about="http://twitter.com/yokofakun">
<foaf:name>yokofakun (Pierre Lindenbaum)</foaf:name>
</foaf:Person>

<Organism rdf:about="lsid:ncbi.nlm.nih.gov:taxonomy:4932">
<taxId>4932</taxId>
<dc:title>Saccharomyces cerevisiae</dc:title>
</Organism>

<Protein rdf:about="lsid:ncbi.nlm.nih.gov:protein:417441">
<gi>417441</gi>
<dc:title>RecName: Full=Polyadenylate-binding protein, cytoplasmic and nuclear; Short=Poly(A)-binding protein; Short=PABP; AltName: Full=ARS consensus-binding protein ACBP-67; AltName: Full=Polyadenylate tail-binding protein</dc:title>
<organism rdf:resource="lsid:ncbi.nlm.nih.gov:taxonomy:4932"/>
</Protein>

<Organism rdf:about="lsid:ncbi.nlm.nih.gov:taxonomy:9606">
<taxId>9606</taxId>
<dc:title>Homo sapiens</dc:title>
</Organism>

<Protein rdf:about="lsid:ncbi.nlm.nih.gov:protein:41019505">
<gi>41019505</gi>
<dc:title>RecName: Full=Eukaryotic translation initiation factor 4 gamma 1; Short=eIF-4-gamma 1; Short=eIF-4G 1; Short=eIF-4G1; AltName: Full=p220</dc:title>
<organism rdf:resource="lsid:ncbi.nlm.nih.gov:taxonomy:9606"/>
</Protein>

<bibo:Article rdf:about="http://www.ncbi.nlm.nih.gov/pubmed/9418852">
<bibo:pmid>9418852</bibo:pmid>
<dc:title>RNA recognition motif 2 of yeast Pab1p is required for its functional interaction with eukaryotic translation initiation factor 4G.</dc:title>
</bibo:Article>

<Interaction rdf:about="http://twitter.com/yokofakun/statuses/1058586293">
<interactor rdf:resource="lsid:ncbi.nlm.nih.gov:protein:417441"/>
<interactor rdf:resource="lsid:ncbi.nlm.nih.gov:protein:41019505"/>
<reference rdf:resource="http://www.ncbi.nlm.nih.gov/pubmed/9418852"/>
<dc:creator rdf:resource="http://twitter.com/yokofakun"/>
<dc:date>2008-12-15T14:51:42Z</dc:date>
</Interaction>

<Organism rdf:about="lsid:ncbi.nlm.nih.gov:taxonomy:10922">
<taxId>10922</taxId>
<dc:title>Simian rotavirus</dc:title>
</Organism>

<Protein rdf:about="lsid:ncbi.nlm.nih.gov:protein:255458">
<gi>255458</gi>
<dc:title>NS34=gene 7 nonstructural protein [simian rotavirus, SA114F, serotype G3, Peptide, 315 aa]</dc:title>
<organism rdf:resource="lsid:ncbi.nlm.nih.gov:taxonomy:10922"/>
</Protein>

<Protein rdf:about="lsid:ncbi.nlm.nih.gov:protein:6176338">
<gi>6176338</gi>
<dc:title>ubiquitous tetratricopeptide containing protein RoXaN [Homo sapiens]</dc:title>
<organism rdf:resource="lsid:ncbi.nlm.nih.gov:taxonomy:9606"/>
</Protein>

<bibo:Article rdf:about="http://www.ncbi.nlm.nih.gov/pubmed/15047801">
<bibo:pmid>15047801</bibo:pmid>
<dc:title>RoXaN, a novel cellular protein containing TPR, LD, and zinc finger motifs, forms a ternary complex with eukaryotic initiation factor 4G and rotavirus NSP3.</dc:title>
</bibo:Article>

<Interaction rdf:about="http://twitter.com/yokofakun/statuses/1058292539">
<interactor rdf:resource="lsid:ncbi.nlm.nih.gov:protein:255458"/>
<interactor rdf:resource="lsid:ncbi.nlm.nih.gov:protein:6176338"/>
<reference rdf:resource="http://www.ncbi.nlm.nih.gov/pubmed/15047801"/>
<dc:creator rdf:resource="http://twitter.com/yokofakun"/>
<dc:date>2008-12-15T11:01:10Z</dc:date>
</Interaction>

<bibo:Article rdf:about="http://www.ncbi.nlm.nih.gov/pubmed/9755181">
<bibo:pmid>9755181</bibo:pmid>
<dc:title>Rotavirus RNA-binding protein NSP3 interacts with eIF4GI and evicts the poly(A) binding protein from eIF4F.</dc:title>
</bibo:Article>

<Interaction rdf:about="http://twitter.com/yokofakun/statuses/1058290564">
<interactor rdf:resource="lsid:ncbi.nlm.nih.gov:protein:41019505"/>
<interactor rdf:resource="lsid:ncbi.nlm.nih.gov:protein:255458"/>
<reference rdf:resource="http://www.ncbi.nlm.nih.gov/pubmed/9755181"/>
<dc:creator rdf:resource="http://twitter.com/yokofakun"/>
<dc:date>2008-12-15T10:59:19Z</dc:date>
</Interaction>

</rdf:RDF>

What do you think ?

Pierre

12 December 2008

Genetic Algorithm with Darwin's Face: Dynamic SVG

In a previous post I described how I've implemented a genetic algorithm finding the best set of colored triangles to re-create an image. I've just changed the output of the program: it now saves the output as a dynamic SVG picture. Watch the creation of the picture here:

(yes I know, this is a colored image whereas the original image was in B&W. This is because my program automatically converted the colors of the triangles to a 8bit-gray-levels image)

The current iteration on Friday:

Pierre

11 December 2008

Random notes 2008-12:

Genetic Algorithm

Evolution of Charles Darwin. I've implemented my own version of the Genetic Algorithm described by Roger Alsing in his blog ( http://rogeralsing.com/2008/12/07/genetic-programming-evolution-of-mona-lisa ). This algorithm finds the best set of colored triangles that could be used to re-create an original image.

On the left : the original image (via wikipdia), on the right the current image generated by the genetic algorithm at generation 240 (population:20 individuals of 50 triangles). My algorithm is currently running .
The source is available here: http://tinyurl.com/57xaeb
A short doc is available here: http://code.google.com/p/lindenb/wiki/GAMonaLisa
I've also uploaded an executable jar here: http://code.google.com/p/lindenb/downloads/list

Workbench

I've uploaded a beta version of a spreadsheet-like program that I wrote for the people of my lab.
It was designed to help people with handling large tables in a rich graphical environment. It currently performs a few tasks that are common under unix. For example, it can finds the information about a column of SNP and I've implemented a grep/awk function filtering the rows with a simple javascript expression.The data are stored with the help of the Java berkeleyDB API to create an index of each row in a table.

This screenshot is a java JTable displaying the hapmap genotypes for chr1/build36/CEU. The size of the original file is 146Mo

The tool is available as a java webstart application. See http://code.google.com/p/cephlib/wiki/Workbench.

Wiki

I've done a presentation on how to use a wiki in a lab. Used both OWW and wikipedia. I showed them how to edit/follow/track a page ( http://tinyurl.com/6ejw35), how to create/discuss a page with templates and categories ( http://tinyurl.com/5l5bw5 ), how files can be uploaded in a wiki and commented ( http://tinyurl.com/5ouc7y ). A demo of the wikipedia API ( http://tinyurl.com/2dp5r4 ).
People were then interested in storing+annotating (linkage) files in a wiki.

FiendFeed

Thank you to all the crowd in FriendFeed. Really motivating.

Pierre