YOKOFAKUN: September 2008

19 September 2008

Center for the Study of Human Polymorphisms: Week 3

In my previous post I showed how I used apache velocity to generate some 'C' code for the Operon project based on BerkeleyDB. I also generated the Makefiles and some Lex and Yacc files to create a simple language to query each database. Today I've compiled and linked my first applications. Each application will use my simple language to query each database without having to write a new piece of code for each new kind of query.

For example, the database called 'snpIds' contains a consecutive number of structures defined as :

typedef struct snpIds_t
  {
  char* featureid;
  char* rs_number;
  }snpIds,*snpIdsPtr;

I can now query this database like this

snpiddump -q "OR( EQ({rs_number},\"rs10043098\"), EQ({rs_number},\"rs2377171\") ) " -f xml 

(OK, the syntax looks ugly, but this design was the simplest way to avoid the shit/reduce conflicts in the yacc parser).The query part is broken into tokens by the lexer and interpreted by the yacc parser. The parser build a Parse Tree which can be drawn like this:

             "rs2377171"
            /
      EQUALS
     /      \
    /       {rs_number}
--OR
    \        {rs_number}
     \      /
      EQUALS
            \
             "rs10043098"

This tree is then evaluated versus each record in the database. When a record matches, it is printed out in xml|json|text. e.g.:

<?xml version="1.0" encoding="UTF-8"?>
<op:operon xmlns:op="http://operon.cng.fr">
<op:SnpIds>
  <op:featureid>101051105133288</op:featureid>
  <op:rs_number>rs10043098</op:rs_number>
</op:SnpIds>
<op:SnpIds>
  <op:featureid>101161015120774</op:featureid>
  <op:rs_number>rs2377171</op:rs_number>
</op:SnpIds>
</op:operon>

Again, most of the code was written using a velocity template [here].

Pierre

17 September 2008

Generating C code with apache-velicity

I'm currently working on Operon ( http://regulon.cng.fr/) a database developped by Mario Foglio at The National Center of Genotyping. The whole database/storage is developped around the Berkeley C API and I've been asked to write a clean 'C' API to access the data. Most data are stored with C structures and I wanted to quickly write the methods to:
* create a new instance of each structure
* free the resources allocated by each structure
* create a vector of those structures with the common methods (addElement, removeElement, getSize, clear, etc...)
* etc...

I wrote a description of a few structures in xml. Something like this:

<?xml version="1.0" encoding="UTF-8"?>
<op:operon
        xmlns:h="http://www.w3.org/1999/xhtml"
        xmlns:op="http://operon.cng.fr"
        >
<op:table name="SnpIds">
        <op:description>
                SNPIDS Berkeley Hash db: stores all SNP ids. The key for this
database is the acn, and
                duplicate acn keys are allowed.
        </op:description>
        <op:column name="fid" type="char*">
                <op:description>fid: SNP feature id</op:description>
        </op:column>
        <op:column name="acn" type="char*">
                <op:description>acn: SNP accession</op:description>
        </op:column>
</op:table>

To generate my C code I've first tried to use xslt but I later found it too ugly.
I then looked for something that could have looked like a standalone version of the java server page (jsp). I didn't find one ( it would have been nice to re-use the custom-tags).
I then tried apache-velocity ( http://velocity.apache.org/), a java processor, and this is the technology I used.

OK, this kind of C structures can be described as a java interface:

public interface CField
  {
  public String getName();
  public String getType();
  (...)
  }

public interface CStructure
  {
  public Colllection<CField> getFields();
  public String getName();
  (...)
  }

Those objects are created by parsing the XML description of the structures and are then associated with a string in the 'context' of velocity. (source code [here]).

CStructure mystructure;
  (...)
  velocityContext.put("struct",mystrucure);

The velocity engine is then called, it uses the object reflection to resolve the velocity statements. For example the following template:

 typedef struct $struct.typedef
  {
  #foreach($field in ${struct.fields})
        /**
    * ${field.name}
    * ${field.description}
    */
   ${field.type} ${field.name};
  #end
  } ${struct.name}, *${struct.name}Ptr;

will generate the C header for this structure.
The velocity templates generating the *.c and the *.h are available [here] and [here] (Warning this is a work in progress)

But that is not all: I also wanted to query each berkeley database without having to re-write a new code for each new kind of query. So I've used velocity to generate a Flex/lex and Bison/yacc files. Those tools then generate a simple parser to build a concrete syntax tree and then searching each database.

YNodePtr search = mydatabaseParseQuery("AND(LT([chromEnd],10000),GT([chromStart],100))");
myDatabaseArray array= myDatabaseSearch(search);

The velocity templates for flex and bison are available [here] and [here] (again, warning , this is a work in progress)

That's it

Pierre

16 September 2008

What is in a list of snp ?

Here is a common question: "Here is a list of snp genotyped with a high p-value. Is there anything interesting in this snp list ? is there a common link between those snp ?".
Today, to answer this question, I've played with NCBI ELink. ELink checks for the existence of an external or Related Articles link from a list of one or more primary IDs; retrieves IDs and relevancy scores for links to Entrez databases or Related Articles; creates a hyperlink to the primary LinkOut provider for a specific ID and database, or lists LinkOut URLs and attributes for multiple IDs..
The java tool I created, AboutRsList, takes as input a list of rs. For each rs it calls ELink and get the links of this snp to pubmed, omim, ncbi-gene.... The information (title...) about each snp is then retrieved using ncbi/EFetch. Then the program creates a set of clusters of snps where each cluster has no link with another one. Each cluster is then saved as a SVG figure using graphviz dot.

Here is an example of cluster, showing all the links between a set of rs### , papers and genes

The circles are the rs##, the ellipse are the papers in pubmed, the polygons are the genes

I put the sources here: http://code.google.com/p/lindenb/source/browse/trunk/proj/tinytools/src/org/lindenb/tinytools/AboutIdentifiers.java
And an executable jar is available here: http://code.google.com/p/lindenb/downloads/list.

Enjoy

Pierre

05 September 2008

Center for the Study of Human Polymorphisms: Week 1

I've started my first week at the center Center for the Study of Human Polymorphisms and today we had our first meeting with Mario Foglio and some other to define what will be my job in the following monthes. As I said, I will collaborate with the National Center of Genotyping on Operon, a feasible bioinformatics platform to centralize scientific software and biomedical data with internal results. It was curious because I found that nobody there uses most of the tools used/discussed with the biogang (rss feeds, social bookmarking, etc... ) and I hope I will present some slides about this later.

I will have to re-factoring the current 'C' code of operon (written over BerkeleyDB) to build a new clean C API that will be used some other persons.

What is cool is that this is an open source project and we will host it on google (http://code.google.com/p/polymorphism/).
I've also created a mailing list on google.groups: http://groups.google.com/group/operon-dev, shown my collaborators how to share a calendar on google-calendar (to find what are the possible dates for organizing a meeting) and we have already started to share some documents using google-docs. Thank you google.

The 'C' language was chosen because it is a low-level language and it seems that the developers at the CNG prefer it. I hope I will create some wrappers around this API with some other language. I already know it is possible with java using the Java Native Interface (JNI, see my previous post about this). SWIG (http://www.swig.org/), a tool generating some wrappers in various languages (python, perl...), might also be of hel. Using a Java wrapper will allow us to deploy any application in a java web server such as tomcat.

I've not much played with 'C' since 1998 ( I then played with C++ for 4 years before switching to java) but I (hope) still have some good skills and I know I now have better good programming practice.

That's it for tonight.

Pierre

02 September 2008

Ubiquity: Arf-arf ! smooch ! Achoo! Wee Woo !

Ok, after a few others (Pawel, Thomas Lemberger, Egon, )I've succumbed to Mozilla Ubiquity, an experimental Firefox extension that (they say) gives you a powerful way to interact with the Web. The following useless script comics inserts a speech balloon using the font samples from http://www.dafont.com/

CmdUtils.CreateCommand({
  name: "comics",
  author: { name: "Pierre Lindenbaum", email: "plindenbaum@yahoo.fr"},
  description: "Comics",
  takes: {"Your text": noun_arb_text},
  help: "Insert a speech balloon with a comic font ",

  preview: function( pblock, theShout ) {
    var msg = "Inserts a speech balloon : (<i>"+ theShout.summary+"</i>)";
    pblock.innerHTML = CmdUtils.renderTemplate( msg );
  },

  execute: function(theShout) {
    CmdUtils.setSelection(
          "<img src=\'http://img.dafont.com/preview.php?text=" +
          escape(theShout.text)+
          "&ttf=badaboom_bb0&size=49&psize=m&y=58'/>"
          );
  }
})

It worked fine with GMail !

Update: The script is available here.

Pierre

01 September 2008

I'm not looking for a job anymore: Welcome at the CEPH

Today was my first day as a bioinformatician at the Center for the Study of Human Polymorphisms (CEPH http://www.cephb.fr/en/cephdb) and I want to thank my former colleagues Christine K and Philippe Gesnouin (philguess on twitter/FF ) who helped me to find this position. It's a short term contract (one year).

The CEPH is localized in Paris near the St-Louis Hospital and the "Place de la République" it maintains a database of genotypes for genetic markers that have been typed on the CEPH reference family resource for linkage mapping (Genomics 6: 575-577, 1990; Science 265: 2049-2054, 1994). The CEPH works works in conjunction with the National Center of Genotyping (CNG/Evry) where I also worked height years ago and both centers are managed by Dr Mark Lathrop. One of my first objective is to develop a set of tools around OPERON with the help of his author, Mario Foglio.

As far as I've understand operon today (I may be wrong), it is a C program handling a large set of genotypes (among other things...) using BerkeleyDB as a storage engine (I blogged about BerkeleyDB a few posts ago). It seems that using this strategy, the genotypes can be quickly accessed using something like fseek(table,sizeof(genotype_t)*(sample_count*marker_index+sample_index),SEEK_SET).

As a java programmer, I wish I could write a wrapper around the Operon C API, that would be useful to embed this model in a web container (servlet, jsp) or to write a Swing interface. My first ideas to achieve this are:
* using JNI (Java Native Interface, allows to call C from java) to write a java wrapper around the C API
* reading the data in the berkeleyDB files using the BerkeleyDB Java API.
* ...

That's it for tonight.

Pierre

YOKOFAKUN