Showing posts with label operon. Show all posts
Showing posts with label operon. Show all posts

19 September 2008

Center for the Study of Human Polymorphisms: Week 3

In my previous post I showed how I used apache velocity to generate some 'C' code for the Operon project based on BerkeleyDB. I also generated the Makefiles and some Lex and Yacc files to create a simple language to query each database. Today I've compiled and linked my first applications. Each application will use my simple language to query each database without having to write a new piece of code for each new kind of query.

For example, the database called 'snpIds' contains a consecutive number of structures defined as :

typedef struct snpIds_t
{
char* featureid;
char* rs_number;
}snpIds,*snpIdsPtr;


I can now query this database like this
snpiddump -q "OR( EQ({rs_number},\"rs10043098\"), EQ({rs_number},\"rs2377171\") ) " -f xml

(OK, the syntax looks ugly, but this design was the simplest way to avoid the shit/reduce conflicts in the yacc parser).The query part is broken into tokens by the lexer and interpreted by the yacc parser. The parser build a Parse Tree which can be drawn like this:

"rs2377171"
/
EQUALS
/ \
/ {rs_number}
--OR
\ {rs_number}
\ /
EQUALS
\
"rs10043098"

This tree is then evaluated versus each record in the database. When a record matches, it is printed out in xml|json|text. e.g.:
<?xml version="1.0" encoding="UTF-8"?>
<op:operon xmlns:op="http://operon.cng.fr">
<op:SnpIds>
<op:featureid>101051105133288</op:featureid>
<op:rs_number>rs10043098</op:rs_number>
</op:SnpIds>
<op:SnpIds>
<op:featureid>101161015120774</op:featureid>
<op:rs_number>rs2377171</op:rs_number>
</op:SnpIds>
</op:operon>

Again, most of the code was written using a velocity template [here].

Pierre

17 September 2008

Generating C code with apache-velicity

I'm currently working on Operon ( http://regulon.cng.fr/) a database developped by Mario Foglio at The National Center of Genotyping. The whole database/storage is developped around the Berkeley C API and I've been asked to write a clean 'C' API to access the data. Most data are stored with C structures and I wanted to quickly write the methods to:
* create a new instance of each structure
* free the resources allocated by each structure
* create a vector of those structures with the common methods (addElement, removeElement, getSize, clear, etc...)
* etc...

I wrote a description of a few structures in xml. Something like this:

<?xml version="1.0" encoding="UTF-8"?>
<op:operon
xmlns:h="http://www.w3.org/1999/xhtml"
xmlns:op="http://operon.cng.fr"
>
<op:table name="SnpIds">
<op:description>
SNPIDS Berkeley Hash db: stores all SNP ids. The key for this
database is the acn, and
duplicate acn keys are allowed.
</op:description>
<op:column name="fid" type="char*">
<op:description>fid: SNP feature id</op:description>
</op:column>
<op:column name="acn" type="char*">
<op:description>acn: SNP accession</op:description>
</op:column>
</op:table>


To generate my C code I've first tried to use xslt but I later found it too ugly.
I then looked for something that could have looked like a standalone version of the java server page (jsp). I didn't find one ( it would have been nice to re-use the custom-tags).
I then tried apache-velocity ( http://velocity.apache.org/), a java processor, and this is the technology I used.

OK, this kind of C structures can be described as a java interface:
public interface CField
{
public String getName();
public String getType();
(...)
}

public interface CStructure
{
public Colllection<CField> getFields();
public String getName();
(...)
}


Those objects are created by parsing the XML description of the structures and are then associated with a string in the 'context' of velocity. (source code [here]).
CStructure mystructure;
(...)
velocityContext.put("struct",mystrucure);
The velocity engine is then called, it uses the object reflection to resolve the velocity statements. For example the following template:
 typedef struct $struct.typedef
{
#foreach($field in ${struct.fields})
/**
* ${field.name}
* ${field.description}
*/
${field.type} ${field.name};
#end
} ${struct.name}, *${struct.name}Ptr;
will generate the C header for this structure.
The velocity templates generating the *.c and the *.h are available [here] and [here] (Warning this is a work in progress)

But that is not all: I also wanted to query each berkeley database without having to re-write a new code for each new kind of query. So I've used velocity to generate a Flex/lex and Bison/yacc files. Those tools then generate a simple parser to build a concrete syntax tree and then searching each database.
YNodePtr search = mydatabaseParseQuery("AND(LT([chromEnd],10000),GT([chromStart],100))");
myDatabaseArray array= myDatabaseSearch(search);

The velocity templates for flex and bison are available [here] and [here] (again, warning , this is a work in progress)

That's it

Pierre

05 September 2008

Center for the Study of Human Polymorphisms: Week 1

I've started my first week at the center Center for the Study of Human Polymorphisms and today we had our first meeting with Mario Foglio and some other to define what will be my job in the following monthes. As I said, I will collaborate with the National Center of Genotyping on Operon, a feasible bioinformatics platform to centralize scientific software and biomedical data with internal results. It was curious because I found that nobody there uses most of the tools used/discussed with the biogang (rss feeds, social bookmarking, etc... ) and I hope I will present some slides about this later.

I will have to re-factoring the current 'C' code of operon (written over BerkeleyDB) to build a new clean C API that will be used some other persons.

What is cool is that this is an open source project and we will host it on google (http://code.google.com/p/polymorphism/).
I've also created a mailing list on google.groups: http://groups.google.com/group/operon-dev, shown my collaborators how to share a calendar on google-calendar (to find what are the possible dates for organizing a meeting) and we have already started to share some documents using google-docs. Thank you google.

The 'C' language was chosen because it is a low-level language and it seems that the developers at the CNG prefer it. I hope I will create some wrappers around this API with some other language. I already know it is possible with java using the Java Native Interface (JNI, see my previous post about this). SWIG (http://www.swig.org/), a tool generating some wrappers in various languages (python, perl...), might also be of hel. Using a Java wrapper will allow us to deploy any application in a java web server such as tomcat.

I've not much played with 'C' since 1998 ( I then played with C++ for 4 years before switching to java) but I (hope) still have some good skills and I know I now have better good programming practice.

That's it for tonight.

Pierre