YOKOFAKUN: 2009

17 December 2009

Playing with Erlang (3): concurrency

In this post I'll show how to write a multithreaded erlang program where each child process will calculate if a given number is a Prime Number (not rocket science, a brute force algorithm will be used).

First this erlang module is called 'prime':

-module(prime).

The method named 'test' with no argument is public

-export([test/0]).

The method named 'start' creates a new thread for the function 'loop'. It returns the ID of this new thread.

start()-> spawn(fun loop/0).

The 'loop' method waits for a message in a child process. When it receives a message matching the pattern '{ _From, {prime,Number},Start }' it invokes the recursive function is_prime and prints the result:

loop()->
receive
{ _From, {prime,Number},Start }->
io:format("result=~p\n Process-ID=~p\n Duration=~p ~n",[
is_prime(2,Number),
self(),
timer:now_diff(now(),Start)/1.0e6
])
;
{_From,_Other} ->
io:format("Booomm~n",[])
end.

The function 'is_prime' is a simple-and-stupid recursive method testing if a number is a prime where we search if the numbers 'Div' lower than 'N' have a modulus of 'Number/Div=0':

is_prime(Div,Number)->
if
Div =:= Number -> {result,{number,Number},{prime,true}};
Number rem Div =:= 0 -> {result,{number,Number},{prime,false}};
true-> is_prime(Div+1,Number)
end.

The method named 'rpc' sends the ID of the current process, the prime number and the current system time to the child process:

rpc(Pid,Request)->
Pid ! { self(), Request , now()}
.

The 'test' method creates a list 'NUMS' of numbers. For each number N in this list the function prime:start is called and it returns a new thread-ID. This Thread-ID is sent with 'N' to the function rpc.

test()->
NUMS=[111, 197951153,197951154,102950143,65537,3,7,8,12],
lists:foreach( fun(N)-> Pid= start(), rpc(Pid,{prime,N}) end, NUMS )
.

All in one, here is the full code:

-module(prime).
-export([test/0]).

start()-> spawn(fun loop/0).

rpc(Pid,Request)->
Pid ! { self(), Request , now()}
.

loop()->
receive
{ _From, {prime,Number},Start }->
io:format("result=~p\n Process-ID=~p\n Duration=~p ~n",[
is_prime(2,Number),
self(),
timer:now_diff(now(),Start)/1.0e6
])
;
{_From,_Other} ->
io:format("Booomm~n",[])
end.

is_prime(Div,Number)->
if
Div =:= Number -> {result,{number,Number},{prime,true}};
Number rem Div =:= 0 -> {result,{number,Number},{prime,false}};
true-> is_prime(Div+1,Number)
end.

test()->
NUMS=[111, 197951153,197951154,102950143,65537,3,7,8,12],
lists:foreach( fun(N)-> Pid= start(), rpc(Pid,{prime,N}) end, NUMS )
.

Compiling:

erlc prime.erl

Running:

erl -noshell -s prime test

Result (as you can see, the biggest prime appears at the end of the list because it took more time for the calculation):

result={result,{number,65537},{prime,true}}
Process-ID=<0.34.0>
Duration=0.006522
result={result,{number,111},{prime,false}}
Process-ID=<0.30.0>
Duration=7.6e-5
result={result,{number,197951154},{prime,false}}
Process-ID=<0.32.0>
Duration=4.57e-4
result={result,{number,3},{prime,true}}
Process-ID=<0.35.0>
Duration=7.25e-4
result={result,{number,7},{prime,true}}
Process-ID=<0.36.0>
Duration=7.23e-4
result={result,{number,8},{prime,false}}
Process-ID=<0.37.0>
Duration=7.22e-4
result={result,{number,12},{prime,false}}
Process-ID=<0.38.0>
Duration=7.2e-4
result={result,{number,102950143},{prime,true}}
Process-ID=<0.33.0>
Duration=8.856508
result={result,{number,197951153},{prime,true}}
Process-ID=<0.31.0>
Duration=49.51542

That's it.
Pierre.

14 December 2009

Parsing a genetic code using flex and bison. Part 2/2

(this post was inspired from this tutorial: "Writing a Reentrant Parser with Flex and Bison").
In the previous post I've shown how to parse a NCBI genetic code in ASN.1 using flex and bison. However the generated code is non reentrant that is to say that some globals variables prevent it to be used in a concurrent environment.

If we want to use this code in a multithread environment, some changes must be added: first we need a new structure GCState where the state of the lexer will be saved between each time bison requests for a new token:

typedef struct gcState
{
/** reference to the flex lexer */
void* handle_to_the_scanner;
}GCStateStruct,*GCState;

The Lexer

In the lexer some options must be added: we tell Flex to create a reentrant parser and that the yylex function generated by flex takes an extra argument yylval

%option bison-bridge
%option reentrant

Prior to be used, this context 'yyscan_t' has to be initialized:

void gcParse(FILE* in)
{
GCStateStruct ctx;
memset(&ctx,0,sizeof(GCStateStruct));
yyscan_t scanner;//structure created by flex

if(yylex_init_extra(&ctx,&scanner)!=0)//initialize the scanner
{
return;
}
ctx.handle_to_the_scanner=&scanner;//will tell bison where is the lexer
yyset_in (in,scanner );//change the input stream

gc_parse(ctx);//generated by bison
yylex_destroy(scanner);
}

.The for each method in yylex, a reference to the state in GCState is added:

[0-9]+ {yyget_lval(yyscanner)->integer=atoi(yyget_text(yyscanner)) ;return INTEGER;}
\"[^\"]+\" {
yyget_lval(yyscanner)->s= malloc(yyget_leng(yyscanner)-1);
if(yyget_lval(yyscanner)->s==NULL) exit(EXIT_FAILURE);
strncpy(yyget_lval(yyscanner)->s,&yytext[1],yyget_leng(yyscanner)-2);
return TEXT;

The Parser

We tell bison that we want a reentrant parser

%pure-parser

, each time the lexer will be asked for a new token, we want to pass an extra GCState parameter:

%parse-param{GCState state}

. We also tell bison that the yylex function generated by flex takes an extra parameter hoding the state of the scanner:

%lex-param{void* handle_to_the_scanner}

.A #define is added to tell bison how to extract the handle to the scanner from the GCState:

#define handle_to_the_scanner (state->handle_to_the_scanner)

All in one

Here is the lexer gc.l:

%option prefix="gc_"
%option noyywrap
%option bison-bridge
%option reentrant

%{
#include <stdio.h>
#include <stdlib.h>
#include "gc.h"
#include "gc.tab.h" //generated by bison
%}

%%
\-\-.*\n ;/* ignore comments */

Genetic\-code\-table return GENETIC_CODE_TABLE;
name return NAME;
id return ID;
ncbieaa return NCBIEAA;
sncbieaa return SNCBIEAA;
[0-9]+ {yyget_lval(yyscanner)->integer=atoi(yyget_text(yyscanner)) ;return INTEGER;}
\:\:= return ASSIGN;
, return COMMA;
\{ return OPEN;
\} return CLOSE;
\"[^\"]+\" {
yyget_lval(yyscanner)->s= malloc(yyget_leng(yyscanner)-1);
if(yyget_lval(yyscanner)->s==NULL) exit(EXIT_FAILURE);
strncpy(yyget_lval(yyscanner)->s,&yytext[1],yyget_leng(yyscanner)-2);
return TEXT;
}
[\n\t ] ;//ignore blanks
. return yytext[0];

%%

void gcParse(FILE* in)
{
GCStateStruct ctx;
memset(&ctx,0,sizeof(GCStateStruct));
yyscan_t scanner;

if(yylex_init_extra(&ctx,&scanner)!=0)
{
return;
}
ctx.handle_to_the_scanner=&scanner;
yyset_in (in,scanner );
yyset_debug(1,scanner);
gc_parse(ctx);
yylex_destroy(scanner);
}

Here is the parser gc.y:

%{
#include <stdio.h>
#include <stdlib.h>
#include "gc.h"
%}

%union {
char* s;
int integer;
GeneticCode gCode;
}

%pure-parser
%name-prefix="gc_"
%defines
%error-verbose
%parse-param{GCState state}
%lex-param{void* handle_to_the_scanner}

%token GENETIC_CODE_TABLE NAME OPEN CLOSE COMMA ASSIGN NCBIEAA SNCBIEAA ID
%token<integer> INTEGER
%token<s> TEXT

%type<gCode> code
%type<gCode> codes
%type<s> optional_name
%start input

%{
void yyerror(GCState state,const char* message)
{
fprintf(stderr,"ERROR:%s\n",message);
}
#define handle_to_the_scanner (state->handle_to_the_scanner)

%}

%%

input: GENETIC_CODE_TABLE ASSIGN OPEN codes CLOSE
{
GeneticCode gc=$4;
fputs("Genetic-code-table ::= {",stdout);
while(gc!=NULL)
{
printf("{name \"%s\",", gc->name1);
if(gc->name2!=NULL) printf("name \"%s\",", gc->name2);
printf("id %d,ncbieaa \"%s\",sncbieaa \"%s\"}",
gc->id, gc->ncbieaa, gc->sncbieaa
);
if(gc->next!=NULL) fputc(',',stdout);
gc=gc->next;
}
fputc('}',stdout);
/** free memory here.... */
};

codes: code
{
$$=$1;
}
| codes COMMA code
{
$$=$3;
$$->next=$1;
}
;

code: OPEN
NAME TEXT COMMA
optional_name
ID INTEGER COMMA
NCBIEAA TEXT COMMA
SNCBIEAA TEXT
CLOSE
{
$$ = malloc(sizeof(GeneticCodeStruct));
if($$==NULL) exit(EXIT_FAILURE);
$$->name1=$3;
$$->name2=$5;
$$->id=$7;
$$->ncbieaa=$10;
$$->sncbieaa=$13;
}
;
optional_name:/* nothing */
{
$$=NULL;
}
| NAME TEXT COMMA
{
$$=$2;
}
%%

extern void gcParse(FILE* in);

int main(int argc,char** argv)
{
gcParse(stdin);
return 0;
}

and the file gc.h

#ifndef GENETIC_CODE_H
#define GENETIC_CODE_H

typedef struct geneticCode
{
char* name1;
char* name2;
int id;
char* ncbieaa;
char* sncbieaa;
struct geneticCode *next;
}GeneticCodeStruct,*GeneticCode;

typedef struct gcState
{
/** reference to the flex lexer */
void* handle_to_the_scanner;
}GCStateStruct,*GCState;

#endif

Compiling...

flex gc.l
bison gc.y
gcc gc.tab.c lex.gc_.c

Testing:

cat gc.ptr |./a.out | ./a.out |./a.out |./a.out

Genetic-code-table ::= {{name "Standard",name "SGC0",id 1,ncbieaa "FFLLSSSSYY**CC*WLLLLPPPPHHQQRRRRIIIMTTTTNNKKSSRRVVVVAAAADDEEGGGG",sncbieaa "---M---------------M---------------M----------------------------"},
{name "Vertebrate Mitochondrial",name "SGC1",id 2,ncbieaa "FFLLSSSSYY**CCWWLLLLPPPPHHQQRRRRIIMMTTTTNNKKSS**VVVVAAAADDEEGGGG",sncbieaa "--------------------------------MMMM---------------M------------"},
(...)
,{name "Thraustochytrium Mitochondrial",id 23,ncbieaa "FF*LSSSSYY**CC*WLLLLPPPPHHQQRRRRIIIMTTTTNNKKSSRRVVVVAAAADDEEGGGG",sncbieaa "--------------------------------M--M---------------M------------"}}

That's it.
Pierre

Parsing a genetic code using flex and bison. Part 1/2

In this post I'll show how to write a lexer and a parser using Flex and Bison for parsing the genetic codes at NCBI. I'll write a non-reentrant parser that is to say that the generated code contains some global variables and cannot be safely executed concurrently. The input file is an ASN1 file that looks like this:

--*************************************************************************
-- This is the NCBI genetic code table
-- (...)
--**************************************************************************
Genetic-code-table ::= {
{
name "Standard" ,
name "SGC0" ,
id 1 ,
ncbieaa "FFLLSSSSYY**CC*WLLLLPPPPHHQQRRRRIIIMTTTTNNKKSSRRVVVVAAAADDEEGGGG",
sncbieaa "---M---------------M---------------M----------------------------"
-- Base1 TTTTTTTTTTTTTTTTCCCCCCCCCCCCCCCCAAAAAAAAAAAAAAAAGGGGGGGGGGGGGGGG
-- Base2 TTTTCCCCAAAAGGGGTTTTCCCCAAAAGGGGTTTTCCCCAAAAGGGGTTTTCCCCAAAAGGGG
-- Base3 TCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAG
},
{
(...)
} ,
{
name "Thraustochytrium Mitochondrial" ,
id 23 ,
ncbieaa "FF*LSSSSYY**CC*WLLLLPPPPHHQQRRRRIIIMTTTTNNKKSSRRVVVVAAAADDEEGGGG",
sncbieaa "--------------------------------M--M---------------M------------"
-- Base1 TTTTTTTTTTTTTTTTCCCCCCCCCCCCCCCCAAAAAAAAAAAAAAAAGGGGGGGGGGGGGGGG
-- Base2 TTTTCCCCAAAAGGGGTTTTCCCCAAAAGGGGTTTTCCCCAAAAGGGGTTTTCCCCAAAAGGGG
-- Base3 TCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAG
}
}

Each genetic code will be stored as a linked list of C structures called GeneticCode:
file gc.h:

.The LEXER generated by flex converts the input into a sequence of tokens, and the parser generated by bison converts a grammar description into a C program.

The Parser

The grammar of Bison says that the lexical context might carry a string (the code ID) , a string (ncbieaa,ncbieaa...) or a GeneticCode:

%union {
char* s;
int integer;
GeneticCode gCode;
}
(...)
%token<integer> INTEGER
%token<s> TEXT

A genetic code is a series of tokens consisting in one or two names, the id, the ncbiaa and the sncbiaa strings. Every time this context is found, a new GeneticCode is allocated and its fields are filled:

%type<gCode> code
(...)
code: OPEN
NAME TEXT COMMA
optional_name
ID INTEGER COMMA
NCBIEAA TEXT COMMA
SNCBIEAA TEXT
CLOSE
{
$$ = malloc(sizeof(GeneticCodeStruct));
if($$==NULL) exit(EXIT_FAILURE);
$$->name1=$3;
$$->name2=$5;
$$->id=$7;
$$->ncbieaa=$10;
$$->sncbieaa=$13;
}
;

An optional name is 'nothing' or the following tokens: 'NAME TEXT COMMA'

optional_name:/* nothing */
{
$$=NULL;
}
| NAME TEXT COMMA
{
$$=$2;
}

A series of genetic codes is one genetic code, or a series of genetic codes followed by one genetic code:

codes: code
{
$$=$1;
}
| codes COMMA code
{
$$=$3;
$$->next=$1;
}
;

At last, the input constists in the tokens 'GENETIC_CODE_TABLE ASSIGN OPEN' and 'CLOSE' , surrounding a series of genetic codes. Each time, an input has been parsed, the linked list is scanned so each genetic code is printed to stdout.

input: GENETIC_CODE_TABLE ASSIGN OPEN codes CLOSE
{
GeneticCode gc=$4;
fputs("Genetic-code-table ::= {",stdout);
while(gc!=NULL)
{
printf("{name \"%s\",", gc->name1);
if(gc->name2!=NULL) printf("name \"%s\",", gc->name2);
printf("id %d,ncbieaa \"%s\",sncbieaa \"%s\"}",
gc->id, gc->ncbieaa, gc->sncbieaa
);
if(gc->next!=NULL) fputc(',',stdout);
gc=gc->next;
}
fputc('}',stdout);
/** free memory here.... */
};

The Lexer

The lexers breaks the input into tokens. In this lexer, comments are ignored:

\-\-.*\n ;/* ignore comments */

... keywords are returned

Genetic\-code\-table return GENETIC_CODE_TABLE;
name return NAME;
id return ID;
ncbieaa return NCBIEAA;
sncbieaa return SNCBIEAA;
\:\:= return ASSIGN;
, return COMMA;
\{ return OPEN;
\} return CLOSE;

... and the tokens carrying a context (integers, strings...) are decoded:

[0-9]+ {gc_lval.integer=atoi(yytext) ;return INTEGER;}
\"[^\"]+\" {
//copy the string without the quotes
gc_lval.s= malloc(yyleng-1);
if(gc_lval.s==NULL) exit(EXIT_FAILURE);
strncpy(gc_lval.s,&yytext[1],yyleng-2);
return TEXT;
}

All in one here is the flex file: gc.l

%option prefix="gc_"
%option noyywrap

%{
#include <stdio.h>
#include <stdlib.h>
#include "gc.h"
#include "gc.tab.h" //generated by bison
%}

%%
\-\-.*\n ;/* ignore comments */

Genetic\-code\-table return GENETIC_CODE_TABLE;
name return NAME;
id return ID;
ncbieaa return NCBIEAA;
sncbieaa return SNCBIEAA;
[0-9]+ {gc_lval.integer=atoi(yytext) ;return INTEGER;}
\:\:= return ASSIGN;
, return COMMA;
\{ return OPEN;
\} return CLOSE;
\"[^\"]+\" {
gc_lval.s= malloc(yyleng-1);
if(gc_lval.s==NULL) exit(EXIT_FAILURE);
strncpy(gc_lval.s,&yytext[1],yyleng-2);
return TEXT;
}
[\n\t ] ;//ignore blanks
. return yytext[0];

%%

... the bison file gc.y:

%{
#include <stdio.h>
#include <stdlib.h>
#include "gc.h"
%}

%union {
char* s;
int integer;
GeneticCode gCode;
}

%name-prefix="gc_"
%defines
%error-verbose

%token GENETIC_CODE_TABLE NAME OPEN CLOSE COMMA ASSIGN NCBIEAA SNCBIEAA ID
%token<integer> INTEGER
%token<s> TEXT

%type<gCode> code
%type<gCode> codes
%type<s> optional_name
%start input

%{
void yyerror(const char* message)
{
fprintf(stderr,"ERROR:%s\n",message);
}

%}

%%

input: GENETIC_CODE_TABLE ASSIGN OPEN codes CLOSE
{
GeneticCode gc=$4;
fputs("Genetic-code-table ::= {",stdout);
while(gc!=NULL)
{
printf("{name \"%s\",", gc->name1);
if(gc->name2!=NULL) printf("name \"%s\",", gc->name2);
printf("id %d,ncbieaa \"%s\",sncbieaa \"%s\"}",
gc->id, gc->ncbieaa, gc->sncbieaa
);
if(gc->next!=NULL) fputc(',',stdout);
gc=gc->next;
}
fputc('}',stdout);
/** free memory here.... */
};

codes: code
{
$$=$1;
}
| codes COMMA code
{
$$=$3;
$$->next=$1;
}
;

code: OPEN
NAME TEXT COMMA
optional_name
ID INTEGER COMMA
NCBIEAA TEXT COMMA
SNCBIEAA TEXT
CLOSE
{
$$ = malloc(sizeof(GeneticCodeStruct));
if($$==NULL) exit(EXIT_FAILURE);
$$->name1=$3;
$$->name2=$5;
$$->id=$7;
$$->ncbieaa=$10;
$$->sncbieaa=$13;
}
;
optional_name:/* nothing */
{
$$=NULL;
}
| NAME TEXT COMMA
{
$$=$2;
}
%%

int main(int argc,char** argv)
{
yyparse();
return 0;
}

Compiling:

flex gc.l
bison gc.y
gcc -o a.out gc.tab.c lex.gc_.c

Testing:

That's it.
Pierre

In the next post I'll write a reentrant parser using flex and bison.

07 December 2009

Playing with SOAP. Implementing a WebService for the LocusTree Server

In a previous post I've described the LocusTree server and showed how the wsimport command can be used to generate the java code that will query a Web-Service. Today, I've implemented a few web services in the LocusTree server but I wrote the entire code generating the SOAP messages rather than using the Java API for Web Services (JAXWS-API) because 1) I wanted to learn about the SOAP internals 2) I wanted to return a big volume of data to the client by writing a stream of data rather than building a xml response and then echoing the xml tree (DOM).
Ok. In this example I'm going to describe a WebService returning a list of chromosomes for a given organism-id:

The WSDL file.

The signature of our function is something like: getChromosomesByOrganismId(int orgId). In the WSDL file (the file describing our web services), the operation is called getChromosomes . The input for this function will be a tns:getChromosomes and the object returned by this function is a tns:GetChromosomesResponse. The prefix 'tns' is a reference to a xml schema that is will to defined later.

We now define those two messages (input and output parameters) for this web service. The input value (the organism-id) is defined in an external xml schema as an element named 'tns:getChromosomes'. The ouput value (a list of chromosomes) is defined in an external xml schema as an element named 'tns:Chromosomes'.

But where can we find this external schema ? It is referenced in the WSDL file under the <types> element. The following 'types' says that the schema describing our objects is available at 'http://localhost:8080//locustree/static/ws/schema.xsd'

The Http protocol will be used to send and receive the SOAP messages, so we have to bind our method 'getChromosomesByOrganismsId' to this protocol.

<soap:binding transport="http://schemas.xmlsoap.org/soap/http"
style="document"/>
<operation name="getChromosomes">
<soap:operation/>
<input>
<soap:body use="literal"/>
</input>
<output>
<soap:body use="literal"/>
</output>
</operation>
</binding>

Finally, we tell the client where the server is located

The XSD Schema

This XML schema describes the structures that will be used and returned by the server.We need to describe what is...

A Chromosome

A Chromosome is a structure holding an ID, a name, a length, an organism-id etc...

<xs:complexType name="Chromosome">
<xs:annotation>
<xs:documentation>A Chromosome</xs:documentation>
</xs:annotation>
<xs:sequence>
<xs:element name="id" type="xs:int" nillable="false" />
<xs:element name="organismId" type="xs:int" nillable="false" />
<xs:element name="name" type="xs:string" nillable="false"/>
<xs:element name="length" type="xs:int" nillable="false"/>
<xs:element name="metadata" type="xs:string"/>
</xs:sequence>
</xs:complexType>

A List of Chromosomes

.. is just a sequence of Chromosomes

<xs:complexType name="Chromosomes">
<xs:annotation>
<xs:documentation>Set of Chromosomes</xs:documentation>
</xs:annotation>
<xs:sequence>
<xs:element ref="tns:Chromosome" maxOccurs="unbounded"/>
</xs:sequence>
</xs:complexType>

GetChromosome

This is the structure that is used as a parameter for our web service. It just holds an organism-id.

<xs:complexType name="getChromosomes">
<xs:annotation>
<xs:documentation>return the chromosomes for a given organism </xs:documentation>
</xs:annotation>
<xs:sequence>
<xs:element name="organismId" type="xs:int" nillable="false">
<xs:annotation>
<xs:documentation>The Organism Id</xs:documentation>
</xs:annotation>
</xs:element>
</xs:sequence>
</xs:complexType>

All in one, here is the WSDL file:

<definitions
 xmlns="http://schemas.xmlsoap.org/wsdl/"
 xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:soap="http://schemas.xmlsoap.org/wsdl/soap/"
 xmlns:tns="http://webservices.cephb.fr/locustree/"
 targetNamespace="http://webservices.cephb.fr/locustree/"
 name="LocusTreeWebServices">
  <types>
    <xsd:schema>
      <xsd:import namespace="http://webservices.cephb.fr/locustree/" schemaLocation="http://localhost:8080//locustree/static/ws/schema.xsd"/>
    </xsd:schema>
  </types>
  <message name="getChromosomes">
    <part name="parameters" element="tns:getChromosomes"/>
  </message>
  <message name="GetChromosomesResponse">
    <part name="parameters" element="tns:Chromosomes"/>
  </message>
  <portType name="LocusTree">
    <operation name="getChromosomes">
      <input message="tns:getChromosomes"/>
      <output message="tns:GetChromosomesResponse"/>
    </operation>
  </portType>
  <binding name="LocusTreePortBinding" type="tns:LocusTree">
    <soap:binding transport="http://schemas.xmlsoap.org/soap/http" style="document"/>
    <operation name="getChromosomes">
      <soap:operation/>
      <input>
        <soap:body use="literal"/>
      </input>
      <output>
        <soap:body use="literal"/>
      </output>
    </operation>
  </binding>
  <service name="LocusTreeService">
    <port name="LocusTreeSePort" binding="tns:LocusTreePortBinding">
      <soap:address location="http://localhost:8080//locustree/locustree/soap"/>
    </port>
  </service>
</definitions>

...and the XSD/Schema file:

<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema"
        xmlns:tns="http://webservices.cephb.fr/locustree/"
        targetNamespace="http://webservices.cephb.fr/locustree/"
        elementFormDefault="qualified">

  <xs:annotation>
    <xs:documentation>XML schema for LocusTreeWebServices</xs:documentation>
  </xs:annotation>
  
  <xs:element name="Chromosomes" type="tns:Chromosomes"/>
  <xs:element name="Chromosome" type="tns:Chromosome"/>
  <xs:element name="getChromosomes" type="tns:getChromosomes"/>

        <xs:complexType name="getChromosomes">
         <xs:annotation>
      <xs:documentation>return the chromosomes for a given organism </xs:documentation>
    </xs:annotation>
                <xs:sequence>
                          <xs:element name="organismId" type="xs:int" nillable="false">
                                <xs:annotation>
                              <xs:documentation>The Organism Id</xs:documentation>
                            </xs:annotation>
                          </xs:element>
                </xs:sequence>
        </xs:complexType>

  <xs:complexType name="Chromosomes">
    <xs:annotation>
      <xs:documentation>Set of Organisms</xs:documentation>
    </xs:annotation>
      <xs:sequence>
        <xs:element ref="tns:Chromosome" maxOccurs="unbounded"/>
      </xs:sequence>
    </xs:complexType>

        <xs:complexType name="Chromosome">
            <xs:annotation>
              <xs:documentation>A Chromosome</xs:documentation>
            </xs:annotation>
              <xs:sequence>
                <xs:element name="id" type="xs:int" nillable="false"/>
                <xs:element name="organismId" type="xs:int" nillable="false"/>
                <xs:element name="name" type="xs:string" nillable="false"/>
                <xs:element name="length" type="xs:int" nillable="false"/>
                <xs:element name="metadata" type="xs:string"/>
              </xs:sequence>
        </xs:complexType>


</xs:schema>

Generating the client

The stubs on the client side are generated using the ${JAVA_HOME}/bin/wsimport command:

> wsimport -keep http://localhost:8080/locustree/locustree/soap?wsdl
parsing WSDL...
generating code...
compiling code...
> find fr
fr/cephb/webservices/locustree/Chromosomes.java
fr/cephb/webservices/locustree/ObjectFactory.java
fr/cephb/webservices/locustree/LocusTreeService.java
fr/cephb/webservices/locustree/LocusTree.java
fr/cephb/webservices/locustree/GetChromosomes.java
fr/cephb/webservices/locustree/Chromosome.java
(...)
> more fr/cephb/webservices/locustree/LocusTree.java
package fr.cephb.webservices.locustree;
(...)
public interface LocusTree
{
(...)
public List<Chromosome> getChromosomes(int organismId);
}

Ok, the function was successfully generated. Let's test it with a tiny java program:

file Test.java

import fr.cephb.webservices.locustree.*;

public class Test
{
public static void main(String args[])
{
LocusTreeService service=new LocusTreeService();
LocusTree locustree=service.getLocusTreeSePort();
final int organismId=36;
for(Chromosome chrom:locustree.getChromosomes(organismId))
{
System.out.println(
chrom.getId()+"\t"+
chrom.getName()+"\t"+
chrom.getOrganismId()+"\t"+
chrom.getLength()
);
}

}
}

Compiling and running:

javac -cp . Test.java
java -cp . Test
1 chr1 36 247249719
2 chr2 36 242951149
3 chr3 36 199501827
4 chr4 36 191273063
5 chr5 36 180857866
6 chr6 36 170899992
7 chr7 36 158821424
8 chr8 36 146274826
9 chr9 36 140273252
10 chr10 36 135374737
11 chr11 36 134452384
12 chr12 36 132349534
13 chr13 36 114142980
14 chr14 36 106368585
15 chr15 36 100338915
(...)

SOAP internals

Here is the XML/SOAP query for getChromosomes sent to the sever via a POST query.

<S:Envelope xmlns:S="http://schemas.xmlsoap.org/soap/envelope/">
        <S:Body>
                <getChromosomes xmlns='http://webservices.cephb.fr/locustree/'>
                        <organismId>36</organismId>
                </getChromosomes>
        </S:Body>
</S:Envelope>

This can be checked using curl:

curl \
-X POST\
-H "Content-Type: text/xml" \
-d '<?xml version="1.0" ?>;<S:Envelope xmlns:S="http://schemas.xmlsoap.org/soap/envelope/"><S:Body><getChromosomes xmlns="http://webservices.cephb.fr/locustree/"><organismId>36</organismId></getChromosomes></S:Body></S:Envelope>' \
'http://localhost:8080/locustree/locustree/soap'

And here is the (my) response from the server:

<Envelope xmlns="http://schemas.xmlsoap.org/soap/envelope/" xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:xsi="http://www.w3.org/1999/XMLSchema-instance" >
  <Body>
    <ceph:Chromosomes>
      <ceph:Chromosome>
        <ceph:id>1</ceph:id>
        <ceph:organismId>36</ceph:organismId>
        <ceph:name>chr1</ceph:name>
        <ceph:length>247249719</ceph:length>
        <ceph:metadata>{'type':'autosomal','size':247249719}</ceph:metadata>
      </ceph:Chromosome>
      <ceph:Chromosome>
        <ceph:id>2</ceph:id>
        <ceph:organismId>36</ceph:organismId>
        <ceph:name>chr2</ceph:name>
        <ceph:length>242951149</ceph:length>
        <ceph:metadata>{'type':'autosomal','size':242951149}</ceph:metadata>
      </ceph:Chromosome>
      <ceph:Chromosome>
        <ceph:id>3</ceph:id>
        <ceph:organismId>36</ceph:organismId>
        <ceph:name>chr3</ceph:name>
        <ceph:length>199501827</ceph:length>
        <ceph:metadata>{'type':'autosomal','size':199501827}</ceph:metadata>
      </ceph:Chromosome>
      <ceph:Chromosome>
        <ceph:id>4</ceph:id>
        <ceph:organismId>36</ceph:organismId>
        <ceph:name>chr4</ceph:name>
        <ceph:length>191273063</ceph:length>
        <ceph:metadata>{'type':'autosomal','size':191273063}</ceph:metadata>
      </ceph:Chromosome>
   (...)
    </ceph:Chromosomes>
  </Body>
</Envelope>

On the server/servlet side I've decoded the SOAP query using javax.xml.soap.MessageFactory;. It looks like that

(...)
MimeHeaders headers=new MimeHeaders();
Enumeration<?> e=req.getHeaderNames();
(... copy the http headers to 'headers' ...);
SOAPMessage message=getMessageFactory().createMessage(headers,req.getInputStream());
SOAPBody body=message.getSOAPBody();
Iterator<?> iter=body.getChildElements();
while(iter.hasNext())
{
SOAPElement child =SOAPElement.class.cast(iter.next());
Name name= child.getElementName();
if(!name.getURI().equals(child.getNamespaceURI())) continue;
if(name.getLocalName().equals("getChromosomes"))
{
processGetChromosomes(w,message,child,req,res);
return;
}
}

And I'm streaming the response using the XML Streaming API (StaX).

(...)
w.writeStartElement(pfx, "Chromosomes", getTargetNamespace());
w.writeAttribute(XMLConstants.XMLNS_ATTRIBUTE,XMLConstants.XML_NS_URI,pfx,getTargetNamespace());
for(ChromInfo ci:model.getChromsomesByOrganismId(getTransaction(), organismId))
{
w.writeStartElement(pfx, "Chromosome", getTargetNamespace());
w.writeStartElement(pfx, "id", getTargetNamespace());
w.writeCharacters(String.valueOf(ci.getId()));
w.writeEndElement();
w.writeStartElement(pfx, "organismId", getTargetNamespace());
w.writeCharacters(String.valueOf(ci.getOrganismId()));
w.writeEndElement();
(...)
w.writeEndElement();
}
w.writeEndElement();(...)

And as a final note, I'll cite this tweet I received today from Paul Joseph Davis :-)

@yokofakun Everytime someone uses SOAP, an angel cries.

Mon Dec 07 16:50:41

That's it !
Pierre

30 November 2009

Playing with Erlang (II)

transcripting a DNA to a RNA sequence with an anonymous function

lists:map(fun($T)->$U;(OTHER)->OTHER end,"AATAGCTGATCGACAATGTTAGCTAGGC").
>"AAUAGCUGAUCGACAAUGUUAGCUAGGC"

Testing if a symbol is an acid nucleic

> ACIDNUCLEICS="ATGCatgc".
"ATGCatgc"
> lists:member($A,ACIDNUCLEICS).
true
> lists:member($P,ACIDNUCLEICS).
false

Filtering a sequence with lists:filter

> IsAcidNucleic=fun(BASE)->lists:member(BASE,ACIDNUCLEICS) end.
#Fun<erl_eval.6.13229925>
> lists:filter( IsAcidNucleic, "1 ACGT ACGT ACGT ACGT ACGT ACGT 36").
"ACGTACGTACGTACGTACGTACGT"

Generating an array of numbers

> lists:seq(1,10).
[1,2,3,4,5,6,7,8,9,10]
> lists:seq(65,65+24).
"ABCDEFGHIJKLMNOPQRSTUVWXY"

The following notation says: add 32 to each item 'X' of the sequence 65 to 89.

> [X+32 || X<-lists:seq(65,65+24)].
"abcdefghijklmnopqrstuvwxy"

Create a list of structures {number,'X'} for each item 'X' of the sequence 1 to 5.

> [{number,X}|| X<-lists:seq(1,5)].
[{number,1},{number,2},{number,3},{number,4},{number,5}]

Create a list of structures {number,'X'} for each item 'X' of the sequence 1 to 100 where the modulus of X/15 is '0'.

> [{number,X}|| X<-lists:seq(1,100), X rem 15 =:= 0].
[{number,15},
{number,30},
{number,45},
{number,60},
{number,75},
{number,90}]

Create a list of all the pairs(x,y) for x and y between 1 and 4:

> [{pair,X,Y}|| X<-lists:seq(1,4),Y<-lists:seq(1,4)].
[{pair,1,1},
{pair,1,2},
{pair,1,3},
{pair,1,4},
{pair,2,1},
{pair,2,2},
{pair,2,3},
{pair,2,4},
{pair,3,1},
{pair,3,2},
{pair,3,3},
{pair,3,4},
{pair,4,1},
{pair,4,2},
{pair,4,3},
{pair,4,4}]

Create a list of all the pairs(x,y) for x and y between 1 and 4 having X==Y:

> [{pair,X,Y}|| X<-lists:seq(1,4),Y<-lists:seq(1,4),X=:=Y].
[{pair,1,1},{pair,2,2},{pair,3,3},{pair,4,4}]

A few is_XXXX functions

> is_integer("AAA").
false
> is_integer($A).
true
> is_list(1).
false
> is_list("AZAZAZ").
true
> is_atom(1).
false
> is_atom(hello).
true
> is_tuple(1).
false
> is_tuple({person,{name,"Pierre"}}).
true

A few functions:

% get the header of a list
> hd([100,99,98,97]).
100
% get the 3rd element of a list
> lists:nth(3,["EcoRI","HindIII","BamHI","PstI"]).
"BamHI"
>size({test,{name,"Hello"},{age,3}}).
3
> element(2,{test,{name,"Hello"},{age,3}}).
{name,"Hello"}
> abs(-99).
99

Translating a DNA to a protein.
The file bio.erl:

-module(bio).
-export([translate/1]).

%The standard genetic code table
geneticCode()->"FFLLSSSSYY**CC*WLLLLPPPPHHQQRRRRIIIMTTTTNNKKSSRRVVVVAAAADDEEGGGG".

%convert a base to an index in the table
base2index($T)-> 0;
base2index($C)-> 1;
base2index($A)-> 2;
base2index($G)-> 3.

%convert a codon to an index, lookup the table and return the amino acid
codon2AminoAcid(A1,A2,A3) -> lists:nth(
(base2index(A1)*16 + base2index(A2)*4 + base2index(A3))+1,
geneticCode())
.

%return an empty array if the argument is an empty sequence or a sequence with less than 3 bases
translate([])->[];
translate([_])->[];
translate([_,_])->[];
%translate the 3 first bases and append the remaining translation.
translate([A1,A2,A3|REMAIN])->[codon2AminoAcid(A1,A2,A3)|translate(REMAIN)].

Invoking bio:translate:

c(bio).

bio:translate("ATGGAGAGGCAGAAACGGAAGGCGGACATCGAGAAG").
"MERQKRKADIEK"

Records

A record file defines the structure of an object. Example: the following file snp.hrl defines the 'class' snp with its default values:

-record(snp,{
name="rs",
chrom="?",
chromStart=0,
chromEnd=0,
avHet= -1
}).

We can now use this class in an erlang shell:

%load the structure of a SNP
rr("snp.hrl").
[snp]
%create an empty instance of a snp
EMPTY_SNP= #snp{}.
#snp{name = "rs",chrom = "?",chromStart = 0,chromEnd = 0,
avHet = -1}
%create and fill an new instance of a snp
RS84=#snp{name="rs84",chrom="chr7",chromStart=25669317,chromEnd=25669318}.
#snp{name = "rs84",chrom = "chr7",chromStart = 25669317,
chromEnd = 25669318,avHet = -1}
%re-use the content of RS84 and fill the value of 'avHet'
RS84_2=RS84#snp{avHet=0.475045}.
#snp{name = "rs84",chrom = "chr7",chromStart = 25669317,
chromEnd = 25669318,avHet = 0.475045}
%extract the name and the avHet for RS84_2
#snp{name=NAME_RS84,avHet=AVHET_RS84}=RS84_2.
NAME_RS84.
"rs84"
AVHET_RS84.
0.475045

That's it
Pierre

26 November 2009

Playing with Erlang (I)

I'm currently reading Joe Armstrong's "Programming Erlang". Here are a couple of notes about ERLANG.

Starting and stopping the Erlang shell

:~> erl
Erlang R13B01 (erts-5.7.2) [source] [smp:2:2] [rq:2] [async-threads:0] [hipe] [kernel-poll:false]

Eshell V5.7.2 (abort with ^G)
1> halt().
:~>

Simple Math

Input:

2*(3+4).
PI=3.14159.
R=2.
SURFACE=PI*R*R.
R=3.

Output:

1> 14
2> 3.14159
3> 2
4> 12.56636
##Variables in erlang are immutable R3 cannot be changed
5> ** exception error: no match of right hand side value 3

Two structure defining a SNP and a Gene are created with a Tuple:

RS94={snp,
{chrom,chr6},
{chromStart,98339675},
{chromEnd,98339676},
{name,"rs94"},
{strand,plus},
{func,unknown},
{avHet,0}
}.
RS47={snp,
{chrom,chr7},
{chromStart,11547645},
{chromEnd,11547646},
{name,"rs47"},
{strand,minus},
{func,missence},
{avHet,0.246}
}.
NP_056019={gene,
{chrom,chr7},
{txStart,11380695},
{txEnd,11838349},
{cdsStart,11381945},
{csdEnd,11838097},
{name,"NP_056019"},
{strand,minus},
{exonCount,27}
}.

Values are extracted using '_' as a placeholder for the unwanted variables.

{_,{_,_},{_,_},{_,_},{_,NAME_OF_RS94},{_,_},{_,_},{_,_}}=RS94.
NAME_OF_RS94.
"rs94"

Create a list of SNP:

LIST_OF_SNP1=[RS94,RS47].

Add P_056019 to this list and create a list of genomic objects:

LIST2=[NP_056019|LIST_OF_SNP1].

Extract the first and second element of LIST2, put the remaining list in LIST3:

[ITEM1,ITEM2|LIST3]=LIST2.
ITEM1.
{gene,{chrom,chr7},
{txStart,11380695},
{txEnd,11838349},
{cdsStart,11381945},
{csdEnd,11838097},
{name,"NP_056019"},
{strand,minus},
{exonCount,27}}
ITEM2.
{snp,{chrom,chr6},
{chromStart,98339675},
{chromEnd,98339676},
{name,"rs94"},
{strand,plus},
{func,unknown},
{avHet,0}}
LIST3.
[{snp,{chrom,chr7},
{chromStart,11547645},
{chromEnd,11547646},
{name,"rs47"},
{strand,minus},
{func,missence},
{avHet,0.246}}]

A String is just an Array of integer:

ALPHABET=[65,66,67,68,69,70].
"ABCDEF"

The dollar( $ ) notation is used to get the 'int' value of a 'char'.

HELLO=[$H,$e,$l,$l,$o].
"Hello"

Methods & Functions

A file named "bio.erl" is created. This file is a erlang module that contains a kind of polymorphic function distance returning the length of a given object (its number of arguments is '/1'). If this object is identified as a atom=snp the value "chromEnd-chromStart" is returned. Else, if atom=gene the value "txEnd-txStart" is returned.

-module(bio).
-export([distance/1]).
distance({snp,
{chrom,_},
{chromStart,START},
{chromEnd,END},
{name,_},
{strand,_},
{func,_},
{avHet,_}}
)-> END - START;
distance({gene,
{chrom,_},
{txStart,START},
{txEnd,END},
{cdsStart,_},
{csdEnd,_},
{name,_},
{strand,_},
{exonCount,_}
})-> END - START.

We now use this module:

c(bio).
RS94={snp,
{chrom,chr6},
{chromStart,98339675},
{chromEnd,98339676},
{name,"rs94"},
{strand,plus},
{func,unknown},
{avHet,0}
}.

RS47={snp,
{chrom,chr7},
{chromStart,11547645},
{chromEnd,11547646},
{name,"rs47"},
{strand,minus},
{func,missence},
{avHet,0.246}
}.

NP_056019={gene,
{chrom,chr7},
{txStart,11380695},
{txEnd,11838349},
{cdsStart,11381945},
{csdEnd,11838097},
{name,"NP_056019"},
{strand,minus},
{exonCount,27}
}.

bio:distance(RS94).
1
bio:distance(RS47).
1
bio:distance(NP_056019).
457654

We now want to calculate the GC percent of a DNA: the bio.erl file is modified as follow:

-module(bio).
-export([gcPercent/1]).
-export([distance/1]).
(...)

gc($A) -> 0.0;
gc($T) -> 0.0;
gc($C) -> 1.0;
gc($G) -> 1.0;
gc([])->0;
gc([BASE|REMAIN])->gc(BASE)+gc(REMAIN).

gcPercent(ADN)->100.0*(gc(ADN)/erlang:length(ADN)).

Here the method gc returns '1' or '0' if the argument is a base; returns 0 if the array is empty, or return the sum of the gc(first character of the string) plus the gc(remaining string). The method gcPercent divide the sum of gc by the length of the string and multiply it by 100.

c(bio).
bio:gcPercent("GCATG").
60.0

That's it.
Pierre

A Java implementation of Jan Aerts' LocusTree

This post is a description of my implementation of Jan Aerts' LocusTree algorithm (I want to thank Jan, our discussion and his comments were as great source of inspiration) based on BerkeleyDB-JE, a Key/Value datastore. This implementation has been used to build a genome browser displaying its data with the SVG format. In brief: splicing each chromosome using a dichotomic approach allows to quickly find all the features in a given genomic region for a given resolution. A count of the total number of objects in the descent of each child node is used to produce a histogram of the number of objects smaller than the given resolution.

JSON/Metadata

All the information is stored in BerkeleyDB and I've used JSON to add some metadata about each object. The JSON is serialized, gzipped and stored in BerkeleyDB.

Organism

Each organism is defined by an ID and a Name. The Key of the BerkeleyDB is the organism.id.

The organisms are loaded in the database using a simple XML file:

<organisms>
        <organism id="36">
                <name>hg18</name>
                <description>Human Genome Build v.36</description>
                <metadata>{"taxon-id":9606}</metadata>
        </organism>
</organisms>

Chromosome

Each chromosome is defined by an ID, a Name, its length and its organism-ID. The Key in berkeleyDB is the chromosome ID.

The chromosomes are loaded in the database using a simple XML file:

<chromosomes organism-id="36">
  <chromosome id="1">
    <name>chr1</name>
    <metadata>{"size":247249719,"type":"autosomal"}</metadata>
  </chromosome>
  <chromosome id="10">
    <name>chr10</name>
    <metadata>{"size":135374737,"type":"autosomal"}</metadata>
  </chromosome>
(...)
</chromosomes>

Track

Each track is defined by an ID and a Name. The Key in berkeleyDB is the track ID.

The descriptions of the tracks are loaded in the database using a simple XML file:

<tracks>
        <track id="1">
                <name>cytobands</name>
                <description>UCSC cytobands</description>
        </track>
        <track id="2">
                <name>knownGene</name>
                <description>UCSC knownGene</description>
        </track>
        <track id="3">
                <name>snp130</name>
                <description>dbSNP v.130</description>
        </track>
        <track id="4">
                <name>snp130Coding</name>
                <description>UCSC coding Snp</description>
        </track>
        <track id="5">
                <name>all_mrna</name>
                <description>UCSC All mRNA</description>
        </track>
</tracks>

Nodes

Each LocusTree Node (LTNode) is linked to a Chromosome and a Track using a database named 'TrackChrom'. Here the Key of the BerkeleyDB is a composite key (chromosome/track).

The structure of a LTNode is described below. Each node contains a link to its parent, the links to its children as well as a set of genomic entities whose length is greater or equals that 'this.length'.

To load the content of each LocusTree, I've defined a simple java interface called LTStreamLoader which looks like this:

public interface LTLoader
{
public MappedObject getMappedObject();
public String getChromosome();
public Set<String> getKeywords();
}
public interface LTStreamLoader
extends LTLoader
{
public void open(String uri) throws IOException;
public void close() throws IOException;
public boolean next() throws IOException;
}

An instance of this interface is used to load the content of a tab delimited file as defined in the following XML file:

<loaders>
        <load organism-id="36" track-id="5" class-loader="fr.cephb.locustree.loaders.UCSCAllMrnaLoader" limit="10000">
                http://hgdownload.cse.ucsc.edu/goldenPath/hg18/database/all_mrna.txt.gz
        </load>
        <load organism-id="36" track-id="4" class-loader="fr.cephb.locustree.loaders.UCSCSnpCodingLoader" limit="10000">
                http://hgdownload.cse.ucsc.edu/goldenPath/hg18/database/snp130CodingDbSnp.txt.gz
        </load>
        <load organism-id="36" track-id="1" class-loader="fr.cephb.locustree.loaders.UCSCCytoBandsLoader">
                http://hgdownload.cse.ucsc.edu/goldenPath/hg18/database/cytoBand.txt.gz
        </load>
        <load organism-id="36" track-id="2" class-loader="fr.cephb.locustree.loaders.UCSCKnownGeneLoader">
                http://hgdownload.cse.ucsc.edu/goldenPath/hg18/database/knownGene.txt.gz
        </load>
        <load organism-id="36" track-id="3" class-loader="fr.cephb.locustree.loaders.UCSCSnpLoader" limit="10000">
                http://hgdownload.cse.ucsc.edu/goldenPath/hg18/database/snp130.txt.gz
        </load>
</loaders>

It took about 3H00 to load 'snp130.txt.gz' and the size of the indexed BerkeleyDB/LocusTree database was 16Go (ouch!).

Building the Genome Browser

The locus tree database was used to create (yet another) Genome Browser. My current implementation runs smoothly under Apache Tomcat. The SVG vectorial format was used to draw and hyperlink the data. Here is a screenshot of the first version I wrote one week ago. As you can see, the objects that were too small to be drawn, were displayed within a histogram.

Later, I've added some labels.

And my latest version uses the JSON metadata available in each objet to display the spliced structure of the genes:

The browser is fast (sorry, I cannot show it at this time) but I need to play with the config of BerkeleyDB to speed up the insertions and reduce the size of the database.

That's it.
Pierre

NB: The figures of this post were created using SVGToCanvas.

YOKOFAKUN

17 December 2009

Playing with Erlang (3): concurrency

14 December 2009

Parsing a genetic code using flex and bison. Part 2/2

The Lexer

The Parser

All in one

Parsing a genetic code using flex and bison. Part 1/2

The Parser

The Lexer

07 December 2009

Playing with SOAP. Implementing a WebService for the LocusTree Server

The WSDL file.

The XSD Schema

A Chromosome

A List of Chromosomes

GetChromosome

Generating the client

SOAP internals

30 November 2009

Playing with Erlang (II)

Records

26 November 2009

Playing with Erlang (I)

Starting and stopping the Erlang shell

Simple Math

Methods & Functions

A Java implementation of Jan Aerts' LocusTree

JSON/Metadata

Organism

Chromosome

Track

Nodes

Building the Genome Browser

About Me

Feeds

Blog Archive

Web2.0

Labels