25 September 2007

A mysql user defined function to get the reverse complement of a sequence

I was asked today to compare some experimental genotypes to the
theoretical data. Unfortunaly many genotypes were given on the
anti-parallele strand compared to their references. So I wrote a small
href="http://dev.mysql.com/doc/refman/5.0/en/adding-functions.html">mysql
user defined function (UDF)
to implement a new function
called 'revcomp' in mysql used to return the reverse complement of
a DNA sequence. Written in 'C/C++', this kind of function can be used
to embed bioinformatics into mysql. The coding was straighforward as I
already wrote a UDF translating a cDNA to a proteic sequence in a href="http://plindenbaum.blogspot.com/2006/07/mysql-user-defined-function-udf-for.html">previous
post
.

Here is the code.


#include <my_global.h>
#include <m_ctype.h>
#include <mysql.h>

#include <m_string.h>
/* this function is called by mysql to initialize it */
my_bool revcomp_init(UDF_INIT *initid, UDF_ARGS *args, char *message);
/* this function is called by mysql to dispose it */
void revcomp_deinit(UDF_INIT *initid);
/* the main function with get the reverse complement of a dna */
char *revcomp(UDF_INIT *initid, UDF_ARGS *args, char *result,
unsigned long *length, char *is_null, char *error);

/* a trivial function returning the complementary base of an acid nucleic */
static char complement(char b)
{
switch(b)
{
case 'A': return 'T';
case 'T': return 'A';
case 'G': return 'C';
case 'C': return 'G';

case 'a': return 't';
case 't': return 'a';
case 'g': return 'c';
case 'c': return 'g';

case 'w': return 'w';
case 'W': return 'W';

case 's': return 's';
case 'S': return 'S';

case 'y': return 'r';
case 'Y': return 'R';

case 'r': return 'y';
case 'R': return 'Y';

case 'k': return 'm';
case 'K': return 'M';

case 'm': return 'k';
case 'M': return 'K';

case 'b': return 'v';
case 'd': return 'h';
case 'h': return 'd';
case 'v': return 'b';


case 'B': return 'V';
case 'D': return 'H';
case 'H': return 'D';
case 'V': return 'B';

case 'N': return 'N';
case 'n': return 'n';

}
return '?';
}


/** this function is called by mysql to initialize our revcomp function */
my_bool revcomp_init(
UDF_INIT *initid,
UDF_ARGS *args,
char *message
)
{
/* check we have one STRING argument */
if (!(args->arg_count == 1 && args->arg_type[0] == STRING_RESULT ))
{
strncpy(message,"Bad parameter, expected a DNA",MYSQL_ERRMSG_SIZE);
return 1;
}
initid->maybe_null=1;
/* initid->ptr will be used to store the transformed sequence */
initid->ptr= (char*)malloc(0);
/* out of memory ? */
if(initid->ptr==NULL)
{
strncpy(message,"Out Of Memory",MYSQL_ERRMSG_SIZE);
return 1;
}
return 0;
}

/** this function is called by mysql to dispose our revcomp function */
void revcomp_deinit(UDF_INIT *initid)
{
/* free the user ptr */
if(initid->ptr!=NULL) free(initid->ptr);
}

/** this is the function called by mysql to reverse-complement a DNA */
char *revcomp(UDF_INIT *initid, UDF_ARGS *args, char *result,
unsigned long *length, char *is_null, char *error)
{
long i;
/* the size of the input */
long size= args->lengths[0];
/* the DNA given as input */
const char *dna=args->args[0];
char *ptr=NULL;

if (dna==NULL) // DNA is a null argument
{
*is_null=1;
return NULL;
}
/* the length of the returned string will be 'size' */
*length=size;


/** try to reallocate our memory to store the new transformed DNA sequence */
ptr= (char*)realloc(initid->ptr,sizeof(char)*(size));//no need (size+1)

/* out of memory ? */
if(ptr==NULL)
{
*is_null=1;
*error=1;
strncpy(error,"Out Of Memory",MYSQL_ERRMSG_SIZE);
return NULL;
}
initid->ptr=ptr;
*is_null=0;
*error=0;

/* build the reverse complement */
for(i=0;i< size;++i)
{
initid->ptr[i] = complement( dna[(size-1)-i] );
}

/* return our pointer */
return initid->ptr;
}



And here is the Makefile for my machine...


/usr/lib/revcomp.so:revcomp.c
gcc -fPIC -shared -I/usr/include/mysql -DDBUG_OFF -O3 -lmysqlclient -o $@ $<




The function was installed on mysql using the following statement:
>mysql create function revcomp returns string SONAME "revcomp.so";


TESTS:


mysql> select revcomp("GAATTC");
+-------------------+
| revcomp("GAATTC") |
+-------------------+
| GAATTC |
+-------------------+
1 row in set (0,00 sec)

mysql> select revcomp(revcomp("AAATTTaaatttGC"));
+------------------------------------+
| revcomp(revcomp("AAATTTaaatttGC")) |
+------------------------------------+
| AAATTTaaatttGC |
+------------------------------------+
1 row in set (0,00 sec)

mysql> select revcomp("SWatgcatgAAATTTaaatttGC");
+------------------------------------+
| revcomp("SWatgcatgAAATTTaaatttGC") |
+------------------------------------+
| GCaaatttAAATTTcatgcatWS |
+------------------------------------+
1 row in set (0,00 sec)



here I create a table of primers and I find the all sub-sequences that could be amplified.


mysql> create table primers(id int primary key auto_increment, primer
varchar(100));
Query OK, 0 rows affected (0,02 sec)

mysql> desc primers;
+--------+--------------+------+-----+---------+----------------+
| Field | Type | Null | Key | Default | Extra |
+--------+--------------+------+-----+---------+----------------+
| id | int(11) | | PRI | NULL | auto_increment |
| primer | varchar(100) | YES | | NULL | |
+--------+--------------+------+-----+---------+----------------+
2 rows in set (0,00 sec)


mysql> insert into primers(primer) values
("CAAAATGAAACAGGT"),
("TAATCAAATAATGCCTGGATTTCTT"),
("TGAACATCCGTCCTCTCCCCACAAA"),
("TCATAGTCCTGAGGAAAGAGAA"),
("TCAAACAAGGAAAATGGAAAACAAATTCA"),
("TCACTGCTGGATGTGTGGGAAAAACTGCA"),
("CACTGTTGCAGTTTTTCCCACAC"),
("TCAGGACTATGAGCAAAGGAACA"),
("TACCAAGTACCTGCGCTCCAGGTACATTT"),
("AGATACCTGTTTCATTT"),
("CTTTTCAGTGGTTGATGCTCAAGATG"),
("GCCTGACGAAGTTAAAACTGATATTGA")
;

Query OK, 12 rows affected (0,00 sec)
Records: 12 Duplicates: 0 Warnings: 0


mysql> set
@seq:="CACTGCTTCTCACTGTTGCAGTTTTTCCCACACATCCAGCAGTGAAAGTCCACTGTAACTTCAGCATAAT
CTGTTGGCATGTGAATTTGTTTTCCATTTTCCTTGTTTGACTGACTGGCTATGTCTTCACTTTTTTCATTTTTTTTG
ACTCTTGAGCCATAGTTCGTAAAATTGCTCCATATCTTGTATCCCATTCTCCTTCATGTAAGTCCAAACTTCTCTTT
CCTCAGGACTATGAGCAAAGGAACAGTTTCCAACATATTGACATTTTTTCCCAGAAGCAATATGGTTGCACAGATCA
AACTGTAAAGGCATTTGTTTCTTTGTGGGGAGAGGACGGATGTTCATCCACTTCTTACGTTCAATAGACATCACTCT
CATCGCACGCCGGTCTTTGGTCCACGAATGCCTTGCTTTTGCACTACAATATTTTCTGTTTTTGTCTGGTTCAATGA
CTTGACCGTTTCTCAGACACTGGGCGCACACAAACTTTATCTTCATATTAAGAAATCCAGGCATTATTTGATTACCA
AGTACCTGCGCTCCAGGTACATTTGCTTCCAAATTCTGCCAATATCGTTTAGACTCTTGAGCAATAGCATCATGTGA
GATACCTGTTTCATTTTGCCG";
Query OK, 0 rows affected (0,00 sec)

mysql> select
concat("ID.",Forward.id) as "Forward" ,
locate(Forward.primer,@seq) as "Start",
concat("ID.",Revers.id) as "Reverse",
locate(revcomp(Revers.primer),@seq)+length(Revers.primer) as "End",
repeat('*',(locate(revcomp(Revers.primer),@seq)+length(Revers.primer)-locate(Forward.primer,@seq) )/16) as "schema"
from
primers as Forward,
primers as Revers
where
locate(Forward.primer,@seq)>0 and
locate(revcomp(Revers.primer),@seq) > locate(Forward.primer,@seq)
group by 1,3 order by 5,2;

+---------+-------+---------+------+----------------------------------------+
| Forward | Start | Reverse | End | schema |
+---------+-------+---------+------+----------------------------------------+
| ID.10 | 609 | ID.1 | 628 | * |
| ID.7 | 11 | ID.6 | 46 | ** |
| ID.7 | 11 | ID.5 | 111 | ****** |
| ID.9 | 528 | ID.1 | 628 | ****** |
| ID.8 | 227 | ID.3 | 348 | ******* |
| ID.7 | 11 | ID.4 | 239 | ************** |
| ID.8 | 227 | ID.2 | 530 | ****************** |
| ID.7 | 11 | ID.3 | 348 | ********************* |
| ID.8 | 227 | ID.1 | 628 | ************************* |
| ID.7 | 11 | ID.2 | 530 | ******************************** |
| ID.7 | 11 | ID.1 | 628 | ************************************** |
+---------+-------+---------+------+----------------------------------------+
11 rows in set (0,00 sec)


That's it.

Pierre

Idiographica

Via aziesel on del.icio.us, I found the tool I was looking for:


Idiographica: a general-purpose web application to build idiograms on-demand for human, mouse and rat

Kin T and Ono Y, Bioinformatics 2007; doi:10.1093/bioinformatics/btm455



Idiographica a web server which serves as a general purpose idiogram rendering service, and allows users to generate high-quality idiograms with custom annotation according to their own genome-wide mapping/annotation data through an easy-to-use interface. The generated idiograms are suitable not only for visualizing summaries of genome-wide analysis but also for many types of presentation material including web pages, conference posters, oral presentations, etc.Idiographica is freely available at http://www.ncrna.org/idiographica/

24 September 2007

Sketchcast

Sketchcasting is a tool to communicate something online by recording a sketch, optionally with your voice speaking. Any sketch can then be embedded on your blog/ homepage for people to play-back, and you can also point people to your sketchcast channel here.


Here is my very first attempt to draw with this tool (please don't flame :-) ) where I tried to explain the technology GenomeHip used in my company to find the region(s) involved in a genetci disease. The approach relies on isolation of identical-by-descent regions from relative-pairs sharing the same disease.




Added Later: http://www.imaginationcubed.com/LaunchPage is far more powerful but you cannot blog your drawings.

Pierre

23 September 2007

google view:timeline

seen on Timeline and map views: here you can see the results of your query on a timeline or a map. With the timeline and map views, Google’s technology extracts key dates and locations from select search results so you can view the information in a different dimension.

See Charles Darwin's Timeline
Charles Darwin's Map
Bioinformatics conferences



Pierre

11 September 2007

NCBI Resource Locator

Via the public-semweb-lifesci mailing list:


"The NCBI Resource Locator provides stable, uniform addressing for NCBI content, making it easy to link to individual records. Some NCBI resources also provide services (like search) through these URLs."

http://view.ncbi.nlm.nih.gov/



How does it work?
Each URL has the form

http://view.ncbi.nlm.nih.gov/<noun>/<verb>/<expression>

Where:

  • <noun> is an NCBI resource (e.g., pubmed, gene, nucleotide, etc.)

  • <verb> is the action to perform (e.g., search, get,etc.). If <verb> is missing, the default verb "get" is used.

  • <expression> is data used by the action to perform the request


Some examples:


Note: but I guess this kind of REST URL doesn't allow to specify the output format (XML, ASN1, etc...)

06 September 2007

IBM CoScripter: A system for capturing, sharing, and automating tasks on the Web.

Via O'Reilly Radar:

CoScripter is firefox extension created by IBM. It is a system for recording, automating, and sharing processes performed in a web browser such as printing photos online, requesting a vacation hold for postal mail, or checking bank account information. Instructions for processes are recorded and stored in easy-to-read text here on the CoScripter web site, so anyone can make use of them.

02 September 2007

Google Earth Sky and Flight Simulator

Via: Transnet.

The newest version of GoogleEarth contains the new "Google Sky" but it also contains a hidden Fligh Simulator !!!! Press Ctrl-Alt-A under Linux.

Procrastination about social Networks during my vacation

"All the nerds have an account on Facebook, what about you ?"...
Ok I got one...

"All the geeks have an account on Twitter, what about you ?..."
Ok, I got one...

etc...

My favorite social network remains LinkedIn but I'm starting on being bored about all those social networks just because I cannot re-create another network elsewhere by sending again and again some invitations to my friends/contacts: they will definitely hate me :-). I don't remember where I read "I don't need a social network, the internet IS the network" but this could be true if anyone could host is own profile(foaf ? xml ?) in a file somewhere on the internet and a software could use it.

Some unordered ideas about this:

- use a kind of FOAF more structured/simple than RDF because I don't want anyone else to add a RDF statement about me.
- create a format for individuals, groups, jobs
- search by keywords ,dc:subject, location, degrees of separation
- cache the xml files in a cache on my computer (flat files ? javadb ?)
- implement a graphical tool and/or a command line tool
- trusted relations are bidirectional: he knows my mail (SHA1), I know his mail
- link to a picture
- location can be geo:lat/geo:long and/or name of country
- describe relationships (kind of relation, since...) instead of using foaf:knows
- include metadata about how long the file should be cached
- transform into rss to track the modifications.

I did it again

Okay, In a previous post I said that "I would not spam the Nature Network with the data collected from the NCBI" using the batch invitations, but this was before it was possible to add a custom message with the single invitations. The desire to test the method was too strong and I sent more than 2700 personalized invitations ("....as you published an article titled xxxx in 2003 in Bioinformatics...") to join bioformatics group. The number of members jumped to more than 300 persons in 3 weeks.... Corie then asked me kindly but strongly to "not do to this again...". Oups.... :-)

By the way, I also created a group called History Of Science where I asked for a structured source of data about History.

Pierre

No more bees

"If the bee disappears from the surface of the earth, man would have no more than four years to live. No more bees, no more pollination ... no more men!" [A Einstein]

A recent article of the French Newspaper "Le Monde" titled "The bees sick of the man" it is said: ...In the USA, where it is described as the “syndrome of collapse of the colonies”, about 25% of the livestock would have disappeared during the winter 2006-2007. In Europe, France, Belgium, Italy, Germany, Switzerland, Spain, Greece, Poland, the Netherlands were touched since the beginning of the years 2000. The losses can reach, locally, up to 90% of the colonies.

For pessimistics, see also: We're all going to die on Nature Scintilla.

Back from Holidays...

I'm now back from three weeks of holidays: there were a thousand posts in my "google-reader" so I just pressed the infamous button "Mark all as read". Sorry if I missed something really terrific :-)

Pierre