26 March 2010

'R' = dna.translate("AGG") . A custom C function for R, My notebook.


In the following post, I will show how I've implemented a custom C function for R. This C function will translate a DNA to a protein. I'm very new to 'R' so feel free to make any comment about the code.

C code


The data in 'R' are stored in an opaque structure named 'SEXP'. A custom C function receives some SEXP arguments and returns another SEXP. So, the declaration of my function translate_dna is:
SEXP translate_dna(SEXP sequences)
{
(...)
}

We check if the argument 'sequences' has a type='character'.
if(!isString(sequences))
error("argument is not a string[]");
The input consists of one or more DNA, so our return value will be an SEXP array of type 'character' (STRSXP) with a size equals to the number of DNAs. This array is allocated:
SEXP array_of_peptides = allocVector(STRSXP,length(sequences));
We loop over each DNA sequence
for(in i=0;i<length(sequences);++i) ...
and we allocate some memory for the peptide:
const char *dna=CHAR(STRING_ELT(sequences, i));
int dna_length=strlen(dna);
char* peptide =malloc(dna_length/3+1);
The peptide is then filled with its amino acids:
for(j=0;j+2 < dna_length;j+=3)
{
peptide[n++]=_translate(dna[j],dna[j+1],dna[j+2]);
}
peptide[n]=0;
And we put this new string/peptide in the returned value
SET_STRING_ELT(array_of_peptides,i,Rf_mkChar(peptide));
Here, I am not sure how 'R' handles its memory. Should I care about the way R runs its garbage manager ? how should I use the macros PROTECT and UNPROTECT ?

Compilation


The code 'translate.c' is compiled as a dynamic library ' libtranslate.so':

gcc -fPIC -I -g -c -Wall -I ${R_HOME}/include translate.c
gcc -shared -Wl,-soname,libtranslate.so.1 -o libtranslate.so translate.o

R code

on the R side , the previous dynamic library is loaded:
dyn.load(paste("libtranslate", .Platform$dynlib.ext, sep=""))
A function dna.translate is declared: it forces the array to be an array of string and it invokes the C function 'translate_dna'
storage.mode(dna) <- "character"
.Call("translate_dna", dna)
We can now invoke the R function dna.translate:
peptides <- dna.translate(
c( "ATGGAGAGGCAGAAACGGAAGGCGGACATCGAGAAAGGGCTGCAGTTCATTCAGTCGACACTAC",
NULL,
"CCCAAAAGCAAGAAGAATATGAGGCCTTTCTGCT",
"CAAACTGGTGCAGAATCTGTTTGCTGAGGGCAATGA",
NULL,
"GGCAGATCAGGGAACTTCTAATGGATTGGGGTCC",
"GGATAACTGCACCTTCGCCTACCATCAGGAGGA",
"GGGTCCCAGGCAGCGCTGCCTGGGGGCTGGGGAG",
"TTGCGACAGGCTCCAGAAGGGCAAAGCCTGCCCAGAT",
"GCACCCCCCTCCTCTCCACCCTACCTTCCATCAACCA",
"AGG"
))

print(peptides)

Result:
[1] "MERQKRKADIEKGLQFIQSTL" "PKSKKNMRPFC" "QTGAESVC*GQ*"
[4] "GRSGNF*WIGV" "G*LHLRLPSGG" "GSQAALPGGWG"
[7] "LRQAPEGQSLPR" "APPSSPPYLPST" "R"


Full source code

:
C code:
#include <stdio.h>
#include <ctype.h>
#include <R.h>
#include <Rinternals.h>
/* the genetic code */
static const char* STANDARD_GENETIC_CODE="FFLLSSSSYY**CC*WLLLLPPPPHHQQRRRRIIIMTTTTNNKKSSRRVVVVAAAADDEEGGGG";

static char _translate(char a,char b,char c);


/** translates DNA to protein
* @arg sequences one or more DNA sequence
* @return an array of peptides
*/
SEXP translate_dna(SEXP sequences)
{
int i;
SEXP array_of_peptides;
//check input has type= characters
if(!isString(sequences))
error("argument is not a string[]");
//prepare a new list for the translated sequence
array_of_peptides = allocVector(STRSXP,length(sequences));

//loop over the input
for(i=0;i< length(sequences);++i)
{
int j=0,n=0;
//transform the sequence into a ptr* of char
const char *dna=CHAR(STRING_ELT(sequences, i));
//get the length of this sequence
int dna_length=strlen(dna);
//alloc the protein sequence
char* peptide =malloc(dna_length/3+1);
if(peptide==NULL) error("out of memory");
//loop over the codons
for(j=0;j+2 < dna_length;j+=3)
{
//set the amino acid at 'n'
peptide[n++]=_translate(dna[j],dna[j+1],dna[j+2]);
}
//add EOS
peptide[n]=0;
//put a copy of this peptide in the vector
SET_STRING_ELT(array_of_peptides,i,Rf_mkChar(peptide));
//free our copy
free(peptide);
}
//return the array of petptide
return array_of_peptides;
}


static int base2index(char c)
{
switch(tolower(c))
{
case 't': return 0;
case 'c': return 1;
case 'a': return 2;
case 'g': return 3;
default: return -1;
}
}

static char _translate(char a,char b,char c)
{
int base1= base2index(a);
int base2= base2index(b);
int base3= base2index(c);
if(base1==-1 || base2==-1 || base3==-1)
{
return '?';
}
else
{
return STANDARD_GENETIC_CODE[base1*16+base2*4+base3];
}

}

Makefile:
run:translate.so
${R_HOME}/bin/R --no-save < translate.R

translate.so:translate.c
gcc -fPIC -I -g -c -Wall -I ${R_HOME}/include translate.c
gcc -shared -Wl,-soname,libtranslate.so.1 -o libtranslate.so translate.o

R code:
#load the dynamic library translate
dyn.load(paste("libtranslate", .Platform$dynlib.ext, sep=""))
#declare the function dna.translate
dna.translate <- function(dna)
{
#force dna to be type=character
storage.mode(dna) <- "character"
#call the C function
.Call("translate_dna", dna)
}

peptides <- dna.translate(
c( "ATGGAGAGGCAGAAACGGAAGGCGGACATCGAGAAAGGGCTGCAGTTCATTCAGTCGACACTAC",
NULL,
"CCCAAAAGCAAGAAGAATATGAGGCCTTTCTGCT",
"CAAACTGGTGCAGAATCTGTTTGCTGAGGGCAATGA",
NULL,
"GGCAGATCAGGGAACTTCTAATGGATTGGGGTCC",
"GGATAACTGCACCTTCGCCTACCATCAGGAGGA",
"GGGTCCCAGGCAGCGCTGCCTGGGGGCTGGGGAG",
"TTGCGACAGGCTCCAGAAGGGCAAAGCCTGCCCAGAT",
"GCACCCCCCTCCTCTCCACCCTACCTTCCATCAACCA",
"AGG"
))

print(peptides)


That's it

Pierre

3 comments:

István Albert said...

This is pretty neat. Seems fairly straigthforward.

Jermdemo said...

very impressive

my R_INCLUDE_DIR was actually in /usr/include/R instead of the default ${R_HOME}/include because I used RPMs to install R-devel

what resources did you use to learn this?

Pierre Lindenbaum said...

Jeremy, I just used http://cran.r-project.org/doc/manuals/R-exts.html. But I don't know if there is any memory leaks here....