Showing posts with label r. Show all posts
Showing posts with label r. Show all posts

30 July 2014

writing #rstats bindings for bwa-mem, my notebook.

I wanted to learn how to bind a C library to R, so I've created the following bindings for BWA. My code is available on github at :

Most of the C code was inspired from Heng Li's code https://github.com/lh3/bwa/blob/master/example.c.

A short description of the C code

In https://github.com/lindenb/rbwa/blob/master/rbwa.c:
struct RBwaHandler
This structure holds a pointer to the bwa-index (bwaidx_t) and to the options for bwa (mem_opt_t).
RBwaOpen(filename)
This methods opens the bwa index, wrap the pointer in a 'R' tructure using R_MakeExternalPtr and registers a destructor '_RBwaFinalizer' to be called by the garbage manager.
_RBwaFinalizer(handler)
This is the destructor called by the garbage manager. It calls 'RBwaClose'
RBwaClose(handler)
retrieves the pointer to the 'struct RBwaHandler' using 'R_ExternalPtrAddr', disposes the resources, free the RBwaHandler using 'R_ClearExternalPtr'
RBwaMap(handler,DNAsequence)
This is the workhorse of the code: it retrieves the pointer to the 'struct RBwaHandler' using 'R_ExternalPtrAddr', creates a "data.frame" with 6 columns (chromosome, position, strand, mapq, NM, secondary), maps the DNA sequence by calling bwa::mem_align1 and insert the hits in the data.frame.

The R code

See https://github.com/lindenb/rbwa/blob/master/rbwa.R. This R code loads the dynamic C libraries and declares the R functions calling the previous C functions using .Call:
  • bwa.open(filename)
  • bwa.close(bwt)
  • bwa.map(bwt,dnasequence)

Example

source("rbwa.R")
bwt <- bwa.open("test_files/chrM.fa")

for(s in c(
 "GCATGTGTAATCTTACTAAGAGCTAATAGAAAGGCTAGGACCAAACCTAT",
 "GCATGTGTAATCTTACTAAGCTAATAGAAAGGCTAGGACCAAACCTAT",
 "CTATCTACTTCAAATTCCTCCCTGTACGAAAGGACAAGAGAAATAAGGCCTCACAAAGCGCCTTCCCCCGTAAATGATATCATCTCAACTTAGTAT",
 "TACTAAACCC",
 "GCGAACCCAACTTCGATTCCCTCGCCGATCTCCGACGGAGCCGTGTGCAT" 
 ))
 {
 print(paste("QUERY:",s));
 hits<-bwa.map(bwt,s)
 print(hits)
 }

bwa.close(bwt);
here is the R output:
> source("rbwa.R")
> 
> bwt <- bwa.open("test_files/chrM.fa")
> 
> for(s in c(
+   "GCATGTGTAATCTTACTAAGAGCTAATAGAAAGGCTAGGACCAAACCTAT",
+   "GCATGTGTAATCTTACTAAGCTAATAGAAAGGCTAGGACCAAACCTAT",
+   "CTATCTACTTCAAATTCCTCCCTGTACGAAAGGACAAGAGAAATAAGGCCTCACAAAGCGCCTTCCCCCGTAAATGATATCATCTCAACTTAGTAT",
+   "TACTAAACCC",
+   "GCGAACCCAACTTCGATTCCCTCGCCGATCTCCGACGGAGCCGTGTGCAT" 
+   ))
+  {
+  print(paste("QUERY:",s));
+  hits<-bwa.map(bwt,s)
+  print(hits)
+  }
[1] "QUERY: GCATGTGTAATCTTACTAAGAGCTAATAGAAAGGCTAGGACCAAACCTAT"
  chrom pos strand mapq NM secondary
1  chrM 650      1   60  0         0
[1] "QUERY: GCATGTGTAATCTTACTAAGCTAATAGAAAGGCTAGGACCAAACCTAT"
  chrom pos strand mapq NM secondary
1  chrM 650      1   60  2         0
[1] "QUERY: CTATCTACTTCAAATTCCTCCCTGTACGAAAGGACAAGAGAAATAAGGCCTCACAAAGCGCCTTCCCCCGTAAATGATATCATCTCAACTTAGTAT"
  chrom  pos strand mapq NM secondary
1  chrM 3100      0   60  4         0
[1] "QUERY: TACTAAACCC"
[1] chrom     pos       strand    mapq      NM        secondary
<0 rows> (or 0-length row.names)
[1] "QUERY: GCGAACCCAACTTCGATTCCCTCGCCGATCTCCGACGGAGCCGTGTGCAT"
[1] chrom     pos       strand    mapq      NM        secondary
<0 rows> (or 0-length row.names)
> 
> bwa.close(bwt);
[1] TRUE

That's it,

Pierre

30 January 2014

Parallelizing #RStats using #make

In the current post, I'll show how to use R as the main SHELL of GNU-Make instead of using a classical linux shell like 'bash'. Why would you do this ?

  • awesomeness
  • Make-based workflow management
  • Make-based execution with --jobs. GNU make knows how to execute several recipes at once. Normally, make will execute only one recipe at a time, waiting for it to finish before executing the next. However, the '-j' or '--jobs' option tells make to execute many recipes simultaneously.
The following recipe has been tested with GNU-Make 4.0 and I'm not sure it would world with '<=3.81'.

The only problem is that R doesn't accept a multiline-argument on the command line (see http://stackoverflow.com/questions/21442674) so I created a wrapper 'mockR' that save the argument '-e "code"' into a file and pipe it into R:

(Edit1: A comment from madscientist : Re your script; you can save yourself some wear-and-tear on your disk and avoid the need for temp files and cleanup by just piping the input directly: echo "$R" | R --vanilla --no-readline --quiet . Just a thought. ")

(Edit2: the exit value of 'R' should also be returned by 'mockR'.)

This file is set as executable:
$ chmod u+x ./mockR
In the makefile, we tell 'make' to use 'mockR' instead of '/usr/bin/sh':
SHELL = ./mockR
The R code will be passed to 'mockR' using the argument '-e "code"'
.SHELLFLAGS= -e
We also set 'ONESHELL': "If .ONESHELL is mentioned as a target, then when a target is built all lines of the recipe will be given to a single invocation of the shell rather than each line being invoked separately"
.ONESHELL:

Example 1

We download the table 'knownGene' from the UCSC and we plot a pdf file 'countExons=f(txStart)'. Please, note that the targets are created using some R statements, NOT bash statements:

Now Invoke make


Example 2

Using a the eval and the call function we can make the previous 'Makefile' applicable for all the chromosomes:

Now Invoke make USING TRHEE PARALLEL JOBS





You can now watch the final pdf files:




That's it,
Pierre

24 September 2010

Connecting to a MongoDB database from R using the C API for MongoDB

Today, Neil posted an article titled" Connecting to a MongoDB database from R using Java". In the current post, I'll show how to use the C API for MongoDB to fetch some MongoDB data from R. The code will be somehow similar to my previous post "A stateful C function for R: parsing Fasta sequences".

OK, First, let's add a few values in mongo:

for(i=1;i> 20;++i)
db.dbsnps.save({_id:"rs"+i,name:"rs"+i});


The C code contains 3 functions.

The first function mongoRconnect connects to the MongoDB server and put the pointer into a R variable.
SEXP mongoRconnect()
{
mongo_connection* conn; /* ptr */
mongo_connection_options opts[1];
mongo_conn_return status;

conn=(mongo_connection*)malloc(sizeof(mongo_connection));
strcpy( opts->host , "127.0.0.1" );
opts->port = 27017;
status = mongo_connect( conn, opts );

return R_MakeExternalPtr(conn, R_NilValue, R_NilValue);
}
The second method mongoRdiconnect closes the connection:
SEXP mongoRdiconnect(SEXP r_handle)
{
mongo_connection* conn;
conn = (mongo_connection*)R_ExternalPtrAddr(r_handle);;
if(conn==NULL) MONGO_ERROR("conn==NULL");
mongo_destroy( conn );
free(conn);
R_ClearExternalPtr(r_handle);
return ScalarInteger(0);
}
The last method mongoRquery scans the database test.dbsnps and inserts the name of the snps into an R array:
SEXP mongoRquery(SEXP r_handle)
{
SEXP values=NULL;
mongo_cursor *cursor;
bson empty[1];
bson_empty( empty );
int i;
mongo_connection* conn = R_ExternalPtrAddr(r_handle);
SEXP* array=NULL;
int array_size=0;
//the R value contains two objects


if(conn==NULL) MONGO_ERROR("handle==NULL");
cursor = mongo_find( conn,
"test.dbsnps",/* ns */
empty,/* fields */
empty,/* return */
0,/* return */
0,/* skip */
0 /* options */
);

while( mongo_cursor_next( cursor ) )
{
bson_iterator it[1];
if ( bson_find( it, &(cursor->current), "name" ))
{
array=(SEXP*)realloc(array,(array_size+1)*sizeof(SEXP));
if(array==NULL) error("out of memory");
array[array_size]=mkChar( bson_iterator_string( it ));
array_size++;
}
}
mongo_cursor_destroy( cursor );
PROTECT(values = allocVector(STRSXP, array_size));
for(i=0;i< array_size;++i)
{
SET_STRING_ELT(values, i, array[i]);
}
free(array);
UNPROTECT(1);
return values;
}
This C code is then be called from R:
mongo <- mongo.open()
mongo.snps(mongo)
mongo.close(mongo)

Result:
[1] "rs1" "rs2" "rs3" "rs4" "rs5" "rs6" "rs7" "rs8" "rs9" "rs10"
[11] "rs11" "rs12" "rs13" "rs14" "rs15" "rs16" "rs17" "rs18" "rs19"



Source code


Makefile

R_HOME=R-2.11.0
MONGO_HOME=mongo-c-driver
run:
gcc -fPIC -I -g -c -Wall -DMONGO_HAVE_STDINT -I ${R_HOME}/include -I ${MONGO_HOME}/src mongoR.c ${MONGO_HOME}/src/*.c
gcc -shared -Wl,-soname,rmongo.so.1 -o librmongo.so *.o
${R_HOME}/bin/R --no-save < mongo.R
clean:
rm *.o


mongoR.c

(again, I'm not sure about those PROTECT/UNPROTECT ...)
#include <ctype.h>
#include <errno.h>
#include <R.h>
#include <Rinternals.h>
#include <bson.h>
#include <mongo.h>

#define MONGO_ERROR(a) { error(a); fputs(a,stdout);exit(EXIT_FAILURE);}

/**
* connect to MONGO
*/
SEXP mongoRconnect()
{
mongo_connection* conn; /* ptr */
mongo_connection_options opts[1];
mongo_conn_return status;

conn=(mongo_connection*)malloc(sizeof(mongo_connection));
if(conn==NULL)
{
MONGO_ERROR("out of memory");
}

strcpy( opts->host , "127.0.0.1" );
opts->port = 27017;

status = mongo_connect( conn, opts );
if(status!= mongo_conn_success)
{
MONGO_ERROR("connection failed");
}

/** the handle is bound a R variable */
return R_MakeExternalPtr(conn, R_NilValue, R_NilValue);
}

/**
* close the mongo connection
*/
SEXP mongoRdiconnect(SEXP r_handle)
{
mongo_connection* conn;
conn = (mongo_connection*)R_ExternalPtrAddr(r_handle);;
if(conn==NULL) MONGO_ERROR("conn==NULL");
mongo_destroy( conn );
free(conn);
R_ClearExternalPtr(r_handle);
return ScalarInteger(0);
}

/**
* get all SNPS
*/
SEXP mongoRquery(SEXP r_handle)
{
SEXP values=NULL;
mongo_cursor *cursor;
bson empty[1];
bson_empty( empty );
int i;
mongo_connection* conn = R_ExternalPtrAddr(r_handle);
SEXP* array=NULL;
int array_size=0;
//the R value contains two objects


if(conn==NULL) MONGO_ERROR("handle==NULL");
cursor = mongo_find( conn,
"test.dbsnps",/* ns */
empty,/* fields */
empty,/* return */
0,/* return */
0,/* skip */
0 /* options */
);

while( mongo_cursor_next( cursor ) )
{
bson_iterator it[1];
if ( bson_find( it, &(cursor->current), "name" ))
{
array=(SEXP*)realloc(array,(array_size+1)*sizeof(SEXP));
if(array==NULL) error("out of memory");
array[array_size]=mkChar( bson_iterator_string( it ));
array_size++;
}
}
mongo_cursor_destroy( cursor );
PROTECT(values = allocVector(STRSXP, array_size));
for(i=0;i< array_size;++i)
{
SET_STRING_ELT(values, i, array[i]);
}
free(array);
UNPROTECT(1);
return values;
}


mongo.R

dyn.load(paste("librmongo", .Platform$dynlib.ext, sep=""))

mongo.open <- function()
{
.Call("mongoRconnect")
}

mongo.close <- function(handler)
{
.Call("mongoRdiconnect",handler)
}

mongo.snps <- function(handler)
{
.Call("mongoRquery",handler)
}

mongo <- mongo.open()
mongo.snps(mongo)
mongo.close(mongo)


That's it,
Pierre

19 April 2010

A stateful C function for R: parsing Fasta sequences

In the following post, I'll create a C extension for R. This extension will iterate over all the FASTA sequences in a file and will return a pair(name,sequence) for each sequence, that is to say that I won't store all the sequences in memory.

The C code

The structure FastaHandler holds the state of the fasta parser. It contains a pointer to the FILE, the current sequence and fasta header. We also need to save the previous line.
/** stateful structure for the fastafile */
typedef struct fastaHandler_t
{
/** input file */
FILE* in;
/** save the previous line, it will contains the next fasta header */
char* previous_line;
/** sequence name */
char *seq_name;
/** sequence dna */
char *seq_dna;
/** sequence_length */
int seq_length;
}FastaHandler,*FastaHandlerPtr;

The function fasta_open opens the fasta file and initialize a new FastaHandlerPtr. This pointer is then persisted into a 'R' variable using R_MakeExternalPtr.
/**
* open a fasta file, init the FastaHandler
*/
SEXP fasta_open(SEXP filename)
{
FastaHandlerPtr handle=NULL;
if(!isString(filename)) error("filename is not a string");
if(length(filename)!=1) error("expected only one filename");

handle=(FastaHandlerPtr)calloc(1,sizeof(FastaHandler));
if(handle==NULL)
{
error("Cannot alloc FastaHandler");
}
const char* c_filename=CHAR(STRING_ELT(filename,0));
errno=0;
handle->in= fopen( c_filename,"r");
if(handle->in==NULL)
{
error("Cannot open \"%s\": %s",c_filename,strerror(errno));
}
/** the handle is bound a R variable */
return R_MakeExternalPtr(handle, R_NilValue, R_NilValue);
}
We also need a function to close the structure. The pointer to the FastaHandlerPtr is retrieved using R_ExternalPtrAddr(R_variable):
/**
* close the FastaHandlerPtr
*/
SEXP fasta_close(SEXP r_handle)
{
FastaHandlerPtr handle = R_ExternalPtrAddr(r_handle);;
if(handle==NULL) error("handle==NULL");
if(handle->in!=NULL)
{
fclose(handle->in);
handle->in=NULL;
}
if(handle->previous_line!=NULL)
{
free(handle->previous_line);
handle->previous_line=NULL;
}
free(handle->seq_name);
handle->seq_name=NULL;
free(handle->seq_dna);
handle->seq_dna=NULL;
R_ClearExternalPtr(r_handle);
return ScalarInteger(0);
}
The function fasta_next return the next pair(name,sequence):
/**
* read the next fasta sequence
*/
SEXP fasta_next(SEXP r_handle)
{
int count=0;
char* ptr=NULL;
FastaHandlerPtr handle = R_ExternalPtrAddr(r_handle);;
if(handle==NULL) error("handle==NULL");
if(handle->in==NULL) error("handle->in==NULL");

while((ptr=readLine(handle))!=NULL)
{
if(ptr[0]=='>')
{
//this is the header of a fasta sequence
if(handle->seq_name!=NULL || handle->seq_dna!=NULL)
{
handle->previous_line=ptr;
SEXP return_value=NULL;
//if there a sequence in memory ?
if( handle->seq_name!=NULL &&
handle->seq_dna!=NULL &&
handle->seq_length>0)
{
//create the sequence
return_value=make_sequence(
&handle->seq_name[1],
handle->seq_dna
);
}
free(handle->seq_name);
free(handle->seq_dna);
handle->seq_name=NULL;
handle->seq_dna=NULL;
handle->seq_length=0;
//return the sequence if any
if(return_value!=NULL) return return_value;
}
//cleanup
handle->seq_name=ptr;
handle->seq_length=0;
handle->seq_dna=NULL;
}
else
{
int j=0;
int len=0;
//remove blank characters
while(ptr[j]!=0)
{
if(!isspace(ptr[j]))
{
ptr[len++]=ptr[j];
}
++j;
}
//enlarge the dna sequence
handle->seq_dna=realloc(
handle->seq_dna,
sizeof(char)*(handle->seq_length+1+len)
);
if(handle->seq_dna==NULL) error("cannot realloc seq_dna");
//append the new line
memcpy(&handle->seq_dna[handle->seq_length],ptr,sizeof(char)*len);
handle->seq_length+=len;
handle->seq_dna[handle->seq_length]=0;
free(ptr);
ptr=NULL;
}
}
//last sequence
SEXP return_value=NULL;
if( handle->seq_name!=NULL &&
handle->seq_dna!=NULL &&
handle->seq_length>0)
{
return_value=make_sequence(
&handle->seq_name[1],
handle->seq_dna
);
}
free(handle->seq_name);
free(handle->seq_dna);
handle->seq_name=NULL;
handle->seq_dna=NULL;
handle->seq_length=0;
return (return_value==NULL?R_NilValue:return_value);
}
This function calls readLine reading the next non-empty line:
/**
* return the next non empty line from a FastaHandlerPtr
* should be free()
*/
static char* readLine(FastaHandlerPtr handle)
{
int c;
char* ptr=NULL;
int length=0;
/* if there was a saved line, return it */
if(handle->previous_line!=NULL)
{
ptr=handle->previous_line;
handle->previous_line=NULL;
return ptr;
}
/** while the line is empty */
while(length==0)
{
int capacity=0;
//no more to read ?
if(feof(handle->in)) return NULL;
ptr=NULL;
//read each char
while((c=fgetc(handle->in))!=EOF)
{
//if this is an EOL, break
if(c=='\n') break;
//enlarge the buffer if it is too small
if(ptr==NULL || (length+2)>=capacity)
{
capacity+=500;
ptr=(char*)realloc(ptr,capacity*sizeof(char));
if(ptr==NULL) error("cannot realloc to %d bytes",capacity);
}
//append the char to the buffer
ptr[(length)++]=c;
}
//nothing was read (empty line) ? continue
if(length==0)
{
free(ptr);
ptr=NULL;
}
}
//add a end-of-string char
ptr[length]=0;
//return the line
return ptr;
}
It also invokes the function make_sequence(name,dna). This function creates a new R pair(name,sequence) :
/** create a pair list containing the name and the dna */
static SEXP make_sequence(const char* seq_name,const char* length)
{
SEXP values,names;
//the R value contains two objects
PROTECT(values = allocVector(VECSXP, 2));
//first item is the sequence name
SET_VECTOR_ELT(values, 0, mkString(seq_name));
//second item is the sequence dna
SET_VECTOR_ELT(values, 1, mkString(length));

//the labels
PROTECT(names = allocVector(STRSXP, 2));
//first label is the name
SET_STRING_ELT(names, 0, mkChar("name"));
//2nd label is the name
SET_STRING_ELT(names, 1, mkChar("sequence"));

//bind the labels to the values
setAttrib(values, R_NamesSymbol, names);
UNPROTECT(2);
return values;
}

The 'R' code

The 'C' dynamic library is loaded and each C function is bound to R using ".Call":
dyn.load(paste("libfasta", .Platform$dynlib.ext, sep=""))

fasta.open <- function(filename)
{
.Call("fasta_open", filename)
}

fasta.close <- function(handler)
{
.Call("fasta_close",handler)
}

fasta.next <- function(handler)
{
.Call("fasta_next",handler)
}

Compiling

Here is my makefile:
run:libfasta.so rs_chLG1.fas
${R_HOME}/bin/R --no-save < fasta.R

rs_chLG1.fas:
wget -O $@.gz ftp://ftp.ncbi.nih.gov/snp/organisms/bee_7460/rs_fasta/rs_chLG1.fas.gz
gunzip $@.gz

libfasta.so:fasta.c
gcc -fPIC -I -g -c -Wall -I ${R_HOME}/include fasta.c
gcc -shared -Wl,-soname,fasta.so.1 -o $@ fasta.o

Testing

I've downloaded some fasta sequences from dbSNP/Apis Mellifera. This file is open with fasta.open, we loop over each sequence and we print those sequences having a length lower than 300bp, at the end we close the stream with fasta.close:
f<-fasta.open("rs_chLG1.fas")

while(!is.null(snp <-fasta.next(f)))
{
if(nchar(as.character(snp$sequence))< 300)
print(snp)
}

fasta.close(f)

Result:
$name
[1] "gnl|dbSNP|rs44106732 rs=44106732|pos=48|len=298|taxid=7460|mol=\"genomic\"|class=1|alleles=\"C/T\"|build=127"

$sequence
[1] "GAAAATCAAAGACAATTTTTGGAATGGAACTAAATAACTTTATTCTTYCTTTCTTTCTCGTCGAGAATATCGTTATCGTTGCATGGGTTATGGAATAGCGTTCGTTAAAAATGTTATATTTCGAGGAAATATCGAAGATAGGCTTTGCGAAAGTCTGTTTCTCTAGAATTAAGATTTATATGTGTTGCAAGGGGAAGTTCAAAGAGAAATCGTGGCCAGTTCGAACATTATTATGTCTATCAATGATCGAGAAGTGTCATTGATGCAAGAGAAAAGTTTTCTCTTGCATTTAAGTATC"

$name
[1] "gnl|dbSNP|rs44108824 rs=44108824|pos=251|len=268|taxid=7460|mol=\"genomic\"|class=1|alleles=\"C/T\"|build=127"

$sequence
[1] "GTTAAATGGGAATTTTGGGGATTATTGGAGGAGGATTTACGTTTCGAGGATTGTTGATGATCTTAGGATTGTTTTCAGTTTGGAATTTTTTCTTCTTCTTCAACGTTGTACAACATTCTCCTCGAATTTTGTGTAACGAGGAGGATTAAACCTTTGGAAAATCACGTAAATTAGAGGACGATATATTGGGTTGGCAACTAAGTAATTGCGGATTTTTTTTAGAAAATCAAAGACAATTTTTGGAATGGAAYTAAATAACTTTATTCTT"

$name
[1] "gnl|dbSNP|rs44150807 rs=44150807|pos=251|len=274|taxid=7460|mol=\"genomic\"|class=1|alleles=\"C/T\"|build=127"

$sequence
[1] "TTCGATCTTCGTTTCACGCTCACGTTTCACGTTTCTCGTCCGGCGTAACGAACGGTATTCCGCTGATCACGAATAATTTCTTTCTGGAGAGTCCATTAGGGGACCGTCCCCTCCCCCCTCTCGCGCACACAGACACCCATCGCTTTCGACGCCTCGTCCATTCGAGGGAGAAACGAACGACTATTAGAAAAAAATCTTCTTTATATCTCTATAAATTCAATTTGCAGAGAAGCAAAGAGCTTTAAAATATYAACCATTATAACCGAACTTGTTG"

(....)


Full Source code


Makefile


run:libfasta.so rs_chLG1.fas
${R_HOME}/bin/R --no-save < fasta.R

rs_chLG1.fas:
wget -O $@.gz ftp://ftp.ncbi.nih.gov/snp/organisms/bee_7460/rs_fasta/rs_chLG1.fas.gz
gunzip $@.gz

libfasta.so:fasta.c
gcc -fPIC -I -g -c -Wall -I ${R_HOME}/include fasta.c
gcc -shared -Wl,-soname,fasta.so.1 -o $@ fasta.o

fasta.c


#include <ctype.h>
#include <errno.h>
#include <R.h>
#include <Rinternals.h>

/** stateful structure for the fastafile */
typedef struct fastaHandler_t
{
/** input file */
FILE* in;
/** save previous line, it will contains the next fasta header */
char* previous_line;
/** sequence name */
char *seq_name;
/** sequence dna */
char *seq_dna;
/** sequence_length */
int seq_length;

}FastaHandler,*FastaHandlerPtr;


/**
* return the next non empty line from a FastaHandlerPtr
* should be free()
*/
static char* readLine(FastaHandlerPtr handle)
{
int c;
char* ptr=NULL;
int length=0;
/* if there was a saved line, return it */
if(handle->previous_line!=NULL)
{
ptr=handle->previous_line;
handle->previous_line=NULL;
return ptr;
}
/** while the line is empty */
while(length==0)
{
int capacity=0;
//no more to read ?
if(feof(handle->in)) return NULL;
ptr=NULL;
//read each char
while((c=fgetc(handle->in))!=EOF)
{
//if this is an EOL, break
if(c=='\n') break;
//enlarge the buffer if it is too small
if(ptr==NULL || (length+2)>=capacity)
{
capacity+=500;
ptr=(char*)realloc(ptr,capacity*sizeof(char));
if(ptr==NULL) error("cannot realloc to %d bytes",capacity);
}
//append the char to the buffer
ptr[(length)++]=c;
}
//nothing was read (empty line) ? continue
if(length==0)
{
free(ptr);
ptr=NULL;
}
}
//add a end-of-string char
ptr[length]=0;
//return the line
return ptr;
}

/**
* open a fasta file, init the FastaHandler
*/
SEXP fasta_open(SEXP filename)
{
FastaHandlerPtr handle=NULL;
if(!isString(filename)) error("filename is not a string");
if(length(filename)!=1) error("expected only one filename");

handle=(FastaHandlerPtr)calloc(1,sizeof(FastaHandler));
if(handle==NULL)
{
error("Cannot alloc FastaHandler");
}
const char* c_filename=CHAR(STRING_ELT(filename,0));
errno=0;
handle->in= fopen( c_filename,"r");
if(handle->in==NULL)
{
error("Cannot open \"%s\": %s",c_filename,strerror(errno));
}
/** the handle is bound a R variable */
return R_MakeExternalPtr(handle, R_NilValue, R_NilValue);
}

/** create a pair list containing the name and the dna */
static SEXP make_sequence(const char* seq_name,const char* length)
{
SEXP values,names;
//the R value contains two objects
PROTECT(values = allocVector(VECSXP, 2));
//first item is the sequence name
SET_VECTOR_ELT(values, 0, mkString(seq_name));
//second item is the sequence dna
SET_VECTOR_ELT(values, 1, mkString(length));

//the labels
PROTECT(names = allocVector(STRSXP, 2));
//first label is the name
SET_STRING_ELT(names, 0, mkChar("name"));
//2nd label is the name
SET_STRING_ELT(names, 1, mkChar("sequence"));

//bind the labels to the values
setAttrib(values, R_NamesSymbol, names);
UNPROTECT(2);
return values;
}

/**
* read the next fasta sequence
*/
SEXP fasta_next(SEXP r_handle)
{
int count=0;
char* ptr=NULL;
FastaHandlerPtr handle = R_ExternalPtrAddr(r_handle);;
if(handle==NULL) error("handle==NULL");
if(handle->in==NULL) error("handle->in==NULL");

while((ptr=readLine(handle))!=NULL)
{
if(ptr[0]=='>')
{
//this is the header of a fasta sequence
if(handle->seq_name!=NULL || handle->seq_dna!=NULL)
{
handle->previous_line=ptr;
SEXP return_value=NULL;
//if there a sequence in memory ?
if( handle->seq_name!=NULL &&
handle->seq_dna!=NULL &&
handle->seq_length>0)
{
//create the sequence
return_value=make_sequence(
&handle->seq_name[1],
handle->seq_dna
);
}
free(handle->seq_name);
free(handle->seq_dna);
handle->seq_name=NULL;
handle->seq_dna=NULL;
handle->seq_length=0;
//return the sequence if any
if(return_value!=NULL) return return_value;
}
//cleanup
handle->seq_name=ptr;
handle->seq_length=0;
handle->seq_dna=NULL;
}
else
{
int j=0;
int len=0;
//remove blank characters
while(ptr[j]!=0)
{
if(!isspace(ptr[j]))
{
ptr[len++]=ptr[j];
}
++j;
}
//enlarge the dna sequence
handle->seq_dna=realloc(
handle->seq_dna,
sizeof(char)*(handle->seq_length+1+len)
);
if(handle->seq_dna==NULL) error("cannot realloc seq_dna");
//append the new line
memcpy(&handle->seq_dna[handle->seq_length],ptr,sizeof(char)*len);
handle->seq_length+=len;
handle->seq_dna[handle->seq_length]=0;
free(ptr);
ptr=NULL;
}
}
//last sequence
SEXP return_value=NULL;
if( handle->seq_name!=NULL &&
handle->seq_dna!=NULL &&
handle->seq_length>0)
{
return_value=make_sequence(
&handle->seq_name[1],
handle->seq_dna
);
}
free(handle->seq_name);
free(handle->seq_dna);
handle->seq_name=NULL;
handle->seq_dna=NULL;
handle->seq_length=0;
return (return_value==NULL?R_NilValue:return_value);
}

/**
* close the FastaHandlerPtr
*/
SEXP fasta_close(SEXP r_handle)
{
FastaHandlerPtr handle = R_ExternalPtrAddr(r_handle);;
if(handle==NULL) error("handle==NULL");
if(handle->in!=NULL)
{
fclose(handle->in);
handle->in=NULL;
}
if(handle->previous_line!=NULL)
{
free(handle->previous_line);
handle->previous_line=NULL;
}
free(handle->seq_name);
handle->seq_name=NULL;
free(handle->seq_dna);
handle->seq_dna=NULL;
R_ClearExternalPtr(r_handle);
return ScalarInteger(0);
}

fasta.R


dyn.load(paste("libfasta", .Platform$dynlib.ext, sep=""))

fasta.open <- function(filename)
{
.Call("fasta_open", filename)
}
fasta.close <- function(handler)
{
.Call("fasta_close",handler)
}

fasta.next <- function(handler)
{
.Call("fasta_next",handler)
}

f<-fasta.open("rs_chLG1.fas")

while(!is.null(snp <-fasta.next(f)))
{
if(nchar(as.character(snp$sequence))< 300)
print(snp)
}

fasta.close(f)

That's it !
Pierre

14 April 2010

Object Oriented Programming with R: My notebook

In the following post, I describe how I've used the OOP features of R to create and use the following class hierarchy:

Your browser does not support the <CANVAS> element !


First the class Person is defined. It contains four fields : firstName, lastName, birthDate and birthPlace.
setClass("Person",
representation(
firstName="character",
lastName="character",
birthDate="Date",
birthPlace="character"
))

A kind of 'constructor' function can be called for Person to check that both firtsName and lastName are not empty:
setValidity("Person",
function(object)
{
length(object@firstName)>0 &&
length(object@lastName)>0
}
)
DeceasedPerson is a subClass of Person, it contains two more fields: deathPlace and deathDate:
setClass("DeceasedPerson",
representation(
deathDate="Date",
deathPlace="character"
),
contains="Person"
)
Scientist is another subClass of Person it contains one more field:'knownFor':
setClass("Scientist",
representation(
knownFor="character"
),
contains="Person"
)
Lastly, DeceasedScientist is a subClass of both Scientist and DeceasedPerson:
setClass("DeceasedScientist",
contains=c("Scientist","DeceasedPerson")
)
Let's define a 'generic' function 'age' returning the age of an individual from his 'birthdate':
age <- function(individual)
{
as.integer((Sys.Date()-individual@birthDate)/365)
}
setGeneric("age")
Polymorphism: for the DeceasedPerson another function will be used, it will calculate the age from both 'deathDate' and 'birthDate':
age.of.death <- function(individual)
{
as.integer((individual@deathDate-individual@birthDate)/365)
}
setMethod(age,signature=c("DeceasedPerson"),definition=age.of.death)
Ok, let's play with our class, we can first create a new instance of Scientist for Craig Venter:
craigVenter <-new(
"Scientist",
firstName="Craig",
lastName="Venter",
birthPlace="Salt Lake City",
birthDate=as.Date("1946-10-14", "%Y-%m-%d"),
knownFor=c("The Institute for Genomic Research","J. Craig Venter Institute")
)
... and Charles Darwin is a DeceasedScientist:
charlesDarwin <-new(
"DeceasedScientist",
firstName="Charles",
lastName="Darwin",
birthDate=as.Date("1809-02-12", "%Y-%m-%d"),
deathDate=as.Date("1882-04-19", "%Y-%m-%d"),
knownFor=c("Natural Selection","The Voyage of the Beagle")
)
Hey , we know where Charles was born!
charlesDarwin@birthPlace="Shrewsbury"
The following statement fails because the firstName is empty:
> try(new("Person",lastName="Darwin",birthDate=as.Date("1809-02-12", "%Y-%m-%d")),FALSE)
Error in validObject(.Object) : invalid class "Person" object: FALSE
Is Darwin a valid object?:
> validObject(charlesDarwin)
[1] TRUE
Print both individuals:
> charlesDarwin
An object of class “DeceasedScientist”
Slot "knownFor":
[1] "Natural Selection" "The Voyage of the Beagle"

Slot "firstName":
[1] "Charles"

Slot "lastName":
[1] "Darwin"

Slot "birthDate":
[1] "1809-02-12"

Slot "birthPlace":
[1] "Shrewsbury"

Slot "deathDate":
[1] "1882-04-19"

Slot "deathPlace":
character(0)

> craigVenter
An object of class “Scientist”
Slot "knownFor":
[1] "The Institute for Genomic Research" "J. Craig Venter Institute"

Slot "firstName":
[1] "Craig"

Slot "lastName":
[1] "Venter"

Slot "birthDate":
[1] "1946-10-14"

Slot "birthPlace":
[1] "Salt Lake City"
Let's use the 'is' operator:
> is(craigVenter,"Person")
[1] TRUE
> is(craigVenter,"DeceasedScientist")
[1] FALSE
> is(charlesDarwin,"DeceasedScientist")
[1] TRUE
Finally let's invoke the polymorhic function 'age' for both individuals:
> age(charlesDarwin)
[1] 73 #age.of.death was called
> age(craigVenter)
[1] 63 #generic "age'


Full source code

setClass("Person",
representation(
firstName="character",
lastName="character",
birthDate="Date",
birthPlace="character"
))

setValidity("Person",
function(object)
{
length(object@firstName)>0 &&
length(object@lastName)>0
}
)

setClass("DeceasedPerson",
representation(
deathDate="Date",
deathPlace="character"
),
contains="Person"
)

setClass("Scientist",
representation(
knownFor="character"
),
contains="Person"
)

age <- function(individual)
{
as.integer((Sys.Date()-individual@birthDate)/365)
}

setGeneric("age")

age.of.death <- function(individual)
{
as.integer((individual@deathDate-individual@birthDate)/365)
}


setClass("DeceasedScientist",
contains=c("Scientist","DeceasedPerson")
)

setMethod(age,signature=c("DeceasedPerson"),definition=age.of.death)

craigVenter <-new(
"Scientist",
firstName="Craig",
lastName="Venter",
birthPlace="Salt Lake City",
birthDate=as.Date("1946-10-14", "%Y-%m-%d"),
knownFor=c("The Institute for Genomic Research","J. Craig Venter Institute")
)

charlesDarwin <-new(
"DeceasedScientist",
firstName="Charles",
lastName="Darwin",
birthDate=as.Date("1809-02-12", "%Y-%m-%d"),
deathDate=as.Date("1882-04-19", "%Y-%m-%d"),
knownFor=c("Natural Selection","The Voyage of the Beagle")
)

try(new("Person",lastName="Darwin",birthDate=as.Date("1809-02-12", "%Y-%m-%d")),FALSE)

charlesDarwin@birthPlace="Shrewsbury"

validObject(charlesDarwin)

charlesDarwin
craigVenter


is(craigVenter,"Person")
is(craigVenter,"DeceasedScientist")
is(charlesDarwin,"DeceasedScientist")
age(charlesDarwin)
age(craigVenter)


That's it !
Pierre

26 March 2010

'R' = dna.translate("AGG") . A custom C function for R, My notebook.


In the following post, I will show how I've implemented a custom C function for R. This C function will translate a DNA to a protein. I'm very new to 'R' so feel free to make any comment about the code.

C code


The data in 'R' are stored in an opaque structure named 'SEXP'. A custom C function receives some SEXP arguments and returns another SEXP. So, the declaration of my function translate_dna is:
SEXP translate_dna(SEXP sequences)
{
(...)
}

We check if the argument 'sequences' has a type='character'.
if(!isString(sequences))
error("argument is not a string[]");
The input consists of one or more DNA, so our return value will be an SEXP array of type 'character' (STRSXP) with a size equals to the number of DNAs. This array is allocated:
SEXP array_of_peptides = allocVector(STRSXP,length(sequences));
We loop over each DNA sequence
for(in i=0;i<length(sequences);++i) ...
and we allocate some memory for the peptide:
const char *dna=CHAR(STRING_ELT(sequences, i));
int dna_length=strlen(dna);
char* peptide =malloc(dna_length/3+1);
The peptide is then filled with its amino acids:
for(j=0;j+2 < dna_length;j+=3)
{
peptide[n++]=_translate(dna[j],dna[j+1],dna[j+2]);
}
peptide[n]=0;
And we put this new string/peptide in the returned value
SET_STRING_ELT(array_of_peptides,i,Rf_mkChar(peptide));
Here, I am not sure how 'R' handles its memory. Should I care about the way R runs its garbage manager ? how should I use the macros PROTECT and UNPROTECT ?

Compilation


The code 'translate.c' is compiled as a dynamic library ' libtranslate.so':

gcc -fPIC -I -g -c -Wall -I ${R_HOME}/include translate.c
gcc -shared -Wl,-soname,libtranslate.so.1 -o libtranslate.so translate.o

R code

on the R side , the previous dynamic library is loaded:
dyn.load(paste("libtranslate", .Platform$dynlib.ext, sep=""))
A function dna.translate is declared: it forces the array to be an array of string and it invokes the C function 'translate_dna'
storage.mode(dna) <- "character"
.Call("translate_dna", dna)
We can now invoke the R function dna.translate:
peptides <- dna.translate(
c( "ATGGAGAGGCAGAAACGGAAGGCGGACATCGAGAAAGGGCTGCAGTTCATTCAGTCGACACTAC",
NULL,
"CCCAAAAGCAAGAAGAATATGAGGCCTTTCTGCT",
"CAAACTGGTGCAGAATCTGTTTGCTGAGGGCAATGA",
NULL,
"GGCAGATCAGGGAACTTCTAATGGATTGGGGTCC",
"GGATAACTGCACCTTCGCCTACCATCAGGAGGA",
"GGGTCCCAGGCAGCGCTGCCTGGGGGCTGGGGAG",
"TTGCGACAGGCTCCAGAAGGGCAAAGCCTGCCCAGAT",
"GCACCCCCCTCCTCTCCACCCTACCTTCCATCAACCA",
"AGG"
))

print(peptides)

Result:
[1] "MERQKRKADIEKGLQFIQSTL" "PKSKKNMRPFC" "QTGAESVC*GQ*"
[4] "GRSGNF*WIGV" "G*LHLRLPSGG" "GSQAALPGGWG"
[7] "LRQAPEGQSLPR" "APPSSPPYLPST" "R"


Full source code

:
C code:
#include <stdio.h>
#include <ctype.h>
#include <R.h>
#include <Rinternals.h>
/* the genetic code */
static const char* STANDARD_GENETIC_CODE="FFLLSSSSYY**CC*WLLLLPPPPHHQQRRRRIIIMTTTTNNKKSSRRVVVVAAAADDEEGGGG";

static char _translate(char a,char b,char c);


/** translates DNA to protein
* @arg sequences one or more DNA sequence
* @return an array of peptides
*/
SEXP translate_dna(SEXP sequences)
{
int i;
SEXP array_of_peptides;
//check input has type= characters
if(!isString(sequences))
error("argument is not a string[]");
//prepare a new list for the translated sequence
array_of_peptides = allocVector(STRSXP,length(sequences));

//loop over the input
for(i=0;i< length(sequences);++i)
{
int j=0,n=0;
//transform the sequence into a ptr* of char
const char *dna=CHAR(STRING_ELT(sequences, i));
//get the length of this sequence
int dna_length=strlen(dna);
//alloc the protein sequence
char* peptide =malloc(dna_length/3+1);
if(peptide==NULL) error("out of memory");
//loop over the codons
for(j=0;j+2 < dna_length;j+=3)
{
//set the amino acid at 'n'
peptide[n++]=_translate(dna[j],dna[j+1],dna[j+2]);
}
//add EOS
peptide[n]=0;
//put a copy of this peptide in the vector
SET_STRING_ELT(array_of_peptides,i,Rf_mkChar(peptide));
//free our copy
free(peptide);
}
//return the array of petptide
return array_of_peptides;
}


static int base2index(char c)
{
switch(tolower(c))
{
case 't': return 0;
case 'c': return 1;
case 'a': return 2;
case 'g': return 3;
default: return -1;
}
}

static char _translate(char a,char b,char c)
{
int base1= base2index(a);
int base2= base2index(b);
int base3= base2index(c);
if(base1==-1 || base2==-1 || base3==-1)
{
return '?';
}
else
{
return STANDARD_GENETIC_CODE[base1*16+base2*4+base3];
}

}

Makefile:
run:translate.so
${R_HOME}/bin/R --no-save < translate.R

translate.so:translate.c
gcc -fPIC -I -g -c -Wall -I ${R_HOME}/include translate.c
gcc -shared -Wl,-soname,libtranslate.so.1 -o libtranslate.so translate.o

R code:
#load the dynamic library translate
dyn.load(paste("libtranslate", .Platform$dynlib.ext, sep=""))
#declare the function dna.translate
dna.translate <- function(dna)
{
#force dna to be type=character
storage.mode(dna) <- "character"
#call the C function
.Call("translate_dna", dna)
}

peptides <- dna.translate(
c( "ATGGAGAGGCAGAAACGGAAGGCGGACATCGAGAAAGGGCTGCAGTTCATTCAGTCGACACTAC",
NULL,
"CCCAAAAGCAAGAAGAATATGAGGCCTTTCTGCT",
"CAAACTGGTGCAGAATCTGTTTGCTGAGGGCAATGA",
NULL,
"GGCAGATCAGGGAACTTCTAATGGATTGGGGTCC",
"GGATAACTGCACCTTCGCCTACCATCAGGAGGA",
"GGGTCCCAGGCAGCGCTGCCTGGGGGCTGGGGAG",
"TTGCGACAGGCTCCAGAAGGGCAAAGCCTGCCCAGAT",
"GCACCCCCCTCCTCTCCACCCTACCTTCCATCAACCA",
"AGG"
))

print(peptides)


That's it

Pierre