30 July 2014

writing #rstats bindings for bwa-mem, my notebook.

I wanted to learn how to bind a C library to R, so I've created the following bindings for BWA. My code is available on github at :

Most of the C code was inspired from Heng Li's code https://github.com/lh3/bwa/blob/master/example.c.

A short description of the C code

In https://github.com/lindenb/rbwa/blob/master/rbwa.c:
struct RBwaHandler
This structure holds a pointer to the bwa-index (bwaidx_t) and to the options for bwa (mem_opt_t).
RBwaOpen(filename)
This methods opens the bwa index, wrap the pointer in a 'R' tructure using R_MakeExternalPtr and registers a destructor '_RBwaFinalizer' to be called by the garbage manager.
_RBwaFinalizer(handler)
This is the destructor called by the garbage manager. It calls 'RBwaClose'
RBwaClose(handler)
retrieves the pointer to the 'struct RBwaHandler' using 'R_ExternalPtrAddr', disposes the resources, free the RBwaHandler using 'R_ClearExternalPtr'
RBwaMap(handler,DNAsequence)
This is the workhorse of the code: it retrieves the pointer to the 'struct RBwaHandler' using 'R_ExternalPtrAddr', creates a "data.frame" with 6 columns (chromosome, position, strand, mapq, NM, secondary), maps the DNA sequence by calling bwa::mem_align1 and insert the hits in the data.frame.

The R code

See https://github.com/lindenb/rbwa/blob/master/rbwa.R. This R code loads the dynamic C libraries and declares the R functions calling the previous C functions using .Call:
  • bwa.open(filename)
  • bwa.close(bwt)
  • bwa.map(bwt,dnasequence)

Example

source("rbwa.R")
bwt <- bwa.open("test_files/chrM.fa")

for(s in c(
 "GCATGTGTAATCTTACTAAGAGCTAATAGAAAGGCTAGGACCAAACCTAT",
 "GCATGTGTAATCTTACTAAGCTAATAGAAAGGCTAGGACCAAACCTAT",
 "CTATCTACTTCAAATTCCTCCCTGTACGAAAGGACAAGAGAAATAAGGCCTCACAAAGCGCCTTCCCCCGTAAATGATATCATCTCAACTTAGTAT",
 "TACTAAACCC",
 "GCGAACCCAACTTCGATTCCCTCGCCGATCTCCGACGGAGCCGTGTGCAT" 
 ))
 {
 print(paste("QUERY:",s));
 hits<-bwa.map(bwt,s)
 print(hits)
 }

bwa.close(bwt);
here is the R output:
> source("rbwa.R")
> 
> bwt <- bwa.open("test_files/chrM.fa")
> 
> for(s in c(
+   "GCATGTGTAATCTTACTAAGAGCTAATAGAAAGGCTAGGACCAAACCTAT",
+   "GCATGTGTAATCTTACTAAGCTAATAGAAAGGCTAGGACCAAACCTAT",
+   "CTATCTACTTCAAATTCCTCCCTGTACGAAAGGACAAGAGAAATAAGGCCTCACAAAGCGCCTTCCCCCGTAAATGATATCATCTCAACTTAGTAT",
+   "TACTAAACCC",
+   "GCGAACCCAACTTCGATTCCCTCGCCGATCTCCGACGGAGCCGTGTGCAT" 
+   ))
+  {
+  print(paste("QUERY:",s));
+  hits<-bwa.map(bwt,s)
+  print(hits)
+  }
[1] "QUERY: GCATGTGTAATCTTACTAAGAGCTAATAGAAAGGCTAGGACCAAACCTAT"
  chrom pos strand mapq NM secondary
1  chrM 650      1   60  0         0
[1] "QUERY: GCATGTGTAATCTTACTAAGCTAATAGAAAGGCTAGGACCAAACCTAT"
  chrom pos strand mapq NM secondary
1  chrM 650      1   60  2         0
[1] "QUERY: CTATCTACTTCAAATTCCTCCCTGTACGAAAGGACAAGAGAAATAAGGCCTCACAAAGCGCCTTCCCCCGTAAATGATATCATCTCAACTTAGTAT"
  chrom  pos strand mapq NM secondary
1  chrM 3100      0   60  4         0
[1] "QUERY: TACTAAACCC"
[1] chrom     pos       strand    mapq      NM        secondary
<0 rows> (or 0-length row.names)
[1] "QUERY: GCGAACCCAACTTCGATTCCCTCGCCGATCTCCGACGGAGCCGTGTGCAT"
[1] chrom     pos       strand    mapq      NM        secondary
<0 rows> (or 0-length row.names)
> 
> bwa.close(bwt);
[1] TRUE

That's it,

Pierre

29 July 2014

Including the hash for the current git-commit in a C program

Say you wrote the following simple C program:
It includes a file "githash.h" that would contain the hash for the current commit in Git:



Because you're working with a Makefile, the file "githash.h" is generated by invoking 'git rev-parse HEAD ':



the file "githash.h" loooks like this:




But WAIT that is not so simple, once the file 'githash.h' has been created it won't be updated by Make as it already exists. This file should be removed each time 'git commit' is invoked. We can do this by creating POST COMMIT git hook: we create a bash script named ".git/hooks/post-commit" removing 'githash.h:



don't forget make it executable: `chmod +x .git/hooks/post-commit`

Now, each time 'git commit' is called, the file githash.h for the previous git-commit will be deleted !


That's it,

Pierre








05 July 2014

Pushed : makefile2graph , creating a graph of dependencies from GNU-Make.

I pushed makefile2graph at https://github.com/lindenb/makefile2graph. This is the standalone and 'C' implementation of a program I first wrote in java in 2012. The program creates a graph of dependencies from GNU-Make and its output is a graphiz-dot file or a Gexf/Gephi-XML file.

Usage

$ make -Bnd | make2graph > output.dot
$ make -Bnd | make2graph -x > gephi.gexf.xml 

Example

Here is the output of makefile2graph for Tabix:
$ cd tabix-0.2.5
$ make -Bnd |make2graph
digraph G {
n1[label="", color="green"];
n2[label="Makefile", color="green"];
n4[label="all", color="red"];
n3[label="all-recur", color="red"];
n23[label="bedidx.c", color="green"];
n22[label="bedidx.o", color="red"];
n9[label="bgzf.c", color="green"];
n10[label="bgzf.h", color="green"];
n8[label="bgzf.o", color="red"];
n27[label="bgzip", color="red"];
n29[label="bgzip.c", color="green"];
n28[label="bgzip.o", color="red"];
n18[label="index.c", color="green"];
n17[label="index.o", color="red"];
n20[label="khash.h", color="green"];
n16[label="knetfile.c", color="green"];
n11[label="knetfile.h", color="green"];
n15[label="knetfile.o", color="red"];
n24[label="kseq.h", color="green"];
n21[label="ksort.h", color="green"];
n13[label="kstring.c", color="green"];
n14[label="kstring.h", color="green"];
n12[label="kstring.o", color="red"];
n6[label="lib", color="red"];
n7[label="libtabix.a", color="red"];
n26[label="main.c", color="green"];
n25[label="main.o", color="red"];
n5[label="tabix", color="red"];
n19[label="tabix.h", color="green"];
n2 -> n1 ; 
n4 -> n1 ; 
n3 -> n1 ; 
(..)
}

That's it
Pierre

02 July 2014

Pushed today: verticalize , an everyday linux command to verticalize tab-delimited files

FYI: this morning, I pushed verticalize on github. Verticalize is an everyday simple command to 'verticalize' delimited files: https://github.com/lindenb/verticalize.

$ curl "https://raw.githubusercontent.com/ekg/vcflib/master/samples/sample.vcf" |\
  grep -vE "^##" |\
  verticalize

>>> 2
$1    #CHROM : 19
$2       POS : 111
$3        ID : .
$4       REF : A
$5       ALT : C
$6      QUAL : 9.6
$7    FILTER : .
$8      INFO : .
$9    FORMAT : GT:HQ
$10  NA00001 : 0|0:10,10
$11  NA00002 : 0|0:10,10
$12  NA00003 : 0/1:3,3
<<< 2

>>> 3
$1    #CHROM : 19
$2       POS : 112
$3        ID : .
$4       REF : A
$5       ALT : G
$6      QUAL : 10
$7    FILTER : .
$8      INFO : .
$9    FORMAT : GT:HQ
$10  NA00001 : 0|0:10,10
$11  NA00002 : 0|0:10,10
$12  NA00003 : 0/1:3,3
<<< 3



.

That's it,
Pierre