08 March 2012

My first walker for the GATK : my notebook

This is my first notebook for developping a new Walker for the Genome Analysis Toolkit. This post was mostly inspired by the following pdf: kvg_20_line_lifesavers_mad_v2.pptx.pdf.

Get the sources

git clone http://github.com/broadgsa/gatk.git GATK.dev
the javac compiler also requires the following library from google :http://code.google.com/p/cofoja/.

A first "Short-Reads" walker

The following class ReadWalker scans the reads and print them as fasta. The @Output annotation tells the GATK that we're going to channel our output through the java.io.PrintStream object. This field is automatically filled by the application runtime.

Compilation

javac -cp /path/to/GenomeAnalysisTK.jar:/path/to/cofoja-1.0-r139.jar:. \
 -sourcepath src \
 -d tmp src/mygatk/HelloRead.java
jar cvf HelloRead.jar -C tmp .

Running

Here I'm using a BAM from the 'examples' folder of samtools. (We need to pre-process this BAM with picard AddOrReplaceReadGroups). We then use our library as follow:
java -cp path/to/GenomeAnalysisTK.jar:HelloRead.jar \
org.broadinstitute.sting.gatk.CommandLineGATK -T HelloRead \
 -I test.bam \
 -R ${SAMTOOLS}/examples/ex1.fa 

Result:

The Makefile

That's it, Pierre

04 March 2012

Java Remote Method Invocation (RMI) for Bioinformatics

"Java Remote Method Invocation (Java RMI) enables the programmer to create distributed Java technology-based to Java technology-based applications, in which the methods of remote Java objects can be invoked from other Java virtual machines*, possibly on different hosts. "[Oracle] In the current post a java client will send a java class to the server that will analyze a DNA sequence fetched from the NCBI, using the RMI technology.

Files and directories

I In this example, my files are structured as defined below:
./sandbox/client/FirstBases.java
./sandbox/client/GCPercent.java
./sandbox/client/SequenceAnalyzerClient.java
./sandbox/server/SequenceAnalyzerServiceImpl.java
./sandbox/shared/SequenceAnalyzerService.java
./sandbox/shared/SequenceAnalyzer.java
./client.policy
./server.policy

The Service: SequenceAnalyzerService.java

The remote service provided by the server is defined as an interface named SequenceAnalyzerService: it fetches a DNA sequence for a given NCBI-gi, processes the sequence with an instance of SequenceAnalyzer (see below) and returns a serializable value (that is to say, we can transmit this value through the network).

Extract a value from a DNA sequence : SequenceAnalyzer

The interface SequenceAnalyzer defines how the remote service should parse a sequence. A SAX Parser will be used by the 'SequenceAnalyzerService' to process a TinySeq-XML document from the NCBI. The method characters is called each time a chunck of sequence is found. At the end, the remote server will return the value calculated from getResult:

Server side : an implementation of SequenceAnalyzerService

The class SequenceAnalyzerServiceImpl is an implementation of the service SequenceAnalyzerService. In the method analyse, a SAXParser is created and the given 'gi' sequence is downloaded from the NCBI. The instance of SequenceAnalyzer received from the client is invoked for each chunck of DNA. At the end, the "value" calculated by the instance of SequenceAnalyzer is returned to the client through the network. The 'main' method contains the code to bind this service to the RMI registry:

Client side

On the client side, we're going to connect to the SequenceAnalyzerService and send two distinct implementations of SequenceAnalyzer. What's interesting here: the server doesn't know anything about those implementations of SequenceAnalyzer. The client's java compiled classes have to be sent to the service.

GCPercent.java

A first implementation of 'SequenceAnalyzer' computes the GC% of a sequence:

FirstBases

The second implementation of 'SequenceAnalyzer' retrieves the first bases of a sequence.

The Client

And here is the java code for the client. The client connects to the RMI server and invokes 'analyse' with the two instances of SequenceAnalyzer for some NCBI-gi:

A note about security

As the server/client doesn't want to receive some malicious code, we have to use some policy files:
server.policy:

client.policy:

Compiling and Running

Compiling the client

javac -cp . sandbox/client/SequenceAnalyzerClient.java

Compiling the server

javac -cp . sandbox/server/SequenceAnalyzerServiceImpl.java

Starting the RMI registry

${JAVA_HOME}/bin/rmiregistry

Starting the SequenceAnalyzerServiceImpl

$ java \
 -Djava.security.policy=server.policy \
 -Djava.rmserver.codebase=file:///path/to/RMI/ \
 -cp . sandbox.server.SequenceAnalyzerServiceImpl

SequenceAnalyzerService bound.

Running the client

$ java  \
 -Djava.rmi.server.codebase=file:///path/to/RMI/ \
 -Djava.security.policy=client.policy  \
 -cp . sandbox.client.SequenceAnalyzerClient  localhost

gi=25 gc%=2.1530612244897958
gi=25 start=TAGTTATTC
gi=26 gc%=2.1443298969072164
gi=26 start=TAGTTATTAA
gi=27 gc%=2.3022222222222224
gi=27 start=AACCAGTATTA
gi=28 gc%=2.376543209876543
gi=28 start=TCGTA
gi=29 gc%=2.2014742014742015
gi=29 start=TCTTTG
That's it, Pierre