08 March 2012

My first walker for the GATK : my notebook

This is my first notebook for developping a new Walker for the Genome Analysis Toolkit. This post was mostly inspired by the following pdf: kvg_20_line_lifesavers_mad_v2.pptx.pdf.

Get the sources

git clone http://github.com/broadgsa/gatk.git GATK.dev
the javac compiler also requires the following library from google :http://code.google.com/p/cofoja/.

A first "Short-Reads" walker

The following class ReadWalker scans the reads and print them as fasta. The @Output annotation tells the GATK that we're going to channel our output through the java.io.PrintStream object. This field is automatically filled by the application runtime.
package mygatk;
import java.io.PrintStream;
import org.broadinstitute.sting.commandline.Output;
import org.broadinstitute.sting.gatk.contexts.ReferenceContext;
import org.broadinstitute.sting.gatk.refdata.ReadMetaDataTracker;
import org.broadinstitute.sting.gatk.walkers.ReadWalker;
import org.broadinstitute.sting.utils.sam.GATKSAMRecord;
public class HelloRead extends ReadWalker<Integer, Integer>
{
/**
tells the GATK that weʼre going to
channel our output through this
object. */
@Output(shortName="o",fullName="output",doc="save as")
PrintStream out;
@Override
public Integer map(ReferenceContext ref, GATKSAMRecord read,
ReadMetaDataTracker metaDataTracker) {
out.println(">"+read.getReadName());
out.println(new String(read.getReadBases()));
return null;
}
@Override
public Integer reduceInit()
{
return null;
}
@Override
public Integer reduce(Integer value, Integer sum)
{
return null;
}
}
view raw HelloRead.java hosted with ❤ by GitHub

Compilation

javac -cp /path/to/GenomeAnalysisTK.jar:/path/to/cofoja-1.0-r139.jar:. \
 -sourcepath src \
 -d tmp src/mygatk/HelloRead.java
jar cvf HelloRead.jar -C tmp .

Running

Here I'm using a BAM from the 'examples' folder of samtools. (We need to pre-process this BAM with picard AddOrReplaceReadGroups). We then use our library as follow:
java -cp path/to/GenomeAnalysisTK.jar:HelloRead.jar \
org.broadinstitute.sting.gatk.CommandLineGATK -T HelloRead \
 -I test.bam \
 -R ${SAMTOOLS}/examples/ex1.fa 

Result:

mkdir -p tmp
javac -cp /home/lindenb/package/GenomeAnalysisTK-1.4-37-g0b29d54/GenomeAnalysisTK.jar:/home/lindenb/package/GATK.dev/lib/cofoja-1.0-r139.jar:. -sourcepath src -d tmp src/mygatk/HelloRead.java
warning: Supported source version 'RELEASE_6' from annotation processor 'com.google.java.contract.core.apt.AnnotationProcessor' less than -source '1.7'
1 warning
jar cvf HelloRead.jar -C tmp .
added manifest
adding: mygatk/(in = 0) (out= 0)(stored 0%)
adding: mygatk/HelloRead.class(in = 1861) (out= 803)(deflated 56%)
java -jar ~/package/picard/picard-tools-1.63/AddOrReplaceReadGroups.jar \
I=/home/lindenb/package/samtools-0.1.18/examples/ex1.bam \
O=test.bam \
RGLB=test.lib \
RGPL=illumina \
RGPU=testunit \
SM=test.sample \
VALIDATION_STRINGENCY=LENIENT
[Thu Mar 08 16:17:55 CET 2012] net.sf.picard.sam.AddOrReplaceReadGroups INPUT=/home/lindenb/package/samtools-0.1.18/examples/ex1.bam OUTPUT=test.bam RGLB=test.lib RGPL=illumina RGPU=testunit RGSM=test.sample VALIDATION_STRINGENCY=LENIENT RGID=1 VERBOSITY=INFO QUIET=false COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false
[Thu Mar 08 16:17:55 CET 2012] Executing as lindenb@hardyweinberg on Linux 3.0.0-16-generic i386; Java HotSpot(TM) Server VM 1.7.0_01-b08; Picard version: 1.63(1131)
INFO 2012-03-08 16:17:55 AddOrReplaceReadGroups Created read group ID=1 PL=illumina LB=test.lib SM=test.sample
[Thu Mar 08 16:17:55 CET 2012] net.sf.picard.sam.AddOrReplaceReadGroups done. Elapsed time: 0.01 minutes.
Runtime.totalMemory()=54853632
/home/lindenb/package/samtools-0.1.18/samtools index test.bam
java -cp /home/lindenb/package/GenomeAnalysisTK-1.4-37-g0b29d54/GenomeAnalysisTK.jar:HelloRead.jar org.broadinstitute.sting.gatk.CommandLineGATK -T HelloRead \
-I test.bam \
-R /home/lindenb/package/samtools-0.1.18/examples/ex1.fa
INFO 16:17:57,700 HelpFormatter - ---------------------------------------------------------------------------------
INFO 16:17:57,702 HelpFormatter - The Genome Analysis Toolkit (GATK) v1.4-37-g0b29d54, Compiled 2012/02/27 19:56:39
INFO 16:17:57,702 HelpFormatter - Copyright (c) 2010 The Broad Institute
INFO 16:17:57,702 HelpFormatter - Please view our documentation at http://www.broadinstitute.org/gsa/wiki
INFO 16:17:57,703 HelpFormatter - For support, please view our support site at http://getsatisfaction.com/gsa
INFO 16:17:57,703 HelpFormatter - Program Args: -T HelloRead -I test.bam -R /home/lindenb/package/samtools-0.1.18/examples/ex1.fa
INFO 16:17:57,703 HelpFormatter - Date/Time: 2012/03/08 16:17:57
INFO 16:17:57,703 HelpFormatter - ---------------------------------------------------------------------------------
INFO 16:17:57,704 HelpFormatter - ---------------------------------------------------------------------------------
INFO 16:17:57,707 GenomeAnalysisEngine - Strictness is SILENT
INFO 16:17:57,743 SAMDataSource$SAMReaders - Initializing SAMRecords in serial
INFO 16:17:57,758 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 0.02
>B7_591:4:96:693:509
CACTAGTGGCTCATTGTAAATGTGTGGTTTAACTCG
INFO 16:17:57,972 TraversalEngine - [INITIALIZATION COMPLETE; TRAVERSAL STARTING]
INFO 16:17:57,973 TraversalEngine - Location processed.reads runtime per.1M.reads completed total.runtime remaining
>EAS54_65:7:152:368:113
CTAGTGGCTCATTGTAAATGTGTGGTTTAACTCGT
>EAS51_64:8:5:734:57
AGTGGCTCATTGTAAATGTGTGGTTTAACTCGTCC
>B7_591:1:289:587:906
GTGGCTCATTGTAATTTTTTGTTTTAACTCTTCTCT
>EAS56_59:8:38:671:758
GCTCATTGTAAATGTGTGGTTTAACTCGTCCATGG
>EAS56_61:6:18:467:281
ATTGTAAATGTGTGGTTTAACTCGTCCCTGGCCCA
>EAS114_28:5:296:340:699
ATTGTAAATGTGTGGTTTAACTCGTCCATGGCCCAG
>B7_597:6:194:894:408
TGTAAATGTGTGGTTTAACTCGTCCATTGCCCAGC
>EAS188_4:8:12:628:973
AAATGTGTGGTTTAACTCGTCCATGGCCCAGCATT
(...)
>EAS139_11:7:50:1229:1313
TTTTTTCTTTTTTTTTTTTTTTTTTTTGCATGCCA
>EAS54_65:3:320:20:250
TTTTTTTTTTTTTTTTTTTTTTTGCATGCCAGAAA
>EAS114_26:7:37:79:581
TTTTTTTTTTTTTTTTTTTTTTTCATGCCAGAAAA
INFO 16:18:40,276 Walker - [REDUCE RESULT] Traversal result is: null
INFO 16:18:40,277 TraversalEngine - Total runtime 0.45 secs, 0.01 min, 0.00 hours
INFO 16:18:40,298 TraversalEngine - 0 reads were filtered out during traversal out of 3310 total (0.00%)
view raw session.txt hosted with ❤ by GitHub

The Makefile

GATKDIR=${HOME}/package/GenomeAnalysisTK-1.4-37-g0b29d54
COFOJA=${HOME}/package/GATK.dev/lib/cofoja-1.0-r139.jar
SAMTOOLS=${HOME}/package/samtools-0.1.18
test:compile
java -jar ~/package/picard/picard-tools-1.63/AddOrReplaceReadGroups.jar \
I=${SAMTOOLS}/examples/ex1.bam \
O=test.bam \
RGLB=test.lib \
RGPL=illumina \
RGPU=testunit \
SM=test.sample \
VALIDATION_STRINGENCY=LENIENT
${SAMTOOLS}/samtools index test.bam
java -cp ${GATKDIR}/GenomeAnalysisTK.jar:HelloRead.jar org.broadinstitute.sting.gatk.CommandLineGATK -T HelloRead \
-I test.bam \
-R ${SAMTOOLS}/examples/ex1.fa
compile:
mkdir -p tmp
javac -cp ${GATKDIR}/GenomeAnalysisTK.jar:${COFOJA}:. -sourcepath src -d tmp src/mygatk/HelloRead.java
jar cvf HelloRead.jar -C tmp .
view raw Makefile hosted with ❤ by GitHub
That's it, Pierre

04 March 2012

Java Remote Method Invocation (RMI) for Bioinformatics

"Java Remote Method Invocation (Java RMI) enables the programmer to create distributed Java technology-based to Java technology-based applications, in which the methods of remote Java objects can be invoked from other Java virtual machines*, possibly on different hosts. "[Oracle] In the current post a java client will send a java class to the server that will analyze a DNA sequence fetched from the NCBI, using the RMI technology.

Files and directories

I In this example, my files are structured as defined below:
./sandbox/client/FirstBases.java
./sandbox/client/GCPercent.java
./sandbox/client/SequenceAnalyzerClient.java
./sandbox/server/SequenceAnalyzerServiceImpl.java
./sandbox/shared/SequenceAnalyzerService.java
./sandbox/shared/SequenceAnalyzer.java
./client.policy
./server.policy

The Service: SequenceAnalyzerService.java

The remote service provided by the server is defined as an interface named SequenceAnalyzerService: it fetches a DNA sequence for a given NCBI-gi, processes the sequence with an instance of SequenceAnalyzer (see below) and returns a serializable value (that is to say, we can transmit this value through the network).
package sandbox.shared;
import java.rmi.Remote;
import java.rmi.RemoteException;
import java.io.IOException;
import org.xml.sax.SAXException;
public interface SequenceAnalyzerService extends Remote
{
public static final String SERVICE_NAME="efetch";
public java.io.Serializable analyse(int gi,SequenceAnalyzer analyzer) throws IOException,SAXException;
}

Extract a value from a DNA sequence : SequenceAnalyzer

The interface SequenceAnalyzer defines how the remote service should parse a sequence. A SAX Parser will be used by the 'SequenceAnalyzerService' to process a TinySeq-XML document from the NCBI. The method characters is called each time a chunck of sequence is found. At the end, the remote server will return the value calculated from getResult:
package sandbox.shared;
import java.io.Serializable;
public interface SequenceAnalyzer extends Serializable
{
public void characters(char content[],int pos,int length);
public Serializable getResult();
}

Server side : an implementation of SequenceAnalyzerService

The class SequenceAnalyzerServiceImpl is an implementation of the service SequenceAnalyzerService. In the method analyse, a SAXParser is created and the given 'gi' sequence is downloaded from the NCBI. The instance of SequenceAnalyzer received from the client is invoked for each chunck of DNA. At the end, the "value" calculated by the instance of SequenceAnalyzer is returned to the client through the network. The 'main' method contains the code to bind this service to the RMI registry:
package sandbox.server;
import java.rmi.RemoteException;
import java.rmi.registry.LocateRegistry;
import java.rmi.registry.Registry;
import java.rmi.server.UnicastRemoteObject;
import java.io.Serializable;
import java.io.IOException;
import javax.xml.parsers.SAXParserFactory;
import javax.xml.parsers.SAXParser;
import org.xml.sax.*;
import org.xml.sax.helpers.DefaultHandler;
import org.xml.sax.SAXException;
import sandbox.shared.*;
public class SequenceAnalyzerServiceImpl implements SequenceAnalyzerService
{
static private class Handler extends DefaultHandler
{
boolean inSeq=false;
SequenceAnalyzer analyzer;
public void startElement(String uri,
String localName,
String qName,
Attributes attributes) throws SAXException
{
if(qName.equals("TSeq_sequence")) inSeq=true;
}
public void characters(char[] ch,
int start,
int length) throws SAXException
{
if(!inSeq) return;
analyzer.characters(ch,start,length);
}
public void endElement(String uri,
String localName,
String qName) throws SAXException
{
inSeq=false;
}
}
public SequenceAnalyzerServiceImpl()
{
}
@Override
public Serializable analyse(int gi,SequenceAnalyzer analyzer) throws IOException,SAXException
{
try
{
SAXParserFactory factory = SAXParserFactory.newInstance();
SAXParser saxParser = factory.newSAXParser();
Handler handler=new Handler();
handler.analyzer=analyzer;
saxParser.parse(
"http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?"+
"db=nucleotide&rettype=fasta&retmode=xml&id="+gi, handler);
return analyzer.getResult();
}
catch(javax.xml.parsers.ParserConfigurationException err)
{
throw new RuntimeException(err);
}
}
public static void main(String[] args)
{
/* protects access to system resources from untrusted downloaded
code running within the Java virtual machine. */
if (System.getSecurityManager() == null)
{
System.setSecurityManager(new SecurityManager());
}
try
{
SequenceAnalyzerService engine = new SequenceAnalyzerServiceImpl();
/* exports the supplied remote object so that it can
receive invocations of its remote methods from remote clients. */
SequenceAnalyzerService stub = (SequenceAnalyzerService) UnicastRemoteObject.exportObject(
engine,
0 //TCP port
);
/* adds the name to the RMI registry running on the server */
Registry registry = LocateRegistry.getRegistry();
registry.rebind(SequenceAnalyzerService.SERVICE_NAME, stub);
System.out.println("SequenceAnalyzerService bound.");
}
catch (Exception e)
{
e.printStackTrace();
}
}
}

Client side

On the client side, we're going to connect to the SequenceAnalyzerService and send two distinct implementations of SequenceAnalyzer. What's interesting here: the server doesn't know anything about those implementations of SequenceAnalyzer. The client's java compiled classes have to be sent to the service.

GCPercent.java

A first implementation of 'SequenceAnalyzer' computes the GC% of a sequence:
package sandbox.client;
import java.io.Serializable;
import sandbox.shared.SequenceAnalyzer;
public class GCPercent implements SequenceAnalyzer
{
private transient double total=0;
private transient double gc=0;
public GCPercent()
{
}
@Override
public void characters(char content[],int pos,int length)
{
total+=length;
for(int i=0;i< length;++i)
{
switch(content[pos+i])
{
case 'G':case 'C':
case 'g':case 'c':
case 's': gc++;
default:break;
}
}
}
@Override
public Serializable getResult()
{
if(total==0) return null;
return total/gc;
}
}
view raw GCPercent.java hosted with ❤ by GitHub

FirstBases

The second implementation of 'SequenceAnalyzer' retrieves the first bases of a sequence.
package sandbox.client;
import java.io.Serializable;
import sandbox.shared.SequenceAnalyzer;
public class FirstBases implements SequenceAnalyzer
{
private int count=0;
private transient StringBuilder sequence=null;
public FirstBases()
{
}
public int getCount()
{
return count;
}
public void setCount(int count)
{
this.count=count;
}
@Override
public void characters(char content[],int pos,int length)
{
if(sequence==null) sequence=new StringBuilder(getCount());
for(int i=0;i< length && sequence.length() < getCount();++i)
{
sequence.append(content[i+pos]);
}
}
@Override
public Serializable getResult()
{
if(sequence==null) return null;
return sequence.toString();
}
}
view raw FirstBases.java hosted with ❤ by GitHub

The Client

And here is the java code for the client. The client connects to the RMI server and invokes 'analyse' with the two instances of SequenceAnalyzer for some NCBI-gi:
package sandbox.client;
import java.rmi.registry.LocateRegistry;
import java.rmi.registry.Registry;
import sandbox.shared.*;
public class SequenceAnalyzerClient
{
public static void main(String args[]) throws Exception
{
/* the process of receiving the server remote object's stub
could require downloading class definitions from the server. */
if (System.getSecurityManager() == null)
{
System.setSecurityManager(new SecurityManager());
}
Registry registry = LocateRegistry.getRegistry(args[0]);
SequenceAnalyzerService comp = (SequenceAnalyzerService) registry.lookup(SequenceAnalyzerService.SERVICE_NAME);
for(int gi=25;gi<30;++gi)
{
GCPercent analyzer1=new GCPercent();
FirstBases analyzer2=new FirstBases();
analyzer2.setCount(5+gi%7);
System.err.println("gi="+gi+" gc%="+comp.analyse(gi,analyzer1));
System.err.println("gi="+gi+" start="+comp.analyse(gi,analyzer2));
}
}
}

A note about security

As the server/client doesn't want to receive some malicious code, we have to use some policy files:
server.policy:
grant {
permission java.security.AllPermission;
};
view raw server.policy hosted with ❤ by GitHub

client.policy:
grant {
permission java.security.AllPermission;
};
view raw client.policy hosted with ❤ by GitHub

Compiling and Running

Compiling the client

javac -cp . sandbox/client/SequenceAnalyzerClient.java

Compiling the server

javac -cp . sandbox/server/SequenceAnalyzerServiceImpl.java

Starting the RMI registry

${JAVA_HOME}/bin/rmiregistry

Starting the SequenceAnalyzerServiceImpl

$ java \
 -Djava.security.policy=server.policy \
 -Djava.rmserver.codebase=file:///path/to/RMI/ \
 -cp . sandbox.server.SequenceAnalyzerServiceImpl

SequenceAnalyzerService bound.

Running the client

$ java  \
 -Djava.rmi.server.codebase=file:///path/to/RMI/ \
 -Djava.security.policy=client.policy  \
 -cp . sandbox.client.SequenceAnalyzerClient  localhost

gi=25 gc%=2.1530612244897958
gi=25 start=TAGTTATTC
gi=26 gc%=2.1443298969072164
gi=26 start=TAGTTATTAA
gi=27 gc%=2.3022222222222224
gi=27 start=AACCAGTATTA
gi=28 gc%=2.376543209876543
gi=28 start=TCGTA
gi=29 gc%=2.2014742014742015
gi=29 start=TCTTTG
That's it, Pierre