08 March 2012

My first walker for the GATK : my notebook

This is my first notebook for developping a new Walker for the Genome Analysis Toolkit. This post was mostly inspired by the following pdf: kvg_20_line_lifesavers_mad_v2.pptx.pdf.

Get the sources

git clone http://github.com/broadgsa/gatk.git GATK.dev
the javac compiler also requires the following library from google :http://code.google.com/p/cofoja/.

A first "Short-Reads" walker

The following class ReadWalker scans the reads and print them as fasta. The @Output annotation tells the GATK that we're going to channel our output through the java.io.PrintStream object. This field is automatically filled by the application runtime.
package mygatk;
import java.io.PrintStream;
import org.broadinstitute.sting.commandline.Output;
import org.broadinstitute.sting.gatk.contexts.ReferenceContext;
import org.broadinstitute.sting.gatk.refdata.ReadMetaDataTracker;
import org.broadinstitute.sting.gatk.walkers.ReadWalker;
import org.broadinstitute.sting.utils.sam.GATKSAMRecord;
public class HelloRead extends ReadWalker<Integer, Integer>
{
/**
tells the GATK that weʼre going to
channel our output through this
object. */
@Output(shortName="o",fullName="output",doc="save as")
PrintStream out;
@Override
public Integer map(ReferenceContext ref, GATKSAMRecord read,
ReadMetaDataTracker metaDataTracker) {
out.println(">"+read.getReadName());
out.println(new String(read.getReadBases()));
return null;
}
@Override
public Integer reduceInit()
{
return null;
}
@Override
public Integer reduce(Integer value, Integer sum)
{
return null;
}
}
view raw HelloRead.java hosted with ❤ by GitHub

Compilation

javac -cp /path/to/GenomeAnalysisTK.jar:/path/to/cofoja-1.0-r139.jar:. \
 -sourcepath src \
 -d tmp src/mygatk/HelloRead.java
jar cvf HelloRead.jar -C tmp .

Running

Here I'm using a BAM from the 'examples' folder of samtools. (We need to pre-process this BAM with picard AddOrReplaceReadGroups). We then use our library as follow:
java -cp path/to/GenomeAnalysisTK.jar:HelloRead.jar \
org.broadinstitute.sting.gatk.CommandLineGATK -T HelloRead \
 -I test.bam \
 -R ${SAMTOOLS}/examples/ex1.fa 

Result:

mkdir -p tmp
javac -cp /home/lindenb/package/GenomeAnalysisTK-1.4-37-g0b29d54/GenomeAnalysisTK.jar:/home/lindenb/package/GATK.dev/lib/cofoja-1.0-r139.jar:. -sourcepath src -d tmp src/mygatk/HelloRead.java
warning: Supported source version 'RELEASE_6' from annotation processor 'com.google.java.contract.core.apt.AnnotationProcessor' less than -source '1.7'
1 warning
jar cvf HelloRead.jar -C tmp .
added manifest
adding: mygatk/(in = 0) (out= 0)(stored 0%)
adding: mygatk/HelloRead.class(in = 1861) (out= 803)(deflated 56%)
java -jar ~/package/picard/picard-tools-1.63/AddOrReplaceReadGroups.jar \
I=/home/lindenb/package/samtools-0.1.18/examples/ex1.bam \
O=test.bam \
RGLB=test.lib \
RGPL=illumina \
RGPU=testunit \
SM=test.sample \
VALIDATION_STRINGENCY=LENIENT
[Thu Mar 08 16:17:55 CET 2012] net.sf.picard.sam.AddOrReplaceReadGroups INPUT=/home/lindenb/package/samtools-0.1.18/examples/ex1.bam OUTPUT=test.bam RGLB=test.lib RGPL=illumina RGPU=testunit RGSM=test.sample VALIDATION_STRINGENCY=LENIENT RGID=1 VERBOSITY=INFO QUIET=false COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false
[Thu Mar 08 16:17:55 CET 2012] Executing as lindenb@hardyweinberg on Linux 3.0.0-16-generic i386; Java HotSpot(TM) Server VM 1.7.0_01-b08; Picard version: 1.63(1131)
INFO 2012-03-08 16:17:55 AddOrReplaceReadGroups Created read group ID=1 PL=illumina LB=test.lib SM=test.sample
[Thu Mar 08 16:17:55 CET 2012] net.sf.picard.sam.AddOrReplaceReadGroups done. Elapsed time: 0.01 minutes.
Runtime.totalMemory()=54853632
/home/lindenb/package/samtools-0.1.18/samtools index test.bam
java -cp /home/lindenb/package/GenomeAnalysisTK-1.4-37-g0b29d54/GenomeAnalysisTK.jar:HelloRead.jar org.broadinstitute.sting.gatk.CommandLineGATK -T HelloRead \
-I test.bam \
-R /home/lindenb/package/samtools-0.1.18/examples/ex1.fa
INFO 16:17:57,700 HelpFormatter - ---------------------------------------------------------------------------------
INFO 16:17:57,702 HelpFormatter - The Genome Analysis Toolkit (GATK) v1.4-37-g0b29d54, Compiled 2012/02/27 19:56:39
INFO 16:17:57,702 HelpFormatter - Copyright (c) 2010 The Broad Institute
INFO 16:17:57,702 HelpFormatter - Please view our documentation at http://www.broadinstitute.org/gsa/wiki
INFO 16:17:57,703 HelpFormatter - For support, please view our support site at http://getsatisfaction.com/gsa
INFO 16:17:57,703 HelpFormatter - Program Args: -T HelloRead -I test.bam -R /home/lindenb/package/samtools-0.1.18/examples/ex1.fa
INFO 16:17:57,703 HelpFormatter - Date/Time: 2012/03/08 16:17:57
INFO 16:17:57,703 HelpFormatter - ---------------------------------------------------------------------------------
INFO 16:17:57,704 HelpFormatter - ---------------------------------------------------------------------------------
INFO 16:17:57,707 GenomeAnalysisEngine - Strictness is SILENT
INFO 16:17:57,743 SAMDataSource$SAMReaders - Initializing SAMRecords in serial
INFO 16:17:57,758 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 0.02
>B7_591:4:96:693:509
CACTAGTGGCTCATTGTAAATGTGTGGTTTAACTCG
INFO 16:17:57,972 TraversalEngine - [INITIALIZATION COMPLETE; TRAVERSAL STARTING]
INFO 16:17:57,973 TraversalEngine - Location processed.reads runtime per.1M.reads completed total.runtime remaining
>EAS54_65:7:152:368:113
CTAGTGGCTCATTGTAAATGTGTGGTTTAACTCGT
>EAS51_64:8:5:734:57
AGTGGCTCATTGTAAATGTGTGGTTTAACTCGTCC
>B7_591:1:289:587:906
GTGGCTCATTGTAATTTTTTGTTTTAACTCTTCTCT
>EAS56_59:8:38:671:758
GCTCATTGTAAATGTGTGGTTTAACTCGTCCATGG
>EAS56_61:6:18:467:281
ATTGTAAATGTGTGGTTTAACTCGTCCCTGGCCCA
>EAS114_28:5:296:340:699
ATTGTAAATGTGTGGTTTAACTCGTCCATGGCCCAG
>B7_597:6:194:894:408
TGTAAATGTGTGGTTTAACTCGTCCATTGCCCAGC
>EAS188_4:8:12:628:973
AAATGTGTGGTTTAACTCGTCCATGGCCCAGCATT
(...)
>EAS139_11:7:50:1229:1313
TTTTTTCTTTTTTTTTTTTTTTTTTTTGCATGCCA
>EAS54_65:3:320:20:250
TTTTTTTTTTTTTTTTTTTTTTTGCATGCCAGAAA
>EAS114_26:7:37:79:581
TTTTTTTTTTTTTTTTTTTTTTTCATGCCAGAAAA
INFO 16:18:40,276 Walker - [REDUCE RESULT] Traversal result is: null
INFO 16:18:40,277 TraversalEngine - Total runtime 0.45 secs, 0.01 min, 0.00 hours
INFO 16:18:40,298 TraversalEngine - 0 reads were filtered out during traversal out of 3310 total (0.00%)
view raw session.txt hosted with ❤ by GitHub

The Makefile

GATKDIR=${HOME}/package/GenomeAnalysisTK-1.4-37-g0b29d54
COFOJA=${HOME}/package/GATK.dev/lib/cofoja-1.0-r139.jar
SAMTOOLS=${HOME}/package/samtools-0.1.18
test:compile
java -jar ~/package/picard/picard-tools-1.63/AddOrReplaceReadGroups.jar \
I=${SAMTOOLS}/examples/ex1.bam \
O=test.bam \
RGLB=test.lib \
RGPL=illumina \
RGPU=testunit \
SM=test.sample \
VALIDATION_STRINGENCY=LENIENT
${SAMTOOLS}/samtools index test.bam
java -cp ${GATKDIR}/GenomeAnalysisTK.jar:HelloRead.jar org.broadinstitute.sting.gatk.CommandLineGATK -T HelloRead \
-I test.bam \
-R ${SAMTOOLS}/examples/ex1.fa
compile:
mkdir -p tmp
javac -cp ${GATKDIR}/GenomeAnalysisTK.jar:${COFOJA}:. -sourcepath src -d tmp src/mygatk/HelloRead.java
jar cvf HelloRead.jar -C tmp .
view raw Makefile hosted with ❤ by GitHub
That's it, Pierre

1 comment:

Anonymous said...

This is a great tutorial. If your readers are hungry for more ways to leverage the power of the GATK, the GATK team at the Broad Institute is planning a workshop for users this Fall. If you’re interested in attending the workshop, you can vote on the topics and activities that you’d like the workshop to include by filling in this survey: http://www.surveymonkey.com/s/T799FQK