Writing a Custom ReadFilter for the GATK, my notebook.
The GATK contains a set of predefined read filters that "filter or transfer incoming SAM/BAM data files":
- BadCigar
- BadMate
- CountingRead
- DuplicateRead
- FailsVendorQualityCheck
- LibraryRead
- MalformedRead
- MappingQuality
- MappingQualityUnavailable
- (...)
The Read filters extend the class org.broadinstitute.gatk.engine.filters.ReadFilter:
public abstract class ReadFilter implements SamRecordFilter { public void initialize(GenomeAnalysisEngine engine) {} public boolean filterOut(final SAMRecord first, final SAMRecord second) { throw new UnsupportedOperationException("Paired filter not implemented: " + this.getClass()); } }Thus the implementation of our filter, named My is defined below:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
/** package is package org.broadinstitute.gatk.engine.filters like the other filter in GATK */ | |
package org.broadinstitute.gatk.engine.filters; | |
import htsjdk.samtools.*; | |
import org.broadinstitute.gatk.utils.commandline.Argument; | |
import org.broadinstitute.gatk.engine.filters.*; | |
import org.broadinstitute.gatk.engine.GenomeAnalysisEngine; | |
/** | |
* author: Pierre Lindenbaum PhD | |
* An example of custom ReadFilter for GATK | |
*/ | |
public class MyFilter extends ReadFilter { | |
@Argument(fullName = "clippattern", shortName = "clippat", doc="Discard reads having a clip containing the sequence 'pattern'", required=true) | |
/** the DNA sequence to search */ | |
private String pattern; | |
/** the reverse complement of 'pattern' , will be initialized in 'initialize(engine)' */ | |
private String reverseCompl = null; | |
public MyFilter() | |
{ | |
} | |
/** initialize this filter . Create the reverse complement of pattern*/ | |
@Override | |
public void initialize(final GenomeAnalysisEngine engine) { | |
this.pattern = this.pattern.toUpperCase(); | |
this.reverseCompl = new String( org.broadinstitute.gatk.utils.BaseUtils.simpleReverseComplement(this.pattern.getBytes() )); | |
} | |
@Override | |
public boolean filterOut(final SAMRecord read) { | |
/* no clip for unmapped reads */ | |
if(read.getReadUnmappedFlag()) return false; | |
final Cigar cigar = read.getCigar(); | |
/* cigar must exist and must have at least 2 elements */ | |
if( cigar==null || cigar.numCigarElements() <2 ) return false; | |
/* check 5' and 3' side of the cigar */ | |
for(int side=0;side < 2;++side) | |
{ | |
/* get cigarelement */ | |
final CigarElement ce = (side==0?cigar.getCigarElement(0) : cigar.getCigarElement(cigar.numCigarElements()-1)); | |
/* cigar element must be a clip */ | |
if( ce.getOperator() != CigarOperator.S || ce.getLength()< pattern.length() ) continue; | |
/* get the clipped sequence */ | |
final String clipSeq; | |
if(side==0) | |
{ | |
clipSeq = read.getReadString().substring(0, ce.getLength()).toUpperCase(); | |
} | |
else | |
{ | |
clipSeq = read.getReadString().substring(read.getReadLength()-ce.getLength()).toUpperCase(); | |
} | |
/* discard the read of the clip contains the pattern */ | |
if( clipSeq.contains(this.pattern) || clipSeq.contains(this.reverseCompl) ) return true; | |
} | |
/* at this point, don't discard the read */ | |
return false; | |
} | |
} |
The class is compiled and packaged using the gatk itself as a library :
mkdir -p org/broadinstitute/gatk/engine/filters cp MyFilter.java org/broadinstitute/gatk/engine/filters javac -cp /path/to/GenomeAnalysisTK.jar org/broadinstitute/gatk/engine/filters/MyFilter.java jar cvf myfilter.jar org rm -rf orgThe GATK main class is org.broadinstitute.gatk.engine.CommandLineGATK, we can now check that there is a filter named My in the list of read Filters.
$ java -cp myfilter.jar:/path/to/GenomeAnalysisTK.jar org.broadinstitute.gatk.engine.CommandLineGATK \
-T ReadLengthDistribution \
--read_filter generate_and_error_please
(...)
##### ERROR MESSAGE: Invalid command line: Malformed read filter: Read filter generate_and_error_please not found. Available read filters:
##### ERROR
##### ERROR FilterName Documentation
##### ERROR MissingReadGroup https://www.broadinstitute.org/gatk/guide/tooldocs/org_broadinstitute_gatk_engine_filters_MissingReadGroupFilter.php
##### ERROR My https://www.broadinstitute.org/gatk/guide/tooldocs/org_broadinstitute_gatk_engine_filters_MyFilter.php
##### ERROR NoOriginalQualityScores https://www.broadinstitute.org/gatk/guide/tooldocs/org_broadinstitute_gatk_engine_filters_NoOriginalQualityScoresFilter.php
We can now use the GATK tools with or without the 'My' filter .java -cp myfilter.jar:/path/to/GenomeAnalysisTK.jar org.broadinstitute.gatk.engine.CommandLineGATK \ -T ReadLengthDistribution \ -R /path/to/ref.fa \ -I /path/to/test.bam \ -L 22 -o dist.ATGATA.tbl \ --read_filter My --clippattern ATGATA (...) INFO 10:53:32,412 MicroScheduler - 32 reads were filtered out during the traversal out of approximately 12434 total reads (0.26%) INFO 10:53:32,412 MicroScheduler - -> 0 reads (0.00% of total) failing BadCigarFilter INFO 10:53:32,412 MicroScheduler - -> 0 reads (0.00% of total) failing MalformedReadFilter INFO 10:53:32,413 MicroScheduler - -> 32 reads (0.26% of total) failing MyFilter $ cat dist.ATGATA.tbl #:GATKReport.v1.1:1 #:GATKTable:2:130:%s:%s:; #:GATKTable:ReadLengthDistribution:Table of read length distributions readLength Sample 19 6 20 4 21 4 22 13 23 4 24 12
while without 'My' the output is:
java -cp myfilter.jar:/path/to/GenomeAnalysisTK.jar org.broadinstitute.gatk.engine.CommandLineGATK \ -T ReadLengthDistribution \ -R /path/to/ref.fa \ -I /path/to/test.bam \ -L 22 -o dist.all.tbl (...) INFO 10:53:35,713 MicroScheduler - 0 reads were filtered out during the traversal out of approximately 12434 total reads (0.00%) INFO 10:53:35,713 MicroScheduler - -> 0 reads (0.00% of total) failing BadCigarFilter INFO 10:53:35,714 MicroScheduler - -> 0 reads (0.00% of total) failing MalformedReadFilter ------------------------------------------------------------------------------------------ Done. There were no warn messages. ------------------------------------------------------------------------------------------ $ cat dist.all.tbl #:GATKReport.v1.1:1 #:GATKTable:2:130:%s:%s:; #:GATKTable:ReadLengthDistribution:Table of read length distributions readLength Sample 19 6 20 4 21 4 22 13 23 4 24 12
The Makefile
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
gatk.jar=/path/to/GenomeAnalysisTK.jar | |
REF=/path/to/fasta.fa | |
BAM=example.bam | |
all: dist.ATGATA.tbl dist.all.tbl | |
# test with filter | |
dist.ATGATA.tbl : myfilter.jar | |
java -cp myfilter.jar:${gatk.jar} org.broadinstitute.gatk.engine.CommandLineGATK \ | |
-T ReadLengthDistribution \ | |
-R ${REF} \ | |
-I ${BAM} \ | |
-L 22 -o $@ \ | |
--read_filter My --clippattern ATGATA | |
# test without filter | |
dist.all.tbl : myfilter.jar | |
java -cp myfilter.jar:${gatk.jar} org.broadinstitute.gatk.engine.CommandLineGATK \ | |
-T ReadLengthDistribution \ | |
-R ${REF} \ | |
-I ${BAM} \ | |
-L 22 -o $@ | |
#compile and package | |
myfilter.jar : MyFilter.java | |
mkdir -p org/broadinstitute/gatk/engine/filters | |
cp $< org/broadinstitute/gatk/engine/filters | |
javac -cp ${gatk.jar} org/broadinstitute/gatk/engine/filters/MyFilter.java | |
jar cvf $@ org | |
rm -rf org | |
clean: | |
rm -rf org myfilter.jar dist.all.tbl dist.ATGATA.tbl |
That's it,
Pierre