22 September 2016

Writing a Custom ReadFilter for the GATK, my notebook.


The GATK contains a set of predefined read filters that "filter or transfer incoming SAM/BAM data files":

With the help of the modular architecture of the GATK, it's possible to write a custom ReadFilter. In this post I'll write a ReadFilter that removes the reads if they contain a specific sequence in a soft-clipped segment.
The Read filters extend the class org.broadinstitute.gatk.engine.filters.ReadFilter:
public abstract class ReadFilter implements SamRecordFilter {
    public void initialize(GenomeAnalysisEngine engine) {}
    public boolean filterOut(final SAMRecord first, final SAMRecord second) {
        throw new UnsupportedOperationException("Paired filter not implemented: " + this.getClass());
    }
}
Thus the implementation of our filter, named My is defined below:
/** package is package org.broadinstitute.gatk.engine.filters like the other filter in GATK */
package org.broadinstitute.gatk.engine.filters;
import htsjdk.samtools.*;
import org.broadinstitute.gatk.utils.commandline.Argument;
import org.broadinstitute.gatk.engine.filters.*;
import org.broadinstitute.gatk.engine.GenomeAnalysisEngine;
/**
* author: Pierre Lindenbaum PhD
* An example of custom ReadFilter for GATK
*/
public class MyFilter extends ReadFilter {
@Argument(fullName = "clippattern", shortName = "clippat", doc="Discard reads having a clip containing the sequence 'pattern'", required=true)
/** the DNA sequence to search */
private String pattern;
/** the reverse complement of 'pattern' , will be initialized in 'initialize(engine)' */
private String reverseCompl = null;
public MyFilter()
{
}
/** initialize this filter . Create the reverse complement of pattern*/
@Override
public void initialize(final GenomeAnalysisEngine engine) {
this.pattern = this.pattern.toUpperCase();
this.reverseCompl = new String( org.broadinstitute.gatk.utils.BaseUtils.simpleReverseComplement(this.pattern.getBytes() ));
}
@Override
public boolean filterOut(final SAMRecord read) {
/* no clip for unmapped reads */
if(read.getReadUnmappedFlag()) return false;
final Cigar cigar = read.getCigar();
/* cigar must exist and must have at least 2 elements */
if( cigar==null || cigar.numCigarElements() <2 ) return false;
/* check 5' and 3' side of the cigar */
for(int side=0;side < 2;++side)
{
/* get cigarelement */
final CigarElement ce = (side==0?cigar.getCigarElement(0) : cigar.getCigarElement(cigar.numCigarElements()-1));
/* cigar element must be a clip */
if( ce.getOperator() != CigarOperator.S || ce.getLength()< pattern.length() ) continue;
/* get the clipped sequence */
final String clipSeq;
if(side==0)
{
clipSeq = read.getReadString().substring(0, ce.getLength()).toUpperCase();
}
else
{
clipSeq = read.getReadString().substring(read.getReadLength()-ce.getLength()).toUpperCase();
}
/* discard the read of the clip contains the pattern */
if( clipSeq.contains(this.pattern) || clipSeq.contains(this.reverseCompl) ) return true;
}
/* at this point, don't discard the read */
return false;
}
}
view raw MyFilter.java hosted with ❤ by GitHub

The class is compiled and packaged using the gatk itself as a library :
mkdir -p  org/broadinstitute/gatk/engine/filters
cp MyFilter.java org/broadinstitute/gatk/engine/filters
javac -cp /path/to/GenomeAnalysisTK.jar org/broadinstitute/gatk/engine/filters/MyFilter.java
jar cvf myfilter.jar org
rm -rf org
The GATK main class is org.broadinstitute.gatk.engine.CommandLineGATK, we can now check that there is a filter named My in the list of read Filters.
$ java -cp myfilter.jar:/path/to/GenomeAnalysisTK.jar  org.broadinstitute.gatk.engine.CommandLineGATK  \
  -T ReadLengthDistribution \
  --read_filter generate_and_error_please
(...)
##### ERROR MESSAGE: Invalid command line: Malformed read filter: Read filter generate_and_error_please not found. Available read filters:
##### ERROR 
##### ERROR                                 FilterName        Documentation
##### ERROR                           MissingReadGroup        https://www.broadinstitute.org/gatk/guide/tooldocs/org_broadinstitute_gatk_engine_filters_MissingReadGroupFilter.php
##### ERROR                                         My        https://www.broadinstitute.org/gatk/guide/tooldocs/org_broadinstitute_gatk_engine_filters_MyFilter.php
##### ERROR                    NoOriginalQualityScores        https://www.broadinstitute.org/gatk/guide/tooldocs/org_broadinstitute_gatk_engine_filters_NoOriginalQualityScoresFilter.php
We can now use the GATK tools with or without the 'My' filter .
java -cp myfilter.jar:/path/to/GenomeAnalysisTK.jar org.broadinstitute.gatk.engine.CommandLineGATK \
  -T ReadLengthDistribution \
  -R /path/to/ref.fa \
  -I /path/to/test.bam \
  -L 22 -o dist.ATGATA.tbl \
  --read_filter My --clippattern ATGATA 
(...)
INFO  10:53:32,412 MicroScheduler - 32 reads were filtered out during the traversal out of approximately 12434 total reads (0.26%) 
INFO  10:53:32,412 MicroScheduler -   -> 0 reads (0.00% of total) failing BadCigarFilter 
INFO  10:53:32,412 MicroScheduler -   -> 0 reads (0.00% of total) failing MalformedReadFilter 
INFO  10:53:32,413 MicroScheduler -   -> 32 reads (0.26% of total) failing MyFilter

$ cat dist.ATGATA.tbl
#:GATKReport.v1.1:1
#:GATKTable:2:130:%s:%s:;
#:GATKTable:ReadLengthDistribution:Table of read length distributions
readLength  Sample
        19        6
        20        4
        21        4
        22       13
        23        4
        24       12



while without 'My' the output is:

java -cp myfilter.jar:/path/to/GenomeAnalysisTK.jar org.broadinstitute.gatk.engine.CommandLineGATK \
  -T ReadLengthDistribution \
  -R /path/to/ref.fa \
  -I /path/to/test.bam \
  -L 22 -o dist.all.tbl
(...)
INFO  10:53:35,713 MicroScheduler - 0 reads were filtered out during the traversal out of approximately 12434 total reads (0.00%) 
INFO  10:53:35,713 MicroScheduler -   -> 0 reads (0.00% of total) failing BadCigarFilter 
INFO  10:53:35,714 MicroScheduler -   -> 0 reads (0.00% of total) failing MalformedReadFilter 
------------------------------------------------------------------------------------------
Done. There were no warn messages.
------------------------------------------------------------------------------------------

$ cat  dist.all.tbl
#:GATKReport.v1.1:1
#:GATKTable:2:130:%s:%s:;
#:GATKTable:ReadLengthDistribution:Table of read length distributions
readLength  Sample
        19        6
        20        4
        21        4
        22       13
        23        4
        24       12

The Makefile

gatk.jar=/path/to/GenomeAnalysisTK.jar
REF=/path/to/fasta.fa
BAM=example.bam
all: dist.ATGATA.tbl dist.all.tbl
# test with filter
dist.ATGATA.tbl : myfilter.jar
java -cp myfilter.jar:${gatk.jar} org.broadinstitute.gatk.engine.CommandLineGATK \
-T ReadLengthDistribution \
-R ${REF} \
-I ${BAM} \
-L 22 -o $@ \
--read_filter My --clippattern ATGATA
# test without filter
dist.all.tbl : myfilter.jar
java -cp myfilter.jar:${gatk.jar} org.broadinstitute.gatk.engine.CommandLineGATK \
-T ReadLengthDistribution \
-R ${REF} \
-I ${BAM} \
-L 22 -o $@
#compile and package
myfilter.jar : MyFilter.java
mkdir -p org/broadinstitute/gatk/engine/filters
cp $< org/broadinstitute/gatk/engine/filters
javac -cp ${gatk.jar} org/broadinstitute/gatk/engine/filters/MyFilter.java
jar cvf $@ org
rm -rf org
clean:
rm -rf org myfilter.jar dist.all.tbl dist.ATGATA.tbl
view raw Makefile hosted with ❤ by GitHub


That's it,
Pierre

No comments: