26 April 2013

Java JNI bindings for BWA(mem-lite)

Motivation

BWA 7.4(http://bio-bwa.sourceforge.net/) contains a small C example(https://github.com/lh3/bwa/blob/master/example.c) for running bwa-mem as a library (bwamem-lite). I created some JNI bindings to see if I can bind the C bwa library to java and get the same output than bwamem-lite. I put the code on github at https://github.com/lindenb/jbwa.

Example


(compare to https://github.com/lh3/bwa/blob/master/example.c )

System.loadLibrary("bwajni");
BwaIndex index=new BwaIndex(new File("hg19.fa"));
BwaMem mem=new BwaMem(index);
KSeq kseq=new KSeq(new File("input.fastq.gz");
ShortRead read=null;
while((read=kseq.next())!=null)
        {
        for(AlnRgn a: mem.align(read))
                {
                if(a.getSecondary()>=0) continue;
                System.out.println(  read.getName()+"\t"+  a.getStrand()+"\t"+  a.getChrom()+"\t"+
                        a.getPos()+"\t"+ a.getMQual()+"\t"+ a.getCigar()+"\t"+  a.getNm() );
                }
        }
kseq.dispose();
index.close();
mem.dispose();

Testing


Here is the ouput of the JAVA version:

gunzip -c input.fastq.gz | head -n 4000 |\
java  -Djava.library.path=src/main/native -cp src/main/java \
   com.github.lindenb.jbwa.jni.Example human_g1k_v37.fasta -| tail 


HWI-1KL149:20:C1CU7ACXX:4:1101:3077:33410       +       3       38647538        60      89M11S  1
HWI-1KL149:20:C1CU7ACXX:4:1101:3396:33445       +       8       52567289        60      100M    1
HWI-1KL149:20:C1CU7ACXX:4:1101:10013:33288      -       1       156104115       60      100M    1
HWI-1KL149:20:C1CU7ACXX:4:1101:10390:33496      -       6       123824853       60      100M    1
HWI-1KL149:20:C1CU7ACXX:4:1101:13537:33483      +       2       157367092       60      100M    1
HWI-1KL149:20:C1CU7ACXX:4:1101:14139:33390      +       20      31413797        60      100M    1
HWI-1KL149:20:C1CU7ACXX:4:1101:14514:33458      +       2       179401813       60      100M    1
HWI-1KL149:20:C1CU7ACXX:4:1101:15292:33282      +       15      63335820        60      100M    1
HWI-1KL149:20:C1CU7ACXX:4:1101:16960:33276      -       12      110782784       60      100M    1
HWI-1KL149:20:C1CU7ACXX:4:1101:17355:33322      +       6       126077895       60      100M    1

And the ouput of the Native C version:

gunzip -c input.fastq.gz | head -n 4000 |\
bwa-0.7.4/bwamem-lite human_g1k_v37.fasta - | tail 

HWI-1KL149:20:C1CU7ACXX:4:1101:3077:33410       +       3       38647538        60      89M11S  1
HWI-1KL149:20:C1CU7ACXX:4:1101:3396:33445       +       8       52567289        60      100M    1
HWI-1KL149:20:C1CU7ACXX:4:1101:10013:33288      -       1       156104115       60      100M    1
HWI-1KL149:20:C1CU7ACXX:4:1101:10390:33496      -       6       123824853       60      100M    1
HWI-1KL149:20:C1CU7ACXX:4:1101:13537:33483      +       2       157367092       60      100M    1
HWI-1KL149:20:C1CU7ACXX:4:1101:14139:33390      +       20      31413797        60      100M    1
HWI-1KL149:20:C1CU7ACXX:4:1101:14514:33458      +       2       179401813       60      100M    1
HWI-1KL149:20:C1CU7ACXX:4:1101:15292:33282      +       15      63335820        60      100M    1
HWI-1KL149:20:C1CU7ACXX:4:1101:16960:33276      -       12      110782784       60      100M    1
HWI-1KL149:20:C1CU7ACXX:4:1101:17355:33322      +       6       126077895       60      100M    1

GUI


As a test I also created a swing-Based interface for BWA:

java  -Djava.library.path=src/main/native  -cp src/main/java \
    com.github.lindenb.jbwa.jni.BwaFrame human_g1k_v37.fasta



That's it,


Pierre

23 April 2013

playing with BWA-MEM : my notebook

BWA-MEM, is a new tool that is part of the latest version of BWA. As said Heng Li on Biostar bwa-mem can be used to identify the viral integration sites in the human genome. Here I've used various short reads to explore how bwa maps the pairs. The sequences I used below are:

  • NOTCH2a and NOTCH2b two sequences on the chromosomes 1, on the same strand
  • NOTCH2del : same as NOTCH2a but with a deletion
  • ROXAN1a and ROXAN1b: two sequences on the chromosome 22, on the same strand
  • NOTCH2ins is NOTCH2a with and insertion of ROXAN1a
The following results were generated with the makefile below. Each pair of fastq was submitted twice (1_* and 2_*). I also added some pairs of reads from another source of FASTQs to let bwa infer the size of the fragments (see biostar: http://www.biostars.org/p/69694).

NOTCH2a vs NOTCH2b BWA-MEM options: none

NOTCH2a vs NOTCH2b
BWA-MEM options: none
NOT Properly mapped reads because they're on the same strand.
ReadFlagChromPosqualCigarExtra
1_Notch2p1=651120572000950MNM:i:0 AS:i:50 XS:i:45
1_Notch2p2=1291120572200950MNM:i:0 AS:i:50 XS:i:45
2_Notch2p11120572000950MNM:i:0 AS:i:50 XS:i:45
2_Notch2p21120572200950MNM:i:0 AS:i:50 XS:i:45

NOTCH2a vs revcomp(NOTCH2b) BWA-MEM options: none

NOTCH2a vs revcomp(NOTCH2b)
BWA-MEM options: none
Properly mapped reads.
ReadFlagChromPosqualCigarExtra
1_Notch2pPR1=9911205720003950MNM:i:0 AS:i:50 XS:i:45
1_Notch2pPr2=14711205722003950MNM:i:0 AS:i:50 XS:i:45
2_Notch2pPR111205720003950MNM:i:0 AS:i:50 XS:i:45
2_Notch2pPr211205722003950MNM:i:0 AS:i:50 XS:i:45

NOTCH2del vs revcomp(NOTCH2b) BWA-MEM options: none

NOTCH2del vs revcomp(NOTCH2b)
BWA-MEM options: none
Deletion in Read1
ReadFlagChromPosqualCigarExtra
1_Notch2delpPR1=9911205720003937M20D31MNM:i:20 AS:i:42 XS:i:37
1_Notch2delpPr2=14711205722003950MNM:i:0 AS:i:50 XS:i:45
2_Notch2delpPR111205720003937M20D31MNM:i:20 AS:i:42 XS:i:37
2_Notch2delpPr211205722003950MNM:i:0 AS:i:50 XS:i:45

NOTCH2ins vs revcomp(NOTCH2b) BWA-MEM options:

NOTCH2ins vs revcomp(NOTCH2b)
BWA-MEM options: non
Soft clipping of chr1: two hits on different chromosomes.
ReadFlagChromPosqualCigarExtra
1_NOTCH2insRoxanpR1=9722417215666023S50M27SNM:i:0 AS:i:50 XS:i:0
1_NOTCH2insRoxanpr2=1451120572200950MNM:i:0 AS:i:50 XS:i:45
2_NOTCH2insRoxanpR122417215666023S50M27SNM:i:0 AS:i:50 XS:i:0
2_NOTCH2insRoxanpr21120572200950MNM:i:0 AS:i:50 XS:i:45


NOTCH2a+ROXAN1a vs revcomp(NOTCH2b) BWA-MEM options: none

NOTCH2a+ROXAN1a vs revcomp(NOTCH2b)
BWA-MEM options: none
two hits on chromosome chr1 and chr22 for the 5' seq
nevertheless, properly paired was set.
ReadFlagChromPosqualCigarExtra
1_Notch2RoxanpPR1=991120572000950M50SNM:i:0 AS:i:50 XS:i:45 XP:Z:22,+41721566,50S50M,9,0;
1_Notch2RoxanpPR1=992241721566950S50MNM:i:0 AS:i:50 XS:i:0 XP:Z:1,+120572000,50M50S,9,0;
1_Notch2RoxanpPr2=1471120572200950MNM:i:0 AS:i:50 XS:i:45
2_Notch2RoxanpPR11120572000950M50SNM:i:0 AS:i:50 XS:i:45 XP:Z:22,+41721566,50S50M,9,0;
2_Notch2RoxanpPR12241721566950S50MNM:i:0 AS:i:50 XS:i:0 XP:Z:1,+120572000,50M50S,9,0;
2_Notch2RoxanpPr21120572200950MNM:i:0 AS:i:50 XS:i:45

NOTCH2a+ROXAN1a vs revcomp(NOTCH2b+ROXAN1b) BWA-MEM options: none

NOTCH2a+ROXAN1a vs revcomp(NOTCH2b+ROXAN1b)
BWA-MEM options: none
All fragments on chr1/chr22.
Reads are NOT properly paired.
ReadFlagChromPosqualCigarExtra
1_Notch2RoxanpR1=971120572000950M50SNM:i:0 AS:i:50 XS:i:45 XP:Z:22,+41721566,50S50M,9,0;
1_Notch2RoxanpR1=972241721566950S50MNM:i:0 AS:i:50 XS:i:0 XP:Z:1,+120572000,50M50S,9,0;
1_Notch2Roxanpr2=14522417217666050S50MNM:i:0 AS:i:50 XS:i:0 XP:Z:1,-120572200,50M50S,9,0;
1_Notch2Roxanpr2=1451120572200950M50SNM:i:0 AS:i:50 XS:i:45 XP:Z:22,-41721766,50S50M,60,0;
2_Notch2RoxanpR11120572000950M50SNM:i:0 AS:i:50 XS:i:45 XP:Z:22,+41721566,50S50M,9,0;
2_Notch2RoxanpR12241721566950S50MNM:i:0 AS:i:50 XS:i:0 XP:Z:1,+120572000,50M50S,9,0;
2_Notch2Roxanpr222417217666050S50MNM:i:0 AS:i:50 XS:i:0 XP:Z:1,-120572200,50M50S,9,0;
2_Notch2Roxanpr21120572200950M50SNM:i:0 AS:i:50 XS:i:45 XP:Z:22,-41721766,50S50M,60,0;

NOTCH2a+ROXAN1a vs revcomp(NOTCH2b) BWA-MEM options: -C

NOTCH2a+ROXAN1a vs revcomp(NOTCH2b)
BWA-MEM options: -C
I cannot see the effect of the option '-C':
"append FASTA/FASTQ comment to SAM output"
ReadFlagChromPosqualCigarExtra
1_Notch2RoxanpPR1=991120572000950M50SNM:i:0 AS:i:50 XS:i:45 XP:Z:22,+41721566,50S50M,9,0;
1_Notch2RoxanpPR1=992241721566950S50MNM:i:0 AS:i:50 XS:i:0 XP:Z:1,+120572000,50M50S,9,0;
1_Notch2RoxanpPr2=1471120572200950MNM:i:0 AS:i:50 XS:i:45
2_Notch2RoxanpPR11120572000950M50SNM:i:0 AS:i:50 XS:i:45 XP:Z:22,+41721566,50S50M,9,0;
2_Notch2RoxanpPR12241721566950S50MNM:i:0 AS:i:50 XS:i:0 XP:Z:1,+120572000,50M50S,9,0;
2_Notch2RoxanpPr21120572200950MNM:i:0 AS:i:50 XS:i:45

NOTCH2a+ROXAN1a vs revcomp(NOTCH2b) BWA-MEM options: -M

NOTCH2a+ROXAN1a vs revcomp(NOTCH2b)
BWA-MEM options: -M
Set the duplicate flags for the multiple hits.
ReadFlagChromPosqualCigarExtra
1_Notch2RoxanpPR1=991120572000950M50SNM:i:0 AS:i:50 XS:i:45 XP:Z:22,+41721566,50S50M,9,0;
1_Notch2RoxanpPR1s=3552241721566950S50MNM:i:0 AS:i:50 XS:i:0 XP:Z:1,+120572000,50M50S,9,0;
1_Notch2RoxanpPr21120572200950MNM:i:0 AS:i:50 XS:i:45
2_Notch2RoxanpPR11120572000950M50SNM:i:0 AS:i:50 XS:i:45 XP:Z:22,+41721566,50S50M,9,0;
2_Notch2RoxanpPR1s2241721566950S50MNM:i:0 AS:i:50 XS:i:0 XP:Z:1,+120572000,50M50S,9,0;
2_Notch2RoxanpPr21120572200950MNM:i:0 AS:i:50 XS:i:45

NOTCH2ins vs revcomp(ROXAN1b) BWA-MEM options: none

NOTCH2ins vs revcomp(ROXAN1b)
BWA-MEM options: none
Reads are mapped on chr22.
ReadFlagChromPosqualCigarExtra
1_NOTCH2insRoxanpPR122417215666023S50M27SNM:i:0 AS:i:50 XS:i:0
1_NOTCH2insRoxanpPr222417217666050MNM:i:0 AS:i:50 XS:i:0
2_NOTCH2insRoxanpPR122417215666023S50M27SNM:i:0 AS:i:50 XS:i:0
2_NOTCH2insRoxanpPr222417217666050MNM:i:0 AS:i:50 XS:i:0

That's it,

Pierre

The makefile