22 May 2013

Drawing a Timeline with jquery.

In 2008, I wrote a XUL-based interface displaying a timeline (http://...freebase-and-history-of-sciences.html).
History of Sciences / Freebase
Here I've played with jquery to display another timeline:

html

There is no json data: everyting is stored in the HTML. The years are surrounded by a <span/> element having a css class "start/end".


javascript

We use jquery to sort and layout each event.

CSS

A basic CSS for my timeline

Result


It's far from being perfect. I don't know how to handle the images overflowing the "div".

That's it,

Pierre


15 May 2013

VIZBAM and NGSProject

FYI: I've started the following projects on github:

VizBam

Vizbam ( https://github.com/lindenb/vizbam ) is a java library used to display a SAM alignment just like samtools tview

NGSProject

NGSProject ( https://github.com/lindenb/ngsproject ) is a java web-server used to display some SAM-records and SAM some alignments.

Screentshot







That's it,

Pierre

01 May 2013

Inserting the result of a BLAST into a Database using XSLT.

Here is the XML output of a BLAST:

<?xml version="1.0"?>
<!DOCTYPE BlastOutput PUBLIC "-//NCBI//NCBI BlastOutput/EN" "NCBI_BlastOutput.dtd">
<BlastOutput>
  <BlastOutput_program>tblastn</BlastOutput_program>
  <BlastOutput_version>TBLASTN 2.2.27+</BlastOutput_version>
  <BlastOutput_reference>Stephen F. Altschul, Thomas L. Madden, Alejandro A. Sch&amp;auml;ffer, Jinghui Z
hang, Zheng Zhang, Webb Miller, and David J. Lipman (1997), &quot;Gapped BLAST and PSI-BLAST: a new generation 
of protein database search programs&quot;, Nucleic Acids Res. 25:3389-3402.</BlastOutput_reference>
  <BlastOutput_db>nr</BlastOutput_db>
  <BlastOutput_query-ID>52385</BlastOutput_query-ID>
  <BlastOutput_query-def>myseq</BlastOutput_query-def>
  <BlastOutput_query-len>30</BlastOutput_query-len>
  <BlastOutput_param>
    <Parameters>
      <Parameters_matrix>BLOSUM62</Parameters_matrix>
      <Parameters_expect>10</Parameters_expect>
      <Parameters_gap-open>11</Parameters_gap-open>
      <Parameters_gap-extend>1</Parameters_gap-extend>
      <Parameters_filter>L;</Parameters_filter>
    </Parameters>
  </BlastOutput_param>
<BlastOutput_iterations>
<Iteration>
  <Iteration_iter-num>1</Iteration_iter-num>
  <Iteration_query-ID>52385</Iteration_query-ID>
  <Iteration_query-def>myseq</Iteration_query-def>
  <Iteration_query-len>30</Iteration_query-len>
<Iteration_hits>
<Hit>
  <Hit_num>1</Hit_num>
  <Hit_id>gi|110624327|dbj|AK225891.1|</Hit_id>
  <Hit_def>Homo sapiens mRNA for zinc finger CCCH-type containing 7B variant, clone: FCC121C02</Hit_def>
  <Hit_accession>AK225891</Hit_accession>
  <Hit_len>1829</Hit_len>
  <Hit_hsps>
    <Hsp>
      <Hsp_num>1</Hsp_num>
      <Hsp_bit-score>62.3882</Hsp_bit-score>
      <Hsp_score>150</Hsp_score>
      <Hsp_evalue>4.01658e-10</Hsp_evalue>
      <Hsp_query-from>1</Hsp_query-from>
      <Hsp_query-to>30</Hsp_query-to>
      <Hsp_hit-from>250</Hsp_hit-from>
      <Hsp_hit-to>339</Hsp_hit-to>
      <Hsp_query-frame>0</Hsp_query-frame>
      <Hsp_hit-frame>1</Hsp_hit-frame>
      <Hsp_identity>30</Hsp_identity>
      <Hsp_positive>30</Hsp_positive>
      <Hsp_gaps>0</Hsp_gaps>
      <Hsp_align-len>30</Hsp_align-len>
      <Hsp_qseq>MERQKRKADIEKGLQFIQSTLPLKQEEYEA</Hsp_qseq>
      <Hsp_hseq>MERQKRKADIEKGLQFIQSTLPLKQEEYEA</Hsp_hseq>
      <Hsp_midline>MERQKRKADIEKGLQFIQSTLPLKQEEYEA</Hsp_midline>
    </Hsp>
  </Hit_hsps>
</Hit>
<Hit>
  <Hit_num>2</Hit_num>
  <Hit_id>gi|6176337|gb|AF188530.1|AF188530</Hit_id>
  <Hit_def>Homo sapiens ubiquitous tetratricopeptide containing protein RoXaN mRNA, partial cds</Hit_def&g
t;
  <Hit_accession>AF188530</Hit_accession>
  <Hit_len>2398</Hit_len>
  <Hit_hsps>
    <Hsp>
      <Hsp_num>1</Hsp_num>
      <Hsp_bit-score>62.3882</Hsp_bit-score>
      <Hsp_score>150</Hsp_score>
      <Hsp_evalue>4.12279e-10</Hsp_evalue>
      <Hsp_query-from>1</Hsp_query-from>
      <Hsp_query-to>30</Hsp_query-to>
      <Hsp_hit-from>105</Hsp_hit-from>
      <Hsp_hit-to>194</Hsp_hit-to>
      <Hsp_query-frame>0</Hsp_query-frame>
      <Hsp_hit-frame>3</Hsp_hit-frame>
      <Hsp_identity>30</Hsp_identity>
      <Hsp_positive>30</Hsp_positive>
      <Hsp_gaps>0</Hsp_gaps>
      <Hsp_align-len>30</Hsp_align-len>
      <Hsp_qseq>MERQKRKADIEKGLQFIQSTLPLKQEEYEA</Hsp_qseq>
      <Hsp_hseq>MERQKRKADIEKGLQFIQSTLPLKQEEYEA</Hsp_hseq>
      <Hsp_midline>MERQKRKADIEKGLQFIQSTLPLKQEEYEA</Hsp_midline>
    </Hsp>
  </Hit_hsps>
(...)
We want to insert that XML file into a database. I wrote the following XSLT stylesheet , it transforms the blast-xml into a set of SQL statements for sqlite3.
xsltproc --novalid blast2sqlite.xsl blast.xml 

create table BlastOutput(
(...)


BEGIN TRANSACTION;

insert into BlastOutput(
 program,
 version,
 reference,
 db,
 query_ID,
 query_def,
 query_len
 )
 values (
  'tblastn',
  'TBLASTN 2.2.27+',
  'Stephen F. Altschul, Thomas L. Madden, Alejandro A. Sch&auml;ffer, Jinghui Zhang,Zheng Zhang, Webb Miller, and David J. Lipman (1997), "Gapped BLAST and PSI-BLAST: a new generation of protein database search programs", Nucleic Acids Res. 25:3389-3402.',
  'nr',
  '52385',
  'myseq',
  30
  );


insert into Parameters(
 blastOutput_id,
 expect,
 matrix,
 sc_match,
 sc_mismatch,
 gap_open,
 gap_extend,
 filter
 )
select MAX(id),
 10,
 'BLOSUM62',
 NULL,
 NULL,
 11,
 1,
 'L;'
 from BlastOutput;
 


insert into Iteration(
 blastOutput_id,
 iter_num,
 query_id,
 query_def,
 query_len
 )
select MAX(id),
 1,
 '52385',
 'myseq',
 30
from BlastOutput;

insert into Hit(iteration_id,num,hit_id,def,accession,len)
select MAX(id),
 1,
 'gi|110624327|dbj|AK225891.1|',
 'Homo sapiens mRNA for zinc finger CCCH-type containing 7B variant, clone: FCC121C02',
 'AK225891',
 1829
from Iteration;
(...)
All in one you can redirect the output to sqlite3.
xsltproc --novalid blast2sqlite.xsl blast.xml |\
 sqlite3 input.db
and query the database:
$ sqlite3 -header -line input.sqlite \
 'select * from Hsp,Hit where Hsp.hit_id=Hit.id limit 2'

          id = 1
      hit_id = 1
         num = 1
   bit_score = 62.3882
       score = 150.0
      evalue = 4.01658e-10
  query_from = 1
    query_to = 30
    hit_from = 250
      hit_to = 339
 query_frame = 0
   hit_frame = 1
    identity = 30
    positive = 30
        gaps = 0
   align_len = 30
        qseq = MERQKRKADIEKGLQFIQSTLPLKQEEYEA
        hseq = MERQKRKADIEKGLQFIQSTLPLKQEEYEA
     midline = MERQKRKADIEKGLQFIQSTLPLKQEEYEA
          id = 1
iteration_id = 1
         num = 1
      hit_id = gi|110624327|dbj|AK225891.1|
         def = Homo sapiens mRNA for zinc finger CCCH-type containing 7B variant, clone: FCC121C02
   accession = AK225891
         len = 1829

          id = 2
      hit_id = 2
         num = 1
   bit_score = 62.3882
       score = 150.0
      evalue = 4.12279e-10
  query_from = 1
    query_to = 30
    hit_from = 105
      hit_to = 194
 query_frame = 0
   hit_frame = 3
    identity = 30
    positive = 30
        gaps = 0
   align_len = 30
        qseq = MERQKRKADIEKGLQFIQSTLPLKQEEYEA
        hseq = MERQKRKADIEKGLQFIQSTLPLKQEEYEA
     midline = MERQKRKADIEKGLQFIQSTLPLKQEEYEA
          id = 2
iteration_id = 1
         num = 2
      hit_id = gi|6176337|gb|AF188530.1|AF188530
         def = Homo sapiens ubiquitous tetratricopeptide containing protein RoXaN mRNA, partial cds
   accession = AF188530
         len = 2398
That's it,

Pierre

26 April 2013

Java JNI bindings for BWA(mem-lite)

Motivation

BWA 7.4(http://bio-bwa.sourceforge.net/) contains a small C example(https://github.com/lh3/bwa/blob/master/example.c) for running bwa-mem as a library (bwamem-lite). I created some JNI bindings to see if I can bind the C bwa library to java and get the same output than bwamem-lite. I put the code on github at https://github.com/lindenb/jbwa.

Example


(compare to https://github.com/lh3/bwa/blob/master/example.c )

System.loadLibrary("bwajni");
BwaIndex index=new BwaIndex(new File("hg19.fa"));
BwaMem mem=new BwaMem(index);
KSeq kseq=new KSeq(new File("input.fastq.gz");
ShortRead read=null;
while((read=kseq.next())!=null)
        {
        for(AlnRgn a: mem.align(read))
                {
                if(a.getSecondary()>=0) continue;
                System.out.println(  read.getName()+"\t"+  a.getStrand()+"\t"+  a.getChrom()+"\t"+
                        a.getPos()+"\t"+ a.getMQual()+"\t"+ a.getCigar()+"\t"+  a.getNm() );
                }
        }
kseq.dispose();
index.close();
mem.dispose();

Testing


Here is the ouput of the JAVA version:

gunzip -c input.fastq.gz | head -n 4000 |\
java  -Djava.library.path=src/main/native -cp src/main/java \
   com.github.lindenb.jbwa.jni.Example human_g1k_v37.fasta -| tail 


HWI-1KL149:20:C1CU7ACXX:4:1101:3077:33410       +       3       38647538        60      89M11S  1
HWI-1KL149:20:C1CU7ACXX:4:1101:3396:33445       +       8       52567289        60      100M    1
HWI-1KL149:20:C1CU7ACXX:4:1101:10013:33288      -       1       156104115       60      100M    1
HWI-1KL149:20:C1CU7ACXX:4:1101:10390:33496      -       6       123824853       60      100M    1
HWI-1KL149:20:C1CU7ACXX:4:1101:13537:33483      +       2       157367092       60      100M    1
HWI-1KL149:20:C1CU7ACXX:4:1101:14139:33390      +       20      31413797        60      100M    1
HWI-1KL149:20:C1CU7ACXX:4:1101:14514:33458      +       2       179401813       60      100M    1
HWI-1KL149:20:C1CU7ACXX:4:1101:15292:33282      +       15      63335820        60      100M    1
HWI-1KL149:20:C1CU7ACXX:4:1101:16960:33276      -       12      110782784       60      100M    1
HWI-1KL149:20:C1CU7ACXX:4:1101:17355:33322      +       6       126077895       60      100M    1

And the ouput of the Native C version:

gunzip -c input.fastq.gz | head -n 4000 |\
bwa-0.7.4/bwamem-lite human_g1k_v37.fasta - | tail 

HWI-1KL149:20:C1CU7ACXX:4:1101:3077:33410       +       3       38647538        60      89M11S  1
HWI-1KL149:20:C1CU7ACXX:4:1101:3396:33445       +       8       52567289        60      100M    1
HWI-1KL149:20:C1CU7ACXX:4:1101:10013:33288      -       1       156104115       60      100M    1
HWI-1KL149:20:C1CU7ACXX:4:1101:10390:33496      -       6       123824853       60      100M    1
HWI-1KL149:20:C1CU7ACXX:4:1101:13537:33483      +       2       157367092       60      100M    1
HWI-1KL149:20:C1CU7ACXX:4:1101:14139:33390      +       20      31413797        60      100M    1
HWI-1KL149:20:C1CU7ACXX:4:1101:14514:33458      +       2       179401813       60      100M    1
HWI-1KL149:20:C1CU7ACXX:4:1101:15292:33282      +       15      63335820        60      100M    1
HWI-1KL149:20:C1CU7ACXX:4:1101:16960:33276      -       12      110782784       60      100M    1
HWI-1KL149:20:C1CU7ACXX:4:1101:17355:33322      +       6       126077895       60      100M    1

GUI


As a test I also created a swing-Based interface for BWA:

java  -Djava.library.path=src/main/native  -cp src/main/java \
    com.github.lindenb.jbwa.jni.BwaFrame human_g1k_v37.fasta



That's it,


Pierre

23 April 2013

playing with BWA-MEM : my notebook

BWA-MEM, is a new tool that is part of the latest version of BWA. As said Heng Li on Biostar bwa-mem can be used to identify the viral integration sites in the human genome. Here I've used various short reads to explore how bwa maps the pairs. The sequences I used below are:

  • NOTCH2a and NOTCH2b two sequences on the chromosomes 1, on the same strand
  • NOTCH2del : same as NOTCH2a but with a deletion
  • ROXAN1a and ROXAN1b: two sequences on the chromosome 22, on the same strand
  • NOTCH2ins is NOTCH2a with and insertion of ROXAN1a
The following results were generated with the makefile below. Each pair of fastq was submitted twice (1_* and 2_*). I also added some pairs of reads from another source of FASTQs to let bwa infer the size of the fragments (see biostar: http://www.biostars.org/p/69694).

NOTCH2a vs NOTCH2b BWA-MEM options: none

NOTCH2a vs NOTCH2b
BWA-MEM options: none
NOT Properly mapped reads because they're on the same strand.
ReadFlagChromPosqualCigarExtra
1_Notch2p1=651120572000950MNM:i:0 AS:i:50 XS:i:45
1_Notch2p2=1291120572200950MNM:i:0 AS:i:50 XS:i:45
2_Notch2p11120572000950MNM:i:0 AS:i:50 XS:i:45
2_Notch2p21120572200950MNM:i:0 AS:i:50 XS:i:45

NOTCH2a vs revcomp(NOTCH2b) BWA-MEM options: none

NOTCH2a vs revcomp(NOTCH2b)
BWA-MEM options: none
Properly mapped reads.
ReadFlagChromPosqualCigarExtra
1_Notch2pPR1=9911205720003950MNM:i:0 AS:i:50 XS:i:45
1_Notch2pPr2=14711205722003950MNM:i:0 AS:i:50 XS:i:45
2_Notch2pPR111205720003950MNM:i:0 AS:i:50 XS:i:45
2_Notch2pPr211205722003950MNM:i:0 AS:i:50 XS:i:45

NOTCH2del vs revcomp(NOTCH2b) BWA-MEM options: none

NOTCH2del vs revcomp(NOTCH2b)
BWA-MEM options: none
Deletion in Read1
ReadFlagChromPosqualCigarExtra
1_Notch2delpPR1=9911205720003937M20D31MNM:i:20 AS:i:42 XS:i:37
1_Notch2delpPr2=14711205722003950MNM:i:0 AS:i:50 XS:i:45
2_Notch2delpPR111205720003937M20D31MNM:i:20 AS:i:42 XS:i:37
2_Notch2delpPr211205722003950MNM:i:0 AS:i:50 XS:i:45

NOTCH2ins vs revcomp(NOTCH2b) BWA-MEM options:

NOTCH2ins vs revcomp(NOTCH2b)
BWA-MEM options: non
Soft clipping of chr1: two hits on different chromosomes.
ReadFlagChromPosqualCigarExtra
1_NOTCH2insRoxanpR1=9722417215666023S50M27SNM:i:0 AS:i:50 XS:i:0
1_NOTCH2insRoxanpr2=1451120572200950MNM:i:0 AS:i:50 XS:i:45
2_NOTCH2insRoxanpR122417215666023S50M27SNM:i:0 AS:i:50 XS:i:0
2_NOTCH2insRoxanpr21120572200950MNM:i:0 AS:i:50 XS:i:45


NOTCH2a+ROXAN1a vs revcomp(NOTCH2b) BWA-MEM options: none

NOTCH2a+ROXAN1a vs revcomp(NOTCH2b)
BWA-MEM options: none
two hits on chromosome chr1 and chr22 for the 5' seq
nevertheless, properly paired was set.
ReadFlagChromPosqualCigarExtra
1_Notch2RoxanpPR1=991120572000950M50SNM:i:0 AS:i:50 XS:i:45 XP:Z:22,+41721566,50S50M,9,0;
1_Notch2RoxanpPR1=992241721566950S50MNM:i:0 AS:i:50 XS:i:0 XP:Z:1,+120572000,50M50S,9,0;
1_Notch2RoxanpPr2=1471120572200950MNM:i:0 AS:i:50 XS:i:45
2_Notch2RoxanpPR11120572000950M50SNM:i:0 AS:i:50 XS:i:45 XP:Z:22,+41721566,50S50M,9,0;
2_Notch2RoxanpPR12241721566950S50MNM:i:0 AS:i:50 XS:i:0 XP:Z:1,+120572000,50M50S,9,0;
2_Notch2RoxanpPr21120572200950MNM:i:0 AS:i:50 XS:i:45

NOTCH2a+ROXAN1a vs revcomp(NOTCH2b+ROXAN1b) BWA-MEM options: none

NOTCH2a+ROXAN1a vs revcomp(NOTCH2b+ROXAN1b)
BWA-MEM options: none
All fragments on chr1/chr22.
Reads are NOT properly paired.
ReadFlagChromPosqualCigarExtra
1_Notch2RoxanpR1=971120572000950M50SNM:i:0 AS:i:50 XS:i:45 XP:Z:22,+41721566,50S50M,9,0;
1_Notch2RoxanpR1=972241721566950S50MNM:i:0 AS:i:50 XS:i:0 XP:Z:1,+120572000,50M50S,9,0;
1_Notch2Roxanpr2=14522417217666050S50MNM:i:0 AS:i:50 XS:i:0 XP:Z:1,-120572200,50M50S,9,0;
1_Notch2Roxanpr2=1451120572200950M50SNM:i:0 AS:i:50 XS:i:45 XP:Z:22,-41721766,50S50M,60,0;
2_Notch2RoxanpR11120572000950M50SNM:i:0 AS:i:50 XS:i:45 XP:Z:22,+41721566,50S50M,9,0;
2_Notch2RoxanpR12241721566950S50MNM:i:0 AS:i:50 XS:i:0 XP:Z:1,+120572000,50M50S,9,0;
2_Notch2Roxanpr222417217666050S50MNM:i:0 AS:i:50 XS:i:0 XP:Z:1,-120572200,50M50S,9,0;
2_Notch2Roxanpr21120572200950M50SNM:i:0 AS:i:50 XS:i:45 XP:Z:22,-41721766,50S50M,60,0;

NOTCH2a+ROXAN1a vs revcomp(NOTCH2b) BWA-MEM options: -C

NOTCH2a+ROXAN1a vs revcomp(NOTCH2b)
BWA-MEM options: -C
I cannot see the effect of the option '-C':
"append FASTA/FASTQ comment to SAM output"
ReadFlagChromPosqualCigarExtra
1_Notch2RoxanpPR1=991120572000950M50SNM:i:0 AS:i:50 XS:i:45 XP:Z:22,+41721566,50S50M,9,0;
1_Notch2RoxanpPR1=992241721566950S50MNM:i:0 AS:i:50 XS:i:0 XP:Z:1,+120572000,50M50S,9,0;
1_Notch2RoxanpPr2=1471120572200950MNM:i:0 AS:i:50 XS:i:45
2_Notch2RoxanpPR11120572000950M50SNM:i:0 AS:i:50 XS:i:45 XP:Z:22,+41721566,50S50M,9,0;
2_Notch2RoxanpPR12241721566950S50MNM:i:0 AS:i:50 XS:i:0 XP:Z:1,+120572000,50M50S,9,0;
2_Notch2RoxanpPr21120572200950MNM:i:0 AS:i:50 XS:i:45

NOTCH2a+ROXAN1a vs revcomp(NOTCH2b) BWA-MEM options: -M

NOTCH2a+ROXAN1a vs revcomp(NOTCH2b)
BWA-MEM options: -M
Set the duplicate flags for the multiple hits.
ReadFlagChromPosqualCigarExtra
1_Notch2RoxanpPR1=991120572000950M50SNM:i:0 AS:i:50 XS:i:45 XP:Z:22,+41721566,50S50M,9,0;
1_Notch2RoxanpPR1s=3552241721566950S50MNM:i:0 AS:i:50 XS:i:0 XP:Z:1,+120572000,50M50S,9,0;
1_Notch2RoxanpPr21120572200950MNM:i:0 AS:i:50 XS:i:45
2_Notch2RoxanpPR11120572000950M50SNM:i:0 AS:i:50 XS:i:45 XP:Z:22,+41721566,50S50M,9,0;
2_Notch2RoxanpPR1s2241721566950S50MNM:i:0 AS:i:50 XS:i:0 XP:Z:1,+120572000,50M50S,9,0;
2_Notch2RoxanpPr21120572200950MNM:i:0 AS:i:50 XS:i:45

NOTCH2ins vs revcomp(ROXAN1b) BWA-MEM options: none

NOTCH2ins vs revcomp(ROXAN1b)
BWA-MEM options: none
Reads are mapped on chr22.
ReadFlagChromPosqualCigarExtra
1_NOTCH2insRoxanpPR122417215666023S50M27SNM:i:0 AS:i:50 XS:i:0
1_NOTCH2insRoxanpPr222417217666050MNM:i:0 AS:i:50 XS:i:0
2_NOTCH2insRoxanpPR122417215666023S50M27SNM:i:0 AS:i:50 XS:i:0
2_NOTCH2insRoxanpPr222417217666050MNM:i:0 AS:i:50 XS:i:0

That's it,

Pierre

The makefile

25 March 2013

Embedding Pubmed, Graphiviz and a remote image in #LaTeX. My notebook. .

I'm learning LaTeX. Today I learned how to create a new command in LaTeX.

\newcommand{name}[num]{definition}
"Basically the command requires two arguments: the name of the command you want to create, and the definition of the command" . I played with LaTeX and wrote the following three commands:

Embedding a remote picture

The following LaTeX document defines a new command "\remoteimage". It takes 3 parameters: a filename, a URL and some parameters for \includegraphics. If the file doesn't exist, the url is downloaded and saved in 'file'. The downloaded image is then included in the final LaTeX document.

Note: latex files must be compiled with --enable-write18 to enable system-calls.
pdflatex --enable-write18 input.tex
Result:

External Image /Latex by lindenb


GraphViz Dot

The second LaTex Document works the same way. It defines a command "\graphviz" , sends the content of the 2nd argument to graphviz dot and save the resulting image before importing it into the LaTeX document.

Result:

GraphViz / Latex by lindenb


Pubmed

The last command define "\pmid" . It needs one Pubmed identifer. It downloads the XML record for this pmid, transforms it to LaTeX with xsltproc and the following XSLT stylesheet:

The LaTeX document includes four pubmed identifiers:

Result:

Pumed / Latex by lindenb






That's it,

Pierre




13 March 2013

Filtering a BAM with javascript.

The following was created to answer that question on biostar: "Exclude reads with XA tag". SamJs is a java program filtering the reads of a BAM file using the javascript engine for java (rhino).

Usage

java -jar samjs.jar [options] (-e script | -f script) (file.bam|stdin)
for each SamRecord in the BAM, the program injects the following variables in the javascript context:
  • 'record' a SamRecord ( http://picard.sourceforge.net/javadoc/net/sf/samtools/SAMRecord.html ).
  • 'header' a SAMFileHeader ( http://picard.sourceforge.net/javadoc/net/sf/samtools/SAMFileHeader.html)

Options

  • -S : SAM output.
  • -e (script).
  • -f (script-file).
  • -L (int) limit to 'L' records.

Example

$ java -jar dist/samjs.jar -L 4  -S \
  -e '(record.inferredInsertSize > 60 || record.alignmentStart  < 50000000) && !record.readUnmappedFlag && record.getAttribute("MD")!=null ' \
 my.bam
@HD VN:1.0 SO:unsorted
@SQ SN:chr1 LN:247249719
IL31_4368:1:1:997:15684 163 chr1 241356442 60 54M = 241356612 224 CAGCCTCAGATTCAGCATTCTCAAATTCAGCTGCGGCTGAAACAGCAGCAGGAC EEEEDEEE9EAEEDEEEEEEEEEECEEAAEEDEE<CD=D=*BCAC?;CB,<D@, X0:i:1 X1:i:0 MD:Z:53T0 XG:i:0 AM:i:37 NM:i:1 SM:i:37 XM:i:1 XO:i:0 XT:A:U
IL31_4368:1:1:997:1657 163 chr1 143630066 60 54M = 143630364 352 CCCACCTCTCTCAATGTTTTCCATATGGCAGGGACTCAGCACAGGTGGATTAAT A;0A?AA+@A<7A7019/<65,3A;'''07<A=<=>?7=?6&)'9('*%,>/(< X0:i:1 X1:i:0 MD:Z:54 XG:i:0 AM:i:37 NM:i:0 SM:i:37 XM:i:0 XO:i:0 XT:A:U
IL31_4368:1:1:999:9391 163 chr1 195364100 29 54M = 195364272 226 AAAAAAAAAAACCCTCATTTTTTTTAAGTACTAAATTTTTTTTCCCATTTGAAA 1>??E?>@BB>0A/43;,=9A98A(',0/<*4>>/@=90A51&(**3;>'*;=; MD:Z:11A5A7C12A2G1A1T1C6 XG:i:0 AM:i:29 NM:i:8 SM:i:29 XM:i:8 XO:i:0 XT:A:M
IL31_4368:1:1:1002:11012 99 chr1 147142473 60 54M = 147142575 156 NGATTAGTACATAGTAAGTACTCAATAGATGTTAGCTATTATTGTAATCACCGC (1*4+236332679?1..87><-6<@<7>>@><7;@@@>962$-6075584093 X0:i:1 X1:i:0 MD:Z:0A40C12 XG:i:0 AM:i:37 NM:i:2 SM:i:37 XM:i:2 XO:i:0 XT:A:U

That's it,
Pierre

27 February 2013

4 Tools I wrote today to annotate VCF files.

Here are four java-bases tools I wrote today for @SolenaLS to annotate VCF files.

VCFTabix

VCFTabix : we want to use the VCF of the 1Kgenomes project (ALL.wgs.phase1_release_v3.20101123.snps_indels_sv.sites.vcf.gz) as a source of annotation for our VCF.

Options

-f (vcf indexed with tabix) REQUIRED.
 -T (tag String) VCF-INFO-ID optional can be used several times.
 -R doesn't use REF allele
 -A doesn't use ALT allele
 -I don't replace ID if it exists.
 -F don't replace INFO field if it exists.
 -C (TAG) use this tag in case of conflict with the ALT allele.

Example:

java -jar dist/vcftabix.jar -f ALL.wgs.phase1_release_v3.20101123.snps_indels_sv.sites.vcf.gz \
   -T AC -T SNPSOURCE  input.vcf 

##fileformat=VCFv4.1
(....)
##Annotated with fr.inserm.umr1087.jvarkit.tools.vcftabix.VCFTabix:ALL.wgs.phase1_release_v3.20101123.snps_indels_sv.sites.vcf.gz
##INFO=<ID=AC,Number=.,Type=Integer,Description="Alternate Allele Count">
##INFO=<ID=SNPSOURCE,Number=.,Type=String,Description="indicates if a snp was called when analysing the low coverage or exome alignment data">
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  CD10977
5       2999819 rs21234 A       G       1620    .       AC=2156;AF=1.00;AN=2;DB;DP=41;Dels=0.00;FS=0.000;HaplotypeScore=0.0000;MLEAC=2;MLEAF=1.00;MQ=59.44;MQ0=0;QD=39.51;SB=-6.519e-03;SNPSOURCE=LOWCOV;       GT:AD:DP:GQ:PL  1/1:0,41:41:99:1653,123,0
(...)


EVS2BED

EVS2BED: bulk download of the XML data of EVS into a BED file CHROM/START/END/XML (see my related post: http://plindenbaum.blogspot.fr/2012/05/downloading-xml-data-from-exome-variant.html )

Example

Download 10 records and index the data with tabix:
java -jar dist/evs2bed.jar -L 10 | LC_ALL=C sort -k1,1 -k2,2n -k3,3n -k4,4 -u |\
  /path/totabix-0.2.6/bgzip -c > test.bed.gz
/path/totabix-0.2.6/tabix -p bed -f test.bed.gz


$ gunzip -c  test.bed.gz  | cut -c 1-500 | fold -w 80
1 69427 69428 <snpList><positionString>1:69428</positionString><chrPos
ition>69428</chrPosition><alleles>G/T</alleles><uaAlleleCounts>G=313/T=6535</uaA
lleleCounts><aaAlleleCounts>G=14/T=3808</aaAlleleCounts><totalAlleleCounts>G=327
/T=10343</totalAlleleCounts><uaMAF>4.5707</uaMAF><aaMAF>0.3663</aaMAF><totalMAF>
3.0647</totalMAF><avgSampleReadDepth>110</avgSampleReadDepth><geneList>OR4F5</ge
neList><snpFunction><chromosome>1</chromosome><position>69428</position><conserv
ationScore>1.0</conservationSc
1 69475 69476 <snpList><positionString>1:69476</positionString><chrPos
ition>69476</chrPosition><alleles>C/T</alleles><uaAlleleCounts>C=2/T=7020</uaAll
eleCounts><aaAlleleCounts>C=0/T=3908</aaAlleleCounts><totalAlleleCounts>C=2/T=10
928</totalAlleleCounts><uaMAF>0.0285</uaMAF><aaMAF>0.0</aaMAF><totalMAF>0.0183</
totalMAF><avgSampleReadDepth>123</avgSampleReadDepth><geneList>OR4F5</geneList><
snpFunction><chromosome>1</chromosome><position>69476</position><conservationSco
re>0.6</conservationScore><con
1 69495 69496 <snpList><positionString>1:69496</positionString><chrPos
ition>69496</chrPosition><alleles>A/G</alleles><uaAlleleCounts>A=2/G=6764</uaAll
eleCounts><aaAlleleCounts>A=23/G=3785</aaAlleleCounts><totalAlleleCounts>A=25/G=
10549</totalAlleleCounts><uaMAF>0.0296</uaMAF><aaMAF>0.604</aaMAF><totalMAF>0.23
64</totalMAF><avgSampleReadDepth>91</avgSampleReadDepth><geneList>OR4F5</geneLis
t><snpFunction><chromosome>1</chromosome><position>69496</position><conservation
Score>0.5</conservationScore><
1 69510 69511 <snpList><positionString>1:69511</positionString><chrPos
ition>69511</chrPosition><alleles>G/A</alleles><uaAlleleCounts>G=5337/A=677</uaA
lleleCounts><aaAlleleCounts>G=1937/A=1623</aaAlleleCounts><totalAlleleCounts>G=7
274/A=2300</totalAlleleCounts><uaMAF>11.2571</uaMAF><aaMAF>45.5899</aaMAF><total
MAF>24.0234</totalMAF><avgSampleReadDepth>69</avgSampleReadDepth><geneList>OR4F5
</geneList><snpFunction><chromosome>1</chromosome><position>69511</position><con
servationScore>1.0</conservati
1 69589 69590 <snpList><positionString>1:69590</positionString><chrPos
ition>69590</chrPosition><alleles>A/T</alleles><uaAlleleCounts>A=0/T=6214</uaAll
eleCounts><aaAlleleCounts>A=1/T=3555</aaAlleleCounts><totalAlleleCounts>A=1/T=97
69</totalAlleleCounts><uaMAF>0.0</uaMAF><aaMAF>0.0281</aaMAF><totalMAF>0.0102</t
otalMAF><avgSampleReadDepth>119</avgSampleReadDepth><geneList>OR4F5</geneList><s
npFunction><chromosome>1</chromosome><position>69590</position><conservationScor
e>0.8</conservationScore><cons
1 69593 69594 <snpList><positionString>1:69594</positionString><chrPos
ition>69594</chrPosition><alleles>C/T</alleles><uaAlleleCounts>C=2/T=6190</uaAll
eleCounts><aaAlleleCounts>C=0/T=3548</aaAlleleCounts><totalAlleleCounts>C=2/T=97
38</totalAlleleCounts><uaMAF>0.0323</uaMAF><aaMAF>0.0</aaMAF><totalMAF>0.0205</t
otalMAF><avgSampleReadDepth>109</avgSampleReadDepth><geneList>OR4F5</geneList><s
npFunction><chromosome>1</chromosome><position>69594</position><conservationScor
e>0.8</conservationScore><cons
1 69619 69620 <snpList><positionString>1:69620</positionString><chrPos
ition>69620</chrPosition><alleles>T/TA</alleles><uaAlleleCounts>A1=2/R=4694</uaA
lleleCounts><aaAlleleCounts>A1=0/R=2954</aaAlleleCounts><totalAlleleCounts>A1=2/
R=7648</totalAlleleCounts><uaMAF>0.0426</uaMAF><aaMAF>0.0</aaMAF><totalMAF>0.026
1</totalMAF><avgSampleReadDepth>56</avgSampleReadDepth><geneList>OR4F5</geneList
><snpFunction><chromosome>1</chromosome><position>69620</position><conservationS
core>0.7</conservationScore><c
1 69744 69745 <snpList><positionString>1:69745</positionString><chrPos
ition>69745</chrPosition><alleles>CA/C</alleles><uaAlleleCounts>A1=4/R=4254</uaA
lleleCounts><aaAlleleCounts>A1=0/R=2716</aaAlleleCounts><totalAlleleCounts>A1=4/
R=6970</totalAlleleCounts><uaMAF>0.0939</uaMAF><aaMAF>0.0</aaMAF><totalMAF>0.057
4</totalMAF><avgSampleReadDepth>10</avgSampleReadDepth><geneList>OR4F5</geneList
><snpFunction><chromosome>1</chromosome><position>69745</position><conservationS
core>0.9</conservationScore><c
1 69760 69761 <snpList><positionString>1:69761</positionString><chrPos
ition>69761</chrPosition><alleles>T/A</alleles><uaAlleleCounts>T=645/A=4093</uaA
lleleCounts><aaAlleleCounts>T=62/A=2840</aaAlleleCounts><totalAlleleCounts>T=707
/A=6933</totalAlleleCounts><uaMAF>13.6133</uaMAF><aaMAF>2.1365</aaMAF><totalMAF>
9.2539</totalMAF><avgSampleReadDepth>8</avgSampleReadDepth><geneList>OR4F5</gene
List><snpFunction><chromosome>1</chromosome><position>69761</position><conservat
ionScore>0.1</conservationScor

VCFTabixml

VCFTabixml retrieves the annotations from a BED (CHROM/START/END/XML) and, using XLST, annotates a VCF.

Example

In the following example, a few variations are downloaded from EVS using EVS2BED. Then a small VCF is created and annotated using the bed and the following XSLT stylesheet: evs2vcf.xsl.
java -jar dist/evs2bed.jar -L 10  |\
       LC_ALL=C sort -k1,1 -k2,2n -k3,3n -k4,4 -u |\
       tabix-0.2.6/bgzip -c > test.bed.gz

/tabix-0.2.6/tabix -p bed -f test.bed.gz
java -jar  dist/vcftabixml.jar \
   -f test.bed.gz -x  dist/evs2vcf.xsl test.vcf



echo "#CHROM POS ID REF ALT QUAL FILTER INFO" > test.vcf
echo "1 1 . C T . . T=test1" >> test.vcf
echo "1 69476 . T C . . T=test2" >> test.vcf
echo "1 69511 . A C . . T=test3" >> test.vcf 
java -jar  dist/vcftabixml.jar \
    -f test.bed.gz -x  dist/evs2vcf.xsl test.vcf

Result:
##Annotated with fr.inserm.umr1087.jvarkit.tools.vcftabixml.VCFTabixml:test.bed.gz
#CHROM POS ID REF ALT QUAL FILTER INFO
1 1 . C T . . T=test1;
1 69476 . T C . . T=test2;EVS_UAMAF=0.0285;EVS_AAMAF=0.0;EVS_TOTALMAF=0.0183;EVS_AVGSAMPLEREADDEPTH=123;EVS_GENELIST=OR4F5;EVS_CONSERVATIONSCORE=0.6;EVS_CONSERVATIONSCOREGERP=2.3;EVS_RSIDS=rs148502021;EVS_CLINICALLINK=unknown;EVS_ONEXOMECHIP=false;EVS_GWASPUBMEDIDS=unknown;
1 69511 . A C . . T=test3;EVS_UAMAF=11.2571;EVS_AAMAF=45.5899;EVS_TOTALMAF=24.0234;EVS_AVGSAMPLEREADDEPTH=69;EVS_GENELIST=OR4F5;EVS_CONSERVATIONSCORE=1.0;EVS_CONSERVATIONSCOREGERP=1.1;EVS_RSIDS=rs75062661;EVS_CLINICALLINK=unknown;EVS_ONEXOMECHIP=false;EVS_GWASPUBMEDIDS=unknown;EVS_CONFLICTALT=G;

VCFBigWig

VCFBigWig use a bigwig file to annotate a VCF file.

Example:

add the GERP score to a VCF:
$ java -jar dist/vcfbigwig.jar -f /commun/data/pubdb/ucsc/hg19/bbi/All_hg19_RS_noprefix.bw  test.vep.vcf
##fileformat=VCFv4.1
##Annotated with class fr.inserm.umr1087.jvarkit.tools.vcfbigwig.VCFBigWig:/commun/data/pubdb/ucsc/hg19/bbi/All_hg19_RS_noprefix.bw
##INFO=<ID=All_hg19_RS_noprefix,Number=1,Type=Float,Description="Annotations from /commun/data/pubdb/ucsc/hg19/bbi/All_hg19_RS_noprefix.bw">
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO
22      12345   .       G       T       100.0   .       All_hg19_RS_noprefix=4.829999923706055

11 February 2013

Counting the reads in a BAM file using SGE and the Open-MPI library: my notebook.

In the current post, I'll describe my first Open MPI program. Open MPI is "a Message Passing Interface (MPI) library, a standardized and portable message-passing system to function on a wide variety of parallel computers". My C program takes a list of BAMs, distributes some jobs on the SGE (SUN/Oracle Grid Engine) to count the number of reads and returns the results to a master process.

Downloading and installing

The library was downloaded from http://www.open-mpi.org/software/ompi/v1.6/, compiled with the option --with-sge and installed in '/commun/data/packages/openmpi'.
./configure --prefix=/commun/data/packages/openmpi --with-sge
make 
make install

Configuring SGE to run MPI-based program

As described in http://docs.oracle.com/cd/E19080-01/n1.grid.eng6/817-5677/6ml49n2c0/index.html .
$ su
$ source /commun/sge_root/bird/common/settings.csh
$ setenv ARCH lx24-amd64
$ qconf -ap orte
And added:
pe_name            orte
slots              448
user_lists         NONE
xuser_lists        NONE
start_proc_args    /bin/true
stop_proc_args     /bin/true
allocation_rule    $round_robin
control_slaves     TRUE
job_is_first_task  TRUE
urgency_slots      min
accounting_summary FALSE
Add orte to the parallele env list:
qconf -mattr queue pe_list "orte" all.q
Using qconf -mconf I've also set shell_start_mode to 'unix_behavior'.

Algorithm

In the 'main' method, test if this is the master(==0) or a slave process(!=0):
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
if (rank == 0)
 {
 master(argc-1,&argv[1]);
 }
else
 {
 slave(argc-1,&argv[1]);
 }

MASTER

The master gets the number of available proc:
MPI_Comm_size(MPI_COMM_WORLD, &proc_count)
Loop over the number of available processors and the BAMS, for each BAM, create a fixed size structure containing the path to the BAM as well as the count of reads initialized to '0':
typedef struct Params_t
 {
 char filename[FILENAME_MAX];
 int is_error;
 long count_total;
 long count_mapped;
 long count_properly_paired;
 long count_dup;
 }Params;
This structure is sent to the slaves:
 MPI_Send(
 &param, /* message buffer */
 sizeof(Params), /* one data item */
 MPI_CHAR, /* data item is an array of bytes */
 rank, /* destination process rank */
 TAG_COUNT_BAM, /* user chosen message tag */
 MPI_COMM_WORLD /* default communicator */
 ); 
and we wait for the slaves to return those 'Params' with the filled values.:
MPI_Recv(&param, /* message buffer */
 sizeof(Params), /* one data item */
 MPI_CHAR, /* data item is an array of bytes */
 MPI_ANY_SOURCE, /* receive from any sender */
 MPI_ANY_TAG, /* any type of message */
 MPI_COMM_WORLD, /* default communicator */
 &status);
And the result is printed:
printf("%s\t%ld\t%ld\t%ld\t%ld\n",
 param.filename,
 param.count_total,
 param.count_mapped,
 param.count_properly_paired,
 param.count_dup
 );
At the end of the master method, when no more BAM is available, the slaves are released (tag is 'TAG_RELEASE_SLAVE') with :
MPI_Send(
 &empty, /* message buffer */
 sizeof(Params), /* one data item */
 MPI_CHAR, /* data item is an array of bytes */
 rank, /* destination process rank */
 TAG_RELEASE_SLAVE, /* SHutdown */
 MPI_COMM_WORLD
 ); /* default communicator */

SLAVE

The slave method receives some messages from the master:
MPI_Recv(
 &param,
 sizeof(Params),
 MPI_CHAR,
 0,
 MPI_ANY_TAG,
 MPI_COMM_WORLD,
 &status);
If the message was 'TAG_RELEASE_SLAVE' we exit the method. Else the BAM file is scanned using the SAmtools API for C.
count_reads(&param);
and the result is sent back to the master:
MPI_Send(
 &param,
 sizeof(Params),
 MPI_CHAR,
 0,
 0,
 MPI_COMM_WORLD
 );
.

Running the Job on SGE

I still don't understand why I need to write the following line:
# qsh -pe orte 4
Your job 84581 ("INTERACTIVE") has been submitted
waiting for interactive job to be scheduled ...
Your interactive job 84581 has been successfully scheduled.
Run the analysis:
qrsh -cwd -pe orte 100 /commun/data/packages/openmpi/bin/mpirun \
 /path/to/ompibam \
 `find /commun/data/projects/folder1234/align -name "*.bam"`
qstat -f shows the jobs running:
arch          states
   ---------------------------------------------------------------------------------
   all.q@node01                   BIP   0/15/64        2.76 lx24-amd64
      84598 0.55500 mpirun     lindenb      r     02/11/2013 12:03:36    15
   ---------------------------------------------------------------------------------
   all.q@node02                   BIP   0/14/64        3.89 lx24-amd64
      84598 0.55500 mpirun     lindenb      r     02/11/2013 12:03:36    14
   ---------------------------------------------------------------------------------
   all.q@node03                   BIP   0/14/64        3.23 lx24-amd64
      84598 0.55500 mpirun     lindenb      r     02/11/2013 12:03:36    14
   ---------------------------------------------------------------------------------
   all.q@node04                   BIP   0/14/64        3.68 lx24-amd64
      84598 0.55500 mpirun     lindenb      r     02/11/2013 12:03:36    14
   ---------------------------------------------------------------------------------
   all.q@node05                   BIP   0/15/64        2.91 lx24-amd64
      84598 0.55500 mpirun     lindenb      r     02/11/2013 12:03:36    15
   ---------------------------------------------------------------------------------
   all.q@node06                   BIP   0/14/64        3.91 lx24-amd64
      84598 0.55500 mpirun     lindenb      r     02/11/2013 12:03:36    14
   ---------------------------------------------------------------------------------
   all.q@node07                   BIP   0/14/64        3.79 lx24-amd64
      84598 0.55500 mpirun     lindenb      r     02/11/2013 12:03:36    14 
output:
#filename total mapped properly_paired dup
/commun/data/projects/folder1234/11X0589_markdup.bam 1179766 973943 684794 0
/commun/data/projects/folder1234/11X0589_pair1_sorted.bam 1179766 973943 684794 0
/commun/data/projects/folder1234/11X0589_realigned.bam 1179766 973943 684794 0
/commun/data/projects/folder1234/11X0589_recal.bam 1179766 973943 684794 0
/commun/data/projects/folder1234/11X0602_pair1_sorted.bam 982006 815636 566136 0
/commun/data/projects/folder1234/11X0602_realigned.bam 982006 815636 566136 0
/commun/data/projects/folder1234/11X0602_markdup.bam 982006 815636 566136 0
/commun/data/projects/folder1234/11X0602_pair1_unsorted.bam 982006 815636 566136 0
/commun/data/projects/folder1234/11X0375_pair1_sorted.bam 1255772 1031155 735480 0
/commun/data/projects/folder1234/11X0375_realigned.bam 1255772 1031155 735480 0
/commun/data/projects/folder1234/11X0375_markdup.bam 1255772 1031155 735480 0
/commun/data/projects/folder1234/11X0367_pair1_sorted.bam 1422144 1158315 796466 0
/commun/data/projects/folder1234/11X0367_markdup.bam 1422144 1158315 796466 0
/commun/data/projects/folder1234/11X0375_pair1_unsorted.bam 1255772 1031155 735480 0
/commun/data/projects/folder1234/11X0367_realigned.bam 1422144 1158315 796466 0
/commun/data/projects/folder1234/11X0291_pair1_sorted.bam 1296294 1067233 743924 0
/commun/data/projects/folder1234/11X0291_markdup.bam 1296294 1067233 743924 0
/commun/data/projects/folder1234/11X0291_realigned.bam 1296294 1067233 743924 0
/commun/data/projects/folder1234/11X0311_markdup.bam 1353534 1117463 788474 0
/commun/data/projects/folder1234/11X0311_realigned.bam 1353534 1117463 788474 0
/commun/data/projects/folder1234/11X0375_recal.bam 1255772 1031155 735480 0
/commun/data/projects/folder1234/11X0589_pair1_unsorted.bam 1179766 973943 684794 0
/commun/data/projects/folder1234/11X0367_pair1_unsorted.bam 1422144 1158315 796466 0
/commun/data/projects/folder1234/11X0290_pair1_sorted.bam 1436970 1191044 837726 0
/commun/data/projects/folder1234/11X0290_realigned.bam 1436970 1191044 837726 0
/commun/data/projects/folder1234/11X0291_pair1_unsorted.bam 1296294 1067233 743924 0
/commun/data/projects/folder1234/11X0311_pair1_sorted.bam 1353534 1117463 788474 0
/commun/data/projects/folder1234/11X0240_pair1_sorted.bam 1489672 1229654 830042 0
/commun/data/projects/folder1234/11X0240_realigned.bam 1489672 1229654 830042 0
/commun/data/projects/folder1234/11X0240_markdup.bam 1489672 1229654 830042 0
/commun/data/projects/folder1234/XX10272_pair1_sorted.bam 1579608 1296449 876040 0
/commun/data/projects/folder1234/XX10272_markdup.bam 1579608 1296449 876040 0
/commun/data/projects/folder1234/XX10272_realigned.bam 1579608 1296449 876040 0
/commun/data/projects/folder1234/11X0602_recal.bam 982006 815636 566136 0
/commun/data/projects/folder1234/11X0290_markdup.bam 1436970 1191044 837726 0
/commun/data/projects/folder1234/11X0290_pair1_unsorted.bam 1436970 1191044 837726 0
/commun/data/projects/folder1234/11X0247_pair1_sorted.bam 1489832 1233208 862842 0
/commun/data/projects/folder1234/11X0456_pair1_sorted.bam 1486490 1220969 848738 0
/commun/data/projects/folder1234/11X0456_realigned.bam 1486490 1220969 848738 0
/commun/data/projects/folder1234/11X0247_markdup.bam 1489832 1233208 862842 0
/commun/data/projects/folder1234/11X0247_realigned.bam 1489832 1233208 862842 0
/commun/data/projects/folder1234/11X0456_markdup.bam 1486490 1220969 848738 0
/commun/data/projects/folder1234/11X0367_recal.bam 1422144 1158315 796466 0
/commun/data/projects/folder1234/11X0305_pair1_sorted.bam 1818236 1489267 1027458 0
/commun/data/projects/folder1234/11X0305_markdup.bam 1818236 1489267 1027458 0
/commun/data/projects/folder1234/11X0305_realigned.bam 1818236 1489267 1027458 0
/commun/data/projects/folder1234/11X0542_pair1_sorted.bam 1784860 1469183 1016314 0
/commun/data/projects/folder1234/XX07551_recal.bam 1778358 1456518 1018694 0
/commun/data/projects/folder1234/11X0542_realigned.bam 1784860 1469183 1016314 0
/commun/data/projects/folder1234/11X0542_markdup.bam 1784860 1469183 1016314 0
/commun/data/projects/folder1234/11X0311_pair1_unsorted.bam 1353534 1117463 788474 0
/commun/data/projects/folder1234/11X0291_recal.bam 1296294 1067233 743924 0
/commun/data/projects/folder1234/XX07551_pair1_sorted.bam 1778358 1456518 1018694 0
/commun/data/projects/folder1234/11X0240_pair1_unsorted.bam 1489672 1229654 830042 0
/commun/data/projects/folder1234/XX10272_pair1_unsorted.bam 1579608 1296449 876040 0
/commun/data/projects/folder1234/XX07551_markdup.bam 1778358 1456518 1018694 0
/commun/data/projects/folder1234/XX07551_realigned.bam 1778358 1456518 1018694 0
/commun/data/projects/folder1234/11X0305_pair1_unsorted.bam 1818236 1489267 1027458 0
/commun/data/projects/folder1234/11X0311_recal.bam 1353534 1117463 788474 0
/commun/data/projects/folder1234/11X0456_pair1_unsorted.bam 1486490 1220969 848738 0
/commun/data/projects/folder1234/11X0247_pair1_unsorted.bam 1489832 1233208 862842 0
/commun/data/projects/folder1234/11X0542_pair1_unsorted.bam 1784860 1469183 1016314 0
/commun/data/projects/folder1234/XX07551_pair1_unsorted.bam 1778358 1456518 1018694 0
/commun/data/projects/folder1234/11X0433_pair1_sorted.bam 2178798 1800943 1249738 0
/commun/data/projects/folder1234/11X0433_markdup.bam 2178798 1800943 1249738 0
/commun/data/projects/folder1234/XX10272_recal.bam 1579608 1296449 876040 0
/commun/data/projects/folder1234/11X0433_realigned.bam 2178798 1800943 1249738 0
/commun/data/projects/folder1234/11X0240_recal.bam 1489672 1229654 830042 0
/commun/data/projects/folder1234/11X0290_recal.bam 1436970 1191044 837726 0
/commun/data/projects/folder1234/11X0456_recal.bam 1486490 1220969 848738 0
/commun/data/projects/folder1234/11X0247_recal.bam 1489832 1233208 862842 0
/commun/data/projects/folder1234/11X0433_pair1_unsorted.bam 2178798 1800943 1249738 0
/commun/data/projects/folder1234/11X0305_recal.bam 1818236 1489267 1027458 0
/commun/data/projects/folder1234/11X0542_recal.bam 1784860 1469183 1016314 0
/commun/data/projects/folder1234/11X0433_recal.bam 2178798 1800943 1249738 0
However I noticed seems that running this code on the cluster is slower that running it on the master node...

The C code

Note: the C code is compiled with ortecc, not gcc.


I'm still very new to this API and to those concepts. Feel free to suggest anything :-)
That's it,
Pierre

07 February 2013

A tool to compare the BAMs

Following that thread on Biostar, I've created a tool that compares two or more BAMs. This java program uses the Picard and berkeleydb-JE libraries and is available at: http://code.google.com/p/jvarkit/wiki/CompareBams.

Download & install

see http://code.google.com/p/jvarkit/wiki/CompareBams.

Example

The following Makefile align the same pair of FASTQs with 5 different parameters for bwa aln -O (gap_open_penalty)

FASTQ1=001.fastq.gz
FASTQ2=002.fastq.gz
REF=/path/to/human_g1k_v37.fasta
BWA=bwa
SAMTOOLS=samtools
ALL_BAMS= 

define SAI
$(1)_1.sai : ${FASTQ1}  $(REF)
        $(BWA) aln  $(2) -t 2  -f $$@ $(REF) $$<
$(1)_2.sai :${FASTQ2}  $(REF)
        $(BWA) aln  $(2) -t 2  -f $$@ $(REF) $$<

endef

define ALN

ALL_BAMS+= $(1).bam 

$(eval $(call SAI,$(1),$(2)))

$(1).bam: $(1)_1.sai $(1)_2.sai
        $(BWA) sampe $(3)  ${REF} \
                $(1)_1.sai $(1)_2.sai  \
                $(FASTQ1) $(FASTQ2) |\
        $(SAMTOOLS) view -S -b -o $$@ -T $(REF) - 

endef

.PHONY:all

all: diff.gz

$(eval $(foreach GAP, 8 9 10 11 12 , $(call ALN,GAP$(GAP), -O $(GAP) , )))


diff.gz: $(ALL_BAMS)
        mkdir -p tmp.bdb
        java -jar  /commun/data/packages/jvarkit/comparebams.jar -d tmp.bdb $^ | gzip --best > $@
execute:
$ make

(...)
java -jar  /path/to/jvarkit/comparebams.jar -d tmp.bdb GAP8.bam GAP9.bam GAP10.bam GAP11.bam GAP12.bam | gzip --best > diff.gz
Feb 07, 2013 2:09:57 PM fr.inserm.umr1087.jvarkit.tools.cmpbams.CompareBams run
#INFO: in GAP8.bam count:1
Feb 07, 2013 2:10:30 PM fr.inserm.umr1087.jvarkit.tools.cmpbams.CompareBams run
INFO: in GAP9.bam count:1
(...)

The output is a tab delimited file containing

  • the name of the read
  • the comparison of each bam vs the others EQ=equals, NE=not-equals
  • the positions of the reads in each bam
$ gunzip -c diff.gz | egrep -E  '(#|NE)' | head


#READ-Name      GAP8.bam GAP9.bam|GAP8.bam GAP10.bam|GAP8.bam GAP11.bam|GAP8.bam GAP12.bam|GAP9.bam GAP10.bam|GAP9.bam GAP11.bam|GAP9.bam GAP12.bam|GAP10.bam GAP11.bam|GAP10.bam GAP12.bam|GAP11.bam GAP12.bam GAP8.bam        GAP9.bam        GAP10.bam       GAP11.bam       GAP12.bam
M00491:10:000000000-A27BP:1:1101:10029:10672    EQ|NE|NE|NE|NE|NE|NE|EQ|EQ|EQ   6:123892013,6:123892006 6:123892013,6:123892006 6:123892005,6:123892013 6:123892005,6:123892013 6:123892005,6:123892013
M00491:10:000000000-A27BP:1:1101:10265:10054    EQ|EQ|NE|NE|EQ|NE|NE|NE|NE|NE   19:49671437,19:49671412 19:49671437,19:49671412 19:49671437,19:49671412 19:49671435     19:49671412,19:49671435
M00491:10:000000000-A27BP:1:1101:10904:12333    EQ|NE|NE|NE|NE|NE|NE|EQ|EQ|EQ   10:88681151,10:88681156 10:88681151,10:88681156 10:88681156,10:88681150 10:88681156,10:88681150 10:88681156,10:88681150
M00491:10:000000000-A27BP:1:1101:11211:13492    EQ|NE|NE|NE|NE|NE|NE|EQ|EQ|EQ   8:52321469      8:52321469      8:52321470      8:52321470      8:52321470
M00491:10:000000000-A27BP:1:1101:11298:18283    NE|NE|NE|NE|EQ|EQ|EQ|EQ|EQ|EQ   6:126071103,6:126071110 6:126071104,6:126071110 6:126071104,6:126071110 6:126071104,6:126071110 6:126071104,6:126071110
M00491:10:000000000-A27BP:1:1101:11381:15675    EQ|NE|NE|NE|NE|NE|NE|EQ|EQ|EQ   1:156106900,1:156106905 1:156106900,1:156106905 1:156106905,1:156106899 1:156106905,1:156106899 1:156106905,1:156106899
M00491:10:000000000-A27BP:1:1101:12189:14088    EQ|NE|NE|EQ|NE|NE|EQ|EQ|NE|NE   15:22015803     15:22015803     15:21009140     15:21009140     15:22015803
M00491:10:000000000-A27BP:1:1101:12382:11193    EQ|NE|NE|NE|NE|NE|NE|EQ|EQ|EQ   4:111542263,4:111542256 4:111542263,4:111542256 4:111542263,4:111542254 4:111542263,4:111542254 4:111542263,4:111542254
M00491:10:000000000-A27BP:1:1101:12998:24492    EQ|NE|NE|NE|NE|NE|NE|EQ|EQ|EQ   2:179433503,2:179433496 2:179433503,2:179433496 2:179433503,2:179433497 2:179433503,2:179433497 2:179433503,2:179433497
(....)

That's it
Pierre