20 November 2013

Inside Jvarkit: Shrink your fastqs by 40% by aligning them to REF before compression.

BamToFastq is an implementation of https://twitter.com/DNAntonie/status/402909852277932032"

Example : piping bwa mem

$ bwa mem -M  human_g1k_v37.fasta  Sample1_L001_R1_001.fastq.gz Sample2_S5_L001_R2_001.fastq.gz |\
  java -jar dist/bam2fastq.jar  -F tmpR1.fastq.gz -R tmpR2.fastq.gz
$ ls -lah Sample1_L001_R1_001.fastq.gz Sample2_S5_L001_R2_001.fastq.gz
-rw-r--r-- 1 lindenb lindenb 181M Jun 14 15:20 Sample1_L001_R1_001.fastq.gz
-rw-r--r-- 1 lindenb lindenb 190M Jun 14 15:20 Sample1_L001_R2_001.fastq.gz

after (these are Haloplex Data, with a lot of duplicates )

$ ls -lah tmpR1.fastq.gz  tmpR2.fastq.gz
-rw-rw-r-- 1 lindenb lindenb  96M Nov 20 17:10 tmpR1.fastq.gz
-rw-rw-r-- 1 lindenb lindenb 106M Nov 20 17:10 tmpR2.fastq.gz

using BZ2:

$  ls -lah *.bz2
-rw-rw-r-- 1 lindenb lindenb 77M Nov 20 17:55 tmpR1.fastq.bz2
-rw-rw-r-- 1 lindenb lindenb 87M Nov 20 17:55 tmpR2.fastq.bz2

That's it


Frogee said...

Hi Pierre,

I'm not sure I understand how this work. Does having reads with similar sequence in close proximity in the file allow the compression to be more efficient?

Pierre Lindenbaum said...

reads are mapping the same part of the genomes: the compressor finds more similar strings to index. See: http://www.quora.com/Data-Compression/How-does-file-compression-work