YOKOFAKUN: Inside Jvarkit: Shrink your fastqs by 40% by aligning them to REF before compression.

BamToFastq is an implementation of https://twitter.com/DNAntonie/status/402909852277932032"

Shrink your FASTQ.bz2 files by 40+% using this one weird tip -> order them by alignment to reference before compression!
— Antonie DNA Software (@DNAntonie) November 19, 2013

Example : piping bwa mem

$ bwa mem -M  human_g1k_v37.fasta  Sample1_L001_R1_001.fastq.gz Sample2_S5_L001_R2_001.fastq.gz |\
  java -jar dist/bam2fastq.jar  -F tmpR1.fastq.gz -R tmpR2.fastq.gz

before:

$ ls -lah Sample1_L001_R1_001.fastq.gz Sample2_S5_L001_R2_001.fastq.gz
-rw-r--r-- 1 lindenb lindenb 181M Jun 14 15:20 Sample1_L001_R1_001.fastq.gz
-rw-r--r-- 1 lindenb lindenb 190M Jun 14 15:20 Sample1_L001_R2_001.fastq.gz

after (these are Haloplex Data, with a lot of duplicates )

$ ls -lah tmpR1.fastq.gz  tmpR2.fastq.gz
-rw-rw-r-- 1 lindenb lindenb  96M Nov 20 17:10 tmpR1.fastq.gz
-rw-rw-r-- 1 lindenb lindenb 106M Nov 20 17:10 tmpR2.fastq.gz

using BZ2:

$  ls -lah *.bz2
-rw-rw-r-- 1 lindenb lindenb 77M Nov 20 17:55 tmpR1.fastq.bz2
-rw-rw-r-- 1 lindenb lindenb 87M Nov 20 17:55 tmpR2.fastq.bz2

That's it
Pierre

2 comments:

FrogeeThursday, 21 November, 2013
Hi Pierre,

I'm not sure I understand how this work. Does having reads with similar sequence in close proximity in the file allow the compression to be more efficient?
Pierre LindenbaumThursday, 21 November, 2013
reads are mapping the same part of the genomes: the compressor finds more similar strings to index. See: http://www.quora.com/Data-Compression/How-does-file-compression-work

20 November 2013

Inside Jvarkit: Shrink your fastqs by 40% by aligning them to REF before compression.

Example : piping bwa mem

2 comments: