Inside Jvarkit: Shrink your fastqs by 40% by aligning them to REF before compression.
BamToFastq is an implementation of https://twitter.com/DNAntonie/status/402909852277932032"
Shrink your FASTQ.bz2 files by 40+% using this one weird tip -> order them by alignment to reference before compression!
— Antonie DNA Software (@DNAntonie) November 19, 2013
Example : piping bwa mem
$ bwa mem -M human_g1k_v37.fasta Sample1_L001_R1_001.fastq.gz Sample2_S5_L001_R2_001.fastq.gz |\ java -jar dist/bam2fastq.jar -F tmpR1.fastq.gz -R tmpR2.fastq.gz
$ ls -lah Sample1_L001_R1_001.fastq.gz Sample2_S5_L001_R2_001.fastq.gz -rw-r--r-- 1 lindenb lindenb 181M Jun 14 15:20 Sample1_L001_R1_001.fastq.gz -rw-r--r-- 1 lindenb lindenb 190M Jun 14 15:20 Sample1_L001_R2_001.fastq.gz
after (these are Haloplex Data, with a lot of duplicates )
$ ls -lah tmpR1.fastq.gz tmpR2.fastq.gz -rw-rw-r-- 1 lindenb lindenb 96M Nov 20 17:10 tmpR1.fastq.gz -rw-rw-r-- 1 lindenb lindenb 106M Nov 20 17:10 tmpR2.fastq.gz
using BZ2:
$ ls -lah *.bz2 -rw-rw-r-- 1 lindenb lindenb 77M Nov 20 17:55 tmpR1.fastq.bz2 -rw-rw-r-- 1 lindenb lindenb 87M Nov 20 17:55 tmpR2.fastq.bz2
That's it
Pierre