Inside Jvarkit: Shrink your fastqs by 40% by aligning them to REF before compression.
BamToFastq is an implementation of https://twitter.com/DNAntonie/status/402909852277932032"
Shrink your FASTQ.bz2 files by 40+% using this one weird tip -> order them by alignment to reference before compression!
— Antonie DNA Software (@DNAntonie) November 19, 2013
Example : piping bwa mem
$ bwa mem -M human_g1k_v37.fasta Sample1_L001_R1_001.fastq.gz Sample2_S5_L001_R2_001.fastq.gz |\ java -jar dist/bam2fastq.jar -F tmpR1.fastq.gz -R tmpR2.fastq.gz
$ ls -lah Sample1_L001_R1_001.fastq.gz Sample2_S5_L001_R2_001.fastq.gz -rw-r--r-- 1 lindenb lindenb 181M Jun 14 15:20 Sample1_L001_R1_001.fastq.gz -rw-r--r-- 1 lindenb lindenb 190M Jun 14 15:20 Sample1_L001_R2_001.fastq.gz
after (these are Haloplex Data, with a lot of duplicates )
$ ls -lah tmpR1.fastq.gz tmpR2.fastq.gz -rw-rw-r-- 1 lindenb lindenb 96M Nov 20 17:10 tmpR1.fastq.gz -rw-rw-r-- 1 lindenb lindenb 106M Nov 20 17:10 tmpR2.fastq.gz
using BZ2:
$ ls -lah *.bz2 -rw-rw-r-- 1 lindenb lindenb 77M Nov 20 17:55 tmpR1.fastq.bz2 -rw-rw-r-- 1 lindenb lindenb 87M Nov 20 17:55 tmpR2.fastq.bz2
That's it
Pierre
2 comments:
Hi Pierre,
I'm not sure I understand how this work. Does having reads with similar sequence in close proximity in the file allow the compression to be more efficient?
reads are mapping the same part of the genomes: the compressor finds more similar strings to index. See: http://www.quora.com/Data-Compression/How-does-file-compression-work
Post a Comment