YOKOFAKUN: Inside Jvarkit: Shrink your fastqs by 40% by aligning them to REF before compression.

20 November 2013

Inside Jvarkit: Shrink your fastqs by 40% by aligning them to REF before compression.

BamToFastq is an implementation of https://twitter.com/DNAntonie/status/402909852277932032"

Shrink your FASTQ.bz2 files by 40+% using this one weird tip -> order them by alignment to reference before compression!
— Antonie DNA Software (@DNAntonie) November 19, 2013

Example : piping bwa mem

$ bwa mem -M  human_g1k_v37.fasta  Sample1_L001_R1_001.fastq.gz Sample2_S5_L001_R2_001.fastq.gz |\
  java -jar dist/bam2fastq.jar  -F tmpR1.fastq.gz -R tmpR2.fastq.gz

before:

$ ls -lah Sample1_L001_R1_001.fastq.gz Sample2_S5_L001_R2_001.fastq.gz
-rw-r--r-- 1 lindenb lindenb 181M Jun 14 15:20 Sample1_L001_R1_001.fastq.gz
-rw-r--r-- 1 lindenb lindenb 190M Jun 14 15:20 Sample1_L001_R2_001.fastq.gz

after (these are Haloplex Data, with a lot of duplicates )

$ ls -lah tmpR1.fastq.gz  tmpR2.fastq.gz
-rw-rw-r-- 1 lindenb lindenb  96M Nov 20 17:10 tmpR1.fastq.gz
-rw-rw-r-- 1 lindenb lindenb 106M Nov 20 17:10 tmpR2.fastq.gz

using BZ2:

$  ls -lah *.bz2
-rw-rw-r-- 1 lindenb lindenb 77M Nov 20 17:55 tmpR1.fastq.bz2
-rw-rw-r-- 1 lindenb lindenb 87M Nov 20 17:55 tmpR2.fastq.bz2

That's it
Pierre

2 comments:

Frogee said...: Hi Pierre,

I'm not sure I understand how this work. Does having reads with similar sequence in close proximity in the file allow the compression to be more efficient?; Thursday, 21 November, 2013
Pierre Lindenbaum said...: reads are mapping the same part of the genomes: the compressor finds more similar strings to index. See: http://www.quora.com/Data-Compression/How-does-file-compression-work; Thursday, 21 November, 2013

YOKOFAKUN

20 November 2013

Inside Jvarkit: Shrink your fastqs by 40% by aligning them to REF before compression.

Example : piping bwa mem

2 comments:

About Me

Feeds

Blog Archive

Web2.0

Labels