I recently asked on Biostar if it would be worth to mask the non-genic sequence before aligning the short reads on the reference after an exome sequencing. Although I was convinced by the answer of lh3, I was curious to observe the difference with some real data.
I've downloaded two fastqs files from ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/pilot_data/data/NA20772/sequence_read/ERR004053_*.recal.fastq.gz and the sequence for the human chromosome chr1 from the UCSC.
One copy of chr1.fa was masked using the UCSC knownGene table +/- 10kb using a custom software. The bases were replaced by a 'N' if they were not contained in a genic region+/-10kb.
- Number of 'N' in chr1 without masking:22,250,000
- Number of 'N' in chr1 with masking:108,399,153
Then, the two fastq were aligned on each chr1 (masked/not masked) and the mutations were called with 'samtools pileup':
bwa-0.5.9rc1/bwa index chr1.fa
bwa-0.5.9rc1/bwa aln chr1.fa ERR004053_1.recal.fastq.gz > aln1.sai
bwa-0.5.9rc1/bwa aln chr1.fa ERR004053_2.recal.fastq.gz > aln2.sai
bwa-0.5.9rc1/bwa sampe chr1.fa aln1.sai aln2.sai ERR004053_1.recal.fastq.gz ERR004053_2.recal.fastq.gz > file.sam
samtools-0.1.10/samtools faidx chr1.fa
samtools-0.1.10/samtools view -b -t chr1.fa.fai file.sam > file.bam
samtools-0.1.10/samtools sort file.bam sorted
samtools-0.1.10/samtools index sorted.bam
samtools-0.1.10/samtools pileup -vcf chr1.fa sorted.bam |\
awk '($3=="*"&&$6>=50)||($3!="*"&&$6>=20)' |\
cut -d ' ' -f 1-4 |\
sort | uniq > pileup.txt
At the end:
- Number of mutations (no masking): 24921
- Number of mutations (masking): 26062
- Number of mutations common in 'masking'/'no masking': 13573
- Number of mutations unique in 'no masking': 12489
- Number of mutations unique in 'masking': 11348
'chr1:100005960 c/A': a mutation from the 'masked' sequence but not found in 'not-masked':
chr1 masked
100005921 100005931 100005941 100005951 100005961 100005971
tgctaattggtcagattggagatggaatca*tggggggtcgacgtgaggttttcttgctgtcttct
....G.........G............... MM.......R.A....RK.................
.. ,,,,,,,,,,,,,,cac,,,,,,,,,a,,,,a,,, ,,,,,,,,,,,,,
.... ..........*CA.......A.A.......N....... ,,,,,,
.. ..........*CA.......A.A............... ,,,,
G...G..G.. ..........*CA.......A.A...............
....G.........G..T.. ...........T.................
....G....... .....A.....T.................
,,,,g,,,,,,,,,g,,,,,,,,,,,,,,,*,,,,, ..............G........
chr1 NOT masked
100005931 100005941 100005951 100005961 100005971 100005981
tcagattggagatggaatcatggggggtcgacgtgaggttttcttgctgtcttctgttcctgggtg
..........CA.......R.A.....K............................
..........CA.......A.A.......N.......
...........T.........................
.....A.....T.........................
.A..................................
..............G...................
in this case, it is visiblethat the reads have been more correctly aligned on the non-masked sequence.
That's it
Pierre