YOKOFAKUN: Don't mask this sequence, please.

I recently asked on Biostar if it would be worth to mask the non-genic sequence before aligning the short reads on the reference after an exome sequencing. Although I was convinced by the answer of lh3, I was curious to observe the difference with some real data.

I've downloaded two fastqs files from ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/pilot_data/data/NA20772/sequence_read/ERR004053_*.recal.fastq.gz and the sequence for the human chromosome chr1 from the UCSC.

One copy of chr1.fa was masked using the UCSC knownGene table +/- 10kb using a custom software. The bases were replaced by a 'N' if they were not contained in a genic region+/-10kb.

Number of 'N' in chr1 without masking:22,250,000
Number of 'N' in chr1 with masking:108,399,153

Then, the two fastq were aligned on each chr1 (masked/not masked) and the mutations were called with 'samtools pileup':

bwa-0.5.9rc1/bwa index chr1.fa
bwa-0.5.9rc1/bwa aln chr1.fa ERR004053_1.recal.fastq.gz > aln1.sai
bwa-0.5.9rc1/bwa aln chr1.fa ERR004053_2.recal.fastq.gz > aln2.sai
bwa-0.5.9rc1/bwa sampe chr1.fa aln1.sai aln2.sai ERR004053_1.recal.fastq.gz ERR004053_2.recal.fastq.gz > file.sam
samtools-0.1.10/samtools faidx chr1.fa
samtools-0.1.10/samtools view -b -t chr1.fa.fai file.sam > file.bam
samtools-0.1.10/samtools sort file.bam sorted
samtools-0.1.10/samtools index sorted.bam
samtools-0.1.10/samtools pileup -vcf chr1.fa sorted.bam |\
awk '($3=="*"&&$6>=50)||($3!="*"&&$6>=20)' |\
cut -d ' ' -f 1-4 |\
sort | uniq > pileup.txt

At the end:

Number of mutations (no masking): 24921

Number of mutations (masking): 26062

Number of mutations common in 'masking'/'no masking': 13573

Number of mutations unique in 'no masking': 12489

Number of mutations unique in 'masking': 11348

'chr1:100005960 c/A': a mutation from the 'masked' sequence but not found in 'not-masked':

chr1 masked

  100005921 100005931 100005941  100005951 100005961 100005971
tgctaattggtcagattggagatggaatca*tggggggtcgacgtgaggttttcttgctgtcttct
....G.........G............... MM.......R.A....RK.................
..              ,,,,,,,,,,,,,,cac,,,,,,,,,a,,,,a,,,  ,,,,,,,,,,,,,
....                ..........*CA.......A.A.......N.......  ,,,,,,
..                  ..........*CA.......A.A...............    ,,,,
G...G..G..          ..........*CA.......A.A...............
....G.........G..T..                 ...........T.................
....G.......                         .....A.....T.................
,,,,g,,,,,,,,,g,,,,,,,,,,,,,,,*,,,,,       ..............G........

chr1 NOT masked

  100005931 100005941 100005951 100005961 100005971 100005981
tcagattggagatggaatcatggggggtcgacgtgaggttttcttgctgtcttctgttcctgggtg
          ..........CA.......R.A.....K............................
          ..........CA.......A.A.......N.......
                          ...........T.........................
                          .....A.....T.........................
                              .A..................................
                                ..............G...................

in this case, it is visiblethat the reads have been more correctly aligned on the non-masked sequence.

That's it

Pierre

YOKOFAKUN

05 January 2011

Don't mask this sequence, please.

chr1 masked

chr1 NOT masked

No comments:

Post a Comment