Showing posts with label sequencing. Show all posts
Showing posts with label sequencing. Show all posts

18 September 2012

Notes about bwa 0.6.2 and the multiple hits.

Here are my notes about the way bwa 0.6.2 handles the multiple hits.


I've created a sub-sequence (~6Mb ) of the human chr22 (see the Makefile below). The sequence doesn't contain any base 'N' or any lowercase (=repeatMasked) letter.
1E6 reads (forward and reverse) have been generated from this ~chr22 with samtools wgsim. No mutation and no sequencing error have been allowed.

This ~chr22 sequence have been duplicated and renamed as 'DUP'. Both sequences have been merged in one fasta file named 'twoChrom.fa'

The reads have been aligned with BWA (v. 0.6.2-r126) on this genome.


There's only one SAM alignment per read

$ samtools view align.sorted.bam  | wc -l
2000000

1% of the reads were not properly paired/mapped on the genome

$ samtools view  -f 2 align.sorted.bam | wc -l
1999926

The properly mapped reads contain some informations about the alternative hits in 'XA'

$samtools view  -f 2 align.sorted.bam | grep 22_10_402_0 | verticalize -n

>>> 1
$1   22_10_402_0:0:0_0:0:0_c1608
$2   99
$3   22
$4   10
$5   0
$6   70M
$7   =
$8   333
$9   393
$10  GGGACGGTCATGCAATCTGGACAACATTCACCTTTAAAAGTTTATTGATCTTTTGTGACATGCACGTGGG
$11  IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
$12  XT:A:R
$13  NM:i:0
$14  SM:i:0
$15  AM:i:0
$16  X0:i:2
$17  X1:i:0
$18  XM:i:0
$19  XO:i:0
$20  XG:i:0
$21  MD:Z:70
$22  XA:Z:DUP,+10,70M,0;
<<< 1

>>> 2
$1   22_10_402_0:0:0_0:0:0_c1608
$2   147
$3   22
$4   333
$5   0
$6   70M
$7   =
$8   10
$9   -393
$10  ATACCTGCCAGATGAGTCACTGGCAAAAGGTGCTGCTCCCTGGTGAGGGAGAAACACCAGGGGCTGGGAG
$11  IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
$12  XT:A:R
$13  NM:i:0
$14  SM:i:0
$15  AM:i:0
$16  X0:i:2
$17  X1:i:0
$18  XM:i:0
$19  XO:i:0
$20  XG:i:0
$21  MD:Z:70
$22  XA:Z:DUP,-333,70M,0;
<<< 2

50% mapped on chr22, 50% mapped on DUP

$ samtools view  -f 2 align.sorted.bam | cut -d '      ' -f 3 | sort | uniq -c
 999860 22
1000066 DUP

Some pairs have been unmapped because the mate have been mapped on the other chromosome !

$ samtools-0.1.18/samtools view  -F 2 align.sorted.bam | grep 22_1210321_1210924 | verticalize -n

>>> 1
$1   22_1210321_1210924_0:0:0_0:0:0_e28f5
$2   81
$3   22
$4   1210855
$5   0
$6   70M
$7   DUP
$8   1210321
$9   0
$10  AGAGACTGAGTCCGGCTAGAGAACAGGGTGGAGCCCCTTTGGACCTTAGAGCTGGGCCTTTGGGCCTTGG
$11  IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
$12  XT:A:R
$13  NM:i:0
$14  SM:i:0
$15  AM:i:0
$16  X0:i:6
$17  X1:i:0
$18  XM:i:0
$19  XO:i:0
$20  XG:i:0
$21  MD:Z:70
$22  XA:Z:22,-1996342,70M,0;DUP,-1996342,70M,0;22,+2551778,70M,0;DUP,+2551778,70M,0;DUP,-1210855,70M,0;
<<< 1

>>> 2
$1   22_1210321_1210924_0:0:0_0:0:0_e28f5
$2   161
$3   DUP
$4   1210321
$5   0
$6   70M
$7   22
$8   1210855
$9   0
$10  CCCGGGCCGGATGGCTCGCCTGCGCGGCCAGCTCCGGGCCGAAGCGGCTTCGCGGTCCGAGGTGCCGCGG
$11  IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
$12  XT:A:R
$13  NM:i:0
$14  SM:i:0
$15  AM:i:0
$16  X0:i:2
$17  X1:i:0
$18  XM:i:0
$19  XO:i:0
$20  XG:i:0
$21  MD:Z:70
$22  XA:Z:22,+1210321,70M,0;
<<< 2

The Makefile

SAMDIR=/usr/local/package/samtools-0.1.18
SAMTOOLS=$(SAMDIR)/samtools
BCFTOOLS=$(SAMDIR)/bcftools/bcftools
BWA=/usr/local/package/bwa-0.6.2/bwa
REF1=chr22.fa
REF2=twoChrom.fa

.INTERMEDIATE : align.sam random_1.sai random_2.sai align.bam variations.bcf

%.bam : %.sam
        $(SAMTOOLS) view -o $@ -b -S -T $(REF2) $<
%.bam.bai : %.bam
        $(SAMTOOLS) index $<



align.sorted.bam : align.bam
        $(SAMTOOLS) sort $< align.sorted


align.sam : random_1.sai random_2.sai  
        $(BWA) sampe -a 600 $(REF2) $^ random_1.fq.gz random_2.fq.gz > $@

$(REF1):
        curl -s "http://hgdownload.cse.ucsc.edu/goldenPath/hg19/chromosomes/$(REF1).gz" |\
        gunzip -c | sed 's/^>chr/>/' | grep -v '[Nnatgc]' | head -n 100000 > $@


random_1.sai :  random_1.fq.gz $(REF2).bwt
        $(BWA) aln -f $@ $(REF2) $<

random_2.sai :  random_2.fq.gz $(REF2).bwt
        $(BWA) aln -f $@ $(REF2) $<

random_1.fq.gz random_2.fq.gz : $(REF1)
        $(SAMDIR)/misc/wgsim -N 1000000 -e 0.0 -r 0.0 -d 400 $< random_1.fq random_2.fq > wgsim.output
        gzip -f --best random_1.fq random_2.fq

$(REF2): $(REF1)
        cp $< $@
        sed 's/^>.*/>DUP/' $< >> $@

$(REF2).bwt : $(REF2)
        $(BWA) index -a bwtsw $<

$(REF2).fai :  $(REF2)
        $(SAMTOOLS) faidx $< 



clean:
        rm -f chr22.* *.bam *.vcf *.bcf *.sai *.gz *.fq *.bai  wgsim.output *.sam
That's it,

Pierre