08 April 2010

BAM, BWA and SAMTOOLS: my notebook

Here are my notes about MAQ,BWA and SAMTOOLS.

MAQ


Maq is a software that builds mapping assemblies from short reads generated by the next-generation sequencing machines. Here I used chrM.fa, the reference sequence for the mitochondrial genome
>chrM
GATCACAGGTCTATCACCCTATTAACCACTCACGGGAGCTCTCCATGCAT
TTGGTATTTTCGTCTGGGGGGTGTGCACGCGATAGCATTGCGAGACGCTG
GAGCCGGAGCACCCTATGTCGCAGTATCTGTCTTTGATTCCTGCCTCATT
CTATTATTTATCGCACCTACGTTCAATATTACAGGCGAACATACCTACTA
AAGTGTGTTAATTAATTAATGCTTGTAGGACATAATAATAACAATTGAAT
GTCTGCACAGCCGCTTTCCACACAGACATCATAACAAAAAATTTCCACCA
AACCCCCCCCTCCCCCCGCTTCTGGCCACAGCACTTAAACACATCTCTGC
CAAACCCCAAAAACAAAGAACCCTAACACCAGCCTAACCAGATTTCAAAT
TTTATCTTTAGGCGGTATGCACTTTTAACAGTCACCCCCCAACTA.....
fasta2bfa converts the reference sequence to the binary fasta format
$(MAQ) fasta2bfa chrM.fa _reference.bfa


I also generated a set of random fastq sequences for the chrM. Each read contains a mutation A/G every 10 bases and an insertion of 3 bases every 50 bases.
(...)
@chrM|8|2
GTATATCACCCTATTAACCACTAACGGGAG
+
kzzst|{efokkdyirlqkflupvofumxe
@chrM|20|3
ATTAACCACTAACGGGAGCTATCCATGCAT
+
zplixjsdlmeemjrxxkt|nrd?qwqwym
@chrM|25|4
CCACTAACGGGAGCTATCCATGCATAAATGGT
+
|uklu}+rqepiuoi|qjyrhpyzlllrq|oz
@chrM|32|5
CGGGAGCTATCCATGCATAAATGGTATTTTAG
+
qyvylqiqkhddve4rddxh|frmznluouxp
(...)

fastq2bfq converts the short reads to the binary fasta format:
$(MAQ) fastq2bfq chrM.fastq _reads.bfq

We can now align the reads to the chrM:
$(MAQ) map -u _poor.txt -H _multiple.data _align.map _reference.bfa _reads.bfq
'_multiple.data' contains the multiple/all 01-mismatch hits , and '_poor.txt' contains the unmapped and poorly aligned reads:
chrM|0|1 99 AAAATCACAGGTATATCACCCTATTAACCACT ````````T``````)````````````````
chrM|25|4 99 CCACTAACGGGAGCTATCCATGCATAAATGGT ``````+`````````````````````````
chrM|32|5 99 CGGGAGCTATCCATGCATAAATGGTATTTTAG ``````````````4`````````````````
chrM|35|6 99 GAGCTATCCATGCATAAATGGTATTTTAGTCT ``````````````````````2`````````
chrM|49|7 99 TAAATGGTATTTTAGTCTGGGGGATGTGCACG ````````````````````````````````
chrM|76|10 99 ACGCAATAGCATTGAGAGACGCTGAAAAGCCG `````````````&``````````````````
chrM|80|11 99 AATAGCATTGAGAGACGCTGAAAAGCCGGAGC ``````````````B```Q``````2``````
chrM|89|12 99 GAGAGACGCTGAAAAGCCGGAGCACCCTATGT ````````````````:```````````````
chrM|90|13 99 AGAGACGCTGAAAAGCCGGAGCACCCTATGTC ````````````````````````````````
chrM|92|14 99 AGACGCTGAAAAGCCGGAGCACCCTATGTCAC ``````````````````?`````````````

View the mapping alignment with mapview:
$(MAQ) mapview _align.map > _mapview.txt

##result 'verticalized'

$1 read.name : chrM|71|26834
$2 chromosome : chrM
$3 position : 72
$4 strand : +
$5 insert.size : 0
$6 paired.flag : 0
$7 mapping.quality : 59
$8 single-end.mapping.quality : 59
$9 alternative.mapping.quality: 59
$10 number.mismatches.of.best.hit : 2
$11 sum.qualities.mismatched.bases.best.hit : 60
$12 number of 0-mismatch hits of the first 24bp : 0
$13 number of 1-mismatch hits of the first 24bp on ref : 1
$14 length of the read : 32
$15 sequence : TGTGCACGCGATAGCATTGGGAGACGCTGGGG
$16 quality : ```````````````````````X````````
(...)


mapstat : get statistics from the alignment:
$(MAQ) mapstat _align.map > _mapstat.txt


-- Total number of reads: 308
-- Sum of read length: 9856
-- Error rate: 0.058746
-- Average read length: 32.00

-- Mismatch statistics:

-- MM 0 1
-- MM 1 37
-- MM 2 268
-- MM 3 2

-- Mapping quality statistics:

-- MQ 20-29 1 1
-- MQ 40-49 138 138
-- MQ 50-59 107 107
-- MQ 70-79 62 62

-- Flag statistics:

-- FG 0 308


mapcheck: Read quality check
$(MAQ) mapcheck _reference.bfa _align.map > _mapcheck.txt

Number of reference sequences: 1
Length of reference sequences exlcuding gaps: 16571
Length of gaps in the reference sequences: 0
Length of non-gap regions covered by reads: 3846
Length of 24bp unique regions of the reference: 3846
Reference nucleotide composition: A: 30.86%, C: 31.33%, G: 13.16%, T: 24.66%
Reads nucleotide composition: A: 41.35%, C: 24.81%, G: 13.91%, T: 19.93%
Average depth across all non-gap regions: 0.591
Average depth across 24bp unique regions: 0.591

A C G T : AC AG AT CA CG CT GA GC GT TA TC TG : 0? 1? 2? 3? 4? 5? 6? : 0? 1? 2? 3? 4? 5? 6?
1 52.1 18.5 16.8 12.5 : 0 33 25 253 0 0 125 0 0 288 0 182 : 3 204 0 0 254 340 198 : 15 998 167 167 0 0 0
2 56.4 16.2 12.9 14.5 : 0 41 0 164 33 0 171 0 0 220 0 34 : 3 36 0 3 412 346 198 : 15 991 167 15 200 29 0
3 52.1 19.5 12.5 15.8 : 0 29 7 114 43 0 222 0 0 138 0 52 : 3 7 3 7 438 340 201 : 15 8 15 8 255 10 0
4 45.9 24.1 10.2 19.8 : 0 15 15 52 0 0 97 0 0 48 0 16 : 10 3 3 3 438 336 204 : 5 15 15 15 105 10 0
5 39.9 25.7 15.2 19.1 : 0 34 0 13 0 0 68 0 0 48 0 16 : 0 3 3 3 442 353 195 : 167 15 15 15 90 0 0
6 35.6 25.7 13.5 25.1 : 0 10 0 37 0 0 48 0 0 26 0 0 : 3 10 0 7 445 343 191 : 15 5 167 8 59 0 0
7 38.3 31.4 12.5 17.8 : 0 0 0 10 0 0 73 0 0 0 0 0 : 0 7 0 7 445 340 201 : 167 8 167 8 30 0 0
8 38.0 26.7 14.2 21.1 : 0 0 0 0 0 0 0 0 0 0 0 0 : 3 7 3 3 435 346 201 : 15 8 15 15 0 0 0
9 39.9 20.1 18.5 21.5 : 0 0 0 116 0 0 18 0 0 58 0 0 : 0 17 7 0 425 353 198 : 167 3 8 167 8 112 0
10 48.5 15.2 17.2 19.1 : 0 0 0 238 32 0 82 0 0 113 0 70 : 0 0 3 7 435 353 201 : 167 167 15 8 46 261 0
11 36.6 27.4 17.5 18.5 : 0 0 0 0 0 0 0 0 0 0 0 0 : 3 13 7 10 425 353 188 : 15 4 8 5 0 0 0
(...)


pileup displays the alignment in a ‘pileup’ text format.
$(MAQ) pileup _reference.bfa _align.map > _pileup.txt

#chrom pos. ref depth base.on.read(@=same)
chrM 1 G 0 @
chrM 2 A 0 @
chrM 3 T 0 @
(...)
chrM 72 T 1 @,
chrM 73 G 1 @,
chrM 74 T 1 @,
chrM 75 G 1 @,
(...)
chrM 110 C 2 @,,
chrM 111 A 2 @GG
chrM 112 C 2 @,,
(..)


assemble creates a consensus sequences from read mapping.
$(MAQ) assemble _consensus.cns _reference.bfa _align.map 2> /dev/null


cns2snp extracts SNP sites from the consensus.
$(MAQ) cns2snp _consensus.cns > _snp.txt

##verticalized
$1 chromosome : chrM
$2 position : 91
$3 ref.base : C
$4 consensus.base : G
$5 phred.qual : 30
$6 read.depth : 1
$7 average number of hits of reads : 1.00
$8 highest mapping quality of the reads covering the position : 59
$9 minimum consensus quality in the 3bp flanking regions at each side of the site : 30
$10 second best call : N
$11 log likelihood ratio of the second best and the third best call : 29
$12 third best call : N
(...)


cns2view shows detailed information at all sites from the consensus.
$(MAQ) cns2view _consensus.cns _viewsnp.txt

##verticalized
$1 chromosome : chrM
$2 position : 1
$3 ref.base : G
$4 consensus.base : N
$5 phred.qual : 0
$6 read.depth : 0
$7 average number of hits of reads : 0.00
$8 highest mapping quality of the reads covering the position : 0
$9 minimum consensus quality in the 3bp flanking regions at each side of the site : 0
$10 second best call : N
$11 log likelihood ratio of the second best and the third best call : 0
$12 third best call : N
(...)


cns2fq extracts the consensus sequences in FASTQ format. "Bases in lower case are essentially repeats or do not have sufficient coverage; bases in upper case indicate regions where SNPs can be reliably called"
$(MAQ) cns2fq _consensus.cns > _cns2fq.txt


@chrM
nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn
nnnnnnnnnnntgtgcacgcgatagcattgggagacgcTggrGccggagcgccctatgtc
gcagtatctgnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnntcaATATT
ACAGGCGAACATACCTACTAAAAAGtgtnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn
nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnncagACATCATAACAAAAAATTTCCACCA
AAAACCccccnnnnnnnnnnnntggccacaacacttaaacacatctctgcaaaannnnnn
(...)
+
(...)
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!!!!!!!!!!!???????????????????????????2EE8EBBBBBBBBBBBBBBBBB
BBBBBBBBBB!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!BHHHKKKK
KKKKKKKKKKKKKKKKKKKKKKKKE???!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
(...)

BWA


Burrows-Wheeler Aligner (BWA) is an efficient program that aligns relatively short nucleotide sequences against a long reference sequence such as the human genome.

index indexes the chrM in the FASTA format:
$(BWA) index -a is -p _bwa.db chrM.fa


aln finds the suffix array coordinates of the input reads.
$(BWA) aln _bwa.db chrM.fastq > _bam.aln.sai


samse generates alignments in the SAM format given single-end reads.
$(BWA) samse _bwa.db _bam.aln.sai chrM.fastq > _bam.samse.sam

#_bam.samse.sam verticalized
(...)
$1 query : chrM|12602|40780
$2 flag : 0
$3 ref: chrM
$4 position : 12603
$5 quality : 37
$6 cigar.string : 30M
$7 mate.ref.sequence : *
$8 mate.pos : 0
$9 insert.size : 0
$10 query.seq : TCATCCCTGTAGCATTGTGCGTTACATGGT
$11 query.qual : kdfz=|srknznewhnkftyowhtmevnhz
$12 ? : XT:A:U ##cf. manual
$13 ? : NM:i:1
$14 ? : X0:i:1
$15 ? : X1:i:0
$16 ? : XM:i:1
$17 ? : XO:i:0
$18 ? : XG:i:0
$19 ? : MD:Z:18T11

(...)


SAM tools

The SAM format is a generic format for storing large nucleotide sequence alignments. SAM Tools provide various utilities for manipulating alignments in the SAM format, including sorting, merging, indexing and generating alignments in a per-position format. Sometimes this tool needs a tab-delimited file 'chrM.ref' containing the length of each chromosome:
chrM 16571


import converts the alignment into BAM (a binary/compact version of SAM)
$(SAM) import chrM.ref _bam.samse.sam _unsorted.bam


sort BAM alignments by leftmost coordinates
$(SAM) sort _unsorted.bam _sorted


sort prints the alignments:
$(SAM) view -h _sorted.bam _sam.view.txt

#verticalized result (BAM format):

$1 ? : chrM|8|2
$2 ? : 0
$3 ? : chrM
$4 ? : 9
$5 ? : 25
$6 ? : 30M
$7 ? : *
$8 ? : 0
$9 ? : 0
$10 ? : GTATATCACCCTATTAACCACTAACGGGAG
$11 ? : kzzst|{efokkdyirlqkflupvofumxe
$12 ? : XT:A:U
$13 ? : NM:i:2
$14 ? : X0:i:1
$15 ? : X1:i:0
$16 ? : XM:i:2
$17 ? : XO:i:0
$18 ? : XG:i:0
$19 ? : MD:Z:2C19C7


pileup prints the alignment in the pileup format.
$(SAM) pileup -c -f chrM.fa _sorted.bam > _sam.pileup.txt

#verticalized...
(...)
$1 chromosome name : chrM
$2 ref coordinate : 11
$3 reference base : C
$4 read base : A
$5 consensus quality : 60
$6 SNP quality : 60
$7 root mean square mapping quality : 25
$8 reads covering the position : 11
$9 read bases at a SNP line : AAAAAAAAA^:A^:A
$10 read quality : yj
(...)


That's it
Pierre

3 comments:

Anonymous said...

Thanks!

István Albert said...

Great stuff, Pierre!

Laozi said...

Fantastic !!