NCBI MAGIC Blast was recently mentioned by BioMickWatson on twitter.
Here, I'll be playing with magicblast and I'll compare its output with bwa (Makefile below).
First, here is an extract of the manual for
magicblast.
USAGE
DESCRIPTION
Short read mapper
*** Input query options
-query <File_In>
Input file name
Default = `-'
* Incompatible with: sra
-infmt <String, `asn1', `asn1b', `fasta', `fastc', `fastq'>
Input format for sequences
Default = `fasta'
* Incompatible with: sra
-paired
Input query sequences are paired
-query_mate <File_In>
FASTA file with mates for query sequences (if given in another file)
* Requires: query
-sra <String>
Comma-separated SRA accessions
* Incompatible with: query, infmt
*** Formatting options
-outfmt <String, Permissible values: 'asn' 'sam' 'tabular' >
alignment view options:
sam = SAM format,
tabular = Tabular format,
text ASN.1
Default = `sam'
Indexing the reference for magicblast
I've indexed the human chr22 using the good old NCBI
makeblastdb:
$ wget -O "chr22.fa.gz" "http://hgdownload.cse.ucsc.edu/goldenPath/hg38/chromosomes/chr22.fa.gz"
$ gunzip -f chr22.fa.gz
$ makeblastdb -in chr22.fa -dbtype nucl -title chr22
Mapping Paired-End reads with magicblast
First I've removed the read names' suffixes '/1' and '/2' because they still appear in the final bam.
gunzip -c R1.aa.fq.gz | sed 's/\/1$$//' | gzip > R1.fq.gz
gunzip -c R2.aa.fq.gz | sed 's/\/2$$//' | gzip > R2.fq.gz
The reads are then mapped with
magicblast using the following command:
magicblast -db chr22 \
-infmt fastq -paired \
-query R1.fq.gz \
-query_mate R2.fq.gz |\
sed 's/gnl\|BL_ORD_ID\|0/chr22/g' |\
samtools sort -o child.magic.bam -T child.magic --reference chr22.fa -O BAM
As far as I could see, there was no option to specify the read group (
RG).
Here,I've transformed the name of the contig because it looks like a blast-id identifier:
I've also mapped the reads using bwa:
bwa mem chr22.fa R1.fq.gz R2.fq.gz |\
samtools sort -o child.bwa.bam -T child.bwa --reference chr22.fa -O BAM
BAM headers
Here is the BAM header for magicblast: Magic blast uses the spec v1.2 and has a Group-Order flag.
@HD VN:1.2 GO:query SO:coordinate
@SQ SN:chr22 LN:50818468
@PG ID:0 PN:magicblast magicblast -db chr22.fa -infmt fastq -paired -query R1.fq.gz -query_mate R2.fq.gz
And the BAM header for bwa:
@HD VN:1.3 SO:coordinate
@SQ SN:chr22 LN:50818468
@PG ID:bwa PN:bwa VN:0.7.15-r1140 bwa mem chr22.fa R1.fq.gz R2.fq.gz
Samtools samflags
for magicblast:
24387 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 secondary
0 + 0 supplementary
0 + 0 duplicates
24387 + 0 mapped (100.00% : N/A)
24387 + 0 paired in sequencing
12222 + 0 read1
12222 + 0 read2
24330 + 0 properly paired (99.77% : N/A)
24387 + 0 with itself and mate mapped
0 + 0 singletons (0.00% : N/A)
57 + 0 with mate mapped to a different chr
57 + 0 with mate mapped to a different chr (mapQ>=5)
for bwa:
24001 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 secondary
1 + 0 supplementary
0 + 0 duplicates
23976 + 0 mapped (99.90% : N/A)
24000 + 0 paired in sequencing
12000 + 0 read1
12000 + 0 read2
23840 + 0 properly paired (99.33% : N/A)
23952 + 0 with itself and mate mapped
23 + 0 singletons (0.10% : N/A)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)
Validationg with picard
Validation of child.magic.bam with picard generates many errors:
$ java -jar picard.jar ValidateSamFile I=child.magic.bam IGNORE=RECORD_MISSING_READ_GROUP
ERROR: Read groups is empty
ERROR: Header version: 1.2 does not match any of the acceptable versions: 1.0, 1.3, 1.4, 1.5
ERROR: Record 4, Read name HWI-1KL149:97:C6Y6VACXX:4:2306:7588:23627, Mate negative strand flag does not match read negative strand flag of mate
ERROR: Record 6, Read name HWI-1KL149:97:C6Y6VACXX:5:2303:6772:28953, Mate negative strand flag does not match read negative strand flag of mate
ERROR: Record 15, Read name HWI-1KL149:97:C6Y6VACXX:4:1207:16508:83772, Mate negative strand flag does not match read negative strand flag of mate
ERROR: Record 17, Read name HWI-1KL149:97:C6Y6VACXX:4:1216:17305:32405, Mate negative strand flag does not match read negative strand flag of mate
ERROR: Record 19, Read name HWI-1KL149:97:C6Y6VACXX:5:1315:20109:45547, Mate negative strand flag does not match read negative strand flag of mate
ERROR: Record 29, Read name HWI-1KL149:97:C6Y6VACXX:4:2313:12478:85202, Mate negative strand flag does not match read negative strand flag of mate
ERROR: Record 36, Read name HWI-1KL149:97:C6Y6VACXX:4:1202:11437:96159, Mate negative strand flag does not match read negative strand flag of mate
ERROR: Record 40, Read name HWI-1KL149:97:C6Y6VACXX:4:1213:16611:87818, Mate negative strand flag does not match read negative strand flag of mate
ERROR: Record 45, Read name HWI-1KL149:97:C6Y6VACXX:4:1208:7944:83271, Mate negative strand flag does not match read negative strand flag of mate
ERROR: Record 49, Read name HWI-1KL149:97:C6Y6VACXX:4:1216:17305:32405, Mate negative strand flag does not match read negative strand flag of mate
ERROR: Record 22, Read name HWI-1KL149:97:C6Y6VACXX:5:1105:17083:36022, Mate alignment does not match alignment start of mate
ERROR: Record 51, Read name HWI-1KL149:97:C6Y6VACXX:5:1105:17083:36022, Mate alignment does not match alignment start of mate
ERROR: Record 51, Read name HWI-1KL149:97:C6Y6VACXX:5:1105:17083:36022, Mate negative strand flag does not match read negative strand flag of mate
ERROR: Record 25, Read name HWI-1KL149:97:C6Y6VACXX:4:2209:8319:82198, Mate negative strand flag does not match read negative strand flag of mate
(...)
while there is ~no error for child.bwa.bam:
$ java -jar picard.jar ValidateSamFile I=child.bwa.bam IGNORE=RECORD_MISSING_READ_GROUP
ERROR: Read groups is empty
[Fri Sep 09 16:09:28 CEST 2016] picard.sam.ValidateSamFile done. Elapsed time: 0.01 minutes.
Runtime.totalMemory()=78643200
Variant Calling
Both BAMs are mapped with samtools/vcftools (min DP=10)
samtools mpileup --uncompressed --output-tags DP --reference chr22.fa child.bwa.bam |\
bcftools call --ploidy GRCh38 --multiallelic-caller --variants-only --format-fields GQ,GP - |\
bcftools filter -e 'DP<10' --output-type v -o child.bwa.vcf -
samtools mpileup --uncompressed --output-tags DP --reference chr22.fa child.magic.bam |\
bcftools call --ploidy GRCh38 --multiallelic-caller --variants-only --format-fields GQ,GP - |\
bcftools filter -e 'DP<10' --output-type v -o child.magic.vcf -
Number of variants:
$ grep -v "#" child.magic.vcf | wc -l
111
$ grep -v "#" child.bwa.vcf | wc -l
166
Here is the diff for CHROM/POS/REF/ALT:
- Uniq to BWA:
'comm -13 <(grep -v "#" child.magic.vcf|cut -f 1,2,4,5 | sort | uniq) <(grep -v "#" child.bwa.vcf|cut -f 1,2,4,5 | sort | uniq) | wc -l'
: 64 variants
- Uniq to Magic:
'comm -23 <(grep -v "#" child.magic.vcf|cut -f 1,2,4,5 | sort | uniq) <(grep -v "#" child.bwa.vcf|cut -f 1,2,4,5 | sort | uniq) | wc -l'
: 9 variants
- Common variants:
'comm -12 <(grep -v "#" child.magic.vcf|cut -f 1,2,4,5 | sort | uniq) <(grep -v "#" child.bwa.vcf|cut -f 1,2,4,5 | sort | uniq) | wc -l'
: 102 variants
The Makefile
SHELL=/bin/bash
data.dir=${HOME}/src/DATA
ncbi.bin=${HOME}/packages/magicblast/bin
REF=chr22.fa
samtools.exe=${HOME}/packages/samtools/samtools
bwa.exe=${HOME}/packages/bwa-0.7.15/bwa
all: child.magic.bam child.bwa.bam
R1.fq.gz : ${HOME}/src/DATA/child.R1.aa.fq.gz
gunzip -c $< | sed 's/\/1$$//' | gzip > $@
R2.fq.gz : ${HOME}/src/DATA/child.R2.aa.fq.gz
gunzip -c $< | sed 's/\/2$$//' | gzip > $@
child.bwa.bam : ${REF}.bwt R1.fq.gz R2.fq.gz
${bwa.exe} mem ${REF} $(word 2,$^) $(word 3,$^) |\
${samtools.exe} sort -o $@ -T $(basename $@) --reference ${REF} -O BAM
child.magic.bam : ${REF}.nin R1.fq.gz R2.fq.gz
${ncbi.bin}/magicblast -db ${REF} -infmt fastq -paired -query $(word 2,$^) -query_mate $(word 3,$^) |\
sed 's/gnl\|BL_ORD_ID\|0/$(basename ${REF})/g' |\
${samtools.exe} sort -o $@ -T $(basename $@) --reference ${REF} -O BAM
${REF}.bwt : ${REF}
${bwa.exe} index $<
${REF}.nin : ${REF}
${ncbi.bin}/makeblastdb -in $< -dbtype nucl -title $(basename ${REF})
${REF} :
wget -O "$@.gz" "http://hgdownload.cse.ucsc.edu/goldenPath/hg38/chromosomes/$@.gz"
gunzip -f $@.gz
That's it,
Pierre