Playing with #magicblast, the #NCBI Short read mapper. My notebook
NCBI MAGIC Blast was recently mentioned by BioMickWatson on twitter.
Looks pretty cool. Perhaps once again the answer to all bfx questions will be BLAST
- Mick Watson (@BioMickWatson) September 9, 2016
RE https://t.co/4D5e9QQnrb pic.twitter.com/bwW3y0yl2n
Here, I'll be playing with magicblast and I'll compare its output with bwa (Makefile below).
First, here is an extract of the manual for magicblast.
USAGE DESCRIPTION Short read mapper *** Input query options -query <File_In> Input file name Default = `-' * Incompatible with: sra -infmt <String, `asn1', `asn1b', `fasta', `fastc', `fastq'> Input format for sequences Default = `fasta' * Incompatible with: sra -paired Input query sequences are paired -query_mate <File_In> FASTA file with mates for query sequences (if given in another file) * Requires: query -sra <String> Comma-separated SRA accessions * Incompatible with: query, infmt *** Formatting options -outfmt <String, Permissible values: 'asn' 'sam' 'tabular' > alignment view options: sam = SAM format, tabular = Tabular format, text ASN.1 Default = `sam'
Indexing the reference for magicblast
I've indexed the human chr22 using the good old NCBI makeblastdb:$ wget -O "chr22.fa.gz" "http://hgdownload.cse.ucsc.edu/goldenPath/hg38/chromosomes/chr22.fa.gz" $ gunzip -f chr22.fa.gz $ makeblastdb -in chr22.fa -dbtype nucl -title chr22
Mapping Paired-End reads with magicblast
First I've removed the read names' suffixes '/1' and '/2' because they still appear in the final bam.gunzip -c R1.aa.fq.gz | sed 's/\/1$$//' | gzip > R1.fq.gz gunzip -c R2.aa.fq.gz | sed 's/\/2$$//' | gzip > R2.fq.gzThe reads are then mapped with magicblast using the following command:
magicblast -db chr22 \ -infmt fastq -paired \ -query R1.fq.gz \ -query_mate R2.fq.gz |\ sed 's/gnl\|BL_ORD_ID\|0/chr22/g' |\ samtools sort -o child.magic.bam -T child.magic --reference chr22.fa -O BAMAs far as I could see, there was no option to specify the read group (RG).
Here,I've transformed the name of the contig because it looks like a blast-id identifier:
I've also mapped the reads using bwa:
bwa mem chr22.fa R1.fq.gz R2.fq.gz |\ samtools sort -o child.bwa.bam -T child.bwa --reference chr22.fa -O BAM
BAM headers
Here is the BAM header for magicblast: Magic blast uses the spec v1.2 and has a Group-Order flag.@HD VN:1.2 GO:query SO:coordinate @SQ SN:chr22 LN:50818468 @PG ID:0 PN:magicblast magicblast -db chr22.fa -infmt fastq -paired -query R1.fq.gz -query_mate R2.fq.gz
And the BAM header for bwa:
@HD VN:1.3 SO:coordinate @SQ SN:chr22 LN:50818468 @PG ID:bwa PN:bwa VN:0.7.15-r1140 bwa mem chr22.fa R1.fq.gz R2.fq.gz
Samtools samflags
for magicblast:24387 + 0 in total (QC-passed reads + QC-failed reads) 0 + 0 secondary 0 + 0 supplementary 0 + 0 duplicates 24387 + 0 mapped (100.00% : N/A) 24387 + 0 paired in sequencing 12222 + 0 read1 12222 + 0 read2 24330 + 0 properly paired (99.77% : N/A) 24387 + 0 with itself and mate mapped 0 + 0 singletons (0.00% : N/A) 57 + 0 with mate mapped to a different chr 57 + 0 with mate mapped to a different chr (mapQ>=5)
for bwa:
24001 + 0 in total (QC-passed reads + QC-failed reads) 0 + 0 secondary 1 + 0 supplementary 0 + 0 duplicates 23976 + 0 mapped (99.90% : N/A) 24000 + 0 paired in sequencing 12000 + 0 read1 12000 + 0 read2 23840 + 0 properly paired (99.33% : N/A) 23952 + 0 with itself and mate mapped 23 + 0 singletons (0.10% : N/A) 0 + 0 with mate mapped to a different chr 0 + 0 with mate mapped to a different chr (mapQ>=5)
Validationg with picard
Validation of child.magic.bam with picard generates many errors:$ java -jar picard.jar ValidateSamFile I=child.magic.bam IGNORE=RECORD_MISSING_READ_GROUP ERROR: Read groups is empty ERROR: Header version: 1.2 does not match any of the acceptable versions: 1.0, 1.3, 1.4, 1.5 ERROR: Record 4, Read name HWI-1KL149:97:C6Y6VACXX:4:2306:7588:23627, Mate negative strand flag does not match read negative strand flag of mate ERROR: Record 6, Read name HWI-1KL149:97:C6Y6VACXX:5:2303:6772:28953, Mate negative strand flag does not match read negative strand flag of mate ERROR: Record 15, Read name HWI-1KL149:97:C6Y6VACXX:4:1207:16508:83772, Mate negative strand flag does not match read negative strand flag of mate ERROR: Record 17, Read name HWI-1KL149:97:C6Y6VACXX:4:1216:17305:32405, Mate negative strand flag does not match read negative strand flag of mate ERROR: Record 19, Read name HWI-1KL149:97:C6Y6VACXX:5:1315:20109:45547, Mate negative strand flag does not match read negative strand flag of mate ERROR: Record 29, Read name HWI-1KL149:97:C6Y6VACXX:4:2313:12478:85202, Mate negative strand flag does not match read negative strand flag of mate ERROR: Record 36, Read name HWI-1KL149:97:C6Y6VACXX:4:1202:11437:96159, Mate negative strand flag does not match read negative strand flag of mate ERROR: Record 40, Read name HWI-1KL149:97:C6Y6VACXX:4:1213:16611:87818, Mate negative strand flag does not match read negative strand flag of mate ERROR: Record 45, Read name HWI-1KL149:97:C6Y6VACXX:4:1208:7944:83271, Mate negative strand flag does not match read negative strand flag of mate ERROR: Record 49, Read name HWI-1KL149:97:C6Y6VACXX:4:1216:17305:32405, Mate negative strand flag does not match read negative strand flag of mate ERROR: Record 22, Read name HWI-1KL149:97:C6Y6VACXX:5:1105:17083:36022, Mate alignment does not match alignment start of mate ERROR: Record 51, Read name HWI-1KL149:97:C6Y6VACXX:5:1105:17083:36022, Mate alignment does not match alignment start of mate ERROR: Record 51, Read name HWI-1KL149:97:C6Y6VACXX:5:1105:17083:36022, Mate negative strand flag does not match read negative strand flag of mate ERROR: Record 25, Read name HWI-1KL149:97:C6Y6VACXX:4:2209:8319:82198, Mate negative strand flag does not match read negative strand flag of mate (...)while there is ~no error for child.bwa.bam:
$ java -jar picard.jar ValidateSamFile I=child.bwa.bam IGNORE=RECORD_MISSING_READ_GROUP ERROR: Read groups is empty [Fri Sep 09 16:09:28 CEST 2016] picard.sam.ValidateSamFile done. Elapsed time: 0.01 minutes. Runtime.totalMemory()=78643200
Variant Calling
Both BAMs are mapped with samtools/vcftools (min DP=10)samtools mpileup --uncompressed --output-tags DP --reference chr22.fa child.bwa.bam |\ bcftools call --ploidy GRCh38 --multiallelic-caller --variants-only --format-fields GQ,GP - |\ bcftools filter -e 'DP<10' --output-type v -o child.bwa.vcf - samtools mpileup --uncompressed --output-tags DP --reference chr22.fa child.magic.bam |\ bcftools call --ploidy GRCh38 --multiallelic-caller --variants-only --format-fields GQ,GP - |\ bcftools filter -e 'DP<10' --output-type v -o child.magic.vcf -Number of variants:
$ grep -v "#" child.magic.vcf | wc -l 111 $ grep -v "#" child.bwa.vcf | wc -l 166
Here is the diff for CHROM/POS/REF/ALT:
- Uniq to BWA:
'comm -13 <(grep -v "#" child.magic.vcf|cut -f 1,2,4,5 | sort | uniq) <(grep -v "#" child.bwa.vcf|cut -f 1,2,4,5 | sort | uniq) | wc -l'
: 64 variants - Uniq to Magic:
'comm -23 <(grep -v "#" child.magic.vcf|cut -f 1,2,4,5 | sort | uniq) <(grep -v "#" child.bwa.vcf|cut -f 1,2,4,5 | sort | uniq) | wc -l'
: 9 variants - Common variants:
'comm -12 <(grep -v "#" child.magic.vcf|cut -f 1,2,4,5 | sort | uniq) <(grep -v "#" child.bwa.vcf|cut -f 1,2,4,5 | sort | uniq) | wc -l'
: 102 variants
The Makefile
SHELL=/bin/bash data.dir=${HOME}/src/DATA ncbi.bin=${HOME}/packages/magicblast/bin REF=chr22.fa samtools.exe=${HOME}/packages/samtools/samtools bwa.exe=${HOME}/packages/bwa-0.7.15/bwa all: child.magic.bam child.bwa.bam R1.fq.gz : ${HOME}/src/DATA/child.R1.aa.fq.gz gunzip -c $< | sed 's/\/1$$//' | gzip > $@ R2.fq.gz : ${HOME}/src/DATA/child.R2.aa.fq.gz gunzip -c $< | sed 's/\/2$$//' | gzip > $@ child.bwa.bam : ${REF}.bwt R1.fq.gz R2.fq.gz ${bwa.exe} mem ${REF} $(word 2,$^) $(word 3,$^) |\ ${samtools.exe} sort -o $@ -T $(basename $@) --reference ${REF} -O BAM child.magic.bam : ${REF}.nin R1.fq.gz R2.fq.gz ${ncbi.bin}/magicblast -db ${REF} -infmt fastq -paired -query $(word 2,$^) -query_mate $(word 3,$^) |\ sed 's/gnl\|BL_ORD_ID\|0/$(basename ${REF})/g' |\ ${samtools.exe} sort -o $@ -T $(basename $@) --reference ${REF} -O BAM ${REF}.bwt : ${REF} ${bwa.exe} index $< ${REF}.nin : ${REF} ${ncbi.bin}/makeblastdb -in $< -dbtype nucl -title $(basename ${REF}) ${REF} : wget -O "$@.gz" "http://hgdownload.cse.ucsc.edu/goldenPath/hg38/chromosomes/$@.gz" gunzip -f $@.gz
That's it,
Pierre
1 comment:
Hi Pierre
Thanks for posting the results. How come you are getting a weird result with magic-BLAST indicating a few read pairs mapping to different chromosomes when your ref sequence database contain only one chromosome?
Jeremy
Post a Comment