A reference genome with or without the 'chr' prefix
The name of the chromosomes in the fasta files for the human genome are prefixed with 'chr' :
$ grep ">" hg19.fa >chr1 >chr2 >chr3 >chr4 >chr5 >chr6 (...)The FAIDX index for this fasta file looks like this:
chr1 249250621 6 50 51 chr2 243199373 254235646 50 51 chr3 198022430 502299013 50 51 chr4 191154276 704281898 50 51 chr5 180915260 899259266 50 51 chr6 171115067 1083792838 50 51 (...).Today, I've been asked to call the variations for a set of BAM files mapped on a reference genome without this 'chr' prefix. One way to get around this problem is to change the header for those BAM. Another way is to create a copy of the faidx file where the 'chr' prefixes have been removed (the faidx is still valid as the positions in the chromosomes didn't change):
sed 's/^chr//' hg19.fa.fai > hg19_NOPREFIX.fa.faiand to create a symbolic link named hg19_NOPREFIX.fa pointing to the original reference:
ln -s hg19.fa hg19_NOPREFIX.fa. The result:
ls -lah -rw-r--r-- 1 root root 3.0G Jan 4 2011 hg19.fa -rw-r--r-- 1 root root 788 Jan 27 2011 hg19.fa.fai lrwxrwxrwx 1 root root 7 Oct 20 16:12 hg19_NOPREFIX.fa -> hg19.fa -rw-r--r-- 1 root root 713 Oct 20 16:12 hg19_NOPREFIX.fa.faiThis solution worked so far with samtools mpileup.
That's it,
Pierre
3 comments:
Thanks for the tip.
It seems that you wrote hg19.fa.fa instead of hg19.fa.fai :-)
@fredebibs thanks ! it's fixed.
Thanks for this tip, it worked !
Post a Comment