21 October 2011

A reference genome with or without the 'chr' prefix

The name of the chromosomes in the fasta files for the human genome are prefixed with 'chr' :
$  grep ">" hg19.fa 
>chr1
>chr2
>chr3
>chr4
>chr5
>chr6
(...)
The FAIDX index for this fasta file looks like this:
chr1 249250621 6 50 51
chr2 243199373 254235646 50 51
chr3 198022430 502299013 50 51
chr4 191154276 704281898 50 51
chr5 180915260 899259266 50 51
chr6 171115067 1083792838 50 51
(...)
.Today, I've been asked to call the variations for a set of BAM files mapped on a reference genome without this 'chr' prefix. One way to get around this problem is to change the header for those BAM. Another way is to create a copy of the faidx file where the 'chr' prefixes have been removed (the faidx is still valid as the positions in the chromosomes didn't change):
sed 's/^chr//' hg19.fa.fai > hg19_NOPREFIX.fa.fai
and to create a symbolic link named hg19_NOPREFIX.fa pointing to the original reference:
 ln -s hg19.fa hg19_NOPREFIX.fa
. The result:
ls -lah

-rw-r--r-- 1 root root 3.0G Jan  4  2011 hg19.fa
-rw-r--r-- 1 root    root   788 Jan 27  2011 hg19.fa.fai
lrwxrwxrwx 1 root    root     7 Oct 20 16:12 hg19_NOPREFIX.fa -> hg19.fa
-rw-r--r-- 1 root    root   713 Oct 20 16:12 hg19_NOPREFIX.fa.fai
This solution worked so far with samtools mpileup.

That's it,

Pierre

3 comments:

  1. Thanks for the tip.
    It seems that you wrote hg19.fa.fa instead of hg19.fa.fai :-)

    ReplyDelete
  2. Thanks for this tip, it worked !

    ReplyDelete