01 August 2011

Memory-Mapping the Human Genome with 'mmap': my notebook

In this post, I've explored how to use a memory-mapped file to handle the 3Go of the Human Genome sequence as if it was entirely loaded in memory.
Via wikipedia: "A memory-mapped file is a segment of virtual memory which has been assigned a direct byte-for-byte correlation with some portion of a file or file-like resource. This resource is typically a file that is physically present on-disk... In computing, mmap is a POSIX-compliant Unix system call that maps files or devices into memory. It is a method of memory-mapped file I/O. It naturally implements demand paging, because initially file contents are not entirely read from disk and do not use physical RAM at all. The actual reads from disk are performed in "lazy" manner, after a specific location is accessed."
Using a C++ program, I'm going to memory-map the fasta sequence of the human genome (indexed with samtools faidx) and search the position of some short-reads using a BoyerMoore algorithm:

Required members:

/* maps a chromosome to its samtools faidx index */
map<string,faidx1_t> name2index;
/* used to get the size of the file */
struct stat buf;
/* genome fasta file file descriptor */
int fd;
/* the mmap (memory mapped) pointer */
char *mapptr;

Opening the mmap

string faidx(fastaFile);
string line;
/* open *.fai file */
ifstream in(faidx.c_str(),ios::in);
/* read indexes in the .fai file that was created with samtools */
faidx1_t index;
//parse the faidx line...
/* close index file */

/* get the whole size of the fasta file */
stat(fasta, &buf);
/* open the fasta file */
fd = open(fastaFile, O_RDONLY);
/* open a memory mapped file associated to this fasta file descriptor */
mapptr = (char*)mmap(0, buf.st_size, PROT_READ, MAP_SHARED, fd, 0);

It's a kind of MAGIC: Getting the base at index 'position-th' of chromosome 'chrom'

std::map<std::string,faidx1_t>::iterator r=name2index.find(chrom);
faidx1_t& index=r->second;
char base=at(&index,position);
/* returns the base at position 'index' for the chromosome indexed by faidx */
char at(const FaidxPtr faidx,int64_t index)
long pos= faidx->offset +
index / faidx->line_blen * faidx->line_len +
index % faidx->line_blen
/* here is the magic: no need to fseek/fread/ftell the file */
return toupper(mapptr[pos]);

Mapping the short reads

I've hacked a simple Boyer-Moore-Horspool algorithm from ttp://en.wikipedia.org/wiki/Boyer-Moore-Horspool_algorithm. Of course, you wouldn't use this algorithm to map your short reads for real :-) .

Disposing the mmap

/* close memory mapped map */
if(mapptr!=NULL) munmap(mapptr,buf.st_size);
/* dispose fasta file descriptor */
if(fd!=-1) close(fd);

Compile & run

$ g++ -Wall testmmap.cpp -lz

$ ./a.out -g /path/tp/hg19.fa /path/to/my.fastq.gz


Source code

That's it,



Nikhil Gopal said...

Hey Pierre,

You have an awesome blog. How fast does this run? I'm finding that this method takes quite a while to run to completion.

Pierre Lindenbaum said...

Hi Nikhil, I'm sorry, I didn't noticed a long time, but as far as I remember, I only tested it on chr22.