YOKOFAKUN: Storing SNPs in a HDF5 file: my notebook

09 July 2011

Storing SNPs in a HDF5 file: my notebook

I discovered the benefits of using twitter in 2008, the day Deepak Singh replied to one of my tweets related to the storage of a large number of genotypes.

Since that day, I've tried to use the HDF5 library, without any success (there's a large disheartening documentation/API on the HDF5 site and the API seems to be only used by a small number of geeks). Furthermore, HDF5 is a technology used by the IGV genome browser.

In that post, I'll describe a C program loading a set of SNPs defined by the following C structure:

typedef struct structSnp
 {
 char rsId[RS_LENGTH];//rs##
 char chrom[CHROM_LENGTH];//chromosome name
 int chromStart;//genomic start index
 int chromEnd;//genomic end index
 }Snp;

The SNPs will be read from stdin. Four columns are expected: rs-id, chrom, chromStart and chromEnd. Here is the command line I used to get a small input file:

curl -s "http://hgdownload.cse.ucsc.edu/goldenPath/hg19/database/snp132.txt.gz" |\
 gunzip -c |\
 cut -d ' ' -f 2-5 |\
 head -n 10000 > sample.tsv

Source code

The C program is described below. I hope my comments will make the code readable.

The Makefile

Compile and run

gcc -Wall -o a.out -I  /path/to/hdf5/include -L /path/to/hdf5/lib dbsnp2hdf5.c  -lhdf5
./a.out < sample.tsv

Dump the data

At the en, the program creates a structured binary file containing our SNPs. The HDF5 utility h5dump can be used to display the data as text:

$ h5dump storage.h5

HDF5 "storage.h5" {
GROUP "/" {
   GROUP "variations" {
      DATASET "dbSNP" {
         DATATYPE  H5T_COMPOUND {
            H5T_STRING {
               STRSIZE 13;
               STRPAD H5T_STR_NULLTERM;
               CSET H5T_CSET_ASCII;
               CTYPE H5T_C_S1;
            } "rs";
            H5T_STRING {
               STRSIZE 7;
               STRPAD H5T_STR_NULLTERM;
               CSET H5T_CSET_ASCII;
               CTYPE H5T_C_S1;
            } "chrom";
            H5T_STD_I32LE "chromStart";
            H5T_STD_I32LE "chromEnd";
         }
         DATASPACE  SIMPLE { ( 10000 ) / ( H5S_UNLIMITED ) }
         DATA {
         (0): {
               "rs112750067",
               "chr1",
               10326,
               10327
            },
         (1): {
               "rs56289060",
               "chr1",
               10433,
               10433
            },
         (2): {
               "rs112155239",
               "chr1",
               10439,
               10440
            },
         (3): {
               "rs112766696",
               "chr1",
               10439,
               10440
            },
          (...)
               450425,
               450426
            }
         }
         ATTRIBUTE "rdfs:comment" {
            DATATYPE  H5T_STRING {
                  STRSIZE 15;
                  STRPAD H5T_STR_NULLTERM;
                  CSET H5T_CSET_ASCII;
                  CTYPE H5T_C_S1;
               }
            DATASPACE  SCALAR
            DATA {
            (0): "My list of SNPs"
            }
         }
      }
   }
}
}

That's it !
Pierre

1 comment:

Anonymous said...: Well and then there's GWASpi, which loads genotypes in a HDF compatible netCDF file.
It seems to be the same type of storage as you can open up netCDf files with HDFView.

For sequences (reads and alignments) there's also the BioHDF libraries which do all the storage and I/O work for you.; Tuesday, 19 July, 2011

YOKOFAKUN

09 July 2011

Storing SNPs in a HDF5 file: my notebook

Source code

The Makefile

Compile and run

Dump the data

1 comment:

About Me

Feeds

Blog Archive

Web2.0

Labels