24 February 2009

1000 human genomes ! How !#@$* do you manage 1000 genomes ??!

The 1000 genomes project (http://www.1000genomes.org/page.php) has started to release its data. The 1000 Genomes Project is an international research consortium formed to create the most detailed and medically useful picture to date of human genetic variation. The project involves sequencing the genomes of approximately 1200 people from around the world

The data are already available on the ftp site. TONS of short reads are here. For example here are the first lines of a FASTQ file ftp://ftp.1000genomes.ebi.ac.uk/data/NA12878/sequence_read/SRR001113.recal.fastq.gz.

size of the file: 1.325.489.979 octets
number of lines: 36.193.608
number of short reads: 9.048.402
number of distinct short reads without any N : 8.591.003
number of short reads without any ‘N’: 6.301.673
number of distinct short reads without any ‘N’: 6.260.140

How many fastq files are there ? 8215 fastq files.
(Just curious: how do you backup this amount of data ?)

And, now, here HE comes: the USER.
The USER and his !#@$ questions.

  • For an given rs number , where is it located on those 1000 genomes ?

  • Here is a new SNP in my favorite gene. Is it a true SNP? I want to see the short reads and their qualities aligned for this SNP

  • For a given SNP, what is its location on the reference sequence ? on Watson's sequence ? on celera assembly ?

  • I want to see all the snps in a given region for a given individual and his parents

  • Is this mutation (e.g. substitution) on this individual is the same on another individual (e.g. insertion) ?

  • ...

How do you store and manage this amount of information ? In a classical RDBM ?! My colleague Mario Foglio is currently looking for another alternative(s).




baoilleach said...

Would pygr be useful? e.g. http://code.google.com/p/pygr/wiki/DataStorageUsingpygr

Jan Aerts said...


Here at Sanger/EBI we do _not_ put those files in a database. Databases just can't handle that amount of data.

Alignments in the 1000genomes project are now made available as bam-files ("binary SAM") and we use the SAMtools API to get information out. For example "samtools view aln.sorted.bam chr2:20,100,000-20,200,000" will extract that region from the named bam-file and convert it to human-readable text. See http://samtools.sourceforge.net for more information.

Pierre Lindenbaum said...

Thanks Jan, this is very useful. But another problem here is how to handle the information about the SNPs. I'm still investigating.

cariaso said...

I am very interested in any rs# related solutions you may build/find.

Andrew Dalke said...

That's about 10TB of data. Figuring $200/TB storage means $2000 to store/back up the data on hard disks. Roll-your-own RAID and the cost is probably $4000 or so.

You can buy retail "home" NAS (designed for movie storage) for 15TB for $15K and Sun's high-end storage systems go up to 574 TB for lots-of-money.

Backups depends on your requirements. You don't likely want the archival quality that tape gives you. After all, you can always refetch the data. So RAID might be enough. I did find an estimate from 2005 that a corporate IT tape backup system for 500 TB costs $250,000. Scaling gives about $5,000 for 10TB.

A big question is - how do you download that amount of data? A T3 line would need 22 days. OC12 would take 1.6 days. Shipping hard drives would likely be the fastest.

Of course, easiest would be to use Amazon's S3. 10TB is in the smallest storage tier for them, and would cost $1500/month and $1000 to transfer the data. In this case they would also manage the backups for you so you wouldn't need on-site staff to manage that, and you would get to use their impressive bandwidth.

The cheapest tier at S3 is for those with over 500 TB storage.

1000 genomes is big data, sure, but there's plenty bigger in the world and there are off-the-shelf and service solutions if you have the money.

Tim Yates said...

This smells like a bloom filter requirement... It won't tell you where the answer is, but it should quickly let you know where it isn't... Of course, that depends on knowing what the question is going to be before hand... :-(

dalloliogm said...

We are thinking about using HDF5 files to store snp data and associated annotations, but we are not still sure on the format.

dalloliogm said...

Moreover, it seems that snp data for 1000 genomes will be released with the VCF format. A tool to parse VCF files and merge them with BED files is available at http://vcftools.sf.net , and probably others libraries are coming.

Sean Davis said...

Keep in mind that BAM format files can be read via random access directly from the web. So, using appropriate tools (see samtools and picard), you can read only the reads that overlap your rs of interest. Of course, having the reads in hand won't give you the genotype directly.

Also, take a look at the Integrative Genome Viewer (IGV) from the Broad. Again, it reads only the parts of the file that are requested and over http, so the entire file need not be stored on a local disk.

Lee Watkins said...

PacBio data is even bigger than other NGS platforms (HiSeq, 454 etc.), so data from their primary analysis pipeline is stored as 4 kinds of HDF5 files AND they provide Java and R APIs into them as well as conversion utilities to SAM/BAM, VCF and BED file formats. So they must be able to represent all the same information in the HDF5 files as is contained in those other file formats, which means you could too. Then the question becomes whether you spend so much time converting from one format to the other that it's not really worth storing the data in HDF5 in the first place, unless you re-work everything to do I/O directly from HDF files, e.g., using the Java API.