The data are already available on the ftp site. TONS of short reads are here. For example here are the first lines of a FASTQ file ftp://ftp.1000genomes.ebi.ac.uk/data/NA12878/sequence_read/SRR001113.recal.fastq.gz.
size of the file: 1.325.489.979 octets
number of lines: 36.193.608
number of short reads: 9.048.402
number of distinct short reads without any N : 8.591.003
number of short reads without any ‘N’: 6.301.673
number of distinct short reads without any ‘N’: 6.260.140
How many fastq files are there ? 8215 fastq files.
(Just curious: how do you backup this amount of data ?)
And, now, here HE comes: the USER.
The USER and his !#@$ questions.
AND HE WANTS SOME ANSWERS.
- For an given rs number , where is it located on those 1000 genomes ?
- Here is a new SNP in my favorite gene. Is it a true SNP? I want to see the short reads and their qualities aligned for this SNP
- For a given SNP, what is its location on the reference sequence ? on Watson's sequence ? on celera assembly ?
- I want to see all the snps in a given region for a given individual and his parents
- Is this mutation (e.g. substitution) on this individual is the same on another individual (e.g. insertion) ?
How do you store and manage this amount of information ? In a classical RDBM ?! My colleague Mario Foglio is currently looking for another alternative(s).
BIG DATA !