26 July 2013

g1kv37 vs hg19

In order to create a class to translate the chromosome names from one naming convention to another. I've compared the MD5 sums of the human genome versions g1k/v37 and ucsc/hg19. Here is the java program to create the MD5s:

The MD5 sums were extracted as follow:



Here are the common chromosomes, joined on the hash-sum:


And here are the unpairable data:


I knew the problem for chrY ( http://www.biostars.org/p/58143/) but not for chr3.. What is the problem for this chromosome ?

Edit: Here are the number of bases for UCSC/chr3:

{T=58760485, G=38670110, A=58713343, C=38653197, N=3225295}
and for g1kv37:
{T=58760485, G=38670110, A=58713343, R=2, C=38653197, M=1, N=3225292}

That's it,



Pierre.

2 comments:

Pablo Marin-Garcia said...

Hello Pierre, about chr3 you help me to find an answer 2 years ago ;-)
http://www.biostars.org/p/9464/

My last comment in this page show a possible answer.

Pierre Lindenbaum said...

@pablo Haha :-) I didn't remember that post :-)