YOKOFAKUN: BerkeleyDB and Hapmap: My notebook.

I'm currently trying to find the best way to store some genotypes. For example I need to store 278.766.958 illumina genotypes (marker,individual, allele1, allele2) and mysql, even with indexes, is getting slow when I'm looking for the Mendelian incompatibilities. Deepak suggested via twitter to use href="http://hdf.ncsa.uiuc.edu/HDF5/">HDF5 but as far as I understand the documentation, HDF5 is "just" a smarter implementation of the C functions fseek/fread/fwrite.

I've been looking at the java implementation of the BerkeleyDB (BDB) just to watch its performance according to my needs. This engine can be used as en embedded database and doesn't need a database server running as a daemon (just like JavaDB). A BDB berkeley database contains records where each record contains a "key" and a "value" (a kind of two columns database, or a kind of C++ std::multimap). In the current post, I'll describe a program which 1) download some genotypes from HapMap 2) Find the pedigree of the samples 3) Loop over the markers (ordered on their position on the genome) and get the frequency 4) find Mendelian incompatibilities

The source code is available on pastie at:

http://pastie.textmate.org/175209

First we need a few classes:

class Position holds a position on a chromosome

private static class Position
 {
 String chromosome;
 int position;
 Position(String chromosome,int position)
  {
  this.chromosome=chromosome;
  this.position=position;
  }
 }

class Marker contains the informations about a snp :

private  static class Marker
 {
 int rsId;
 String alleles;
 Position pos;
 char strand;
 }

class Individual contains the informations about an individual including his 'id' on corriel ( see http://cardiogenomics.med.harvard.edu)

private  static class Individual
 {
 /**Coriell Repository Number*/
 String name;
 /** id in corriel */
 String individualID=null;
 /** father id */
 String fatherID=null;
 /** mother id */
 String motherID=null;
 }

As BDB is a collection of pairs(key,value) we need a class MarkerIndividual holding the pair(marker,individual) to store the genotypes with f(pair(marker,individual)=genotype.

private static class MarkerIndividual
 {
 private int rsId;
 /**Coriell Repository Number*/
 private String individualId;
 }

At the end, we need a class Genotype to store two alleles:

private  static class Genotype
 {
 char a1,a2;
 
 public boolean same(char a1,char a2)
  {
  return (this.a1==a1 && this.a2==a2) ||
    (this.a1==a2 && this.a2==a1)
    ;
  }
 }

BDB has several alternatives to read and write the java objects from/to the databases, this operation requires the object to be converted into an array of bytes: 1) the java Serialization can be used, 2) a TupleBinding can be implemented, this class must impements two functions which are used to encode/decode the object to/from an array of bytes. I've choosen to use this later option, and here is for example the TupleBinding implementation for the class Individual:

private TupleBinding individualTupleBinding=new TupleBinding()
 {
 @Override
 public Object entryToObject(TupleInput input)
  {
  Individual indi= new Individual();
  indi.name= input.readString();
  indi.individualID= input.readString();
  indi.fatherID=input.readString();
  indi.motherID= input.readString();
  return indi;
  }
 
 @Override
 public void objectToEntry(Object object, TupleOutput output)  {
  Individual indi= Individual.class.cast(object);
  output.writeString(indi.name);
  output.writeString(indi.individualID);
  output.writeString(indi.fatherID);
  output.writeString(indi.motherID);
  }
 };

To open and create an Berkeley Environment the following code was written:

 EnvironmentConfig envConf= new EnvironmentConfig();
 envConf.setAllowCreate(!readOnly);
 envConf.setReadOnly(readOnly);
 envConf.setTransactional(false);
 this.env= new Environment(
   file,
  envConf
  );

Then, we open each database. We need 3 databases: markers, individuals and genotypes. Opening a Database looks like this:

DatabaseConfig dbConfig= new DatabaseConfig();
/* allow create if database does not exists */
dbConfig.setAllowCreate(!readOnly);
dbConfig.setReadOnly(readOnly);
Database db= this.env.openDatabase(null, "database-name", d
bConfig);

Althoug DBD is based on a pair(key,value) another component of the value could be searched and need to be indexed. This process is called a "Secondary Database". For example, with this project I created a secondary database:1) to find/loop over the markers using their position 2) to find/loop over the individuals using individualID instead of name. We need a class extending SecondaryKeyCreator
to extract this second key for the original value. For example here is the class extraction the Position from the Marker.

class MarkerPositionCreator implements SecondaryKeyCreator
 {
 public boolean createSecondaryKey(
  SecondaryDatabase secDb,
  DatabaseEntry keyEntry,
  DatabaseEntry dataEntry,
  DatabaseEntry resultEntry)
  {
  Marker marker= (Marker)Test01.this.markerTupleBinding.entryToObject(dataEntry);
  Test01.this.positionTupleBinding.objectToEntry(marker.pos, resultEntry);
  return true;
  }
}

We also need to open those "secondary" databases

SecondaryConfig secondConfig= new SecondaryConfig();
 secondConfig.setAllowCreate(!readOnly);
 /* on open, if the secondary database is empty then the primary
database is read in its entirety and additions/modifications to the secondary's records occur automatically */
 secondConfig.setAllowPopulate(true);
 secondConfig.setSortedDuplicates(true);
 MarkerPositionCreator posCreator = new MarkerPositionCreator();
 /* Identifies the key creator object to be used for secondary key creation.  */
 secondConfig.setKeyCreator(posCreator);
 positionDB = this.env.openSecondaryDatabase(null, "position
", this.markerDB,secondConfig);

OK, more genetic now. The HapMap genotypes are available here:http://www.hapmap.org/genotypes/latest/rs_strand/non-redundant/ (the path may change). A file looks like a table f(marker,individual)=genotype:

rs# SNPalleles chrom pos strand genome_build center protLSID assayLSID panelLSID QC_code NA18940 NA18942 NA189
rs28412942 A/T chrMT 410 + ncbi_B36 affymetrix GenomeWideSNP_6.02 Japanese:2 QC+ AA AA AA AA AA AA AA AA AA AA
rs3937039 A/G chrMT 665 + ncbi_b36 broad genotype_protocol_11 Japanese:1 QC+ AA AA AA AA AA AA AA AA AA AA AA
rs2853517 A/G chrMT 711 + ncbi_b36 broad genotype_protocol_11 Japanese:1 QC+ GG GG GG GG GG GG GG GG GG GG GG
rs28358568 C/T chrMT 712 + ncbi_b36 broad genotype_protocol_11 Japanese:1 QC+ TT TT TT TT TT TT TT TT TT TT TT
(...)

The file is processed as follow.

Pattern space=Pattern.compile("[ ]");
String HEADER[]=new String[]{"rs#","SNPalleles","chrom","pos","strand","genome_build","center","protLSID","assayLSID","panelLSID","QC_code"};
BufferedReader in= new  BufferedReader(new InputStreamReader(new GZIPInputStream(url.openStream())));

String line= in.readLine();
if(line==null) throw new IOException("Empty file");
/* The first line of this file is the header*/
String header[]=space.split(line);
for(int i=0;i< HEADER.length;++i)
 {
 if(!header[i].equals(HEADER[i])) throw new IOException("Bad header "+header[i]+" expected "+HEADER[i]);
 }
/* the header contains the name of the Individual which will be inserted. At this time we don't know what are the relationships between those individuals.*/
for(int i=HEADER.length;i< header.length;++i)
 {
 Individual individual= new Individual();
 individual.name=header[i];
 DatabaseEntry key= new DatabaseEntry(individual.name.getBytes());
 DatabaseEntry data= new DatabaseEntry();
 this.individualTupleBinding.objectToEntry(individual, data);
 getIndividualDB().put(null
   ,key
   ,data
   );
 }
/** the following lines are the markers and the genotypes */
TupleBinding INT_BINDING=TupleBinding.getPrimitiveBinding(Integer.class);
while((line=in.readLine())!=null)
 {
 if(!line.startsWith("rs")) continue;
 String tokens[]=space.split(line);
 //System.err.println(line);
 /* fill the information of this marker */
 Marker marker= new Marker(Integer.parseInt( tokens[0].substring(2)));
 marker.alleles= tokens[1];
 marker.pos= new Position(tokens[2],Integer.parseInt(tokens[3]));
 marker.strand= tokens[4].charAt(0);

 DatabaseEntry key= new DatabaseEntry();
 INT_BINDING.objectToEntry(marker.getRSId(), key);
 DatabaseEntry data= new DatabaseEntry();
 this.markerTupleBinding.objectToEntry(marker, data);
 getMarkerDB().put(null
   ,key
   ,data
   );
 /** loop over this marker and get the genotypes */
 for(int i=HEADER.length;i<header.length;++i)
  {
  if(tokens[i].length()!=2 || tokens[i].equals("NN")) continue;
  /** create the KEY */
  MarkerIndividual mi= new MarkerIndividual(marker.rsId,header[i]);
  this.markerIndividualTupleBinding.objectToEntry(mi, key);
  /** create the genotype */
  this.genotypeTupleBinding.objectToEntry(
    new Genotype(tokens[i].charAt(0),tokens[i].charAt(1)),
    data
    );
  /** put the new pair( pair(marker,individual), genotype) in the BDB */
  getGenotypeDB().put(null
    ,key
    ,data
    );
  }
 }
in.close();

OK, I want to find the relationships between those individuals, this information is available here. For each "Coriell Repository Number" we find the individual in our database, if it exists we add the information and write the individual back to the database. (See function ">makePedigree line 466).

To retrieve the genotype g=f(marker,individual) I wrote the following simple utility function getGenotypeAt:

Genotype getGenotypeAt(Marker marker,Individual indi) throws DatabaseException
 {
 if(marker==null || indi==null) return null;
 DatabaseEntry key=new DatabaseEntry();
 DatabaseEntry value=new DatabaseEntry();
 this.markerIndividualTupleBinding.objectToEntry(new MarkerIndividual(marker.rsId,indi.name),key);
 if(this.genotypeDB.get(null, key, value, LockMode.DEFAULT)!=OperationStatus.SUCCESS) return null;
 Genotype g= Genotype.class.cast(this.genotypeTupleBinding.entryToObject(value));
 return g;
 }

To get the frequencies of the alleles, we loop each over each marker (using a secondary database to get the markers ordered on the genome (not ordered on rs##)) and we get all the genotypes for each individual. To loop over a BDB an instance Cursor (looks like a java.util.Iterator) is used.

void frequencies() throws DatabaseException
 {
 SecondaryCursor cM= this.positionDB.openSecondaryCursor(null, null);
 DatabaseEntry key=new DatabaseEntry();
 DatabaseEntry value=new DatabaseEntry();
 while(cM.getNext(key, value,LockMode.DEFAULT)==OperationStatus.SUCCESS)
  {
  Marker m= Marker.class.cast(this.markerTupleBinding.entryToObject(value));
  HashMap<Character, Integer> allele2count= new HashMap<Character, Integer>();
  int totalGenotyped=0;
  int totalFailures=0;
  System.out.print("rs"+m.rsId+" "+m.alleles+" "+m.pos.chromosome+" "+m.pos.position+" "+m.strand);
  Cursor cI= this.individualDB.openCursor(null, null);
  while(cI.getNext(key, value,LockMode.DEFAULT)==OperationStatus.SUCCESS)
   {
   Individual indi=Individual.class.cast(this.individualTupleBinding.entryToObject(value));
   Genotype g= getGenotypeAt(m, indi);
   if(g!=null)
    {
    totalGenotyped++;
    for(int i=0;i< 2;++i)
     {
     char c= (i==0?g.a1:g.a2);
     Integer count= allele2count.get(c);
     if(count==null) count=0;
     allele2count.put(c,count+1);
     }
    }
   else
    {
    totalFailures++;
    }
   }
  cI.close();
  System.out.print(" genotyped:"+(int)(100.0*(totalGenotyped-totalFailures)/(float)totalGenotyped)+"%");
  for(Character allele: allele2count.keySet())
   {
   System.out.print(" f("+allele+")="+allele2count.get(allele)/(2.0*totalGenotyped));
   }
  System.out.println();
  }
 cM.close();
 }

Finding the Mendelian incompatibilities is much like the same: I sued here the brute force, we loop over each individual and over each marker. If the individual as any parent, we check that his genotype is compatible with them.

void incompats() throws DatabaseException
 {
 DatabaseEntry key=new DatabaseEntry();
 DatabaseEntry value=new DatabaseEntry();

 Cursor cI= this.individualDB.openCursor(null, null);
 while(cI.getNext(key, value,LockMode.DEFAULT)==OperationStatus.SUCCESS)
  {
  Individual indi=Individual.class.cast(this.individualTupleBinding.entryToObject(value));

  if(indi.fatherID==null && indi.motherID==null)
   {
   continue;
   }

  Individual father= findIndividualByCorrielId(indi.fatherID);
  Individual mother= findIndividualByCorrielId(indi.motherID);

  Cursor cM= this.markerDB.openCursor(null, null);
  while(cM.getNext(key, value,LockMode.DEFAULT)==OperationStatus.SUCCESS)
   {
   Marker m= Marker.class.cast(this.markerTupleBinding.entryToObject(value));
   Genotype gChild= getGenotypeAt(m, indi);
   Genotype gFather= getGenotypeAt(m, father);
   if(isIncompat(gChild,gFather))
    {
    System.out.println("Incompat: for rs"+m.getRSId()+"("+m.alleles+") "+
     indi.individualID+" is "+gChild+" and his father "+indi.fatherID+" is "+
     gFather
     );
    continue;
    }
   Genotype gMother= getGenotypeAt(m, mother);
   if(isIncompat(gChild,gMother))
    {
    System.out.println("Incompat: for rs"+m.getRSId()+"("+m.alleles+") "+
     indi.individualID+" is "+gChild+" and his mother "+indi.motherID+" is "+
     gMother
     );
    continue;
    }
   if(isIncompat(gChild,gFather,gMother))
    {
    System.out.println("Incompat: for rs"+m.getRSId()+"("+m.alleles+") "+
     indi.individualID+" is "+gChild+
     " and his father "+indi.fatherID+" is "+ gFather+" and his mother "+indi.motherID+" is "+ gMother
     );
    }
   }
  cM.close();
  }
 cI.close();
 }

That's it. I first test the chromosome 1 at http://www.hapmap.org/genotypes/latest/rs_strand/non-redundant/genotypes_chr1_CEU_r23a_nr.b36.txt.gz(11Mo) but I pressed Ctrl-C when the files reached 1.4Go !
I then used the chr22 file directly downloaded on my computer. The space required by BerkeleyDB to hold the database and the indexes was 721Mo whereas the zipped original source of data was 2Mo (25Mo unzipped)!!! (Arghhhhhhhhhhhh !!!!!).

Individuals count:=90

Markers count:=54786

Genotypes count:=4853237

Time required to load the hapmap file : 1174secs (20min)

rs9624968 A/G chr22 24783030 + genotyped:86% f(G)=0.879746835443038 f(A)=0.12025316455696203
rs9624969 C/T chr22 24784595 + genotyped:87% f(T)=0.075 f(C)=0.925
rs6004919 C/T chr22 24785216 + genotyped:100% f(T)=0.12777777777777777 f(C)=0.8722222222222222
rs4585127 A/G chr22 24785559 + genotyped:100% f(G)=0.8722222222222222 f(A)=0.12777777777777777
rs5752262 A/G chr22 24786367 + genotyped:95% f(G)=0.5116279069767442 f(A)=0.4883720930232558
rs16981296 C/G chr22 24787784 + genotyped:95% f(G)=0.8488372093023255 f(C)=0.1511627906976744
rs1003547 G/T chr22 24788134 + genotyped:86% f(T)=0.44936708860759494 f(G)=0.5506329113924051
rs9613094 A/G chr22 24788388 + genotyped:100% f(G)=0.2222222222222222 f(A)=0.7777777777777778
rs16986627 A/G chr22 24789298 + genotyped:88% f(G)=0.2654320987654321 f(A)=0.7345679012345679
(...)

Time required to generate the frequencies Time:955secs

Incompat: for rs133457(C/T) 1341M02 is CC and his father 1341MF13 is TT
Incompat: for rs136009(C/T) 1341M02 is CT and his father 1341MF13 is TT and his mother 1341MM14 is TT
Incompat: for rs394518(C/T) 1341M02 is CT and his father 1341MF13 is TT and his mother 1341MM14 is TT
Incompat: for rs628437(A/C) 1341M02 is AC and his father 1341MF13 is CC and his mother 1341MM14 is CC
Incompat: for rs731403(C/G) 1341M02 is GG and his mother 1341MM14 is CC
Incompat: for rs1122940(C/T) 1341M02 is CT and his father 1341MF13 is TT and his mother 1341MM14 is TT
Incompat: for rs2845343(A/T) 1341M02 is AA and his mother 1341MM14 is TT
Incompat: for rs4822498(C/T) 1341M02 is CC and his father 1341MF13 is TT
Incompat: for rs4822499(C/T) 1341M02 is TT and his father 1341MF13 is CC
Incompat: for rs4823195(A/G) 1341M02 is AA and his mother 1341MM14 is GG
Incompat: for rs4991267(C/T) 1341M02 is CC and his mother 1341MM14 is TT
Incompat: for rs5755047(A/T) 1341M02 is TT and his father 1341MF13 is AA
Incompat: for rs5755343(A/T) 1341M02 is AA and his mother 1341MM14 is TT
Incompat: for rs5755420(A/C) 1341M02 is AC and his father 1341MF13 is CC and his mother 1341MM14 is CC
Incompat: for rs5765056(C/T) 1341M02 is CT and his father 1341MF13 is CC and his mother 1341MM14 is CC
Incompat: for rs5765436(A/C) 1341M02 is AC and his father 1341MF13 is CC and his mother 1341MM14 is CC
Incompat: for rs5765499(C/T) 1341M02 is CC and his father 1341MF13 is TT
Incompat: for rs5768636(G/T) 1341M02 is GT and his father 1341MF13 is GG and his mother 1341MM14 is GG
Incompat: for rs5769710(C/T) 1341M02 is CC and his father 1341MF13 is TT
Incompat: for rs5770600(A/C) 1341M02 is CC and his father 1341MF13 is AA
Incompat: for rs5997220(A/G) 1341M02 is AG and his father 1341MF13 is GG and his mother 1341MM14 is GG
(...)
764 errors

Time required to find the Mendelian incompatibilities: 11021secs (~3H00)

When the program closed, the database was compacted down to 710Mo.

Conclusion: Too slow, some huges files generated. Definitely a bad choice to handle this kind of data.

4 comments:

Paul Guermonprez said...: salut,

for what it's worth, i've seen huge software store BIG files in bioinfo with HDF5.
So you may test if you have a minute or two.

Memory access pattern is very important for this kind of task. Having aligned data and parsing it with a predictable pattern speed things a lot.

paul.; Saturday, 05 April, 2008
Anonymous said...: Hi, I wrote a Information Retrieval system last year, using Perl. I dont know about the java implementation of bdb, but i know that with the standard solaris version I found that if I mixed reading and writing to the db file my db file would ballon to 3-4x what it should have been.

As a result I stored the data in intermediate structures before persisting to the file. By isolating writes from reads i increased performance on index creation by about 300%.

Might be something to think about.; Monday, 07 April, 2008
Unknown said...: I've been working with large genotyping projects for several years now. In fact, I work at one of the centers that produced the HapMap Phase 1 data... So I can relate to your issues: it really is a challenging problem.

We've used several methods to store and analyse genotypes along the years and we ended up creating our own storage engine and API: GenoByte. We've released it under GPL3: www.obiba.org/genobyte. We'd be glad to help you trying it out!

We actively use this engine to store and analyse the HapMap data (but also many other genotyping technologies). The whole CEU data takes up ~2GB of disk space, loading the data is a long process, but analysing it afterwards is very fast. Have a look!; Monday, 07 April, 2008
Pierre Lindenbaum said...: Thank you Philippe, this was also my next alternative ! :-)
http://twitter.com/yokofakun/statuses/783921446; Monday, 07 April, 2008

YOKOFAKUN

04 April 2008

BerkeleyDB and Hapmap: My notebook.

4 comments:

About Me

Feeds

Blog Archive

Web2.0

Labels