23 July 2010

Rules Engine for Bioinformatics. Playing with Drools and the Hapmap data

Drools is a Rules Engine implementation of the Rete algorithm. (from "http://java.sys-con.com/node/45082":)A rule engine is a system for applying some set of if/then rules to the data in a system and then taking some action on the data or the system itself. Its primary purpose is to separate the business logic from the system logic - to externalize the business logic so it can be maintained separately. Use a rule engine to separate the if/then rules from system code - to externalize them so they can be maintained separately from the system code. The behavior of the system can then be modified without changing code or needing to recompile/redeploy. Rules are stored in human-readable form in a file so they can be changed with a text editor or rule editor.

I would like to know if JBOSS can be used for some common bioinformatics tasks. For example, can I use Drools when my users want to select some candidate genes with various parameters : "I want a gene with 3 SNPs in the first exon but with less than 2 microsattelites in the 3'UTR and if there is another gene close to this one I want ... etc... etc...".

In the current post, I've played with the Drools engine to search some mendelians incompatibilities in the hapmap data.

First, download some the pedigrees and the genotypes:

Here is the structure of my project:

Individual, Genotype and Snp are 3 simple POJO files.


package test;
import java.io.Serializable;

public class Individual implements Serializable
private static final long serialVersionUID = 1L;
private int familyId;
private String name;
private String fatherName;
private String motherName;
private int gender;
//constructor, getters, setters, etc...


package test;
import java.io.Serializable;

public class Snp
implements Serializable
private static final long serialVersionUID = 1L;
private String name;
private String alleles;
private String chrom;
private int position;
//constructor, getters, setters, etc...


package test;
import java.io.Serializable;

public class Genotype
implements Serializable
private static final long serialVersionUID = 1L;
private Snp snp;
private Individual individual;
private char a1;
private char a2;

public Genotype(Snp snp, Individual individual, String observed)
this.snp = snp;
this.individual = individual;

this.a1 = observed.charAt(0);
this.a2 = observed.charAt(1);
if(this.a1> this.a2)
this.a2 = a1;
public boolean contains(char allele)
return getA1()==allele || getA2()==allele;

//constructor, getters, setters, etc...


hapmap.drl is the Drools rules file. It contains the "business logic" and can be modified without changing anything in the java program:
package test

rule "Select snps"
salience 30
eval(!($chrom=="chr21" && $pos> 9880000 && $pos< 10043000))
System.err.println("Removing : "+$snp.getName());

rule "One Parent"
salience 20

$children : Individual($dad:fatherName,$mum:motherName)
$parent: Individual(name==$dad || name==$mum)
$genotypeChild : Genotype(individual==$children,$snp:snp,a1!='N',a2!='N' )
$genotypeParent : Genotype(individual==$parent,snp==$snp,a1!='N',a2!='N')
$genotypeParent.contains($genotypeChild.getA1()) ||
System.out.println($snp.getName()+" : problem with "+
$children+"("+$genotypeChild.getA1()+"/"+$genotypeChild.getA2()+") incompatibility with parent:"+

rule "Both Parents"
salience 10
$children : Individual($dad:fatherName,$mum:motherName)
$father: Individual(name==$dad )
$mother: Individual(name==$mum )
$genotypeChild : Genotype(individual==$children,$snp:snp,a1!='N',a2!='N' )
$genotypeDad : Genotype(individual==$father,snp==$snp,a1!='N',a2!='N')
$genotypeMum : Genotype(individual==$mother,snp==$snp,a1!='N',a2!='N')
($genotypeDad.contains( $genotypeChild.getA1()) && $genotypeMum.contains( $genotypeChild.getA2())) ||
($genotypeDad.contains( $genotypeChild.getA2()) && $genotypeMum.contains( $genotypeChild.getA1()))
System.out.println($snp.getName()+" : problem with "+
$children+"("+$genotypeChild.getA1()+"/"+$genotypeChild.getA2()+") incompatibility with parents:"+
$father+"("+$genotypeDad.getA1()+"/"+$genotypeDad.getA2()+") "+
$mother+"("+$genotypeMum.getA1()+"/"+$genotypeMum.getA2()+") "
  • The first rule "Select snps" (=with the highest priority (salience)) remove all the SNPs that are not in "chr21:9880000-10043000"
  • The second rule "One Parent" prints a message if there is an incompatibility between a children and one of his parents
  • The last rule "Both Parents" prints a message if the is an incompatibility between a children and both of his parents


Drools01.java initializes the Drools engine, parses the hapmap files, put those objects in the "KnowledgeBase" and fires all the rules:
package test;
import java.io.BufferedReader;
import java.io.FileReader;
import java.util.Collection;
import java.util.HashMap;
import java.util.Map;

import org.drools.KnowledgeBase;
import org.drools.KnowledgeBaseFactory;
import org.drools.builder.KnowledgeBuilder;
import org.drools.builder.KnowledgeBuilderError;
import org.drools.builder.KnowledgeBuilderErrors;
import org.drools.builder.KnowledgeBuilderFactory;
import org.drools.builder.ResourceType;
import org.drools.definition.KnowledgePackage;
import org.drools.io.ResourceFactory;
import org.drools.runtime.StatefulKnowledgeSession;

public class Drools01
private void loadHapMap(StatefulKnowledgeSession session)throws Exception
Map<String, Individual> name2individual=new HashMap<String, Individual>();
String line;
BufferedReader in= new BufferedReader(new FileReader(

if(line.startsWith("FID") || !line.endsWith("CEU")) continue;
String tokens[]=line.split("[\t]");
Individual indi=new Individual(
name2individual.put(indi.getName(), indi);

in= new BufferedReader(new FileReader("/home/lindenb/genotypes_chr21_CEU_r27_nr.b36_fwd.txt"));
int r=0;

String tokens[]=line.split("[ ]");
Map<Integer, Individual> index2individual=new HashMap<Integer, Individual>(tokens.length);

for(int i=11;i <tokens.length;++i)
Individual indi=name2individual.get(tokens[i]);
if(indi==null) throw new NullPointerException("Cannot get "+tokens[i]);
index2individual.put(i, indi);

tokens=line.split("[ ]");
Snp snp=new Snp(tokens[0], tokens[1],tokens[2], Integer.parseInt(tokens[3]));
for(int i=11;i <tokens.length;++i)
session.insert(new Genotype(snp, index2individual.get(i), tokens[i]));
if(r++>2000) break;//just read the first 2000 rows everything...

private void run()throws Exception
KnowledgeBase kbase = KnowledgeBaseFactory.newKnowledgeBase();
KnowledgeBuilder kbuilder = KnowledgeBuilderFactory.newKnowledgeBuilder();
kbuilder.add( ResourceFactory.newClassPathResource(
Drools01.class ),
ResourceType.DRL );
KnowledgeBuilderErrors errors= kbuilder.getErrors();
for(KnowledgeBuilderError error:errors)

Collection<KnowledgePackage> pkgs = kbuilder.getKnowledgePackages();
kbase.addKnowledgePackages( pkgs );
StatefulKnowledgeSession ksession = kbase.newStatefulKnowledgeSession();

* @param args
public static void main(String[] args) throws Exception
try {
new Drools01().run();
} catch(Exception err)



javac -cp ${CP}:. test/Drools01.java
java -cp ${CP}:. test.Drools01


Removing : rs885550
Removing : rs28363862
Removing : rs28783163
rs1028272 : problem with NA10838(T/T) incompatibility with parent:NA12003(A/A)
rs9647052 : problem with NA10847(A/A) incompatibility with parent:NA12239(C/C)
rs1882882 : problem with NA07014(A/A) incompatibility with parent:NA07031(G/G)
rs12627714 : problem with NA07048(G/G) incompatibility with parent:NA07034(A/A)
rs17240368 : problem with NA07048(A/A) incompatibility with parent:NA07034(G/G)
rs2822605 : problem with NA07048(C/C) incompatibility with parent:NA07034(G/G)
rs2822593 : problem with NA07048(G/G) incompatibility with parent:NA07034(A/A)
rs7276922 : problem with NA07048(T/T) incompatibility with parent:NA07034(C/C)
rs10439653 : problem with NA07048(A/A) incompatibility with parent:NA07034(C/C)
rs10439652 : problem with NA07048(C/C) incompatibility with parent:NA07034(T/T)
rs17001769 : problem with NA10856(A/A) incompatibility with parent:NA11829(C/C)
rs9977658 : problem with NA10860(C/C) incompatibility with parent:NA11992(T/T)
rs8133625 : problem with NA10856(G/G) incompatibility with parent:NA11830(A/A)
rs12627045 : problem with NA12740(A/A) incompatibility with parent:NA12750(G/G)
rs416083 : problem with NA10843(G/G) incompatibility with parent:NA11919(A/A)
rs2822484 : problem with NA06991(A/A) incompatibility with parent:NA06985(G/G)
rs9977169 : problem with NA12865(C/C) incompatibility with parent:NA12875(T/T)
rs379724 : problem with NA07019(G/G) incompatibility with parent:NA07056(A/A)
rs13046593 : problem with NA10860(C/C) incompatibility with parent:NA11992(G/G)
rs9984592 : problem with NA10854(G/G) incompatibility with parent:NA11840(A/A)
rs2187275 : problem with NA10831(C/C) incompatibility with parent:NA12156(T/T)
rs6516605 : problem with NA12708(G/G) incompatibility with parent:NA12718(C/C)
rs7283783 : problem with NA10830(G/G) incompatibility with parent:NA12154(A/A)
rs13051673 : problem with NA12739(G/G) incompatibility with parent:NA12748(A/A)
rs3115488 : problem with NA10860(A/A) incompatibility with parent:NA11993(G/G)
rs8132413 : problem with NA10860(T/T) incompatibility with parent:NA11993(A/A)
rs2207842 : problem with NA10838(A/A) incompatibility with parent:NA12004(G/G)
rs2821973 : problem with NA12832(C/C) incompatibility with parent:NA12843(T/T)
rs469471 : problem with NA07349(A/A) incompatibility with parent:NA07347(G/G)
rs8129674 : problem with NA10839(G/G) incompatibility with parent:NA12006(A/A)
rs2257224 : problem with NA10854(G/G) incompatibility with parent:NA11840(A/A)
rs865859 : problem with NA10855(C/C) incompatibility with parent:NA11832(T/T)
rs2742158 : problem with NA10855(C/C) incompatibility with parent:NA11832(T/T)
rs4111253 : problem with NA10836(C/C) incompatibility with parent:NA12275(T/T)
rs240444 : problem with NA10861(T/T) incompatibility with parent:NA11994(C/C)
rs469812 : problem with NA06991(C/C) incompatibility with parent:NA06985(G/G)
rs210534 : problem with NA06991(T/T) incompatibility with parent:NA06985(A/A)
rs2822670 : problem with NA10861(C/T) incompatibility with parents:NA11994(C/C) NA11995(C/C)
rs9305335 : problem with NA10831(C/T) incompatibility with parents:NA12155(C/C) NA12156(C/C)
rs9305297 : problem with NA12801(A/G) incompatibility with parents:NA12812(G/G) NA12813(G/G)
rs2822641 : problem with NA07349(A/C) incompatibility with parents:NA07347(C/C) NA07346(C/C)
rs9977057 : problem with NA12801(G/T) incompatibility with parents:NA12812(T/T) NA12813(T/T)
rs2822614 : problem with NA12877(C/T) incompatibility with parents:NA12889(C/C) NA12890(C/C)
rs2178907 : problem with NA07349(A/G) incompatibility with parents:NA07347(G/G) NA07346(G/G)
rs1124322 : problem with NA10837(A/G) incompatibility with parents:NA12272(G/G) NA12273(G/G)
rs2822537 : problem with NA12336(A/G) incompatibility with parents:NA12342(G/G) NA12343(G/G)
rs386524 : problem with NA12753(C/T) incompatibility with parents:NA12762(C/C) NA12763(C/C)
rs367249 : problem with NA12802(A/G) incompatibility with parents:NA12814(G/G) NA12815(G/G)
rs17001380 : problem with NA10856(G/T) incompatibility with parents:NA11829(G/G) NA11830(G/G)
rs2155965 : problem with NA12336(A/T) incompatibility with parents:NA12342(A/A) NA12343(A/A)
rs8133601 : problem with NA10854(C/T) incompatibility with parents:NA11839(C/C) NA11840(C/C)
rs8134986 : problem with NA07014(A/C) incompatibility with parents:NA07051(C/C) NA07031(C/C)
rs3094804 : problem with NA12865(C/T) incompatibility with parents:NA12874(T/T) NA12875(T/T)
rs1929150 : problem with NA06997(C/T) incompatibility with parents:NA06986(T/T) NA07045(T/T)
rs7276195 : problem with NA07019(A/G) incompatibility with parents:NA07022(A/A) NA07056(A/A)
rs3855691 : problem with NA12877(A/G) incompatibility with parents:NA12889(G/G) NA12890(G/G)
rs17468376 : problem with NA10835(A/G) incompatibility with parents:NA12248(A/A) NA12249(A/A)
rs2747618 : problem with NA10835(C/T) incompatibility with parents:NA12248(C/C) NA12249(C/C)
rs2943900 : problem with NA10854(C/T) incompatibility with parents:NA11839(T/T) NA11840(T/T)
rs4009972 : problem with NA10835(G/T) incompatibility with parents:NA12248(G/G) NA12249(G/G)
rs8133159 : problem with NA12767(A/G) incompatibility with parents:NA12777(A/A) NA12778(A/A)


So many questions: how should I model my data ? what if those data are alread present in database) ? how drools supports a large amount of data ? etc... etc...

That's it


No comments: