Showing posts with label genetics. Show all posts

22 February 2015

Drawing a Manhattan plot in SVG using a GWAS+XML model.

On friday, I saw my colleague @b_l_k starting writing SVG+XML code to draw a Manhattan plot. I told him that a better idea would be to describe the data using XML and to transform the XML to SVG using XSLT.

So, let's do this. I put the XSLT stylesheet on github at https://github.com/lindenb/xslt-sandbox/blob/master/stylesheets/bio/manhattan.xsl . And the model of data would look like this (I took the data from @genetics_blog's http://www.gettinggeneticsdone.com/2011/04/annotated-manhattan-plots-and-qq-plots.html ) .

<?xml version="1.0"?>
<manhattan>
  <chromosome name="1" length="1500">
    <data rs="1" position="1" p="0.914806043496355"/>
    <data rs="2" position="2" p="0.937075413297862"/>
    <data rs="3" position="3" p="0.286139534786344"/>
    <data rs="4" position="4" p="0.830447626067325"/>
    <data rs="5" position="5" p="0.641745518893003"/>
 (...)
  </chromosome>
  <chromosome name="22" length="535">
    <data rs="15936" position="1" p="0.367785102687776"/>
    <data rs="15937" position="2" p="0.628192085539922"/>
(...)
    <data rs="1" position="1" p="0.914806043496355"/>
  </chromosome>
</manhattan>

The stylesheet

At the beginning , we declare the size of the drawing

<xsl:variable name="width" select="number(1000)"/>
<xsl:variable name="height" select="number(400)"/>

we need to get the size of the genome.

<xsl:variable name="genomeSize">
    <xsl:call-template name="sumLengthChrom">
      <xsl:with-param name="length" select="number(0)"/>
      <xsl:with-param name="node" select="manhattan/chromosome[1]"/>
    </xsl:call-template>
</xsl:variable>

We could use the xpath function 'sum()' but here I the user is free to omit the size of the chromosome. It this attribute '@length' is not declared, we use the maximum position of the SNP in this chromosome:

<xsl:template name="sumLengthChrom">
    <xsl:param name="length"/>
    <xsl:param name="node"/>
    <xsl:variable name="chromlen">
      <xsl:apply-templates select="$node" mode="length"/>
    </xsl:variable>
    <xsl:choose>
      <xsl:when test="count($node/following-sibling::chromosome)&gt;0">
        <xsl:call-template name="sumLengthChrom">
          <xsl:with-param name="length" select="$length + number($chromlen)"/>
          <xsl:with-param name="node" select="$node/following-sibling::chromosome[1]"/>
        </xsl:call-template>
      </xsl:when>
      <xsl:otherwise>
        <xsl:value-of select="$length + number($chromlen)"/>
      </xsl:otherwise>
    </xsl:choose>
  </xsl:template>

we get the smallest p-value:

<xsl:variable name="minpvalue">
    <xsl:for-each select="manhattan/chromosome/data">
      <xsl:sort select="@p" data-type="number" order="ascending"/>
      <xsl:if test="position() = 1">
        <xsl:value-of select="number(@p)"/>
      </xsl:if>
    </xsl:for-each>
  </xsl:variable>

then we plot each chromosome, the xsl parameter "previous" contains the number of bases already printed.
We use the SVG attribute transform to translate the current chromosome in the drawing

<xsl:template name="plotChromosomes">
    <xsl:param name="previous"/>
    <xsl:param name="node"/>
(...)
      <xsl:attribute name="transform">
        <xsl:text>translate(</xsl:text>
        <xsl:value-of select="(number($previous) div number($genomeSize)) * $width"/>
        <xsl:text>,0)</xsl:text>
      </xsl:attribute>

we plot each SNP:

<svg:g style="fill-opacity:0.5;">
        <xsl:apply-templates select="data" mode="plot"/>
      </svg:g>

and we plot the remaining chromosomes, if any :

<xsl:if test="count($node/following-sibling::chromosome)&gt;0">
      <xsl:call-template name="plotChromosomes">
        <xsl:with-param name="previous" select="$previous + number($chromlen)"/>
        <xsl:with-param name="node" select="$node/following-sibling::chromosome[1]"/>
      </xsl:call-template>
    </xsl:if>

to plot each SNP, we get the X coordinate in the current chromosome:

<xsl:template match="data" mode="x">
    <xsl:variable name="chromWidth">
      <xsl:apply-templates select=".." mode="width"/>
    </xsl:variable>
    <xsl:variable name="chromLength">
      <xsl:apply-templates select=".." mode="length"/>
    </xsl:variable>
    <xsl:value-of select="(number(@position) div number($chromLength)) * $chromWidth"/>
  </xsl:template>

and the Y coordinate:

<xsl:template match="data" mode="y">
    <xsl:value-of select="$height - (( (math:log(number(@p)) * -1 ) div $maxlog2value ) * $height )"/>
  </xsl:template>

we can also wrap the data in a hyperlink if a @rs attribute exists:

<xsl:choose>
      <xsl:when test="@rs">
        <svg:a target="_blank">
          <xsl:attribute name="xlink:href">
            <xsl:value-of select="concat('http://www.ncbi.nlm.nih.gov/projects/SNP/snp_ref.cgi?rs=',@rs)"/>
          </xsl:attribute>
          <xsl:apply-templates select="." mode="plotshape"/>
        </svg:a>
      </xsl:when>
      <xsl:otherwise>
        <xsl:apply-templates select="." mode="plotshape"/>
      </xsl:otherwise>
    </xsl:choose>

we plot the SNP itself, as a circle:

<xsl:template match="data" mode="plotshape">
    <svg:circle r="5">
      <xsl:attribute name="cx">
        <xsl:apply-templates select="." mode="x"/>
      </xsl:attribute>
      <xsl:attribute name="cy">
        <xsl:apply-templates select="." mode="y"/>
      </xsl:attribute>
    </svg:circle>
  </xsl:template>

Result:

$ xsltproc manhattan.xsl model.xml > plot.svg

Plot

That's it
Pierre

26 July 2013

g1kv37 vs hg19

In order to create a class to translate the chromosome names from one naming convention to another. I've compared the MD5 sums of the human genome versions g1k/v37 and ucsc/hg19. Here is the java program to create the MD5s:

The MD5 sums were extracted as follow:

Here are the common chromosomes, joined on the hash-sum:

And here are the unpairable data:

I knew the problem for chrY ( http://www.biostars.org/p/58143/) but not for chr3.. What is the problem for this chromosome ?

Edit: Here are the number of bases for UCSC/chr3:

{T=58760485, G=38670110, A=58713343, C=38653197, N=3225295}

and for g1kv37:

{T=58760485, G=38670110, A=58713343, R=2, C=38653197, M=1, N=3225292}

That's it,

Pierre.

15 June 2011

Sorting and Joining some Genotypes with Hadoop.

This post describes how I've used hadoop to merge two large files produced by our Affymetrix-Axiom Genotyping platform.

(Note: working with the last version of hadoop has been a headache: it contains some classes with the same name: ('org.apache.hadoop.mapreduce.Reducer' and 'org.apache.hadoop.mapred.Reducer' ), many classes have been deprecated ( org.apache.hadoop.mapred.JobConf ) but they are still needed to run some specific algorithms ( `jobConf.setInputFormat(CompositeInputFormat.class)`), etc...

The Input Files

The annotation file

This file describes the genotyped markers (chrom,position,alleleA,allleB, etc...).


(... header ...)
"Probe Set ID","dbSNP RS ID","Chromosome","Physical Position","Strand","ChrX pseudo-autosomal region 1","Cytoband","Flank","Allele A","Allele B","Associated Gene","Genetic Map","Microsatellite","Allele Frequencies","Heterozygous Allele Frequencies","Number of individuals/Number of chromosomes","In Hapmap","Strand Versus dbSNP","Probe Count","ChrX pseudo-autosomal region 2","Minor Allele","Minor Allele Frequency","OMIM"
"AX-11086612","rs10001348","4","29912308","+","0","p15.1","gtattcagttgaacacaaatcagtgcatgt[A/G]","A","G","ENST00000467087 // downstream // 2885305 // Hs.724550 // STIM2 // 57620 // stromal interaction molecule 2 /// ENST00000361762 // upstream // 809728 // Hs.479439 // PCDH7 // 5099 // protocadherin 7 /// NM_001169117 // downstream // 2885305 // Hs.724550 // STIM2 // 57620 // stromal interaction molecule 2 /// NM_032456 // upstream // 809728 // Hs.724529 // PCDH7 // 5099 // protocadherin 7","50.0229786923511 // D4S418 // D4S2408 // --- // --- /// 44.2128449948474 // D4S2397 // D4S2430 // ATA27C07 // GCT6F03 /// 42.4637111703432 // --- // --- // 226002 // 46437","D4S333 // upstream // 47955 /// D4S605 // downstream // 97312","0.872881356 // 0.127118644 // Caucasian /// 0.833333333 // 0.166666667 // Han Chinese /// 0.777777778 // 0.222222222 // Japanese /// 0.775 // 0.225 // Yoruban","0.254237288 // Caucasian /// 0.288888889 // Han Chinese /// 0.355555556 // Japanese /// 0.35 // Yoruban","60.0 // Caucasian /// 45.0 // Han Chinese /// 45.0 // Japanese /// 60.0 // Yoruban","YES","same","1","0","G // Caucasian /// G // Han Chinese /// G // Japanese /// G // Yoruban","0.127118644 // Caucasian /// 0.166666667 // Han Chinese /// 0.222222222 // Japanese /// 0.225 // Yoruban","---"
"AX-11086611","rs10001340","4","130341127","+","0","q28.2","[A/C]agggcattcatctcagcttactatttgggaaaaat","A","C","ENST00000281146 // downstream // 306645 // Hs.567679 // C4orf33 // 132321 // chromosome 4 open reading frame 33 /// ENST00000394248 // upstream // 3729342 // Hs.192859 // PCDH10 // 57575 // protocadherin 10 /// NM_173487 // downstream // 307285 // Hs.567679 // C4orf33 // 132321 // chromosome 4 open reading frame 33 /// NM_020815 // upstream // 3729342 // Hs.192859 // PCDH10 // 57575 // protocadherin 10","127.864057946266 // D4S1615 // D4S2365 // --- // --- /// 129.756132396152 // D4S2938 // D4S2365 // AFMA284WG5 // GATA10A12 /// 124.03426335901 // D4S2394 // --- // --- // 55218","D4S3198 // upstream // 331274 /// D4S2394 // downstream // 43310","0.815789474 // 0.184210526 // Caucasian /// 1.0 // 0.0 // Han Chinese /// 1.0 // 0.0 // Japanese /// 0.816666667 // 0.183333333 // Yoruban","0.368421053 // Caucasian /// 0.0 // Han Chinese /// 0.0 // Japanese /// 0.266666667 // Yoruban","60.0 // Caucasian /// 45.0 // Han Chinese /// 45.0 // Japanese /// 60.0 // Yoruban","YES","same","1","0","C // Caucasian /// C // Han Chinese /// C // Japanese /// C // Yoruban","0.184210526 // Caucasian /// 0.0 // Han Chinese /// 0.0 // Japanese /// 0.183333333 // Yoruban","---"
"AX-11086610","rs10001337","4","54351529","+","0","q12","atgaggagtagccacatgatctaagcacct[C/T]","T","C","ENST00000306888 // --- // 0 // Hs.518760 // LNX1 // 84708 // ligand of numb-protein X 1 /// ENST00000263925 // --- // 0 // Hs.518760 // LNX1 // 84708 // ligand of numb-protein X 1 /// NM_001126328 // intron // 0 // Hs.518760 // LNX1 // 84708 // ligand of numb-protein X 1 /// NM_032622 // intron // 0 // Hs.518760 // LNX1 // 84708 // ligand of numb-protein X 1","67.4182086016315 // D4S2971 // D4S1594 // --- // --- /// 62.2091955728879 // D4S2971 // UNKNOWN // AFMB312YG1 // GATA61B02 /// 61.6059658777947 // --- // GATA61B02 // 149925 // ---","D4S461 // upstream // 151923 /// D4S2583E // downstream // 24481","0.118644068 // 0.881355932 // Caucasian /// 0.111111111 // 0.888888889 // Han Chinese /// 0.122222222 // 0.877777778 // Japanese /// 0.025 // 0.975 // Yoruban","0.203389831 // Caucasian /// 0.133333333 // Han Chinese /// 0.244444444 // Japanese /// 0.05 // Yoruban","60.0 // Caucasian /// 45.0 // Han Chinese /// 45.0 // Japanese /// 60.0 // Yoruban","YES","same","1","0","T // Caucasian /// T // Han Chinese /// T // Japanese /// T // Yoruban","0.118644068 // Caucasian /// 0.111111111 // Han Chinese /// 0.122222222 // Japanese /// 0.025 // Yoruban","---"
(...)

Calls

This file is returned by the Axiom machine. Each number encodes a genotype (AA/AB/BB/nil).

(..header...)
probeset_id A05.CEL A06.CEL A03.CEL A04.CEL A01.CEL A10.CEL A02.CEL A11.CEL A12.CEL A09.CEL A07.CEL A08.CEL B01.CEL B02.CEL B06.CEL B03.CEL B04.CEL B05.CEL B07.CEL B08.CEL B09.CEL B10.CEL C02.CEL B11.CEL B12.CEL C01.CEL C03.CEL C04.CEL C05.CEL C06.CEL C07.CEL C08.CEL C09.CEL C10.CEL C11.CEL C12.CEL D01.CEL D02.CEL D03.CEL D04.CEL D05.CEL D06.CEL D07.CEL D08.CEL D09.CEL D10.CEL D11.CEL D12.CEL E01.CEL E02.CEL E03.CEL E04.CEL E05.CEL E06.CEL E07.CEL E08.CEL E09.CEL E10.CEL E11.CEL E12.CEL F01.CEL F02.CEL F03.CEL F04.CEL F05.CEL F06.CEL F07.CEL F08.CEL H08.CEL H07.CEL G07.CEL G08.CEL H03.CEL H04.CEL H06.CEL H05.CEL G09.CEL G10.CEL H01.CEL H02.CEL G11.CEL G12.CEL G03.CEL G04.CEL H10.CEL H09.CEL F10.CEL F09.CEL G01.CEL G02.CEL F11.CEL F12.CEL G05.CEL G06.CEL H11.CEL H12.CEL
AX-11086525 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 1 0 1 0 0 1 1 2 1 1 0 0 0 0 1 0 0 1 0 0 1 1 0 1 0 1 1 2 0 1 0 0 1 2 0 2
AX-11086526 2 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 2 2 2 2 2 2 2 2 2 2 1 2 2 0 2 2 2 1 1 1 2 2 2 2 2 2 2 2 2 1 1 1 2 1 2 1 2 2 0 2 1 2 2 1 2 2 2 2 1 2 2 2 2 2 1 2 1 1 2 2 1 2 2 2 2 2 2 2 2 2 1 2 2 2
AX-11086527 2 0 2 2 2 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 2 2 2 2 2 2 2 1 2 2 1 1 0 2 1 2 1 2 2 1 1 1 1 2 2 1 0 2 1 2 1 2 2 2 1 2 1 1 1 1 2 2 2 2 2 1 1 0 2 0 1 0 1 0 1 2 1 1 0 1 1 0 2 1 1 1 0 1 1 1 2 2 1 1
(...)

Joined file

.. and this is what I want to obtain at the end:

AX-11086525 3 165621955 rs10000341 T T T T T T T T T T T T T T T G T T T T T T T T T T T T T T T T T T T T T T T G T G T T T T T G T T T T T T T T T T T T T T T T T T T T T T T T T T T G T T T T T G T T T G T T T T T T T T T T T T T T T T T T T T T G T T T T T T T T T G T T T G T T T G T T T T T G T G G G T G T G T T T T T T T T T G T T T T T G T T T T T G T G T T T G T T T G T G G G T T T G T T T T T G G G T T G G
AX-11086526 3 5237152 rs10000142 C C T C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C T C C C C C C C C C C C C C C C C C C C C C C C T C C C C C T T C C C C C C T C T C T C C C C C C C C C C C C C C C C C C C T C T C T C C C T C C C T C C C C C T T C C T C C C C C T C C C C C C C C C T C C C C C C C C C C C T C C C T C T C C C C C T C C C C C C C C C C C C C C C C C C C T C C C C C C C
(...)

Setting up Hadoop and HDFS

export JAVA_HOME=/your/path/to/JDK
cd hadoop-0.20.203.0

Change the config
in "conf/core-site.xml":

<configuration>
     <property>
         <name>fs.default.name</name>
         <value>hdfs://localhost:9000</value>
     </property>
</configuration>

in "conf/hdfs-site.xml":

<configuration>
     <property>
         <name>dfs.replication</name>
         <value>1</value>
     </property>
</configuration>

in "conf/mapred-site.xml":

<configuration>
     <property>
         <name>mapred.job.tracker</name>
         <value>localhost:9001</value>
     </property>
</configuration>

Setup ssh for no password:

$ ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsa
$ cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys

change chmod for ssh:

$ chmod 700 ~/.ssh/
$ chmod 640 ~/.ssh/authorized_keys

if needed, add an alias in '/etc/hosts' and restart:"sudo /etc/rc.d/init.d/network restart"
Format HDFS:

bin/hadoop namenode -format
11/06/15 08:32:13 INFO namenode.NameNode: STARTUP_MSG: 
/************************************************************
STARTUP_MSG: Starting NameNode
STARTUP_MSG:   host = srv-clc-04.u915.irt.univ-nantes.prive3/127.0.0.1
STARTUP_MSG:   args = [-format]
STARTUP_MSG:   version = 0.20.203.0
STARTUP_MSG:   build = http://svn.apache.org/repos/asf/hadoop/common/branches/branch-0.20-security-203 -r 1099333; compiled by 'oom' on Wed May  4 07:57:50 PDT 2011
************************************************************/
11/06/15 08:32:13 INFO util.GSet: VM type       = 64-bit
11/06/15 08:32:13 INFO util.GSet: 2% max memory = 19.1675 MB
11/06/15 08:32:13 INFO util.GSet: capacity      = 2^21 = 2097152 entries
11/06/15 08:32:13 INFO util.GSet: recommended=2097152, actual=2097152
11/06/15 08:32:13 INFO namenode.FSNamesystem: fsOwner=lindenb
11/06/15 08:32:13 INFO namenode.FSNamesystem: supergroup=supergroup
11/06/15 08:32:13 INFO namenode.FSNamesystem: isPermissionEnabled=true
11/06/15 08:32:13 INFO namenode.FSNamesystem: dfs.block.invalidate.limit=100
11/06/15 08:32:13 INFO namenode.FSNamesystem: isAccessTokenEnabled=false accessKeyUpdateInterval=0 min(s), accessTokenLifetime=0 min(s)
11/06/15 08:32:13 INFO namenode.NameNode: Caching file names occuring more than 10 times 
11/06/15 08:32:13 INFO common.Storage: Image file of size 113 saved in 0 seconds.
11/06/15 08:32:14 INFO common.Storage: Storage directory /home/lindenb/tmp/HADOOP/dfs/name has been successfully formatted.
11/06/15 08:32:14 INFO namenode.NameNode: SHUTDOWN_MSG: 
/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at srv-clc-04.u915.irt.univ-nantes.prive3/127.0.0.1

... and start HDFS:

$ ./bin/start-all.sh 
namenode running as process 1151. Stop it first.
localhost: starting datanode, logging to /home/lindenb/package/hadoop-0.20.203.0/bin/../logs/hadoop-lindenb-datanode-srv-clc-04.u915.irt.univ-nantes.prive3.out
localhost: secondarynamenode running as process 1490. Stop it first.
jobtracker running as process 1588. Stop it first.
localhost: starting tasktracker, logging to /home/lindenb/package/hadoop-0.20.203.0/bin/../logs/hadoop-lindenb-tasktracker-srv-clc-04.u915.irt.univ-nantes.prive3.out

Copy the Axiom data to HDFS:

$bin/hadoop fs  -mkdir myfolder
$ bin/hadoop fs -copyFromLocal ~/Axiom_GW_Hu_SNP.r2.na31.annot.csv myfolder/
$ bin/hadoop fs -copyFromLocal ~/AxiomGT1.calls.txt myfolder/

The Classes

The annotation and the genotypes file will be sorted and we're going to merge both results. The classes implement Writable in order to be (un)serialized from/to HDFS .

Genotypes

This class contains the four possible codes for the genotypes (0, 1, 2 or -1):

 static public class Genotypes implements Writable
 {
 private byte array[];
 public Genotypes()
   {
  this(0);
   }
 public Genotypes(int n)
   {
  this.array=new byte[n];
   }
 @Override
 public void readFields(DataInput in) throws IOException
  {
  int n=in.readInt();
  this.array=new byte[n];
  in.readFully(this.array);
  }
 @Override
 public void write(DataOutput out) throws IOException 
  {
  out.writeInt(this.array.length);
  out.write(this.array);
  }
 }

Position

A position on the genome.

static public class Position implements WritableComparable<Position>
 {
 private String chrom="N/A";
 private int position=-1;
 
 public Position()
  {
  }
 public Position(String chrom,int position)
  {
  this.chrom=chrom;
  this.position=position;
  }
 public String getChrom()
  {
  return chrom;
  }
 
 public int getPosition()
  {
  return position;
  }
 
 @Override
 public void readFields(DataInput in) throws IOException {
  this.chrom=in.readUTF();
  this.position=in.readInt();
  }
 @Override
 public void write(DataOutput out) throws IOException {
  out.writeUTF(chrom);
  out.writeInt(position);
  }
 @Override
 public int compareTo(Position o) {
  int i=chrom.compareTo(o.chrom);
  if(i!=0) return i;
  return position - o.position;
  }
 @Override
 public String toString() {
  return chrom+":"+position;
  }
 }

Marker

This class contains the data available in the annotation file:

static public class Marker implements WritableComparable<Marker>
 {
 private Position position=new Position();
 private String alleleA;
 private String alleleB;
 private String probeSetId;
 private String rsid;
 public Marker()
  {
  }
 
 public Position getPosition() {
  return position;
 }
 
 public String getAlleleA() {
  return alleleA;
 }
 
 public String getAlleleB() {
  return alleleB;
 }
 
 public String getProbeSetId() {
  return probeSetId;
 }
 
 public String getRsid() {
  return rsid;
 }
 
 @Override
 public void readFields(DataInput in) throws IOException {
  this.position.readFields(in);
  this.alleleA=in.readUTF();
  this.alleleB=in.readUTF();
  this.probeSetId=in.readUTF();
  this.rsid=in.readUTF();
  }
 @Override
 public void write(DataOutput out) throws IOException {
  this.position.write(out);
  out.writeUTF(this.alleleA);
  out.writeUTF(this.alleleB);
  out.writeUTF(this.probeSetId);
  out.writeUTF(this.rsid);
  }
 @Override
 public int compareTo(Marker o) {
  return position.compareTo(o.position);
  }
 @Override
 public String toString() {
  return this.probeSetId+" "+position;
  }
 }

MapReduce for sorting the Annotation file.

We need to split each CSV line and sort the file on the probeSetId. The output will be a key/value [probeSetId(String)/Marker].

Mapper

public static class SortMarkerByProbIdMapper extends Mapper<LongWritable, Text, Text, Marker>
 {
 (...)
 @Override
 protected void map(
  LongWritable key,
  Text value,
  Context context)
  throws java.io.IOException ,InterruptedException
  {
  //ignore header and comment
  if( value.find("\"Probe Set ID\"")==0 ||
   value.find("#")==0 )
   {
   return;
   }
  
  List<String> array=splitCSV(new String(value.getBytes(),0,value.getLength()));
  if(array.get(this.chromCol).equals("---")) return;//undefined chromosome
  
  Marker m=new Marker();
  m.position=new Position(
   array.get(this.chromCol),
   Integer.parseInt(array.get(this.posCol))
   );
  m.alleleA=array.get(this.alleleACol);
  m.alleleB=array.get(this.alleleBCol);
  m.probeSetId=array.get(this.probeSetIdCol);
  m.rsid=array.get(this.rsIdCol);
  context.write(new Text(m.probeSetId),m);
  }
 }

Reducer

public static class SortMarkerByProbIdReducer extends Reducer<Text, Marker, Text, Marker>
 {
 @Override
 protected void reduce(
   Text key,
   Iterable<Marker> values,
   Context context
   ) throws java.io.IOException ,InterruptedException
  {
  Marker marker=null;
  for(Marker i:values)
   {
   if(marker!=null) throw new IOException("Duplicate marker id "+key);
   marker=i;
   }
  if(marker==null) return;
  context.write(key,marker);
  }
 }

Job

private void sortMarkersByProbeid(Configuration conf) throws Exception
 {
 //JobConf jobConf=new JobConf(conf);
 
 FileSystem fs=FileSystem.get(conf);
 final Path outPath=new Path(SORTED_MARKERS_PATH);
 final Path inPath=new Path("myfolder/Axiom_GW_Hu_SNP.r2.na31.annot.csv");
 
 List<String> header=null;
 (...)
 /*
 (...)
 open the file and find the column indexes
 (...)
 */ 

 Job job = new Job(conf, Hadoop01.class.getName());
 job.setJarByClass(Hadoop01.class);
 job.setMapperClass(SortMarkerByProbIdMapper.class);
 job.setReducerClass(SortMarkerByProbIdReducer.class);
 job.setMapOutputKeyClass(Text.class);
 job.setMapOutputValueClass(Marker.class);
 job.setOutputKeyClass(Text.class);
 job.setOutputValueClass(Marker.class);
 job.setOutputFormatClass(org.apache.hadoop.mapreduce.lib.output.SequenceFileOutputFormat.class);
 
 
 if(fs.exists(outPath))
  {
  fs.delete(outPath, true);
  }
 
 FileInputFormat.addInputPath(job, inPath);
 FileOutputFormat.setOutputPath(job, outPath);
 

 if(!job.waitForCompletion(true) )
  {
  throw new IOException("Cannot complete job");
  }
 }

MapReduce for sorting the 'Genotypes' file.

The output will be a key/value [probeSetId(String)/Genotypes].

Mapper

public static class SortCallByProbIdMapper extends Mapper<LongWritable, Text, Text, Genotypes>
 {
 final private Pattern tab=Pattern.compile("[\t]");
 @Override
 protected void map(
  LongWritable key,
  Text value,
  Context context)
  throws java.io.IOException ,InterruptedException
  {
  //ignore header and comment
  if( value.find("probeset_id")==0 ||
   value.find("%")==0 )
   {
   return;
   }
  
  String tokens[]=tab.split(new String(value.getBytes(),0,value.getLength()));
  Genotypes genotypes=new Genotypes(tokens.length-1);
  for(int i=1;i< tokens.length;++i)
   {
   genotypes.array[i-1]=Byte.parseByte(tokens[i]);
   }
  context.write(new Text(tokens[0]),genotypes);
  }
 }

Reducer

public static class SortCallByProbIdReducer extends Reducer<Text, Genotypes, Text, Genotypes>
 {
 @Override
 protected void reduce(
   Text key,
   Iterable<Genotypes> values,
   Context context
   ) throws java.io.IOException ,InterruptedException
  {
  Genotypes array=null;
  for(Genotypes i:values)
   {
   if(array!=null) throw new IOException("Duplicate marker id "+key);
   array=i;
   }
  if(array==null) return;
  context.write(key,array);
  }
}

Job

public void sortCallsByProbeid(Configuration conf) throws Exception
 {
 //JobConf jobConf=new JobConf(conf);

 FileSystem fs=FileSystem.get(conf);
 final Path outPath=new Path(SORTED_CALLS_PATH);
 final Path inPath=new Path("myfolder/AxiomGT1.calls.txt");

 Job job = new Job(conf, Hadoop01.class.getName());
 job.setJarByClass(Hadoop01.class);
 job.setMapperClass(SortCallByProbIdMapper.class);
 job.setReducerClass(SortCallByProbIdReducer.class);
 job.setMapOutputKeyClass(Text.class);
 job.setMapOutputValueClass(Genotypes.class);
 job.setOutputKeyClass(Text.class);
 job.setOutputValueClass(Genotypes.class);

 job.setOutputFormatClass(org.apache.hadoop.mapreduce.lib.output.SequenceFileOutputFormat.class);



 if(fs.exists(outPath))
  {
  fs.delete(outPath, true);
  }

 FileInputFormat.addInputPath(job, inPath);
 FileOutputFormat.setOutputPath(job, outPath);


 if(!job.waitForCompletion(true) )
  {
  throw new IOException("Cannot complete job");
  }
 }

Joining the two sorted files

It was the most complicated part, as I didn't found much documentation about how to 'join' two files in HDFS with the latest hadoop API. The following solution seems to work and it only needs a reducer:

Reducer

public static class SortCallByProbIdReducer extends Reducer<Text, Genotypes, Text, Genotypes>
 {
 @Override
 protected void reduce(
   Text key,
   Iterable<Genotypes> values,
   Context context
   ) throws java.io.IOException ,InterruptedException
  {
  Genotypes array=null;
  for(Genotypes i:values)
   {
   if(array!=null) throw new IOException("Duplicate marker id "+key);
   array=i;
   }
  if(array==null) return;
  context.write(key,array);
  }
 }

Job

private void join(Configuration conf) throws Exception
 { 
 FileSystem fs=FileSystem.get(conf);
 Path outPath=new Path(JOIN_PATH);
 if(fs.exists(outPath))
  {
  fs.delete(outPath, true);
  }
 
 final String compose=CompositeInputFormat.compose(
            "inner",
            SequenceFileInputFormat.class,
            new Path(SORTED_MARKERS_PATH),
            new Path(SORTED_CALLS_PATH)
            );
 System.err.println(compose);
 JobConf jobConf = new JobConf(conf, getClass());
    jobConf.setJobName("join");
    jobConf.setInputFormat(CompositeInputFormat.class);
    jobConf.set("mapred.join.expr",
      compose);
    
    jobConf.setMapOutputKeyClass(Text.class);
    jobConf.setMapOutputValueClass(TupleWritable.class);
    jobConf.setOutputValueClass(Text.class);//TupleWritable ?
    jobConf.setOutputKeyClass(Text.class);
    jobConf.setOutputFormat(TextOutputFormat.class);
    jobConf.setReducerClass(JoinReducer.class);
    
    //jobConf.setMapOutputValueClass(Text.class);
    org.apache.hadoop.mapred.FileOutputFormat.setOutputPath(jobConf, outPath);
    JobClient.runJob(jobConf);
 }

Full source code

That's it,

Pierre

05 January 2011

Coding a CXF web service translating a DNA to a protein. My notebook

Apache CXF is a Web Services framework. In this post, I'll will describe how I implemented a Web Service translating a DNA to a protein using the web server Apache Tomcat and the CXF libraries.

Defining the interface

First a simple java interface bio.Translate is needed to describe the service. This simple service receives a string (the dna) and returns a string (the peptide). The annotations will be used by CXF to name the parameters in the WSDL file (see later):

Implementing the service

bio.TranslateImpl implements bio.Translate. The setter/getter for ncbiString will be used by a configuration file to specify a genetic code (standard, mitochondrial) for this service. The methods initIt and cleanUp could be used to acquire and to release some resources for the service when it is created and/or disposed.

Configuring the service

CXF uses the libraries of the Spring framework (I blogged about spring here ). A XML config file beans.xml makes it easy to configure two java beans for the 'standard genetic code' and the 'mitochondrial code'. In this config file, we also tell Spring about the two methods initIt and cleanUp. Those two beans will be used by two Web Services

Defining the CXF application for Tomcat

The following web.xml file only tells tomcat, the web server, to use the CXFServlet to listen to the SOAP queries.

Compile & Deploy

Installing a CXF web service requires many libraries and at the end, the size of the deployed 'war' file was 8.5Mo(!). Currently, my structure for the current project is:

./translate/WEB-INF/classes/bio/TranslateImpl.java
./translate/WEB-INF/classes/bio/Translate.java
./translate/WEB-INF/beans.xml
./translate/WEB-INF/web.xml

The service was compiled and deployed using the following Makefile:

cxf.lib=apache-cxf-2.3.1/lib
all:
mkdir -p translate/WEB-INF/lib
javac -d translate/WEB-INF/classes -sourcepath translate/WEB-INF/classes translate/WEB-INF/classes/bio/TranslateImpl.java
cp ${cxf.lib}/cxf-2.3.1.jar \
${cxf.lib}/geronimo-activation_1.1_spec-1.1.jar \
${cxf.lib}/geronimo-annotation_1.0_spec-1.1.1.jar \
${cxf.lib}/geronimo-javamail_1.4_spec-1.7.1.jar \
${cxf.lib}/geronimo-servlet_3.0_spec-1.0.jar \
${cxf.lib}/geronimo-ws-metadata_2.0_spec-1.1.3.jar \
${cxf.lib}/geronimo-jaxws_2.2_spec-1.0.jar \
${cxf.lib}/geronimo-stax-api_1.0_spec-1.0.1.jar \
${cxf.lib}/jaxb-api-2.2.1.jar \
${cxf.lib}/jaxb-impl-2.2.1.1.jar \
${cxf.lib}/neethi-2.0.4.jar \
${cxf.lib}/saaj-api-1.3.jar \
${cxf.lib}/saaj-impl-1.3.2.jar \
${cxf.lib}/wsdl4j-1.6.2.jar \
${cxf.lib}/XmlSchema-1.4.7.jar \
${cxf.lib}/xml-resolver-1.2.jar \
${cxf.lib}/aopalliance-1.0.jar \
${cxf.lib}/spring-core-3.0.5.RELEASE.jar \
${cxf.lib}/spring-beans-3.0.5.RELEASE.jar \
${cxf.lib}/spring-context-3.0.5.RELEASE.jar \
${cxf.lib}/spring-web-3.0.5.RELEASE.jar \
${cxf.lib}/commons-logging-1.1.1.jar \
${cxf.lib}/spring-asm-3.0.5.RELEASE.jar \
${cxf.lib}/spring-expression-3.0.5.RELEASE.jar \
${cxf.lib}/spring-aop-3.0.5.RELEASE.jar \
translate/WEB-INF/lib
jar cvf translate.war -C translate .
mv translate.war path-to-tomcat/webapps

Checking the URL

We can see that the service was correctly deployed by pointing a web browser at http://localhost:8080/translate/, where we can see the two services:

Available SOAP services:

Translate translate	Endpoint address: http://localhost:8080/translate/translateMit WSDL : {http://bio/}TranslateService Target namespace: http://bio/
Translate translate	Endpoint address: http://localhost:8080/translate/translateStd WSDL : {http://bio/}TranslateService Target namespace: http://bio/

Here, the URLs link to the WSDL definition for the web service:

Creating a client

For creating a client consuming this service, I first used the code generated by CXF's wsdl2java but there was a bug with one of the generated classe (it is a known ~~bug~~ feature) so here, I'm going to use the standard ${JAVA_HOME}/bin/wsimport.

> wsimport -p generated -d client -keep "http://localhost:8080/translate/translateStd?wsdl"
parsing WSDL...
generating code...
compiling code...

I wrote a java client MyClient.java using this generated API:

Compiling

> cd client
> javac MyClient.java

Running

> java MyClient
EFIDHSIAC*

That's it,

Pierre

23 July 2010

Rules Engine for Bioinformatics. Playing with Drools and the Hapmap data

Drools is a Rules Engine implementation of the Rete algorithm. (from "http://java.sys-con.com/node/45082":)A rule engine is a system for applying some set of if/then rules to the data in a system and then taking some action on the data or the system itself. Its primary purpose is to separate the business logic from the system logic - to externalize the business logic so it can be maintained separately. Use a rule engine to separate the if/then rules from system code - to externalize them so they can be maintained separately from the system code. The behavior of the system can then be modified without changing code or needing to recompile/redeploy. Rules are stored in human-readable form in a file so they can be changed with a text editor or rule editor.

I would like to know if JBOSS can be used for some common bioinformatics tasks. For example, can I use Drools when my users want to select some candidate genes with various parameters : "I want a gene with 3 SNPs in the first exon but with less than 2 microsattelites in the 3'UTR and if there is another gene close to this one I want ... etc... etc...".

In the current post, I've played with the Drools engine to search some mendelians incompatibilities in the hapmap data.

First, download some the pedigrees and the genotypes:

wget http://hapmap.ncbi.nlm.nih.gov/downloads/samples_individuals/relationships_w_pops_121708.txt
wget http://hapmap.ncbi.nlm.nih.gov/genotypes/latest/forward/non-redundant/genotypes_chr21_CEU_r27_nr.b36_fwd.txt.gz
gunzip genotypes_chr21_CEU_r27_nr.b36_fwd.txt.gz

Here is the structure of my project:

./test
./test/Individual.java
./test/Snp.java
./test/Drools01.java
./test/Genotype.java
./test/hapmap.drl

Individual, Genotype and Snp are 3 simple POJO files.

Individual.java

package test;
import java.io.Serializable;

public class Individual implements Serializable
{
private static final long serialVersionUID = 1L;
private int familyId;
private String name;
private String fatherName;
private String motherName;
private int gender;
//constructor, getters, setters, etc...
}

Snp.java

package test;
import java.io.Serializable;

public class Snp
implements Serializable
{
private static final long serialVersionUID = 1L;
private String name;
private String alleles;
private String chrom;
private int position;
//constructor, getters, setters, etc...
}

Genotype.java

package test;
import java.io.Serializable;

public class Genotype
implements Serializable
{
private static final long serialVersionUID = 1L;
private Snp snp;
private Individual individual;
private char a1;
private char a2;

public Genotype(Snp snp, Individual individual, String observed)
{
super();
this.snp = snp;
this.individual = individual;

this.a1 = observed.charAt(0);
this.a2 = observed.charAt(1);
if(this.a1> this.a2)
{
this.a2 = a1;
this.a1=observed.charAt(1);
}
}
public boolean contains(char allele)
{
return getA1()==allele || getA2()==allele;
}

//constructor, getters, setters, etc...
}

hapmap.drl

hapmap.drl is the Drools rules file. It contains the "business logic" and can be modified without changing anything in the java program:

package test

rule "Select snps"
salience 30
when
$snp:Snp($chrom:chrom,$pos:position)
eval(!($chrom=="chr21" && $pos> 9880000 && $pos< 10043000))
then
retract($snp);
System.err.println("Removing : "+$snp.getName());
end

rule "One Parent"
salience 20

when
$children : Individual($dad:fatherName,$mum:motherName)
$parent: Individual(name==$dad || name==$mum)
$genotypeChild : Genotype(individual==$children,$snp:snp,a1!='N',a2!='N' )
$genotypeParent : Genotype(individual==$parent,snp==$snp,a1!='N',a2!='N')
eval(!(
$genotypeParent.contains($genotypeChild.getA1()) ||
$genotypeParent.contains($genotypeChild.getA2())
))
then
retract($genotypeChild);
System.out.println($snp.getName()+" : problem with "+
$children+"("+$genotypeChild.getA1()+"/"+$genotypeChild.getA2()+") incompatibility with parent:"+
$parent+"("+$genotypeParent.getA1()+"/"+$genotypeParent.getA2()+")"
);
end

rule "Both Parents"
salience 10
when
$children : Individual($dad:fatherName,$mum:motherName)
$father: Individual(name==$dad )
$mother: Individual(name==$mum )
$genotypeChild : Genotype(individual==$children,$snp:snp,a1!='N',a2!='N' )
$genotypeDad : Genotype(individual==$father,snp==$snp,a1!='N',a2!='N')
$genotypeMum : Genotype(individual==$mother,snp==$snp,a1!='N',a2!='N')
eval(!(
($genotypeDad.contains( $genotypeChild.getA1()) && $genotypeMum.contains( $genotypeChild.getA2())) ||
($genotypeDad.contains( $genotypeChild.getA2()) && $genotypeMum.contains( $genotypeChild.getA1()))
))
then
retract($genotypeChild);
System.out.println($snp.getName()+" : problem with "+
$children+"("+$genotypeChild.getA1()+"/"+$genotypeChild.getA2()+") incompatibility with parents:"+
$father+"("+$genotypeDad.getA1()+"/"+$genotypeDad.getA2()+") "+
$mother+"("+$genotypeMum.getA1()+"/"+$genotypeMum.getA2()+") "
);
end

The first rule "Select snps" (=with the highest priority (salience)) remove all the SNPs that are not in "chr21:9880000-10043000"
The second rule "One Parent" prints a message if there is an incompatibility between a children and one of his parents
The last rule "Both Parents" prints a message if the is an incompatibility between a children and both of his parents

Drools01.java

Drools01.java initializes the Drools engine, parses the hapmap files, put those objects in the "KnowledgeBase" and fires all the rules:

package test;
import java.io.BufferedReader;
import java.io.FileReader;
import java.util.Collection;
import java.util.HashMap;
import java.util.Map;

import org.drools.KnowledgeBase;
import org.drools.KnowledgeBaseFactory;
import org.drools.builder.KnowledgeBuilder;
import org.drools.builder.KnowledgeBuilderError;
import org.drools.builder.KnowledgeBuilderErrors;
import org.drools.builder.KnowledgeBuilderFactory;
import org.drools.builder.ResourceType;
import org.drools.definition.KnowledgePackage;
import org.drools.io.ResourceFactory;
import org.drools.runtime.StatefulKnowledgeSession;

public class Drools01
{
private void loadHapMap(StatefulKnowledgeSession session)throws Exception
{
Map<String, Individual> name2individual=new HashMap<String, Individual>();
String line;
BufferedReader in= new BufferedReader(new FileReader(
"/home/lindenb/relationships_w_pops_121708.txt"
));

while((line=in.readLine())!=null)
{
if(line.startsWith("FID") || !line.endsWith("CEU")) continue;
String tokens[]=line.split("[\t]");
Individual indi=new Individual(
Integer.parseInt(tokens[0]),
tokens[1],
tokens[2].equals("0")?null:tokens[2],
tokens[3].equals("0")?null:tokens[3],
Integer.parseInt(tokens[4])
);
name2individual.put(indi.getName(), indi);
}
in.close();

in= new BufferedReader(new FileReader("/home/lindenb/genotypes_chr21_CEU_r27_nr.b36_fwd.txt"));
int r=0;

line=in.readLine();
String tokens[]=line.split("[ ]");
Map<Integer, Individual> index2individual=new HashMap<Integer, Individual>(tokens.length);

for(int i=11;i <tokens.length;++i)
{
Individual indi=name2individual.get(tokens[i]);
session.insert(indi);
if(indi==null) throw new NullPointerException("Cannot get "+tokens[i]);
index2individual.put(i, indi);
}

while((line=in.readLine())!=null)
{
tokens=line.split("[ ]");
Snp snp=new Snp(tokens[0], tokens[1],tokens[2], Integer.parseInt(tokens[3]));
session.insert(snp);
for(int i=11;i <tokens.length;++i)
{
session.insert(new Genotype(snp, index2individual.get(i), tokens[i]));
}
//System.err.println(line);
if(r++>2000) break;//just read the first 2000 rows everything...
}
in.close();
}

private void run()throws Exception
{
KnowledgeBase kbase = KnowledgeBaseFactory.newKnowledgeBase();
KnowledgeBuilder kbuilder = KnowledgeBuilderFactory.newKnowledgeBuilder();
kbuilder.add( ResourceFactory.newClassPathResource(
"hapmap.drl",
Drools01.class ),
ResourceType.DRL );
KnowledgeBuilderErrors errors= kbuilder.getErrors();
if(!errors.isEmpty())
{
for(KnowledgeBuilderError error:errors)
{
System.err.println(error.getMessage());
}
return;
}

Collection<KnowledgePackage> pkgs = kbuilder.getKnowledgePackages();
kbase.addKnowledgePackages( pkgs );
StatefulKnowledgeSession ksession = kbase.newStatefulKnowledgeSession();
loadHapMap(ksession);

ksession.fireAllRules();
}
/**
* @param args
*/
public static void main(String[] args) throws Exception
{
try {
new Drools01().run();
} catch(Exception err)
{
err.printStackTrace();
}
}
}

Compilation

Makefile::

DROOLS=/home/lindenb/package/drools-5.0
ECLIPSE=/home/lindenb/package/eclipse/plugins
CP=$(DROOLS)/drools-core-5.0.1.jar:$(DROOLS)/drools-api-5.0.1.jar:$(DROOLS)/lib/mvel2-2.0.10.jar:$(DROOLS)/lib/antlr-runtime-3.1.1.jar:$(DROOLS)/drools-compiler-5.0.1.jar:$(ECLIPSE)/org.eclipse.jdt.core_3.5.2.v_981_R35x.jar

all:
javac -cp ${CP}:. test/Drools01.java
java -cp ${CP}:. test.Drools01

Results

Removing : rs885550
Removing : rs28363862
Removing : rs28783163
rs1028272 : problem with NA10838(T/T) incompatibility with parent:NA12003(A/A)
rs9647052 : problem with NA10847(A/A) incompatibility with parent:NA12239(C/C)
rs1882882 : problem with NA07014(A/A) incompatibility with parent:NA07031(G/G)
rs12627714 : problem with NA07048(G/G) incompatibility with parent:NA07034(A/A)
rs17240368 : problem with NA07048(A/A) incompatibility with parent:NA07034(G/G)
rs2822605 : problem with NA07048(C/C) incompatibility with parent:NA07034(G/G)
rs2822593 : problem with NA07048(G/G) incompatibility with parent:NA07034(A/A)
rs7276922 : problem with NA07048(T/T) incompatibility with parent:NA07034(C/C)
rs10439653 : problem with NA07048(A/A) incompatibility with parent:NA07034(C/C)
rs10439652 : problem with NA07048(C/C) incompatibility with parent:NA07034(T/T)
rs17001769 : problem with NA10856(A/A) incompatibility with parent:NA11829(C/C)
rs9977658 : problem with NA10860(C/C) incompatibility with parent:NA11992(T/T)
rs8133625 : problem with NA10856(G/G) incompatibility with parent:NA11830(A/A)
rs12627045 : problem with NA12740(A/A) incompatibility with parent:NA12750(G/G)
rs416083 : problem with NA10843(G/G) incompatibility with parent:NA11919(A/A)
rs2822484 : problem with NA06991(A/A) incompatibility with parent:NA06985(G/G)
rs9977169 : problem with NA12865(C/C) incompatibility with parent:NA12875(T/T)
rs379724 : problem with NA07019(G/G) incompatibility with parent:NA07056(A/A)
rs13046593 : problem with NA10860(C/C) incompatibility with parent:NA11992(G/G)
rs9984592 : problem with NA10854(G/G) incompatibility with parent:NA11840(A/A)
rs2187275 : problem with NA10831(C/C) incompatibility with parent:NA12156(T/T)
rs6516605 : problem with NA12708(G/G) incompatibility with parent:NA12718(C/C)
rs7283783 : problem with NA10830(G/G) incompatibility with parent:NA12154(A/A)
rs13051673 : problem with NA12739(G/G) incompatibility with parent:NA12748(A/A)
rs3115488 : problem with NA10860(A/A) incompatibility with parent:NA11993(G/G)
rs8132413 : problem with NA10860(T/T) incompatibility with parent:NA11993(A/A)
rs2207842 : problem with NA10838(A/A) incompatibility with parent:NA12004(G/G)
rs2821973 : problem with NA12832(C/C) incompatibility with parent:NA12843(T/T)
rs469471 : problem with NA07349(A/A) incompatibility with parent:NA07347(G/G)
rs8129674 : problem with NA10839(G/G) incompatibility with parent:NA12006(A/A)
rs2257224 : problem with NA10854(G/G) incompatibility with parent:NA11840(A/A)
rs865859 : problem with NA10855(C/C) incompatibility with parent:NA11832(T/T)
rs2742158 : problem with NA10855(C/C) incompatibility with parent:NA11832(T/T)
rs4111253 : problem with NA10836(C/C) incompatibility with parent:NA12275(T/T)
rs240444 : problem with NA10861(T/T) incompatibility with parent:NA11994(C/C)
rs469812 : problem with NA06991(C/C) incompatibility with parent:NA06985(G/G)
rs210534 : problem with NA06991(T/T) incompatibility with parent:NA06985(A/A)
rs2822670 : problem with NA10861(C/T) incompatibility with parents:NA11994(C/C) NA11995(C/C)
rs9305335 : problem with NA10831(C/T) incompatibility with parents:NA12155(C/C) NA12156(C/C)
rs9305297 : problem with NA12801(A/G) incompatibility with parents:NA12812(G/G) NA12813(G/G)
rs2822641 : problem with NA07349(A/C) incompatibility with parents:NA07347(C/C) NA07346(C/C)
rs9977057 : problem with NA12801(G/T) incompatibility with parents:NA12812(T/T) NA12813(T/T)
rs2822614 : problem with NA12877(C/T) incompatibility with parents:NA12889(C/C) NA12890(C/C)
rs2178907 : problem with NA07349(A/G) incompatibility with parents:NA07347(G/G) NA07346(G/G)
rs1124322 : problem with NA10837(A/G) incompatibility with parents:NA12272(G/G) NA12273(G/G)
rs2822537 : problem with NA12336(A/G) incompatibility with parents:NA12342(G/G) NA12343(G/G)
rs386524 : problem with NA12753(C/T) incompatibility with parents:NA12762(C/C) NA12763(C/C)
rs367249 : problem with NA12802(A/G) incompatibility with parents:NA12814(G/G) NA12815(G/G)
rs17001380 : problem with NA10856(G/T) incompatibility with parents:NA11829(G/G) NA11830(G/G)
rs2155965 : problem with NA12336(A/T) incompatibility with parents:NA12342(A/A) NA12343(A/A)
rs8133601 : problem with NA10854(C/T) incompatibility with parents:NA11839(C/C) NA11840(C/C)
rs8134986 : problem with NA07014(A/C) incompatibility with parents:NA07051(C/C) NA07031(C/C)
rs3094804 : problem with NA12865(C/T) incompatibility with parents:NA12874(T/T) NA12875(T/T)
rs1929150 : problem with NA06997(C/T) incompatibility with parents:NA06986(T/T) NA07045(T/T)
rs7276195 : problem with NA07019(A/G) incompatibility with parents:NA07022(A/A) NA07056(A/A)
rs3855691 : problem with NA12877(A/G) incompatibility with parents:NA12889(G/G) NA12890(G/G)
rs17468376 : problem with NA10835(A/G) incompatibility with parents:NA12248(A/A) NA12249(A/A)
rs2747618 : problem with NA10835(C/T) incompatibility with parents:NA12248(C/C) NA12249(C/C)
rs2943900 : problem with NA10854(C/T) incompatibility with parents:NA11839(T/T) NA11840(T/T)
rs4009972 : problem with NA10835(G/T) incompatibility with parents:NA12248(G/G) NA12249(G/G)
rs8133159 : problem with NA12767(A/G) incompatibility with parents:NA12777(A/A) NA12778(A/A)

Questions

So many questions: how should I model my data ? what if those data are alread present in database) ? how drools supports a large amount of data ? etc... etc...

That's it

Pierre

01 February 2010

Searching for Genotypes with SPARQL.

This week-end, I've noticed that the NCBI has an interface called Genotype Query Form used to query some genotypes the generating the following kind of XML output:

<GenoExchange xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns="http://www.ncbi.nlm
.nih.gov/SNP/geno" xsi:schemaLocation="http://www.ncbi.nlm.nih.gov/SNP/geno ftp://ftp.ncbi.nlm.nih.gov/snp/specs/genoex_1_4.xsd" dbSNPBuildNo="129">
<Population popId="1409" handle="CSHL-HAPMAP" locPopId="HapMap-CEU">
<popClass self="NOT SPECIFIED" />
</Population>
<Individual indId="170" taxId="9606" sex="F" indGroup="European">
<SourceInfo source="Coriell" sourceType="repository" ncbiPedId="80" pedId="1340" indId="NA07000" maId="0" paId="0" srcIndGroup="Western and Nothern European" />
<SubmitInfo popId="1409" submittedIndId="NA07000" subIndGroup="Western and Northern European" />
</Individual>
<Individual indId="621" taxId="9606" sex="F" indGroup="European">

(...)
<SnpLoc genomicAssembly="36:reference" chrom="1" start="1286927" locType="2" rsOrientToCh
rom="rev" contigAllele="C" />
<SsInfo ssId="3906671" locSnpId="AL139287.6_22772" ssOrientToRs="fwd">
<ByPop popId="1409" sampleSize="120">
<AlleleFreq allele="A" freq="0.117" />
<AlleleFreq allele="G" freq="0.883" />
<GTypeFreq gtype="A/G" freq="0.233" />
<GTypeFreq gtype="G/G" freq="0.767" />
(...)
<GTypeByInd indId="636" gtype="G/G" />
<GTypeByInd indId="456" gtype="G/G" />
<GTypeByInd indId="536" gtype="G/G" />
</ByPop>
</SsInfo>
<GTypeFreq gtype="A/A" freq="0.380952380952381" />
<GTypeFreq gtype="A/G" freq="0.352380952380952" />
<GTypeFreq gtype="G/G" freq="0.266666666666667" />
</SnpInfo>
<SnpInfo rsId="2765021" observed="A/G">
(...)

I wanted to see how one could query this kind of data with SPARQL... well, I'm sure that RDF is one of the most inefficient way to store this kind of data but I wanted to see what could be extracted from such RDFStore from a semantic query. First, I wrote a XSLT stylesheet transforming <GenoExchange/> to <rdf:RDF/>. The stylsheet is available at http://code.google.com/p/lindenb/source/browse/trunk/src/xsl/genoexch2rdf.xsl.
.

Transform the data

About 639 HAPMAP snps on the chromosome 1 were extracted using the HTML form and saved as XML to the file 'SNPgenotype-100201-1244-3905.xml'(size 4Mo). The xml was converted to RDF with the xsltproc engine:

xsltproc --stringparam "with-sequence" yes --novalid genoexch2rdf.xsl SNPgenotype-100201-1244-3905.xml > input.rdf

The size of 'input.rdf' (including the flanking sequences of the SNPs) was 20Mo.

Result

<?xml version="1.0"?>
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:g="http://www.ncbi.nlm.nih.gov/SNP/geno" xmlns:snp="http://www.ncbi.nlm.nih.gov/SNP/docsum" xmlns="http://ontology.lindenb.org/genotypes/">
<Population rdf:about="http://www.ncbi.nlm.nih.gov/projects/SNP/snp_viewTable.cgi?type=pop&pop_id=1409">
<handle>CSHL-HAPMAP</handle>
<locPopId>HapMap-CEU</locPopId>
</Population>
<Individual rdf:about="http://www.ncbi.nlm.nih.gov/projects/SNP/snp_ind.cgi?ind_id=170">
<hasPop rdf:resource="http://www.ncbi.nlm.nih.gov/projects/SNP/snp_viewTable.cgi?type=pop&pop_id=1409"/>
<sex>F</sex>
<name>NA07000</name>
</Individual>
<Individual rdf:about="http://www.ncbi.nlm.nih.gov/projects/SNP/snp_ind.cgi?ind_id=621">
<hasPop rdf:resource="http://www.ncbi.nlm.nih.gov/projects/SNP/snp_viewTable.cgi?type=pop&pop_id=1409"/>
<sex>F</sex>
<name>NA12875</name>
</Individual>
<Individual rdf:about="http://www.ncbi.nlm.nih.gov/projects/SNP/snp_ind.cgi?ind_id=538">
<hasPop rdf:resource="http://www.ncbi.nlm.nih.gov/projects/SNP/snp_viewTable.cgi?type=pop&pop_id=1409"/>
<sex>F</sex>
<name>NA12753</name>
(...)
<SNP rdf:about="http://www.ncbi.nlm.nih.gov/projects/SNP/snp_ref.cgi?rs=307347">
<het rdf:datatype="http://www.w3.org/2001/XMLSchema#float">0.1</het>
<name>rs307347</name>
<seq5>GGGGATGGCTGCTCCTGGGCCTCAGAAAGATGCAGTCCCATAGACTTCCAGCACGCCCCTCCCCTCCTCGGGCCTTAATTTTGTCCACTGAGAAGATGGTCTCTGAGGCTCTGGGGTTTCCTTCTTGGTCACCAGATATTCTGCGGGCCTTGCCTTCCTGCCCAGATTCGAGCCAGTGGCAAACAGAAGCTGCCAGGAGC</seq5>
<observed>C/T</observed>
<seq3>TCTCAGAGCTGTGGCTGGTGGCTCGGTAACAACAGGAAGGGCAGTGGCTGTGCAGGAGGCAGGCAGCTTGCCAGCCCAGGAAGGTGACCCAGGACACCTCCAGGCCTTTCCCAGGGCAGCCCAACGGCCCAAGGTCAGGGCCGGGCGCGAGGGCGGCCTGAGCACAGAGCACGGGGGCTGACAGCAGGCTGGGGGGCCAG</seq3>
</SNP>
<MapLoc>
<hasSNP rdf:resource="http://www.ncbi.nlm.nih.gov/projects/SNP/snp_ref.cgi?rs=307347"/>
<strand>+</strand>
<chrom>1</chrom>
<start rdf:datatype="http://www.w3.org/2001/XMLSchema#integer">1320381</start>
<assembly rdf:resource="urn:assembly:Celera:36_3"/>
<type>exact</type>
</MapLoc>
(...)
<Genotype>
<hasIndi rdf:resource="http://www.ncbi.nlm.nih.gov/projects/SNP/snp_ind.cgi?ind_id=465"/>
<hasSNP rdf:resource="http://www.ncbi.nlm.nih.gov/projects/SNP/snp_ref.cgi?rs=940550"/>
<allele1>T</allele1>
<allele2>T</allele2>
</Genotype>
<Genotype>
<hasIndi rdf:resource="http://www.ncbi.nlm.nih.gov/projects/SNP/snp_ind.cgi?ind_id=253"/>
<hasSNP rdf:resource="http://www.ncbi.nlm.nih.gov/projects/SNP/snp_ref.cgi?rs=940550"/>
<allele1>T</allele1>
<allele2>T</allele2>
</Genotype>
</rdf:RDF>

Invoking ARQ

export ARQROOT=ARQ-2.5.0
ARQ-2.5.0/bin/arq --data ~/input.rdf --query ~/query01.rq

Dump All

Query: PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX g: <http://ontology.lindenb.org/genotypes/>
SELECT ?s ?p ?o {?s ?p ?o.}
Result: | _:b0 | g:allele2 | "C" |
| _:b0 | g:allele1 | "C" |
| _:b0 | g:hasSNP | <http://www.ncbi.nlm.nih.gov/projects/SNP/snp_ref.cgi?rs=17160669> |
| _:b0 | g:hasIndi | <http://www.ncbi.nlm.nih.gov/projects/SNP/snp_ind.cgi?ind_id=636> |
| _:b0 | rdf:type | g:Genotype |
| _:b1 | g:allele2 | "T" |
| _:b1 | g:allele1 | "C" |
| _:b1 | g:hasSNP | <http://www.ncbi.nlm.nih.gov/projects/SNP/snp_ref.cgi?rs=17160669> |
| _:b1 | g:hasIndi | <http://www.ncbi.nlm.nih.gov/projects/SNP/snp_ind.cgi?ind_id=361> |
| _:b1 | rdf:type | g:Genotype |
| <http://www.ncbi.nlm.nih.gov/projects/SNP/snp_ind.cgi?ind_id=174> | g:name | "NA07048" |
| <http://www.ncbi.nlm.nih.gov/projects/SNP/snp_ind.cgi?ind_id=174> | g:sex | "M" |
| <http://www.ncbi.nlm.nih.gov/projects/SNP/snp_ind.cgi?ind_id=174> | g:hasPop | <http://www.ncbi.nlm.nih.gov/projects/SNP/snp_viewTable.cgi?type=pop&pop_id=1409> |
| <http://www.ncbi.nlm.nih.gov/projects/SNP/snp_ind.cgi?ind_id=174> | rdf:type | g:Individual |
| <http://www.ncbi.nlm.nih.gov/projects/SNP/snp_ind.cgi?ind_id=566> | g:name | "NA12802" |
| <http://www.ncbi.nlm.nih.gov/projects/SNP/snp_ind.cgi?ind_id=566> | g:sex | "F" |
| <http://www.ncbi.nlm.nih.gov/projects/SNP/snp_ind.cgi?ind_id=566> | g:hasPop | <http://www.ncbi.nlm.nih.gov/projects/SNP/snp_viewTable.cgi?type=pop&pop_id=1409> |
| <http://www.ncbi.nlm.nih.gov/projects/SNP/snp_ind.cgi?ind_id=566> | rdf:type | g:Individual |
| _:b2 | g:allele2 | "A" |
| _:b2 | g:allele1 | "A" |
| _:b2 | g:hasSNP | <http://www.ncbi.nlm.nih.gov/projects/SNP/snp_ref.cgi?rs=2765021> |
| _:b2 | g:hasIndi | <http://www.ncbi.nlm.nih.gov/projects/SNP/snp_ind.cgi?ind_id=429> |
| _:b2 | rdf:type | g:Genotype |
| _:b3 | g:allele2 | "C" |
| _:b3 | g:allele1 | "C" |
| _:b3 | g:hasSNP | <http://www.ncbi.nlm.nih.gov/projects/SNP/snp_ref.cgi?rs=17160669> |
| _:b3 | g:hasIndi | <http://www.ncbi.nlm.nih.gov/projects/SNP/snp_ind.cgi?ind_id=546> |
| _:b3 | rdf:type | g:Genotype |
| _:b4 | g:allele2 | "T" |
| _:b4 | g:allele1 | "C" |
| _:b4 | g:hasSNP | <http://www.ncbi.nlm.nih.gov/projects/SNP/snp_ref.cgi?rs=17160669> |
| _:b4 | g:hasIndi | <http://www.ncbi.nlm.nih.gov/projects/SNP/snp_ind.cgi?ind_id=159> |
| _:b4 | rdf:type | g:Genotype |
| <http://www.ncbi.nlm.nih.gov/projects/SNP/snp_ind.cgi?ind_id=621> | g:name | "NA12875" |
| <http://www.ncbi.nlm.nih.gov/projects/SNP/snp_ind.cgi?ind_id=621> | g:sex | "F" |
(...)

Select the populations

Query: PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX g: <http://ontology.lindenb.org/genotypes/>
SELECT ?pop
{
?s a g:Population .
?s g:handle ?pop .
}
Result: -----------------
| pop |
=================
| "CSHL-HAPMAP" |
-----------------

List six individuals for each population

Query: PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX g: <http://ontology.lindenb.org/genotypes/>
SELECT ?pop ?indi_name ?good
{
?s a g:Population .
?s g:handle ?pop .
?s2 a g:Individual .
?s2 g:hasPop ?s .
?s2 g:sex ?good .
?s2 g:name ?indi_name
}
limit 6
Result: ------------------------------------
| pop | indi_name | good |
====================================
| "CSHL-HAPMAP" | "NA10854" | "F" |
| "CSHL-HAPMAP" | "NA12264" | "M" |
| "CSHL-HAPMAP" | "NA11993" | "F" |
| "CSHL-HAPMAP" | "NA10830" | "M" |
| "CSHL-HAPMAP" | "NA12762" | "M" |
| "CSHL-HAPMAP" | "NA12155" | "M" |
------------------------------------

List the SNPs having a flanking sequence containing 'CACACA'

Query: PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX g: <http://ontology.lindenb.org/genotypes/>
PREFIX fn: <http://www.w3.org/2005/xpath-functions#>

SELECT ?name ?seq5 ?observed ?seq3
WHERE
{
?s a g:SNP .
?s g:name ?name .
?s g:seq5 ?seq5 .
?s g:seq3 ?seq3 .
?s g:observed ?observed .

FILTER (
fn:contains(fn:upper-case(?seq5), "CACACA") ||
fn:contains(fn:upper-case(?seq3), "CACACA")
)
}
Result: -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
| name | seq5 | observed | seq3 |
=============================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================
| "rs17160669" | "GCCACCGCGCCTGGCCCACAAGCATAACTTTTATAAAAATAATTTACTTTTACAATTAAGCTTAGGAATCACACAGACTCAGGGCTGGCTCATGGCTTCC" | "C/T" | "GGCAAGTTAAACTCTGTACTTAGGCTCGGCGCGTATGAAATGGCTAATTCTAATCAGTGGTGCAATGAAGTAACTCCTCTAAAGAACTTATCGGGCCGGG" |
| "rs2765023" | "ACTTGTAAATTTAGTCAGCATACATAACTAACCAAAACTTCAATATATCTTGAGACCCCCTTGGGGGGCTGTCTCCATAAAAGTGACTTTCCCAGGAGAGTGACTGGATGTGATTGGCCAACACCGTCTTAGCCCGCAGGGGTTCCTGGCGCGGAAGCCTCACGTCCCTCCCCACAGCGAGTTTTCAGAATCCAAAGGCCGTAGGAGAAAGAAGGCTGGCGGTGTTTCCTCTTAGAGGGGAGAAACTCAGCCTGGGTAGGAGACCCAGCCCCACGCAGGGAAAACTGTGCTAACGCTTCC" | "A/G" | "ATGTGCGTGGCAGGTGCGGCGGCGGCGAATACGGTTTGTCCTCGAGCCTAACCCTGTCTGTGTTGGTGTCAGCAGTGGCCCCCCTACCACACACACAGGGTCCCTGGCGTCCCAAGACCACTCCTGGCAGCCCCGCCACTGGCTGCGCCTGGAAGCCGCGTCCTCAGGCCTCGCCTGGCATTTGCTGTCACAGAGGTTGCTTCCTTGGGTCCGTCCGTCCTCGCCCCTCCAGCCTGGGCGCCCCCCCACCCCTGTCTCATTCCCTCCACCACATGCAGCACAGTCCAGGAGGCTGGGGTC" |
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Get 12 Heterozygous Genotypes

Query: PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX g: <http://ontology.lindenb.org/genotypes/>
PREFIX fn: <http://www.w3.org/2005/xpath-functions#>

SELECT ?indi ?snp ?a1 ?a2
WHERE
{
?s a g:Genotype .
?s g:allele1 ?a1 .
?s g:allele2 ?a2 .
?s g:hasIndi ?s2 .
?s2 g:name ?indi .
?s g:hasSNP ?s3 .
?s3 g:name ?snp .
FILTER ( ?a1 != ?a2 )
}
LIMIT 10
Result: ----------------------------------------
| indi | snp | a1 | a2 |
========================================
| "NA12056" | "rs17160669" | "C" | "T" |
| "NA12716" | "rs17160669" | "C" | "T" |
| "NA12761" | "rs17160669" | "C" | "T" |
| "NA10839" | "rs2765023" | "A" | "G" |
| "NA12813" | "rs2765023" | "A" | "G" |
| "NA12760" | "rs2765023" | "A" | "G" |
| "NA12865" | "rs17160669" | "C" | "T" |
| "NA07056" | "rs17160669" | "C" | "T" |
| "NA12146" | "rs2765023" | "A" | "G" |
| "NA10860" | "rs2765023" | "A" | "G" |
| "NA10839" | "rs17160669" | "C" | "T" |
| "NA12812" | "rs17160669" | "C" | "T" |
----------------------------------------

List 12 SNPs on chr1 between 100000 and 500000 on the reference assembly, order by chrom/position

Query: PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX g: <http://ontology.lindenb.org/genotypes/>
PREFIX fn: <http://www.w3.org/2005/xpath-functions#>

SELECT ?snp ?chrom ?orient ?start
WHERE
{
?s a g:SNP .
?s g:name ?snp .
?s2 a g:MapLoc .
?s2 g:hasSNP ?s .
?s2 g:chrom ?chrom .
?s2 g:chrom "1" .
?s2 g:strand ?orient .
?s2 g:start ?start .
?s2 g:assembly <urn:assembly:reference:36_3> .
FILTER ( ?start > 100000 && ?start< 500000)
}
ORDER BY ?chrom ?start
LIMIT 12
Result: ------------------------------------------
| snp | chrom | orient | start |
==========================================
| "rs17009015" | "1" | "-" | 121810 |
| "rs11490937" | "1" | "+" | 222076 |
| "rs12041624" | "1" | "+" | 232164 |
| "rs11514575" | "1" | "-" | 235726 |
| "rs4731490" | "1" | "+" | 311783 |
| "rs4006867" | "1" | "+" | 325493 |
| "rs7462951" | "1" | "-" | 360984 |
| "rs4030300" | "1" | "+" | 392471 |
| "rs4030303" | "1" | "+" | 392552 |
| "rs9661032" | "1" | "-" | 396549 |
| "rs3872250" | "1" | "-" | 400742 |
| "rs3907361" | "1" | "-" | 412985 |
------------------------------------------

List the positions of 10 SNPs on the reference assembly and chr1, print the heterozygosity if it exists and is greater than 0.1

Query: PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX g: <http://ontology.lindenb.org/genotypes/>

SELECT ?snp ?chrom ?orient ?start ?het
WHERE
{
?s a g:SNP .
?s g:name ?snp .
?s2 a g:MapLoc .
?s2 g:hasSNP ?s .
?s2 g:chrom ?chrom .
?s2 g:chrom "1" .
?s2 g:strand ?orient .
?s2 g:start ?start .
?s2 g:assembly <urn:assembly:reference:36_3> .
OPTIONAL { ?s g:het ?het . FILTER ( ?het > 0.1 ) }
}
LIMIT 10
Result: ------------------------------------------------------------------------------------------------
| snp | chrom | orient | start | het |
================================================================================================
| "rs7417504" | "1" | "+" | 555799 | |
| "rs10018120" | "1" | "-" | 241387750 | "0.48"^^<http://www.w3.org/2001/XMLSchema#float> |
| "rs12043546" | "1" | "+" | 224043895 | |
| "rs4023296" | "1" | "-" | 141776514 | |
| "rs1320571" | "1" | "+" | 1110293 | "0.31"^^<http://www.w3.org/2001/XMLSchema#float> |
| "rs1359759" | "1" | "+" | 115826181 | "0.49"^^<http://www.w3.org/2001/XMLSchema#float> |
| "rs7553429" | "1" | "+" | 1080419 | "0.19"^^<http://www.w3.org/2001/XMLSchema#float> |
| "rs4245756" | "1" | "+" | 789325 | |
| "rs3766177" | "1" | "-" | 1471210 | "0.5"^^<http://www.w3.org/2001/XMLSchema#float> |
| "rs9442372" | "1" | "+" | 1008566 | "0.46"^^<http://www.w3.org/2001/XMLSchema#float> |
------------------------------------------------------------------------------------------------

Print 10 differences between the Reference Assembly and the Celera Assembly

Query: PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX g: <http://ontology.lindenb.org/genotypes/>

SELECT ?snp ?chrom1 ?orient1 ?start1 ?chrom2 ?orient2 ?start2
WHERE
{
?s a g:SNP .
?s g:name ?snp .

?s2 a g:MapLoc .
?s2 g:hasSNP ?s .
?s2 g:chrom ?chrom1 .
?s2 g:strand ?orient1 .
?s2 g:start ?start1 .
?s2 g:assembly <urn:assembly:Celera:36_3> .

?s3 a g:MapLoc .
?s3 g:hasSNP ?s .
?s3 g:chrom ?chrom2 .
?s3 g:strand ?orient2 .
?s3 g:start ?start2 .
?s3 g:assembly <urn:assembly:reference:36_3> . .

}
LIMIT 10
Result: -----------------------------------------------------------------------------
| snp | chrom1 | orient1 | start1 | chrom2 | orient2 | start2 |
=============================================================================
| "rs7553640" | "1" | "-" | 833104 | "1" | "+" | 1751873 |
| "rs3951936" | "9" | "-" | 41330304 | "4" | "+" | 49186295 |
| "rs3951936" | "9" | "-" | 41330304 | "1" | "-" | 142233119 |
| "rs3951936" | "9" | "-" | 41330304 | "1" | "+" | 142038296 |
| "rs3951936" | "9" | "-" | 41330304 | "1" | "-" | 141781399 |
| "rs3951936" | "9" | "-" | 41330304 | "1" | "+" | 141641811 |
| "rs41319344" | "Y" | "+" | 10690990 | "Y" | "-" | 25853159 |
| "rs41319344" | "Y" | "+" | 10690990 | "Y" | "+" | 24928047 |
| "rs41319344" | "Y" | "+" | 10690990 | "1" | "-" | 241194834 |
| "rs10907183" | "1" | "-" | 1511375 | "1" | "+" | 1060980 |
-----------------------------------------------------------------------------

Create a new RDF graph of 10 SNPs having a neighbour at a distance less than 500pb

Query: PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX g: <http://ontology.lindenb.org/genotypes/>
PREFIX fn: <http://www.w3.org/2005/xpath-functions#>

CONSTRUCT { ?snp1 g:hasNeighbour ?snp2 . }
WHERE
{
?snp1 a g:SNP .
?snp2 a g:SNP .

?s1 a g:MapLoc .
?s1 g:hasSNP ?snp1 .
?s1 g:chrom ?chrom1 .
?s1 g:strand ?orient1 .
?s1 g:start ?start1 .
?s1 g:assembly <urn:assembly:reference:36_3> .

?s2 a g:MapLoc .
?s2 g:hasSNP ?snp2 .
?s2 g:chrom ?chrom2 .
?s2 g:strand ?orient2 .
?s2 g:start ?start2 .
?s2 g:assembly <urn:assembly:reference:36_3> .

FILTER( (fn:abs(?start1 - ?start2) < 500) && ?chrom1=?chrom2 && ?snp1!=?snp2)

}
LIMIT 10
Result: @prefix : <http://ontology.lindenb.org/genotypes/> .
@prefix g: <http://ontology.lindenb.org/genotypes/> .
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix fn: <http://www.w3.org/2005/xpath-functions#> .

<http://www.ncbi.nlm.nih.gov/projects/SNP/snp_ref.cgi?rs=7545812>
:hasNeighbour <http://www.ncbi.nlm.nih.gov/projects/SNP/snp_ref.cgi?rs=9970455> .

<http://www.ncbi.nlm.nih.gov/projects/SNP/snp_ref.cgi?rs=1043506>
:hasNeighbour <http://www.ncbi.nlm.nih.gov/projects/SNP/snp_ref.cgi?rs=12126411> .

<http://www.ncbi.nlm.nih.gov/projects/SNP/snp_ref.cgi?rs=6603793>
:hasNeighbour <http://www.ncbi.nlm.nih.gov/projects/SNP/snp_ref.cgi?rs=7548693> ;
:hasNeighbour <http://www.ncbi.nlm.nih.gov/projects/SNP/snp_ref.cgi?rs=7553066> .

<http://www.ncbi.nlm.nih.gov/projects/SNP/snp_ref.cgi?rs=10907178>
:hasNeighbour <http://www.ncbi.nlm.nih.gov/projects/SNP/snp_ref.cgi?rs=10907177> ;
:hasNeighbour <http://www.ncbi.nlm.nih.gov/projects/SNP/snp_ref.cgi?rs=11260588> ;
:hasNeighbour <http://www.ncbi.nlm.nih.gov/projects/SNP/snp_ref.cgi?rs=11260587> ;
:hasNeighbour <http://www.ncbi.nlm.nih.gov/projects/SNP/snp_ref.cgi?rs=6701114> ;
:hasNeighbour <http://www.ncbi.nlm.nih.gov/projects/SNP/snp_ref.cgi?rs=3737728> ;
:hasNeighbour <http://www.ncbi.nlm.nih.gov/projects/SNP/snp_ref.cgi?rs=9442398> .

That's it !
Pierre

YOKOFAKUN

22 February 2015

Drawing a Manhattan plot in SVG using a GWAS+XML model.

The stylesheet

Result:

26 July 2013

g1kv37 vs hg19

15 June 2011

Sorting and Joining some Genotypes with Hadoop.

The Input Files

The annotation file

Calls

Joined file

Setting up Hadoop and HDFS

The Classes

Genotypes

Position

Marker

MapReduce for sorting the Annotation file.

Mapper

Reducer

Job

MapReduce for sorting the 'Genotypes' file.

Mapper

Reducer

Job

Joining the two sorted files

Reducer

Job

Full source code

05 January 2011

Coding a CXF web service translating a DNA to a protein. My notebook

Defining the interface

Implementing the service

Configuring the service

Defining the CXF application for Tomcat

Compile & Deploy

Checking the URL

Creating a client

Compiling

Running

23 July 2010

Rules Engine for Bioinformatics. Playing with Drools and the Hapmap data

Individual.java

Snp.java

Genotype.java

hapmap.drl

Drools01.java

Compilation

Results

Questions

01 February 2010

Searching for Genotypes with SPARQL.

Transform the data

Result

Invoking ARQ

Dump All

Select the populations

List six individuals for each population

List the SNPs having a flanking sequence containing 'CACACA'

Get 12 Heterozygous Genotypes

List 12 SNPs on chr1 between 100000 and 500000 on the reference assembly, order by chrom/position

List the positions of 10 SNPs on the reference assembly and chr1, print the heterozygosity if it exists and is greater than 0.1

Print 10 differences between the Reference Assembly and the Celera Assembly

Create a new RDF graph of 10 SNPs having a neighbour at a distance less than 500pb

About Me

Feeds

Blog Archive

Web2.0

Labels