14 August 2012

Apache Pig: first contact with some 'bio' data.

via wikipedia: "Apache Pig is a high-level platform for creating MapReduce programs used with Hadoop. The language for this platform is called Pig Latin. Pig Latin abstracts the programming from the Java MapReduce idiom into a notation which makes MapReduce programming high level, similar to that of SQL for RDBMS systems.".

here I've played with PIG using a local datastore (that is, different from a distributed environment) to handle some data from the UCSC database.

Download and Install

Install Pig:
$ wget "http://mirror.cc.columbia.edu/pub/software/apache/pig/pig-0.10.0/pig-0.10.0.tar.gz"
$ tar xvfz pig-0.10.0.tar.gz
$ rm pig-0.10.0.tar.gz
$ cd pig-0.10.0
$ export PIG_INSTALL=${PWD}
$ export JAVA_HOME=/your/path/to/jdk1.7

Download 'knownGene' from the UCSC:
$ wget "http://hgdownload.cse.ucsc.edu/goldenPath/hg19/database/knownGene.txt.gz"
$ gunzip knownGene.txt.gz

Start the command line interface

run Pig’s Grunt shell in local mode:
$ pig -x local 
2012-08-15 10:37:25,247 [main] INFO  org.apache.pig.Main - Apache Pig version 0.10.0 (r1328203) compiled Apr 19 2012, 22:54:12
2012-08-15 10:37:25,248 [main] INFO  org.apache.pig.Main - Logging error messages to: /home/lindenb/tmp/HADOOP/pig-0.10.0/pig_1344933445243.log
2012-08-15 10:37:25,622 [main] INFO  org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to hadoop file system at: file:///

Getting the number of Genes by Chromosome

knownGenes = LOAD '/home/lindenb/tmp/HADOOP/knownGene.txt' as 
  ( name:chararray ,  chrom:chararray ,  strand:chararray ,  txStart:int , 
    txEnd:int ,  cdsStart:int ,  cdsEnd:int ,  exonCount:chararray ,
     exonStarts:chararray ,  exonEnds:chararray ,  proteinID:chararray , 
     alignID:chararray );
keep the genes coding for a protein:
coding = FILTER knownGenes BY cdsStart < cdsEnd ;
Remove some columns:
coding = FOREACH coding GENERATE chrom,cdsStart,cdsEnd,strand,name;
Group by chromosome:
C = GROUP coding by chrom;
Filter out some chromosomes:
C = FILTER C by NOT(
  group matches  '.*_random' OR
  group matches  'chrUn_.*' OR
  group matches '.*hap.*'
  );
Get the count of genes for each chromosome:
D= FOREACH C GENERATE
 group as CHROM,
 COUNT(coding.name) as numberOfGenes
 ;
And dump the data:
dump D;
Result:
(chr1,6139)
(chr2,3869)
(chr3,3406)
(chr4,2166)
(chr5,2515)
(chr6,3021)
(chr7,2825)
(chr8,1936)
(chr9,2338)
(chrM,1)
(chrX,2374)
(chrY,318)
(chr10,2470)
(chr11,3579)
(chr12,3066)
(chr13,924)
(chr14,2009)
(chr15,1819)
(chr16,2510)
(chr17,3396)
(chr18,862)
(chr19,3784)
(chr20,1570)
(chr21,663)
(chr22,1348)
Interestingly, noting seems to happen until your ask to dump the data.

Finding the overlapping genes

I want to get a list of pairs of genes overlapping and having an opposite strand. I've not been able to find a quick way to join two tables using a complex criteria.

Create a two identical lists of genes E1 and E2 . Add an extra column "1" that will be used to join both tables.

E1 = FOREACH coding GENERATE 1 as pivot , $0 , $1 , $2 , $3, $4;
E2 = FOREACH coding GENERATE 1 as pivot , $0 , $1 , $2 , $3, $4;
Join the tables using the extra column.
E3 = join E1 by pivot, E2 by pivot;
Extract and rename the fields from the join:
E3 = FOREACH E3 generate 
 $1 as chrom1, $2 as start1, $3 as end1, $4 as strand1, $5 as name1,
 $7 as chrom2, $8 as start2, $9 as end2, $10 as strand2, $11 as name2
 ;
At this point, the data in E3 look like this:
(...)
(chr1,664484,665108,-,uc009vjm.3,chr1,324342,325605,+,uc001aau.3)
(chr1,664484,665108,-,uc009vjm.3,chr1,664484,665108,-,uc001abe.4)
(chr1,664484,665108,-,uc009vjm.3,chr1,324342,325605,+,uc009vjk.2)
(chr1,664484,665108,-,uc009vjm.3,chr1,664484,665108,-,uc009vjm.3)
(chr1,664484,665108,-,uc009vjm.3,chr1,12189,13639,+,uc010nxq.1)
(chr1,664484,665108,-,uc009vjm.3,chr1,367658,368597,+,uc010nxu.2)
(chr1,664484,665108,-,uc009vjm.3,chr1,621095,622034,-,uc010nxv.2)
(chr1,664484,665108,-,uc009vjm.3,chr1,324514,325605,+,uc021oeh.1)
(chr1,664484,665108,-,uc009vjm.3,chr1,327745,328213,+,uc021oei.1)
(...)
Extract the overlapping genes:
E3= FILTER E3 BY
    name1 < name2 AND
    chrom1==chrom2 AND
    strand1!=strand2 AND
    NOT(end1 < start2 OR end2 < start1);
and dump the result:
dump E3
After a few hours the result is computed:
(...)
(chr9,119188129,120177216,-,uc004bjt.2,chr9,119460021,119461983,+,uc004bjw.2)
(chr9,119188129,120177216,-,uc004bjt.2,chr9,119460021,119461983,+,uc004bjx.2)
(chr9,119188129,120177216,-,uc004bjt.2,chr9,119460021,119461983,+,uc022bmo.1)
(chr9,119460021,119461983,+,uc004bjw.2,chr9,119188129,119903719,-,uc022bml.1)
(chr9,119460021,119461983,+,uc004bjw.2,chr9,119188129,119903719,-,uc022bmm.1)
(chr9,119460021,119461983,+,uc004bjx.2,chr9,119188129,119903719,-,uc022bml.1)
(chr9,119460021,119461983,+,uc004bjx.2,chr9,119188129,119903719,-,uc022bmm.1)
(chr9,129724568,129981048,+,uc004bqo.2,chr9,129851217,129871010,-,uc004bqr.1)
(chr9,129724568,129981048,+,uc004bqo.2,chr9,129851217,129856116,-,uc010mxg.1)
(chr9,129724568,129979280,+,uc004bqq.4,chr9,129851217,129871010,-,uc004bqr.1)
(chr9,129724568,129979280,+,uc004bqq.4,chr9,129851217,129856116,-,uc010mxg.1)
(chr9,129851217,129871010,-,uc004bqr.1,chr9,129724568,129940183,+,uc022bno.1)
(chr9,129851217,129871010,-,uc004bqr.1,chr9,129724568,129946390,+,uc011mab.2)
(chr9,129851217,129871010,-,uc004bqr.1,chr9,129724568,129979280,+,uc011mac.2)
(chr9,129851217,129856116,-,uc010mxg.1,chr9,129724568,129940183,+,uc022bno.1)
(chr9,129851217,129856116,-,uc010mxg.1,chr9,129724568,129946390,+,uc011mab.2)
(chr9,129851217,129856116,-,uc010mxg.1,chr9,129724568,129979280,+,uc011mac.2)
(chr9,130455527,130477918,-,uc004brm.3,chr9,130469310,130476184,+,uc004brn.1)
(chr9,131703812,131719311,+,uc004bwq.1,chr9,131707965,131709582,-,uc004bwr.3)
(chrX,11156982,11445715,-,uc004cun.1,chrX,11312908,11318732,+,uc004cus.3)
(chrX,11156982,11445715,-,uc004cun.1,chrX,11312908,11318732,+,uc004cut.3)
(chrX,11156982,11445715,-,uc004cun.1,chrX,11312908,11318732,+,uc004cuu.3)
(chrX,11156982,11682948,-,uc004cup.1,chrX,11312908,11318732,+,uc004cus.3)
(chrX,11156982,11682948,-,uc004cup.1,chrX,11312908,11318732,+,uc004cut.3)
(chrX,11156982,11682948,-,uc004cup.1,chrX,11312908,11318732,+,uc004cuu.3)
(...)
That was very slow. There might be a better way to do this and I wonder if using a hadoop filesystem would really speed the computation. At this point I'll continue to use a SQL database for such small amount of data.

That's it.

Pierre




5 comments:

alain coletta said...

Hi Pierre,

How many nodes did you use to run your job?

Pierre Lindenbaum said...

@alain: I played with a local instance of Pig, I don't think the number of nodes matters here. (Isn't it ?)

alain coletta said...
This comment has been removed by the author.
alain coletta said...

The aim with Pig is to perform operations in a hadoop cluster. Although you can run it locally, performance will/should dramatically increase when you use it with a hadoop cluster and the performance gain will be related to the amount of nodes you have in your cluster.
You can create a hadoop cluster in EC2 and give it a spin. We're testing this at InSilico DB and are getting pretty good results.

Andre Schumacher said...

Hi Pierre,

if you are interested in Pig processing aligned sequence data stored in SAM/BAM format or un-aligned fastq/qseq, have a look at SeqPig.