Showing posts with label pig. Show all posts
Showing posts with label pig. Show all posts

14 August 2012

Apache Pig: first contact with some 'bio' data.

via wikipedia: "Apache Pig is a high-level platform for creating MapReduce programs used with Hadoop. The language for this platform is called Pig Latin. Pig Latin abstracts the programming from the Java MapReduce idiom into a notation which makes MapReduce programming high level, similar to that of SQL for RDBMS systems.".

here I've played with PIG using a local datastore (that is, different from a distributed environment) to handle some data from the UCSC database.

Download and Install

Install Pig:
$ wget "http://mirror.cc.columbia.edu/pub/software/apache/pig/pig-0.10.0/pig-0.10.0.tar.gz"
$ tar xvfz pig-0.10.0.tar.gz
$ rm pig-0.10.0.tar.gz
$ cd pig-0.10.0
$ export PIG_INSTALL=${PWD}
$ export JAVA_HOME=/your/path/to/jdk1.7

Download 'knownGene' from the UCSC:
$ wget "http://hgdownload.cse.ucsc.edu/goldenPath/hg19/database/knownGene.txt.gz"
$ gunzip knownGene.txt.gz

Start the command line interface

run Pig’s Grunt shell in local mode:
$ pig -x local 
2012-08-15 10:37:25,247 [main] INFO  org.apache.pig.Main - Apache Pig version 0.10.0 (r1328203) compiled Apr 19 2012, 22:54:12
2012-08-15 10:37:25,248 [main] INFO  org.apache.pig.Main - Logging error messages to: /home/lindenb/tmp/HADOOP/pig-0.10.0/pig_1344933445243.log
2012-08-15 10:37:25,622 [main] INFO  org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to hadoop file system at: file:///

Getting the number of Genes by Chromosome

knownGenes = LOAD '/home/lindenb/tmp/HADOOP/knownGene.txt' as 
  ( name:chararray ,  chrom:chararray ,  strand:chararray ,  txStart:int , 
    txEnd:int ,  cdsStart:int ,  cdsEnd:int ,  exonCount:chararray ,
     exonStarts:chararray ,  exonEnds:chararray ,  proteinID:chararray , 
     alignID:chararray );
keep the genes coding for a protein:
coding = FILTER knownGenes BY cdsStart < cdsEnd ;
Remove some columns:
coding = FOREACH coding GENERATE chrom,cdsStart,cdsEnd,strand,name;
Group by chromosome:
C = GROUP coding by chrom;
Filter out some chromosomes:
C = FILTER C by NOT(
  group matches  '.*_random' OR
  group matches  'chrUn_.*' OR
  group matches '.*hap.*'
  );
Get the count of genes for each chromosome:
D= FOREACH C GENERATE
 group as CHROM,
 COUNT(coding.name) as numberOfGenes
 ;
And dump the data:
dump D;
Result:
(chr1,6139)
(chr2,3869)
(chr3,3406)
(chr4,2166)
(chr5,2515)
(chr6,3021)
(chr7,2825)
(chr8,1936)
(chr9,2338)
(chrM,1)
(chrX,2374)
(chrY,318)
(chr10,2470)
(chr11,3579)
(chr12,3066)
(chr13,924)
(chr14,2009)
(chr15,1819)
(chr16,2510)
(chr17,3396)
(chr18,862)
(chr19,3784)
(chr20,1570)
(chr21,663)
(chr22,1348)
Interestingly, noting seems to happen until your ask to dump the data.

Finding the overlapping genes

I want to get a list of pairs of genes overlapping and having an opposite strand. I've not been able to find a quick way to join two tables using a complex criteria.

Create a two identical lists of genes E1 and E2 . Add an extra column "1" that will be used to join both tables.

E1 = FOREACH coding GENERATE 1 as pivot , $0 , $1 , $2 , $3, $4;
E2 = FOREACH coding GENERATE 1 as pivot , $0 , $1 , $2 , $3, $4;
Join the tables using the extra column.
E3 = join E1 by pivot, E2 by pivot;
Extract and rename the fields from the join:
E3 = FOREACH E3 generate 
 $1 as chrom1, $2 as start1, $3 as end1, $4 as strand1, $5 as name1,
 $7 as chrom2, $8 as start2, $9 as end2, $10 as strand2, $11 as name2
 ;
At this point, the data in E3 look like this:
(...)
(chr1,664484,665108,-,uc009vjm.3,chr1,324342,325605,+,uc001aau.3)
(chr1,664484,665108,-,uc009vjm.3,chr1,664484,665108,-,uc001abe.4)
(chr1,664484,665108,-,uc009vjm.3,chr1,324342,325605,+,uc009vjk.2)
(chr1,664484,665108,-,uc009vjm.3,chr1,664484,665108,-,uc009vjm.3)
(chr1,664484,665108,-,uc009vjm.3,chr1,12189,13639,+,uc010nxq.1)
(chr1,664484,665108,-,uc009vjm.3,chr1,367658,368597,+,uc010nxu.2)
(chr1,664484,665108,-,uc009vjm.3,chr1,621095,622034,-,uc010nxv.2)
(chr1,664484,665108,-,uc009vjm.3,chr1,324514,325605,+,uc021oeh.1)
(chr1,664484,665108,-,uc009vjm.3,chr1,327745,328213,+,uc021oei.1)
(...)
Extract the overlapping genes:
E3= FILTER E3 BY
    name1 < name2 AND
    chrom1==chrom2 AND
    strand1!=strand2 AND
    NOT(end1 < start2 OR end2 < start1);
and dump the result:
dump E3
After a few hours the result is computed:
(...)
(chr9,119188129,120177216,-,uc004bjt.2,chr9,119460021,119461983,+,uc004bjw.2)
(chr9,119188129,120177216,-,uc004bjt.2,chr9,119460021,119461983,+,uc004bjx.2)
(chr9,119188129,120177216,-,uc004bjt.2,chr9,119460021,119461983,+,uc022bmo.1)
(chr9,119460021,119461983,+,uc004bjw.2,chr9,119188129,119903719,-,uc022bml.1)
(chr9,119460021,119461983,+,uc004bjw.2,chr9,119188129,119903719,-,uc022bmm.1)
(chr9,119460021,119461983,+,uc004bjx.2,chr9,119188129,119903719,-,uc022bml.1)
(chr9,119460021,119461983,+,uc004bjx.2,chr9,119188129,119903719,-,uc022bmm.1)
(chr9,129724568,129981048,+,uc004bqo.2,chr9,129851217,129871010,-,uc004bqr.1)
(chr9,129724568,129981048,+,uc004bqo.2,chr9,129851217,129856116,-,uc010mxg.1)
(chr9,129724568,129979280,+,uc004bqq.4,chr9,129851217,129871010,-,uc004bqr.1)
(chr9,129724568,129979280,+,uc004bqq.4,chr9,129851217,129856116,-,uc010mxg.1)
(chr9,129851217,129871010,-,uc004bqr.1,chr9,129724568,129940183,+,uc022bno.1)
(chr9,129851217,129871010,-,uc004bqr.1,chr9,129724568,129946390,+,uc011mab.2)
(chr9,129851217,129871010,-,uc004bqr.1,chr9,129724568,129979280,+,uc011mac.2)
(chr9,129851217,129856116,-,uc010mxg.1,chr9,129724568,129940183,+,uc022bno.1)
(chr9,129851217,129856116,-,uc010mxg.1,chr9,129724568,129946390,+,uc011mab.2)
(chr9,129851217,129856116,-,uc010mxg.1,chr9,129724568,129979280,+,uc011mac.2)
(chr9,130455527,130477918,-,uc004brm.3,chr9,130469310,130476184,+,uc004brn.1)
(chr9,131703812,131719311,+,uc004bwq.1,chr9,131707965,131709582,-,uc004bwr.3)
(chrX,11156982,11445715,-,uc004cun.1,chrX,11312908,11318732,+,uc004cus.3)
(chrX,11156982,11445715,-,uc004cun.1,chrX,11312908,11318732,+,uc004cut.3)
(chrX,11156982,11445715,-,uc004cun.1,chrX,11312908,11318732,+,uc004cuu.3)
(chrX,11156982,11682948,-,uc004cup.1,chrX,11312908,11318732,+,uc004cus.3)
(chrX,11156982,11682948,-,uc004cup.1,chrX,11312908,11318732,+,uc004cut.3)
(chrX,11156982,11682948,-,uc004cup.1,chrX,11312908,11318732,+,uc004cuu.3)
(...)
That was very slow. There might be a better way to do this and I wonder if using a hadoop filesystem would really speed the computation. At this point I'll continue to use a SQL database for such small amount of data.

That's it.

Pierre