Apache Pig: first contact with some 'bio' data.
via wikipedia: "Apache Pig is a high-level platform for creating MapReduce programs used with Hadoop. The language for this platform is called Pig Latin. Pig Latin abstracts the programming from the Java MapReduce idiom into a notation which makes MapReduce programming high level, similar to that of SQL for RDBMS systems.".
here I've played with PIG using a local datastore (that is, different from a distributed environment) to handle some data from the UCSC database.
Download and Install
Install Pig:$ wget "http://mirror.cc.columbia.edu/pub/software/apache/pig/pig-0.10.0/pig-0.10.0.tar.gz" $ tar xvfz pig-0.10.0.tar.gz $ rm pig-0.10.0.tar.gz $ cd pig-0.10.0 $ export PIG_INSTALL=${PWD} $ export JAVA_HOME=/your/path/to/jdk1.7
Download 'knownGene' from the UCSC:
$ wget "http://hgdownload.cse.ucsc.edu/goldenPath/hg19/database/knownGene.txt.gz" $ gunzip knownGene.txt.gz
Start the command line interface
run Pig’s Grunt shell in local mode:$ pig -x local 2012-08-15 10:37:25,247 [main] INFO org.apache.pig.Main - Apache Pig version 0.10.0 (r1328203) compiled Apr 19 2012, 22:54:12 2012-08-15 10:37:25,248 [main] INFO org.apache.pig.Main - Logging error messages to: /home/lindenb/tmp/HADOOP/pig-0.10.0/pig_1344933445243.log 2012-08-15 10:37:25,622 [main] INFO org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to hadoop file system at: file:///
Getting the number of Genes by Chromosome
knownGenes = LOAD '/home/lindenb/tmp/HADOOP/knownGene.txt' as ( name:chararray , chrom:chararray , strand:chararray , txStart:int , txEnd:int , cdsStart:int , cdsEnd:int , exonCount:chararray , exonStarts:chararray , exonEnds:chararray , proteinID:chararray , alignID:chararray );keep the genes coding for a protein:
coding = FILTER knownGenes BY cdsStart < cdsEnd ;Remove some columns:
coding = FOREACH coding GENERATE chrom,cdsStart,cdsEnd,strand,name;Group by chromosome:
C = GROUP coding by chrom;Filter out some chromosomes:
C = FILTER C by NOT( group matches '.*_random' OR group matches 'chrUn_.*' OR group matches '.*hap.*' );Get the count of genes for each chromosome:
D= FOREACH C GENERATE group as CHROM, COUNT(coding.name) as numberOfGenes ;And dump the data:
dump D;Result:
(chr1,6139) (chr2,3869) (chr3,3406) (chr4,2166) (chr5,2515) (chr6,3021) (chr7,2825) (chr8,1936) (chr9,2338) (chrM,1) (chrX,2374) (chrY,318) (chr10,2470) (chr11,3579) (chr12,3066) (chr13,924) (chr14,2009) (chr15,1819) (chr16,2510) (chr17,3396) (chr18,862) (chr19,3784) (chr20,1570) (chr21,663) (chr22,1348)Interestingly, noting seems to happen until your ask to dump the data.
Finding the overlapping genes
I want to get a list of pairs of genes overlapping and having an opposite strand. I've not been able to find a quick way to join two tables using a complex criteria.Create a two identical lists of genes E1 and E2 . Add an extra column "1" that will be used to join both tables.E1 = FOREACH coding GENERATE 1 as pivot , $0 , $1 , $2 , $3, $4; E2 = FOREACH coding GENERATE 1 as pivot , $0 , $1 , $2 , $3, $4;Join the tables using the extra column.
E3 = join E1 by pivot, E2 by pivot;Extract and rename the fields from the join:
E3 = FOREACH E3 generate $1 as chrom1, $2 as start1, $3 as end1, $4 as strand1, $5 as name1, $7 as chrom2, $8 as start2, $9 as end2, $10 as strand2, $11 as name2 ;At this point, the data in E3 look like this:
(...) (chr1,664484,665108,-,uc009vjm.3,chr1,324342,325605,+,uc001aau.3) (chr1,664484,665108,-,uc009vjm.3,chr1,664484,665108,-,uc001abe.4) (chr1,664484,665108,-,uc009vjm.3,chr1,324342,325605,+,uc009vjk.2) (chr1,664484,665108,-,uc009vjm.3,chr1,664484,665108,-,uc009vjm.3) (chr1,664484,665108,-,uc009vjm.3,chr1,12189,13639,+,uc010nxq.1) (chr1,664484,665108,-,uc009vjm.3,chr1,367658,368597,+,uc010nxu.2) (chr1,664484,665108,-,uc009vjm.3,chr1,621095,622034,-,uc010nxv.2) (chr1,664484,665108,-,uc009vjm.3,chr1,324514,325605,+,uc021oeh.1) (chr1,664484,665108,-,uc009vjm.3,chr1,327745,328213,+,uc021oei.1) (...)Extract the overlapping genes:
E3= FILTER E3 BY name1 < name2 AND chrom1==chrom2 AND strand1!=strand2 AND NOT(end1 < start2 OR end2 < start1);and dump the result:
dump E3After a few hours the result is computed:
(...) (chr9,119188129,120177216,-,uc004bjt.2,chr9,119460021,119461983,+,uc004bjw.2) (chr9,119188129,120177216,-,uc004bjt.2,chr9,119460021,119461983,+,uc004bjx.2) (chr9,119188129,120177216,-,uc004bjt.2,chr9,119460021,119461983,+,uc022bmo.1) (chr9,119460021,119461983,+,uc004bjw.2,chr9,119188129,119903719,-,uc022bml.1) (chr9,119460021,119461983,+,uc004bjw.2,chr9,119188129,119903719,-,uc022bmm.1) (chr9,119460021,119461983,+,uc004bjx.2,chr9,119188129,119903719,-,uc022bml.1) (chr9,119460021,119461983,+,uc004bjx.2,chr9,119188129,119903719,-,uc022bmm.1) (chr9,129724568,129981048,+,uc004bqo.2,chr9,129851217,129871010,-,uc004bqr.1) (chr9,129724568,129981048,+,uc004bqo.2,chr9,129851217,129856116,-,uc010mxg.1) (chr9,129724568,129979280,+,uc004bqq.4,chr9,129851217,129871010,-,uc004bqr.1) (chr9,129724568,129979280,+,uc004bqq.4,chr9,129851217,129856116,-,uc010mxg.1) (chr9,129851217,129871010,-,uc004bqr.1,chr9,129724568,129940183,+,uc022bno.1) (chr9,129851217,129871010,-,uc004bqr.1,chr9,129724568,129946390,+,uc011mab.2) (chr9,129851217,129871010,-,uc004bqr.1,chr9,129724568,129979280,+,uc011mac.2) (chr9,129851217,129856116,-,uc010mxg.1,chr9,129724568,129940183,+,uc022bno.1) (chr9,129851217,129856116,-,uc010mxg.1,chr9,129724568,129946390,+,uc011mab.2) (chr9,129851217,129856116,-,uc010mxg.1,chr9,129724568,129979280,+,uc011mac.2) (chr9,130455527,130477918,-,uc004brm.3,chr9,130469310,130476184,+,uc004brn.1) (chr9,131703812,131719311,+,uc004bwq.1,chr9,131707965,131709582,-,uc004bwr.3) (chrX,11156982,11445715,-,uc004cun.1,chrX,11312908,11318732,+,uc004cus.3) (chrX,11156982,11445715,-,uc004cun.1,chrX,11312908,11318732,+,uc004cut.3) (chrX,11156982,11445715,-,uc004cun.1,chrX,11312908,11318732,+,uc004cuu.3) (chrX,11156982,11682948,-,uc004cup.1,chrX,11312908,11318732,+,uc004cus.3) (chrX,11156982,11682948,-,uc004cup.1,chrX,11312908,11318732,+,uc004cut.3) (chrX,11156982,11682948,-,uc004cup.1,chrX,11312908,11318732,+,uc004cuu.3) (...)That was very slow. There might be a better way to do this and I wonder if using a hadoop filesystem would really speed the computation. At this point I'll continue to use a SQL database for such small amount of data.
That's it.
Pierre
5 comments:
Hi Pierre,
How many nodes did you use to run your job?
@alain: I played with a local instance of Pig, I don't think the number of nodes matters here. (Isn't it ?)
The aim with Pig is to perform operations in a hadoop cluster. Although you can run it locally, performance will/should dramatically increase when you use it with a hadoop cluster and the performance gain will be related to the amount of nodes you have in your cluster.
You can create a hadoop cluster in EC2 and give it a spin. We're testing this at InSilico DB and are getting pretty good results.
Hi Pierre,
if you are interested in Pig processing aligned sequence data stored in SAM/BAM format or un-aligned fastq/qseq, have a look at SeqPig.
Post a Comment