28 June 2011

The java library for BigBed and BigWig: my notebook

Jim Robinson and his team, from the Broad Institute/IGV, have recently released a java library parsing the BigBed and the BigWig formats. Here is my notebook for this API.

Download and compile the library

The sources are hosted at: http://bigwig.googlecode.com/.
$ svn checkout http://bigwig.googlecode.com/svn/trunk/ bigwig-read-only
$ cd bigwig-read-only
$ ant

Buildfile: build.xml

compile:
[mkdir] Created dir: /path/to/bigwig-read-only/build
[javac] Compiling 38 source files to /path/to/bigwig-read-only/build
[javac] Note: /path/to/bigwig-read-only/src/org/broad/igv/bbfile/BPTree.java uses unchecked or unsafe operations.
[javac] Note: Recompile with -Xlint:unchecked for details.

dist:
[mkdir] Created dir: /path/to/bigwig-read-only/dist
[jar] Building jar: /path/to/bigwig-read-only/dist/BigWig.jar

BUILD SUCCESSFUL
Total time: 3 seconds

Code

The following java code prints all the Bed or the Wig data in a given genomics region.

Compile

$javac -cp /path/to/bigwig-read-only/dist/BigWig.jar:. BigCat.java

Test

List the data in a BigBed file:
java -cp /path/to/bigwig-read-only/dist/BigWig.jar:/path/to/bigwig-read-only/lib/log4j-1.2.15.jar:. BigCat /path/to/bigwig-read-only/test/data/chr21.bb  | head
chr21 9434178 9434609
chr21 9434178 9434609
chr21 9508110 9508214
chr21 9516607 9516987
chr21 9903013 9903230
chr21 9903013 9903230
chr21 9905661 9906613
chr21 9907217 9907519
chr21 9907241 9907415
chr21 9907597 9908258

List the data in a BigBed file for the region: 'chr21:9906000-9908000', allow the overlaps.
$ java -cp /path/to/bigwig-read-only/dist/BigWig.jar:/path/to/bigwig-read-only/lib/log4j-1.2.15.jar:. BigCat -p chr21:9906000-9908000 /path/to/bigwig-read-only/test/data/chr21.bb  | head
chr21 9905661 9906613
chr21 9907217 9907519
chr21 9907241 9907415
chr21 9907597 9908258

List the data in a BigBed file for the region: 'chr21:9906000-9908000', do not allow the overlaps.
$ java -cp /path/to/bigwig-read-only/dist/BigWig.jar:/path/to/bigwig-read-only/lib/log4j-1.2.15.jar:. BigCat -p chr21:9906000-9908000 -c /path/to/bigwig-read-only/test/data/chr21.bb  | head
chr21 9907217 9907519
chr21 9907241 9907415

List the data in a BigWig file:
$ java -cp /path/to/bigwig-read-only/dist/BigWig.jar:/path/to/bigwig-read-only/lib/log4j-1.2.15.jar:. BigCat  /path/to/bigwig-read-only/test/data/wigVarStepExample.bw  | head
chr21 9411190 9411195 50.0
chr21 9411195 9411200 40.0
chr21 9411200 9411205 60.0
chr21 9411205 9411210 20.0
chr21 9411210 9411215 20.0
chr21 9411215 9411220 20.0
chr21 9411220 9411225 40.0
chr21 9411225 9411230 60.0
chr21 9411230 9411235 40.0
chr21 9411235 9411240 40.0

List the data in a BigWig file for the region: 'chr21:9906000-9908000'
$ java -cp /path/to/bigwig-read-only/dist/BigWig.jar:/path/to/bigwig-read-only/lib/log4j-1.2.15.jar:. BigCat -p chr21:9906000-9908000 /path/to/bigwig-read-only/test/data/wigVarStepExample.bw  | head
chr21 9906000 9906005 20.0
chr21 9906005 9906010 60.0
chr21 9906010 9906015 60.0
chr21 9906015 9906020 60.0
chr21 9906020 9906025 80.0
chr21 9906025 9906030 60.0
chr21 9906030 9906035 40.0
chr21 9906035 9906040 80.0
chr21 9906040 9906045 80.0
chr21 9906045 9906050 80.0

See also



That's it,

Pierre

Bioinformatician 2.0 ( JeBiF Workshop 2011 )

Here is the presentation I gave for the JeBiF Workshop 2011 (jobs and careers in bioinformatics).







Pierre Legrain (CEA) and Franck Molina, CNRS talking about working in Bioinformatics in the industry and/or academia.

19 June 2011

A couchdb-like application using apache JETTY and BerkeleyDB. My notebook


This is my notebook for the Jetty API. Jetty provides an Web server and javax.servlet container. It can be embedded in devices, tools, frameworks, application servers, and clusters. I find it more easier for developping and testing a new web application as it doesn't require some deployement descriptors, some configuration files, a web container, etc... like in apache tomcat. As I will show below, a whole web applications can be embedded in a single java class.
As an example, I wrote a couchdb-like web application, named DivanDB ('divan' is the french word for couch), using Jetty and Berkeleydb. The source file is available on github at:(See also my previous post :CouchDB for Bioinformatics: Storing SNPs - My Notebook)

The BerkeleyDB-base key/value datastore

The class BDBStorage consists in a BerkeleyDB environment and a BerkeleyDB database. The database stores keys and the values as String.
/** a berkeley-db String/String datastore */ 
private static class BDBStorage
{
/** bdb environment */
private Environment environment=null;
/** string/string database */
private Database database=null;

private BDBStorage()
{
}

/** open environment & database */
private void open(File dbHome) throws DatabaseException
{
(...)
}
/** close environment & database */
private void close()
{
(...)
}
}


The web context

A servlet context is created where we put a BDBStorage. We also tell the context about the DivanDB servlet and the path of the application.

DivanDB app=new DivanDB();
BDBStorage storage=new BDBStorage();
ServletContextHandler context = new ServletContextHandler();
context.setAttribute(STORAGE_ATTRIBUTE, storage);
context.addServlet(new ServletHolder(app),"/*");
context.setContextPath("/divandb");
context.setResourceBase(".");

The server

The embedded server listens to the specified port. We add the context to the server, it is started and listen to the queries forever.
/* create a new server */
Server server = new Server(port);
/* context */
server.setHandler(context);
/* start server */
server.start();
/* loop forever */
server.join();


The Servlet

The servlet DivanDB receives the POST, GET and DELETE queries just like couchdb. It decodes the JSON data using a JSON API provided in the jetty distribution. See the code for more information.

Start the server

$java -jar dist/divandb.jar
2011-06-19 18:12:16.823:INFO::jetty-8.0.0.M3
2011-06-19 18:12:16.896:INFO::Started SelectChannelConnector@0.0.0.0:8080 STARTING

Testing, using CURL

Add a single snp. Here the id is a concatenation of the chromosome number and the genomic position:
$curl -X PUT -d '{"id":"010000011390","name":"rs75100478","avHet":0.5,"chrom":"chr1","position":11390}' http://localhost:8080/divandb/
{"ok":"true","id":["010000011390"]}
Add single snp without id (it will be generated):
$curl -X PUT -d '{"name":"rs3877545","avHet":0.375,"chrom":"chr1","position":11507}' http://localhost:8080/divandb/
{"ok":"true","id":["id972755720"]}
Add an array of SNPs:
$ curl -X PUT -d '[{"id":"010000011390","name":"rs75100478","avHet":0.5,"chrom":"chr1","position":11390},{"id":"010000011507","name":"rs3877545","avHet":0.375,"chrom":"chr1","position":11507},{"id":"010000011863","name":"rs76933399","avHet":0.46875,"chrom":"chr1","position":11863},{"id":"010000011920","name":"rs78105014","avHet":0.5,"chrom":"chr1","position":11920},{"id":"010000011959","name":"rs71234145","avHet":0.5,"chrom":"chr1","position":11959},{"id":"010000012048","name":"rs7357889","avHet":0.277778,"chrom":"chr1","position":12048},{"id":"010000012073","name":"rs76972523","avHet":0.5,"chrom":"chr1","position":12073},{"id":"010000012120","name":"rs78253371","avHet":0.5,"chrom":"chr1","position":12120},{"id":"010000012148","name":"rs78371031","avHet":0.5,"chrom":"chr1","position":12148},{"id":"010000012158","name":"rs79313092","avHet":0.5,"chrom":"chr1","position":12158},{"id":"010000012168","name":"rs74400200","avHet":0.5,"chrom":"chr1","position":12168},{"id":"010000012169","name":"rs79903148","avHet":0.375,"chrom":"chr1","position":12169},{"id":"010000012213","name":"rs79272378","avHet":0.5,"chrom":"chr1","position":12213},{"id":"010000012264","name":"rs2475482","avHet":0.444444,"chrom":"chr1","position":12264},{"id":"010000012295","name":"rs76096671","avHet":0.375,"chrom":"chr1","position":12295},{"id":"010000012295","name":"rs79461972","avHet":0.375,"chrom":"chr1","position":12295},{"id":"010000012305","name":"rs71236994","avHet":0.5,"chrom":"chr1","position":12305},{"id":"010000012332","name":"rs2981846","avHet":0.375,"chrom":"chr1","position":12332},{"id":"010000012336","name":"rs76760687","avHet":0.375,"chrom":"chr1","position":12336},{"id":"010000012355","name":"rs76588012","avHet":0.444444,"chrom":"chr1","position":12355}]' http://localhost:8080/divandb/
{"ok":"true","id":["010000011390","010000011507","010000011863","010000011920","010000011959","010000012048","010000012073","010000012120","010000012148","010000012158","010000012168","010000012169","010000012213","010000012264","010000012295","010000012295","010000012305","010000012332","010000012336","010000012355"]
Deleting a SNP:
$curl -X DELETE -d '{"id":"2065420809"}' http://localhost:8080/divandb/
{"ok":"true","id":["id972755720"]}
List all the SNPs:
$ curl http://localhost:8080/divandb/
[
{"position":11390,"id":"010000011390","_timestamp":"1308501619331","name":"rs75100478","avHet":0.5,"chrom":"chr1"},
{"position":11507,"id":"010000011507","_timestamp":"1308501619335","name":"rs3877545","avHet":0.375,"chrom":"chr1"},
{"position":11863,"id":"010000011863","_timestamp":"1308501619336","name":"rs76933399","avHet":0.46875,"chrom":"chr1"},
{"position":11920,"id":"010000011920","_timestamp":"1308501619336","name":"rs78105014","avHet":0.5,"chrom":"chr1"},
{"position":11959,"id":"010000011959","_timestamp":"1308501619337","name":"rs71234145","avHet":0.5,"chrom":"chr1"},
{"position":12048,"id":"010000012048","_timestamp":"1308501619337","name":"rs7357889","avHet":0.277778,"chrom":"chr1"},
{"position":12073,"id":"010000012073","_timestamp":"1308501619338","name":"rs76972523","avHet":0.5,"chrom":"chr1"},
{"position":12120,"id":"010000012120","_timestamp":"1308501619338","name":"rs78253371","avHet":0.5,"chrom":"chr1"},
{"position":12148,"id":"010000012148","_timestamp":"1308501619339","name":"rs78371031","avHet":0.5,"chrom":"chr1"},
{"position":12158,"id":"010000012158","_timestamp":"1308501619339","name":"rs79313092","avHet":0.5,"chrom":"chr1"},
{"position":12168,"id":"010000012168","_timestamp":"1308501619339","name":"rs74400200","avHet":0.5,"chrom":"chr1"},
{"position":12169,"id":"010000012169","_timestamp":"1308501619340","name":"rs79903148","avHet":0.375,"chrom":"chr1"},
{"position":12213,"id":"010000012213","_timestamp":"1308501619340","name":"rs79272378","avHet":0.5,"chrom":"chr1"},
{"position":12264,"id":"010000012264","_timestamp":"1308501619341","name":"rs2475482","avHet":0.444444,"chrom":"chr1"},
{"position":12295,"id":"010000012295","_timestamp":"1308501619342","name":"rs79461972","avHet":0.375,"chrom":"chr1"},
{"position":12305,"id":"010000012305","_timestamp":"1308501619344","name":"rs71236994","avHet":0.5,"chrom":"chr1"},
{"position":12332,"id":"010000012332","_timestamp":"1308501619348","name":"rs2981846","avHet":0.375,"chrom":"chr1"},
{"position":12336,"id":"010000012336","_timestamp":"1308501619348","name":"rs76760687","avHet":0.375,"chrom":"chr1"},
{"position":12355,"id":"010000012355","_timestamp":"1308501619349","name":"rs76588012","avHet":0.444444,"chrom":"chr1"}
]
List the SNPs on chr1 between 12000 and 12150:
$ curl "http://localhost:8080/divandb/?startkey=010000012000&endkey=010000012150"
[
{"position":12048,"id":"010000012048","_timestamp":"1308501619337","name":"rs7357889","avHet":0.277778,"chrom":"chr1"},
{"position":12073,"id":"010000012073","_timestamp":"1308501619338","name":"rs76972523","avHet":0.5,"chrom":"chr1"},
{"position":12120,"id":"010000012120","_timestamp":"1308501619338","name":"rs78253371","avHet":0.5,"chrom":"chr1"},
{"position":12148,"id":"010000012148","_timestamp":"1308501619339","name":"rs78371031","avHet":0.5,"chrom":"chr1"}
]

List 3 SNPs starting from index=2
$ curl "http://localhost:8080/divandb/?start=2&limit=3"
[
{"position":11507,"id":"010000011507","_timestamp":"1308501619335","name":"rs3877545","avHet":0.375,"chrom":"chr1"},{"position":11863,"id":"010000011863","_timestamp":"1308501619336","name":"rs76933399","avHet":0.46875,"chrom":"chr1"},
{"position":11920,"id":"010000011920","_timestamp":"1308501619336","name":"rs78105014","avHet":0.5,"chrom":"chr1"}
]

That's it


Pierre

15 June 2011

Sorting and Joining some Genotypes with Hadoop.



This post describes how I've used hadoop to merge two large files produced by our Affymetrix-Axiom Genotyping platform.
Image via OpenWet Ware
(Note: working with the last version of hadoop has been a headache: it contains some classes with the same name: ('org.apache.hadoop.mapreduce.Reducer' and 'org.apache.hadoop.mapred.Reducer' ), many classes have been deprecated ( org.apache.hadoop.mapred.JobConf ) but they are still needed to run some specific algorithms ( `jobConf.setInputFormat(CompositeInputFormat.class)`), etc...

The Input Files


The annotation file

This file describes the genotyped markers (chrom,position,alleleA,allleB, etc...).

(... header ...)
"Probe Set ID","dbSNP RS ID","Chromosome","Physical Position","Strand","ChrX pseudo-autosomal region 1","Cytoband","Flank","Allele A","Allele B","Associated Gene","Genetic Map","Microsatellite","Allele Frequencies","Heterozygous Allele Frequencies","Number of individuals/Number of chromosomes","In Hapmap","Strand Versus dbSNP","Probe Count","ChrX pseudo-autosomal region 2","Minor Allele","Minor Allele Frequency","OMIM"
"AX-11086612","rs10001348","4","29912308","+","0","p15.1","gtattcagttgaacacaaatcagtgcatgt[A/G]","A","G","ENST00000467087 // downstream // 2885305 // Hs.724550 // STIM2 // 57620 // stromal interaction molecule 2 /// ENST00000361762 // upstream // 809728 // Hs.479439 // PCDH7 // 5099 // protocadherin 7 /// NM_001169117 // downstream // 2885305 // Hs.724550 // STIM2 // 57620 // stromal interaction molecule 2 /// NM_032456 // upstream // 809728 // Hs.724529 // PCDH7 // 5099 // protocadherin 7","50.0229786923511 // D4S418 // D4S2408 // --- // --- /// 44.2128449948474 // D4S2397 // D4S2430 // ATA27C07 // GCT6F03 /// 42.4637111703432 // --- // --- // 226002 // 46437","D4S333 // upstream // 47955 /// D4S605 // downstream // 97312","0.872881356 // 0.127118644 // Caucasian /// 0.833333333 // 0.166666667 // Han Chinese /// 0.777777778 // 0.222222222 // Japanese /// 0.775 // 0.225 // Yoruban","0.254237288 // Caucasian /// 0.288888889 // Han Chinese /// 0.355555556 // Japanese /// 0.35 // Yoruban","60.0 // Caucasian /// 45.0 // Han Chinese /// 45.0 // Japanese /// 60.0 // Yoruban","YES","same","1","0","G // Caucasian /// G // Han Chinese /// G // Japanese /// G // Yoruban","0.127118644 // Caucasian /// 0.166666667 // Han Chinese /// 0.222222222 // Japanese /// 0.225 // Yoruban","---"
"AX-11086611","rs10001340","4","130341127","+","0","q28.2","[A/C]agggcattcatctcagcttactatttgggaaaaat","A","C","ENST00000281146 // downstream // 306645 // Hs.567679 // C4orf33 // 132321 // chromosome 4 open reading frame 33 /// ENST00000394248 // upstream // 3729342 // Hs.192859 // PCDH10 // 57575 // protocadherin 10 /// NM_173487 // downstream // 307285 // Hs.567679 // C4orf33 // 132321 // chromosome 4 open reading frame 33 /// NM_020815 // upstream // 3729342 // Hs.192859 // PCDH10 // 57575 // protocadherin 10","127.864057946266 // D4S1615 // D4S2365 // --- // --- /// 129.756132396152 // D4S2938 // D4S2365 // AFMA284WG5 // GATA10A12 /// 124.03426335901 // D4S2394 // --- // --- // 55218","D4S3198 // upstream // 331274 /// D4S2394 // downstream // 43310","0.815789474 // 0.184210526 // Caucasian /// 1.0 // 0.0 // Han Chinese /// 1.0 // 0.0 // Japanese /// 0.816666667 // 0.183333333 // Yoruban","0.368421053 // Caucasian /// 0.0 // Han Chinese /// 0.0 // Japanese /// 0.266666667 // Yoruban","60.0 // Caucasian /// 45.0 // Han Chinese /// 45.0 // Japanese /// 60.0 // Yoruban","YES","same","1","0","C // Caucasian /// C // Han Chinese /// C // Japanese /// C // Yoruban","0.184210526 // Caucasian /// 0.0 // Han Chinese /// 0.0 // Japanese /// 0.183333333 // Yoruban","---"
"AX-11086610","rs10001337","4","54351529","+","0","q12","atgaggagtagccacatgatctaagcacct[C/T]","T","C","ENST00000306888 // --- // 0 // Hs.518760 // LNX1 // 84708 // ligand of numb-protein X 1 /// ENST00000263925 // --- // 0 // Hs.518760 // LNX1 // 84708 // ligand of numb-protein X 1 /// NM_001126328 // intron // 0 // Hs.518760 // LNX1 // 84708 // ligand of numb-protein X 1 /// NM_032622 // intron // 0 // Hs.518760 // LNX1 // 84708 // ligand of numb-protein X 1","67.4182086016315 // D4S2971 // D4S1594 // --- // --- /// 62.2091955728879 // D4S2971 // UNKNOWN // AFMB312YG1 // GATA61B02 /// 61.6059658777947 // --- // GATA61B02 // 149925 // ---","D4S461 // upstream // 151923 /// D4S2583E // downstream // 24481","0.118644068 // 0.881355932 // Caucasian /// 0.111111111 // 0.888888889 // Han Chinese /// 0.122222222 // 0.877777778 // Japanese /// 0.025 // 0.975 // Yoruban","0.203389831 // Caucasian /// 0.133333333 // Han Chinese /// 0.244444444 // Japanese /// 0.05 // Yoruban","60.0 // Caucasian /// 45.0 // Han Chinese /// 45.0 // Japanese /// 60.0 // Yoruban","YES","same","1","0","T // Caucasian /// T // Han Chinese /// T // Japanese /// T // Yoruban","0.118644068 // Caucasian /// 0.111111111 // Han Chinese /// 0.122222222 // Japanese /// 0.025 // Yoruban","---"
(...)

Calls

This file is returned by the Axiom machine. Each number encodes a genotype (AA/AB/BB/nil).
(..header...)
probeset_id A05.CEL A06.CEL A03.CEL A04.CEL A01.CEL A10.CEL A02.CEL A11.CEL A12.CEL A09.CEL A07.CEL A08.CEL B01.CEL B02.CEL B06.CEL B03.CEL B04.CEL B05.CEL B07.CEL B08.CEL B09.CEL B10.CEL C02.CEL B11.CEL B12.CEL C01.CEL C03.CEL C04.CEL C05.CEL C06.CEL C07.CEL C08.CEL C09.CEL C10.CEL C11.CEL C12.CEL D01.CEL D02.CEL D03.CEL D04.CEL D05.CEL D06.CEL D07.CEL D08.CEL D09.CEL D10.CEL D11.CEL D12.CEL E01.CEL E02.CEL E03.CEL E04.CEL E05.CEL E06.CEL E07.CEL E08.CEL E09.CEL E10.CEL E11.CEL E12.CEL F01.CEL F02.CEL F03.CEL F04.CEL F05.CEL F06.CEL F07.CEL F08.CEL H08.CEL H07.CEL G07.CEL G08.CEL H03.CEL H04.CEL H06.CEL H05.CEL G09.CEL G10.CEL H01.CEL H02.CEL G11.CEL G12.CEL G03.CEL G04.CEL H10.CEL H09.CEL F10.CEL F09.CEL G01.CEL G02.CEL F11.CEL F12.CEL G05.CEL G06.CEL H11.CEL H12.CEL
AX-11086525 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 1 0 1 0 0 1 1 2 1 1 0 0 0 0 1 0 0 1 0 0 1 1 0 1 0 1 1 2 0 1 0 0 1 2 0 2
AX-11086526 2 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 2 2 2 2 2 2 2 2 2 2 1 2 2 0 2 2 2 1 1 1 2 2 2 2 2 2 2 2 2 1 1 1 2 1 2 1 2 2 0 2 1 2 2 1 2 2 2 2 1 2 2 2 2 2 1 2 1 1 2 2 1 2 2 2 2 2 2 2 2 2 1 2 2 2
AX-11086527 2 0 2 2 2 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 2 2 2 2 2 2 2 1 2 2 1 1 0 2 1 2 1 2 2 1 1 1 1 2 2 1 0 2 1 2 1 2 2 2 1 2 1 1 1 1 2 2 2 2 2 1 1 0 2 0 1 0 1 0 1 2 1 1 0 1 1 0 2 1 1 1 0 1 1 1 2 2 1 1
(...)

Joined file

.. and this is what I want to obtain at the end:
AX-11086525 3 165621955 rs10000341 T T T T T T T T T T T T T T T G T T T T T T T T T T T T T T T T T T T T T T T G T G T T T T T G T T T T T T T T T T T T T T T T T T T T T T T T T T T G T T T T T G T T T G T T T T T T T T T T T T T T T T T T T T T G T T T T T T T T T G T T T G T T T G T T T T T G T G G G T G T G T T T T T T T T T G T T T T T G T T T T T G T G T T T G T T T G T G G G T T T G T T T T T G G G T T G G
AX-11086526 3 5237152 rs10000142 C C T C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C T C C C C C C C C C C C C C C C C C C C C C C C T C C C C C T T C C C C C C T C T C T C C C C C C C C C C C C C C C C C C C T C T C T C C C T C C C T C C C C C T T C C T C C C C C T C C C C C C C C C T C C C C C C C C C C C T C C C T C T C C C C C T C C C C C C C C C C C C C C C C C C C T C C C C C C C
(...)

Setting up Hadoop and HDFS


export JAVA_HOME=/your/path/to/JDK
cd hadoop-0.20.203.0

Change the config
in "conf/core-site.xml":
<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:9000</value>
</property>
</configuration>

in "conf/hdfs-site.xml":
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
</configuration>

in "conf/mapred-site.xml":
<configuration>
<property>
<name>mapred.job.tracker</name>
<value>localhost:9001</value>
</property>
</configuration>


Setup ssh for no password:
$ ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsa
$ cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys

change chmod for ssh:
$ chmod 700 ~/.ssh/
$ chmod 640 ~/.ssh/authorized_keys

if needed, add an alias in '/etc/hosts' and restart:"sudo /etc/rc.d/init.d/network restart"
Format HDFS:
bin/hadoop namenode -format
11/06/15 08:32:13 INFO namenode.NameNode: STARTUP_MSG:
/************************************************************
STARTUP_MSG: Starting NameNode
STARTUP_MSG: host = srv-clc-04.u915.irt.univ-nantes.prive3/127.0.0.1
STARTUP_MSG: args = [-format]
STARTUP_MSG: version = 0.20.203.0
STARTUP_MSG: build = http://svn.apache.org/repos/asf/hadoop/common/branches/branch-0.20-security-203 -r 1099333; compiled by 'oom' on Wed May 4 07:57:50 PDT 2011
************************************************************/
11/06/15 08:32:13 INFO util.GSet: VM type = 64-bit
11/06/15 08:32:13 INFO util.GSet: 2% max memory = 19.1675 MB
11/06/15 08:32:13 INFO util.GSet: capacity = 2^21 = 2097152 entries
11/06/15 08:32:13 INFO util.GSet: recommended=2097152, actual=2097152
11/06/15 08:32:13 INFO namenode.FSNamesystem: fsOwner=lindenb
11/06/15 08:32:13 INFO namenode.FSNamesystem: supergroup=supergroup
11/06/15 08:32:13 INFO namenode.FSNamesystem: isPermissionEnabled=true
11/06/15 08:32:13 INFO namenode.FSNamesystem: dfs.block.invalidate.limit=100
11/06/15 08:32:13 INFO namenode.FSNamesystem: isAccessTokenEnabled=false accessKeyUpdateInterval=0 min(s), accessTokenLifetime=0 min(s)
11/06/15 08:32:13 INFO namenode.NameNode: Caching file names occuring more than 10 times
11/06/15 08:32:13 INFO common.Storage: Image file of size 113 saved in 0 seconds.
11/06/15 08:32:14 INFO common.Storage: Storage directory /home/lindenb/tmp/HADOOP/dfs/name has been successfully formatted.
11/06/15 08:32:14 INFO namenode.NameNode: SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at srv-clc-04.u915.irt.univ-nantes.prive3/127.0.0.1

... and start HDFS:
$ ./bin/start-all.sh 
namenode running as process 1151. Stop it first.
localhost: starting datanode, logging to /home/lindenb/package/hadoop-0.20.203.0/bin/../logs/hadoop-lindenb-datanode-srv-clc-04.u915.irt.univ-nantes.prive3.out
localhost: secondarynamenode running as process 1490. Stop it first.
jobtracker running as process 1588. Stop it first.
localhost: starting tasktracker, logging to /home/lindenb/package/hadoop-0.20.203.0/bin/../logs/hadoop-lindenb-tasktracker-srv-clc-04.u915.irt.univ-nantes.prive3.out

Copy the Axiom data to HDFS:
$bin/hadoop fs  -mkdir myfolder
$ bin/hadoop fs -copyFromLocal ~/Axiom_GW_Hu_SNP.r2.na31.annot.csv myfolder/
$ bin/hadoop fs -copyFromLocal ~/AxiomGT1.calls.txt myfolder/

The Classes

The annotation and the genotypes file will be sorted and we're going to merge both results. The classes implement Writable in order to be (un)serialized from/to HDFS .

Genotypes

This class contains the four possible codes for the genotypes (0, 1, 2 or -1):
 static public class Genotypes implements Writable
{
private byte array[];
public Genotypes()
{
this(0);
}
public Genotypes(int n)
{
this.array=new byte[n];
}
@Override
public void readFields(DataInput in) throws IOException
{
int n=in.readInt();
this.array=new byte[n];
in.readFully(this.array);
}
@Override
public void write(DataOutput out) throws IOException
{
out.writeInt(this.array.length);
out.write(this.array);
}
}

Position

A position on the genome.
static public class Position implements WritableComparable<Position>
{
private String chrom="N/A";
private int position=-1;

public Position()
{
}
public Position(String chrom,int position)
{
this.chrom=chrom;
this.position=position;
}
public String getChrom()
{
return chrom;
}

public int getPosition()
{
return position;
}

@Override
public void readFields(DataInput in) throws IOException {
this.chrom=in.readUTF();
this.position=in.readInt();
}
@Override
public void write(DataOutput out) throws IOException {
out.writeUTF(chrom);
out.writeInt(position);
}
@Override
public int compareTo(Position o) {
int i=chrom.compareTo(o.chrom);
if(i!=0) return i;
return position - o.position;
}
@Override
public String toString() {
return chrom+":"+position;
}
}

Marker

This class contains the data available in the annotation file:
static public class Marker implements WritableComparable<Marker>
{
private Position position=new Position();
private String alleleA;
private String alleleB;
private String probeSetId;
private String rsid;
public Marker()
{
}

public Position getPosition() {
return position;
}

public String getAlleleA() {
return alleleA;
}

public String getAlleleB() {
return alleleB;
}

public String getProbeSetId() {
return probeSetId;
}

public String getRsid() {
return rsid;
}

@Override
public void readFields(DataInput in) throws IOException {
this.position.readFields(in);
this.alleleA=in.readUTF();
this.alleleB=in.readUTF();
this.probeSetId=in.readUTF();
this.rsid=in.readUTF();
}
@Override
public void write(DataOutput out) throws IOException {
this.position.write(out);
out.writeUTF(this.alleleA);
out.writeUTF(this.alleleB);
out.writeUTF(this.probeSetId);
out.writeUTF(this.rsid);
}
@Override
public int compareTo(Marker o) {
return position.compareTo(o.position);
}
@Override
public String toString() {
return this.probeSetId+" "+position;
}
}

MapReduce for sorting the Annotation file.

We need to split each CSV line and sort the file on the probeSetId. The output will be a key/value [probeSetId(String)/Marker].
Mapper
public static class SortMarkerByProbIdMapper extends Mapper<LongWritable, Text, Text, Marker>
{
(...)
@Override
protected void map(
LongWritable key,
Text value,
Context context)
throws java.io.IOException ,InterruptedException
{
//ignore header and comment
if( value.find("\"Probe Set ID\"")==0 ||
value.find("#")==0 )
{
return;
}

List<String> array=splitCSV(new String(value.getBytes(),0,value.getLength()));
if(array.get(this.chromCol).equals("---")) return;//undefined chromosome

Marker m=new Marker();
m.position=new Position(
array.get(this.chromCol),
Integer.parseInt(array.get(this.posCol))
);
m.alleleA=array.get(this.alleleACol);
m.alleleB=array.get(this.alleleBCol);
m.probeSetId=array.get(this.probeSetIdCol);
m.rsid=array.get(this.rsIdCol);
context.write(new Text(m.probeSetId),m);
}
}

Reducer
public static class SortMarkerByProbIdReducer extends Reducer<Text, Marker, Text, Marker>
{
@Override
protected void reduce(
Text key,
Iterable<Marker> values,
Context context
) throws java.io.IOException ,InterruptedException
{
Marker marker=null;
for(Marker i:values)
{
if(marker!=null) throw new IOException("Duplicate marker id "+key);
marker=i;
}
if(marker==null) return;
context.write(key,marker);
}
}

Job
private void sortMarkersByProbeid(Configuration conf) throws Exception
{
//JobConf jobConf=new JobConf(conf);

FileSystem fs=FileSystem.get(conf);
final Path outPath=new Path(SORTED_MARKERS_PATH);
final Path inPath=new Path("myfolder/Axiom_GW_Hu_SNP.r2.na31.annot.csv");

List<String> header=null;
(...)
/*
(...)
open the file and find the column indexes
(...)
*/

Job job = new Job(conf, Hadoop01.class.getName());
job.setJarByClass(Hadoop01.class);
job.setMapperClass(SortMarkerByProbIdMapper.class);
job.setReducerClass(SortMarkerByProbIdReducer.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(Marker.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Marker.class);
job.setOutputFormatClass(org.apache.hadoop.mapreduce.lib.output.SequenceFileOutputFormat.class);


if(fs.exists(outPath))
{
fs.delete(outPath, true);
}

FileInputFormat.addInputPath(job, inPath);
FileOutputFormat.setOutputPath(job, outPath);


if(!job.waitForCompletion(true) )
{
throw new IOException("Cannot complete job");
}
}

MapReduce for sorting the 'Genotypes' file.

The output will be a key/value [probeSetId(String)/Genotypes].
Mapper
public static class SortCallByProbIdMapper extends Mapper<LongWritable, Text, Text, Genotypes>
{
final private Pattern tab=Pattern.compile("[\t]");
@Override
protected void map(
LongWritable key,
Text value,
Context context)
throws java.io.IOException ,InterruptedException
{
//ignore header and comment
if( value.find("probeset_id")==0 ||
value.find("%")==0 )
{
return;
}

String tokens[]=tab.split(new String(value.getBytes(),0,value.getLength()));
Genotypes genotypes=new Genotypes(tokens.length-1);
for(int i=1;i< tokens.length;++i)
{
genotypes.array[i-1]=Byte.parseByte(tokens[i]);
}
context.write(new Text(tokens[0]),genotypes);
}
}

Reducer
public static class SortCallByProbIdReducer extends Reducer<Text, Genotypes, Text, Genotypes>
{
@Override
protected void reduce(
Text key,
Iterable<Genotypes> values,
Context context
) throws java.io.IOException ,InterruptedException
{
Genotypes array=null;
for(Genotypes i:values)
{
if(array!=null) throw new IOException("Duplicate marker id "+key);
array=i;
}
if(array==null) return;
context.write(key,array);
}
}

Job
public void sortCallsByProbeid(Configuration conf) throws Exception
{
//JobConf jobConf=new JobConf(conf);

FileSystem fs=FileSystem.get(conf);
final Path outPath=new Path(SORTED_CALLS_PATH);
final Path inPath=new Path("myfolder/AxiomGT1.calls.txt");

Job job = new Job(conf, Hadoop01.class.getName());
job.setJarByClass(Hadoop01.class);
job.setMapperClass(SortCallByProbIdMapper.class);
job.setReducerClass(SortCallByProbIdReducer.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(Genotypes.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Genotypes.class);

job.setOutputFormatClass(org.apache.hadoop.mapreduce.lib.output.SequenceFileOutputFormat.class);



if(fs.exists(outPath))
{
fs.delete(outPath, true);
}

FileInputFormat.addInputPath(job, inPath);
FileOutputFormat.setOutputPath(job, outPath);


if(!job.waitForCompletion(true) )
{
throw new IOException("Cannot complete job");
}
}

Joining the two sorted files

It was the most complicated part, as I didn't found much documentation about how to 'join' two files in HDFS with the latest hadoop API. The following solution seems to work and it only needs a reducer:
Reducer
public static class SortCallByProbIdReducer extends Reducer<Text, Genotypes, Text, Genotypes>
{
@Override
protected void reduce(
Text key,
Iterable<Genotypes> values,
Context context
) throws java.io.IOException ,InterruptedException
{
Genotypes array=null;
for(Genotypes i:values)
{
if(array!=null) throw new IOException("Duplicate marker id "+key);
array=i;
}
if(array==null) return;
context.write(key,array);
}
}

Job
private void join(Configuration conf) throws Exception
{
FileSystem fs=FileSystem.get(conf);
Path outPath=new Path(JOIN_PATH);
if(fs.exists(outPath))
{
fs.delete(outPath, true);
}

final String compose=CompositeInputFormat.compose(
"inner",
SequenceFileInputFormat.class,
new Path(SORTED_MARKERS_PATH),
new Path(SORTED_CALLS_PATH)
);
System.err.println(compose);
JobConf jobConf = new JobConf(conf, getClass());
jobConf.setJobName("join");
jobConf.setInputFormat(CompositeInputFormat.class);
jobConf.set("mapred.join.expr",
compose);

jobConf.setMapOutputKeyClass(Text.class);
jobConf.setMapOutputValueClass(TupleWritable.class);
jobConf.setOutputValueClass(Text.class);//TupleWritable ?
jobConf.setOutputKeyClass(Text.class);
jobConf.setOutputFormat(TextOutputFormat.class);
jobConf.setReducerClass(JoinReducer.class);

//jobConf.setMapOutputValueClass(Text.class);
org.apache.hadoop.mapred.FileOutputFormat.setOutputPath(jobConf, outPath);
JobClient.runJob(jobConf);
}

Full source code



That's it,

Pierre

13 June 2011

XSLT+SVG= "It Came from Outer Space" (anaglyphs)


"It Came from Outer Space", Universal's first film to be filmed in 3-D.

Just like with the "Gimp", Inkscape , a SVG editor, can manage some layers:
Each layer will appear in the SVG document as an element svg:g with an attribute: inkscape:groupmode="layer" under the root element svg:svg. I've written a XSLT stylesheet that converts the SVG image to an Anaglyph image. The stylesheet is available on github at: . For each layer it creates a new SVG filter that will split the layer into a red and a blue image.

Example


The original SVG picture:


After processing with the XSLT stylesheet:


SVG Source:

SVG Result:


That's it,

Pierre