21 April 2009

Hadoop, my notebook: HDFS

This post is about the Apache Hadoop, an open-source algorithm implementing the MapReduce algorithm. This first notebook focuses on HDFS, the Hadoop file system, and follows the great Yahoo! Hadoop Tutorial Home. Forget the clusters, I'm running this hadoop engine on my one and only laptop.

Downloading & Installing

~/tmp/HADOOP> wget "http://apache.multidist.com/hadoop/core/hadoop-0.19.1/hadoop-0.19.1.tar.gz
Saving to: `hadoop-0.19.1.tar.gz'

100%[======================================>] 55,745,146 487K/s in 1m 53s

2009-04-21 20:52:04 (480 KB/s) - `hadoop-0.19.1.tar.gz' saved [55745146/55745146]
~/tmp/HADOOP> tar xfz hadoop-0.19.1.tar.gz
~/tmp/HADOOP> rm hadoop-0.19.1.tar.gz
~/tmp/HADOOP> mkdir -p hdfs/data
~/tmp/HADOOP> mkdir -p hdfs/name
#hum... this step was not clear as I'm not a ssh guru. I had to give my root password to make the server starts
~/tmp/HADOOP> ssh-keygen -t rsa -P 'password' -f ~/.ssh/id_rsa
Generating public/private dsa key pair.
Your identification has been saved in /home/pierre/.ssh/id_rsa.
Your public key has been saved in /home/pierre/.ssh/id_rsa.pub.
The key fingerprint is:
17:c0:29:b4:56:d1:d3:dd:ae:d5:ba:3e:5b:33:b0:99 pierre@linux-zfgk
~/tmp/HADOOP> cat ~/.ssh/id_rsa.pub >> ~/.ssh/autorized_keys

Editing the Cluster configuration

Edit the file hadoop-0.19.1/conf/hadoop-site.xml.
<description>This is the URI (protocol specifier, hostname, and port) that describes the NameNode (main Node) for the cluster.</description>
<description>This is the path on the local file system in which the DataNode instance should store its data</description>
<description>This is the path on the local file system of the NameNode instance where the NameNode metadata is stored.</description>

Formatting HDFS

HDFS the Hadoop Distributed File System "HDFS is a block-structured file system: individual files are broken into blocks of a fixed size. These blocks are stored across a cluster of one or more machines with data storage capacity. A file can be made of several blocks, and they are not necessarily stored on the same machine(...)If several machines must be involved in the serving of a file, then a file could be rendered unavailable by the loss of any one of those machines. HDFS combats this problem by replicating each block across a number of machines (3, by default)."
~/tmp/HADOOP> hadoop-0.19.1/bin/hadoop namenode -format
09/04/21 21:11:18 INFO namenode.NameNode: STARTUP_MSG:
STARTUP_MSG: Starting NameNode
STARTUP_MSG: host = linux-zfgk.site/
STARTUP_MSG: args = [-format]
STARTUP_MSG: version = 0.19.1
STARTUP_MSG: build = https://svn.apache.org/repos/asf/hadoop/core/branches/branch-0.19 -r 745977; compiled by 'ndaley' on Fri Feb 20 00:16:34 UTC 2009
Re-format filesystem in /home/pierre/tmp/HADOOP/hdfs/name ? (Y or N) Y
09/04/21 21:11:29 INFO namenode.FSNamesystem: fsOwner=pierre,users,dialout,video
09/04/21 21:11:29 INFO namenode.FSNamesystem: supergroup=supergroup
09/04/21 21:11:29 INFO namenode.FSNamesystem: isPermissionEnabled=true
09/04/21 21:11:29 INFO common.Storage: Image file of size 96 saved in 0 seconds.
09/04/21 21:11:29 INFO common.Storage: Storage directory /home/pierre/tmp/HADOOP/hdfs/name has been successfully formatted.
09/04/21 21:11:29 INFO namenode.NameNode: SHUTDOWN_MSG:
SHUTDOWN_MSG: Shutting down NameNode at linux-zfgk.site/

Starting HDFS

~/tmp/HADOOP> hadoop-0.19.1/bin/start-dfs.sh
starting namenode, logging to /home/pierre/tmp/HADOOP/hadoop-0.19.1/bin/../logs/hadoop-pierre-namenode-linux-zfgk.out
localhost: starting datanode, logging to /home/pierre/tmp/HADOOP/hadoop-0.19.1/bin/../logs/hadoop-pierre-datanode-linux-zfgk.out
localhost: starting secondarynamenode, logging to /home/pierre/tmp/HADOOP/hadoop-0.19.1/bin/../logs/hadoop-pierre-secondarynamenode-linux-zfgk.out

Playing with HDFS

First Download a few SNP from UCSC/dbsnp into ~/local.xls.
~/tmp/HADOOP> mysql -N --user=genome --host=genome-mysql.cse.ucsc.edu -A -D hg18 -e 'select name,chrom,chromStart,avHet from snp129 where avHet!=0 and name like "rs12345%" ' > ~/local.xls

Creating directories
~/tmp/HADOOP> hadoop-0.19.1/bin/hadoop dfs -mkdir /user
~/tmp/HADOOP> hadoop-0.19.1/bin/hadoop dfs -mkdir /user/pierre

Copying a file "local.xls" from your local file system to HDFS
~/tmp/HADOOP> hadoop-0.19.1/bin/hadoop dfs -put ~/local.xls stored.xls

Recursive listing of HDFS
~/tmp/HADOOP> hadoop-0.19.1/bin/hadoop dfs -lsr /
drwxr-xr-x - pierre supergroup 0 2009-04-21 21:45 /user
drwxr-xr-x - pierre supergroup 0 2009-04-21 21:45 /user/pierre
-rw-r--r-- 3 pierre supergroup 308367 2009-04-21 21:45 /user/pierre/stored.xls

'cat' the first lines of the SNP file stored on HDFS:
~/tmp/HADOOP> hadoop-0.19.1/bin/hadoop dfs -cat /user/pierre/stored.xls | head
rs12345003 chr9 1765426 0.02375
rs12345004 chr9 2962430 0.055768
rs12345006 chr9 74304094 0.009615
rs12345007 chr9 73759324 0.112463
rs12345008 chr9 88421765 0.014184
rs12345013 chr9 78951530 0.104463
rs12345014 chr9 78542260 0.490608
rs12345015 chr9 10121973 0.201446
rs12345016 chr9 2698257 0.456279
rs12345027 chr9 8399632 0.04828

Removing a file. Note: "On startup, the NameNode enters a special state called Safemode." I could not delete a file before I used "dfsadmin -safemode leave".
~/tmp/HADOOP> hadoop-0.19.1/bin/hadoop dfsadmin -safemode leave
Safe mode is OFF
~/tmp/HADOOP> hadoop-0.19.1/bin/hadoop dfs -rm /user/pierre/stored.xls
Deleted hdfs://localhost:9000/user/pierre/stored.xls

Check there is NO file named stored.xls in the local file system !
~/tmp/HADOOP> find hdfs/


~/tmp/HADOOP> hadoop-0.19.1/bin/stop-dfs.sh
stopping namenode
localhost: stopping datanode
localhost: stopping secondarynamenode


No comments: