YOKOFAKUN: Hadoop, my notebook: HDFS

21 April 2009

Hadoop, my notebook: HDFS

This post is about the Apache Hadoop, an open-source algorithm implementing the MapReduce algorithm. This first notebook focuses on HDFS, the Hadoop file system, and follows the great Yahoo! Hadoop Tutorial Home. Forget the clusters, I'm running this hadoop engine on my one and only laptop.

Downloading & Installing

~/tmp/HADOOP> wget "http://apache.multidist.com/hadoop/core/hadoop-0.19.1/hadoop-0.19.1.tar.gz
Saving to: `hadoop-0.19.1.tar.gz'

100%[======================================>] 55,745,146   487K/s   in 1m 53s

2009-04-21 20:52:04 (480 KB/s) - `hadoop-0.19.1.tar.gz' saved [55745146/55745146]
~/tmp/HADOOP> tar xfz hadoop-0.19.1.tar.gz
~/tmp/HADOOP> rm hadoop-0.19.1.tar.gz
~/tmp/HADOOP> mkdir -p hdfs/data
~/tmp/HADOOP> mkdir -p hdfs/name
#hum... this step was not clear as I'm not a ssh guru. I had to give my root password to make the server starts
~/tmp/HADOOP> ssh-keygen -t rsa -P 'password' -f ~/.ssh/id_rsa 
Generating public/private dsa key pair.
Your identification has been saved in /home/pierre/.ssh/id_rsa.
Your public key has been saved in /home/pierre/.ssh/id_rsa.pub.
The key fingerprint is:
17:c0:29:b4:56:d1:d3:dd:ae:d5:ba:3e:5b:33:b0:99 pierre@linux-zfgk
~/tmp/HADOOP> cat ~/.ssh/id_rsa.pub >> ~/.ssh/autorized_keys

Editing the Cluster configuration

Edit the file hadoop-0.19.1/conf/hadoop-site.xml.

<configuration>
<property>
    <name>fs.default.name</name>
    <value>hdfs://localhost:9000</value>
    <description>This is the URI (protocol specifier, hostname, and port) that describes the NameNode (main Node) for the cluster.</description>
</property>
<property>
    <name>dfs.data.dir</name>
    <value>/home/pierre/tmp/HADOOP/hdfs/data</value>
    <description>This is the path on the local file system in which the DataNode instance should store its data</description>
</property>
<property>
    <name>dfs.name.dir</name>
    <value>/home/pierre/tmp/HADOOP/hdfs/name</value>
    <description>This is the path on the local file system of the NameNode instance where the NameNode metadata is stored.</description>
</property>
</configuration>

Formatting HDFS

HDFS the Hadoop Distributed File System "HDFS is a block-structured file system: individual files are broken into blocks of a fixed size. These blocks are stored across a cluster of one or more machines with data storage capacity. A file can be made of several blocks, and they are not necessarily stored on the same machine(...)If several machines must be involved in the serving of a file, then a file could be rendered unavailable by the loss of any one of those machines. HDFS combats this problem by replicating each block across a number of machines (3, by default)."

~/tmp/HADOOP> hadoop-0.19.1/bin/hadoop namenode -format
09/04/21 21:11:18 INFO namenode.NameNode: STARTUP_MSG:
/************************************************************
STARTUP_MSG: Starting NameNode
STARTUP_MSG:   host = linux-zfgk.site/127.0.0.2
STARTUP_MSG:   args = [-format]
STARTUP_MSG:   version = 0.19.1
STARTUP_MSG:   build = https://svn.apache.org/repos/asf/hadoop/core/branches/branch-0.19 -r 745977; compiled by 'ndaley' on Fri Feb 20 00:16:34 UTC 2009
************************************************************/
Re-format filesystem in /home/pierre/tmp/HADOOP/hdfs/name ? (Y or N) Y
09/04/21 21:11:29 INFO namenode.FSNamesystem: fsOwner=pierre,users,dialout,video
09/04/21 21:11:29 INFO namenode.FSNamesystem: supergroup=supergroup
09/04/21 21:11:29 INFO namenode.FSNamesystem: isPermissionEnabled=true
09/04/21 21:11:29 INFO common.Storage: Image file of size 96 saved in 0 seconds.
09/04/21 21:11:29 INFO common.Storage: Storage directory /home/pierre/tmp/HADOOP/hdfs/name has been successfully formatted.
09/04/21 21:11:29 INFO namenode.NameNode: SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at linux-zfgk.site/127.0.0.2
************************************************************/

Starting HDFS

~/tmp/HADOOP>  hadoop-0.19.1/bin/start-dfs.sh
starting namenode, logging to /home/pierre/tmp/HADOOP/hadoop-0.19.1/bin/../logs/hadoop-pierre-namenode-linux-zfgk.out
Password:
localhost: starting datanode, logging to /home/pierre/tmp/HADOOP/hadoop-0.19.1/bin/../logs/hadoop-pierre-datanode-linux-zfgk.out
Password:
localhost: starting secondarynamenode, logging to /home/pierre/tmp/HADOOP/hadoop-0.19.1/bin/../logs/hadoop-pierre-secondarynamenode-linux-zfgk.out

Playing with HDFS

First Download a few SNP from UCSC/dbsnp into ~/local.xls.

~/tmp/HADOOP>  mysql -N --user=genome --host=genome-mysql.cse.ucsc.edu -A -D hg18 -e 'select name,chrom,chromStart,avHet from snp129 where avHet!=0 and name like "rs12345%" ' > ~/local.xls

Creating directories

~/tmp/HADOOP> hadoop-0.19.1/bin/hadoop dfs -mkdir /user
~/tmp/HADOOP> hadoop-0.19.1/bin/hadoop dfs -mkdir /user/pierre

Copying a file "local.xls" from your local file system to HDFS

~/tmp/HADOOP> hadoop-0.19.1/bin/hadoop dfs -put ~/local.xls stored.xls

Recursive listing of HDFS

~/tmp/HADOOP> hadoop-0.19.1/bin/hadoop dfs -lsr /
drwxr-xr-x   - pierre supergroup          0 2009-04-21 21:45 /user
drwxr-xr-x   - pierre supergroup          0 2009-04-21 21:45 /user/pierre
-rw-r--r--   3 pierre supergroup     308367 2009-04-21 21:45 /user/pierre/stored.xls

'cat' the first lines of the SNP file stored on HDFS:

~/tmp/HADOOP> hadoop-0.19.1/bin/hadoop dfs -cat /user/pierre/stored.xls | head
rs12345003      chr9    1765426 0.02375
rs12345004      chr9    2962430 0.055768
rs12345006      chr9    74304094        0.009615
rs12345007      chr9    73759324        0.112463
rs12345008      chr9    88421765        0.014184
rs12345013      chr9    78951530        0.104463
rs12345014      chr9    78542260        0.490608
rs12345015      chr9    10121973        0.201446
rs12345016      chr9    2698257 0.456279
rs12345027      chr9    8399632 0.04828

Removing a file. Note: "On startup, the NameNode enters a special state called Safemode." I could not delete a file before I used "dfsadmin -safemode leave".

~/tmp/HADOOP> hadoop-0.19.1/bin/hadoop dfsadmin -safemode leave
Safe mode is OFF
~/tmp/HADOOP> hadoop-0.19.1/bin/hadoop dfs -rm /user/pierre/stored.xls
Deleted hdfs://localhost:9000/user/pierre/stored.xls

Check there is NO file named stored.xls in the local file system !

~/tmp/HADOOP> find hdfs/
hdfs/
hdfs/data
hdfs/data/detach
hdfs/data/in_use.lock
hdfs/data/tmp
hdfs/data/current
hdfs/data/current/blk_3340572659657793789
hdfs/data/current/dncp_block_verification.log.curr
hdfs/data/current/blk_3340572659657793789_1002.meta
hdfs/data/current/VERSION
hdfs/data/storage
hdfs/name
hdfs/name/in_use.lock
hdfs/name/current
hdfs/name/current/edits
hdfs/name/current/VERSION
hdfs/name/current/fsimage
hdfs/name/current/fstime
hdfs/name/image
hdfs/name/image/fsimage

Stop HDFS

~/tmp/HADOOP>  hadoop-0.19.1/bin/stop-dfs.sh
stopping namenode
Password:
localhost: stopping datanode
Password:
localhost: stopping secondarynamenode

Pierre

YOKOFAKUN

21 April 2009

Hadoop, my notebook: HDFS

Downloading & Installing

Editing the Cluster configuration

Formatting HDFS

Starting HDFS

Playing with HDFS

Stop HDFS

No comments:

About Me

Feeds

Blog Archive

Web2.0

Labels