How to setup HBase (Part I)

HBase  is used to store (read/write access) big amounts of data in real time.  The goal of this project is hosting very large tables — billions of rows X millions of columns — atop clusters of commodity hardware. Apache HBase is an open-source, distributed, versioned, column-oriented store, modeled after Google’s Bigtable: A Distributed Storage System for Structured Data by Chang et al. Just as Bigtable leveraged the distributed data storage provided by the Google File System, Apache HBase provides Bigtable-like capabilities on top of Hadoop and HDFS.

Next, it is presented the basic procedure to set up HBase on a pseudo-distributed fashion. The first step is download HBase from the following link: Apache Download Hbase.

For this entry, we are going to work with the stable version, this means HBase-1.0.0

A. HBase setup

First, we unpack the .tar file by means of the following command:

[underclud@localhost Downloads]$ tar xfz hbase-1.0.0-bin.tar.gz

Then, move the extracted folder to /usr/local. Remember changing to the root user in order to have the permission to move the folder.

[underclud@localhost Downloads]$ su -
[root@localhost Downloads]# mv hbase-1.0.0 /usr/loca

Change the ownership of the hbase-1.0.0 folder to the undercloud user:

[root@localhost Downloads]# chown -R undercloud /usr/local/hbase-1.0.0

 

B. Configuring HBase

The next step is to configure HBase to run in pseudo-distributed mode. Pseudo-distributed mode means that HBase still runs completely on a single host, but each HBase daemon (HMaster, HRegionServer, and Zookeeper) runs as a separate process.

Configuring JAVA_HOME environment variable

It is required to set the JAVA_HOME environment variable before starting HBase. HBase provides a central mechanism to configure the JAVA_HOME variable – conf/hbase-env.sh. Edit this file, uncomment the line starting with JAVA_HOME, and set it to the appropriate location for your operating system. The JAVA_HOME variable should be set to a directory which contains the executable file bin/java.

[root@localhost hbase-1.0.0]# exit
logout
[undercloud@localhost Downloads]$ cd /usr/local/hbase-1.0.0
[undercloud@localhost hbase-1.0.0]$ nano conf/hbase-env.sh

Uncomment, the JAVA_HOME and set the java folder location.

export JAVA_HOME=/usr/java/jdk1.7.0_67

Next, it’s necesary to edit the conf/hbase-site.xml file to specify a custom configurations such as, set hbase.rootdir, the directory where HBase writes data to, and hbase.zookeeper.property.dataDir, the director where ZooKeeper writes its data too. Next, we present the content of our conf/hbase-site.xml file.

<configuration>
  <property>
    <name>hbase.cluster.distributed</name>
    <value>true</value>
  </property>
  <property>
    <name>hbase.rootdir</name>
    <value>hdfs://localhost:9000/hbase</value>
  </property>
  <property>
    <name>hbase.zookeeper.property.dataDir</name>
    <value>/home/undercloud/zookeeper</value>
  </property>
</configuration>

C. Starting HBase

First, it is necessary start the Hadoop Distributed File System (HDFS) and YARN services as follow:

[undercloud@localhost hadoop-2.6.0]$ sbin/start-dfs.sh
Starting namenodes on [localhost]
localhost: starting namenode, logging to /usr/local/hadoop-2.6.0/logs/hadoop-undercloud-namenode-localhost.localdomain.out
localhost: starting datanode, logging to /usr/local/hadoop-2.6.0/logs/hadoop-undercloud-datanode-localhost.localdomain.out
Starting secondary namenodes [0.0.0.0]
0.0.0.0: starting secondarynamenode, logging to /usr/local/hadoop-2.6.0/logs/hadoop-undercloud-secondarynamenode-localhost.localdomain.out
[undercloud@localhost hadoop-2.6.0]$ sbin/start-yarn.sh
starting yarn daemons
starting resourcemanager, logging to /usr/local/hadoop-2.6.0/logs/yarn-undercloud-resourcemanager-localhost.localdomain.out
localhost: starting nodemanager, logging to /usr/local/hadoop-2.6.0/logs/yarn-undercloud-nodemanager-localhost.localdomain.out
[undercloud@localhost hadoop-2.6.0]$

Next, we are able to start the HBase by means of the following command:

[undercloud@localhost hadoop-2.6.0]$ cd /usr/local/hbase-1.0.0
[undercloud@localhost hbase-1.0.0]$ bin/start-hbase.sh
localhost: starting zookeeper, logging to /usr/local/hbase-1.0.0/bin/../logs/hbase-undercloud-zookeeper-localhost.localdomain.out
starting master, logging to /usr/local/hbase-1.0.0/bin/../logs/hbase-undercloud-master-localhost.localdomain.out
starting regionserver, logging to /usr/local/hbase-1.0.0/bin/../logs/hbase-undercloud-1-regionserver-localhost.localdomain.out
[undercloud@localhost hbase-1.0.0]$

Now, we can start working with HBase through its shell as follow:

[undercloud@localhost hbase-1.0.0]$ bin/hbase shell
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/usr/local/hbase-1.0.0/lib/slf4j-log4j12-1.7.7.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/local/hadoop-2.6.0/share/hadoop/common/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
HBase Shell; enter 'help<RETURN>' for list of supported commands.
Type "exit<RETURN>" to leave the HBase Shell
Version 1.0.0, r6c98bff7b719efdb16f71606f3b7d8229445eb81, Sat Feb 14 19:49:22 PST 2015

hbase(main):001:0> list
TABLE                                                                           
0 row(s) in 0.4580 seconds

=> []
hbase(main):002:0>

 D. Stopping HBase

Now, we can stop the HBase service with the command bin/stop-hbase.sh, but first, it is necessary quit of the HBase shell by means of the exit command as follow:

hbase(main):002:0> exit
[undercloud@localhost hbase-1.0.0]$ bin/stop-hbase.sh
stopping hbase.......................
localhost: stopping zookeeper.
[undercloud@localhost hbase-1.0.0]$

 

Posted in Uncategorized

How to install Hadoop 2.3.0 (YARN)

This entry presents the procedure to set up Hadoop 2.3.0 (YARN). Apache Hadoop 2.3.0 includes a lot of improvements over previous releases. In fact, Hadoop 2.3.0 or also known as Hadoop YARN is a breakthrough in Hadoop architecture because this provides a more general processing platform beyond MapReduce. The main feature of Hadoop YARN is that the management of resources is not focused only on MapReduce, but now is oriented to run multiple type of applications in Hadoop.

You can find more information about the new Hadoop 2.3.0 architecture in the following links:

Official website of Apache Hadoop 2.3.0

Hadoop YARN – Hortonworks
MapReduce 2.0 in Apache Hadoop 0.23 – Cloudera

Note: The following procedure is an update to Hadoop 2.6.0 version (January 2015)

The software used to set up Apache Hadoop 2.6.0 is:

CenOS 7 x86_64
Java JDK 7 update 67
Hadoop Apache 2.6.0

1 Installing JDK 7 on CenOS

1.1 First step is to change to root user

[undercloud@localhost ~]$ su -
Password: 

1.2 Install Java JDK 7 package

[root@localhost Downloads]# rpm -Uvh jdk-7u67-linux-x64.rpm

1.3 Install JDK java, javaws, libjavaplugin.so and javac with alternatives –install command

[root@localhost Downloads]# alternatives --install /usr/bin/java java /usr/java/latest/jre/bin/java 200000
[root@localhost Downloads]# alternatives --install /usr/bin/javaws javaws /usr/java/latest/jre/bin/javaws 200000
[root@localhost Downloads]# alternatives --install /usr/lib64/mozilla/plugins/libjavaplugin.so libjavaplugin.so.x86_64 /usr/java/latest/jre/lib/amd64/libnpjp2.so 200000
[root@localhost Downloads]# alternatives --install /usr/bin/javac javac /usr/java/latest/bin/javac 200000
[root@localhost Downloads]# alternatives --install /usr/bin/jar jar /usr/java/latest/bin/jar 200000
[root@localhost Downloads]#

1.4 Use Java JDK absolute version (/usr/java/jdk1.7.0_67)

[root@localhost Downloads]# alternatives --install /usr/bin/java java /usr/java/jdk1.7.0_67/jre/bin/java 200000
[root@localhost Downloads]# alternatives --install /usr/bin/javaws javaws /usr/java/jdk1.7.0_67/jre/bin/javaws 200000
[root@localhost Downloads]# alternatives --install /usr/lib64/mozilla/plugins/libjavaplugin.so libjavaplugin.so.x86_64 /usr/java/jdk1.7.0_67/jre/lib/amd64/libnpjp2.so 200000
[root@localhost Downloads]# alternatives --install /usr/bin/javac javac /usr/java/jdk1.7.0_67/bin/javac 200000
[root@localhost Downloads]# alternatives --install /usr/bin/jar jar /usr/java/jdk1.7.0_67/bin/jar 200000
[root@localhost Downloads]#

1.5 Check java version

 
[root@localhost Downloads]# java -version
java version "1.7.0_67"
Java(TM) SE Runtime Environment (build 1.7.0_67-b01)
Java HotSpot(TM) 64-Bit Server VM (build 24.65-b04, mixed mode)
[root@localhost Downloads]#

1.6 Finally, add JAVA_HOME environment variable to /etc/profile file or $HOME/.bash_profile file

[root@localhost Downloads]# export JAVA_HOME="/usr/java/latest"
[root@localhost Downloads]# export JAVA_HOME="/usr/java/jdk1.7.0_67"
[root@localhost Downloads]# export JAVA_HOME="/usr/java/jre1.7.0_67"
[root@localhost Downloads]#

2 Installing OpenSSH-Server

2.1 Install OpenSSH-Server

[root@localhost Downloads]# yum install openssh-server

2.2 Verify the service status of OpenSSH-Server

[root@localhost Downloads]# service sshd status
sshd.service - OpenSSH server daemon
Loaded: loaded (/usr/lib/systemd/system/sshd.service; disabled)
Active: inactive (dead)
[root@localhost Downloads]#

2.3 If this is stopped then start this by means of the follow command

[root@localhost Downloads]# service sshd start
Redirecting to /bin/systemctl start sshd.service
[root@localhost Downloads]# service sshd status
Redirecting to /bin/systemctl status sshd.service
sshd.service - OpenSSH server daemon
 Loaded: loaded (/usr/lib/systemd/system/sshd.service; disabled)
 Active: active (running) since Wed 2015-01-21 12:02:30 CST; 25s ago
 Process: 5464 ExecStartPre=/usr/sbin/sshd-keygen (code=exited, status=0/SUCCESS)
 Main PID: 5466 (sshd)
 CGroup: /system.slice/sshd.service
 └─5466 /usr/sbin/sshd -D

Jan 21 12:02:30 localhost.localdomain systemd[1]: Started OpenSSH server daemon.
Jan 21 12:02:31 localhost.localdomain sshd[5466]: Server listening on 0.0.0.0...
Jan 21 12:02:31 localhost.localdomain sshd[5466]: Server listening on :: port...
Hint: Some lines were ellipsized, use -l to show in full.
[root@localhost Downloads]#

2.4 Sets the service to always start when booting the system

[root@localhost Downloads]#  systemctl enable sshd.service
ln -s '/usr/lib/systemd/system/sshd.service' '/etc/systemd/system/multi-user.target.wants/sshd.service'
[root@localhost Downloads]#

2.5 SSH configuration without pass-phrase request. First, be sure you are NOT logged as root user

[root@localhost Downloads]# exit
logout
[undercloud@localhost ~]$ 

2.6 Then, generate a DSA key by means of the following command:

[undercloud@localhost ~]$ ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsa
Generating public/private dsa key pair.
Created directory '/home/undercloud/.ssh'.
Your identification has been saved in /home/undercloud/.ssh/id_dsa.
Your public key has been saved in /home/undercloud/.ssh/id_dsa.pub.
The key fingerprint is:
f5:e4:fd:1b:f4:dc:04:5f:e4:64:94:0b:99:13:0d:bb undercloud@localhost.localdomain
The key's randomart image is:
+--[ DSA 1024]----+
|             o*.*|
|             =.B |
|          . ..+ +|
|         . + ..+.|
|        S   oE..o|
|              .+o|
|               .=|
|                o|
|               . |
+-----------------+
[undercloud@localhost ~]$ 

2.7 Configure the SSH service without pass-phrase by means of the following commands:

[undercloud@localhost ~]$ chmod 755 ~/.ssh
[undercloud@localhost ~]$ cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys
[undercloud@localhost ~]$ chmod 644 ~/.ssh/authorized_keys
[undercloud@localhost ~]$ ssh-add
Identity added: /home/undercloud/.ssh/id_dsa (/home/undercloud/.ssh/id_dsa)
[undercloud@localhost ~]$

2.8 You can verify the SSH connexion without pass phrase as follow:

[undercloud@localhost ~]$ ssh localhost
The authenticity of host 'localhost (127.0.0.1)' can't be established.
RSA key fingerprint is 48:b3:71:8b:c9:f8:1c:16:8c:64:8a:b0:1a:35:42:2f.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added 'localhost' (RSA) to the list of known hosts.
Last login: Sat Jan 18 02:06:53 2015 from 192.168.1.114
[undercloud@localhost ~]$ exit
logout
Connection to localhost closed.
[undercloud@localhost ~]$ ssh localhost
Last login: Wed Jan 21 11:28:56 2015
[undercloud@localhost ~]$ 

3 Installing Hadoop 2.6.0 (YARN)

3.1 Extract the content from hadoop-2.6.0.tar.gz file and move the extracted folder to the /usr/local directory

[undercloud@localhost Downloads]$ tar vxzf hadoop-2.6.0.tar.gz
[undercloud@localhost Downloads]$ su -
Password: 
[root@localhost ~]# mv /home/undercloud/Downloads/hadoop-2.6.0 /usr/local
[root@localhost ~]#

3.2 Change the ownership of the hadoop-2.6.0 directory to the undercloud user

[root@localhost ~]# chown -R undercloud /usr/local/hadoop-2.3.0
[root@localhost ~]#exit
[undercloud@localhost Downloads]$ 

3.3 Set up Hadoop environment variables, adding the following variables at the end of the .bashrc file

#Hadoop variables
export JAVA_HOME=/usr/java/jdk1.7.0_67
export HADOOP_INSTALL=/usr/local/hadoop-2.6.0
export PATH=$PATH:$HADOOP_INSTALL/bin
export PATH=$PATH:$HADOOP_INSTALL/sbin
export HADOOP_MAPRED_HOME=$HADOOP_INSTALL
export HADOOP_COMMON_HOME=$HADOOP_INSTALL
export HADOOP_HDFS_HOME=$HADOOP_INSTALL
export HADOOP_YARN_HOME=$HADOOP_INSTALL

3.4 Modify the JAVA_HOME variable into /usr/local/hadoop-2.6.0/etc/hadoop/hadoop-env.sh file

export JAVA_HOME=/usr/java/jdk1.7.0_67

4 Configure Hadoop 2.6.0

3.1 First step is to configure the core-site.xml file locate in /usr/local/hadoop-2.6.0/etc/hadoop

<configuration>
   <property>
      <name>fs.default.name</name>
      <value>hdfs://localhost:9000</value>
   </property> 
</configuration>

3.2 Second, is to configure the yarn-site.xml file as follow

<configuration>
   <property>
      <name>yarn.nodemanager.aux-services</name>
      <value>mapreduce_shuffle</value>
   </property>
   <property>
      <name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
      <value>org.apache.hadoop.mapred.ShuffleHandler</value>
   </property>
</configuration>

3.3 Third, move the mapred-site.xml.template file to mapred-site.xml and the edit the content as follow

<configuration>
   <property>
      <name>mapreduce.framework.name</name>
      <value>yarn</value>
   </property>
</configuration>

3.4 Create folders where NameNode and DataNode data will be stored

[undercloud@localhost hadoop]$ cd ~
[undercloud@localhost ~]$ mkdir -p hadoopData/hdfs/namenode
[undercloud@localhost ~]$ mkdir -p hadoopData/hdfs/datanode
[undercloud@localhost ~]$

3.5 Configure the hdfs-site.xml file

<configuration>
   <property>
      <name>dfs.replication</name>
      <value>1</value>
   </property>
   <property>
      <name>dfs.namenode.name.dir</name>
      <value>file:/home/undercloud/hadoopData/hdfs/namenode</value>
   </property>
   <property>
      <name>dfs.datanode.data.dir</name>
      <value>file:/home/undercloud/hadoopData/hdfs/datanode</value>
   </property>
</configuration>

4 Starting Hadoop YARN services

4.1 First, it is necessary format the HDFS namenode as usual

[undercloud@localhost hadoop]$ cd /usr/local/hadoop-2.6.0
[undercloud@localhost hadoop-2.6.0]$ bin/hadoop namenode -format

4.2 Start the HDFS services

[undercloud@localhost hadoop-2.6.0]$  sbin/start-dfs.sh
Starting namenodes on [localhost]
localhost: starting namenode, logging to /usr/local/hadoop-2.6.0/logs/hadoop-undercloud-namenode-localhost.localdomain.out
localhost: starting datanode, logging to /usr/local/hadoop-2.6.0/logs/hadoop-undercloud-datanode-localhost.localdomain.out
Starting secondary namenodes [0.0.0.0]
0.0.0.0: starting secondarynamenode, logging to /usr/local/hadoop-2.6.0/logs/hadoop-undercloud-secondarynamenode-localhost.localdomain.out
[undercloud@localhost hadoop-2.6.0]$

4.3 Start the YARN services

[undercloud@localhost hadoop-2.6.0]$ sbin/start-yarn.sh
starting yarn daemons
starting resourcemanager, logging to /usr/local/hadoop-2.6.0/logs/yarn-bautista-resourcemanager-localhost.localdomain.out
localhost: starting nodemanager, logging to /usr/local/hadoop-2.6.0/logs/yarn-bautista-nodemanager-localhost.localdomain.out
[undercloud@localhost hadoop-2.6.0]$ 

4.4 You can see the running services by means of the jps command

[undercloud@localhost bin]$ cd /usr/local/hadoop-2.6.0
[undercloud@localhost hadoop-2.6.0]$ /usr/java/default/bin/jps
5485 NameNode
5660 ResourceManager
6101 Jps
5568 DataNode
5986 JobHistoryServer
[undercloud@localhost hadoop-2.6.0]$

4.5 You can see the resource manager web page by means of the address http://localhost:8088

Hadoop-2.3.0_running

4.6 Also you can see the NameNode Overview in the address http://localhost:50070
Screenshot-Namenode_running

4.7 Finnaly, you should be up and running. You can run the pi example as follow:

[undercloud@localhost hadoop-2.6.0]$ bin/hadoop jar $HADOOP_MAPRED_HOME/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.6.0.jar pi 2 5
Number of Maps  = 2
Samples per Map = 5
Java HotSpot(TM) 64-Bit Server VM warning: You have loaded library /usr/local/hadoop-2.6.0/lib/native/libhadoop.so.1.0.0 which might have disabled stack guard. The VM will try to fix the stack guard now.
It's highly recommended that you fix the library with 'execstack -c ', or link it with '-z noexecstack'.
14/05/10 14:42:28 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Wrote input for Map #0
Wrote input for Map #1
Starting Job
14/05/10 14:42:30 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
14/05/10 14:42:31 INFO input.FileInputFormat: Total input paths to process : 2
14/05/10 14:42:31 INFO mapreduce.JobSubmitter: number of splits:2
14/05/10 14:42:31 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1399745002298_0003
14/05/10 14:42:32 INFO impl.YarnClientImpl: Submitted application application_1399745002298_0003
14/05/10 14:42:32 INFO mapreduce.Job: The url to track the job: http://localhost.localdomain:8088/proxy/application_1399745002298_0003/
14/05/10 14:42:32 INFO mapreduce.Job: Running job: job_1399745002298_0003
Posted in Uncategorized

How to collect Hadoop metrics with Chukwa (Part II)

This is the second part of the post “How to collect metrics with Chukwa “. In this entry we’re going to configure Chukwa to store collected metrics into HBase instead of HDFS. It’s highly recommended perform the steps described in the first part before continuing with this post. In this entry, we’re going to use the following software:

  • Hadoop 1.0.4
  • HBase 0.94.1
  • Chukwa 0.5.0
  • Java SE 1.6 update 37

HBase docs mention that it is used to store (read/write access) big amounts of data in real time.  The goal of this project is hosting very large tables — billions of rows X millions of columns — atop clusters of commodity hardware. Apache HBase is an open-source, distributed, versioned, column-oriented store modeled after Google’s Bigtable: A Distributed Storage System for Structured Data by Chang et al. Just as Bigtable leverages the distributed data storage provided by the Google File System, Apache HBase provides Bigtable-like capabilities on top of Hadoop and HDFS.

In order to be able to store Chukwa metrics into HBase, it’s necessary to configure HBase to integrate it with HDFS and Chukwa.

A. HBase setup

First, we need to download HBase from the following link. Once downloaded the Hbase-0.94.1.tar file, we unpack it by means of the following command:

$ tar xfz hbase-0.94.1.tar.gz
$ cd hbase-0.94.1

A.1 Next, we need to edit the conf/hbase-env.sh file and specify the variables JAVA_HOME and HBASE_REGIONSERVERS as follow:

export JAVA_HOME=/usr/java/jdk1.6.0_26
export HBASE_REGIONSERVERS=${HBASE_HOME}/conf/regionservers

A.2 Also, it’s necesary to edit the conf/hbase-site.xml file to specify a custom configurations such as, set hbase.rootdir, the directory where HBase writes data to, and hbase.zookeeper.property.dataDir, the director where ZooKeeper writes its data too. Next, we present the content of our hbase-site.xml file.


  
    hbase.rootdir
    hdfs://localhost:9000/hbase
  
  
    hbase.zookeeper.quorum
    localhost
  
  
    hbase.cluster.distributed
    true
  
  
    hbase.master
    localhost:60000
    The host and port that the HBase master runs at.
  
  
    dfs.support.append
    true
  

A.3 Once modified the conf/hbase-env.sh and conf/hbase-site.xml files, it’s necessary to modify also the conf/regionservers file. The regionservers file lists all hosts that we have running HRegionServers, one host per line (this file in HBase is like the Hadoop slaves file). All servers listed in this file will be started and stopped when the HBase cluster start or stop. As we runChukwa in a stand-alone way, our  con/regionservers file only contains one line:

localhost

A.4 For HBase to work with the same type of metrics that Hadoop uses, it’s necessary to modify the conf/hadoop-metrics.properties file into HBase as follow:

# Configuration of the "hbase" context for null
hbase.class=org.apache.hadoop.metrics.spi.NullContext

# Configuration of the "hbase" context for file
# hbase.class=org.apache.hadoop.hbase.metrics.file.TimeStampingFileContext
# hbase.period=10
# hbase.fileName=/tmp/metrics_hbase.log

# HBase-specific configuration to reset long-running stats (e.g. compactions)
# If this variable is left out, then the default is no expiration.
hbase.extendedperiod = 3600

# Configuration of the "hbase" context for ganglia
# Pick one: Ganglia 3.0 (former) or Ganglia 3.1 (latter)
# hbase.class=org.apache.hadoop.metrics.ganglia.GangliaContext
# hbase.class=org.apache.hadoop.metrics.ganglia.GangliaContext31
# hbase.period=10
# hbase.servers=GMETADHOST_IP:8649

# Configuration of the "jvm" context for null
jvm.class=org.apache.hadoop.metrics.spi.NullContext

# Configuration of the "jvm" context for file
# jvm.class=org.apache.hadoop.hbase.metrics.file.TimeStampingFileContext
# jvm.period=10
# jvm.fileName=/tmp/metrics_jvm.log

# Configuration of the "jvm" context for ganglia
# Pick one: Ganglia 3.0 (former) or Ganglia 3.1 (latter)
# jvm.class=org.apache.hadoop.metrics.ganglia.GangliaContext
# jvm.class=org.apache.hadoop.metrics.ganglia.GangliaContext31
# jvm.period=10
# jvm.servers=GMETADHOST_IP:8649

# Configuration of the "rpc" context for null
rpc.class=org.apache.hadoop.metrics.spi.NullContext

# Configuration of the "rpc" context for file
# rpc.class=org.apache.hadoop.hbase.metrics.file.TimeStampingFileContext
# rpc.period=10
# rpc.fileName=/tmp/metrics_rpc.log

# Configuration of the "rpc" context for ganglia
# Pick one: Ganglia 3.0 (former) or Ganglia 3.1 (latter)
# rpc.class=org.apache.hadoop.metrics.ganglia.GangliaContext
# rpc.class=org.apache.hadoop.metrics.ganglia.GangliaContext31
# rpc.period=10
# rpc.servers=GMETADHOST_IP:8649

# Configuration of the "rest" context for ganglia
# Pick one: Ganglia 3.0 (former) or Ganglia 3.1 (latter)
# rest.class=org.apache.hadoop.metrics.ganglia.GangliaContext
# rest.class=org.apache.hadoop.metrics.ganglia.GangliaContext31
# rest.period=10
# rest.servers=GMETADHOST_IP:8649

A.5 Finally, we need to copy the following files into the hbase-0.94.1/lib directory:

  • chukwa-0.5.0.jar
  • chukwa-0.5.0-client.jar
  • hadoop-core-1.0.4.jar
  • hadoop-client-1.0.4.jar

In addition, it’s recommended to copy the hbase-0.94.1.jar file into the hadoop-1.0.4/lib and chukwa-0.5.0/share/chukwa/lib directories.

B. Chukwa configuration

B.1 Modifying the collector configuration. As it was mentioned, chukwa is capable to store the collected metrics into the HDFS or HBase system. In the first part of this tutorial, it was described how to store metrics into the HDFS, now we’ll modify the $CHUKWA_HOME/etc/chukwa/chukwa-collector-conf.xml file in order to indicate to chukwa that Hbase will be used to store the collected metrics. Next, it’s presented the modified file.


  
    chukwaCollector.writerClass
    org.apache.hadoop.chukwa.datacollection.writer.PipelineStageWriter
  
  <!-- Sequence File Writer parameters 
  
    chukwaCollector.pipeline   org.apache.hadoop.chukwa.datacollection.writer.SocketTeeWriter,org.apache.hadoop.chukwa.datacollection.writer.SeqFileWriter
  
  -->
  
  
    chukwaCollector.localOutputDir
    /tmp/chukwa/dataSink/
    Chukwa local data sink directory, see LocalWriter.java
  
  
    chukwaCollector.writerClass
    org.apache.hadoop.chukwa.datacollection.writer.localfs.LocalWriter
    Local chukwa writer, see LocalWriter.java
  
  
  
  
    chukwaCollector.pipeline
 org.apache.hadoop.chukwa.datacollection.writer.SocketTeeWriter,org.apache.hadoop.chukwa.datacollection.writer.hbase.HBaseWriter, org.apache.hadoop.chukwa.datacollection.writer.SeqFileWriter
  
  
    chukwaCollector.writerClass
    org.apache.hadoop.chukwa.datacollection.writer.hbase.HBaseWriter 
  
  
    hbase.demux.package
    org.apache.hadoop.chukwa.extraction.demux.processor
    Demux parser class package, HBaseWriter uses this package name to validate HBase for annotated demux parser classes.
  
  
    hbase.writer.verify.schema
    false
    Verify HBase Table schema with demux parser schema, log
    warning if there are mismatch between hbase schema and demux parsers.
    
  
  
    hbase.writer.halt.on.schema.mismatch
    false
    If this option is set to true, and HBase table schema 
    is mismatched with demux parser, collector will shut down itself.
    
  
  
  
    writer.hdfs.filesystem
    hdfs://localhost:9000
    HDFS to dump to
  
  
    chukwaCollector.outputDir
    /chukwa/logs/
    Chukwa data sink directory
  
  
    chukwaCollector.rotateInterval
    300000
    Chukwa rotate interval (ms)
  
  
    chukwaCollector.isFixedTimeRotatorScheme
    false
    A flag to indicate that the collector should close at a fixed
    offset after every rotateInterval. The default value is false which uses
    the default scheme where collectors close after regular rotateIntervals.
    If set to true then specify chukwaCollector.fixedTimeIntervalOffset value.
    e.g., if isFixedTimeRotatorScheme is true and fixedTimeIntervalOffset is
    set to 10000 and rotateInterval is set to 300000, then the collector will
    close its files at 10 seconds past the 5 minute mark, if
    isFixedTimeRotatorScheme is false, collectors will rotate approximately
    once every 5 minutes
    
  
  
    chukwaCollector.fixedTimeIntervalOffset
    30000
    Chukwa fixed time interval offset value (ms)
  
  
    chukwaCollector.http.port
    8080
    The HTTP port number the collector will listen on
  

C. Starting the services

C.1 Chukwa agents. Once that all the modifications have been done, we’re going to start the services starting with the chukwa agent as follow:

[bautista@Zen-UnderLinx chukwa-0.5.0]$ bin/chukwa agent
[bautista@Zen-UnderLinx chukwa-0.5.0]$ OK chukwaAgent.checkpoint.dir [File] = /tmp/chukwa/log/
OK chukwaAgent.checkpoint.interval [Time] = 5000
WARN: option chukwaAgent.collector.retries may not exist; val = 144000
Guesses:
chukwaAgent.connector.retryRate Time
chukwaAgent.sender.retries Integral
chukwaAgent.control.remote Boolean
WARN: option chukwaAgent.collector.retryInterval may not exist; val = 20000
Guesses:
chukwaAgent.sender.retryInterval Integral
chukwaAgent.connector.retryRate Time
chukwaCollector.rotateInterval Time
OK chukwaAgent.control.port [Portno] = 9093
WARN: option chukwaAgent.hostname may not exist; val = localhost
Guesses:
chukwaAgent.control.remote Boolean
chukwaAgent.checkpoint.enabled Boolean
chukwaAgent.sender.retries Integral
OK chukwaAgent.sender.fastRetries [Integral] = 4
WARN: option syslog.adaptor.port.9095.facility.LOCAL1 may not exist; val = HADOOP
Guesses:
adaptor.dirscan.intervalMs Integral
adaptor.memBufWrapper.size Integral
chukwaAgent.adaptor.context.switch.time Time
No checker rules for: chukwaAgent.checkpoint.name chukwaAgent.tags

C.2 Next we run the Hadoop services.

[bautista@Zen-UnderLinx hadoop-1.0.4]$ bin/start-all.sh
starting namenode, logging to /usr/local/hadoop-1.0.4/libexec/../logs/hadoop-bautista-namenode-Zen-UnderLinx.out
localhost: starting datanode, logging to /usr/local/hadoop-1.0.4/libexec/../logs/hadoop-bautista-datanode-Zen-UnderLinx.out
localhost: starting secondarynamenode, logging to /usr/local/hadoop-1.0.4/libexec/../logs/hadoop-bautista-secondarynamenode-Zen-UnderLinx.out
starting jobtracker, logging to /usr/local/hadoop-1.0.4/libexec/../logs/hadoop-bautista-jobtracker-Zen-UnderLinx.out
localhost: starting tasktracker, logging to /usr/local/hadoop-1.0.4/libexec/../logs/hadoop-bautista-tasktracker-Zen-UnderLinx.out
[bautista@Zen-UnderLinx hadoop-1.0.4]$

C.3 Next we run HBase services

[bautista@Zen-UnderLinx hbase-0.94.1]$ bin/start-hbase.sh
localhost: starting zookeeper, logging to /usr/local/hbase-0.94.1/bin/../logs/hbase-bautista-zookeeper-Zen-UnderLinx.out
starting master, logging to /usr/local/hbase-0.94.1/bin/../logs/hbase-bautista-master-Zen-UnderLinx.out
localhost: starting regionserver, logging to /usr/local/hbase-0.94.1/bin/../logs/hbase-bautista-regionserver-Zen-UnderLinx.out
[bautista@Zen-UnderLinx hbase-0.94.1]$

Note.- Before being able to run the chukwa collector, you must have created the database where the metrics are going to be stored. So, you can do this with the follow comand:

/path/to/hbase-0.94.1/bin/hbase shell < /path/to/chukwa/etc/chukwa/hbase.schema

C.4 Start the chukwa collector service

[bautista@Zen-UnderLinx chukwa-0.5.0]$ bin/chukwa collector
[bautista@Zen-UnderLinx chukwa-0.5.0]$ WARN: option chukwa.data.dir may not exist; val = /chukwa
Guesses:
chukwaRootDir null
fs.default.name URI
nullWriter.dataRate Time
WARN: option chukwa.tmp.data.dir may not exist; val = /chukwa/temp
Guesses:
chukwaRootDir null
nullWriter.dataRate Time
chukwaCollector.tee.port Integral
WARN: option chukwaCollector.fixedTimeIntervalOffset may not exist; val = 30000
Guesses:
chukwaCollector.minPercentFreeDisk Integral
chukwaCollector.tee.keepalive Boolean
chukwaCollector.http.threads Integral
OK chukwaCollector.http.port [Integral] = 8080
WARN: option chukwaCollector.isFixedTimeRotatorScheme may not exist; val = false
Guesses:
chukwaCollector.writeChunkRetries Integral
chukwaCollector.showLogs.enabled Boolean
chukwaCollector.minPercentFreeDisk Integral
OK chukwaCollector.localOutputDir [File] = /tmp/chukwa/dataSink/
WARN chukwaCollector.pipeline = org.apache.hadoop.chukwa.datacollection.writer.SocketTeeWriter,org.apache.hadoop.chukwa.datacollection.writer.hbase.HBaseWriter, org.apache.hadoop.chukwa.datacollection.writer.SeqFileWriter -- org.apache.hadoop.chukwa.datacollection.writer.SeqFileWriter
OK chukwaCollector.rotateInterval [Time] = 300000
OK chukwaCollector.writerClass [ClassName] = org.apache.hadoop.chukwa.datacollection.writer.hbase.HBaseWriter
WARN: option hbase.demux.package may not exist; val = org.apache.hadoop.chukwa.extraction.demux.processor
Guesses:
fs.default.name URI
nullWriter.dataRate Time
CHUKWA_DATA_DIR null
WARN: option hbase.writer.halt.on.schema.mismatch may not exist; val = false
Guesses:
nullWriter.dataRate Time
fs.default.name URI
chukwaCollector.http.threads Integral
WARN: option hbase.writer.verify.schema may not exist; val = false
Guesses:
nullWriter.dataRate Time
fs.default.name URI
httpConnector.asyncAcks Boolean
OK writer.hdfs.filesystem [URI] = hdfs://localhost:9000
No checker rules for: chukwaCollector.outputDir
started Chukwa http collector on port 8080

C.5 Starting the ETL Process

For Chukwa to aggregate the metrics to get relevant results (see raw log collection and aggregation workflow), it’s necessary to execute the Demux MapReduce job with the following comand.

[bautista@Zen-UnderLinx chukwa-0.5.0]$ bin/chukwa demux

D. Checking results in HBase. 

Finally, after a few minutes we’re going to be able to query some results in HBase. For example if we have executed some mapreduce programs, we can see their metrics by means of querying the Job table as follow:

[bautista@Zen-UnderLinx hbase-0.94.1]$ bin/hbase shell
HBase Shell; enter 'help<RETURN>' for list of supported commands.
Type "exit<RETURN>" to leave the HBase Shell
Version 0.94.1, r1365210, Tue Jul 24 18:40:10 UTC 2012

hbase(main):001:0> list
TABLE
ClusterSummary
Hadoop
HadoopLog
Jobs
SystemMetrics
chukwa
6 row(s) in 1.3490 seconds

hbase(main):002:0> scan 'Jobs'
ROW COLUMN+CELL
1351800840000-job_20 column=summary:capp, timestamp=1351800840000, value=Zen-Un
1211011604_0001 derLinx
1351800840000-job_20 column=summary:cluster, timestamp=1351800840000, value=chu
1211011604_0001 kwa
1351800840000-job_20 column=summary:clusterMapCapacity, timestamp=1351800840000
1211011604_0001 , value=2
1351800840000-job_20 column=summary:clusterReduceCapacity, timestamp=1351800840
1211011604_0001 000, value=2
1351800840000-job_20 column=summary:csource, timestamp=1351800840000, value=Zen
1211011604_0001 -UnderLinx
1351800840000-job_20 column=summary:ctags, timestamp=1351800840000, value= clus
1211011604_0001 ter="chukwa"
1351800840000-job_20 column=summary:finishTime, timestamp=1351800840000, value=
1211011604_0001 1351800875355
1351800840000-job_20 column=summary:firstJobCleanupTaskLaunchTime, timestamp=13
1211011604_0001 51800840000, value=1351800866341
1351800840000-job_20 column=summary:firstJobSetupTaskLaunchTime, timestamp=1351
1211011604_0001 800840000, value=1351800660259
1351800840000-job_20 column=summary:firstMapTaskLaunchTime, timestamp=135180084
1211011604_0001 0000, value=1351800669413
1351800840000-job_20 column=summary:firstReduceTaskLaunchTime, timestamp=135180
1211011604_0001 0840000, value=1351800681452
1351800840000-job_20 column=summary:jobId, timestamp=1351800840000, value=job_2
1211011604_0001 01211011604_0001
1351800840000-job_20 column=summary:jobName, timestamp=1351800840000, value=Chu
1211011604_0001 kwa-HourlyArchiveBuilder-Stream
1351800840000-job_20 column=summary:launchTime, timestamp=1351800840000, value=
1211011604_0001 1351800660026
1351800840000-job_20 column=summary:mapSlotSeconds, timestamp=1351800840000, va
1211011604_0001 lue=212
1351800840000-job_20 column=summary:numMaps, timestamp=1351800840000, value=22
1211011604_0001
1351800840000-job_20 column=summary:numReduces, timestamp=1351800840000, value=
1211011604_0001 5
1351800840000-job_20 column=summary:numSlotsPerMap, timestamp=1351800840000, va
1211011604_0001 lue=1
1351800840000-job_20 column=summary:numSlotsPerReduce, timestamp=1351800840000,
1211011604_0001 value=1
1351800840000-job_20 column=summary:queue, timestamp=1351800840000, value=defau
1211011604_0001 lt
1351800840000-job_20 column=summary:reduceSlotsSeconds, timestamp=1351800840000
1211011604_0001 , value=348
1351800840000-job_20 column=summary:status, timestamp=1351800840000, value=SUCC
1211011604_0001 EEDED
1351800840000-job_20 column=summary:submitTime, timestamp=1351800840000, value=
1211011604_0001 1351800659473
1351800840000-job_20 column=summary:user, timestamp=1351800840000, value=bautis
1211011604_0001 ta ...

...

hbase(main):003:0>

We can work with the HBase commands to get relevant information about different metrics and analyze the system performance. You can refer to the HBase page to do it.

Tagged with: , , , ,
Posted in Chukwa, Hadoop, HBase, HDFS

How to collect Hadoop metrics with Chukwa (Part I)

Chukwa is an Apache Project which is built on top of the HDFS and MapReduce framework. According to the Chukwa site, it is an open source data collection system for monitoring of distributed systems and more specifically Hadoop clusters.In this post (which is the first of two), we’ll install and configure Chukwa in a standalone scheme over the HDFS and then on HBase. So before continuing with this post, it is highly recommended read the Chukwa Architecture.

For this, we’re going to use:

  • Hadoop 1.0.4
  • HBase 0.94.1
  • Chukwa 0.5.0
  • Java SE 1.6 update 37 o superior

Next, it’s presented a description of Chwukwa components:

  • Agents are process that run in each computer and emit data to Collectors. Data emitted is generated by means of adapters. Adapters generally wrap some other data source, such as a file or a Unix command-line tool from which the information is extracted.
  • Collectors receive data from the agent and write it to stable storage. According to Chukwa site, rather than have each adaptor write directly to HDFS, data is sent across the network to a collector process, that does the HDFS writes. Each collector receives data from up to several hundred hosts.
  • ETL Processes for parsing and archiving the data. Collectors can write data directly to HBase or sequence files in HDFS.  Chukwa has a toolbox of MapReduce jobs for organizing and processing incoming data. These jobs come in two kinds, Archiving and Demux.
  • Data Analytics Scripts for aggregate Hadoop cluster health. These scripts provide visualization and interpretation of health of Hadoop cluster.
  • HICC, the Hadoop Infrastructure Care Center; a web-portal style interface for displaying data. Data is fetched from HBase, which in turn is populated by collector or data analytic scripts that runs on the collected data, after Demux.

This post is based on the Chukwa Administration Guide and they are included a few comments as configuration examples and compatibility options. In addition, it’s important to mention that we are working with Hadoop 1.0.4 and Chukwa 0.5.0 versions which can be download at this link. Thus, in this first part we will describe the basic setup of Chukwa, which is made up of three components: agents, collectors and ETL process.

A. Agent configuration

  1. Obtain a copy of Chukwa. You can find the latest release on the Chukwa release page.
  2. Un-tar the release, via tar xzf.
  3. We refer to the directory containing Chukwa as CHUKWA_HOME. It may be helpful to set CHUKWA_HOME explicitly in your environment, but Chukwa does not require that you do so.
  4. Make sure that JAVA_HOME is set correctly and points to a Java 1.6 JRE. It’s generally best to set this in etc/chukwa/chukwa-env.sh.
  5. In etc/chukwa/chukwa-env.sh, set CHUKWA_LOG_DIR and CHUKWA_PID_DIR to the directories where Chukwa should store its console logs and pid files. The pid directory must not be shared between different Chukwa instances: it should be local, not NFS-mounted.
  6. Optionally, set CHUKWA_IDENT_STRING. This string is used to name Chukwa’s own console log files. Next, it is presented a file example.
# The java implementation to use. Required.
export JAVA_HOME=/usr/java/jdk1.6.0_37

# Optional
# The location of HBase Configuration directory. For writing data to
# HBase, you need to set environment variable HBASE_CONF to HBase conf
# directory.
export HBASE_CONF_DIR="${HBASE_CONF_DIR}";

# Hadoop Configuration directory
export HADOOP_CONF_DIR="/usr/local/hadoop-1.0.4/conf";

# The location of chukwa data repository (in either HDFS or your local
# file system, whichever you are using)
export chukwaRecordsRepository="/chukwa/repos/"

# The directory where pid files are stored. CHUKWA_HOME/var/run by default.
export CHUKWA_PID_DIR=/tmp/chukwa/pidDir

# The location of chukwa logs, defaults to CHUKWA_HOME/logs
export CHUKWA_LOG_DIR=/tmp/chukwa/log

# The location to store chukwa data, defaults to CHUKWA_HOME/data
#export CHUKWA_DATA_DIR="{CHUKWA_HOME}/data"

# Instance name for chukwa deployment
export CHUKWA_IDENT_STRING=$USER
export JAVA_PLATFORM=Linux-i386-32
export JAVA_LIBRARY_PATH=${HADOOP_HOME}/lib/native/${JAVA_PLATFORM}

# Datatbase driver name for storing Chukwa Data.
export JDBC_DRIVER=${TODO_CHUKWA_JDBC_DRIVER}

# Database URL prefix for Database Loader.
export JDBC_URL_PREFIX=${TODO_CHUKWA_JDBC_URL_PREFIX}

# HICC Jetty Server heap memory settings
# Specify min and max size of heap to JVM, e.g. 300M
export CHUKWA_HICC_MIN_MEM=
export CHUKWA_HICC_MAX_MEM=

# HICC Jetty Server port, defaults to 4080
#export CHUKWA_HICC_PORT=
export CLASSPATH=${CLASSPATH}:${HBASE_CONF_DIR}:${HADOOP_CONF_DIR}

Note. It is important to mention that in this first part we are NOT going to work with HBase as repository of data collected, instead we are going to store it into the HDFS.

  1. Agents sends data collected to a random collector from a list of collectors. So, it’s necessary to indicate what the collector list is. The collector list is specified in the $CHUKWA_HOME/etc/chukwa/collectors file, so the file should look something like:
http://collector1HostName:collector1Port/
http://collector2HostName:collector2Port/
http://collector3HostName:collector3Port/

Our collectors file only contains localhost, example:

localhost
  1. Other file that should be modified is $CHUKWA_HOME/etc/chukwa/chukwa-agent-conf.xml. The most important value to modify is the cluster/group name which identifies the monitored source nodes. This value is stored in each Chunk of collected data and it can be used to distinguish data coming from different clusters. Our chukwa-agent-conf.xml looks like:
  
  
    chukwaAgent.tags
    cluster="chukwa"
    The cluster's name for this agent
  
  
  
    chukwaAgent.control.port
    9093
    The socket port number the agent's control interface can be contacted at.
  

  
    chukwaAgent.hostname
    localhost
    The hostname of the agent on this node. Usually localhost, this is used by the chukwa instrumentation agent-control interface library
  

  
    chukwaAgent.checkpoint.name
    chukwa_agent_checkpoint
    the prefix to to prepend to the agent's checkpoint file(s)
  
  
  
    chukwaAgent.checkpoint.dir
    ${CHUKWA_LOG_DIR}/
    the location to put the agent's checkpoint file(s)
  

  
    chukwaAgent.checkpoint.interval
    5000
    the frequency interval for the agent to do checkpoints, in milliseconds
  

  
    chukwaAgent.sender.fastRetries
    4
    the number of post attempts to make to a single collector, before marking it failed
  

  
    chukwaAgent.collector.retries
    144000
    the number of attempts to find a working collector
  

  
    chukwaAgent.collector.retryInterval
    20000
    the number of milliseconds to wait between searches for a collector
  

  
    syslog.adaptor.port.9095.facility.LOCAL1
    HADOOP
  

Note. It’s important to comment that it’t necessary to open the 9093, 9095 and 9097 ports  in our firewall to be able to connect to the agent.

  1. Configuring Hadoop for monitoring. One of the key goals for Chukwa is to collect logs from Hadoop clusters. The Hadoop configuration files are located in HADOOP_HOME/etc/hadoop. To setup Chukwa to collect logs from Hadoop, we  need to change some of the Hadoop configuration files.
  • Copy CHUKWA_HOME/etc/chukwa/hadoop-log4j.properties file to HADOOP_CONF_DIR/log4j.properties
  • Copy CHUKWA_HOME/etc/chukwa/hadoop-metrics2.properties file to HADOOP_CONF_DIR/hadoop-metrics2.properties
  • Edit HADOOP_HOME/etc/hadoop/log4.properties file and change “hadoop.log.dir” to your actual CHUKWA log dirctory (ie, CHUKWA_HOME/var/log)

Note. To avoid the following error; log4j:ERROR Could not instantiate class [org.apache.hadoop.chukwa.inputtools.log4j.ChukwaDailyRollingFileAppender], we should copy the chukwa-client-xx.jar and json-simple-xx.jar into hadoop/lib directory. These files should be available on all hadoop nodes.

  1. Collector configuration. Since we are going to use HDFS for data storage in this tutorial, we must disable the HBase options and work only with the HDFS configuration parameters like writer.hdfs.filesystem. This should be set to the HDFS root URL on which Chukwa will store data. Next it’s presented an example of the chukwa-collector-conf.xml file.

  
    chukwaCollector.writerClass
    org.apache.hadoop.chukwa.datacollection.writer.PipelineStageWriter
  
  
  
    chukwaCollector.pipeline</name
org.apache.hadoop.chukwa.datacollection.writer.SocketTeeWriter,org.apache.hadoop.chukwa.datacollection.writer.SeqFileWriter

  
  
    chukwaCollector.localOutputDir
    /tmp/chukwa/dataSink/
    Chukwa local data sink directory, see LocalWriter.java
  

  
    chukwaCollector.writerClass
    org.apache.hadoop.chukwa.datacollection.writer.localfs.LocalWriter
    Local chukwa writer, see LocalWriter.java
  
  

  
  <!-- HBaseWriter parameters 
  
    chukwaCollector.pipeline
    org.apache.hadoop.chukwa.datacollection.writer.SocketTeeWriter,org.apache.hadoop.chukwa.datacollection.writer.hbase.HBaseWriter, org.apache.hadoop.chukwa.datacollection.writer.SeqFileWriter
  

  
    chukwaCollector.writerClass
    org.apache.hadoop.chukwa.datacollection.writer.hbase.HBaseWriter 
  

  
    hbase.demux.package
    org.apache.hadoop.chukwa.extraction.demux.processor
    Demux parser class package, HBaseWriter uses this package name to validate HBase for annotated demux parser classes.
  

  
    hbase.writer.verify.schema
    false
    Verify HBase Table schema with demux parser schema, log
    warning if there are mismatch between hbase schema and demux parsers.
    
  

  
    hbase.writer.halt.on.schema.mismatch
    false
    If this option is set to true, and HBase table schema 
    is mismatched with demux parser, collector will shut down itself.
    
  
  -->
  

  
    writer.hdfs.filesystem
    hdfs://localhost:9000
    HDFS to dump to
  
  
  
    chukwaCollector.outputDir
    /chukwa/logs/
    Chukwa data sink directory
  

  
    chukwaCollector.rotateInterval
    300000
    Chukwa rotate interval (ms)
  

  
    chukwaCollector.isFixedTimeRotatorScheme
    false
    A flag to indicate that the collector should close at a fixed
    offset after every rotateInterval. The default value is false which uses
    the default scheme where collectors close after regular rotateIntervals.
    If set to true then specify chukwaCollector.fixedTimeIntervalOffset value.
    e.g., if isFixedTimeRotatorScheme is true and fixedTimeIntervalOffset is
    set to 10000 and rotateInterval is set to 300000, then the collector will
    close its files at 10 seconds past the 5 minute mark, if
    isFixedTimeRotatorScheme is false, collectors will rotate approximately
    once every 5 minutes
    
  

  
    chukwaCollector.fixedTimeIntervalOffset
    30000
    Chukwa fixed time interval offset value (ms)
  

  
    chukwaCollector.http.port
    8080
    The HTTP port number the collector will listen on
  

Note. Chukwa 0.5.0 includes the Hadoop libraries hadoop-core-1.0.0.jar and hadoop-test-1.0.0.jar to comunicate to IPC Server version 4. So it’s necessary to replace the above libreries with the hadoop-core-1.0.4.jar and hadoop-test-1.0.4.jar files located in the chukwa-0.5.0/share/chukwa/lib directory.

  1. Once modified our configuration files, we can start the services and collect data from chukwa. For this, we are going to start first the agent and then the collector as follow:
[bautista@Zen-UnderLinx chukwa-0.5.0]$ bin/chukwa agent
OK chukwaAgent.checkpoint.dir [File] = /tmp/chukwa/log/
OK chukwaAgent.checkpoint.interval [Time] = 5000
WARN: option chukwaAgent.collector.retries may not exist; val = 144000
Guesses:
chukwaAgent.connector.retryRate Time
chukwaAgent.sender.retries Integral
chukwaAgent.control.remote Boolean
WARN: option chukwaAgent.collector.retryInterval may not exist; val = 20000
Guesses:
chukwaAgent.sender.retryInterval Integral
chukwaAgent.connector.retryRate Time
chukwaCollector.rotateInterval Time
OK chukwaAgent.control.port [Portno] = 9093
WARN: option chukwaAgent.hostname may not exist; val = localhost
Guesses:
chukwaAgent.control.remote Boolean
chukwaAgent.checkpoint.enabled Boolean
chukwaAgent.sender.retries Integral
OK chukwaAgent.sender.fastRetries [Integral] = 4
WARN: option syslog.adaptor.port.9095.facility.LOCAL1 may not exist; val = HADOOP
Guesses:
adaptor.dirscan.intervalMs Integral
adaptor.memBufWrapper.size Integral
chukwaAgent.adaptor.context.switch.time Time
No checker rules for: chukwaAgent.checkpoint.name chukwaAgent.tags
[bautista@Zen-UnderLinx chukwa-0.5.0]$
  1. Next we start the Hadoop services, like this:
[bautista@Zen-UnderLinx hadoop-1.0.4]$ bin/start-all.sh
starting namenode, logging to /usr/local/hadoop-1.0.4/bin/../logs/hadoop-bautista-namenode-Zen-UnderLinx.out
localhost: starting datanode, logging to /usr/local/hadoop-1.0.4/bin/../logs/hadoop-bautista-datanode-Zen-UnderLinx.out
localhost: starting secondarynamenode, logging to /usr/local/hadoop-1.0.4/bin/../logs/hadoop-bautista-secondarynamenode-Zen-UnderLinx.out
starting jobtracker, logging to /usr/local/hadoop-1.0.4/bin/../logs/hadoop-bautista-jobtracker-Zen-UnderLinx.out
localhost: starting tasktracker, logging to /usr/local/hadoop-1.0.4/bin/../logs/hadoop-bautista-tasktracker-Zen-UnderLinx.out
[bautista@Zen-UnderLinx hadoop-1.0.4]$
  1. Finally, we start the collector with the following command:
[bautista@Zen-UnderLinx chukwa-0.5.0]$ bin/chukwa collector
[bautista@Zen-UnderLinx chukwa-0.5.0]$ WARN: option chukwa.data.dir may not exist; val = /chukwa
Guesses:
chukwaRootDir null
fs.default.name URI
nullWriter.dataRate Time
WARN: option chukwa.tmp.data.dir may not exist; val = /chukwa/temp
Guesses:
chukwaRootDir null
nullWriter.dataRate Time
chukwaCollector.tee.port Integral
WARN: option chukwaCollector.fixedTimeIntervalOffset may not exist; val = 30000
Guesses:
chukwaCollector.minPercentFreeDisk Integral
chukwaCollector.tee.keepalive Boolean
chukwaCollector.http.threads Integral
OK chukwaCollector.http.port [Integral] = 8080
WARN: option chukwaCollector.isFixedTimeRotatorScheme may not exist; val = false
Guesses:
chukwaCollector.writeChunkRetries Integral
chukwaCollector.showLogs.enabled Boolean
chukwaCollector.minPercentFreeDisk Integral
OK chukwaCollector.localOutputDir [File] = /tmp/chukwa/dataSink/
OK chukwaCollector.pipeline [ClassName list] = org.apache.hadoop.chukwa.datacollection.writer.SocketTeeWriter,org.apache.hadoop.chukwa.datacollection.writer.SeqFileWriter
OK chukwaCollector.rotateInterval [Time] = 300000
OK chukwaCollector.writerClass [ClassName] = org.apache.hadoop.chukwa.datacollection.writer.localfs.LocalWriter
OK writer.hdfs.filesystem [URI] = hdfs://localhost:9000
No checker rules for: chukwaCollector.outputDir
started Chukwa http collector on port 8080

[bautista@Zen-UnderLinx chukwa-0.5.0]$
  1. In a few minutes, we will see that chuckwa has collected some dataSinkArchives files which include different metrics like the next screenshots:

In the next post, we are going to modify our configuration files to storage the collected metrics into HBase.

Tagged with: , , ,
Posted in Hadoop, Instalación, MapReduce, Setup

XML Input Files for a MapReduce Application

Normally, MapReduce inputs come from a set of XML input files loaded onto the the HDFS cluster. Although several input formats are provided with Hadoop, it does not include a  XML format to process these type of files. The type of files that can be processed by Hadoop  are defined by their InputFormat class. Next it’s presented a table of standard Input Formats  provides with Hadoop.

InputFormat: Description: Key: Value:
TextInputFormat Default format; reads lines of text files The byte offset of the line The line contents
KeyValueInputFormat Parses lines into key, val pairs Everything up to the first tab character The remainder of the line
SequenceFileInputFormat A Hadoop-specific high-performance binary format user-defined user-defined

InputFormats provided by MapReduce

Thus, sometimes it is necessary to develop a custom input format to be able to read files in a particular format. In this post I’II present how to develop a XML reader which takes XML files as input and writes the output as a set of values in a format (key, value). A good way to do this is creating a subclass of the FileInputFormat class which provides the basics to manipulate different type of files. If it is needed to parse a file in a particular form, then it is necessary override the getRecordRader() method which returns an object that read from the input source. For example, suppose we want to read a XML file as follow:

<book>
	Gambardella, Matthew
	XML Developer’s Guide
	Computer
	44.95
	2000-10-01
	An in-depth look at creating applications with XML.
</book>
<book>
	Ralls, Kim
	Midnight Rain
	Fantasy
	5.95
	2000-12-16
	A former architect battles corporate zombies, an evil sorceress, and her own childhood to become queen of the world.
</book>

Thus, the main idea of our XML reader is to read each element between book tags (<book>and</book>) as an input record in our MapReduce application. In our example, the XmlInputFormat class reads records from files and defines a factory method for RecordRader implementations as shown.

package org.undercloud.mapreduce.example3;

import java.io.IOException;

import org.apache.hadoop.io.*;
import org.apache.hadoop.mapred.*;

public class XmlInputFormat extends FileInputFormat {

 public RecordReader getRecordReader(InputSplit input, JobConf job, Reporter reporter)
 throws IOException {

reporter.setStatus(input.toString());
 return new XmlRecordReader(job, (FileSplit)input);
 }

In our application, the RecordReader implementation is where the input file data is read and parsed. This is implemented on the xmlRecordReader class which make use of the LineRecordReader class. This class, is the implementation used by TextInputFormat to read the lines  from files and return them. It has been wrapped the LineRecordReader with our own implementation which extract the elements between defined Xml tags as a record. In the event that a record spans a InputSplit boundary, the record reader will take care of this so we will not have to worry about this.

package org.undercloud.mapreduce.example3;

import java.io.IOException;

import org.apache.hadoop.fs.FSDataInputStream;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapred.*;

public class XmlRecordReader implements RecordReader {

	private String startTagS = "<book>";
	private String endTagS = "</book>";
	private byte[] startTag;
	private byte[] endTag;
	private long start;
	private long end;
	private FSDataInputStream fsin;
	private DataOutputBuffer buffer = new DataOutputBuffer();
	private LineRecordReader lineReader;
	private LongWritable lineKey;
	private Text lineValue;

    public XmlRecordReader(JobConf job, FileSplit split) throws IOException {
      lineReader = new LineRecordReader(job, split);
      lineKey = lineReader.createKey();
      lineValue = lineReader.createValue();
      startTag = startTagS.getBytes();
      endTag = endTagS.getBytes();

      // Open the file and seek to the start of the split
      start = split.getStart();
      end = start + split.getLength();
      Path file = split.getPath();
      FileSystem fs = file.getFileSystem(job);
      fsin = fs.open(split.getPath());
      fsin.seek(start);
   }

   public boolean next(Text key, XmlContent value) throws IOException {
       // Get the next line
       if (fsin.getPos() < end) {  			
           if (readUntilMatch(startTag, false)) {  				
              try {  					
                  buffer.write(startTag);  					
                  if (readUntilMatch(endTag, true)) {  						             key.set(Long.toString(fsin.getPos()));  						     value.bufferData = buffer.getData();  						     value.offsetData = 0;  						
                     value.lenghtData = buffer.getLength();  						     return true;  					
                  }  				
              }  				
              finally {  					
                     buffer.reset();  				
              }  			
           }  		
       }  		
       return false;  	
   }  	
   
   private boolean readUntilMatch(byte[] match, boolean withinBlock) throws IOException {  
       int i = 0;  		
       while (true) {  			
             int b = fsin.read(); 
             // End of file -> T
	     if (b == -1) return false;
	     // F-> Save to buffer:
	     if (withinBlock) buffer.write(b);
	     if (b == match[i]) {
		i++;
		if (i >= match.length) return true;
	     } else i = 0;
	     // see if we’ve passed the stop point:
	     if(!withinBlock && i == 0 && fsin.getPos() >= end) return false;
       }
   }

   public Text createKey() {
     return new Text("");
   }

   public XmlContent createValue() {
     return new XmlContent();
   }

   public long getPos() throws IOException {
     return lineReader.getPos();
   }

   public void close() throws IOException {
     lineReader.close();
   }

   public float getProgress() throws IOException {
     return lineReader.getProgress();
   }
}

As we can see the XmlRecordReader class make use of XmlContent class. This class defines our custom data type which allow us to use the Xml content as a value type. Thus, this class is declared as Writable because this allows transmit values over the network. The following code listing extends XmlContent class:

package org.undercloud.mapreduce.example3;

import java.io.*;

import org.apache.hadoop.io.*;

public class XmlContent implements Writable{

    public byte[] bufferData;
    public int offsetData;
    public int lenghtData;
    
   
	  public XmlContent(byte[] bufferData, int offsetData, int lenghtData) {
		  this.bufferData = bufferData;
		  this.offsetData = offsetData;
		  this.lenghtData = lenghtData;
		  }
	  
	  public XmlContent(){
		  this(null,0,0);
	  }
	  
	  public void write(DataOutput out) throws IOException {
		  out.write(bufferData);
		  out.writeInt(offsetData);
		  out.writeInt(lenghtData);
	  }

	  public void readFields(DataInput in) throws IOException {
		  in.readFully(bufferData);
		  offsetData = in.readInt();
		  lenghtData = in.readInt();
		  }

	  public String toString() {
		    return Integer.toString(offsetData) + ", "
		        + Integer.toString(lenghtData) +", "
		        + bufferData.toString();
		  }	  

}

Finally, it is necessary to declare our main class which defines; the Map and Reduce classes, creates the Job to be processed and defines its configuration. The XmlReader class takes as first input argument the place where the input Xml files are located in our HDFS. The second argument is the location where the application will put results which will have a (key,value) structure.

package org.undercloud.mapreduce.example3;

import java.io.IOException;
import java.util.Iterator;
import org.apache.hadoop.fs.*;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapred.*;

public class XmlReader {
	public static class Map extends MapReduceBase implements Mapper<Text, XmlContent, Text, Text> {
		private final static XmlContent xmlContent = new XmlContent();
		private Text keyxml = new Text();
		private Text content = new Text();
		public void map(Text key, XmlContent value, OutputCollector<Text, Text> output, Reporter reporter) throws IOException {
				keyxml.set(Integer.toString(value.offsetData));
				content.set(value.bufferData);
				output.collect(keyxml,content );
		}
}
Tagged with:
Posted in Development, Hadoop, Java, MapReduce

Word Count MapReduce Application (Tutorial part II)

In this post, I’ll describe the structure of a Map Reduce application. The application that I’ll present is the classic “word count” program. This program  takes a set of files as input and determines how many times different words are found in those files. As it was said in the first part of this tutorial, a MapReduce program compute large volumes of data in a parallel form. A MapReduce program transforms lists of input data elements into lists of output data elements. This type of  programs are composed of two elements; the mapper and reducer.  The function called Mapper transforms each input element of a list to an output data element of a new list (see figure 1).

In this program the mapper function receives a set of files as input (we assumed that we have several files) and delivers a (word,1) pairs that are sent to the reducer function. The reducer is responsible for processing the list of values associated with each word. In this application we have written the mapper and reducer in different classes. The next code (WordCountMapper.java) presents the mapper function.

package org.undercloud.mapreduce.example2;

import java.io.IOException;
import java.util.StringTokenizer;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapred.*;

public class WordCountMapper extends MapReduceBase 
     implements Mapper<LongWritable, Text, Text, IntWritable> {
	private final IntWritable one = new IntWritable(1);
	private Text word = new Text();
	public void map(LongWritable key, Text value,
			
	OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException {
		String line = value.toString();
		StringTokenizer itr = new StringTokenizer(line.toLowerCase());
		while(itr.hasMoreTokens()) {
				word.set(itr.nextToken());
				output.collect(word, one);
		}
	}
}

As we can see in the mapper class, the default input format used by Hadoop presents each line of an input file as a separate input to the function (LongWritable key, Text value). It uses a StringTokenizer object to separate the line into words. In addition the OutputCollector object is given as an input which will receive values to deliver to the next stage of execution (reducer).

In the reducer class, each reducer will processes the list of values associated with a different word. The list of values will be a list of 1’s where the reducer will sum the values into a final count associated with each word. The reducer then delivers the (word, count) output which is written in an output file. The next code (WordCountReducer.java) presents the reducer function.

package org.undercloud.mapreduce.example2;

import java.io.IOException;
import java.util.Iterator;

import org.apache.hadoop.io.*;
import org.apache.hadoop.mapred.*;

public class WordCountReducer extends MapReduceBase 
   implements Reducer <Text, IntWritable, Text, IntWritable> {

  public void reduce(Text key, Iterator values, OutputCollector output, Reporter reporter) throws IOException {

    int sum = 0;
    while (values.hasNext()) {
      IntWritable value = (IntWritable) values.next();
      sum += value.get(); // process value
    }

    output.collect(key, new IntWritable(sum));
  }
}

As it is remembered, several instances of the maper function as well reducer function are created in different machines of the cluster, so each machine receives a different file.

Finally, there is a final element in a Mapreduce application; the Driver. The Driver, initializes the job, configures the input data and establishes where the output wil be placed. The next code (WrodCount.java) presents the Driver method which at the same time is the main class of our application.

package org.undercloud.mapreduce.example2;

import org.apache.hadoop.fs.*;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapred.*;

public class WordCount {

  public static void main(String[] args) {
    JobClient client = new JobClient();
    JobConf conf = new JobConf(WordCount.class);

    // specify output types
    conf.setOutputKeyClass(Text.class);
    conf.setOutputValueClass(IntWritable.class);

    // specify input and output dirs
    FileInputFormat.setInputPaths(conf, new Path(args[0]));
    FileOutputFormat.setOutputPath(conf, new Path(args[1]));

    // specify a mapper
    conf.setMapperClass(WordCountMapper.class);

    // specify a reducer
    conf.setReducerClass(WordCountReducer.class);
    conf.setCombinerClass(WordCountReducer.class);

    client.setConf(conf);
    try {
      JobClient.runJob(conf);
    } catch (Exception e) {
      e.printStackTrace();
    }
  }
}

The driver sets up a job to execute the program using all the files in the input directory (the inputPath argument). The output from the reducers are written into files in the directory identified by outputPath. The configuration information to run the job is captured in the JobConf object. The mapping and reducing functions are identified by the setMapperClass() and setReducerClass() methods.
The data types emitted by the reducer are identified by setOutputKeyClass() and setOutputValueClass(). The input types fed to the mapper are controlled by the InputFormat. Input and output formats will be discussed in next posts.

After running the application, we can see the result (words counted) in the HDFS directory which was indicated as second argument. We can see it executing the follow command:

[bautista@Zen-UnderLinx hadoop-0.21.0]$ bin/hadoop dfs -ls /user/bautista/output/wordcount
DEPRECATED: Use of this script to execute hdfs command is deprecated.
Instead use the hdfs command for it.

12/01/25 14:42:21 INFO security.Groups: Group mapping impl=org.apache.hadoop.security.ShellBasedUnixGroupsMapping; cacheTimeout=300000
12/01/25 14:42:22 WARN conf.Configuration: mapred.task.id is deprecated. Instead, use mapreduce.task.attempt.id
Found 2 items
-rw-r--r-- 3 bautista supergroup 0 2012-01-25 12:27 /user/bautista/output/wordcount/_SUCCESS
-rw-r--r-- 3 bautista supergroup 66674917 2012-01-25 12:27 /user/bautista/output/wordcount/part-00000
[bautista@Zen-UnderLinx hadoop-0.21.0]$

As we can see, the total of words counted is located within the file: /user/bautista/output/wordcount/part-00000 which was created by the application. We can see its content using the following command:

bin/hadoop dfs -cat /user/bautista/output/wordcount/part-00000
DEPRECATED: Use of this script to execute hdfs command is deprecated.
Instead use the hdfs command for it.

12/01/25 14:52:21 INFO security.Groups: Group mapping impl=org.apache.hadoop.security.ShellBasedUnixGroupsMapping; cacheTimeout=300000
12/01/25 14:52:21 WARN conf.Configuration: mapred.task.id is deprecated. Instead, use mapreduce.task.attempt.id
^C! 147
!! 32
!!! 24
!!!! 5
!!!!! 1
!!!!!! 2
!*usa*! 1
!*~dianamite~*! 1
!3 2
!= 1
!?! 1
!@ 1
!@#$%^&*()_$#@!%^&*() 1
!djnicksilva! 1
!fiesta! 1
....

In the next post I’ll describe the different input and output data types for MapReduce applications.

Tagged with: , , ,
Posted in Development, Hadoop, Java, MapReduce, Uncategorized

First MapReduce Application (Tutorial part I)

In this post I’ll try to explain how to develop a MapReduce application. Although this first program can be found at any MapReduce tutorial, I’ll submit it as part of several samples of this MapReduce tutorial. MapReduce is a programming model and an associated implementation developed by Google for processing and generating large data sets (Dean and Ghemawat 2004). The approach to tackling MapReduce application is to divide and conquer, in which the basic idea is to partition a large problem into smaller sub problems where these sub problems can be tackled in parallel by different machines in a cluster. Although I won’t explain the MapReduce architecture, I strongly recommend read it at Yahoo MapReduce Guide.

The first program called RecordCount is a very simple application which counts records of a dataset. So, the first step is upload a data set into the Hadoop Distributed File System (HDFS). I’ve loaded a data set related about product info of Amazon. You can get different data sets (including this) on the following GitHub link. These data sets have been uploaded by HackReduce guys (thanks..!), to upload the data set into the HDFS we should execute the follow command:

[bautista@Zen-UnderLinx hadoop-0.21.0]$ bin/hadoop dfs -put datasets/amazon/amazon-memberinfo-locations.txt input/example1/amazon-memberinfo-locations.txtDEPRECATED: Use of this script to execute hdfs command is deprecated.
Instead use the hdfs command for it.

11/10/25 20:22:10 INFO security.Groups: Group mapping impl=org.apache.hadoop.security.ShellBasedUnixGroupsMapping; cacheTimeout=300000
11/10/25 20:22:10 WARN conf.Configuration: mapred.task.id is deprecated. Instead, use mapreduce.task.attempt.id

To verify that the data set has been loaded properly, we can run the following command:

[bautista@Zen-UnderLinx hadoop-0.21.0]$ bin/hadoop dfs -ls input/example1
DEPRECATED: Use of this script to execute hdfs command is deprecated.
Instead use the hdfs command for it.

11/10/25 20:28:44 INFO security.Groups: Group mapping impl=org.apache.hadoop.security.ShellBasedUnixGroupsMapping; cacheTimeout=300000
11/10/25 20:28:45 WARN conf.Configuration: mapred.task.id is deprecated. Instead, use mapreduce.task.attempt.id
Found 1 items
-rw-r--r--   1 bautista supergroup  142507999 2011-10-25 20:22 /user/bautista/input/example1/amazon-memberinfo-locations.txt

Once we have loaded the data set, the next step is to write the code of our MapReduce application. I’ve used eclipse Helios Service Release 2 to write the RecordCount.java program since there is a very simple plugin which allow us to run directly MapReduce applications over the Hadoop cluster. The eclipse plugin (hadoop-eclipse-plugin-0.20.3-SNAPSHOT.jar) can be download from the page of Eclipse plugin – Hadoop Wiki, however in some eclipse versions the above plugin does not work properly, so maybe it could be necessary download an earlier version  (hadoop-0.20.2-eclipse-plugin.jar) that can be downloaded from the follow link.

Once the plugin has been downloaded, this must be located into the plugins sub directory within the eclipse directory. The complete set of steps to install and configure the eclipse plugin can be consulted in the tutorial “Hadoop on windows with eclipse“. Thus, the next step is to write the RecordCount.java program that is shown below:

package org.undercloud.mapreduce.example1;

import java.io.IOException;
import java.util.*;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapred.*;

public class RecordCount {
	public static class Map extends MapReduceBase implements Mapper {
		private final static IntWritable one = new IntWritable(1);
		private Text record = new Text();
		public void map(LongWritable key, Text value, OutputCollector output, Reporter reporter) throws IOException {
			record.set("record");
			output.collect(record, one);
		}
	}

	public static class Reduce extends MapReduceBase implements Reducer {
		public void reduce(Text key, Iterator values, OutputCollector output, Reporter reporter) throws IOException {
			int sum = 0;
			while (values.hasNext()) {
				sum += values.next().get();
			}
			output.collect(key, new IntWritable(sum));
		}
	}

	public static void main(String[] args) throws Exception {
		JobConf conf = new JobConf(AccountRecordCount.class);
		conf.setJobName("accountrecordcount");
		conf.setOutputKeyClass(Text.class);
		conf.setOutputValueClass(IntWritable.class);
		conf.setMapperClass(Map.class);
		conf.setCombinerClass(Reduce.class);
		conf.setReducerClass(Reduce.class);
		conf.setInputFormat(TextInputFormat.class);
		conf.setOutputFormat(TextOutputFormat.class);
		FileInputFormat.setInputPaths(conf, new Path(args[0]));
		FileOutputFormat.setOutputPath(conf, new Path(args[1]));
		JobClient.runJob(conf);
	}
}

It is important to mention that it is necessary to include the Hadoop libraries into the java build path of our project to be able to run the program from eclipse. The Hadoop MapReduce libraries are located in the /lib directory of the Hadoop distribution and must be declared into our project.

Before to be able to run the MapReduce application, it is necessary to configure the Hadoop plugin. To do this, the first step is to open the MapReduce perspective using Windows->Open perspective -> Other, and then select MapReduce perspective.

After this, we must configure the Hadoop plugin using the options. This can be done from MapReduce perspective->right click->Edit. According to my Hadoop cluster configuration, the fields are filled as follow (you should adapt them according your own options).

Once this is done, next step is to configure the arguments of our MapReduce program, we will do this using the configuration properties in our program with the follow arguments:

The first argument is the location where the program will get the data (/user/bautista/input/example1) which it was previously uploaded into the HDFS. The second argument (/user/bautista/output/example1) is the location where the program will put results also into the HDFS. Now, we are able to run the program selecting it from the project explorer -> right click -> Run As -> Run on Hadoop. After this we should get an similar output to the following:

We can see the result (records counted) in the HDFS directory that was indicated as second argument, for this run the follow command:

[bautista@Zen-UnderLinx hadoop-0.21.0]$ bin/hadoop dfs -ls /user/bautista/output/example1
DEPRECATED: Use of this script to execute hdfs command is deprecated.
Instead use the hdfs command for it.

11/10/26 13:15:24 INFO security.Groups: Group mapping impl=org.apache.hadoop.security.ShellBasedUnixGroupsMapping; cacheTimeout=300000
11/10/26 13:15:24 WARN conf.Configuration: mapred.task.id is deprecated. Instead, use mapreduce.task.attempt.id
Found 2 items
-rw-r--r--   3 bautista supergroup          0 2011-10-26 12:32 /user/bautista/output/example1/_SUCCESS
-rw-r--r--   3 bautista supergroup         15 2011-10-26 12:32 /user/bautista/output/example1/part-00000
[bautista@Zen-UnderLinx hadoop-0.21.0]$

As we can see, the total of records counted is located within the file: /user/bautista/output/example1/part-00000 which was created by our application. We can see its content (total of records) with the following command:

[bautista@Zen-UnderLinx hadoop-0.21.0]$ bin/hadoop dfs -cat /user/bautista/output/example1/part-00000
DEPRECATED: Use of this script to execute hdfs command is deprecated.
Instead use the hdfs command for it.

11/10/26 13:27:52 INFO security.Groups: Group mapping impl=org.apache.hadoop.security.ShellBasedUnixGroupsMapping; cacheTimeout=300000
11/10/26 13:27:52 WARN conf.Configuration: mapred.task.id is deprecated. Instead, use mapreduce.task.attempt.id
record	2237769
[bautista@Zen-UnderLinx hadoop-0.21.0]$

As we can see the total of counted records is 2237769 which is shown in the last output line. In the next post I’ll explain the Map and Reduce classes of this program as well as the configuration methods. Also, we’re going to modify the application on separate tasks.

Tagged with: , ,
Posted in Data Sets, Development, HDFS, Java, MapReduce
Follow

Get every new post delivered to your Inbox.