Sunday, October 7, 2012

Installing Hadoop on Ubuntu (12.04) - single node


Installing Java
sudo add-apt-repository ppa:webupd8team/java
sudo apt-get update
sudo apt-get install oracle-java7-installer

Creating user
$ sudo addgroup hadoop
$ sudo adduser --ingroup hadoop hduser

Configuring SSH
su - hduser
ssh-keygen -t rsa -P ""

To be sure that SSH installation is went well, you can open a new terminal and try to create ssh session using hduser by the following command:

$ssh localhost

Reinstall ssh if localhost not connected
sudo apt-get install openssh-server

Edit Sudoers
pkexec visudo

Add below line to add hduser into sudoers
hduser (ALL)=(ALL) ALL

Ctrl + O to save nano

Disable IPv6
following commands using a root account:
$sudo gedit /etc/sysctl.conf
This command will open sysctl.conf in text editor, you can copy the following lines at the end of the file:

#disable ipv6
net.ipv6.conf.all.disable_ipv6 = 1
net.ipv6.conf.default.disable_ipv6 = 1
net.ipv6.conf.lo.disable_ipv6 = 1

If you faced a problem telling you don't have permissions, just remember to run the previous commands by your root account.
These steps required you to reboot your system, but alternatively, you can run the following command to re-initialize the configurations again.

$sudo sysctl -p
To make sure that IPV6 is disabled, you can run the following command:
$cat /proc/sys/net/ipv6/conf/all/disable_ipv6

Configuration of Hadoop
Installing Hadoop

Now we can download Hadoop to begin installation. Go to Apache Downloads and download Hadoop version 0.20.2. To overcome the security issues, you can download the tar file in hduser directory, for example, /home/hduser. 

Then you need to extract the tar file and rename the extracted folder to 'hadoop'. Open a new terminal and run the following command:

$ cd /home/hduser
$ sudo tar xzf hadoop-0.20.2.tar.gz
$ sudo mv hadoop-0.20.2 hadoop

Update $HOME/.bashrc
You will need to update the .bachrc for hduser (and for every user you need to administer Hadoop). To open .bachrc file, you will need to open it as root:

$sudo gedit /home/hduser/.bashrc

Then you will add the following configurations at the end of .bachrc file

# Set Hadoop-related environment variables

export HADOOP_HOME=/home/hduser/hadoop

# Set JAVA_HOME (we will also configure JAVA_HOME directly for Hadoop later on)

export JAVA_HOME=/usr/lib/jvm/java-7-oracle

# Some convenient aliases and functions for running Hadoop-related commands

unalias fs &> /dev/null
alias fs="hadoop fs"
unalias hls &> /dev/null
alias hls="fs -ls"

# If you have LZO compression enabled in your Hadoop cluster and
# compress job outputs with LZOP (not covered in this tutorial):
# Conveniently inspect an LZOP compressed file from the command
# line; run via:
#
# $ lzohead /hdfs/path/to/lzop/compressed/file.lzo
#
# Requires installed 'lzop' command.
#
lzohead () {
    hadoop fs -cat $1 | lzop -dc | head -1000 | less
}

# Add Hadoop bin/ directory to PATH
export PATH=$PATH:$HADOOP_HOME/bin

Hadoop Configuration

Now, we need to configure Hadoop framework on Ubuntu machine. The following are configuration files we can use to do the proper configuration. To know more about hadoop configurations, you can visit this site

hadoop-env.sh
We need only to update the JAVA_HOME variable in this file. Simply you will open this file using a text editor using the following command:

$sudo gedit /home/hduser/hadoop/conf/hadoop-env.sh

or

nano /home/hduser/hduser/hadoop/conf/hadoop-env.sh

Then you will need to change the following line

# export JAVA_HOME=/usr/lib/j2sdk1.5-sun

To 

export JAVA_HOME=/usr/lib/jvm/java-7-oracle

Note: if you faced "Error: JAVA_HOME is not set" Error while starting the services, then you seems that you forgot toe uncomment the previous line (just remove #).

core-site.xml
First, we need to create a temp directory for Hadoop framework. If you need this environment for testing or a quick prototype (e.g. develop simple hadoop programs for your personal test ...), I suggest to create this folder under /home/hduser/ directory, otherwise, you should create this folder in a shared place under shared folder (like /usr/local ...) but you may face some security issues. But to overcome the exceptions that may caused by security (like java.io.IOException), I have created the tmp folder under hduser space.

To create this folder, type the following command:

$ sudo mkdir  /home/hduser/tmp

Please note that if you want to make another admin user (e.g. hduser2 in hadoop group), you should grant him a read and write permission on this folder using the following commands:


$ sudo chown hduser2:hadoop /home/hduser/tmp

$ sudo chmod 755 /home/hduser/tmp
Now, we can open hadoop/conf/core-site.xml to edit the hadoop.tmp.dir entry.
We can open the core-site.xml using text editor:

$sudo gedit /home/hduser/hadoop/conf/core-site.xml

or

nano /home/hduser/hduser/hadoop/conf/core-site.xml

Then add the following configurations between .. xml elements:

  hadoop.tmp.dir
  /home/hduser/tmp
  A base for other temporary directories.

  fs.default.name
  hdfs://localhost:54310
  The name of the default file system.  A URI whose
  scheme and authority determine the FileSystem implementation.  The
  uri's scheme determines the config property (fs.SCHEME.impl) naming
  the FileSystem implementation class.  The uri's authority is used to
  determine the host, port, etc. for a filesystem.

mapred-site.xml
We will open the hadoop/conf/mapred-site.xml using a text editor and add the following configuration values (like core-site.xml)
nano /home/hduser/hduser/hadoop/conf/mapred-site.xml

  mapred.job.tracker
  localhost:54311
  The host and port that the MapReduce job tracker runs
  at.  If "local", then jobs are run in-process as a single map
  and reduce task.
 

hdfs-site.xml
Open hadoop/conf/hdfs-site.xml using a text editor and add the following configurations:

nano /home/hduser/hduser/hadoop/conf/hdfs-site.xml

  dfs.replication
  1
  Default block replication.
  The actual number of replications can be specified when the file is created.
  The default is used if replication is not specified in create time.
 

Formatting NameNode
~/hduser/hadoop/bin/hadoop namenode -format

You should format the NameNode in your HDFS. You should not do this step when the system is running. It is usually done once at first time of your installation.
Run the following command

$/home/hduser/hadoop/bin/hadoop namenode -format

NameNode Formatting

Starting Hadoop Cluster

You will need to navigate to hadoop/bin directory and run ./start-all.sh script.
cd ~/hduser/hadoop/bin/
./start-all.sh

Starting Hadoop Services using ./start-all.sh

There is a nice tool called jps. You can use it to ensure that all the services are up.

Using jps tool


EDIT: I observed many changes in folder structure of latest releases of hadoop. 
So, Please refer following links for the same:


References:


So, the updated configuration of files:
Core-site.xml:
 
    fs.default.name
    hdfs://localhost:9000
 

 
    mapred.job.tracker
    hdfs://localhost:9001
 

 
    dfs.replication
    1
 



Hadoop-env.sh:
# The java implementation to use.
export JAVA_HOME="$(readlink -f /usr/bin/javac | sed "s:/bin/javac::")"
export HADOOP_CONF_DIR=${HADOOP_CONF_DIR:-"/etc/hadoop"}




17 comments:

  1. Replies
    1. Big data is a term that describes the large volume of data – both structured and unstructured – that inundates a business on a day-to-day basis. IEEE Projects for CSE in Big Data But it’s not the amount of data that’s important. Final Year Project Centers in Chennai It’s what organizations do with the data that matters. Big data can be analyzed for insights that lead to better decisions and strategic business moves.

      Spring Framework has already made serious inroads as an integrated technology stack for building user-facing applications. Corporate TRaining Spring Framework the authors explore the idea of using Java in Big Data platforms.
      Specifically, Spring Framework provides various tasks are geared around preparing data for further analysis and visualization. Spring Training in Chennai

      Delete
  2. It's interesting that many of the bloggers to helped clarify a few things for me as well as giving.Most of ideas can be nice content.The people to give them a good shake to get your point and across the command.

    aws training in chennai

    hadoop training in chennai

    ReplyDelete
  3. Continuously pursue your energy. Find and attempt all the potential ways that are there to accomplish what you need throughout everyday life. There could a slight deferral however is certainly justified, despite all the trouble.data science course in pune

    ReplyDelete
  4. Excellent Blog! I would like to thank for the efforts you have made in writing this post.
    digital marketing course

    ReplyDelete
  5. Upgrading spend and sway crosswise over channels Even now, in the period of treats and snap throughs, it's not in every case simple to enhance spending allotments. machine learning certification

    ReplyDelete
  6. Well, the most on top staying topic is Data Analytics. Data Analytics is one of the most promising technique in the growing world. I would like to add Data Analytics training to the preference list. Out of all, Data analytics course in Mumbai is making a huge difference all across the country. Thank you so much for showing your work and thank you so much for this wonderful article.

    ReplyDelete

  7. Such a very useful article. Very interesting to read this article.I would like to thank you for the efforts you had made for writing this awesome article. I would like to state about something which creates curiosity in knowing more about it. It is a part of our daily routine life which we usually don`t notice in all the things which turns the dreams in to real experiences. Back from the ages, we have been growing and world is evolving at a pace lying on the shoulder of technology."data science training in hyderabad" will be a great piece added to the term technology. Cheer for more ideas & innovation which are part of evolution.

    ReplyDelete
  8. Such a very useful article. I have learn some new information.thanks for sharing.
    data scientist course in mumbai

    ReplyDelete
  9. Excellent Blog! I would like to thank for the efforts you have made in writing this post. I am hoping the same best work from you in the future as well. I wanted to thank you for this websites! Thanks for sharing. Great websites!
    Data science course in Mumbai

    ReplyDelete
  10. Such a very useful article. Very interesting to read this article. I have learn some new information.thanks for sharing. ExcelR

    ReplyDelete
  11. Really nice and interesting post. I was looking for this kind of information and enjoyed reading this one. Keep posting. Thanks for sharing.
    ExcelR data analytics

    ReplyDelete