Sei sulla pagina 1di 12

TRANSCRIPT

MY CLASS NOTES

Hello and welcome back. In the previous videos,


we have discussed in-depth the architecture of
HDFS and MapReduce.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
In this section of the videos, we will explain in
detail how to set up Hadoop. For this video, it is
essential that you have downloaded the required
software. Primarily, we require a virtual machine
which has Linux installed along with VMware
player and putty utility software, all of which are
free downloads, so without much ado, let us start
off.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
Before we move on to setting up Hadoop, it is
really

important

for

us

to

recall

Hadoops

architecture. Let us quickly and briefly recall the


architecture of Hadoop.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
What does running Hadoop actually mean?

It

means running a set of daemons or processes on


different machines.

The different processes are

responsible for handling the two core components


of Hadoop, HDFS which serves the purpose of
distributed storage and MapReduce which is
responsible for the parallel processing framework.
These daemons have specific roles.

Some exist

only on one server whereas some exist across


multiple servers.

1|Page
Jigsaw Academy Education Pvt Ltd

TRANSCRIPT

MY CLASS NOTES

>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
There are three major categories of machine role
in Hadoop deployment.

We have the master

nodes, slave nodes and then we have the client


machines.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
The master nodes have two key processes running,
NameNode.

This oversees and coordinates the

data storage function and stores lots of data. This


is the HDFS. Then we have the job tracker which
oversees and coordinates the parallel processing of
data using MapReduce.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
We also have the slave nodes that make up the
vast majority of machines and do all the dirty work
of storing the data and running the computations.
The slave nodes run two processes, data node.
This communicates with and receives instructions
from its master, the Name Node.

This is

responsible for storing the data in HDFS and we


have the Task Tracker, which receives instructions
from its master, which is the job tracker. This is
responsible for running your business logic as a
MapReduce program.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
Lastly, we have the client machines who have
Hadoop installed on them and whose purpose is to
2|Page
Jigsaw Academy Education Pvt Ltd

TRANSCRIPT

MY CLASS NOTES

load data into the cluster, submit MapReduce jobs


and retrieve its results.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
Essentially, we have many processes running in
Hadoop cluster.

The NameNode, we have the

secondary NameNode, data nodes, we have job


tracker, and the TaskTrackers.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
There are three ways you can set up Hadoop based
on where you run these processes. You have the
local or the stand-alone mode, then you have the
pseudo distributed mode or a single-node Hadoop
cluster and finally we have a distributed mode or
the multi-node Hadoop cluster. We will need to
change certain configuration files to change the
mode in which Hadoop is setup.
distributed

mode

though,

in

In the case of
addition

to

configuration file changes, we also need multiple


machines as well. We will discuss this more as we
get to that stage. First, let us get down to setting
up Hadoop.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
There are a couple of prerequisites for setting up
Hadoop.

Firstly, we have the operating system.

Linux is the official development and production


platform for Hadoop.

A significant number of

production clusters run on Red Hat enterprise


Linux or its freely available sister, CentOS.
Ubuntu and Suse enterprise Linux and Debian
3|Page
Jigsaw Academy Education Pvt Ltd

TRANSCRIPT

MY CLASS NOTES

deployments also exist in production and work


perfectly well.
It is important to note that Apache Hadoop version
2.2 onwards also supports for running Hadoop on
Windows as well, but in our class we are going to
use Ubuntu Linux.

The second prerequisite for

running Hadoop is that it requires Java version 1.6


or higher.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
What we are going to do is setup Hadoop on our
virtual machine, so please download the VMware
virtual machine which has Ubuntu Linux 1204
installed

on

it.

Hope,

you

have

already

downloaded it, if not the link to download this


virtual machine is here and it is free, and please
also note the usernames and passwords for the
virtual machine.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
After you download, you should see a file called
Ubuntu 1204.zip. Now, you just have to unzip this
file to a folder and remember the path. In the
interest of time, I have already done this, but in
general, it will take a minute or so to unzip the
files. Now, here is my unzipped folder. In order
to run this, you need to download and install
VMware player, which is also a free download.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>

4|Page
Jigsaw Academy Education Pvt Ltd

TRANSCRIPT
Go to the link below and download the appropriate
version.
Install the VMware player just like how you would
install any other window software. After you have
installed VMware player, to launch the Ubuntu
virtual machine you have downloaded, first launch
the VMware player from the start menu.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
I have already launched VMware. Click on open a
virtual machine and go to the folder where you
unzipped the Ubuntu virtual machine and select
the VMX file and click on open.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
Before you play the virtual machine, edit the
virtual machine settings. Make sure you choose at
least 2 GB of ram and also you have the option of
changing the virtual machine name as well and
then click on okay.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
Now click on play virtual machine. This will take a
couple of minutes to load and also prompt you to
choose if you want the virtual machine to be
started every time you start the machine, so
choose No for this. It might also prompt you for
updates, so choose no for now.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>

5|Page
Jigsaw Academy Education Pvt Ltd

MY CLASS NOTES

TRANSCRIPT

MY CLASS NOTES

Once the virtual machine is up and running, I can


now launch a terminal inside the virtual machine
and perform my Linux operations.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
Let us run some basic Linux commands to check if
it is working fine. We have a virtual machine up
and running. Next, we need to update the local
installation packages with the latest updates from
its sources, so to do this, type the following, it
prompts for the password.

Once the virtual

machine is up and running I can now launch


terminal inside the virtual machine perform my
Linux operations.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
Let me launch a terminal window and let me run
some basic Linux commands. Who am I tells me
who the current user is, I can list the current
directory and I can print out the present working
directory as well. Next, importantly we need to
update the local installation packages with the
latest update from its sources, so to do this type
the following command.

It prompts you for the

password.

This will take a few seconds to

complete.

As you can see, there are various

packages being downloaded from its repositories.


Once this is done, next we need to set up SSH.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
SSH stands for Secure Shell and it is basically a
program to log into another computer over a
6|Page
Jigsaw Academy Education Pvt Ltd

TRANSCRIPT

MY CLASS NOTES

network to execute commands in a remote


machine and also to move files from one machine
to another. It provides strong authentication and
secure communications over insecure channels.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
Now, we will install Open SSH by typing the below
command.

Type yes for the prompt and this

command actually downloads and installs the Open


SSH server and the client as well. Let us wait for
the command run successfully.
Now that SSH is set up, I can also connect to this
Linux virtual machine from the windows host.
Now, you might wonder, we are already in the
desktop of the virtual machine and also have
launched the terminal, so why not perform the
necessary operations on the terminal instead of
connecting to the virtual machine from windows.
Well, you can do this and it would work just fine.
However, remember that in most environments,
whether you are using Amazon web services or you
have a Hadoop cluster in your data centre itself,
you will not be able to log in directly into a
NameNode or to the client and access the terminal
from the desktop. You will invariably have to use
utility tools like putty to connect to the nodes of
your cluster.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
Essentially, we are doing this just so you get to
using putty, but what is putty?
7|Page
Jigsaw Academy Education Pvt Ltd

Putty basically

TRANSCRIPT

MY CLASS NOTES

allows a windows user to securely connect to


remote systems.

Let us say your Linux server.

This will be over the internet via Telnet and SSH


protocols.
remote

Using SSH allows you to connect to

systems

via

secure

shell

thereby

encrypting information before it is transferred.


>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
For our class, we have a windows operating system
which has an Ubuntu Linux virtual machine
configured on top of it. We will now connect to
our Linux virtual machine using putty.

Now in

order to connect to my virtual machine using


putty, I need either the IP address or the host
name of the virtual machine.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
To get the IP address of the virtual machine, type
the below command. This, what you see here is
the IP address of my virtual machine.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
Now let us open putty.

For that, we have to

double click the putty.exe application file and


enter the IP address and then hit the enter key.
You will be prompted to choose to store the
servers host key in the registry, so choose yes
otherwise you will get this prompt every time you
connect to this virtual machine.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>

8|Page
Jigsaw Academy Education Pvt Ltd

TRANSCRIPT

MY CLASS NOTES

This time we will login as the root user.

The

authentication was successful, so you can see that


we have now logged into our virtual machine using
putty as root user. Let us quickly run some basic
commands to just verify, so this time who am I
command gives the user name as root.

Let us

quickly check the IP address. The IP address is the


same IP address of our virtual machine.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
Next, since Hadoop is written in Java we need to
have Java installed on the machine so we can do
this by typing the below commands.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
These commands basically add a repository from
where we can download the latest installed files
for Java. Now let us run the update command to
get the latest installed files from the repositories
that we just added. Now once the reading of the
package lists is done, we can finally run the
command to install Java version 1.7.
Choose yes, hit the enter key to accept the
agreement and use the left arrow key choose yes
for the license terms.

This command will first

download and then install Java on our virtual


machine. This will take a bit of time to complete.
Okay as you can see the download of the Java is
done and it is still in the process of completing the
installation so let us give it little bit more time.

9|Page
Jigsaw Academy Education Pvt Ltd

TRANSCRIPT

MY CLASS NOTES

Okay, once the Java installation is done, we need


to set up the Java underscore home variable to the
path of our Java installation.

Since we did not

provide a path for installing Java, we need to find


out where a Java was installed. You can use the
find command to find out the files that start with
a name Java hyphen, so you can find out where
Java was installed.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
Let us use the find command to do that. Note that
you need to be in the root directory to do this. It
does give you a list of results typically in this
folder if it is Java version 1.7. Let us copy the
path. I am sure we will need it somewhere. Once
we have path of our Java installation, let us
update our Java underscore home variable.
We can do this in two ways; one by using an export
command. Let us check if this command actually
worked for us by echoing the variable.

It did;

however, remember that those export command


will be applicable only for the session and if you
close the session and then login later, the variable
would not be updated.

To avoid this, we will

simply include this export command in the dot


bashrc file in your home directory, but what is a
bashrc file. Well, we know that Linux can have
multiple shells and shell is basically a program that
takes your command from the keyboard and gives
them to the operating system to perform.
BASH, which stands for Born Again Shell is the
most commonly used shell and the dot bashrc file
10 | P a g e
Jigsaw Academy Education Pvt Ltd

TRANSCRIPT
is

typically

environment
procedures.

used

to

variables,

MY CLASS NOTES

change
and

prompts,
define

set
shell

The dot bashrc file is read when a

new terminal window is launched or the shell is


invoked the first time. Now let us edit this file
which is in the home directory. Let us go to the
home directory of the current user. Let us list the
files of the directory by typing the command LS. I
do not find anything here. The reason for this is
that the bashrc file is basically a dot file and the
dot files are actually hidden. To view the hidden
files, you should use the command LS minus A.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
As you can see, you can see the dot bashrc file
here. Let us edit the dot bashrc file using the VI
editor. Scroll all the way to the down and paste
the command export. Once you have done this,
save and exit using the escape and colon WQ
commands.
Now let us quickly execute the file to reflect the
changes we made.
command.

For that, we use the source

Source is basically a bash built-in

command that executes the content of the file


pass to it as an argument in the current shell. In
our case, it is the dot bashrc file and with that we
have successfully installed Java and also updated
our Java underscore home variable. You need to
edit the dot bashrc file of your users who will be
logging into the virtual machine using putty.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>

11 | P a g e
Jigsaw Academy Education Pvt Ltd

TRANSCRIPT
With this, we come to the end of the first part of
videos on setting Hadoop. In the next couple of
videos, we will talk in more detail about adding
dedicated users to manage Hadoop, setting up SSH
for Hadoop, etc.
Thank you for watching.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>

12 | P a g e
Jigsaw Academy Education Pvt Ltd

MY CLASS NOTES

Potrebbero piacerti anche