hadoop tutorial

Apache Hadoop Tutorial for Beginners

The Hadoop open source software stack makes use of a group of machines. Hadoop offers distributed processing and storage for very large data sets. The objectives of this Hadoop tutorial are to get you started with Hadoop as well as to introduce you to Hadoop and the code and homework submission mechanisms. This Hadoop tutorial includes the Hadoop fundamentals along with HDFS, MapReduce, Yarn, etc.

Open source software is Hadoop. The Apache Open Source License, version 2.0, governs this project. Here, you will discover how to create, assemble, test, and run a straightforward Hadoop program. The first section of the topic serves as a tutorial and covers a basic introduction. The second section asks you to create your own Hadoop program. Although completing the instruction is optional, make sure you submit your work on time. This tutorial must be finished on your own.

Hadoop Tutorial For Beginners

Hadoop Introduction

The Hadoop framework offers the power and flexibility to perform tasks that were previously impossible. Doug Cutting, the man behind the popular text search library Apache Lucene, developed Hadoop. Hadoop is descended from the Lucene project’s Apache Nutch, an open source online search engine. Hadoop is a made-up moniker rather than an acronym.

HDFS – Hadoop Distributed File System

HDFS, which stands for Hadoop Distributed File system, is the name of the distributed file system that comes with Hadoop. Running on clusters of affordable hardware, HDFS is a file system made for storing very big files with streaming data access patterns.

In order to reduce the cost of seeks, HDFS blocks are larger than disk blocks. If the block is large enough, it may take much longer to transmit data from the disk than it does to seek to the beginning of the block. A huge file made up of numerous blocks can therefore be transferred at the disk transfer rate.

A quick calculation reveals that if the seek time is approximately 10 ms and the transfer rate is 100 MB/s, we need to make the block size approximately 100 MB in order to make the seek time 1% of the transfer time. Although many HDFS installations use bigger block sizes, the default is actually 128 MB.

File System Operations

We can perform all of the common file system activities, including reading files, creating directories, moving files, deleting data, and listing directories, now that the file system is ready for use. You can type hadoop fs -help to get detailed help on every command.


Start by copying a file from the local file system to HDFS:

% hadoop fs -copyFromLocal input/docs/quangle.txt \
hdfs://localhost/user/tom/quangle.txt

This command invokes Hadoop’s file system shell command fs, which supports a number of subcommands—in this case, we are running –copy FromLocal. The local file quangle.txt is copied to the file /user/tom/quangle.txt on the HDFS instance running on localhost. In fact, we could have omitted the scheme and host of the URI and picked up the default, hdfs://localhost, as specified in core-site.xml:

% hadoop fs -copyFromLocal input/docs/quangle.txt /user/tom/quangle.txt

We also could have used a relative path and copied the file to our home directory in HDFS, which in this case is /user/tom:

% hadoop fs -copyFromLocal input/docs/quangle.txt quangle.txt

Let’s copy the file back to the local file system and check whether it’s the same

% hadoop fs -copyToLocal quangle.txt quangle.copy.txt
% md5 input/docs/quangle.txt quangle.copy.txt
MD5 (input/docs/quangle.txt) = e7891a2627cf263a079fb0f18256ffb2
MD5 (quangle.copy.txt) = e7891a2627cf263a079fb0f18256ffb2

The MD5 digests are the same, showing that the file survived its trip to HDFS and is back intact.

Finally, let’s look at an HDFS file listing. We create a directory first just to see how it is displayed in the listing:

% hadoop fs -mkdir books
% hadoop fs -ls .
Found 2 items
drwxr-xr-x - tom supergroup 0 2014-10-04 13:22 books
-rw-r--r-- 1 tom supergroup 119 2014-10-04 13:21 quangle.txt

Hadoop File systems

HDFS is only one implementation of Hadoop’s abstract concept of file systems. the org.apache.hadoop.fs abstract class for Java.There are numerous actual implementations of FileSystem, which in Hadoop provides the client interface to a file system.

YARN

The Hadoop cluster resource management system is called Apache YARN (Yet Another Resource Negotiator). YARN was added to Hadoop 2 to enhance the implementation of Map Reduce, although it is sufficiently open-source to accommodate other distributed computing paradigms as well.

Although YARN offers APIs for accessing and manipulating cluster resources, user code rarely uses these APIs directly. Instead, users write to higher-level APIs offered by distributed computing frameworks, which are based on YARN itself and shield the user from the complexities of resource management..

An example of certain distributed computing frameworks operating as YARN applications on the cluster compute layer (YARN) and the cluster storage layer (HDFS and HBase) is shown in the image below.

Two different types of long-running daemons are used by YARN to deliver its main services: a resource manager (one for each cluster) to control resource usage, and node managers (running on each cluster node) to launch and keep track of containers.

YARN Scheduling

The requests that a YARN application makes would, in an ideal world, be promptly granted. However, in the real world, resources are constrained, and on a busy cluster, an application frequently has to wait for some of its requests to be satisfied. The YARN scheduler’s responsibility is to distribute resources to apps in accordance with a set of predetermined policies.

Map Reduce

A programming model for processing data is called Map Reduce. The model is easy to understand but not too easy to express programs in. Map Reduce applications developed in a variety of languages can operate on Hadoop. The map phase and the reduce phase are the two processing stages that make up the Map Reduce algorithm. Key-value pairs can be selected by the programmer as input and output for each step.

The two phases of processing—the map phase and the reduce phase—are how Map Reduce operates. Key-value pairs serve as the input and output for each phase, and the programmer can select the types of these pairs.

The inputs and outputs for the map and reduce functions are key-value pairs, thus Map Reduce has a straightforward data processing model. This chapter examines the Map Reduce model in detail, focusing on the ways in which it can be utilized with data in a variety of formats, from plain text to complex binary objects.

Apache Hadoop Tutorial for beginners explains what is Hadoop. It gives a brief understanding of messaging and important Hadoop concepts are explained. I will be adding more posts in Hadoop tutorial, so please bookmark the post for future reference too.

Online Training Tutorials

  • SAP FI-APAccounts Payable AR & AP are sub ledgers of GL for managing customer & vendor balances. Most of AR comes from SD modules & AP data from MM module. Define Acct group with […]
  • SAP Business WorkflowAn Introduction to SAP Business WorkflowSAP Business Workflow is a tool to automate complex business processes where there is more than one user involved. SAP workflow maps the position in organization because SAP believes that […]
  • General Ledger AccountingMigration to New General Ledger Accounting – OverviewMany different migration scenarios are imaginable for the transition from classic General Ledger Accounting to New General Ledger Accounting, ranging from the straightforward merge of […]
  • SAP BODS Tutorial Beginners – Learn SAP SAP BODS OnlineSAP BODS Tutorial Beginners – Learn SAP SAP BODS OnlineThis SAP BODS tutorial is designed for beginners looking to adopt SAP BODS Cloud in their business or career.
  • SAP Upgrade RoadmapThird Party Order Processing in Sales – Step by StepThird Party order in Sales (in case of external procurement)  In this business scenario Company A accept the sales order of customer and asked third company/Vendor to deliver the goods […]
  • Web DynproWhat is Web Dynpro for ABAP?Web Dynpro (Stands for Web Dynamic Programming) for ABAP is a programming model for developing Web applications. It consists of a run time environment and a graphical development […]
  • Copy Control in SAP SDHow to Configuring Copy Control in SAP SD?We can define control data for a flow of documents. You can specify, for a particular sales document type, which document type is to be assigned to copied reference documents, and which […]
  • Structure of SAP SD Master DataWhat is Structure of SAP SD Master Data?The SAP SD Master data forms the basis of the SD processing. Master data is the responsibility of all SAP modules, as each module has an element of it. However, many other modules other […]