Hadoop Interview Questions and Answers

Hadoop Interview Questions and answers

Hadoop Interview Questions and answers for beginners and experts. List of frequently asked Hadoop Interview Questions with answers by Besant Technologies. We hope these Hadoop Interview Questions and answers are useful and will help you to get the best job in the networking industry. This Hadoop Interview Questions and answers are prepared by Hadoop Professionals based on MNC Companies expectation. Stay tuned we will update New Hadoop Interview questions with Answers Frequently. If you want to learn Practical Hadoop Training then please go through this Hadoop Training in Chennai and Hadoop Training in Bangalore

Best Hadoop Interview Questions and answers

Besant Technologies supports the students by providing Hadoop Interview Questions and answers for the job placements and job purposes. Hadoop is the leading important course in the present situation because more job openings and the high salary pay for this Hadoop and more related jobs. We provide the Hadoop online training also for all students around the world through the Gangboard medium. These are top Hadoop Interview Questions and answers, prepared by our institute experienced trainers.

Hadoop Interview Questions and answers for the job placements

Here is the list of most frequently asked Hadoop Interview Questions and answers in technical interviews. These questions and answers are suitable for both freshers and experienced professionals at any level. The questions are for intermediate to somewhat advanced Hadoop professionals, but even if you are just a beginner or fresher you should be able to understand the answers and explanations here we give.

Q1) Explain in detail about Kafka Producer in context to Hadoop?

Before explaining about Kafka Producer, we first have to know about what Kafka is and why it came into existence.

Kafka is an open source API cluster for processing stream data.

Kafka Includes these Core API’s – Producer API, Consumer API, Streams API, Connect API

The use cases of Kafka API’s are – Website Activity Tracking, Messaging, Metrics, Log Aggregation, , Event Sourcing , Stream Processing and Commit Log.

Let’s go in detail about Producer API:

These API’s are mainly used for Publishing and Consuming Messages using Java Client.

Kafka Producer API (Apache) has a class called “KafkaProducer” which facilitates Kafka broker in its constructor and provide following methods- Send Method, Flush Method and Metrics.

Send Method-
e.g- producer.send(new ProducerRecord<byte[],byte[]>(topic, partition, key, value) , Usercallback);

In the above example code-

ProducerRecord – This is a producer class which manages a buffer of records waiting to be sent which needs topic, partition , key and value are parameters.

UserCallback – It is a User callback function to execute when the record has been acknowledged by the server. If it is null that means there is no callback.

Flush Method – this Method is used for sending messages.
e.g. public void flush ()

Metrics – It provides partition for getting the Partition metadata for given topic in runtime. This method is also used for custom partitioning.
e.g. public Map metrics()

After execution of all the methods, we need to call the close method after sent request is completed.

e.g. public void close()

Overview of Kafka Producer API’s:

There are 2 types of producers i.e. Synchronous (Sync) and Asynchronous (Async)

Sync – This Producer send message directly along with other execution (messages) in background.

e.g. kafka.producer.SyncProducer

Async- Kafka provides an asynchronous send method to send a record to a topic. The big difference between Sync and Async is that we have to use a lambda expression to define a callback.

e.g. kafka.producer.async.AsyncProducer.

Example Program-
class Producer
{
/* the data that is partitioned by key to the topic is sent using either the synchronous or the asynchronous producer */
public void send(kafka.javaapi.producer.ProducerData<K,V>producerData);
public void send(java.util.List<kafka.javaapi.producer.ProducerData<K,V>>producerData);
/* In the last close the producer to clean up */
public void close();
}

Q2) Explain Monad class?

Monad class is a class for wrapping of objects. E.g. identity with Unit & Bind with Map. It provides two operations as below:-

identity (return in Haskell, unit in Scala)

bind (>>= in Haskell, flatMap in Scala)

Scala doesn’t have a built-in monad type, so we need to model the monad ourselves. However other subsidiaries of Scala like Scalaz have the monad built-in itself also it comes with theory family like applicatives , functors, monoids and so on.

The sample program to model monad with generic trait in Scala which provide method like unit() and flatMap() is below. Lets denote M in-short for monad.

trait M[A]
{
defflatMap[B](f: A => M[B]): M[B]
}
def unit[A](x: A): M[A]

Q3) Explain the reliability of Flume-NG data?

Apache Flume provides a reliable and distributed system for collecting, aggregating and moving large amounts of log data from many different sources to a centralized data store.

This work currently in progress and informally referred to as Flume NG. It has gone through two internal milestones – NG Alpha 1, and NG Alpha 2 and a formal incubator release of Flume NG is in the works.

The Core Concept of Flume-NG data are – Event, Flow, Client, Agent, Source, Channel and Sink. These core concept makes the architecture of Flume NG to achieve this objective.

Q4) What is Interceptor?

This is a Flume Plug-in that helps to listen any Incoming and alter event’s content on the Fly.

e.g. Interceptor Implementation for JSON data.

Q5) What are the different Flume-NG Channel types?

The main channel types of Flume-NG are Memory Channel, JDBC Channel, Kafka Channel, File Channel, Spillable Memory Channel, Pseudo Transaction Channel.

In basic Flume, we have channel type like memory, JDBC, file and Kafka.

Q6) What is Base class in java?

A base class is also a class which facilitates the creation of other classes. In terms of object oriented programming, it is referred as derived class. This helps to reuse the code implicitly from base class except constructors and destructors.

Q7) What is Base class in scala?

Base class concept is same for both java and scala. Only difference is in syntex. The Keywords in Scala are Base and Derived.

Ex:

abstractclassBase( val x : String )
finalclassDerived( x : String ) extendsBase( “Base’s ” + x )
{
overridedeftoString = x
}

Q8) What is Resilient Distributed Dataset(RDD)?

Resilient Distributed Dataset(RDD) is core of Apache Spark which provides primary data abstraction.

These are features of RDDs:

Resilient means fault-tolerant with the help of RDD lineage graph and so that it’s easy to re-compute missing or damaged partitions due to failure of any node.
Distributed means this feature works with data residing on multiple nodes in a cluster.
Dataset means collection of partitioned data with primitive values or values of values, e.g. tuples or other objects.

Q9) Give a brief description of Fault tolerance in Hadoop?

Fault tolerance can be defined as, proper functioning of the system without any data loss even if some hardware components of the system fails. This feature of Hadoop is used for computing large data sets with parallel and distributed algorithms in the cluster without any failures. It use the Heart of Hadoop i.e. MapReduce.

Q10) What is Immutable data with respect to Hadoop?

Immutability is the idea that data or objects cannot be modified once they are created. This concept provides the basic functionalities of the Hadoop in computing the large data without any data loss or failures. Programming languages, like Java and Python, treat strings as immutable objects which means we will not be able change it.

Q11) Which are the nodes that hadoop can b executed?

We have three modes in which Hadoop can run and which are:

Standalone (local) mode: Default mode of Hadoop, it uses the local file system for input and output operations. This mode is used for debugging purpose, and it does not support the use of HDFS.
Pseudo-distributed mode: In this case, you need configuration for all the three files mentioned above. In this case, all daemons are running on one node and thus, both Master and Slave nodes are on the same machine.
Fully distributed mode: This is the production phase of Hadoop where data is distributed across several nodes on a Hadoop cluster. Different nodes are allotted as Master and Slaves.

Q12) How is formatting done in HDFS?

Hadoop distributed file system(HDFS) is formatted using bin/hadoop namenode -format command. This command formats the HDFS via NameNode. This command is only used for the first time. Formatting the file system means starting the working of the directory specified by the dfs.name.directory variable. If you execute this command on existing filesystem, you will delete all your data stored on your NameNode. Formatting a Namenode will not format the DataNode.

Q13) What are the contents found in masterfile of hadoop?

The masters file contains information about Secondary NameNode server location.

Q14) Describe the main hdfs-site.xml properties?

The three important hdfs-site.xml properties are:

checkpoint.dir is the directory found on the filesystem where the Secondary NameNode collects the temporary images of edit logs, which is to be combined and the FsImage for backup.

Q15)Explain about spill factor with respect to the RAM?

The map output is stored in an in-memory buffer; when this buffer is almost full, then spilling phase begins in order to transport the data to a temp folder.

Map output is first written to buffer and buffer size is decided by mapreduce.task.io.sort.mb .By default, it will be 100 MB.

When the buffer outreaches certain threshold, it will start spilling buffer data to disk. This threshold is specified inmapreduce.map.sort.spill.percent.

Q16) Why do we require a password-less SSH in Fully Distributed environment?

We required a password-less SSH in a Fully-Distributed environment because when the cluster is live and working in Fully Distributed environment, the communication is too frequent. The DataNode and the NodeManager should be able to transport messages quickly to master server.

Q17) Does this reqirement lead to security issues?

Hadoop cluster is an isolated cluster and generally, it has nothing to do with the internet. It has a different kind of a configuration. We doesn’t worry about that kind of a security breach, like as, someone hacking through the internet, and so on. Hadoop also has a very secured way to connect to other devices to fetch and to process the built data.

Q18) What will happen to a NameNode, when ResourceManager is down?

When a ResourceManager is not working, it will not be functional (for submitting jobs) but NameNode will be available. So, the cluster is present if NameNode is working, even if the ResourceManager is not in a working state.

Q19) 1 Tell about features of Fully Distributed mode?

This is one of the important questions as Fully Distributed mode is used in the production environment, in which we have ‘n’ number of machines resulting in the formation of a Hadoop cluster. Hadoop daemons works on a cluster of machines. There is one node on which Namenode is running and other nodes on which Datanodes are running. NodeManagers are placed on every DataNode and it is responsible for working of the task on every single DataNode. The work of ResourceManager is to manage all these NodeManagers. Another work of ResourceManager is to receive the processing requests and some parts of requests its passes to the corresponding NodeManagers and so on.

Q20) Explain about fsck?

The expansion of fsck is File System Check. Hadoop Distributed File System supports the file system check command to check for different inconsistencies. It is constructed or designed for reporting the problems with the files in HDFS, for example, missing blocks of a file or under-replicated blocks.

Q21) How to copy file from local hard disk to hdfs?

hadoop fs -copyFromLocal localfilepath hdfsfilepath

Q22) Is it possible to set the reducer to zero?

Yes, it is possible to set the number of reducers to zero in MapReduce (Hadoop).

When the number of reducers is set to zero, no reducers will be executed, and the output

of each mapper process will be stored to a separate file on HDFS.

Q23) map-side join / hive join?

To optimize the performance in Hive queries, we can use Map-side Join in Hive. We will use Map-Side Join when one of the tables in the join is small in size and can be loaded into primary memory.

So that join could be performed within a mapper process without using Reduce step.

Q24) Managed Table Vs External Table

Managed table stores the data in /user/hive/warehouse/tablename folder. And once you drop the table, along with the table schema, the data will be lost.

External table stores the data in the user specified location. And once you drop the table, only table schema will be lost. The data still available in HDFS for further use.

Q25) Difference between bucketing and partitioning

Bucketing – Bucketing concept is mainly used for data sampling. We can use Hive bucketing concept on Hive Managed tables / External tables. We can perform bucketing on a single column only not more than one column. The value of this single column will be distributed into number of buckets by using hash algorithm. Bucketing is an optimization technique and it improves the performance.

Partitioning – we can do partitioning with one or more columns and sub-partitioning (Partition within a Partition) is allowed. In static partitioning, we have to give the number of static partitions. But in dynamic partitioning, the number of partitions will be decided by number of unique values in the partitioned column.

Q26) Syntax to create hive table with partitioning

create table tablename
(
var1 datatype1,
var2 datatype2,
var3 datatype3
)
PARTITIONED BY (var4 datatype4,var5 datatype5)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ‘delimiter’
LINES TERMINATED BY ‘\n’
TBLPROPERTIES (“SKIP.HEADER.LINE.COUNT”=”1”)

Q27) SQOOP split by:

For parallel importing / exporting the data to / from HDFS from / to RDBMS with multiple mappers. We can distribute the work load into multiple parts.

split-by used to specify the column of a table used to generate the splits for import. It means which column has to be used to create splits for imports will be declared by split-by.

Generally select min(split-by column) from table and select max(split-by column) from table will decide the out boundaries for the split (boundary-query). We need to define the column to create splits for parallel imports. Otherwise, sqoop will split the workload based on primary key of the table.

Syntax: bin/sqoop import –connect jdbc:mysql://localhost/database –table tablename –split-by column

Q28) file formats available in SQOOP Import

Delimited Text and sequenceFile

Delimited Text is default import file format. We can specify it as stored as-textfile

sequenceFile is binary format.

Q29) Default number of mappers in a sqoop command

the default number of mappers is 4 in a sqoop command.

Q30) Maximum number of mappers used a sqoop import command

The maximum number of mappers depends on many variables:

Database type.

Hardware that is used for your database server.

Impact to other requests that your database needs to process.

Q31) Flume Architecture

External data source ==> Source ==> Channel ==> Sink ==> HDFS

Q32) In Unix, command to show all processes

Q33) partitions in hive

Partitions allows use to store the data in different sub-folders under main folder based on a Partitioned column.

Static Partitions: User has to load the data into static partitioned table manually.

Dynamic Partitions: We can load the data from a non-partitioned table to partitioned table using dynamic partitions.

set hive.exec.dynamic.partition = true

set hive.exec.dynamic.partition.mode = nonstrict

set hive.exec.dynamic.partitioned

set hive.exec.max.dynamic.partitions = 10000

set hive.exec.max.dynamic.partitions.pernode = 1000

Q34) File formats in hive

ORC File format – Optimized Row Columnar file format

RC File format – Row Columnar file format

TEXT File format – Defalut file format

Sequence file format – If the size of a file is smaller than the data block size in Hadoop, we can consider it as a small file. Due to this, metadata increases which will become an overhead to the NameNode. To solve this problem, sequence files are introduced. Sequence files act as containers to store multiple small files.

Avro file format

Custom INPUT FILE FORMAT and OUTPUT FILE FORMAT

Q35) Syntax to create bucketed table

create table tablename
(
var1 datatype1,
var2 datatype2,
var3 datatype3
)
PARTITIONED BY (var4 datatype4,var5 datatype5)
CLUSTERED BY (VAR1) INTO 5 BUCKETS
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ‘delimiter’
LINES TERMINATED BY ‘\n’
TBLPROPERTIES (“SKIP.HEADER.LINE.COUNT”=”1”)

Q36) Custom Partitioning

Custom Partitioner is a process that allows us to store the results in different reducers, based on the user condition. By setting a partitioner to partition by the key, we can confirm that, records for the same keys will go to the same reducers.

Q37) Difference between order by and sort by

Hive supports sortby – sort the data per reducer and orderby – sort the data for all reducers (mean sort the total data)

Q38) Purpose of Zoo Keeper

Zoo Keeper assists in cluster management.

Manage configuration across nodes: Hadoop cluster will have hundreds of systems. Zoo Keeper helps in synchronization of configurations across the cluster.

As many systems are involved, race condition and deadlocks are common problems when implementing distributed applications.

Race condition occurs when a system tries to perform two or more operations at the same time and this can be taken care by serialization property of ZooKeeper.

Deadlock is when two or more systems try to access same shared resource at the same time. Synchronization helps to solve the deadlock.

Partial failure of process, which can lead to uncertainity of data. Zookeeper handles this through atomicity, which means either whole of the process will finish or nothing will carry through after failure.

Q39) Sqoop Incremental last modified

bin/sqoop import –connect jdbc:mysql://localhost/database –table table_name –incremental-lastmodified –check-column column_name –last-value ‘value’ -m 1

Q40) Difference MR1 vs MR2

MR1 – It consists of Job Tracker and Task Tracker (For processing) and name node and data node (For storing). It supports only MR framework.

MR2 – Job Tracker has been splitted again into two parts application master (one per mr job) and resource manager (only one). It will support MR framework and other frameworks too (spark, storm)

Q41) select * from table – give what results for normal table and partitioned table?

give the same results in both the scenarios.

Q42) Explode and implode in hive

Explode – will explore the array of values into the individual values.

Syntax – select pageid, adid from page LATERAL VIEW explode (adid_list) mytable as adid;

implode – collect aggregates records into either an array or map from multiple rows. It is the opposite of an explode().

syntax – select userid, collect(actor_id) from actor group by userid;

Q43) Interceptors in Flume:

Interceptors are designed to modify or drop an event of data. Flume is designed to pick the data from source and drop it into Sink.

Timestamp Interceptors: This will add the timestamp at which process is running to the header event.

Host Interceptors: this will write the hostname or ip address of the host system on which the agent or process is running to the event of data.

Static Interceptors: This will add the static string along with the static header to all events;

UUID Interceptors: Universla Unique Identifier, this setups a UUID on all events that are intercepted.

Search and Replace Interceptors: this will search and replace a string with a value in the event data.

Regex filtering Interceptors: This is used to include/exclude an event. This filters events selectively by interpreting a exent body as text and against a matching text against a configured regular expression.

Regex Extractor Interceptors: this will extracts a match of regex interceptors againest a regular expression.

Q44) Different types of distributed file systems:

HDFS – Hadoop Distributed File system

GFS – Google File System

MapR File system

Ceph File system

IBM General Parallel file system (GPFS)

Q45) Write a pig script to extract hive table?

First we need to enter the pig shell with option useHCataLog (pig -useHCataLog).

A = LOAD ‘tablename’ USING org.apache.hive.hcatalog.pig.HCatLoader();

A = LOAD ‘airline.airdata’ USING org.apache.hive.hcatalog.pig.HCatLoader();

Q46) Predefined value in sqoop to extract data from any database current date minus one?

sqoop import –connect jdbc:mysql://localhost/database –table table_name –where “time_stamp > day(now()-1)”

Q47) UNION, UNIONALL, MINUS and INTERSECT available in hive?

select_statement UNION [ALL | DISTINCT] select_statement

MINUS keyword is not available in Hive

INTERSECT keyword is not available in Hive

Q48) Difference between Distribute by, cluster by, order by, sort by

Distribute by – Distribute the data among n reducers (un-sorted manner).

Cluster by – Distribute the data among n reducers and sort the data (Distribute by and sort by).

order by – sort the data for all reducers.

sort by – sort the data per reducer.

Q49) Describe the main hdfs-site.xml properties? The three important hdfs-site.xml properties are:

dfs.name.dir which gives you the location in which NameNode stores the metadata (FsImage and edit logs) and where DFS is located – on the disk or onto the remote directory.
Location of the DataNodes is given by dfs.data.dir , and the data is stored in DataNodes.
fs.checkpoint.dir is the directory found on the filesystem where the Secondary NameNode collects the temporary images of edit logs, which is to be combined and the FsImage for backup.

Q50) What is Hadoop? Name the Main Parts of a Hadoop Application

Hadoop is what developed as the solution to the Big Data problem. Hadoop is described as the structure that offers a product of tools and services in order to collect and prepare Big Data. It also plays a relevant role in the analysis of big data and to make effective business choices when it is difficult to make the decision using the conventional method. Hadoop offers a vast toolset that makes it reasonable to store and prepare data very quickly. Here are all the main components of the Hadoop:
1. HDFS
2. Hadoop MapReduce
3. YARN
4. PIG and HIVE – The Data Access Components.
5. HBase – For Data Storage
6. Apache Flume, Sqoop, Chukwa – The Data Integration Components
7. Ambari, Oozie and ZooKeeper – Data Management and Monitoring Component
8. Thrift and Avro – Data Serialization components
9. Apache Mahout and Drill – Data Intelligence Components

Q51) How many Data Formats are there in Hadoop?

1. Text Input Format: The text input is the failure input format in Hadoop.
2. Sequence File Input Format: This input format is used to read files in order.
3. Key Value Input Format: This input format is used for clear text files.

Q52) What do you k now about YARN?

YARN stands for Yet Another Resource Negotiator, it is the Hadoop processing structure. YARN is capable to manage the support and establish execution conditions for the processes.

Q53) Why do nodes are extracted and added regularly in Hadoop cluster?

1. The Hadoop framework uses materials hardware, and it is one of the great features of the Hadoop framework. It appears in a common DataNode crash in a Hadoop cluster.
2. The ease of scale is a yet different primary feature of the Hadoop framework that is implemented according to the rapid increase in data volume.

Q54) What do you understand by “Rack Awareness”?

In Hadoop, Rack Awareness is defined as the algorithm into which NameNode manages how the blocks and their models are stored in the Hadoop cluster. This is done via rack Sensitivity: Internal outlines that minimize the traffic inside DataNodes within the same rack. Let’s take an example – we know that the failure value of replication factor is 3. According to the “Replica Placement Policy” two images of models for every block of data will be collected in a single rack whereas the three copy is stored in another rack.

Q55) What do you understand about the Speculative Execution?

In Hadoop, Speculative Execution is a means that takes place through the slower performance of a task at a node. In this manner, the master node starts performing another example of that same task on the different node. And the task which is completed first is taken and the execution of other is stopped by killing that.

Q56) State any of the main features of Hadoop

1. Hadoop framework is created on Google MapReduce that is based on Google’s Big Data File Systems.
2. Hadoop framework can explain several questions efficiently for Big Data analysis.

Q57) Do you know some organizations that are using Hadoop?

Yahoo – working Hadoop Facebook – produced Hive for analysis Amazon, Adobe, Spotify, Netflix, eBay, and Twitter are any other well-known and lived companies that are doing Hadoop.

Q58) How can you distinguish RDBMS RDBMS and Hadoop?

1.RDBMS is arranged to store structured data, whereas Hadoop can put any kind of data i.e. unstructured, structured, or semi-structured.
2. RDBMS supports “Schema on write” method while Hadoop is based on “Schema on reading” policy.
3. The schema of data is previously known in RDBMS that performs Reads fast, whereas in HDFS, records no schema validation issues during HDFS write, so the Writes are fast.
4. RDBMS is licensed software, so one wants to pay for it, whereas Hadoop does open source software, so it is free of cost. 5. RDBMS is used for Online Transactional Processing (OLTP) method whereas Hadoop is used for data analytics, data discovery, and OLAP system as well.

Q59) What do you know about active and passive NameNodes

Active NameNode – The NameNode that moves in the Hadoop cluster, is the Dynamic NameNode. Passive NameNode – The standby NameNode that stores the corresponding data as that of the Active NameNode is the Passive NameNode. On the breakdown of active NameNode, the passive NameNode restores it and takes the charge. In this way, there is ever a running NameNode in the cluster and thus it nevermore fails.

Q60) What are the Parts from Apache HBase?

1. Region Server: A Report can be split into several regions. A group of these areas gets completed to the clients by a Region Server.
2. H Master: This organizes and operates the Region server.
3. Zoo Keeper: This acts as an organizer inside HBase distributed context. It functions by having a server state inside of this cluster by communication in sessions.

Q61) How is the DataNode failure managed by NameNode?

Name Node continuously accepts a signal from all the DataNodes started in Hadoop cluster that defines the proper function of the DataNode. The record of all the pieces present on a DataNode is stored in a block report. If a DataNode is disappointed in sending the signal to the NameNode, it is marked Sensitivity: Internal decedent later a particular time period. Then the NameNode replicates/copies the blocks of the final node over different DataNode with the earlier built replicas.

Q62) Define the NameNode recovery process.

The process of NameNode restoration helps to prevent the Hadoop cluster running, and can be defined by the following steps –
Step 1: To start a new NameNode, appropriate the file system metadata model (FsImage).
Step 2: Configure the clients and Data Nodes to support the new NameNode. Step 3: Once the new Name performs the loading of last checkpoint FsImage and takes a block store from the DataNodes, the new NameNode start assisting the client.

Q63) What are the various programs available in Hadoop?

The differently available schedulers in Hadoop are – COSHH – It lists resolutions by analyzing cluster, workload, and managing heterogeneity. FIFO Scheduler – It orders the jobs on the base of their approach time in a line without using heterogeneity. Fair Sharing – It defines a supply for each user that includes a representation of pictures and defeat slots on a resource. Each user is granted to use own pool for the performance of jobs.

Q64) Can DataNode and NameNode be specialty hardware?

DataNodes are the specialty hardware only as it can store data like laptops and individual computers, these are needed in high numbers. Instead, NameNode is the master node; it stores metadata about all the blocks saved in HDFS. It needs large memory space, thus works as a high-end the device with great memory space.

Q65) Whatever is that Hadoop daemon? Explain their roles.

NameNode – The master node, the subject for metadata warehouse for all lists and files is known as the NameNode. It also contains metadata data about each block of the data and their allocation in Hadoop cluster.
Secondary NameNode – This daemon is capable to merge and store the changed Filesystem Image into stable storage. It is used in case the NameNode fails. DataNode – The slave node including actual data is the DataNode. NodeManager – Working on the slave devices, the NodeManager controls the launch of the application container, controlling resource usage and reporting same to the ResourceManager.
ResourceManager – It is the main administration qualified to manage devices and to schedule forms running on the top of YARN.
JobHistoryServer – It is responsible to keep every information about the MapReduce jobs when the Application Master stops to work (terminates).

Q66) Define “Checkpointing”. What is its benefit?

Checkpointing is a system that compacts a FsImage and Edits record into a new FsImage. In this way, the NameNode manages the loading of the final in-memory time from the FsImage immediately, instead of replaying an edit log. The secondary NameNode is qualified to perform the checkpointing process. Benefit of Checkpointing Checkpointing is an extremely efficient process and reduces the startup time of the NameNode.

Q67) Name the methods in which Hadoop code can be run.

1.Fully-distributed mode
2. Pseudo-distributed mode
3. Standalone mode

Q68) What is the hoop’s map reduction?

To implement a large data set parallel across a hutto cluster, the hoodo Map Reduce architecture is used. Data analysis uses two step diagrams and reduces the processes.

Q69) How does the Hoodoo map work?

In Map Reduce, during the map stage, it counts the words in each document, while combining data according to the document that contains the entire collection at the cutting point. During the map stage, the input data is divided into divisions to analyze graphic tasks in parallel to the hood structure.

Q70) Explain what happens in MapReduce

The system operates in any way and modifies graphic outputs for reuse, entries called shuffle

Q71) Explain the Cache distributed in MapReduce structure?

Distributed Cache is an important feature provided by MapReduce architecture. Whenever you want to share some files on the hoodo cluster, the distributed cache is used. Files can be an executable jar files or simple properties file.

Q72) What is the name of Hope’s name?

The name Node is a terminal in Hadoop, where HAPOOP stores all file location information in HDFS (Hadoop shared file system). In other words, the name Net is the core of the HDFS file system. It keeps the records of all files in the file system and oversees file data within the cluster or on multiple computers.

Q73) What is JobTracker in Hupa? What does hatio continue to do?

Job Tracker is used in Hadoop to submit and monitor Map Reduce jobs. The work tracker runs on its own JVM process

Work performs the following activities in the tracker’s hoard

Client application must submit work to work supervisor

Contacting Job Tracker Name Mode to determine the data location

The Job Tracker Task Tracker edges near the location or available locations are available

In the selected Task Tracker nodes, it submits to the job

If a work fails, the worker will announce and decide what to do.

Task Tracker edges are tracked by Job Tracker,

Q74) Explain what is heart rate in HDFS?

If the name tone or work supervisor is not answered with the signal, the heart rate is indicated for a signal used at a data endpoint and a name signal between the work monitor and the work monitor

Q75) Explain what the connectors do, and explain that you should use a partner in the mopredos work.

Connectors are used to increase the efficiency of Map Reduce. The amount of data can be reduced and can be reduced with the help of the connector. If surgery is transmitted and subtle, you can use your reproduction code as a collaborator. The execution of the coupator was not guaranteed.

Q76) What happens when a data node fails?

If a data node fails

Find Job tracker and name node failure

All the tasks in the failure end are reconstructed

Name node represents the user’s data with another terminal

Q77) What is Special Execution?

During specificive execution in hugo, some specific tasks start. In another slave terminal, you can reduce multiple copies or tasks of the same map. In simpler terms, if a particular driver takes a long time to complete a task, hoodo will create a duplicate task on another disk. A disc of the first work was retained and the first unreleased disks were killed.

Q78) What are the basic parameters of the modeler?

Basic parameters of the mapper type

Long and text

Text and Intrritable

Q79) What is the MapReduce partition function?

The function of the MapReduce partition is to ensure that all values of the same keys go to the same reader, which ultimately helps distribute the map output on the manufacturer.

Q80) What is the difference between input separation and HDFS block?

The logical segment of data is called split when the project section of the data is called HDFS Block

Q81) What happens in text form?

In the text input format, each line in the text file is a log. When the line is in the interior of the line, the content of the line is worth. For example, main: longwritable, value: text

Q82) Do you specify the main configuration parameters that the user should be prompted to work with?

The MapReduce Framework user must specify

Job’s input spaces in the distributed file system

Work release location in the distributed file system

Input format

Output format

Class with map functionality

Class with squeeze function

JAP file mappers, driver and driver classes.

Q83) What would you explain to WebDAV at Hadoop?

WebDAV is a set of extensions for HTTP to support editing and updating the files. In most operating systems WebDAV shares can be loaded into file systems, so you can access HDFS as a standard file system by introducing HDFS to WebDAV.

Q84) Explain how JobTracker should schedule a task?

Sends HeartsPad messages as a drop-down driver that works every few minutes to ensure that JobTracker is active and active. The JobTracker also reports about the available availability of locations, so the JobTracker cluster is up to the date the task is given.

Q85) Explain what is Sequencefile?

Sequencefileinputformat is used to read files in queues. This is a specific compressed binary file format, which is optimized to obtain data between inputs for some MapReduce workspaces between a MapReduce task release.

Q86) explain what the conf.setMapper class does?

Conf.setMapperclass sets all things related to the graphical work, such as setting up the maple class and reading data, and creates a key value pair from the mapper

Q87) What is Hudab?

It is an open source software architecture for data and running applications as a hardware cluster of products. It offers massive processing and massive storage for any type of data.

Q88) What is the difference between RDBMS and Hadoop?

rdbms if Hadoop

RDBMS is a related database management system Hadoop is a node based flat system

It was used for OLTP processing, whereas Hadoop is currently used for analysis and BIG DATA processing.

In RDBMS, the database package uses the same data files stored in shared storage on the hi-tech, and storage data is stored personally on each processing point.

You need to prepare the data to prepare it before you save it.

Q89) Are you familiar with the Hatoyo Gore elements?

Hadoop includes key elements,

HDFS

Map Reduce.

Q90) What is the name of the hottest?

In Hadoop, name node is also stored in HFSOP to store all file location information in HDFS. It is Master Note, which works with the Tracker and contains Metadata.

Q91) What are the data elements used by Hadabi?

There are data elements that Hadoop uses

Pig

Hive

Q92) What is the data storage component used by Hadus?

The data storage component component used by Hadoop is HBase.

Q93) Do you refer to the most common input forms defined by Hatoyobill?

The most common input forms defined in Hadoudo;

TextInputFormat

KeyValueInputFormat

SequenceFileInputFormat.

Q94) What is InputSplit in Hupa?

It divides the input files into pieces and allocates each split as a meter for the process.

Q95) How to Write a Custom Partition for a Hadoop Work?

You write a custom partition for a hobby work, and you follow the following path

Create a new class that extends the sharing class

Get the method

MapReduce running guitar

Add Custom Partition to Work by Using Partition Partitioner Clause – Add a Custom Share to Work as a Configuration File.

Q96) Can I change the number of producers for a job in the hoate?

No, the number of developers to be created can not be changed. The number of maps is determined by the number of input divisions.

Q97) What is a visual file in the hottie?

To store binary key / value pairs, the array file is used. Unlike the usual compressed file, sequencing the file support even when the data in the file is pressed.

Q98) What will happen to the worker?

Name node is a single point for failure in HDFS, so your cluster will be turned off when Name node is down.

Q99) How is it being indexed in HDFS?

Hadoop has a unique way table. Once the volume level is collected, the next part of the HDFS data will store the last section where it says.

Q100) Can search files using wilddo?

Yes, you can search files using viiliders.

Q101) Do you want to list three configuration files of hidopo?

There are three configuration files

Central site.xml

mapred-site.xml

hdfs-site.xml.

Q102) Can you verify that the Yaminot uses the JPS command?

Besides using the Jps command, you can use Namenode to work

/etc/init.d/hadoop-0.20-namenode level.

Q103) ‘map’ and ‘defect’ in the hottie?

In Hadoop, a diagram is a phase for the HDFS query solution. A diagram reads data from the input field and outputs a prominent value pair according to the input type.

In Hodoob company, the maker is collecting and executing the output produced by the manufacturer, making its own publication.

Q104) In the hottest, what file controls do you report into the hoopoe?

In hoate, hoodo-matrix.

Q105) To use the network needs of Hadoop?

To use Hadoopoe, the list of network needs is as follows:

Password-less SSH connection

A secure shell (SSH) to start server processes.

Q106) What is rack awareness?

Rack awareness determines how to set blocks based on rack definitions.

Q107) What is a task tracker in the hatton?

A task tracker in Hadoop is a slave terminal in the cluster that accepts tasks from a JobTracker. It sends heartfelt messages to JobTracker every few minutes, confirming that JobTracker is still alive.

Q108) What is Demons running at the edge of the master and slavery?

Demons run “MasterNet”

Each slave terminal runs on the “Task Tracker” and “Data”

Q109) How can you fix the Hope code?

Popular methods of surviving the Hadoop code:

By using the web interface provided by the Hadoopo architecture

By using counters

Q110) What is the calculation of storage and nodes?

Storage Code The machine or computer that is comfortable to store your file system processing data

The computational tip is the actual computer logic implemented for your computer or machine.

Q111) Do you mean environmental use?

The environment implements the maple to communicate with other people of HOHOPS

System. The structural information for the job includes the interfaces allowing the release of the output.

Q112) Do you refer to the next step of Mapper or Methus Dogs?

Mapper or MapTask is the next step of the mapper release and the output will be created in the release.

Q113) What is the Number of Normal Participation in Hatoyobil?

In Hadoop, the default partition is the “hash” partitioner.

Q114) What is the purpose of recordReader in Hutcho?

In the Hadoop, RecordReader converts data into the appropriate (key, value) pairs to read the data from the source map.

Q115) How is RDBMS different from HDFS?

RDBMS	HDFS
Based on structured data.	Any type of data can be used.
Limited processing capacity.	Processing in parallel manner.
Schema on write.	Schema on read.
Read is faster than write.	Write is faster than read.
Licensed	Open source

Q116.) What is meant by Big Data and its five V’s?

Big data refers to a group of complex and large data. It is difficult to be processed with RDBMS tools. It is also not easy to capture, store, share, search, transfer, visualize and analyze this data. But it is helpful to make business-related decisions after deriving value from it.

5 Vs of Big data are:

Volume: Amount of data, growing with an exponential rate.
Velocity: Rate of data growth. Social media is the biggest contributor here.
Variety: Heterogeneity of types of data. It could be videos, CSV, audios, etc.
Veracity: Uncertainty of data due to the inconsistency of data and its incompleteness.
Value: Big data turned into some value is useful. It should add benefits to the organization.

Q117) Explain Hadoop with components?

Hadoop was the solution to the problem of Big Data. Apache Hadoop provides a framework for different tools, helping in processing and storage of Big Data. This is useful in making business decisions by analyzing Big Data. It has the following components:

Storage unit
Processing framework

Q118) Explain HDFS as well as YARN?

Hadoop Distributed File System or HDFS is the storage module of Hadoop, responsible for storage of various kinds of data. It does so by using blocks of distributed environment. The topology used here is master-slave topology.

It has the following two components:

NameNode: Master node of distributed environment. It is used to maintain the information of
DataNode: Slave nodes. They manage storage of data in the HDFS. Given below, are managed by NameNode.

YARN offers a processing framework for Hadoop. It handles resources and helps in providing an environment for execution of the processes.

Components:

ResourceManager: Gets processing requests, passes them to NodeManagers accordingly for actual processing.
NodeManager: Installed on all DataNode, responsible for the task execution

Q119) What are Hadoop daemons? Explain their roles.

Various daemons are NameNode, Secondary NameNode, DataNode, ResorceManager, NodeManager and JobHistoryServer.

Roles played by them:

NameNode: Master node is used to store metadata related to files as well as directories.
Datanode: A Slave node where actual data is contained.
Secondary NameNode: Used to periodically merge the modifications with the FsImage etc. It is used to store the modified FsImage with persistent storage.
ResourceManager: The chief authority that performs resource management and application scheduling.
NodeManager: Runs on slave machine. It launches the application’s containers, monitor the resource usage and report them to the ResourceManager with the information.
JobHistoryServer: Information regarding MapReduce jobs is maintained after the termination of Application Master.

Q120) How is HDFS different from NAS?

NAS or Network-attached storage is a file-level storage server. It is used for providing access to data to heterogeneous set of clients. It is either hardware or software and offers services for accessing and storing files.
On the other hand, Hadoop Distributed File System or HDFS, represents a distributed file system and is used to store data with commodity hardware.
Data Blocks in HDFS are distributed over all the machines. Whereas in NAS, a dedicated hardware is used to store data.
HDFS works easily with MapReduce paradigm. NAS is not meant for MapReduce.
HDFS makes use of commodity hardware and it is cost-effective, while a NAS uses high-end devices for storage and is of high cost.

Q121) Explain Hadoop 1 vs. Hadoop 2.

Hadoop 1	Hadoop 2
Failure point is NameNode.	On the failure of active NameNode, passive one takes charge.
MRV 1 Processing	MRV 2 Processing
Not all tools can be used for processing.	Can be used via YARN
Passive NameNode only	Passive and Active

Q122) Explain active/passive “NameNodes”?

Active NameNode works in the cluster and Passive NameNode has the similar data as in Active one but it is also a standby entity. Thus there is never a state when cluster has no NameNode. This is why there is never a failure in cluster.

Q123.) What is the reason for adding/removing Hadoop cluster nodes frequently?

Commodity hardware leads to crashes of DataNode often. As data volume grows, framework of Hadoop can be scaled. This is the reason why Hadoop administrator needs to add or remove DataNodes in cluster.

Q124) Can same files be accessed by two clients in HDFS?

HDFS provides exclusive writes.

When one client wants to write in the file, NameNode provides lease to create this file.
When another client tries to use the same file to write in it, NameNode rejects the request as first client is still writing in the file.

Q125) How are DataNode failures managed by NameNode?

From time to time, NameNode gets a signal of every DataNode. This indicates that DataNode is properly functioning. If no signal is received after a particular time period, DataNode is not working properly. The NameNode will now replicate every block of not functioning node to different DataNode by the use of replicas earlier created.

Q126) How is a down NameNode handled?

With FsImage a fresh NameNode is started.
For clients plus DataNodes to see the presence of this NameNode, they are configured.
NameNode starts serving clients, after completion of loading from last checkpoint. It also receives from DataNode good amount of block reports.

Q127) Explain a checkpoint?

Checkpointing is an approach that works by taking an FsImage with edit log then compact Given below, into a FsImage. Replaying of edit log is not required. NameNode loads the end in-memory state from FsImage. It proves out to be an efficient operation. It also reduces startup time of NameNode. Secondary NameNode is used for performing Checkpointing.

Q128) Explain fault tolerance of HDFS?

Data stored on HDFS is replicated to many DataNode by NameNode. Replication factor by default is 3. Configuration factor can be changed as per need. Once the DataNode is down, the NameNode automatically copies the data to different node using replicas. That way data will be available. Thus HDFS becomes fault-tolerant.

Q129) Explain commodity hardware in terms of NameNode or DataNode?

DataNode is commodity hardware because like laptops it stores data. It is also required in a big number. NameNode represents master node. It keeps metadata for every block kept in HDFS. High RAM is required by NameNode. Thus NameNode is a high entity and requires good amount of memory space.

Q130) Would you use HDFS for large or small data sets?

It is better to use it for data sets of big sizes. NameNode is used to store the information in RAM about metadata related to the file system. Amount of memory causes a limit to the files that can be kept in HDFS file system. Many files will eventually mean a lot of metadata. Its storage in the RAM becomes a challenge.

Q131.) Explain HDFS block?

A block in HDFS represents the continuous smallest location on hard drive to store data. HDFS distributes data stored as blocks over the Hadoop cluster.Also, Files are kept as block-sized chunks. They are stored in the form of independent units.

In Hadoop 1, the default block size is 64 MB and in Hadoop 2, the default block size is 128 MB.

Q132) Explain jps command?

It helps in checking of running Hadoop daemons.

Q133) Explain “speculative execution”?

Let’s say a node runs a task very slow, the master node will redundantly execute different instance of the similar task using another node. The task completed first is accepted. The other task is ignored or killed. This is called speculative execution in Hadoop.

Q134) Can you restart “NameNode”?

This can be done using command /sbin/hadoop-daemon.sh start namenode.

Q135) How is HDFS Block different from Input Split?

HDFS Block represents data’s physical division while Input Split represents logical division. HDFS partitions data in blocks to store the blocks together, while MapReduce partitions the data into the split to submit it to mapper function.

Q136.) What are three modes for Hadoop running?

Standalone
Pseudo-distributed
Fully distributed

Q137.) Explain MapReduce?

MapReduce is a framework used to process huge data sets on computers cluster via parallel programming.

Q138) Explain MapReduce configuration parameters?

Input and output location of Job in distributed file system
Input and output data format
Class with map function
Class with reduce function

Q139) Can you perform addition in mapper?

We can’t. This is because mapper function doesn’t support sorting. It occurs in reducer only.

Q140) Explain RecordReader?

It is a class. It takes data from the source, converts it into pair of (key, value) and makes it available for “Mapper” to read.

Q141) Explain “Distributed Cache”.

It is a provision given by the MapReduce framework for caching files required by applications. After caching, it becomes available with every data node.

Q142) Explain communication process of reducer?

Reducers can’t communicate.

Q143) What is “MapReduce Partitioner”?

It is used to make all values of same key go to one reducer.

Q144) Explain writing process of custom partitioner?

Create a class extending Partitioner Class
getPartition method is overridden, in the wrapper running in MapReduce.
Custom partitioner is added to the job

Q145) Explain “Combiner”?

It is a mini “reducer” and it performs the “reduce” task locally. It gets the input of the “mapper” on a specific “node” then it gives the output to the “reducer”.

Q146) Explain “SequenceFileInputFormat”?

This is an input format. It is used for reading in sequence files. It has been optimized to pass information between input and output of job of MapReduce.

Q147) What is Apache Pig?

Apache Pig represents a platform. It is used to analyze huge data sets that are also represented as data flows. It provides MapReduce abstraction and thus reduces program writing complexities of MapReduce.

Q148) Explain Pig Latin data types?

Atomic: Given below, are also called scalar types. Given below, are int, long, float, double, byte[], char[].

Complex Data Types: Bag, Map, Tuple.

Q149) Name relational operations of “Pig Latin”?

Order by
For each
Group
Filters
Join
Limit
Distinct

Q150) Explain UDF?

User Defined Functions or UDF is used when some operations are not in built. It is used to create those functionalities required.

Q151) Explain “SerDe”?

The “SerDe” allows instruction of “Hive” about processing of a record. It is a combined form of “Serializer” with “Deserializer”.

Q152) Can many users use the default “Hive Metastore”?

It is used in unit testing. Multiple users cannot use it at the same time.

Q153) Name the default location for “Hive” to store table data?

It is in /user/hive/warehouse.

Q154) Explain Apache HBase?

It is an open source, distributed, multidimensional, scalable, NoSQL database. It is written in Java. It provides BigTable like capabilities, fault-tolerant ways, high throughput, etc.

Q155) Name some components in Apache HBase?

Region Server
HMaster
ZooKeeper

Q156) Name Region Server components?

WAL i.e.Write Ahead Log

Block Cache

MemStore

HFile

Q157) Explain “WAL”?

A file is attached, inside distributed environment, to all region servers, known as WAL. It stores new non-persisted data. It is used to recover data sets in failure. Q114) What is HBase properties?

Schema-less
Column-oriented
Stores de-normalized information
Sparsely populated tables
Automated partitioning

Q158) Explain Apache Spark?

It is a framework used in real-time analytics of data in distributed computing. It runs in-memory computations and increases the data processing speed.

Q159) Is it possible to build “Spark” using Hadoop version?

One can do that for a particular Hadoop version.

Q160) What is RDD.

It stands for Resilient Distribution Datasets.
It is a set of operational elements running in parallel.

Q161) Explain Apache ZooKeeper?

It coordinates with multiple services of distributed environment. A lot of time is saved by synchronization, grouping, naming, configuration maintenance etc.

Q162) Explain the configuration of “Oozie” job?

It is integrated with Hadoop stack and supports many jobs like “Java MapReduce”, Pig, “Streaming MapReduce”, Sqoop, “Hive” etc.

Q163) Explain Rack Awareness?

This is an algorithm where “NameNode” decides the placement of blocks with their replicas. This is based on definitions of racks so that network traffic can be minimized between “DataNodes” in the same rack.

Interview Questions

Hadoop Interview Questions and Answers