TECHNICAL BLOGS - BIG DATA, HADOOP, JAVA: Hadoop Administration

Hadoop Administration - NameNode Failure Recovery

The NameNode is the single point of failure - SPOF in Hadoop-1.x configuration. It maintains the locations of all of the data blocks in the cluster. In hadoop-1.x we have the concept on SecondaryNameNode which holds a copy of the NameNode metadata. If your NameNode goes down you can take the metadata copy stored with SecondaryNameNode and use it to resume your work, once your NameNode is up again.

It is important to note that the Secondary NameNode is not a backup for the NameNode. It performs a checkpoint process periodically. The data is almost certainly stale when recovering from a Secondary NameNode checkpoint. However, recovering from a NameNode failure using an old file system state is better than not being able to recover at all. It is possible to recover from a previous checkpoint generated by the Secondary NameNode. So, in case of NameNode failure, Hadoop admins have to manually recover the data from Secondary NameNode.

In Hadoop 2.x, with the introduction of HA (High Availability), the Standby NameNode came into picture. The standby NameNode is the node that removes the problem of SPOF (Single Point of Failure) that was there in Hadoop 1.x. The standby NameNode provides automatic failover in case Active NameNode fails (if HA is not enabled)

Moreover, enabling HA is not mandatory. But, when it is enabled, you can't use Secondary NameNode. So, either Secondary NameNode is enabled OR Standby NameNode is enabled. For these reasons, adding high availability (HA) to the HDFS Name Node became one of the top priorities for the HDFS community.

So in another words, in Hadoop-2.x anyone can have more than one NameNode. In case primary NameNode goes down, the redundant NameNode can take over so that your cluster doesn't stop working(either manual or automatic). In this implementation there is a pair of NameNodes in an active/standby configuration. In the event of the failure of the active namenode, the standby takes over its duties to continue servicing client requests.

Steps for recovering from a NameNode failure

1. Stop the Secondary NameNode:

$ cd $HADOOP_HOME

$ bin/hadoop-daemon.sh stop secondarynamenode

2. Bring up a new machine which will act as the new NameNode. This machine should have Hadoop installed configuration setup like the previous NameNode. Also SSH for password-less login should be configured. Also, it should have the same IP and hostname as the previous NameNode.

3. Copy the contents of fs.checkpoint.dir on the Secondary NameNode to the dfs.name.dir folder on the new NameNode machine.

4. Start the new NameNode on the new machine:

$ bin/hadoop-daemon.sh start namenode

5. Start the Secondary NameNode on the Secondary NameNode machine:

$ bin/hadoop-daemon.sh start secondarynamenode

6. Verify that the NameNode started successfully by looking at the NameNode status page http://localhost:50070/.

Thus first we logged into the Secondary NameNode and stopped the service. Next, we set up a new machine in the exact manner we set up the failed NameNode. Next, we copied all of the checkpoint and edit files from the Secondary NameNode to the new NameNode. This will allow us to recover the file system status, metadata, and edits at the time of the last checkpoint. Finally, we restarted the new NameNode and Secondary NameNode.

Additionally

Recovering using the old data is unacceptable for certain processing environments. Instead, another option would be to set up some type of offsite storage where the NameNode can write its image and edits files. This way, if there is a hardware failure of the NameNode, you can recover the latest filesystem without resorting to restoring old data from the Secondary NameNode snapshot.

The first step in this would be to designate a new machine to hold the NameNode image and edit file backups. Next, mount the backup machine on the NameNode server. Finally, modify the hdfs-site.xml file on the server running the NameNode to write to the local filesystem and the backup machine mount.

$ cd $HADOOP_HOME

Edit hdfs-site.xml

$ sudo vi conf/hdfs-site.xml

<value>/path/to/hadoop/cache/hadoop/dfs, /path/to/backup</value>

</property>

Now the NameNode will write all of the filesystem metadata to both /path/to/hadoop/ cache/hadoop/dfs and the mounted /path/to/backup folders.

Hope you have enjoyed the article.

Author: Iqubal Mustafa Kaki, Technical Specialist.

Want to connect with me
If you want to connect with me, please connect through my email - iqubal.kaki@gmail.com

TECHNICAL BLOGS - BIG DATA, HADOOP, JAVA

Tuesday, January 5, 2016

Hadoop Administration - NameNode Failure Recovery

16 comments:

About Me

Blog Archive