By Ian Lumb | October 30, 2014 | Bright Cluster Manager, Hadoop, Hadoop Cluster Management, Apache Hadoop, Apache Spark, Hadoop Analytics Stack, Software Maintenance
Pop quiz: How many steps does it take to upgrade your Hadoop distribution?
Choose one answer:
If you chose 1 step, you must be using Bright Cluster Manager for Apache Hadoop, as all other approaches require multiple steps. Need the details? (You will at some point!) Your options are summarized below.
The Apache Project details a 4-step rolling upgrade process:
In preparing for the upgrade (Step 1), a snapshot of the Hadoop filesystem (HDFS) metadata is made for downgrade or rollback purposes (if required). It’s important to keep in mind that this is HDFS metadata (i.e., data about the data you’ve stored in HDFS, but not the data itself). Because replication is built into Hadoop, redundant copies of your data are your failsafe against data loss during the upgrade.
Steps 2 and 3 of this process make it clear that upgrading Hadoop corresponds to upgrading HDFS services. In the case of Highly Available (HA) configurations, the standby NameNode (NN2) is upgraded to the latest software release first in Step 2; when re-instantiated, the standby assumes the active role and ingests HDFS metadata. The same upgrade process is then applied to the node that was active prior to the start of the upgrade (NN1). The upgrade process for DataNodes is less choreographed, with subsets of nodes being upgraded simultaneously. The upgrade process with the DataNodes (Step 3) is repeated until all of the nodes have been upgraded to the latest release of the software.
Once complete, committing to the upgraded release of the software is achieved through the final step (Step 4).
If your cluster has not been configured for HA, the upgrade process is more involved, and downtime is inevitable. If federated clusters have been configured, the process needs to be repeated for each namespace.
The upgrade process ignores the JournalNodes and ZooKeeperNodes in your Hadoop cluster. The Apache Project argues that transaction logging “... [JournalNodes are] relatively stable and [do] not require upgrade when upgrading HDFS in most of the cases.” If, however, JournalNodes and ZooKeeperNodes do require upgrade, downtime may be involved.
Other than mentioning the ZooKeeper coordination service in passing, the Apache Project places emphasis squarely on HDFS-related services in the Hadoop upgrade process. Of course, your platform for Big Data Analytics involves more than just HDFS. Your deployment likely relies upon YARN for managing workloads, as well as a stack of analytics applications, in addition to HDFS and ZooKeeper. When you factor in the rest of the stack, the number of steps more than doubles in the cases of CDH and HDP upgrades. In fact, when you factor in the CDH 5 components, you’re looking at about 20 additional upgrades.
Cloudera Manager automates aspects of the CDH upgrade, but still requires a number of steps - some of which need to be taken manually.
HDP relies on Apache Ambari for deploying, managing and monitoring clusters. Today, upgrading equates to redeploying HDP in its entirety. Note: The consequences of a multistep redeployment process need to be carefully considered before any action is taken. Automated stack upgrades are a planned enhancement for a future release of Ambari.
A Bright Cluster Manager role is a task that can be performed by a node in your cluster. Take for example a Bright-managed Hadoop cluster configured with NameNode HA (and Automatic Failover): Nodes are assigned DataNode, NameNode, JournalNode and ZooKeeperNode roles - see the screenshot below. Bright roles make relationships explicit, so dependencies between:
Because Bright Cluster Manager roles allow Hadoop services to be defined, assigned and composed, the Apache Project’s upgrade procedure can be collapsed into a single script. Bright’s 1-step upgrade script also incorporates the following enhancements:
Bright Cluster Manager for Apache Hadoop has been validated for various rolling-upgrade scenarios - see the table below for the details. Particularly noteworthy is the cascading upgrade of CDH: In two steps with Bright you can upgrade from CDH 5.0.4 to CDH 5.1.3, and then immediately from CDH 5.1.3 to CDH 5.2.0. Our upgrade process will even allow you to revert to the pre-upgrade state, should you need to.
Distribution | Pre-Upgrade Version | Intermediate Version | Post-Upgrade Version |
Apache Hadoop | 2.4.1 | 2.5.1 | |
CDH (based upon Apache Hadoop 2.3.0) | 5.0.4 | 5.1.3 | 5.2.0 |
HDP (based upon Apache Hadoop 2.4.0) | 2.1.2.0 | 2.1.5.0 |
So much for HDFS and its services, what about the rest of the stack - i.e., components like YARN and the analytics applications? Bright Cluster Manager for Apache Hadoop maintains YARN, Apache Spark and other components. Maintains is the operative word here. Bright Computing ensures that updates to HDFS, YARN, Spark and other components are made available via YUM updates to our product on a regular basis. Almost more importantly, Bright Computing ensures compatibility of components on an ongoing basis. Because maintaining the stack of Hadoop software is what we do, your distro-upgrade process is greatly simplified. Translation: No extra steps are required to maintain your Hadoop stack.
Bright Computing customers won’t find any of this surprising, as we’ve earned a solid reputation for easing the burden of management outside the Big Data Analytics arena - most notably, in High Performance Computing (HPC) and Cloud computing. Because we have over a decade’s worth of experience in managing complex IT infrastructures, Bright Cluster Manager is a mature and robust solution that saves you time and effort.
Interested in learning more? Please join us next week for a webinar on this topic.
Do you have an upgrade scenario in mind? If so, please get in touch with us. We have the product, people and process to rapidly execute pain-free Hadoop upgrades without downtime.