My Two Key Takeaways from Hadoop Summit 2016: Security & Storage versus Compute

    

 The podcast from Roaring Elephant summarized the theme for the 2016 Hadoop Summit as a "Continuation of the Journey," and the importance of partnering up. The message that; "You are not alone in this," was emphasized a few times in different keynotes. I also felt like the speakers tried to coin the theme "It's all about the apps," but as this was my first Summit, I’m uncertain as to whether this was new to 2016 or not.

Overall, the atmosphere at Hadoop Summit was really great. 1600 attendees represented 46 countries and all of the presentations I attended (apart from the occasional commercial one) were really informative. The talks reflected the current state of Hadoop, and sometimes I wished some of the projects were more mature. There were lots of preview & incubating projects, and I learned about many interesting projects that are happening right now with Hadoop / Spark / NiFi / Cloud. All the videos from Hadoop Summit are already online on this YouTube channel.

Security

Regarding security, many projects seem to be developing in parallel; at least they were in the beginning. Because of this there is a bit of overlap between some projects. For instance, Apache Eagle and Apache Atlas have some overlapping goals, but they don't solve exactly the same problem. Apache Eagle is being developed by EBay and PayPal and is already being used in their production environments. Apache Atlas has a preview VM available. I'm not sure if it's already being used in production.

Apache Atlas aims to provide a foundation for governance services like data classification, centralized auditing, search & lineage, and a security & policy engine. Eagle focuses more on the monitoring, identification of sensitive data, recognizing attacks and other malicious activities, and taking action in real-time. Apache Ranger can be used to enforce policies using Hadoop ACL's and to provide basic Hadoop security with Kerberos.

One of the interesting questions that popped up multiple times in security talks was: "After enabling all security settings in Hadoop/Spark, how would I know my cluster is actually secure?"  The answers that followed were not completely satisfying, unfortunately.

We have asked ourselves this question at Bright too, and use our Metrics system to monitor all nodes in the cluster actively, checking whether Kerberos/SPNEGO is active, IPC is protected, encryption is active, the right people are denied access to create keys/zones in the KMS server, etc. This solves part of the problem, but it’s clear that more work needs to be done in the Hadoop community to raise its security profile.

Storage versus Compute

Another interesting theme for me was the storage versus compute debate. There is a trade-off you can make with regards to data locality (i.e., local with HDFS or cloud like Azure Cloud/Amazon). In a lot of cases choosing separate compute and storage nodes seems to be more appealing. The main reason is that it allows storage to scale independently from compute: Don't pay for what you don't use!

Reading data from local disks is still faster compared to storage over the network, but networks are getting faster, disks aren't. There seems to also be quite a bit of momentum towards moving data into memory, making disks even less relevant. Today the norm is 1 Gigabit or even 10 Gigabit Ethernet, and the speed is expected to increase. A while ago in 2011 a study investigated Facebook jobs and 85% of the time tasks reading from network ran just as fast as disk-local tasks. From the same study, 96% of the Hadoop workload could fit their entire data set in memory (assuming 32 GB per server).

###

I would strongly recommend this two-day conference to anyone who is interested in solving big data technical problems. It’s a hothouse for ideas and innovation.

hadoop