First let me say, by my own admittance, I am an infrastructure guy! That said, I have been lucky enough recently to be given the chance to dive into the world of Big Data and PaaS. As a techie the extensive technology options in this area are very impressive and as a consequence, I have had a lot of fun combined with late nights and heavy learning.
I have always worked on the KISS principle 'Keep It Simple Stupid!' and Hadoop first into that with the exception of its file services HDFS. By having that layer of abstraction to the file system the ability to manage and populate the file system is non existent without specific tools written for the task.
I know DAS and scale-out through the data nodes is great and builds a big pond for your big data by combining the compute and disk resources of a 1000+ server nodes . Yay all good! But putting my EMC and Infrastructure hat on for a moment what about the following:
- How do I backup my HDFS based data
- How do you use those other cool storage capabilities such as snapshots, auto-tierring, etc
- How do I get real-time analysis on data without having to move it into HDFS
- How could you share the data from within HDFS
- What if you need more compute resources in your cluster but not storage, or the other way around.