With all the hype and promised potential around Big Data for more than a decade, Big Analysis is transforming Big Data into reality since our ability to make sense of data has grown exponentially and is still significantly improving. Big Analysis is leveraging all sorts of data about pretty much everyone by using demographic data, social media habits, purchasing power, job experience, search history, preferences and much more. Companies that want to remain competitive will need this Intel to create better products to suit their markets and advertise where it will be best received.
Big Data in the early days needed a lot of raw computing power, storage and parallelism; resulting in organizations spending a lot of money to build the infrastructure needed to support Big Data Analytics. For many years, the data management field was dominated by closed-source products with a hefty price tag. Only the large scale organizations and Fortune 500’s could manage such an infrastructure.
The rise of open source in the database community was possibly an artifact of the popularity of the Hadoop Project. Created by Doug Cutting and his colleagues in 2005, more as a necessity to rein in Big Data than as a pure research invention. It was inspired by Google’s MapReduce and Google File System and cultivated at Yahoo. Hadoop started as a large-scale distributed batch processing infrastructure framework and was tailored to meet the need for an affordable, scalable and flexible data structure that could be used for working with very large data sets. To make the most of Big Data, MapReduce came to mainstream. The MapReduce programming paradigm made it possible for massive scalability across hundreds or thousands of servers in a Hadoop cluster.
The first generation of Hadoop addressed affordable scalability and a flexible data structure, but it was really only the first step in the Big Data journey. Its batch-oriented job processing and consolidated resource management were limitations that drove the development of Yet Another Resource Negotiator (YARN).
YARN fundamentally became the architectural center of Hadoop since it allowed multiple data processing engines to handle data stored in one platform.
This new modern data architecture made it possible for Apache Hadoop to become a true data operating system and platform. YARN separated the data persistence functions from different execution models to combine data for multiple workloads. Hadoop Version 2.0 provides the groundwork for today’s Data Lake Strategy, which is basically a large object-based storage repository that holds data in its native format until it’s required. However, using Data Lake only as a consolidated data repository was holding back its true potential; Hadoop was really meant to be used as an interactive, multiple workload and operational data platform.
When the Apache Hadoop project started, MapReduce V1 was the only choice as a Compute model (Execution Engine) on Hadoop. Now in addition to MapReduce V2, we have Apache Tez, Apache Spark and Apache Flink.
Here is how Apache Tez is branding itself:
According to Hortonworks, Apache™ Tez is an extensible framework for building high performance batch and interactive data processing applications, coordinated by YARN in Apache Hadoop. Tez improves the MapReduce paradigm by dramatically improving its speed while maintaining MapReduce’s ability to scale to petabytes of data. Important Hadoop ecosystem projects like Apache Hive and Apache Pig use Apache Tez, as do a growing number of third party data access applications developed for the broader Hadoop ecosystem.
Here is how Apache Spark is branding itself:
According to Databricks, Apache Spark™ is a powerful open source processing engine built around speed, ease of use and sophisticated analytics. It was originally developed at UC Berkeley in 2009. Databricks was founded by the creators of Spark in 2013. Since its release, Spark has seen rapid adoption by enterprises across a wide range of industries. Internet powerhouses such as Yahoo, Baidu and Tencent have eagerly deployed Spark on a massive scale, collectively processing multiple petabytes of data on clusters of over 8,000 nodes. It has quickly become the largest open source community in Big Data with over 500 contributors from 200+ organizations. To help Spark achieve this growth, Databricks continues to contribute broadly throughout the project, both with roadmap development and with community evangelism.
Here is how Apache Flink is branding itself:
Apache Flink, like Apache Hadoop and Apache Spark, is a community-driven open source framework for distributed Big Data Analytics. The Apache Flink engine exploits data streaming, in-memory processing and iteration operators to improve performance. Its execution model supports batch, interactive, real time streaming. Flink joined the Apache incubator in April 2014 and graduated as an Apache Top Level Project in December the same year. Flink can work completely independent of existing technologies like Hadoop, but can run on top of HDFS and YARN.
Today’s computing power means that in the blink of an eye, nearly one billion calculations can take place. Emerging new computing models with lightening fast speed are challenging and forcing each other toward better performance and predictability.
What are leading vendors and market movers talking about?
Both are already happening:
As the battle for Next Generation Big Data Analysis Framework continues, the question isn’t whether some critical Hadoop components are being replaced by new open source technologies. The question is, what’s next for Hadoop? Will Hadoop ecosystem keep growing to compete with the emerging open source technologies or will technologies like Spark, Flink, Ceph, Kafka and others evolve into something entirely new?
At this point there are a lot of “what if’s” and we don’t know the answers. Time will answer with certainty. But the fact remains that we will see more dynamic data management techniques going mainstream in near future.