Insights

The Battle for the Next Generation Big Data Analysis Framework

The Battle for the Next Generation Big Data Analysis Framework

With all the hype and promised potential around Big Data for more than a decade, Big Analysis is transforming Big Data into reality since our ability to make sense of data has grown exponentially and is still significantly improving. Big Analysis is leveraging all sorts of data about pretty much everyone by using demographic data, social media habits, purchasing power, job experience, search history, preferences and much more. Companies that want to remain competitive will need this Intel to create better products to suit their markets and advertise where it will be best received.

They must look for “Next Generation Big Data Analysis Framework.”

Big Data in the early days needed a lot of raw computing power, storage and parallelism; resulting in organizations spending a lot of money to build the infrastructure needed to support Big Data Analytics. For many years, the data management field was dominated by closed-source products with a hefty price tag. Only the large scale organizations and Fortune 500’s could manage such an infrastructure.

The rise of open source in the database community was possibly an artifact of the popularity of the Hadoop Project.  Created by Doug Cutting and his colleagues in 2005, more as a necessity to rein in Big Data than as a pure research invention.  It was inspired by Google’s MapReduce and Google File System and cultivated at Yahoo. Hadoop started as a large-scale distributed batch processing infrastructure framework and was tailored to meet the need for an affordable, scalable and flexible data structure that could be used for working with very large data sets. To make the most of Big Data, MapReduce came to mainstream. The MapReduce programming paradigm made it possible for massive scalability across hundreds or thousands of servers in a Hadoop cluster.

The first generation of Hadoop addressed affordable scalability and a flexible data structure, but it was really only the first step in the Big Data journey. Its batch-oriented job processing and consolidated resource management were limitations that drove the development of Yet Another Resource Negotiator (YARN).

YARN fundamentally became the architectural center of Hadoop since it allowed multiple data processing engines to handle data stored in one platform.

This new modern data architecture made it possible for Apache Hadoop to become a true data operating system and platform. YARN separated the data persistence functions from different execution models to combine data for multiple workloads. Hadoop Version 2.0 provides the groundwork for today’s Data Lake Strategy, which is basically a large object-based storage repository that holds data in its native format until it’s required. However, using Data Lake only as a consolidated data repository was holding back its true potential; Hadoop was really meant to be used as an interactive, multiple workload and operational data platform.

Evolution of Compute Models

When the Apache Hadoop project started, MapReduce V1 was the only choice as a Compute model (Execution Engine) on Hadoop. Now in addition to MapReduce V2, we have Apache Tez, Apache Spark and Apache Flink.

Apache Tez:

Here is how Apache Tez is branding itself:

According to Hortonworks, Apache™ Tez is an extensible framework for building high performance batch and interactive data processing applications, coordinated by YARN in Apache Hadoop. Tez improves the MapReduce paradigm by dramatically improving its speed while maintaining MapReduce’s ability to scale to petabytes of data. Important Hadoop ecosystem projects like Apache Hive and Apache Pig use Apache Tez, as do a growing number of third party data access applications developed for the broader Hadoop ecosystem.

Apache Spark:

Here is how Apache Spark is branding itself:

According to Databricks, Apache Spark™ is a powerful open source processing engine built around speed, ease of use and sophisticated analytics. It was originally developed at UC Berkeley in 2009. Databricks was founded by the creators of Spark in 2013. Since its release, Spark has seen rapid adoption by enterprises across a wide range of industries. Internet powerhouses such as Yahoo, Baidu and Tencent have eagerly deployed Spark on a massive scale, collectively processing multiple petabytes of data on clusters of over 8,000 nodes. It has quickly become the largest open source community in Big Data with over 500 contributors from 200+ organizations. To help Spark achieve this growth, Databricks continues to contribute broadly throughout the project, both with roadmap development and with community evangelism.

Apache Flink:

Here is how Apache Flink is branding itself:

Apache Flink, like Apache Hadoop and Apache Spark, is a community-driven open source framework for distributed Big Data Analytics. The Apache Flink engine exploits data streaming, in-memory processing and iteration operators to improve performance. Its execution model supports batch, interactive, real time streaming. Flink joined the Apache incubator in April 2014 and graduated as an Apache Top Level Project in December the same year. Flink can work completely independent of existing technologies like Hadoop, but can run on top of HDFS and YARN.

Battle for Next Generation Big Data Analysis Framework

Today’s computing power means that in the blink of an eye, nearly one billion calculations can take place. Emerging new computing models with lightening fast speed are challenging and forcing each other toward better performance and predictability.

What are leading vendors and market movers talking about?

Databricks:

  • “Spark and Hadoop are Working Together.”
  • “Uniform API for diverse workloads over diverse storage systems and ”
  • "The goal of Apache Spark is to have one engine for all data sources, workloads and ”

Cloudera:

  • “Spark is already an excellent piece of software and is advancing very quickly.No vendor — no new project — is likely to catch up. Chasing Spark would be a waste of time and would delay availability of real-time analytic and processing services for no good reason.”
  • “Apache Spark is an open source, parallel data processing framework that complements Apache Hadoop to make it easy to develop fast, unified Big Data applications combining batch, streaming and interactive analytics on all your data.”

MapReduce:

  • “Apache Spark is a general-purpose engine for large-scale data processing. Spark supports rapid application development for Big Data and allows for code reuse across batch, interactive and streaming applications. Spark also provides advanced execution graphs with in-memory pipelining to speed up end-to-end application performance.”
  • “MapR Adds Complete Apache Spark Stack to its Distribution for Hadoop.”

Hortonworks:

  • “Apache Spark provides an elegant, attractive development API and allows data workers to rapidly iterate over data via machine learning and other data science techniques that require fast, in-memory data processing.”
  • “A shared vision for Apache Spark on Hadoop.”
  • “At Hortonworks, we love Spark and want to help our customers leverage all its benefits.”

Gartner:

  • “Is Apache Spark replacing Hadoop or complementing existing Hadoop practice?

Both are already happening:

  • With uncertainty about ‘What is Hadoop?’ there is no reason to think solution stacks built on Spark, not positioned as Hadoop, will not continue to proliferate as the technology matures.
  • At the same time, Hadoop distributions are all embracing Spark and including it in their offerings.”

Forrester:

  • “After hearing the confusion between Spark and Hadoop one too many times, I was inspired to write a report.” The Hadoop Ecosystem Overview, Q4 2104
  • “For those that have day jobs that don’t include constantly tracking Hadoop evolution, I dove in and worked with Hadoop vendors and trusted consultants to create a framework. We divided the complex Hadoop ecosystem into a core set of tools that all work closely with data stored in Hadoop File System and extended group of components that leverage, but do not require it.”  Source: Elephants, Pigs, Rhinos and Giraphs; Oh My! – It's Time To Get A Handle On Hadoop. Posted by Brian Hopkins on November 26, 2014

As the battle for Next Generation Big Data Analysis Framework continues, the question isn’t whether some critical Hadoop components are being replaced by new open source technologies. The question is, what’s next for Hadoop? Will Hadoop ecosystem keep growing to compete with the emerging open source technologies or will technologies like Spark, Flink, Ceph, Kafka and others evolve into something entirely new?

At this point there are a lot of “what if’s” and we don’t know the answers.  Time will answer with certainty. But the fact remains that we will see more dynamic data management techniques going mainstream in near future.

Article written by Rashid Jamal
Want more? For Job Seekers | For Employers | For Contributors