Enterprise Spark at Scale: Driving More Business Value from Big Data
Recently, Apache Spark set the world of Big Data on fire. With a promise of amazing performance and comfortable APIs, some thought that Spark was bound to replace Hadoop MapReduce. Or is it? Looking closely into it, Spark rather appears to be a natural complement to Apache Hadoop YARN, the architectural center of Hadoop…
Hadoop is already transforming many industries, accelerating Big Data projects to help businesses translate information into competitive advantage. Everywhere you look, you can find companies using Hadoop in large-scale projects to enable deep data discovery, to capture a single view of customers across multiple data sets, and to help data scientists perform predictive analytics. In these ways, companies meet current customer needs, anticipate shifting market dynamics and consumer behaviors, and test business hypotheses—all crucial capabilities to help them outmaneuver and outperform their competitors.
The booming demand for Big Data has fueled a dizzying rise in spending on the technologies that make it possible. One of the most active and remarkable open source projects in the Apache Software Foundation is Apache SparkTM, which makes it possible to run programs up to 100X faster than MapReduce using an advanced DAG (directed acyclic graph) execution engine that supports cyclic data flows and in-memory computing. Spark is also developer-friendly and leverages Java, Scala, Python and R with 80 high-level operators that make it easy to build parallel apps. Since Spark combines SQL, streaming and complex analytics, it offers broad compatibilities within multiple tools—a key advantage for running analytics against diverse data sources.
Apache Spark has generated a lot of excitement in the Big Data community, inspiring contributions by more than 400 developers since the project started in 2009. With the promise of such performance and comfortable APIs, some thought this could be the end of Hadoop MapReduce. If Spark performs better when all the data fits in the memory, especially on dedicated clusters; Hadoop MapReduce is designed for data that doesn’t fit in the memory and it can run well alongside other services. Both have benefits and most companies won’t use Spark it on its own; they still need HDFS to store the data and may want to use HBase, Hive, Pig, Impala or other Hadoop projects. This means they will still need to run Hadoop and MapReduce alongside Spark for a full Big Data package.
The truth is that Spark is a natural complement to Apache Hadoop YARN, the architectural center of Hadoop, which allows multiple data processing engines to interact with data stored in a single platform. Yarn provides resource management and a central platform to deliver consistent operations, security, and data governance tools across Hadoop clusters. It unlocks an entirely new approach to Big Data analytics, and Spark is a key pillar in that approach. Combined together, Spark and YARN enable a modern data architecture that allows users to store data in a single location and interact with it in multiple ways, using whichever data processing engine best matches the analysis.
This cross-stack integration makes Spark on YARN one of the best options for unlocking the value of large-scale Big Data repositories and extracting rich insights from a data lake, and enterprise customers want to take full advantage of YARN’s unique power for multi-tenant Big Data analysis. Now data scientists can substantiate machine-learning insights from Spark with interactive insights from Apache Hive or real-time insights from Apache Storm (to name just two of the multiple engines managed by YARN).
However broad Hadoop adoption requires not just powerful analytics but also enterprise-grade services for operations, data security and governance.
Some of the companies that begin their journey with Spark use cases on clusters that either don’t contain sensitive data, or are dedicated for a single application, meaning that they aren’t subject to broad security requirements. They’re relatively self-contained.
Then, with early Spark successes under their belts, those companies want to capture YARN’s unique multi-tenant value. They seek to deploy Spark-based applications alongside other applications in a single cluster, but with this deeper integration they need to meet higher security standards.
Spark and Hadoop’s combination is a key solution to address some of companies’ challenges like:
Costs: improving storage and processing scalability can help to cut costs by 20-40% while simultaneously adding high volumes of data.
Duplication: deploying a big data platform in the cloud can unify separate clusters into one that supports both Spark and Hadoop.
Analytics: With only a retrospective view of data, companies have limited predictive capabilities, hampering Big Data’s strategic value to anticipate emerging market trends and customer needs. Spark helps to process billions of events per day at a blistering analytical pace of 40 milliseconds per event.
That’s why we are convinced that a consistent work to integrate Spark with the security constructs of a broader Hadoop platform is crucial, ensuring that Spark can run on a secure Hadoop cluster. There is also a community effort to ensure that Spark runs on a Kerberos-enabled cluster, so that only authenticated users can submit Spark jobs. There is a huge potential of benefits for companies.