Apache Spark is a fast data processing engine dedicated to Big Data. It allows the processing of large volumes of data in a distributed manner (cluster computing). Very popular for a few years now, this framework is about to replace Hadoop. Its main advantages are its speed, ease of use, and versatility. Find out everything you need to know about Apache Spark.
It’s a open source parallel data processing engine for large scale analysis through clustered machines. It should not be confused with Cisco’s email software available Spark on Windows, nor with Amazon’s social network. Coded in Scala, Spark allows to process data from data repositories such as Hadoop Distributed File System, NoSQL databases, or relational data stores such as Apache Hive. This engine also supports In-memory processing, which increases the performance of Big Data analytical applications. It can also be used for conventional disk-based processing, if the datasets are too large for system memory.
Apache Spark definition: Big Data as the main application
Sound main advantage is its speed, since it allows to launch programs 100 times faster than Hadoop MapReduce in-memory, and 10 times faster on disk.. Its advanced DAG runtime engine supports acyclic data flow and in-memory computing. It is also easy to use, and allows the development of applications in Java, Scala, Python and R. Its programming model is simpler than Hadoop. Thanks to more than 80 high-level operators, the software makes it easy to develop parallel applications.
Another advantage of Apache Spark is its generality. It acts as an SQL query engine, a Spark Streaming software, and a graphing system (GraphX) at the same time. Apache Spark also includes a large number of libraries of MLib algorithms for Machine Learning. These libraries can be easily combined within the same application.
The engine can be run on Hadoop 2 clusters based on the YARN resource manager, or on Mesos. It is also possible to launch it in standalone form or on the cloud with Amazon’s Elastic Compute Cloud service. It provides access to various data sources such as HDFS, Cassandra, Hbase and S3.
The other the strength of this engine is its massive community. Apache Spark is used by a large number of companies to process large datasets. This community can be reached through a list of email addresses, or at events and summits. As an open source platform, Apache Spark is developed by a large number of developers from over 200 companies. Since 2009, more than 1000 developers have contributed to the project. This makes many Spark tutorials available.
With its data processing speed, its ability to federate many types of databases and to run a variety of analytical applications, it can unify all Spark Big Data applications.. That’s why this Framework could soon supplant Hadoop.
Big Data: Apache Spark vs Hadoop
For more than 10 years, Hadoop has been regarded as the leading Big Data processing technology. It is indeed a solution of choice for processing large datasets. For “one-pass” calculations, MapReduce is indeed very efficient, but is less convenient for use cases requiring multi-pass calculations and algorithms. For this reason, each step of the data processing is broken down into a Map phase and a Reduce phase.
Between each step, the data must be stored in the Distributed File System before the next step can begin.. In practice, this approach is very slow. In addition, Hadoop solutions typically include clusters that are difficult to configure and manage. Several tools must also be integrated for the different Big Data use cases. For Machine Learning, for example, Mahout will have to be used. For the processing of data streams, it will be necessary to integrate Storm.
On his side, Apache Spark allows programmers to develop complex multi-step data pipelines using DAG patterns.. Spark also supports in-memory data sharing through DAGs, allowing different tasks to be performed with the same data. It is run from an existing HDFS infrastructure to provide enhanced and additional functionality. It allows applications to be deployed on a Hadoop V1 cluster with SIMR, a Hadoop V2 YARN cluster or on Apache Mesos.
Rather than a replacement for Hadoop, he’s can be considered as a Spark alternative to Hadoop MapReduce. Spark is not intended to replace Hadoop, but to provide a unified and understandable solution to manage different Big Data use cases.
The story of Apache Spark
Originally, this engine was created in 2009 in the AMPLab laboratory of the University of Berkeley by Matei Zaharia. The initial goal of the project was to take advantage of the declining cost of RAM, and to respond to the exponential increase in Big Data. It was then released as open source in 2010 under the BSD license. In 2013, the project was handed over to the Apache Software Foundation, and released under the Apache 2.0 license.
Apache has integrated the project into its incubator, and placed it as a Top-Level Project in 2014. The version 1.0.0 was released in 2014. In November 2014, Zaharia’s company, Databricks, broke the record for large-scale data classification using Spark. With more than 1000 contributors in 2015, it became one of the most active projects of the Apache Software Foundation, is one of the most active open source big data projects as well. Given the popularity of the platform, companies such as General Assembly or The Data Incubator have been offering training courses to master Apache Spark since 2014.
Recent news about the Spark engine
In July 2016, Apache Spark was upgraded to version 2.0. This major update notably improved the ease of use of the API and improved performance. It also brings support for SQL 2003, R UDF, and structured streaming. In addition, this version includes 2500 patches from over 300 contributors. The current version of Apache Spark is version 2.2 released on July 11, 2017.
On November 16th, Microsoft announced the support of this new version of the data processing engine by its Cloud Azure. Thus, developers can use their database tools to perform their Big Data search. To do so, the Redmond-based firm trusted Databricks to integrate its latest version into the Cloud Azure. Cassandra and MariaDB are also available for companies preferring them, but the company founded by Bill Gates seems to have a preference for the star engine in this article.
We also learn that companies are particularly fond of Spark in order to build up lakes of data necessary for their business. Syncsort, a company specializing in Big Data technologies, conducted a survey of 200 IT managers. Nearly 70% of them use a data processing engine such as this one or Hadoop to build these data lakes.