MapReduce : all about the Hadoop framework for Big Data processing

MapReduce is a framework for processing data in clusters. Composed of the Map and Reduce functions, it allows you to distribute data processing tasks between different computers, and then reduce the results to a single synthesis. Discover its origins, how it works, its advantages, and the criticisms made about it.

What is MapReduce? Definition

MapReduce is a programming model created by Google for processing and generating large datasets on computer clusters. It is a central component of the Apache Hadoop Software Framework, which enables the resilient and distributed processing of massive unstructured data sets on computer clusters, where each node has its own storage space. Specifically, the framework offers two main features. It distributes the work to the individual nodes in the cluster (map), then organizes them and reduces the results provided by each node into a single, consistent response to a query. This is made possible thanks to its HDFS distributed file system.

What are the advantages of MapReduce ?

MapReduce runs on a large cluster of machines and is highly scalable. It can be implemented in several forms using different programming languages such as Java, C# and C++. For novice developers, the Framework is convenient because library routines can be used to create parallel programs without worrying about infra-cluster communications, task monitoring or error handling. Programmers without experience in parallel and distributed systems can easily use resources from large distributed systems.

In order to distribute the data entered and weld the results, it operates in parallel on massive clusters. The size of a cluster has no impact on data processing. In fact, tasks can be distributed over any number of servers. This is why MapReduce and Hadoop simplify software development. It is available in several languages including C, C++, Java, Ruby, Pearl and Python. Programmers can use the MapReduce libraries based on Java 8 to create tasks without worrying about communication or coordination between nodes.

MapReduce for dummies? The The main advantage of this framework is its tolerance to errors.. A task is transferred from one node to another, and if the master node notices that a node has been silent for a longer period of time than expected, the master node reassigns the task to another node. This creates resiliency and makes it easier to launch this software structure on inexpensive servers.

MapReduce: how does the framework work?

The Framework is divided into two parts. The Map function allows the different points of the distributed cluster to distribute their work. The Reduce function reduces the final shape of the cluster results to a single result. This is why its name combines both functions. This Framework also consists of several components. Job tracker is the main node that manages all the tasks and resources of a cluster. TaskTrackers are the agents deployed on each machine of a cluster to launch the map and reduce tasks. JobHistoryServer is a component for tracking completed jobs, usually deployed as a separate function or with JobTracker .

The data processing takes place on data stored in the file system or within a database. A set of key values is entered and transformed into another set of key values produced by the framework. The system startup time is extended by partitioning the input data, scheduling program execution on the sets of machines, handling machine failures and managing communication between machines.

Every single day, many MapReduce programs are carried out on Google’s clusters. The programs are automatically parallelized and run on a large cluster of machines. MapReduce is used for distributed classification, reverse web link-graph, web access statistics, document clustering, machine learning and automatic translation of statistics.

For example, for example.., If it is possible to manually count the number of times a word appears in a novel, it takes a lot of time to do so.s. If you divide this task among about 20 people, things can go much faster. Each person takes a page from the novel and writes the number of times the word appears on the page. This is the Map part of MapReduce. If one person leaves, another takes his place. This example illustrates the error tolerance of MapReduce. When all the pages are processed, users spread all the words into 26 boxes based on the first letter of each word. Each user takes one box, and sorts the words in alphabetical order. The number of pages with the same word is an example of the Reduce part of MapReduce.

mapreduce schema fonctionnement

The origins of the framework

Originally, he is a framework developed for indexing web pages on the Google search engine.. It allowed the firm to replace its old indexing algorithms. However, while this term originally refers to Google’s proprietary technology, it is now used generically. As of 2014, Google no longer uses MapReduce as its primary model for Big Data processing. MapReduce libraries have been written in many different programming languages, and the Apache Hadoop implementation is the most popular.

MapReduce: what are the criticisms of the framework?

Several criticism is levelled at MapReduce. Computer scientists David DeWitt and Michael Stonebreaker, who specialize in parallel databases and shared-nothing architectures, consider that MapReduce solves only a small number of problems. According to them, MapReduce is not as innovative as its creators claim, and Teradata has been offering similar functionality for nearly two decades.

It should be noted that Google has received a very controversial patent for an equivalent of this programming architecture, because this product remains very similar to other. For example, the Map and Reduce functionalities can be easily implemented to Oracle PL/SQL databases. They are also supported by distributed database architectures such as Clusterpoint XML or MongoDB NoSQL.

Likewise, these two experts consider that the fact that MapReduce uses file entries and does not support enough schemas prevents the increase in performance offered by most common databases. However, projects such as Pig, Sawzall, Apache Hive, YSmart, HBase and BigTable are addressing these issues. In addition, some use MapReduce and Spark as a complement to extend the functionalities. PDFs are available on the web on this subject.

However, Greg Jorgensen refuted these criticisms by stating that MapReduce was never designed as a database, and that all the analyses made by DeWitt and Stonebraker are completely unfounded.. In response, DeWitt and Stonebraker have published a benchmark to compare the approaches of MapReduce Hadoop and RDBMS on different specific issues. Their conclusion is that relational databases offer real advantages for many uses of data, especially for complex processing, but that the former is easier to use for simple or one-off tasks.

More about Hadoop Mapreduce

    • Cluster Hadoop: Understanding everything about server clusters
    • Comparative Hadoop: top 7 commercial distribution sellers
    • Deploying Hadoop, a real ordeal according to the experts
    • Hadoop – All about the main Big Data platform

Be the first to comment

Leave a Reply

Your email address will not be published.


*