MapReduce is a programming model and an associated implementation for processing and generating large data sets with a parallel, distributed algorithm on a cluster. The solution is to provide a distributed memory abstraction that lets programmer perform in-memory computation on a large cluster of machines in a fault-tolerant manner.
Thus, the solution is to provide a distributed memory abstraction that lets programmer perform in-memory computation on a large cluster of machines in a fault-tolerant manner. Abstract MapReduce-based systems have emerged as a prominent framework forlarge-scale data analysis, having fault tolerance as one of its key features. MapRe-duce has introduced simple yet efﬁcient mechanisms to handle different kinds offailures including crashes, omissions and arbitrary failures. Spark provides an abstraction for distributed-processing : It provides a distributed memory-abstraction. You have a big flat array, and this array is distributed across multiple machines, and you will have an API to seamlessly work on this array. And RDD provides that abstraction
And your Big DATA system are not suited for individual rows, it works in batch-mode, that is , it is suited to work on billions of rows as a batch. So, when you are planning to store intermediate results into distributed shared memory through a database suited for shared memory, you are actually using a system which is well suited for fine-grained update, but the logic which you are going to implement for your iterative algorithms is of coarse-grained update. The modern Hadoop ecosystem not only provides a reliable distributed aggregation system that seamlessly delivers data parallelism, but also allows for analytics that can provide great data insights. And, further problems would be encountered when we would need a fault-tolerant version of this data. Updating such vast amount of data in a database is in itself a concern, and having a fault-tolerant version of this data in a database is going to be another big concern. So, Spark came with a concept called RDD(Resilient Distributed Data) to solve the problems of distributed share memory.
A stop-gap arrangement that can be thought of was that of a shared-memory. For example, put all the intermediate data-results into a table, and let this table be made available to all the computing-nodes of the cluster. But, this would have it’s own implementation issues It also surveys the state-of-the-art optimization mechanisms toimprove the fault tolerance in MapReduce, and in particular its open-source imple-mentation Hadoop.
Step By Step process on new technologies
- Hadoop has a good programming abstraction for accessing a cluster as a computation resource, but has a poor abstraction for leveraging distributed memory. In cloud environment, node and task failure are no longer accidental but a common feature of large-scale systems.
- Most current iterative algorithms need to access the results of previous computations. They have to reuse the intermediate results across multiple iterative computations.
- Coarse-grained processing This means that all of billions of rows are going to be processing with a same specific logic.