While attending university, I recall more than a little time spent discussing the nuances and efficiency of various sorting algorithms. Those considerations seem largely irrelevant just barely a decade later. Modern distributed computing problems need a programming abstraction that is better than map-reduce, as an abstraction that wraps Map Reduce into a single higher-level concept
Modern distributed computing problems need a programming abstraction that is better than map-reduce, as an abstraction that wraps Map Reduce into a single higher-level concept Just as the early engines relied upon a single large cylinder to produce power, pioneering computer scientists relied upon a single, massive computer to produce compute power – a very expensive approach, and very limited in top-end capability. Hadoop brought in Map-Reduce framework of processing , which was a fundamental new abstraction given to programmers to look into distributed computing problems. Similarly, Spark is new programming abstraction for distributed computing problems. It is better and faster in several areas of distributed-processing as compared to Map-reduce.
A need to solve iterative algorithms which was a tough challenge to solve using traditional map-Reduce abstraction. What a distributed system enables you to do is scale horizontally. Going back to our previous example of the single database server, the only way to handle more traffic would be to upgrade the hardware the database is running on. This is called scaling vertically. Distributed computing algorithms like interactive and graph algorithms are those which work on the data again and again. Hadoop is designed for transformation of data and then reduction of data and then move ahead to the next step. And this, primarily, gave rise to Spark and Spark has a new abstraction called RDD.
In June of 2010, Matei Zaharia released his research paper on the project, titled as “Spark: Cluster Computing with Working Sets,” following up later that year by releasing the project as open source. In 2013, Spark collaborators founded Databricks, a company focused on supporting Spark as an open source product. Spark switched its license to Apache. Cloud computing brings a number of advantages to consumers in terms of accessibility and elasticity. It is based on centralization of resources that possess huge processing power and storage capacities.
Step By Step process on new technologies
- Apache Spark is a fast and easy to use framework which is sweeping the big data world.
- MapReduce was built to conquer the problems of BigData, parallelizing the processing across a distribution of machines, taking the processing to the data(data-locality). But Map-reduce has it’s own difficulties of complex-algorithms and not being able to efficiently perform anything more than just batching/batch processing But grouping together a network of computers offered the promise of far more capability for much less expense. And thus the modern distributed computing systems have started gaining momentum.
- MapReduce Explosion : Map-Reduce, led to an explosion of specialized libraries, each adding a new logic/code/APIs to your big data model, each attempting to solve different problems.