Spark revolves around the concept of a resilient distributed dataset (RDD), which is a fault-tolerant collection of elements that can be operated on in parallel. There are two ways to create RDDs. The action can be an operation that returns a value to the calling application or export data to the storage system Examples of actions are count, collect, save, etc
The action can be an operation that returns a value to the calling application or export data to the storage system Examples of actions are count, collect, save, etc Each of these partitions can reside in memory or stored on disk of different machines in a cluster. RDDs are immutable (Read Only) data structure. You can’t change original RDD, but you can always transform it into different RDD with all changes you want. These actions are different from transformations and are close to the reduce functionality of Map-Reduce RDD can be persisted for future computation, that is, RDD can be kept in memory or saved on disk This means RDDS can be reloaded and kept in memory. You can even treat the RDD as an SQL-store, using a concept called Data-Frames. This transformer take a function with the signature U => V where U denotes the type of elements that the target RDD contains, being element by element transformer it applies this function to each element
Example, if you have 10 records, you can update or delete any record. So, DSM is treated as a in-memory database in which you can look at or update/delete any record. Abstraction is always at the level of a single-record Deals with access to arbitrary memory locations In RDD, a new RDD can be created as result of a transformation of an existing RDD. An operation is a method, which can be applied on a RDD to accomplish certain task. RDD supports two types of operations, which are Action and Transformation. An operation can be something as simple as sorting, filtering and summarizing data. Here, you cannot point to individual records or you are not talking about pointed-updates. Here you are only talking about transformations. Abstraction is always at the level of a set of records. Deals with transformations.
There is more than one language possibility for writing a Spark application. In this module, we will use Scala. The first line defines a base RDD from an external file. This dataset is not loaded in memory or otherwise acted on: lines is merely a pointer to the file. The second line defines lineLengths as the result of a map transformation. Again, lineLengths is not immediately computed, due to laziness. There are other language avenues for writing a Spark application. The most obvious one is Java. Spark also supports Python. Spark even supports another big data language, R.
Step By Step process on new technologies
- Spark exposes RDD through a language-specific APIs These APIs are available in Scala, Python, Java.
- Datasets are exposed as Objects and transformations(map, flat-map, etc) are invoked as methods on these objects. In the map, we have the flexibility that the input and the return type of RDD may differ from each other.
- RDDs are used to perform actions : The action can be an operation that returns a value to the calling application or export data to the storage system