In graph theory, a graph is a series of vertexes connected by edges. In a directed graph, the edges are connected so that each edge only goes one way. When a client submits a spark user-application code, the driver implicitly converts the code containing transformations into a logical directed acyclic graph (DAG).
When a client submits a spark user-application code, the driver implicitly converts the code containing transformations into a logical directed acyclic graph (DAG). Then it converts the logical DAG into physical execution plan with set of stages. After creating the physical execution plan, it creates small physical execution units referred to as tasks under each stage. A directed acyclic graph means that the graph is not cyclic, or that it is impossible to start at one point in the graph and traverse the entire graph.
The driver program then talks to the cluster manager and negotiates for resources. The cluster manager then launches executors on the worker nodes on behalf of the driver. At this point the driver sends tasks to the cluster manager based on data placement. Now executors start executing the various tasks assigned by the driver program. At any point of time when the spark application is running, the driver program will monitor the set of executors that run. A spreadsheet may be represented as a directed acyclic graph, with each cell a vertex and an edge connected a cell when a formula references another cell.
Conclusion : The structure of a Spark program at higher level is – RDD’s are created from the input data and new RDD’s are derived from the existing RDD’s using different transformations, after which an action is performed on the data. In any spark program, the DAG operations are created by default and whenever the driver runs the job(through an action), the Spark DAG will be converted into a physical execution plan. DAGs can model many different kinds of information. For example, a spreadsheet can be modeled as a DAG, with a vertex for each cell and an edge whenever the formula in one cell uses the value from another; a topological ordering of this DAG can be used to update all cell values when the spreadsheet is changed.
Step By Step process on new technologies
- The driver program schedules the job execution and negotiates with the cluster manager.
- It translates the RDD-transformations into the execution graph(DAG) and splits the graph into multiple stages.
- Driver program access Apache Spark through a SparkContext object which represents a connection to spark-computing-cluster