42

I recently found a framework named ecto.

In this framework, a basic component named "plasm", which is the ecto Directed Acyclic Graph.In ecto, plasm can be operated by ecto scheduler.

I am wondering what's the advantage of this mechanism, and in what other situations can we exploit the concept of DAG?

Po-Jen Lai
  • 563
  • 1
  • 4
  • 9

9 Answers9

36

Nice Question.

  • Code may be represented by a DAG describing the inputs and outputs of each of the arithmetic operations performed within the code; this representation allows the compiler to perform common subexpression elimination efficiently.
  • Most Source Control Management Systems implement the revisions as a DAG.
  • Several Programming languages describe systems of values that are related to each other by a directed acyclic graph. When one value changes, its successors are recalculated; each value is evaluated as a function of its predecessors in the DAG.
  • DAG are handy in detecting deadlocks as they illustrate the dependencies amongst a set of processes and resources.
  • In many randomized algorithms in computational geometry, the algorithm maintains a history DAG representing features of some geometric construction that have been replaced by later finer-scale features; point location queries may be answered, as for the above two data structures, by following paths in this DAG.
  • Once we have the DAG in memory, we can write algorithms to calculate the maximum execution time of the entire set.
  • While programming spreadsheet systems, the dependency graph that connects one cell to another if the first cell stores a formula that uses the value in the second cell must be a directed acyclic graph. Cycles of dependencies are disallowed because they cause the cells involved in the cycle to not have a well-defined value. Additionally, requiring the dependencies to be acyclic allows a topological order to be used to schedule the recalculations of cell values when the spreadsheet is changed.
  • Using DAG we can write algorithms to evaluate the computations in the correct order.

EDIT :

  • Ordering of formula cell evaluation when recomputing formula values in spreadsheets can be done using DAGs
  • Git uses DAGs for content storage, reference pointers for heads, object model representation, and remote protocol.
  • DAGs is used at Trace scheduling: the first practical approach for global scheduling, trace scheduling tries to optimize the control flow path that is executed most often.
  • Ecto is a processing framework and it uses DAG to model processing graphs so that the graphs do ordered synchronous execution. Plasm in Ecto is the DAG and Scheduler operates on it.
  • DAGs is used at software pipelining, which is a technique used to optimize loops, in a manner that parallels hardware pipelining.

Good Resources :

Md Mahbubur Rahman
  • 4,809
  • 5
  • 34
  • 38
14

The answer is that it doesn't have much of anything to do with programming. It has to do with problem solving.

Just like linked-lists are data structures used for certain classes of problems, graphs are useful for representing certain relationships. Linked lists, trees, graphs, and other abstract structures only have a connection to programming in that you can implement them in code. They exist at a higher level of abstraction. It's not about programming, it's about applying data structures in the solution of problems.

If still you want some relationship with programming then please consider following points:

  • DAG (known as Wait-For-Graphs - more technical details) are handy in detecting deadlocks as they illustrate the dependencies amongst a set of processes and resources (both are nodes in the DAG). Deadlock would happen when a cycle is detected.
  • Once you have the DAG in memory, you can write algorithms to:
    • make sure the computations are evaluated in the correct order (topological sort)
    • if computations can be done in parallel but each computation has a maximum execution time, you can calculate the maximum execution time of the entire set
Vaibhav Agarwal
  • 1,248
  • 2
  • 8
  • 21
8

Other people have applied DAG to data, but I think it is at least as applicable (if not more so) to code. Mahbubur R Aaman mentions this, so really this is more of an addendum to his answer than a complete answer on its own.

It occurs to me than any imperative computer program that is free of infinite loops (thanks @AndresF.) is a Directed Acyclic Graph (DAG). Meaning that the possible paths of execution of the code are directed (first this, then that), and acyclic (not forming infinite loops). They are a graph because the path through any significant code is rarely as simple as a list or a tree.

I worked in XSLT for maybe 4 years. I had a terrible time trying to explain why it was not a good general purpose programming language, but DAG is the reason. Specifically, XSLT is a data driven language. You define functions (yes, in the functional-programming sense) but you don't necessarily call these functions from your code. Rather, XSLT sets up a combination of selection of, and iteration through, the nodes of an input XML document. This lets the structure of the input data determine which functions are called and in what order.

This was very interesting and very cool until your program encountered a data condition that you didn't test for at 2:30 AM and you had to wake up and fix it. When you let the data define the DAG, then the definition of the DAG becomes all the possible input conditions - which for any non-trivial business application are beyond incalculable; they are unimaginable.

At first I thought that functional programming may not be a DAG because execution order is sometimes not clear to, or even thought about by the programmer. But a functional program does define dependencies. In fact, the declarative nature of functional programming could be thought of as defining only dependencies (a^2 = b^2 + c^2) without specifying execution order (it does not matter whether 'b' or 'c' is squared first, so long as they are both squared before being added together).

But while Functional programming may be deliberately vague about order of operations at a detailed level, it is exquisitely clear about dependencies. These are the very features that make it so amenable to concurrency. In any case, there is still a graph of paths through the code, and that graph is still directed (dependencies must be evaluated before dependent tasks), so I think DAG applies there as well.

Nice question - thanks for posting!

GlenPeterson
  • 14,950
6

Currently DAG is underrated in programming. Historically many things related to development were made with trees and hierarchies because moving something in a box is convenient for our brain to make complex things easier to manage. But if you look at events and how they depend on other events and states then you will get DAG because anything in our life and in the program can depend on anything in the past but not in the future, so you would get perfectly "acyclic" relations to be applicable to DAG concept. Although this rarely used explicitly in development, having this in mind would help to understand things better

Maksee
  • 2,663
3

I am wondering what's the advantage of Plasm in Ecto...

DAG can be used to model a collection of tasks in a sequence with the constraint that certain tasks must be done before the others. Ecto is a processing framework and it uses DAG to model processing graphs so that the graphs do ordered synchronous execution. Plasm in Ecto is the DAG and Scheduler operates on it.

in what other situations can we exploit the concept of DAG?

  • DAWG is a data structure that represents a set of strings and allows for a query operation that tests whether a given string belongs to the set in time proportional to its length.
  • Git uses DAGs for content storage, reference pointers for heads, object model representation, and remote protocol.
theD
  • 2,913
3

DAGs are extremely useful in data science and especially in statistical programming.

In statistics and data science, we typically program models with an end-goal of achieving either 1) good predictions on out-of-sample data, or 2) to infer causal relationships from observed data. DAGs are very useful tools for the latter. So what follows relates only to causal inference.

Very often, an analyst has a dataset and a research question about the causal effect of a variable (the "main exposure") on an outcome, in the presence of other variables. The task that the analyst has is to determine the causal effect of the main exposure on the outcome, while taking into account the other variables in the dataset. This "taking account of other variables" can be very difficult and lies at the heart of what causal inference is all about. These other variables can be broadly categorised into:

  1. Potential confounders,
  2. Competing exposures,
  3. Mediators

Confounders are variables that are a cause (or a proxy of a cause) of both the main exposure and the outcome. Statistical models should condition on confounders in order to reduce or avoid confounder bias. A simple example is causal effect of coffee drinking on lung cancer. On the face of it, there should be no causal effect but examining the relation between coffee consumption and lung cancer, we will find a positive association (correlation). So a simple regression of lung cancer on coffee drinking will likely find a "significant" result. The problem with this, is that there is an unmeasured confounder - namely, smoking. Coffee drinkers are more likely to smoke than those who don't drink coffee. Once we add smoking to the model, we will no longer find a significant effect of coffee drinking. This is obviously a very simple example.

Competing exposures are variables that have no association with the main exposure, but are associated with the outcome. These should be included in the model(or "conditioned on" in statistics terminology) as they will increase precision of the causal effect we are interested in,

Mediators are variables that are "caused" by the main exposure, and then they themselves are also a cause of the outcome. Conditioning on these will invoke bias, which can be so extreme as to change the sign of the effect we are interested in. The general name for this is the "reversal paraqdox", and includes examples such as Simpson's Paradox and Lord's Paradox.

Things quickly get very complicated the larger (in terms of variables) the dataset is. The DAG-based approach of causal inference can produce a "minimally sufficient" set of variables to include in the model. Many novice analysts simply throw all the variables into their model, or use "stepwise" procedures to determine their model, both of which are gigantic mistakes.

A DAG directly informs which variables to include and which to exclude from the model. A very useful tool for this is Dagitty.

Much of the theory of DAGs in causality was developed by Judea Pearl, a rather famous computer scientist who has written extensively on this topic and is regarded as one of the pioneers of causal inference. Daggity is based on Pearl's work. The canonical reference is Pearl (2009).

For further reading on this, see:

How do DAGs help to reduce bias in causal inference?

Backward Stepwise Regression

References:

Pearl, J. (2009). Causality. Cambridge university press.

Tu, Y. K., Gunnell, D., & Gilthorpe, M. S. (2008). Simpson's Paradox, Lord's Paradox, and Suppression Effects are the same phenomenon–the reversal paradox. Emerging themes in epidemiology, 5, 1-9.

-1

As a real world example, our software is similar to an IDE where the end user can define a series of operations to be performed on an image (machine vision inspection). These inspections can have dependencies on other inspections or may have inspections dependent upon them. Since this is all configurable by the end user, we cannot make optimizations for parallel processing at design time. By representing these inspections and dependencies as a DAG, we can optimize the parallelism of the overall inspection for maximum performance at run time.

Dave Nay
  • 3,827
-1

Just for another example, the memory management rules in Cocoa apps are made so that all strong references form a directed acyclic graph, which is done to guarantee the absence of leaks.

millenomi
  • 107
-2

Adding another answer as have not seen a reference to build systems like make which uses DAG to find out dependencies for building.

More details here