DAGs are extremely useful in data science and especially in statistical programming.
In statistics and data science, we typically program models with an end-goal of achieving either 1) good predictions on out-of-sample data, or 2) to infer causal relationships from observed data. DAGs are very useful tools for the latter. So what follows relates only to causal inference.
Very often, an analyst has a dataset and a research question about the causal effect of a variable (the "main exposure") on an outcome, in the presence of other variables. The task that the analyst has is to determine the causal effect of the main exposure on the outcome, while taking into account the other variables in the dataset. This "taking account of other variables" can be very difficult and lies at the heart of what causal inference is all about. These other variables can be broadly categorised into:
- Potential confounders,
- Competing exposures,
- Mediators
Confounders are variables that are a cause (or a proxy of a cause) of both the main exposure and the outcome. Statistical models should condition on confounders in order to reduce or avoid confounder bias. A simple example is causal effect of coffee drinking on lung cancer. On the face of it, there should be no causal effect but examining the relation between coffee consumption and lung cancer, we will find a positive association (correlation). So a simple regression of lung cancer on coffee drinking will likely find a "significant" result. The problem with this, is that there is an unmeasured confounder - namely, smoking. Coffee drinkers are more likely to smoke than those who don't drink coffee. Once we add smoking to the model, we will no longer find a significant effect of coffee drinking. This is obviously a very simple example.
Competing exposures are variables that have no association with the main exposure, but are associated with the outcome. These should be included in the model(or "conditioned on" in statistics terminology) as they will increase precision of the causal effect we are interested in,
Mediators are variables that are "caused" by the main exposure, and then they themselves are also a cause of the outcome. Conditioning on these will invoke bias, which can be so extreme as to change the sign of the effect we are interested in. The general name for this is the "reversal paraqdox", and includes examples such as Simpson's Paradox and Lord's Paradox.
Things quickly get very complicated the larger (in terms of variables) the dataset is. The DAG-based approach of causal inference can produce a "minimally sufficient" set of variables to include in the model. Many novice analysts simply throw all the variables into their model, or use "stepwise" procedures to determine their model, both of which are gigantic mistakes.
A DAG directly informs which variables to include and which to exclude from the model. A very useful tool for this is Dagitty.
Much of the theory of DAGs in causality was developed by Judea Pearl, a rather famous computer scientist who has written extensively on this topic and is regarded as one of the pioneers of causal inference. Daggity is based on Pearl's work. The canonical reference is Pearl (2009).
For further reading on this, see:
How do DAGs help to reduce bias in causal inference?
Backward Stepwise Regression
References:
Pearl, J. (2009). Causality. Cambridge university press.
Tu, Y. K., Gunnell, D., & Gilthorpe, M. S. (2008). Simpson's Paradox, Lord's Paradox, and Suppression Effects are the same phenomenon–the reversal paradox. Emerging themes in epidemiology, 5, 1-9.