Causal Inference 360 - Resources

Welcome to IBM Research Causal Inference 360

What is causal inference

Understanding cause and effect is the ultimate aim of scientific inquiry. It is only by such understanding that one can truly explain a phenomenon and ensure that actions yield intended results. Causal modeling is crucial for questions of trust, such as fairness, robust decision making and interpretability across disciplines, industries and sectors.

Causal inference is the field of estimating the effect of an action, rather than identifying the cause of something. An example of such a question may be "how would the decision to quit smoking affect my weight in 10 years?" Causal inference is nuanced and difficult because we can only observe one of two possible futures: either an object received a treatment or did not receive a treatment. The outcome that is observed is only the outcome given the observed treatment. The outcome that would have been obtained if given the opposite treatment is called the counterfactual outcome.

An additional complication arises from the fact that the decision for an object to receive a treatment often depends on many different factors that are also related to the outcome (confounding). In the example above, for instance, the decision to quit smoking may be driven by a switch to a healthier lifestyle, changing not only smoking, but also their diet and physical activity profiles. This would confound our estimation of how much smoking affects weight, compared to how these other aspects affect the weight.

What are the alternatives

Randomized experiments are often described as the gold standard for estimating causal effect from data. However, randomized experiments may be prohibitively expensive, unethical, or simply infeasible. Therefore, sophisticated algorithms must be applied to observational data. Causal inference is different than the supervised machine learning problem, but many of the tools of supervised learning such as deep learning can be useful components of causal inference algorithms because of their ability to deal with large-scale high-dimensional data.

What does this package provide

The Causal Inference 360 Python package provides a suite of methods, under a unified API, to estimate the causal effect of an intervention on an outcome using real-world observational data. It does so by implementing several meta-algorithms that allow one to plug-in arbitrarily complex machine learning algorithms in order to provide highly flexible causal inference modeling. The interactive demo provides a gentle introduction to the concepts and capabilities by walking through an example use case and comparing the results of a causal analysis to a non-causal one. The tutorials and other notebooks offer a deeper, data scientist-oriented introduction. The complete API is also available.

Many degrees of freedom when modeling may result in confusion regarding which model and which method to use and when. One of the highlights of the package is an evaluation module that helps in model selection, parameter tuning, and selection of the causal meta-method itself. These evaluation methods help detect problems in the definition of the dataset and eliminate models with bad performance. Moreover, we have created some guidance material that can be consulted, providing a summary of the different methods and how the provided evaluation module can help check that models are not obviously wrong.

It should be noted that causal inference is only valid under a set of assumptions that cannot be evaluated from the data itself (for example, that all important confounding variables are measured). As a result, there is no guarantee that a well-performing model provides a real and accurate estimate of the effect. Bad performing models, however, are guaranteed to provide unreliable results and the IBM Research Causal Inference 360 allows identifying and eliminating such models.

The philosophy behind Causal Inference 360 is:
  1. Be approachable for data scientists and machine learning practitioners by unifying many causal models under one scikit-learn-inspired API.
  2. Provide the ability to plug-in arbitrarily complex machine learning models to be employed by the causal models.
  3. Enable estimation on out-of-bag samples.
  4. Perform evaluation to detect cohort-misspecification and guide model-selection.
  5. Be extensible. We encourage the contribution of your causal inference models. Please join the community to get started as a contributor.


Developer tutorials

The following tutorials provide different examples of causal inference. View them individually below or open the set of Jupyter notebooks in GitHub.

Smoking Cessation and Weight Loss
Understand if there is a cause-and-effect relationship between quitting smoking and losing weight.

Econometric Evaluation of Job Training
Understand the effectiveness of job training programs on the earnings of individuals two years later.

Guidance on choosing methods

Causal Inference 360 includes several estimation methods, each of which can utilize several machine learning models. This flexibility raises the question of what combination yields the best overall causal model. Due to the nature of counterfactual prediction, we cannot evaluate models against ground-truth as is usually done in machine learning. However, we can discuss the pros and cons of different modeling strategies and provide tools to eliminate poor-performing model candidates.

We follow the workflow depicted in Shimoni et al. For a step-by-step interactive version, please visit the demo.

1. Defining the dataset

A first important step in causal analysis is defining the cohort or dataset. In doing so, it is important to ensure that causal assumptions hold and biases are not introduced inadvertently. Unfortunately, it requires domain knowledge of the problem and there are no automation or magic solutions available. It is also beyond the scope this package, which focuses in causal methods and not cohort building. However, we mention two things to consider:

First, the study design – which individuals from the dataset should be excluded or included in the analysis. It is intended to avoid selection biases that may render the groups incomparable and bias the effect estimation. A common approach is to try and emulate the criteria for joining a randomized control trial – who would you recruit for the experiment? How long would you look at their history? How long would you follow-up on them to measure their outcome?

Second, what variables are relevant confounders and should be the input to the model. It might be the case that not all variables are causally related to the problem. Incorporating unnecessary variables might increase estimation variance and, unlike ML, even bias it. For more information, we refer the user to read on backdoor criteria in causal graphs.

Once we defined a cohort we should have in hand 3 things:

  • a data matrix with features for each individual called the covariate matrix (e.g., patients age, sex, etc.),
  • a vector with the treatment assignment of each individual (e.g., have they undergone a medical procedure or not?),
  • and a vector with the outcome observed in these individuals (e.g., did their condition improved or not? Or what were the levels of HbA1C in their blood?).

2. Choosing a causal model

Causal models can be broadly cast into two families: weight models and outcome models.

Weight models operate by using the covariates to balance between the treatment groups, such that the weighted distribution of covariates between groups is more similar. Once each individual has its own weight, we can also weight the outcomes in each group to obtain unbiased potential outcomes. Then we can compare (say, subtract) the estimations of potential outcomes to obtain a causal treatment effect, in the sense that we canceled the effect of the covariates (now that they are similarly distributed across groups) and left only with the effect of the treatment.

A special most-common type of weight models is propensity models, which estimate the weights by first estimating the probability of being treated given the covariates (also called a propensity score). They have vast theoretical literature backing them up. For our use, since they estimate probabilities, they can be evaluated more rigorously as we’ll mention below.

Outcome models, the second family of causal models, predicts the potential outcomes directly from the covariates and treatment assignment. This allows them to perform individual-level prediction of potential outcome, while weight models can only predict in the sample level (e.g., average within the treatment groups). However, the individual-level prediction is less reliable since there are more assumptions needed for it to hold.

Doubly robust models are an additional type of causal models that combine both weight and outcome models. There are multiple ways to do so, hence there are multiple models. The premise of doubly robust models is that their estimation is, well, more robust. Theoretically, it is sufficient that either one of the weight or outcome models be unbiased for the composite estimate to be unbiased.

3. Choosing a machine learning core model

Each causal model encapsulates an underlying machine-learning model and uses it to make inference in a causal way. For example, propensity models utilize a classification model to learn a mapping from covariate to treatment assignment and then uses its probabilistic output to generate weights by their inverse. Statistical literature will most probably use logistic regression to estimate the probabilities, but these probabilities can be estimated by any other model, like random forests or neural networks. We designed causallib to enable plug & play of arbitrary machine learning models, as long as they support scikit-learn’s interface.

Now the options for a causal model are numerous – first we choose the causal method, then we choose the underlying machine learning model and then we also need to tune the hyperparameters of that model. We also developed an evaluation framework to guide model selection.

4. Model selection

We suggest a suite of ML-inspired evaluations that can perform in cross validation to help eliminate poor performing models or detect cohort misspecifications. Weight and outcome models each has their own set of evaluations and doubly-robust models can simply apply evaluations on both of its component. Here we describe a short summary of the guidelines, but we advise the reader to go over a more complete description found in our manuscript.

Weight models

Evaluating weight models can be done marginally using what is commonly known as a Love plot. It calculates some statistic about the distribution distance of a single feature between the treated and untreated (absolute standardized mean difference - ASMD - is a common metric) before and after weighting. An ideal result is that after weighting difference in each feature should be small. An unofficial rule-of-thumb is that all the ASMD values should be below 0.1.

Propensity models

Since propensity models output probabilities, we can apply more evaluations in addition to the one of weight models.

Probability calibration: we can use calibration plots to verify that the scores provided by the model are in agreement with the empirical probabilities of labels. Since theoretical guarantees assumes the propensity is a probability, it is best to check this assumption holds. Furthermore, a poorly calibrated model will likely create bias when applied to subsets of the data. In a calibration plot we therefore optimally would like to see the calibration plot as close to the diagonal of y=x. If the model is poorly calibrated it is best to choose another ML model.

Propensity distribution: A useful plot is of the distributions of propensity scores for each treatment group separately. This plot can provide several insights:

  1. Since the propensity is assumed to be the probability to treat, this means that for any given value of propensity p, we should observe a proportion of p of the instances treated and 1-p untreated. Therefore, for any propensity value between 1 and 0 for which there are any values we should see instances from both treatment groups. We should therefore observe a complete overlap in the support of each distribution, while, importantly, the distributions should not be identical. Any deviation from overlap may be an indication of an inconsistent propensity model, overfitting (which can be checked by comparing the training and validation performance), or positivity violation.
  2. Positivity assumes that all samples can be treated. This means that we should not observe values with propensity of either 0 or 1. If such values are observed then this is an indication the the model was able to identify a subset of the data for which the treatment assignment is know. In this case the criteria for this sample-set need to be identified and these samples must be removed from the data. Accordingly, this requires a slight change in the causal question being asked, since it no longer applies to the whole data set.

ROC-curves: The propensity model is trained over a binary treatment assignment and can therefore be evaluated using an ROC curve. However, the propensity model is not meant to be a good classifier, but rather a risk model. Therefore, a “good” ROC curve under ML standards (sharp horizontal/vertical parts in the curve, high AUC) is NOT a good ROC curve for propensity models. Perhaps counter-intuitively, good prediction performance suggests that the treated and untreated are too differentiable and therefore incomparable for causal analysis and probably suffer from positivity violation. On the other hand, a propensity model that is too close to random is still not useful. We should therefore check that the propensity model provides an ROC curve with a modest (0.6-0.8) AUC without any horizontal or vertical parts.

We provide two additional weighted curves:

  1. The expected curve, where each sample contributes its propensity as a positive label and the complement of the propensity as a negative label. If the propensity model is self-consistent, then these are the probabilities by which each sample would contribute to each treatment category. The curve should therefore ideally consolidate with the non-weighted curve if the model outputs true propensities.
  2. The IP-weighted curve, which checks the AUC under balancing between the groups. A good propensity model should re-weigh the samples such that the distributions are indistinguishable, and therefore this curve should ideally converge with the diagonal (AUC of 0.5).

Outcome models

Evaluating outcome models is similar to evaluating regular ML models, where we aim for best prediction scores. One important point to remember is that for the effect estimation to be unbiased, one should check that the residuals in each treatment groups are distributed similarly (or at least have the same mean). A proxy test for exchangeability can be done by plotting the individual counterfactual predictions and verify the treatment groups overlap. If comparing multiple outcome models, it might be useful to also calibrate the output of the outcome model, like what is suggested for propensity models.

We can perform all evaluations in cross-validation to assess whether violations arise from the model or the data. If violations are exhibited only in the train set, we can assume they are due to model overfit and the underlying ML model should be re-specified. However, if violations are exhibited also in the test set, we can suspect they are due to structures in the data, and therefore we should reconsider our study-design or the covariates used.

5. Estimation

Once we’re happy with the model at hand we can go on and perform estimation. Unlike the well-defined “predict” action in ML, there are several estimations that can be done in causal inference. There are two levels estimations can be made, either in the individual or population (aggregated) level, depending on the causal model used. The first is to estimate the potential outcomes in case we wish to know what would have been the outcome if everyone in the cohort would be assign to some treatment. The second, and probably more known, is the treatment effect. It itself can be defined as either the difference, ratio or odds-ratio (if applicable) of the potential outcomes. There are also intermediate estimations, for example the sample weight in weight-models. Therefore, we made each estimation type very explicit. For example, for effect estimation, one must first estimate the potential outcomes and use them explicitly for estimating the effect.



The Characteristics of each sample in the dataset, usually referred as features in machine learning.
For example, age, sex or lab results of patients.

Covariates of interest that needs to be adjusted to ensure no spurious correlations are present and the derived effect is not biased.

Treatment assignment
A special variable of interest, not found in regular machine learning, which we want to assess its effect on the outcome.
For example, going through some medical procedure or not.

Observed outcome
The outcome of interest observed in the sample.
For example, how much weight was lost after 6 months, or whether the individual had a heart-attack during the 12 months following treatment.


Potential outcomes
The outcomes that would have been observed if the individual had gotten treated and if they were left untreated.

Counterfactual outcomes
The potential outcome corresponding to the treatment assignment the individual ended up not getting.

Individual outcome
Individual level outcomes. Usually it is harder to evaluate the potential outcomes in the individual level.

Population outcome
An aggregation (usually mean but can be any statistic) of the outcomes on an entire group.
For example, average outcome in the treated, or average potential outcome in males if they would all be treated.

How much the potential outcome changes between treatment groups. i.e., how much the outcome has changed if everyone would be treated vs. everyone would be untreated. The most common quantity is to look at the difference between the potential outcomes.


Weight models
Estimate the potential outcomes by balancing between the treatment groups, i.e., providing each individual with a weight to make the groups to appear to have a similar distribution of covariates. Usually allow to predict only population outcomes and not individual ones.

The probability of an individual to be treated given their covariates. The inverse of those propensity scores is a popular weighting scheme.

Outcome models
Estimate the potential outcomes by directly predicting the outcome from the covariates and treatment assignment of the individual. Can be used to predict individual potential outcomes.

Doubly robust methods
Models that combine a weight model with an outcome model to create a more robust estimation of the effect.

Every causal model has a machine learning model at its core which it utilizes for causal inference, whether it is to learn the outcomes or the weights. We call these underlying machine learning models Learners.

Causal inference assumptions

Causal analysis also assumes three mathematical assumptions which must be satisfied for the quantities it derives to be considered truly causal.

The treatment group and the control group have similar properties.
For example, both males and females are treated, if we have both in the untreated but only females in the treatment groups than the groups are incomparable.

The distribution of potential outcome does not depend on the actual treatment assignment. Therefore, they are distributed equally between the groups.

The observed outcome of a given treatment and the potential outcome for that treatment consolidates.

Stable Unit Treatment Value Assumption is an extended independence assumption where intervention on one person (i.e. 'unit') does not affect the outcome for another person.

Disclaimer: some simplifications were made to appeal to non-technical audience.