CLADDER - Assessing Causal Reasoning in Language Models

Yao Feng, Jing Lin, Sai Kumar Dwivedi, Yu Sun, Priyanka Patel, Michael J. Black

Abstract

Previous work on NLP focuses mostly on evaluating commonsense causal reasoning in LLMs. This work attempts to evalute LLMs ability to perform causal inference in accordance with a set of well-defined formal-rules, composing a large dataset “CLadder” with 10k samples. CLadder is based on a collection of causal graphs and queries (associational, interventional, and counterfactual). The paper also proposes a bespoke chain-of-thought prompting strategy, CausalCoT.

Introduction

Recent advances in LLMs shift the public’s attention to their capability in performing sound causal reasoning and answering causal questinos at scale with ease. This raises the question: “Do LLMs understand causality?”.

Majority of previous work focus on commonsense causality, exploring LLMs as knowledge bases. This assesses the alignment between commonsense knowledge about causal relatinoship in LLMs. However, few studies have focused on the capability of causal reasoning. A lot of the times the LLMs would just be “causal parrots” that perform unreliable amortized causal inference”, answering using repeated patterns in training data.

Therefore, the authors introduce a dataset CLadder to evaluate the formal causal reasoning in lLMs. The dataset contain 10k causal questions that cover all three rungs of the Ladder of Causation: Associational; Interventional; Counterfactual. In addition, to probe whether LLMs employ amortized causal inference, the dataset contains nonsensical causal relations where amortized causal inference would fail while formal causal reasoning would still yield correct answers.

Moreover, the author proposes a method to elicit sound causal reasoning in LLMs and help them solve challenging causality questions with CausalCoT, a chain-of-thought prompting strategy. This strategy improves the performance of vanilla GPT-4 by 8.37% on CLadder to 70.40%.

To summarzie, the contributions of this paper are as follows:

  1. Introduces CLadder, a dataset for formal causal reasoning with 10k causal questions, spanning all three rungs of the ladder of causations, several causal graphs, and various stories for verbalization.
  2. Develops CausalCoT, a chain-of-thought prompting strategy to elicit formal causal reasoning in LLMs.
  3. Assesses eight LLMs, showcasing the limitations of lLMs in formal causal reasoning.

Preliminaries on Causal Inference

The Ladder of Causation

The Ladder of Causation is a proposed taxonomy and hierarchy of causal inference tasks. It consists of three distinct rungs.

To learn more about the ladder of Causation, Fabian Dablander wrote an very interesting and thorough blog on the matter.

Causal Inference

It is extremely hard to perform causal inference when we only have measurements from the lower rungs but have to reason about higher ones. Some researchers argue that one may be able to reason at a higher layer given a combination of partial knowledge of the underlying SCM, in the form of a causal graph and data at lower layers. Therefore, the graphical structure plays a crucial role in transforming higher-rung queries into expressions which can be estimated based on lower-rung quantities.

Composing the CLadder Dataset

Task Formulation

CLadder dataset consists of NN triples of D{(qi,ai,ei)}i=1N\mathcal{D} \coloneqq \{(q_i, a_i, e_i)\}^N_{i=1}, where qiq_i is a question, ai{Yes,No}a_i \in \{\text{Yes}, \text{No}\} is a binary answer, and eie_i is the ground-truth explanations.

Design Principles

To compose of a comprehensive dataset, the authors ensure broad coverage of all rungs of the ladder of causation. The authors also utilize binary variables instead of continuous ones due to the abundance of binary and categorical variables. Moreover, the authors designed the questions to focus on graphs with few (three to four) variables as LLMs struggle with heavy computations.

Overall Pipeline

  1. Formal Part: specify inputs and ground-truth answers and explanations by the CI Engine.
  2. Natural Language Part: verbalize the formal queries and specifications by associating them to a story

CausalCoT Model

To guide LLMs in correctly answering the questions in CLadder, the authors draw inspiration from the CI engine, breaking problems down into multiple symbolically-grounded, simpler steps. The proposed prompting framework is called CausalCoT which composes of two phases. The first phase is the Preparation phrase, which contains four steps

  1. Extract the causal graph
  2. Determine the query type
  3. Formalize the query
  4. Gather all relevant data

The second phase is the Solution phase, which involves

  1. Deducing estimand using causal inference
  2. Clculate the estimand

Lastly, CausalCoT will then prompt LLMs to answer the initial question with just “Yes” or “No”.

Results

CausalCoT achieve the best performance across all three rungs of causal questions, with a monotonically decreasing performance as the rungs get higher. Moreover, CausalCoT enhances the reasoning ability across all levels, espeically on the anti-commonsensical data.