Paper Review

FAITHFUL EXPLANATIONS OF BLACK-BOX NLP MODELSUSING LLM-GENERATED COUNTERFACTUALS

Kim doing 2025. 4. 13. 20:13

Introduction

  • To ensure the safe and trustworthy deployment of NLP models, it is essential to provide explanations that reflect the true reasoning behind a model’s predictions. Traditional explanation methods often rely on correlation rather than causation, which can result in misleading interpretations. This paper proposes two model-agnostic approaches for generating faithful explanations using counterfactual reasoning: (1) counterfactual generation via large language models (LLMs) that modify a specific concept in a text while preserving confounding variables, and (2) a more efficient matching-based approach that learns a causal embedding space using LLM-generated counterfactuals at training time. These methods aim to estimate the causal effect of high-level concepts on model predictions, offering faithful and generalizable insights into black-box models.

Background

  • NLP models are often black-boxes, making their internal decision-making processes hard to interpret.
  • Traditional explanation methods (e.g., feature attribution, probing) mostly focus on what is encoded in the model, not what is actually used.
  • This leads to correlation-based explanations, which may not reflect the true causal reasoning of the model.
  • Counterfactual explanations—how the output changes when a concept is altered—offer a more causally grounded approach.
  • However, previous counterfactual generation methods were limited by:
    • Simplistic local editing (e.g., word replacement)
    • Costly manual annotation
  • Large Language Models (LLMs) can overcome these limitations by:
    • Generating high-quality counterfactuals
    • Enabling model-agnostic explanation methods that are both faithful and efficient

 

Method 

1. MODEL EXPLANATION WITH COUNTERFACTUALS

 

  • The goal of this method is to explain black-box NLP models by estimating the causal effect of high-level concepts (like "ambiance" or "service") on a model’s prediction.

These equations define the average (CaCE) and individual (ICaCE) causal impact of changing a concept in the data generating process.

 

  • The method assumes access to a causal graph (not the full DGP), representing relationships among variables like concepts, text, and prediction.
  • Since true counterfactuals are not available, the model uses approximated counterfactuals to estimate the causal effect.

These equations show how to estimate causal effects using approximate counterfactuals. ICaCE is the difference between the prediction for the counterfactual and the original input. CaCE is the average of these differences across multiple examples.

  • To improve robustness, the model employs Top-K counterfactual matching, averaging over the K most similar matched texts

This equation uses Top-K matching , averaging the causal effects over the K most similar counterfactuals. It helps make the explanation more stable and reliable.

This equation uses Top-K matching, averaging the causal effects over the K most similar counterfactuals.
It helps make the explanation more stable and reliable.

 

2. COUNTERFACTUALS AS AN IDEAL MODEL EXPLANATION

 

  • A faithful explanation should reflect how the model actually makes its decisions—not just what it has learned.
  • Many existing explanation methods focus on what features are encoded, not what features are used in predictions, leading to correlation-based (not causal) explanations.
  • Counterfactual explanations directly show why a prediction was made by answering:
    “What would the model predict if this concept were different?”
  • The authors introduce a key property:
    Order-Faithfulness:
  • If an explanation method says concept A has more impact than concept B, the actual causal effect of A should be higher than B.
    (This ensures the explanation ranks concepts correctly.)
  • Formal definition of Order-Faithfulness: 

This equation formally defines Order-Faithfulness . It means that if an explanation method ranks concept C1C_1C1​ as having more impact than C2C_2C2​ , then the true causal effect of C1C_1C1​ must be greater than that of C2C_2C2​ , and vice versa.

 

 

  • Counterfactual-based methods (like the one proposed) satisfy Order-Faithfulness, while many non-causal methods do not.
  • Therefore, counterfactuals are ideal for faithful explanations, especially in tasks like model comparison, fairness analysis, and debugging.

3.LLM-GENERATED COUNTERFACTUALS

  • This section introduces the first method for approximating counterfactuals:
    ->Generating counterfactuals using a Large Language Model (LLM), such as ChatGPT.
  • Given a causal graph and an intervention (e.g., changing “service” from bad → good), the LLM is prompted to rewrite the input text by modifying only the target concept while keeping other variables fixed.

Original: "Great pizza and vibe, but the waiter was rude."
Counterfactual: "Great pizza and vibe, and the waiter was friendly."

  • This is achieved by:
    • Identifying confounders using the back-door criterion (adjustment set)
    • Avoiding changes to mediators and colliders, which could distort the effect,

Figure 1 shows the causal graph used in the CEBaB dataset. It illustrates how four high-level concepts— Food (F) , Service (S) , Ambiance (A) , and Noise (N) — affect both the text (X) and the rating prediction (Y) . The model fff predicts YYY using only the text XXX . This graph helps determine which variables should be changed or held fixed when generating counterfactuals.

  • For direct causal effect, the prompt asks the LLM to change the target concept but keep confounders and mediators fixed  For total causal effect, mediators are allowed to change based on the intervention
  • To improve precision, the LLM prompt can also include:
    • Which concepts to change
    • Which concepts to hold constant
    • The causal context
  • Multiple counterfactuals can be generated using Top-K sampling
    → This supports more robust estimation by reducing variability.
  • Limitation: While effective, LLM-based generation is:
    • Slow (autoregressive decoding)
    • Costly (requires API calls)
    • Potentially restricted by privacy or deployment constraints

4.CAUSAL REPRESENTATION LEARNING FOR MATCHING

 

 

  • This section presents the second method for counterfactual approximation:
    Matching, which finds similar real examples instead of generating counterfactuals with an LLM.
  • The idea is to build a causal embedding space, where:
    • Valid counterfactuals and matched examples are close to the input,
    • Invalid or irrelevant examples are far apart.
  • To achieve this, the model uses a learned text encoder ϕ(⋅) (e.g., RoBERTa or Sentence-BERT).
  • Given an input xt the model searches in a control group (where the concept value is t′)
    to find the best match x′

Matching objective

  • The similarity function s(⋅,⋅) is cosine similarity:

 

  • The encoder is trained using contrastive loss that:
    • Pulls together true counterfactuals and valid matches,
    • Pushes apart misspecified or invalid samples.
  • Training uses four sets:
    • XC: LLM-generated counterfactuals
    • XM: valid matches (same confounders)
    • X¬CF: counterfactuals from the wrong intervention
    • X¬M: mismatched samples (wrong confounders)
  • After training, inference is very efficient—just a similarity search, no generation.
    It’s up to 1000× faster than LLM generation.
  • To train the encoder ϕ(x), the model uses a contrastive loss,
    which makes the representation of a query example closer to positive examples (true counterfactuals or valid matches) and farther from negative ones (misspecified or invalid examples).

Contrastive loss function

Where:

  • x: query example
  • X+: positive set (e.g., counterfactuals, valid matches)
  • X−: negative set (e.g., misspecified counterfactuals or invalid matches)
  • τ: temperature hyperparameter (controls softmax sharpness)

Final loss

This complete objective encourages:

  • Similarity with correct CFs and matches
  • Dissimilarity with incorrect ones
  • Robustness across varied candidate sets

training precedure

 

  • Step 1: Prepare training set with concept annotations
    (or predict concept values using a zero-shot LLM if human labels are unavailable)
  • Step 2: For each example xtx_t, construct the four sets:XCF,XM,X¬CF,X¬M
  • Step 3: Sample one example from each set and compute the loss in Equation (8)
  • Step 4: Train the encoder (e.g., RoBERTa or Sentence-BERT) for multiple epochs
  • Step 5: Select the best checkpoint based on validation loss

 

EXPERIMENTAL SETUP

-Benchmark Dataset: CEBaB (Causal Estimation-Based Benchmark)

  • A high-quality benchmark for evaluating causal explanations in NLP.
  • Based on OpenTable restaurant reviews.
  • For each original review:
    • Human annotators rewrote it with a counterfactual edit (e.g., change “service” from negative → positive).
    • Each version received a 5-star sentiment rating and concept-level annotations for:
      • Food (F), Service (S), Ambiance (A), Noise (N)

Total examples:

  • Train (exclusive): 1,463 (split in half: 731 train / 732 candidate set)
  • Dev: 1,672
  • Test: 1,688

 

-Evaluation Pipeline

 

  • For each input example xtx_t, and intervention T:t→t′, the method estimates ICaCE using:
    • The approximated counterfactual
    • Or, the matched example
  • This is compared to a ground-truth counterfactual written by humans:
  • The error is calculated as the distance between:

-Evaluation Metrics (3 types of distance)

 

1. L2 Distance  

 

 

2. Cosine Distance

 

 

3. Norm Difference (ND)

 

 

  •  Final score = average error over 24 interventions (4 concepts × 6 value changes)

 

-Models Explained

 

 

  • Three fine-tuned models for 5-star sentiment prediction:
    • DistilBERT
    • BERT
    • RoBERTa
  • Two zero-shot LLMs:
    • LLaMA-2 7B
    • LLaMA-2 13B

-Explanation Methods Compared

 

Generative methods:

  • Zero-shot LLM (ChatGPT)
  • Few-shot LLM
  • Fine-tuned T5

Matching methods:

  • Random Match
  • Propensity Score Matching
  • Approx (baseline from CEBaB)
  • Pretrained RoBERTa & S-BERT matching
  • Our proposed Causal Representation Matching

RESULTS

 

Table 1

 

 

  1. LLM-generated counterfactuals = SOTA explainers
    • Fine-tuned T5 shows best results among generative methods
    • Few-shot > Zero-shot
    • But LLM generation is slow & expensive at inference
  2. Our causal matching model outperforms all other matching methods
    • Performs better than Approx (CEBaB baseline) and pretrained matching with RoBERTa/S-BERT
    • When using ground-truth CFs in the candidate set, our causal model even beats generative methods (see first row in Table 1)
  3. Top-K matching universally improves all methods
    • K = 10 significantly reduces error compared to K = 1
    • Performance gains apply to both matching and generative approaches

DISCUSSION

 

  • Faithful explanations require causality
    → Truly understanding why a model makes a decision requires causal reasoning, not just correlational patterns.
  • LLMs are powerful for counterfactual generation
    → LLMs (e.g., ChatGPT, GPT-4) generate high-quality counterfactuals that outperform other explanation methods.
  • But LLMs are inefficient for inference
    → They are slow, costly, and sometimes infeasible for real-time or privacy-sensitive applications.
  • Matching offers a fast, scalable alternative
    → Matching with learned causal representations allows fast, model-agnostic, and still faithful explanations.
  • Top-K matching consistently improves all methods
    → Using multiple matched counterfactuals increases robustness and lowers variance in causal effect estimates.
  • Future benchmarks can be built using LLMs
    → Instead of manually collecting counterfactuals, GPT-4 can be used to generate realistic and diverse counterfactual benchmarks .

CONCLUSION

 

  • The paper introduces a framework for faithful, model-agnostic explanations based on counterfactual reasoning.
  • Two methods are proposed:
    1. LLM-based counterfactual generation: highly effective, but costly at inference time.
    2. Matching using causal representations: efficient and scalable, while maintaining faithfulness.
  • A new theoretical property is proposed: Order-Faithfulness,
    which ensures explanation rankings reflect actual causal impact.
  • Experiments show that:
    • LLM-generated counterfactuals are state-of-the-art explainers.
    • Matching methods trained on causal objectives closely rival LLMs and are 1000× faster at inference.
    • Top-K counterfactuals improve performance across the board.
  • GPT-4 can be used not only for explanations, but also to automatically generate benchmark datasets, enabling new standards for explainability research.

 

OWN REVIEW

  • This paper presents a highly valuable contribution to the explainability field by combining strong theoretical grounding with practical applicability. It balances two worlds: LLM-powered explanations that achieve state-of-the-art performance, and matching-based methods that offer scalability. The introduction of the “Order-Faithfulness” principle is not only elegant but actionable—it sets a new standard for what it means to be “faithful” in explanation. While LLM-based generation is powerful, its cost still limits adoption. The matching approach, on the other hand, is both fast and interpretable. This work will likely influence future research in interpretable AI, counterfactual reasoning, and benchmark creation.

 

REFERENCES

 

1.Faithful Explanations of Black-box NLP Models Using LLM-generated Counterfactuals

Main paper. Proposes model-agnostic causal explanations using counterfactuals generated via LLMs and learned matching.
https://arxiv.org/abs/2310.00603

 

Faithful Explanations of Black-box NLP Models Using LLM-generated Counterfactuals

Causal explanations of the predictions of NLP systems are essential to ensure safety and establish trust. Yet, existing methods often fall short of explaining model predictions effectively or efficiently and are often model-specific. In this paper, we addr

arxiv.org

 

2.CEBaB: Estimating the Causal Effects of Real-world Concepts on NLP Model Behavior

Introduces the CEBaB dataset, which is used as the benchmark in this paper for causal evaluation.
https://papers.nips.cc/paper_files/paper/2022/hash/701ec28790b29a5bc33832b7bdc4c3b6-Abstract-Conference.html

 

 

3.Causality: Models, Reasoning and Inference

The foundational work on causal inference, including the do-calculus and causal graphs used throughout this paper.
Pearl, Cambridge University Press.

 

4.Learning the Difference That Makes a Difference with Counterfactually-Augmented Data

Demonstrates how counterfactual examples improve model robustness and fairness. A key precedent for counterfactual data generation.
https://arxiv.org/abs/1909.12434