Concept Bottleneck Models

Summary: Neural network architectures that explicitly route information through human-interpretable concept representations as an intermediate bottleneck layer. These models first predict concepts from raw inputs, then use those concepts to make final predictions, enabling interpretability and human intervention during inference.

Overview

Concept Bottleneck Models (CBMs) represent a paradigm shift in interpretable machine learning by forcing neural networks to make decisions through human-understandable intermediate representations. Unlike traditional black-box models that map inputs directly to outputs, CBMs decompose the prediction process into two stages: first predicting a set of pre-defined concepts from the input, then using only those concepts to generate the final prediction.

This architecture creates an interpretability bottleneck where all decision-relevant information must pass through concept activations that humans can understand and potentially modify. The concept layer serves as a compressed, interpretable representation of the input that retains sufficient information for the downstream task while being meaningful to human domain experts.

CBMs enable several key capabilities: understanding which concepts drive specific decisions, intervening on concept predictions at test time to correct model errors, and providing guarantees about model behavior through the constrained concept space.

Key Details

Architecture Components:

Concept Predictor: Neural network mapping raw inputs (images, text, etc.) to concept predictions
Label Predictor: Network taking concept activations as input to produce final task predictions
Concept Space: Pre-defined set of human-interpretable attributes or features

Training Approaches:

Joint Training: End-to-end optimization with concept and task supervision
Sequential Training: Pre-train concept predictor, then train label predictor
Independent Training: Separate optimization of each component

Concept Selection Challenges:

Manual concept curation requires domain expertise and is costly
No guarantees that manually selected concepts preserve task performance
Decision-Relevant Selection algorithms can automatically identify optimal concept subsets
Concept selection is NP-hard but tractable approximations exist

Performance Characteristics:

Test-time interventions on concepts can improve accuracy by 40-87%
Decision-Relevant Concepts provide near-optimal performance bounds
Concept quality vs. quantity trade-offs affect final task performance
Imperfect concept predictors can still enable effective interpretability

Applications:

Medical diagnosis with interpretable symptom reasoning
Interpretable Reinforcement Learning for transparent decision-making
Safety-critical systems requiring explainable predictions
Scientific discovery through concept-based hypothesis testing

Relationships

Interpretable Machine Learning — CBMs are a core technique for creating interpretable neural networks
Feature Selection — concept selection is a specialized form of feature selection with interpretability constraints
State Abstraction — in RL, concepts serve as state abstractions that preserve decision-relevant information
Human-AI Interaction — CBMs enable human intervention through concept modification at test time
Reinforcement Learning — concept-based RL policies route decisions through interpretable concept representations
Test-Time Intervention — key capability enabled by CBM architecture for human oversight
Abstraction Error — measures how well concept bottlenecks preserve task-relevant information
Mixed Integer Linear Programming — used in algorithms for optimal concept subset selection

Sources

sources/selecting-decision-relevant-concepts-in-reinforcement-learning — contributed automated concept selection methods, decision-relevance theory, and performance bounds for concept-based RL