Concept Bottleneck Models
Summary: Neural network architectures that explicitly route information through human-interpretable concept representations as an intermediate bottleneck layer. These models first predict concepts from raw inputs, then use those concepts to make final predictions, enabling interpretability and human intervention during inference.
Overview
Concept Bottleneck Models (CBMs) represent a paradigm shift in interpretable machine learning by forcing neural networks to make decisions through human-understandable intermediate representations. Unlike traditional black-box models that map inputs directly to outputs, CBMs decompose the prediction process into two stages: first predicting a set of pre-defined concepts from the input, then using only those concepts to generate the final prediction.
This architecture creates an interpretability bottleneck where all decision-relevant information must pass through concept activations that humans can understand and potentially modify. The concept layer serves as a compressed, interpretable representation of the input that retains sufficient information for the downstream task while being meaningful to human domain experts.
CBMs enable several key capabilities: understanding which concepts drive specific decisions, intervening on concept predictions at test time to correct model errors, and providing guarantees about model behavior through the constrained concept space.
Key Details
Architecture Components:
- Concept Predictor: Neural network mapping raw inputs (images, text, etc.) to concept predictions
- Label Predictor: Network taking concept activations as input to produce final task predictions
- Concept Space: Pre-defined set of human-interpretable attributes or features
Training Approaches:
- Joint Training: End-to-end optimization with concept and task supervision
- Sequential Training: Pre-train concept predictor, then train label predictor
- Independent Training: Separate optimization of each component
Concept Selection Challenges:
- Manual concept curation requires domain expertise and is costly
- No guarantees that manually selected concepts preserve task performance
- Decision-Relevant Selection algorithms can automatically identify optimal concept subsets
- Concept selection is NP-hard but tractable approximations exist
Performance Characteristics:
- Test-time interventions on concepts can improve accuracy by 40-87%
- Decision-Relevant Concepts provide near-optimal performance bounds
- Concept quality vs. quantity trade-offs affect final task performance
- Imperfect concept predictors can still enable effective interpretability
Applications:
- Medical diagnosis with interpretable symptom reasoning
- Interpretable Reinforcement Learning for transparent decision-making
- Safety-critical systems requiring explainable predictions
- Scientific discovery through concept-based hypothesis testing
Relationships
- Interpretable Machine Learning — CBMs are a core technique for creating interpretable neural networks
- Feature Selection — concept selection is a specialized form of feature selection with interpretability constraints
- State Abstraction — in RL, concepts serve as state abstractions that preserve decision-relevant information
- Human-AI Interaction — CBMs enable human intervention through concept modification at test time
- Reinforcement Learning — concept-based RL policies route decisions through interpretable concept representations
- Test-Time Intervention — key capability enabled by CBM architecture for human oversight
- Abstraction Error — measures how well concept bottlenecks preserve task-relevant information
- Mixed Integer Linear Programming — used in algorithms for optimal concept subset selection
Sources
- sources/selecting-decision-relevant-concepts-in-reinforcement-learning — contributed automated concept selection methods, decision-relevance theory, and performance bounds for concept-based RL