← Library
source: "raw/articles/ama-bench-evaluating-long-horizon-memory-for-agentic-applications.md"
Summary: AMA-Bench: Evaluating Long-Horizon Memory for Agentic Applications
TL;DR: Researchers introduce AMA-Bench, a benchmark for evaluating memory systems in long-horizon AI agent applications, finding that current memory methods underperform compared to long-context baselines, and propose AMA-Agent with causality graphs and tool-augmented retrieval that achieves 57.22% average accuracy.
Key Points
- Existing memory benchmarks focus on dialogue-centric interactions but miss machine-generated, causally grounded agent trajectories
- Agent memory requires three core mechanisms: memory retrieval, memory evolution, and memory condensation
- Four memory capability categories: Recall, Causal Inference, State Updating, and State Abstraction
- AMA-Bench comprises real-world subset (2,496 QA pairs from 6 domains) and synthetic subset (controllable scaling)
- Current memory systems fall short of long-context baselines on agent tasks due to lossy compression and similarity-based retrieval
- Memory architecture choice has more impact than model scale (0.45 vs 0.038 improvement range)
- AMA-Agent uses Causality Graph for memory construction and Tool-Augmented Search for retrieval
- AMA-Agent outperforms strongest baselines by 11.16% on average, achieving 57.22% accuracy
- Real-world domains include: Web navigation, Software Engineering, Text2SQL, Gaming, Embodied AI, Open-world QA
- Synthetic environments: BabyAI and TextWorld with programmatic QA generation
Concepts Covered
- Agent Memory Systems — comprehensive evaluation framework for memory in autonomous agents
- Long-Horizon Reasoning — evaluation of memory capabilities across extended interaction sequences
- Causality Graphs — structured representation preserving causal dependencies in agent trajectories
- Tool-Augmented Retrieval — hybrid search mechanism combining graph traversal and keyword search
- Memory Construction — process of transforming interaction histories into structured representations
- Memory Retrieval — targeted access to relevant evidence from stored agent experiences
- State Tracking — monitoring changes in environment and object states across agent interactions
- POMDP Formulation — partially observable Markov decision process framework for agent-environment interaction
- Needle-in-a-Haystack Protocol — evaluation methodology for long-context memory capabilities
- Machine-Generated Representations — handling of structured data formats beyond natural language