source: "raw/articles/ama-bench-evaluating-long-horizon-memory-for-agentic-applications.md"

Summary: AMA-Bench: Evaluating Long-Horizon Memory for Agentic Applications

TL;DR: Researchers introduce AMA-Bench, a benchmark for evaluating memory systems in long-horizon AI agent applications, finding that current memory methods underperform compared to long-context baselines, and propose AMA-Agent with causality graphs and tool-augmented retrieval that achieves 57.22% average accuracy.

Key Points

  • Existing memory benchmarks focus on dialogue-centric interactions but miss machine-generated, causally grounded agent trajectories
  • Agent memory requires three core mechanisms: memory retrieval, memory evolution, and memory condensation
  • Four memory capability categories: Recall, Causal Inference, State Updating, and State Abstraction
  • AMA-Bench comprises real-world subset (2,496 QA pairs from 6 domains) and synthetic subset (controllable scaling)
  • Current memory systems fall short of long-context baselines on agent tasks due to lossy compression and similarity-based retrieval
  • Memory architecture choice has more impact than model scale (0.45 vs 0.038 improvement range)
  • AMA-Agent uses Causality Graph for memory construction and Tool-Augmented Search for retrieval
  • AMA-Agent outperforms strongest baselines by 11.16% on average, achieving 57.22% accuracy
  • Real-world domains include: Web navigation, Software Engineering, Text2SQL, Gaming, Embodied AI, Open-world QA
  • Synthetic environments: BabyAI and TextWorld with programmatic QA generation

Concepts Covered

Related Concepts