source: "raw/articles/ama-bench-evaluating-long-horizon-memory-for-agentic-applications.md"

Summary: AMA-Bench: Evaluating Long-Horizon Memory for Agentic Applications

TL;DR: Researchers introduce AMA-Bench, a benchmark for evaluating memory systems in long-horizon AI agent applications, finding that current memory methods underperform compared to long-context baselines, and propose AMA-Agent with causality graphs and tool-augmented retrieval that achieves 57.22% average accuracy.

Key Points

Existing memory benchmarks focus on dialogue-centric interactions but miss machine-generated, causally grounded agent trajectories
Agent memory requires three core mechanisms: memory retrieval, memory evolution, and memory condensation
Four memory capability categories: Recall, Causal Inference, State Updating, and State Abstraction
AMA-Bench comprises real-world subset (2,496 QA pairs from 6 domains) and synthetic subset (controllable scaling)
Current memory systems fall short of long-context baselines on agent tasks due to lossy compression and similarity-based retrieval
Memory architecture choice has more impact than model scale (0.45 vs 0.038 improvement range)
AMA-Agent uses Causality Graph for memory construction and Tool-Augmented Search for retrieval
AMA-Agent outperforms strongest baselines by 11.16% on average, achieving 57.22% accuracy
Real-world domains include: Web navigation, Software Engineering, Text2SQL, Gaming, Embodied AI, Open-world QA
Synthetic environments: BabyAI and TextWorld with programmatic QA generation

Concepts Covered

Agent Memory Systems — comprehensive evaluation framework for memory in autonomous agents
Long-Horizon Reasoning — evaluation of memory capabilities across extended interaction sequences
Causality Graphs — structured representation preserving causal dependencies in agent trajectories
Tool-Augmented Retrieval — hybrid search mechanism combining graph traversal and keyword search
Memory Construction — process of transforming interaction histories into structured representations
Memory Retrieval — targeted access to relevant evidence from stored agent experiences
State Tracking — monitoring changes in environment and object states across agent interactions
POMDP Formulation — partially observable Markov decision process framework for agent-environment interaction
Needle-in-a-Haystack Protocol — evaluation methodology for long-context memory capabilities
Machine-Generated Representations — handling of structured data formats beyond natural language

source: "raw/articles/ama-bench-evaluating-long-horizon-memory-for-agentic-applications.md"

Summary: AMA-Bench: Evaluating Long-Horizon Memory for Agentic Applications

Key Points

Concepts Covered

Related Concepts