Trajectory Distillation

Summary: Training methodology that transfers capabilities from large teacher models to smaller student models by having students learn from successful demonstration trajectories. This approach enables smaller models to achieve performance that rivals much larger models while being more practical for deployment, particularly effective for sequential decision-making tasks.

Overview

Trajectory Distillation is a knowledge transfer technique where smaller student models learn by observing and imitating successful task completion sequences (trajectories) generated by larger teacher models. Unlike traditional Model Distillation which transfers knowledge through probability distributions, trajectory distillation focuses on learning from complete behavioral sequences that demonstrate how to accomplish specific tasks.

The methodology involves collecting successful trajectories from teacher models performing various tasks, then training student models to reproduce these successful patterns. This approach is particularly effective for Computer-Use Agents and other sequential decision-making systems where the path to success is as important as the final outcome. The technique proves especially valuable when teacher models struggle with low success rates, as it allows students to learn exclusively from successful demonstrations rather than mixed-quality data.

In practice, trajectory distillation addresses the challenge of training capable models for deployment scenarios where computational efficiency matters more than raw capability. By focusing on behavioral patterns that lead to successful task completion, student models can learn to navigate complex environments without requiring the full capacity of their teacher models. The approach has proven particularly effective in environments like CUA-World, where models trained via trajectory distillation significantly outperform baseline approaches across diverse software environments.

Key Details

Performance Gains: In the CUA-World benchmark, a 2B parameter model trained via trajectory distillation outperformed models twice its size, demonstrating significant efficiency improvements over both baseline approaches and larger untrained models
Success Rate Amplification: Enables effective learning from successful trajectories even when teacher models have low overall success rates (e.g., GPT-5.4 achieves only 27.5% pass rate on CUA-World-Long tasks, yet successful trajectories provide valuable training signal)
Data Scaling: Performance scales log-linearly with training data across both software count and task count, indicating systematic benefits from diverse trajectory collections across multiple domains
Application Domain: Particularly effective for Computer-Use Agents performing GUI interaction tasks across diverse software environments, demonstrated across 200+ applications spanning all 22 SOC occupation groups in CUA-World
Training Process: Student models learn to predict next actions in successful sequences rather than just final outputs, capturing behavioral patterns necessary for multi-step task completion across hundreds of interaction steps
Efficiency Benefits: Enables deployment of capable models with reduced computational requirements compared to teacher models while maintaining competitive performance on economically-grounded real-world tasks
Cross-Software Generalization: Limited but measurable - students show 22-27% recovery performance on unseen software compared to 65-87% on seen software, indicating partial transfer of learned interaction patterns
Success Metrics: Effectiveness measured through standardized benchmarks like CUA-World, with particular emphasis on long-horizon tasks requiring 500+ interaction steps where traditional approaches struggle
Quality Control: Benefits from Privileged Information Verification and systematic contamination filtering to ensure student models learn from genuinely successful demonstrations across diverse software environments

Relationships

Computer-Use Agents — primary application domain where trajectory distillation proves most effective for GUI interaction tasks across diverse software environments, enabling smaller agents to perform complex digital automation
Model Distillation — broader category of knowledge transfer techniques, with trajectory distillation as a specialized behavioral variant focusing on action sequences rather than output probability distributions
Long-Horizon Task Planning — benefits significantly from trajectory distillation as students learn complex multi-step sequences requiring hundreds of steps, crucial for realistic software interaction scenarios like CUA-World-Long tasks
Multi-Agent Environment Creation — trajectory collection can leverage automated environment creation frameworks like Gym-Anything to generate diverse training scenarios across multiple software applications
GDP-Grounded Software Selection — trajectory distillation benefits from economically-grounded task selection to ensure training on relevant, high-impact scenarios that reflect real-world software usage patterns
Test-Time Auditing — can complement trajectory distillation by providing additional feedback during inference, as demonstrated by improvements from 11.5% to 14.0% performance on long-horizon tasks
Privileged Information Verification — essential for validating successful trajectories before use in training, ensuring quality demonstrations through ground-truth data from setup scripts
Creation-Audit Loop — systematic approach to generating high-quality training trajectories through iterative environment creation and verification processes
Cross-Software Generalization — trajectory distillation enables training models that can transfer learned behaviors across different software applications, though with limited recovery compared to seen environments
Behavioral Pattern Analysis — automated analysis of successful trajectories helps identify key patterns for effective knowledge transfer in trajectory distillation training

Sources

sources/arxiv-260406126 — demonstrated trajectory distillation effectiveness in CUA-World benchmark, showing 2B model outperforming larger models through learning from successful teacher trajectories across 200+ software applications with GDP-grounded selection