Contamination Filtering

Summary: A systematic data preprocessing technique that prevents information leakage between training and test sets by analyzing similarity relationships and removing instances where test data is too similar to training data. Essential for ensuring reliable model evaluation and preventing inflated performance metrics from memorization rather than true generalization.

Overview

Contamination filtering addresses a critical problem in machine learning evaluation: data leakage between train and test splits that can lead to inflated performance metrics and poor generalization. This occurs when test examples are highly similar to training examples, allowing models to essentially "cheat" by memorizing rather than learning generalizable patterns.

The technique typically involves building similarity graphs where nodes represent data points and edges connect similar instances based on chosen similarity metrics. Test examples that exceed a similarity threshold to any training example are systematically removed to ensure clean separation between datasets.

In large-scale benchmarks like CUA-World, contamination filtering becomes particularly important when dealing with thousands of tasks across hundreds of software applications. The framework must ensure that training trajectories don't leak information into test scenarios, maintaining the integrity of performance evaluations across different software environments.

Key Details

Core Methodology:

Constructs bidirectional similarity graphs between all data pairs
Applies similarity thresholds (typically 0.7-0.9 for semantic similarity)
Uses multiple similarity metrics simultaneously for robust filtering
Removes 5-20% of original test data depending on threshold strictness
Employs connected component analysis to identify clusters of similar examples

Implementation Considerations:

Computationally expensive for large datasets (O(n²) similarity calculations)
Requires domain expertise to select appropriate similarity functions
May create class imbalance if contamination affects some classes more than others
Threshold selection involves precision/recall tradeoffs between contamination prevention and data retention
Can use semantic embeddings, n-gram overlap, edit distance, or domain-specific measures

Validation Methods:

Manual inspection of filtered examples
Performance gap analysis before/after filtering
Cross-validation with different threshold values
Temporal validation for time-series data
Automated analysis of behavioral patterns in filtered vs. retained data

Domain-Specific Applications:

In Computer-Use Agents, filters similar GUI interaction sequences
For Long-Horizon Task Planning, prevents leakage of multi-step solution patterns
In Benchmark Design, ensures statistical independence between evaluation splits

Relationships

Data Leakage Prevention — contamination filtering is a primary technique for preventing various forms of data leakage
Similarity Metrics — relies heavily on choosing appropriate similarity functions for the data domain
Train-Test Split Validation — ensures splits maintain statistical independence required for valid evaluation
Benchmark Integrity — critical for maintaining reliable benchmark datasets in machine learning research
Cross-Validation — contamination can occur across CV folds, requiring similar filtering approaches
Evaluation Methodology — part of rigorous experimental design for model assessment
Data Preprocessing — standard step in ML pipelines for high-stakes applications
CUA-World — implements contamination filtering to prevent task leakage across software environments
Trajectory Distillation — requires contamination filtering to prevent teacher model trajectories from leaking into student evaluation
Multi-Agent Environment Creation — contamination filtering ensures generated environments don't duplicate existing test scenarios

Sources

sources/arxiv-260406126 — describes implementation in CUA-World benchmark creation to ensure clean train/test splits across software tasks and prevent data leakage in large-scale computer-use agent evaluation