Contamination Filtering

Summary: A systematic data preprocessing technique that prevents information leakage between training and test sets by analyzing similarity relationships and removing instances where test data is too similar to training data. Essential for ensuring reliable model evaluation and preventing inflated performance metrics from memorization rather than true generalization.

Overview

Contamination filtering addresses a critical problem in machine learning evaluation: data leakage between train and test splits that can lead to inflated performance metrics and poor generalization. This occurs when test examples are highly similar to training examples, allowing models to essentially "cheat" by memorizing rather than learning generalizable patterns.

The technique typically involves building similarity graphs where nodes represent data points and edges connect similar instances based on chosen similarity metrics. Test examples that exceed a similarity threshold to any training example are systematically removed to ensure clean separation between datasets.

In large-scale benchmarks like CUA-World, contamination filtering becomes particularly important when dealing with thousands of tasks across hundreds of software applications. The framework must ensure that training trajectories don't leak information into test scenarios, maintaining the integrity of performance evaluations across different software environments.

Key Details

Core Methodology:

  • Constructs bidirectional similarity graphs between all data pairs
  • Applies similarity thresholds (typically 0.7-0.9 for semantic similarity)
  • Uses multiple similarity metrics simultaneously for robust filtering
  • Removes 5-20% of original test data depending on threshold strictness
  • Employs connected component analysis to identify clusters of similar examples

Implementation Considerations:

  • Computationally expensive for large datasets (O(n²) similarity calculations)
  • Requires domain expertise to select appropriate similarity functions
  • May create class imbalance if contamination affects some classes more than others
  • Threshold selection involves precision/recall tradeoffs between contamination prevention and data retention
  • Can use semantic embeddings, n-gram overlap, edit distance, or domain-specific measures

Validation Methods:

  • Manual inspection of filtered examples
  • Performance gap analysis before/after filtering
  • Cross-validation with different threshold values
  • Temporal validation for time-series data
  • Automated analysis of behavioral patterns in filtered vs. retained data

Domain-Specific Applications:

Relationships

  • Data Leakage Prevention — contamination filtering is a primary technique for preventing various forms of data leakage
  • Similarity Metrics — relies heavily on choosing appropriate similarity functions for the data domain
  • Train-Test Split Validation — ensures splits maintain statistical independence required for valid evaluation
  • Benchmark Integrity — critical for maintaining reliable benchmark datasets in machine learning research
  • Cross-Validation — contamination can occur across CV folds, requiring similar filtering approaches
  • Evaluation Methodology — part of rigorous experimental design for model assessment
  • Data Preprocessing — standard step in ML pipelines for high-stakes applications
  • CUA-World — implements contamination filtering to prevent task leakage across software environments
  • Trajectory Distillation — requires contamination filtering to prevent teacher model trajectories from leaking into student evaluation
  • Multi-Agent Environment Creation — contamination filtering ensures generated environments don't duplicate existing test scenarios

Sources

  • sources/arxiv-260406126 — describes implementation in CUA-World benchmark creation to ensure clean train/test splits across software tasks and prevent data leakage in large-scale computer-use agent evaluation