Gradient Descent

Summary: A fundamental optimization algorithm that iteratively finds the minimum of a function by moving in the direction of steepest descent (negative gradient). Forms the backbone of training neural networks and machine learning models by minimizing loss functions through parameter updates.

Overview

Gradient descent is an iterative optimization algorithm used to find the minimum of a differentiable function. The algorithm starts with an initial guess for the parameters and repeatedly updates them by taking steps proportional to the negative of the gradient (slope) of the function at the current point. This process continues until convergence to a local minimum.

The basic update rule is: θ = θ - α∇f(θ), where θ represents parameters, α is the learning rate, and ∇f(θ) is the gradient of the objective function. The algorithm's effectiveness depends on choosing appropriate learning rates and handling challenges like local minima, saddle points, and varying curvature in high-dimensional spaces.

Three main variants exist based on how much data is used per update: batch gradient descent (entire dataset), stochastic gradient descent (single sample), and mini-batch gradient descent (small batches). Modern deep learning predominantly uses mini-batch approaches with sophisticated optimizers like Adam, RMSprop, and AdaGrad that adapt learning rates dynamically.

Key Details

Learning Rate: Critical hyperparameter controlling step size; too large causes oscillation/divergence, too small causes slow convergence
Convergence Guarantees: Guaranteed to find global minimum for convex functions; may get trapped in local minima for non-convex functions
Computational Complexity: O(n·d) per iteration where n is samples and d is dimensions
Memory Requirements: Minimal for basic version; optimizers like Adam require additional memory for momentum terms
Stopping Criteria: Typically based on gradient magnitude threshold, function value change, or maximum iterations
Batch Size Effects: Larger batches provide more accurate gradients but require more computation; smaller batches add noise that can help escape local minima
Parallelization: Mini-batch versions naturally parallelize across samples and can distribute across multiple GPUs

Relationships

Backpropagation — computes gradients needed for gradient descent in neural networks
Learning Rate Scheduling — techniques to adapt learning rate during training for better convergence
Momentum Optimization — extensions that use previous gradients to accelerate convergence and reduce oscillation
Adam Optimizer — adaptive learning rate method that combines momentum with per-parameter scaling
Loss Functions — objective functions that gradient descent minimizes during model training
Neural Network Training — primary application domain where gradient descent optimizes network weights
Convex Optimization — mathematical framework where gradient descent has strong convergence guarantees
Stochastic Processes — SGD variants introduce randomness that affects convergence properties
Test-Time Training — uses gradient descent for dynamic parameter updates during inference
Parameter Efficient Fine-tuning — applies gradient descent to small parameter subsets for adaptation

Sources

sources/in-place-test-time-training — demonstrates gradient descent application for dynamic inference-time parameter updates