Gradient Descent
Summary: A fundamental optimization algorithm that iteratively finds the minimum of a function by moving in the direction of steepest descent (negative gradient). Forms the backbone of training neural networks and machine learning models by minimizing loss functions through parameter updates.
Overview
Gradient descent is an iterative optimization algorithm used to find the minimum of a differentiable function. The algorithm starts with an initial guess for the parameters and repeatedly updates them by taking steps proportional to the negative of the gradient (slope) of the function at the current point. This process continues until convergence to a local minimum.
The basic update rule is: θ = θ - α∇f(θ), where θ represents parameters, α is the learning rate, and ∇f(θ) is the gradient of the objective function. The algorithm's effectiveness depends on choosing appropriate learning rates and handling challenges like local minima, saddle points, and varying curvature in high-dimensional spaces.
Three main variants exist based on how much data is used per update: batch gradient descent (entire dataset), stochastic gradient descent (single sample), and mini-batch gradient descent (small batches). Modern deep learning predominantly uses mini-batch approaches with sophisticated optimizers like Adam, RMSprop, and AdaGrad that adapt learning rates dynamically.
Key Details
- Learning Rate: Critical hyperparameter controlling step size; too large causes oscillation/divergence, too small causes slow convergence
- Convergence Guarantees: Guaranteed to find global minimum for convex functions; may get trapped in local minima for non-convex functions
- Computational Complexity: O(n·d) per iteration where n is samples and d is dimensions
- Memory Requirements: Minimal for basic version; optimizers like Adam require additional memory for momentum terms
- Stopping Criteria: Typically based on gradient magnitude threshold, function value change, or maximum iterations
- Batch Size Effects: Larger batches provide more accurate gradients but require more computation; smaller batches add noise that can help escape local minima
- Parallelization: Mini-batch versions naturally parallelize across samples and can distribute across multiple GPUs
Relationships
- Backpropagation — computes gradients needed for gradient descent in neural networks
- Learning Rate Scheduling — techniques to adapt learning rate during training for better convergence
- Momentum Optimization — extensions that use previous gradients to accelerate convergence and reduce oscillation
- Adam Optimizer — adaptive learning rate method that combines momentum with per-parameter scaling
- Loss Functions — objective functions that gradient descent minimizes during model training
- Neural Network Training — primary application domain where gradient descent optimizes network weights
- Convex Optimization — mathematical framework where gradient descent has strong convergence guarantees
- Stochastic Processes — SGD variants introduce randomness that affects convergence properties
- Test-Time Training — uses gradient descent for dynamic parameter updates during inference
- Parameter Efficient Fine-tuning — applies gradient descent to small parameter subsets for adaptation
Sources
- sources/in-place-test-time-training — demonstrates gradient descent application for dynamic inference-time parameter updates