Gradient Descent

Summary: A fundamental optimization algorithm that iteratively finds the minimum of a function by moving in the direction of steepest descent (negative gradient). Forms the backbone of training neural networks and machine learning models by minimizing loss functions through parameter updates.

Overview

Gradient descent is an iterative optimization algorithm used to find the minimum of a differentiable function. The algorithm starts with an initial guess for the parameters and repeatedly updates them by taking steps proportional to the negative of the gradient (slope) of the function at the current point. This process continues until convergence to a local minimum.

The basic update rule is: θ = θ - α∇f(θ), where θ represents parameters, α is the learning rate, and ∇f(θ) is the gradient of the objective function. The algorithm's effectiveness depends on choosing appropriate learning rates and handling challenges like local minima, saddle points, and varying curvature in high-dimensional spaces.

Three main variants exist based on how much data is used per update: batch gradient descent (entire dataset), stochastic gradient descent (single sample), and mini-batch gradient descent (small batches). Modern deep learning predominantly uses mini-batch approaches with sophisticated optimizers like Adam, RMSprop, and AdaGrad that adapt learning rates dynamically.

Key Details

  • Learning Rate: Critical hyperparameter controlling step size; too large causes oscillation/divergence, too small causes slow convergence
  • Convergence Guarantees: Guaranteed to find global minimum for convex functions; may get trapped in local minima for non-convex functions
  • Computational Complexity: O(n·d) per iteration where n is samples and d is dimensions
  • Memory Requirements: Minimal for basic version; optimizers like Adam require additional memory for momentum terms
  • Stopping Criteria: Typically based on gradient magnitude threshold, function value change, or maximum iterations
  • Batch Size Effects: Larger batches provide more accurate gradients but require more computation; smaller batches add noise that can help escape local minima
  • Parallelization: Mini-batch versions naturally parallelize across samples and can distribute across multiple GPUs

Relationships

Sources