Gradient Descent 1D — Simulation
Minimising \(f(x)=(x-3)^2\) using gradient descent
f(x) = (x - 3)^2 with update x_{t+1} = x_t - α · f'(x_t), f'(x)=2(x-3).What you’re seeing
A point sliding down \(f(x)=(x-3)^2\). The dashed line is the tangent with slope \(f'(x)\). Each step updates
\[ x_{t+1}=x_t-\alpha\,f'(x_t), \]
so the point moves against the slope (downhill).
Why this works (intuition)
- The derivative \(f'(x)\) is a local “tilt.”
- Subtracting \(\alpha f'(x)\) steps you in the direction that lowers \(f\).
- If \(\alpha\) is small, steps are careful; if it’s big, steps are bold (and can overshoot).
Where it’s used
- Linear & logistic regression training (closed-form exists for linear, but GD is still used at scale).
- Neural networks (SGD/Adam are gradient-based).
- Transformers, CNNs, RNNs—trained by (variants of) gradient descent on large losses.
- Matrix factorization, recommender models, word embeddings.
When it misbehaves
- Learning rate too large: diverges or jitters.
- Plateaus/saddles: slow progress when gradients are tiny.
- Non-convex losses: can settle in different local minima.
- Poor scaling: features with very different scales can confuse steps.
One-liner to remember
Gradient descent = “walk downhill using the slope, with a sensible stride.”
Maximizing \(g(x)=4-(x-1)^2\) using gradient descent
g(x) = 4 − (x − 1)^2. Gradient ascent update: x_{t+1} = x_t + α · g'(x_t) with g'(x)= −2(x−1).What you’re seeing
A point climbing the concave parabola \(g(x)=4-(x-1)^2\). Update:
\[ x_{t+1}=x_t+\alpha\,g'(x_t), \]
i.e., step with the slope (uphill) to reach the peak near \(x=1\).
Where ascent shows up
- Maximizing log-likelihood (same as minimizing negative log-likelihood).
- Policy gradient methods in RL (maximize expected return).
- Feature selection / sparse coding framed as maximization problems.
Same cautions
Learning rate, plateaus, and scaling issues apply just like descent—only the sign flips.
One-liner to remember
Gradient ascent = “walk uphill using the slope, with a sensible stride.”
Some FAQ
Q: Why does the tangent line matter?
It shows the local slope \(f'(x)\). The sign and magnitude of this slope drive the direction and size of the update.
Q: How do I pick \(\alpha\)?
Start small (e.g., 0.01–0.1), watch the curve value: if it explodes, shrink; if it crawls, grow a bit. In deep learning, use optimizers (Adam) and learning-rate schedules.
Q: What’s the difference between GD, SGD, mini-batch?
GD uses the full dataset gradient; SGD uses one sample; mini-batch averages a small subset. Mini-batch is the sweet spot for modern training.
Q: Why do deep nets converge at all on messy landscapes?
Tricks like normalization, residual connections, good initializations, and adaptive optimizers make the landscape more traversable.
Earning Opportunity with AI
If this article helped you,
☕ Buy me a coffee
Your support keeps more things coming!