Gradient Descent 1D — Simulation

Machine Learning

Tutorial

Artificial Intelligence

Optimisation

Simple examples showing how to minimise or maximize functions using gradient descent

Published

August 7, 2025

Automate Customer Service with AI Banner

Minimising \(f(x)=(x-3)^2\) using gradient descent

Gradient Descent — 1D Quadratic

Speed

Problem: Minimize f(x) = (x - 3)^2 with update x_{t+1} = x_t - α · f'(x_t), f'(x)=2(x-3).

x =

f(x) =

f'(x) =

next x =

Step 0

What you’re seeing

A point sliding down \(f(x)=(x-3)^2\). The dashed line is the tangent with slope \(f'(x)\). Each step updates

\[ x_{t+1}=x_t-\alpha\,f'(x_t), \]

so the point moves against the slope (downhill).

Why this works (intuition)

The derivative \(f'(x)\) is a local “tilt.”
Subtracting \(\alpha f'(x)\) steps you in the direction that lowers \(f\).
If \(\alpha\) is small, steps are careful; if it’s big, steps are bold (and can overshoot).

Where it’s used

Linear & logistic regression training (closed-form exists for linear, but GD is still used at scale).
Neural networks (SGD/Adam are gradient-based).
Transformers, CNNs, RNNs—trained by (variants of) gradient descent on large losses.
Matrix factorization, recommender models, word embeddings.

When it misbehaves

Learning rate too large: diverges or jitters.
Plateaus/saddles: slow progress when gradients are tiny.
Non-convex losses: can settle in different local minima.
Poor scaling: features with very different scales can confuse steps.

One-liner to remember

Gradient descent = “walk downhill using the slope, with a sensible stride.”

Maximizing \(g(x)=4-(x-1)^2\) using gradient descent

Gradient Ascent — 1D Concave Parabola

Speed

Goal: Maximize g(x) = 4 − (x − 1)^2. Gradient ascent update: x_{t+1} = x_t + α · g'(x_t) with g'(x)= −2(x−1).

x =

g(x) =

g'(x) =

next x =

Step 0

What you’re seeing

A point climbing the concave parabola \(g(x)=4-(x-1)^2\). Update:

\[ x_{t+1}=x_t+\alpha\,g'(x_t), \]

i.e., step with the slope (uphill) to reach the peak near \(x=1\).

Where ascent shows up

Maximizing log-likelihood (same as minimizing negative log-likelihood).
Policy gradient methods in RL (maximize expected return).
Feature selection / sparse coding framed as maximization problems.

Same cautions

Learning rate, plateaus, and scaling issues apply just like descent—only the sign flips.

One-liner to remember

Gradient ascent = “walk uphill using the slope, with a sensible stride.”

Some FAQ

Q: Why does the tangent line matter?

It shows the local slope \(f'(x)\). The sign and magnitude of this slope drive the direction and size of the update.

Q: How do I pick \(\alpha\)?

Start small (e.g., 0.01–0.1), watch the curve value: if it explodes, shrink; if it crawls, grow a bit. In deep learning, use optimizers (Adam) and learning-rate schedules.

Q: What’s the difference between GD, SGD, mini-batch?

GD uses the full dataset gradient; SGD uses one sample; mini-batch averages a small subset. Mini-batch is the sweet spot for modern training.

Q: Why do deep nets converge at all on messy landscapes?

Tricks like normalization, residual connections, good initializations, and adaptive optimizers make the landscape more traversable.

Earning Opportunity with AI

Grab Great Deals

Coursera Courses Deal Get AppSumo Deal

Enjoyed this post?

If this article helped you,
☕ Buy me a coffee
Your support keeps more things coming!