Adam vs SGD — Simple example with Animation

Math for AI

Optimization

Two optimizers descend the same 2D bowl. Watch paths and values per step.

Published

August 8, 2025

Adam vs SGD — almost same loss, different paths

Speed

SGD (vanilla GD) Adam (β₁=0.9, β₂=0.999)

Loss: f(x,y)=½(ax²+by²) (elongated bowl: a=1, b=15). Start (2.5,2.5). Same seed, different updates.

t = 0

α_SGD=0.03, α_Adam=0.08

f_SGD =

f_Adam =

‖∇f‖_SGD =

‖∇f‖_Adam =

Step 0

Comparison table

If you want…	Pick	Why it helps
Quick, stable progress with little tuning	Adam/AdamW	Auto-adjusts step sizes per parameter
Best final test accuracy (with tuning)	SGD + Momentum	Often generalizes a bit better
Works well for fine-tuning pretrained models	Adam/AdamW	Stable updates on mixed-scale layers
Handles noisy/sparse gradients	Adam/AdamW	Adaptive steps tame noise & scale
Minimal memory overhead	SGD (+Mom.)	No extra moment buffers (or just one)
Classic CNN training from scratch	SGD + Momentum	Strong, proven baseline in vision
Uneven curvature (zig-zag risk)	Adam/AdamW	Per-parameter normalization smooths steps
“Set and forget” defaults	Adam/AdamW	Reasonable results with default betas

SGD: \(\theta_{t+1}=\theta_t-\alpha\,g_t\)
SGD + Momentum: \(v_{t+1}=\beta v_t+g_t,\;\theta_{t+1}=\theta_t-\alpha v_{t+1}\)
Adam: \(\theta_{t+1}=\theta_t-\alpha\,\dfrac{\hat m_t}{\sqrt{\hat v_t}+\varepsilon}\) with running means \(\hat m_t,\hat v_t\)

Typical starting points

AdamW: \(\alpha=1\mathrm{e}{-3}\) (or \(5\mathrm{e}{-4}\)), \(\beta_1=0.9\), \(\beta_2=0.999\), weight decay \(=0.01\)
SGD+Momentum: \(\alpha=0.1\rightarrow0.01\) with schedule, momentum \(=0.9\), weight decay \(=1\mathrm{e}{-4}\)

I need something that “just works” with minimal tuning. → Adam/AdamW Adapts the step size per weight; very forgiving early on.
I’m fine-tuning a pretrained model (NLP, vision, speech). → Adam/AdamW Stable, fast convergence from the start.
My gradients are noisy or features are sparse (recsys, NLP). → Adam/AdamW Handles uneven scales and noise better.
I care about the very best final accuracy/generalization (from scratch). → SGD + Momentum Often edges out Adam at the end—if you tune the learning-rate schedule.
I’m training big vision models from scratch (classic CNN setups). → SGD + Momentum (with cosine/step decay) A strong, time-tested default in vision.
Memory is tight. → SGD (+Momentum) Adam keeps two extra tensors per weight (m, v).
My loss landscape is elongated/ill-scaled. → Adam/AdamW or SGD + Momentum Adam smooths steps automatically; momentum reduces zig-zags.
Not sure where to start? → Start with AdamW, then try SGD+Momentum once things are stable to see if you gain extra accuracy.

Grab Great Deals

Enjoyed this post?

If this article helped you,
☕ Buy me a coffee
Your support keeps more things coming!