Adam vs SGD — Simple example with Animation

Math for AI
Optimization
Two optimizers descend the same 2D bowl. Watch paths and values per step.
Published

August 8, 2025

Automate Customer Service with AI Banner

Adam vs SGD — almost same loss, different paths
Speed
SGD (vanilla GD) Adam (β₁=0.9, β₂=0.999)
Loss: f(x,y)=½(ax²+by²) (elongated bowl: a=1, b=15). Start (2.5,2.5). Same seed, different updates.
t = 0
αSGD=0.03, αAdam=0.08
fSGD =
fAdam =
‖∇f‖SGD =
‖∇f‖Adam =
Step 0

Comparison table

If you want… Pick Why it helps
Quick, stable progress with little tuning Adam/AdamW Auto-adjusts step sizes per parameter
Best final test accuracy (with tuning) SGD + Momentum Often generalizes a bit better
Works well for fine-tuning pretrained models Adam/AdamW Stable updates on mixed-scale layers
Handles noisy/sparse gradients Adam/AdamW Adaptive steps tame noise & scale
Minimal memory overhead SGD (+Mom.) No extra moment buffers (or just one)
Classic CNN training from scratch SGD + Momentum Strong, proven baseline in vision
Uneven curvature (zig-zag risk) Adam/AdamW Per-parameter normalization smooths steps
“Set and forget” defaults Adam/AdamW Reasonable results with default betas

Some “math at a glance”

  • SGD: \(\theta_{t+1}=\theta_t-\alpha\,g_t\)
  • SGD + Momentum: \(v_{t+1}=\beta v_t+g_t,\;\theta_{t+1}=\theta_t-\alpha v_{t+1}\)
  • Adam: \(\theta_{t+1}=\theta_t-\alpha\,\dfrac{\hat m_t}{\sqrt{\hat v_t}+\varepsilon}\) with running means \(\hat m_t,\hat v_t\)

Typical starting points

  • AdamW: \(\alpha=1\mathrm{e}{-3}\) (or \(5\mathrm{e}{-4}\)), \(\beta_1=0.9\), \(\beta_2=0.999\), weight decay \(=0.01\)
  • SGD+Momentum: \(\alpha=0.1\rightarrow0.01\) with schedule, momentum \(=0.9\), weight decay \(=1\mathrm{e}{-4}\)

When to use which

  • I need something that “just works” with minimal tuning.Adam/AdamW Adapts the step size per weight; very forgiving early on.

  • I’m fine-tuning a pretrained model (NLP, vision, speech).Adam/AdamW Stable, fast convergence from the start.

  • My gradients are noisy or features are sparse (recsys, NLP).Adam/AdamW Handles uneven scales and noise better.

  • I care about the very best final accuracy/generalization (from scratch).SGD + Momentum Often edges out Adam at the end—if you tune the learning-rate schedule.

  • I’m training big vision models from scratch (classic CNN setups).SGD + Momentum (with cosine/step decay) A strong, time-tested default in vision.

  • Memory is tight.SGD (+Momentum) Adam keeps two extra tensors per weight (m, v).

  • My loss landscape is elongated/ill-scaled.Adam/AdamW or SGD + Momentum Adam smooths steps automatically; momentum reduces zig-zags.

  • Not sure where to start? → Start with AdamW, then try SGD+Momentum once things are stable to see if you gain extra accuracy.

Earning Opportunity with AI

Enjoyed this post?

If this article helped you,
☕ Buy me a coffee
Your support keeps more things coming!

© 2025 Aimling. All rights reserved. | Terms of Use | Privacy Policy