Adam vs SGD — Simple example with Animation
f(x,y)=½(ax²+by²) (elongated bowl: a=1, b=15). Start (2.5,2.5). Same seed, different updates.Comparison table
| If you want… | Pick | Why it helps |
|---|---|---|
| Quick, stable progress with little tuning | Adam/AdamW | Auto-adjusts step sizes per parameter |
| Best final test accuracy (with tuning) | SGD + Momentum | Often generalizes a bit better |
| Works well for fine-tuning pretrained models | Adam/AdamW | Stable updates on mixed-scale layers |
| Handles noisy/sparse gradients | Adam/AdamW | Adaptive steps tame noise & scale |
| Minimal memory overhead | SGD (+Mom.) | No extra moment buffers (or just one) |
| Classic CNN training from scratch | SGD + Momentum | Strong, proven baseline in vision |
| Uneven curvature (zig-zag risk) | Adam/AdamW | Per-parameter normalization smooths steps |
| “Set and forget” defaults | Adam/AdamW | Reasonable results with default betas |
Some “math at a glance”
- SGD: \(\theta_{t+1}=\theta_t-\alpha\,g_t\)
- SGD + Momentum: \(v_{t+1}=\beta v_t+g_t,\;\theta_{t+1}=\theta_t-\alpha v_{t+1}\)
- Adam: \(\theta_{t+1}=\theta_t-\alpha\,\dfrac{\hat m_t}{\sqrt{\hat v_t}+\varepsilon}\) with running means \(\hat m_t,\hat v_t\)
Typical starting points
- AdamW: \(\alpha=1\mathrm{e}{-3}\) (or \(5\mathrm{e}{-4}\)), \(\beta_1=0.9\), \(\beta_2=0.999\), weight decay \(=0.01\)
- SGD+Momentum: \(\alpha=0.1\rightarrow0.01\) with schedule, momentum \(=0.9\), weight decay \(=1\mathrm{e}{-4}\)
When to use which
I need something that “just works” with minimal tuning. → Adam/AdamW Adapts the step size per weight; very forgiving early on.
I’m fine-tuning a pretrained model (NLP, vision, speech). → Adam/AdamW Stable, fast convergence from the start.
My gradients are noisy or features are sparse (recsys, NLP). → Adam/AdamW Handles uneven scales and noise better.
I care about the very best final accuracy/generalization (from scratch). → SGD + Momentum Often edges out Adam at the end—if you tune the learning-rate schedule.
I’m training big vision models from scratch (classic CNN setups). → SGD + Momentum (with cosine/step decay) A strong, time-tested default in vision.
Memory is tight. → SGD (+Momentum) Adam keeps two extra tensors per weight (m, v).
My loss landscape is elongated/ill-scaled. → Adam/AdamW or SGD + Momentum Adam smooths steps automatically; momentum reduces zig-zags.
Not sure where to start? → Start with AdamW, then try SGD+Momentum once things are stable to see if you gain extra accuracy.
Earning Opportunity with AI
If this article helped you,
☕ Buy me a coffee
Your support keeps more things coming!