Article is from 2016. It only mentions AdamW at the very end in passing. These days I rarely see much besides AdamW in production.
Messing with optimizers is one of the ways to enter hyperparameter hell: it’s like legacy code but on steroids because changing it only breaks your training code stochastically. Much better to stop worrying and love AdamW.
The mention of AdamW is brief, but in his defense he includes a link that gives a gloss of it: "An updated overview of recent gradient descent algorithms" [https://johnchenresearch.github.io/demon/].
Luckily we have Shampoo, SOAP, Modula, Schedule-free variants, and many more these days being researched! I am very very excited by the heavyball library in particular
Something that stuck out to me in the updated blog [0] is that Demon Adam performed much better than even AdamW, with very interesting learning curves. I'm wondering now why it didn't become the standard. Anyone here have insights into this?
Demon Adam didn’t become standard largely for the same reason many “better” optimizers never see wide adoption: it’s a newer tweak, not clearly superior on every problem, is less familiar to most engineers, and isn’t always bundled in major frameworks. By contrast, AdamW is now the “safe default” that nearly everyone supports and knows how to tune, so teams stick with it unless they have a strong reason not to.
Edit: Demon involves decaying the momentum parameter over time, which introduces a new schedule or formula for how momentum should be reduced during training. That can feel like additional complexity or a potential hyperparameter rabbit hole. Teams trying to ship products quickly often avoid adding new hyperparameters unless the gains are decisive.
Interesting, but it does not seem to be an overview of gradient optimisers, but rather gradient optimisers in ML, as I see no mentions of BFGS and the likes.
It's with the utmost humility that I confess to falling back on "just use Nelder-Mead" in *scipy.optimize* when something is ill behaved. I consider it to be a sign that I'm doing something wrong, but I certainly respect its use.
Nelder–Mead has often not worked well for me in moderate to high dimensions. I'd recommend trying Powell's method if you want to quickly converge to a local optimum. If you're using scipy's wrappers, it's easy to swap between the two:
I think the big difference is dimensionality. If the dimensionality is low, then taking account of the 2nd derivatives becomes practical and worthwhile.
What is it that makes higher order derivatives less useful at high dimensionality? Is it related to the Curse of Dimensionality, or maybe something like exploding gradients at higher orders?
In n dimensions, the first derivative is an n-element vector. The second derivative is an n x n (symmetric) matrix. As n grows, the computation required to estimate the matrix increases (as at least n^2) and computation needed to use it increases (possibly faster).
In practice, clever optimisation algorithms that use the 2nd derivative won't actually form this matrix.
That’s how interviews go though, it’s not like I’ve ever had to use Bayes rule at work but for a few years everyone loved asking about it in screening rounds.
In my experience a lot of people "know" maths, but fail to recognise the opportunities to use it. Some of my colleagues were pleased when I showed them that their ad hoc algorithm was equivalent to an application of Bayes' rule. It gave them insights into the meaning of constants that had formerly been chosen by trial and error.
Everyone’s experience is different but I’ve been in dozens of MLE interviews (some of which I passed!) and have never once been asked to explain the internals of an optimizer. The interviews were all post 2020, though.
Unless someone had a very good reason I would consider it weird to use anything other than AdamW. The compute you could save on a slightly better optimizer pale in comparison to the time you will spend debugging an opaque training bug.
Why would you? Implementing optimizers isn’t something that MLEs do. Even the Deepseek team just uses AdamW.
An MLE should be able to look up and understand the differences between optimizers but memorizing that information is extremely low priority compared with other information they might be asked.
Article is from 2016. It only mentions AdamW at the very end in passing. These days I rarely see much besides AdamW in production.
Messing with optimizers is one of the ways to enter hyperparameter hell: it’s like legacy code but on steroids because changing it only breaks your training code stochastically. Much better to stop worrying and love AdamW.
The mention of AdamW is brief, but in his defense he includes a link that gives a gloss of it: "An updated overview of recent gradient descent algorithms" [https://johnchenresearch.github.io/demon/].
Luckily we have Shampoo, SOAP, Modula, Schedule-free variants, and many more these days being researched! I am very very excited by the heavyball library in particular
Been out of the loop for while, anything exciting?
Read the Modular Norms in Deep Learning paper, and follow the author of the heavyball library on twitter with notifications enabled
Something that stuck out to me in the updated blog [0] is that Demon Adam performed much better than even AdamW, with very interesting learning curves. I'm wondering now why it didn't become the standard. Anyone here have insights into this?
[0] https://johnchenresearch.github.io/demon/
Demon Adam didn’t become standard largely for the same reason many “better” optimizers never see wide adoption: it’s a newer tweak, not clearly superior on every problem, is less familiar to most engineers, and isn’t always bundled in major frameworks. By contrast, AdamW is now the “safe default” that nearly everyone supports and knows how to tune, so teams stick with it unless they have a strong reason not to.
Edit: Demon involves decaying the momentum parameter over time, which introduces a new schedule or formula for how momentum should be reduced during training. That can feel like additional complexity or a potential hyperparameter rabbit hole. Teams trying to ship products quickly often avoid adding new hyperparameters unless the gains are decisive.
Interesting, but it does not seem to be an overview of gradient optimisers, but rather gradient optimisers in ML, as I see no mentions of BFGS and the likes.
I'm also curious about gradient-less algorithms
For non deep learning applications, Nelder-Mead saved my butt a fees times
It's with the utmost humility that I confess to falling back on "just use Nelder-Mead" in *scipy.optimize* when something is ill behaved. I consider it to be a sign that I'm doing something wrong, but I certainly respect its use.
Nelder–Mead has often not worked well for me in moderate to high dimensions. I'd recommend trying Powell's method if you want to quickly converge to a local optimum. If you're using scipy's wrappers, it's easy to swap between the two:
https://docs.scipy.org/doc/scipy/reference/optimize.html#loc...
For nastier optimization problems there are lots of other options, including evolutionary algorithms and Bayesian optimization:
https://facebookresearch.github.io/nevergrad/
https://github.com/facebook/Ax
SAMBO does a good job of finding the global optimum in a black-box manner even compared to Nelder-Mead, according to its own benchmark ...
https://sambo-optimization.github.io
ChatGPT also advised me to use NM a couple of times, which was neat.
Look into zeroth-order optimizers and CMA-ES.
I think the big difference is dimensionality. If the dimensionality is low, then taking account of the 2nd derivatives becomes practical and worthwhile.
What is it that makes higher order derivatives less useful at high dimensionality? Is it related to the Curse of Dimensionality, or maybe something like exploding gradients at higher orders?
In n dimensions, the first derivative is an n-element vector. The second derivative is an n x n (symmetric) matrix. As n grows, the computation required to estimate the matrix increases (as at least n^2) and computation needed to use it increases (possibly faster).
In practice, clever optimisation algorithms that use the 2nd derivative won't actually form this matrix.
Example of thr bitter lesson. None of these nuanced matter 8 years later where everyone uses sgd or adamw.
It's a great summary for ML interview prep.
I disagree, it is old and most of those algorithms aren’t used anymore.
That’s how interviews go though, it’s not like I’ve ever had to use Bayes rule at work but for a few years everyone loved asking about it in screening rounds.
In my experience a lot of people "know" maths, but fail to recognise the opportunities to use it. Some of my colleagues were pleased when I showed them that their ad hoc algorithm was equivalent to an application of Bayes' rule. It gave them insights into the meaning of constants that had formerly been chosen by trial and error.
Everyone’s experience is different but I’ve been in dozens of MLE interviews (some of which I passed!) and have never once been asked to explain the internals of an optimizer. The interviews were all post 2020, though.
Unless someone had a very good reason I would consider it weird to use anything other than AdamW. The compute you could save on a slightly better optimizer pale in comparison to the time you will spend debugging an opaque training bug.
For example, if it is meaningful to use large batch sizes, the gradient variance will be lower and adam could be equivalent to just momentum.
As a model is trained, the gradient variance typically falls.
Those optimizers all work to reduce the variance of the updates in various ways.
I'd still expect an MLE to know it though.
Why would you? Implementing optimizers isn’t something that MLEs do. Even the Deepseek team just uses AdamW.
An MLE should be able to look up and understand the differences between optimizers but memorizing that information is extremely low priority compared with other information they might be asked.