The last time I was reminded of the bitter lesson was when I read about Guidance & Control Networks, after seeing them used in an autonomous drone that beat the best human FPV pilots [0]. Basically it's using a small MLP (Multi Layer Perceptron) on the order of 200 parameters, and using the drone's state as input and controlling the motors directly with the output. We have all kinds of fancy control theory like MCP (Model Predictive Control), but it turns out that the best solution might be to train a relatively tiny NN using a mix of simulation and collected sensor data instead. It's not better because of huge computation resources, it's actually more computationally efficient than some classic alternatives, but it is more general.
But I have also seen people trying to use deep networks to identify rotating machinery faults, like bearings, from raw accelerometer data collected a high frequencies like 40 kHz. Whereas the spectrum of the data from running FFT on the signal contains fault information much more obviously and clearly.
Throwing a deep network on a problem without some physical insight into the problem has also its disadvantages it seems.
Yeah, we're shouting into the wind here. I have had people tell me directly that my ideas from old school state estimation were irrelevant in the era of deep learning. They may produce (in this case) worse results, but the long game I'm assured is superior.
The specific scenario was estimating the orientation of a stationary semi trailer. An objectively measurable number and it was consistently off by 30 deg, yet I was the jerk for suggesting we move from end to end DL to trad Bar Shalom techniques.
I’m working on this sort of thing right now in a SaaS product that previously didn’t have support for vibration data. One competitor is all ML-d up to the hilt but customers don’t like the black box and keep finding it gives false positives with no explanation. I think one problem is those buying not understanding problem; they just want to plug in a sensor and insights to happen, but without information about the machine that’s never going to be able to provide useful insights
There are also extremely misleading research articles out there promising good results with deep networks in the area of anomaly detection, without adequate comparison with more classical techniques.
This well-known critical paper shows examples of AI articles/techniques applied to popular datasets with good-looking results. But, it also demonstrates that, literally, a single line of MATLAB code can outperform some of these techniques: https://arxiv.org/pdf/2009.13807
> an autonomous drone that beat the best human FPV pilots
Doesn’t any such claim come with huge caveats — pre specified track/course, no random objects flying between, etc…? ie. train & test distributions are ensured same by ensuring test time can never be more complicated than training data.
Also presumably better sensing than raw visual input.
>It's not better because of huge computation resources, it's actually more computationally efficient than some classic alternatives
It's similar with options pricing. The most sophisticated models like multivariate stochastic volatility are computationally expensive to approximate with classical approaches (and have no closed form solution), so just training a small NN on the output of a vast number of simulations of the underlying processes ends up producing a more efficient model than traditional approaches. Same with stuff like trinomial trees.
This is really interesting. I think force fields in molecular dynamics have underwent a similar NN revolution. You train your NN on the output of expensive calculations to replace the expensive function with a cheap one. Could you train a small language model with a big one?
I was not at all a fan of "The Bitter Lesson versus The Garbage Can", but this misses the same thing that it missed.
The Bitter Lesson is from the perspective of how to spend your entire career. It is correct over the course of a very long time, and bakes in Moore's Law.
The Bitter Lesson is true because general methods capture these assumed hardware gains that specific methods may not. It was never meant for contrasting methods at a specific moment in time. At a specific moment in time you're just describing Explore vs Exploit.
Right, and if you spot a job that needs doing and can be done by a specialized model, waving your hands about general purpose scale-leveraging models eventually overtaking specialized models has not historically been a winning approach.
Except in the last year or two, which is why people are citing it a lot :)
I think there might be interesting time scales in between “now” and “my entire career” to which the bitter lesson may or may not apply. As an outsider to ML I have questions about the longevity of any given “context engineering” approach in light of the bitter lesson.
The bitter lesson becomes more true over time, because inductive bias becomes less useful over time. Case in point: PCA/hand engineering -> CNN -> ViT.
Thirty-five years ago they gave me a Ph.D. basically for pointing out that the controversy du jour -- reactive vs deliberative control for autonomous robots -- was not a dichotomy. You could have the best of both worlds by combining a reactive system with a deliberative one. The reactive system interfaced directly to the hardware on one end and provided essentially a high-level API on the other end that provided primitives like "go that way". It's a little bit more complicated than that because it turns out you need a glue layer in the middle, but the point is: you don't have to choose. The Bitter Lesson is simply a corollary of Ron's First Law: all extreme positions are wrong. So reactive control by itself has limits, and deliberative control by itself has limits. But put the two together (and add some pretty snazzy image processing) and the result is Waymo.
So it was no surprise to me that Stockfish, with its similar approach of combining deliberative search with a small NN computing its quality metric blows everything else out of the water. It has been obvious (at least to me) that this is the right approach for decades now.
I'm actually pretty terrified of the results when the mainstream AI companies finally rediscover this. The capabilities of LLMs are already pretty impressive on their own. If they can get a Stockfish-level boost by combining them with a simple search algorithm the result may very well be the GAI that the rationalist community has been sounding the alarm over for the last 20 years.
I've been thinking of the issue like this: a traditional approach to symbolic artificial intelligence frames problem solving as a search of a tree. Do you go depth first, dodging infinite branches? Or breadth first (do you have enough memory?). What about iterative deepening?
Sometimes the solution is in the tree, but it is too deep, and one runs out of time before it is found.
Statistical based learning could act as a branch predictor. Sometimes guiding the search to go very deep in the right place and find the hidden solution. Sometimes guiding the search to go very deep in the wrong place; one runs out of time as usual.
Notice the strength of hybrid approach. One isn't accepting the probably correct answer of the statistical part. It is only a guide, and if the answer is found, and the symbolic part of the software is correct, the answer will be reliable.
I think this is already being done with maths problems. The LLM is writing proof attempts in Lean. But Lean is traditional symbolic AI. If the LLM can come up with a proof that Lean approves, then it really has a proof. From your comment I learn that Stockfish has already got something like this to work very well.
What you're describing — search + NN — is presently under the term "test-time compute".
The rules / dynamics / objectives of chess ( and Go ) are trivial to encode in a search formulation. I personally don't really get what that tells us about AGI.
No, I don't think that test-time compute is the same thing at all. It's a little challenging to find a definitive definition of TTC, but AFAICT it is just a fairly simple control loop around an LLM. What I'm describing is a merging of components with fundamentally different architectures, each of which is a significant engineering effort in its own right, to produce a whole that is greater than the sum of its parts. Those seem different to me, but to be fair I have not been keeping up with the latest tech so I could be wrong.
I think search is a fairly simple control loop. Beam search is an example of TTC in this modern era.
It is a very wide term, IME, that means anything besides "one-shot through the network".
I think the thing about the search formulation, which is amenable to domains like chess and go, but not other domains is critical. If LLMs are coming up with effective search formulation for "open-ended" problems, that would be a big deal. Maybe this is what you're alluding to.
>In retrospect, in the story of the three-layer
architecture there may be more to be learned about
research methodology than about robot control
architectures. For many years the field was bogged
down in the assumption that planning was sufficient
for generating intelligent behavior in situated agents.
That it is not sufficient clearly does not justify the
conclusion that planning is therefore unnecessary. A
lot of effort has been spent defending both of these
extreme positions. Some of this passion may be the
result of a hidden conviction on the part of AI
researchers that at the root of intelligence lies a
single, simple, elegant mechanism. But if, as seems
likely, there is no One True Architecture, and
intelligence relies on a hodgepodge of techniques,
then the three-layer architecture offers itself as a way
to help organize the mess.
I don’t think people understand the point sutton was making; he’s saying that general, simple systems that get better with scale tend to outperform hand engineered systems that don’t. It’s a kind of subtle point that’s implicitly saying hand engineering inhibits scale because it inhibits generality. He is not saying anything about the rate, doesn’t claim llms/gd are the best system, in fact I’d guess he thinks there’s likely an even more general approach that would be better. It’s comparing two classes of approaches not commenting on the merits of particular systems.
> I don’t think people understand the point sutton was making; he’s saying that general, simple systems that get better with scale tend to outperform hand engineered systems that don’t
This is your reading of Sutton. When I read his original post, I don't extract this level of nuance. The very fact that he calls it a "lesson" rather than something else, such as a "tendency", suggests Sutton may not hold the idea lightly*. In other words, it might have become more than a testable theory; it might have become a narrative.
* Holding an idea lightly is usually good thing in my book. Very few ideas are foundational.
It occurs to me that the bitter lesson is so often repeated because it involves a slippery slope or moot-and-castle argument. IE, the meaning people assign to the bitter lesson ranges between all the following:
General-purpose-algorithms-that-scale will beat algorithms that aren't those
The most simple general purpose, scaling algorithm will win, at least over time
Yep this article is self centered and perfectly represents the type of ego Sutton was referencing. Maybe in a year or two general methods will improve the author's workflow significantly once again (eg. better models) and they would still add a bit of human logic on top and claim victory.
The point about training data stands. We usually only think of scaling compute, but we need to scale data as well, maybe even faster than compute. But we exhausted the source of high quality organic text, and it doesn't grow exponentially fast.
I think at the moment the best source of data is the chat log, with 1B users and over 1T daily tokens over all LLMs. These chat logs are at the intersection of human interests and LLM execution errors, they are on-policy for the model, right what they need to improve the next iteration.
The question is when price/performance hits financial limits. That point may be close, if not already passed.
Interestingly, this hasn't happened for wafer fabs. A modern wafer fab costs US$1bn to US$3bn, and there is talk of US$20bn wafer fabs. Around the year 2000, those would have been un-financeable. It was expected that fab cost was going to be a constraint on feature size. That didn't happen.
For years, it was thought that the ASML approach to extreme UV was going to cost too much. It's a horrible hack, shooting off droplets of tin to be vaporized by lasers just to generate soft X-rays. Industry people were hoping for small synchrotrons or X-ray lasers or E-beam machines or something sane. But none of those worked out. Progress went on by making a fundamentally awful process work commercially, at insane cost.
It's hard to value $20bn in 2025 vs say $2bn for cutting edge fab in 2000. Personally I don't think CPI captures the relative value of investments well.
If you compare it as % of gdp, or relative to m2 or % of S&P 500, or even size of electronics industry it's maybe 2x-4x or something. Which is still an increase and still a lot but doesn't seem as crazy to me.
Another lens on it: the most valuable car company in 2000 was $50bn and in 2025 it is $1000bn, which I think says more about dollars than cars.
Sometimes awful is the best we have, we don't have anything that performs at a similar level to EUV machines by ASML but are much simpler or tenable than what we have right now, right?
Perhaps we will find something better in the future, but for now awful is the best we got for the cutting edge.
Also, when is cutting edge not the worst it's ever been?
> The biggest lesson that can be read from 70 years of AI research is that general methods that leverage computation are ultimately the most effective, and by a large margin.
Be careful when anyone, even a giant in the field such as Sutton, posits a sweeping claim like this.
My take? Sutton's "bitter lesson" is rather vague and unspecified (i.e. hard to pin down and test) for at least two reasons:
1. The word "ultimately" is squishy, when you think about it. When has enough time passed to make the assessment? At what point can we say e.g. "Problem X has a most effective solution"?
2. What do we mean by "most effective"? There is a lot of variation, including but not limited to (a) some performance metric; (b) data efficiency; (c) flexibility / adaptability across different domains; and (d) energy efficiency.
I'm a big fan of Sutton's work. I've read his RL book cover-to-cover and got to meet him briefly. But, to me, the bitter lesson (as articulated in Sutton's original post) is not even wrong. It is sufficiently open-ended that many of us will disagree about what the lesson is, even before we can get to the empirical questions of "First, has it happened in domain D at time T? Second, is it 'settled' now, or might things change?"
I believe the main problem could be reframed as an improper use of analogies.
People are pushing the analogy of "artificial intelligence" and "brain", etc. creating a confusion that leads to such "laws".
What we have is a situation that is similar to birds and planes, they do not operate under the same principles at all.
Looking at the original claim, we can take from birds a number of optimization regarding air flows that are far beyond what any plane can do.
But, the impact that could be transfer to planes would be minimal compared to a boost in engine technology.
Which is not surprising since the way both systems achieve "flight" are completely different.
I don't believe such discourse would happen at all if it was just considered to be a number of techniques, of different categories with their own strength and weaknesses, used to tackle problems.
Like all fake "laws", it is based on a general idea that is devoid of any time-frame prediction that would make it falsifiable.
In "the short term" is beaten by "in the long run". How far is "the long run"?
This is like the "mean reversion law", saying that prices will "eventually" go back to their equilibrium price; will you survive bankruptcy by the time of "eventually"?
>> People are pushing the analogy of "artificial intelligence" and "brain", etc. creating a confusion that leads to such "laws". What we have is a situation that is similar to birds and planes, they do not operate under the same principles at all.
But modend neural nets ARE based on biological neurons. It's not a perfect match by any means, but synaptic "weights" are very much equivalent to model weights. Model structure and size are bigger differences.
One odd thing is that progress in SAT/SMT solvers has been almost as good as progress in neural networks from the 1970s to the present. There was a time I was really interested in production rules and expert system shells and systems in the early 1980s often didn't even use RETE and didn't have hash indexes so of course a rule base of 10,000 looked unmanageable, by 2015 you could have a million rules in Drools and it worked just fine.
The difference is that SAT/SMT solvers have primarily relied on single-threaded algorithmic improvements [1] and unlike neural networks, we have not [yet] discovered a uniformly effective strategy for leveraging additional computation to accelerate wall-clock runtime. [2]
RETE family algorithms did turn out to be somewhat parallelizable, enough to get a real speed-up on ordinary multicore CPUs. There was an idea in the 1980s that symbolic AI would be massively parallelizable that turned out to be a disappointment.
You could argue that since automatic differentiation and symbolic differentiation are equivalent, [1] symbolic AI did succeed by becoming massively parallelizable, we just needed to scale up the data and hardware in kind.
In the comments, zero_k posted a link to the SAT competition's parallel track. The 2025 results page is here: https://satcompetition.github.io/2025/results.html Parallel solvers consistently score lower (take less time) than single-threaded solvers, and solve more instances within the time limit. Probably the speedup is nowhere near proportional to the amount of parallelism, but if you just want to get results a little bit faster, throwing more cores at the problem does seem like it generally works.
> The solvers participating in this track will be executed with a wall-clock time limit of 1000 seconds. Each solver will be run an a single AWS machine of the type m6i.16xlarge, which has 64 virtual cores and 256GB of memory.
For comparison, an H100 has 14,592 CUDA cores, with GPU clusters measured in the exaflops. The scaling exponents are clearly favorable for LLM training and inference, but whether the same algorithms used for parallel SAT would benefit from compute scaling is unclear. I maintain that either (1) SAT researchers have not yet learned the bitter lesson, or (2) it is not applicable across all of AI as Sutton claims.
In my experience, there's an opposing "bitter lesson" when trying to make incremental, tactical progress in user-facing AI / ML applications: _you're not a researcher_. Stay to tried-and-true, boring ML methods that have been proven at scale and then add human knowledge and rules to make it all work.
Then, as the article mentions, some new fundamental shift happens, and practitioners need to jump over to a completely new way of working. Monkeypatching to make it all work. Rinse repeat.
I see elements of the bitter lesson in arguments about context window size and RAG. The argument is about retrieval being the equivalent of compute/search. Just improve them, to hell with all else.
However, retrieval is not just google search. Primary key lookups in my db are also retrieval. As are vector index queries or BM25 free text search queries. It's not a general purpose area like compute/search. In summary, i don't think that RAG is dead. Context engineering is just like feature engineering - transform the swamp of data into a structured signal that is easy for in-context learning to learn.
The corollory of all this is it's not just about scaling up agents - giving them more LLMs and more data via MCP. The bitter lesson doesn't apply to agents yet.
When The Bitter Lesson essay came out it was a bunch of important things: addressing an audience of serious practitioners, contrarian and challenging entrenched dogma, written without any serious reputational or (especially) financial stake in the outcome. It needed saying and it was important.
But its become a lazy crutch for a bunch of people who meet none of those criteria and perverted into a statement more along the lines of "LLMs trained on NVIDIA cards by one of a handful of US companies are guaranteed to outperform every other approach from here to the Singularity".
Nope. Not at all guaranteed, and at the moment? Not even looking likely.
It will have other stuff in it. Maybe that's prediction in representation space like JEPA, maybe its MCTS like Alpha*, maybe its some totally new thing.
I'm not so sure Stockfish is a good example. The fact it can run on an Iphone is due to Moore's law, which follows the same pattern. And Deepmind briefly taking its throne was a very good example of the Bitter Lesson.
Stockfish being so strong is not merely a result of scaling of computation with search and learning. Basic alpha-beta search doesn't really scale all that well with compute. The number of nodes visited grows exponentionally with the number of plies you look ahead. Additionally alpha-beta search is not embarassingly parallel. The reason Stockfish is so strong is that it includes pretty much every heuristic improvement to alpha-beta that's been thought of in the history of computer chess, somehow combining all of them while avoiding bugs and performance regressions. Many of these heuristics are based on chess knowledge. As well as a lot of very clever optimisation of data structures(transposition tables, bitboards) to facilitate parallel search and shave off every bit of overhead.
Stockfish is a culmination of a lot of computer science research, chess knowledge and clever, meticulous design.
While what you mention is true, I'm not sure how it undermines the bitter lesson. Optimizing the use of hardware (which is what NNUE essentially does) is one way of "increasing compute." Also, NNUE was not a chess specific technique, it was originally developed for Shogi.
I think it's a little early (even in these AI times) to call HRM a counterexample of the bitter lesson.
I think it's quite a bit more likely for HRM to scale embarrassingly far and outstrip the tons of RLHF and distillation that's been invested in for transformers, more of a bitter lesson 2.0 than anything else.
I didn't know about HRM until this article, but it looks incredibly impressive, even in its current state. A 27M parameter model that can be trained to perform difficult domain-specific tasks with better results than flagship reasoning models? That is compelling.
The problem with the Bitter Lesson is that it doesn't clearly define what is a computational "hack" and what is a genuine architecturally breakthrough. We would be no where without transformers for example.
Tldr; ML using neural networks is not really replacing human knowledge with computation; human work goes into encoding data, building datasets, and designing hyperparameters.
Perhaps, but I see it more as an endorsement of careful feature selection. Subject matter experts can do this, and once done, you can get a away with a much smaller model and better price / performance.
This article focuses on the learning aspect of The Bitter Lesson. But The Bitter Lesson is about both search and learning.
This article cites Leela, the chess program, as an example of the Bitter Lesson, as it learns chess using a general method. The article then goes on to cite Stockfish as a counterexample, because it uses human-written heuristics to perform search. However, as you add compute to Stockfish's search, or spend time optimizing compute-expenditure-per-position, Stockfish gets better. Stockfish isn't a counterexample, search is still a part of The Bitter Lesson!
Not familiar with the cited essay (added to reading list for the weekend), but the post does make some generally good points on generalization (it me) vs specialization, and the benefits of an optimized and scalable generalist approach vs a niche, specialized approach, specifically with regards to current LLMs (and to a lesser degree, ML as a whole).
Where I furrow my brow is the casual mixing of philosophical conjecture with technical observations or statements. Mixing the two all too often feels like a crutch around defending either singular perspective in an argument by stating the other half of the argument defends the first half. I know I'm not articulating my point well here, but it just comes off as a little...insincere, I guess? I'm sure someone here will find the appropriate words to communicate my point better, if I'm being understood.
One nitpick on the philosophical side of things I'd point out is that a lot of the resistance to AI replacing human labor is less to do with the self-styled importance of humanity, and more the bleak future of a species where a handful of Capitalists will destroy civilization for the remainder to benefit themselves. That is what sticks in our collective craw, and a large reason for the pushback against AI - and nobody in a position of power is taking that threat remotely seriously, largely because the owners of AI have a vested interest in preventing that from being addressed (since it would inevitably curb the very power they're investing in building for themselves).
I should know better than to speak anything too enthusiastically about the humanities or feminism on this particular forum, but I just want to say the connection here to Donna Haraway was a surprise and delight. Any one open to that world at all would behoove themselves to check her out. "The Cyborg Manifesto" is the one everyone knows, but I recently finished "Living with the Trouble" and can't recommend it enough!
43% of American workers have used AI at work, they are mostly doing it in informal ways, solving their own work problems. Scaling AI across the enterprise is hard
A lot of firms starting into this business are "betting the farm" on "scaling AI across the enterprise"
In my experience LLMs are incredibly useful from a simple text interface (I only work with text, mainly computer code). I am still reeling from how disruptive they are, in that context.
But IMO there is not a lot of money to be made for start ups in that context (I expect there is not enough to justify the high valuations of outfits like Open AI). There should be a name for the curse - revolutionary technology that makes many people vastly more productive, but there is no real way to capture that value. Unless "Scaling AI across the enterprise" can succeed.
I have my doubts. I am sure there will be niches, and in a decade or so, with hindsight, it will be clear what they are. But there is no reliable way to tell now
The "Bitter Lesson" seems like a distraction to me. The fundamental problem is related: this technology is generally useful, much more than it is specifically useful.
The "killer app" is a browser window open to https://chat.deepseek.com. There is not much beyond that. Not nothing, just not much.
But so long as you have not bet your farm on "scaling AI across the enterprise" nor been fired by someone else who is trying, we should be very happy. We are in a "steam engine" moment. Nothing will ever be the same.
And if Open AI and the like all go belly up and demote a swathe of billionaires to be normally rich, that is the cherry on the top
I don't know, but considering I had never heard of "the bitter lesson" before reading this article I agree the title is not especially clear. Not everyone lives and breathes AI and has heard of stuff like that, after all.
"Despite" is absolutely correct when you realize that cheating in the title is a way to make people look at articles when they would rather ignore AI in favor of actually-useful/interesting subjects.
> This views organizations as chaotic “garbage cans” where problems, solutions, and decision-makers are dumped in together, and decisions often happen when these elements collide randomly, rather than through a fully rational process
Only tangentially related, but this has to be one of the worst metaphors I’ve ever heard. Garbage cans are not typically hotbeds of chaotic activity, unless a raccoon gets in or something.
The last time I was reminded of the bitter lesson was when I read about Guidance & Control Networks, after seeing them used in an autonomous drone that beat the best human FPV pilots [0]. Basically it's using a small MLP (Multi Layer Perceptron) on the order of 200 parameters, and using the drone's state as input and controlling the motors directly with the output. We have all kinds of fancy control theory like MCP (Model Predictive Control), but it turns out that the best solution might be to train a relatively tiny NN using a mix of simulation and collected sensor data instead. It's not better because of huge computation resources, it's actually more computationally efficient than some classic alternatives, but it is more general.
[0] https://www.tudelft.nl/en/2025/lr/autonomous-drone-from-tu-d...
https://www.nature.com/articles/s41586-023-06419-4
https://arxiv.org/abs/2305.13078
https://arxiv.org/abs/2305.02705
But I have also seen people trying to use deep networks to identify rotating machinery faults, like bearings, from raw accelerometer data collected a high frequencies like 40 kHz. Whereas the spectrum of the data from running FFT on the signal contains fault information much more obviously and clearly.
Throwing a deep network on a problem without some physical insight into the problem has also its disadvantages it seems.
Yeah, we're shouting into the wind here. I have had people tell me directly that my ideas from old school state estimation were irrelevant in the era of deep learning. They may produce (in this case) worse results, but the long game I'm assured is superior.
The specific scenario was estimating the orientation of a stationary semi trailer. An objectively measurable number and it was consistently off by 30 deg, yet I was the jerk for suggesting we move from end to end DL to trad Bar Shalom techniques.
That scene isn't for me anymore.
> That scene isn't for me anymore.
They will learn. At least when the competition beats their solution with a hybrid approach they can't begin to understand.
I’m working on this sort of thing right now in a SaaS product that previously didn’t have support for vibration data. One competitor is all ML-d up to the hilt but customers don’t like the black box and keep finding it gives false positives with no explanation. I think one problem is those buying not understanding problem; they just want to plug in a sensor and insights to happen, but without information about the machine that’s never going to be able to provide useful insights
There are also extremely misleading research articles out there promising good results with deep networks in the area of anomaly detection, without adequate comparison with more classical techniques.
This well-known critical paper shows examples of AI articles/techniques applied to popular datasets with good-looking results. But, it also demonstrates that, literally, a single line of MATLAB code can outperform some of these techniques: https://arxiv.org/pdf/2009.13807
> an autonomous drone that beat the best human FPV pilots
Doesn’t any such claim come with huge caveats — pre specified track/course, no random objects flying between, etc…? ie. train & test distributions are ensured same by ensuring test time can never be more complicated than training data.
Also presumably better sensing than raw visual input.
>It's not better because of huge computation resources, it's actually more computationally efficient than some classic alternatives
It's similar with options pricing. The most sophisticated models like multivariate stochastic volatility are computationally expensive to approximate with classical approaches (and have no closed form solution), so just training a small NN on the output of a vast number of simulations of the underlying processes ends up producing a more efficient model than traditional approaches. Same with stuff like trinomial trees.
This is really interesting. I think force fields in molecular dynamics have underwent a similar NN revolution. You train your NN on the output of expensive calculations to replace the expensive function with a cheap one. Could you train a small language model with a big one?
> Could you train a small language model with a big one?
Yes, it's called distillation.
Interesting. Are these models the SOTA in the options trading industry (e.g. MM) nowadays?
> "MCP (Model Predictive Control)"
^ that's MPC. (MCP = Model Context Protocol)
I was not at all a fan of "The Bitter Lesson versus The Garbage Can", but this misses the same thing that it missed.
The Bitter Lesson is from the perspective of how to spend your entire career. It is correct over the course of a very long time, and bakes in Moore's Law.
The Bitter Lesson is true because general methods capture these assumed hardware gains that specific methods may not. It was never meant for contrasting methods at a specific moment in time. At a specific moment in time you're just describing Explore vs Exploit.
Right, and if you spot a job that needs doing and can be done by a specialized model, waving your hands about general purpose scale-leveraging models eventually overtaking specialized models has not historically been a winning approach.
Except in the last year or two, which is why people are citing it a lot :)
Probably because this is how bubbles happen.
I think there might be interesting time scales in between “now” and “my entire career” to which the bitter lesson may or may not apply. As an outsider to ML I have questions about the longevity of any given “context engineering” approach in light of the bitter lesson.
The bitter lesson becomes more true over time, because inductive bias becomes less useful over time. Case in point: PCA/hand engineering -> CNN -> ViT.
Por que no los dos?
Thirty-five years ago they gave me a Ph.D. basically for pointing out that the controversy du jour -- reactive vs deliberative control for autonomous robots -- was not a dichotomy. You could have the best of both worlds by combining a reactive system with a deliberative one. The reactive system interfaced directly to the hardware on one end and provided essentially a high-level API on the other end that provided primitives like "go that way". It's a little bit more complicated than that because it turns out you need a glue layer in the middle, but the point is: you don't have to choose. The Bitter Lesson is simply a corollary of Ron's First Law: all extreme positions are wrong. So reactive control by itself has limits, and deliberative control by itself has limits. But put the two together (and add some pretty snazzy image processing) and the result is Waymo.
So it was no surprise to me that Stockfish, with its similar approach of combining deliberative search with a small NN computing its quality metric blows everything else out of the water. It has been obvious (at least to me) that this is the right approach for decades now.
I'm actually pretty terrified of the results when the mainstream AI companies finally rediscover this. The capabilities of LLMs are already pretty impressive on their own. If they can get a Stockfish-level boost by combining them with a simple search algorithm the result may very well be the GAI that the rationalist community has been sounding the alarm over for the last 20 years.
I've been thinking of the issue like this: a traditional approach to symbolic artificial intelligence frames problem solving as a search of a tree. Do you go depth first, dodging infinite branches? Or breadth first (do you have enough memory?). What about iterative deepening?
Sometimes the solution is in the tree, but it is too deep, and one runs out of time before it is found.
Statistical based learning could act as a branch predictor. Sometimes guiding the search to go very deep in the right place and find the hidden solution. Sometimes guiding the search to go very deep in the wrong place; one runs out of time as usual.
Notice the strength of hybrid approach. One isn't accepting the probably correct answer of the statistical part. It is only a guide, and if the answer is found, and the symbolic part of the software is correct, the answer will be reliable.
I think this is already being done with maths problems. The LLM is writing proof attempts in Lean. But Lean is traditional symbolic AI. If the LLM can come up with a proof that Lean approves, then it really has a proof. From your comment I learn that Stockfish has already got something like this to work very well.
What you're describing — search + NN — is presently under the term "test-time compute".
The rules / dynamics / objectives of chess ( and Go ) are trivial to encode in a search formulation. I personally don't really get what that tells us about AGI.
No, I don't think that test-time compute is the same thing at all. It's a little challenging to find a definitive definition of TTC, but AFAICT it is just a fairly simple control loop around an LLM. What I'm describing is a merging of components with fundamentally different architectures, each of which is a significant engineering effort in its own right, to produce a whole that is greater than the sum of its parts. Those seem different to me, but to be fair I have not been keeping up with the latest tech so I could be wrong.
I think search is a fairly simple control loop. Beam search is an example of TTC in this modern era.
It is a very wide term, IME, that means anything besides "one-shot through the network".
I think the thing about the search formulation, which is amenable to domains like chess and go, but not other domains is critical. If LLMs are coming up with effective search formulation for "open-ended" problems, that would be a big deal. Maybe this is what you're alluding to.
Did you name that law by yourself?
https://flownet.com/ron/papers/tla.pdf
>In retrospect, in the story of the three-layer architecture there may be more to be learned about research methodology than about robot control architectures. For many years the field was bogged down in the assumption that planning was sufficient for generating intelligent behavior in situated agents. That it is not sufficient clearly does not justify the conclusion that planning is therefore unnecessary. A lot of effort has been spent defending both of these extreme positions. Some of this passion may be the result of a hidden conviction on the part of AI researchers that at the root of intelligence lies a single, simple, elegant mechanism. But if, as seems likely, there is no One True Architecture, and intelligence relies on a hodgepodge of techniques, then the three-layer architecture offers itself as a way to help organize the mess.
I don’t think people understand the point sutton was making; he’s saying that general, simple systems that get better with scale tend to outperform hand engineered systems that don’t. It’s a kind of subtle point that’s implicitly saying hand engineering inhibits scale because it inhibits generality. He is not saying anything about the rate, doesn’t claim llms/gd are the best system, in fact I’d guess he thinks there’s likely an even more general approach that would be better. It’s comparing two classes of approaches not commenting on the merits of particular systems.
> I don’t think people understand the point sutton was making; he’s saying that general, simple systems that get better with scale tend to outperform hand engineered systems that don’t
This is your reading of Sutton. When I read his original post, I don't extract this level of nuance. The very fact that he calls it a "lesson" rather than something else, such as a "tendency", suggests Sutton may not hold the idea lightly*. In other words, it might have become more than a testable theory; it might have become a narrative.
* Holding an idea lightly is usually good thing in my book. Very few ideas are foundational.
It occurs to me that the bitter lesson is so often repeated because it involves a slippery slope or moot-and-castle argument. IE, the meaning people assign to the bitter lesson ranges between all the following:
General-purpose-algorithms-that-scale will beat algorithms that aren't those
The most simple general purpose, scaling algorithm will win, at least over time
Neural networks will win
LLMs will reach AGI with just more resources
motte and bailey*
Yep this article is self centered and perfectly represents the type of ego Sutton was referencing. Maybe in a year or two general methods will improve the author's workflow significantly once again (eg. better models) and they would still add a bit of human logic on top and claim victory.
The point about training data stands. We usually only think of scaling compute, but we need to scale data as well, maybe even faster than compute. But we exhausted the source of high quality organic text, and it doesn't grow exponentially fast.
I think at the moment the best source of data is the chat log, with 1B users and over 1T daily tokens over all LLMs. These chat logs are at the intersection of human interests and LLM execution errors, they are on-policy for the model, right what they need to improve the next iteration.
The question is when price/performance hits financial limits. That point may be close, if not already passed.
Interestingly, this hasn't happened for wafer fabs. A modern wafer fab costs US$1bn to US$3bn, and there is talk of US$20bn wafer fabs. Around the year 2000, those would have been un-financeable. It was expected that fab cost was going to be a constraint on feature size. That didn't happen.
For years, it was thought that the ASML approach to extreme UV was going to cost too much. It's a horrible hack, shooting off droplets of tin to be vaporized by lasers just to generate soft X-rays. Industry people were hoping for small synchrotrons or X-ray lasers or E-beam machines or something sane. But none of those worked out. Progress went on by making a fundamentally awful process work commercially, at insane cost.
It's hard to value $20bn in 2025 vs say $2bn for cutting edge fab in 2000. Personally I don't think CPI captures the relative value of investments well.
If you compare it as % of gdp, or relative to m2 or % of S&P 500, or even size of electronics industry it's maybe 2x-4x or something. Which is still an increase and still a lot but doesn't seem as crazy to me.
Another lens on it: the most valuable car company in 2000 was $50bn and in 2025 it is $1000bn, which I think says more about dollars than cars.
Fundamentally awful but spiritually delightful.
Sometimes awful is the best we have, we don't have anything that performs at a similar level to EUV machines by ASML but are much simpler or tenable than what we have right now, right?
Perhaps we will find something better in the future, but for now awful is the best we got for the cutting edge.
Also, when is cutting edge not the worst it's ever been?
> The biggest lesson that can be read from 70 years of AI research is that general methods that leverage computation are ultimately the most effective, and by a large margin.
Be careful when anyone, even a giant in the field such as Sutton, posits a sweeping claim like this.
My take? Sutton's "bitter lesson" is rather vague and unspecified (i.e. hard to pin down and test) for at least two reasons:
1. The word "ultimately" is squishy, when you think about it. When has enough time passed to make the assessment? At what point can we say e.g. "Problem X has a most effective solution"?
2. What do we mean by "most effective"? There is a lot of variation, including but not limited to (a) some performance metric; (b) data efficiency; (c) flexibility / adaptability across different domains; and (d) energy efficiency.
I'm a big fan of Sutton's work. I've read his RL book cover-to-cover and got to meet him briefly. But, to me, the bitter lesson (as articulated in Sutton's original post) is not even wrong. It is sufficiently open-ended that many of us will disagree about what the lesson is, even before we can get to the empirical questions of "First, has it happened in domain D at time T? Second, is it 'settled' now, or might things change?"
I believe the main problem could be reframed as an improper use of analogies. People are pushing the analogy of "artificial intelligence" and "brain", etc. creating a confusion that leads to such "laws". What we have is a situation that is similar to birds and planes, they do not operate under the same principles at all.
Looking at the original claim, we can take from birds a number of optimization regarding air flows that are far beyond what any plane can do. But, the impact that could be transfer to planes would be minimal compared to a boost in engine technology. Which is not surprising since the way both systems achieve "flight" are completely different.
I don't believe such discourse would happen at all if it was just considered to be a number of techniques, of different categories with their own strength and weaknesses, used to tackle problems.
Like all fake "laws", it is based on a general idea that is devoid of any time-frame prediction that would make it falsifiable. In "the short term" is beaten by "in the long run". How far is "the long run"? This is like the "mean reversion law", saying that prices will "eventually" go back to their equilibrium price; will you survive bankruptcy by the time of "eventually"?
>> People are pushing the analogy of "artificial intelligence" and "brain", etc. creating a confusion that leads to such "laws". What we have is a situation that is similar to birds and planes, they do not operate under the same principles at all.
But modend neural nets ARE based on biological neurons. It's not a perfect match by any means, but synaptic "weights" are very much equivalent to model weights. Model structure and size are bigger differences.
One odd thing is that progress in SAT/SMT solvers has been almost as good as progress in neural networks from the 1970s to the present. There was a time I was really interested in production rules and expert system shells and systems in the early 1980s often didn't even use RETE and didn't have hash indexes so of course a rule base of 10,000 looked unmanageable, by 2015 you could have a million rules in Drools and it worked just fine.
The difference is that SAT/SMT solvers have primarily relied on single-threaded algorithmic improvements [1] and unlike neural networks, we have not [yet] discovered a uniformly effective strategy for leveraging additional computation to accelerate wall-clock runtime. [2]
[1]: https://arxiv.org/pdf/2008.02215
[2]: https://news.ycombinator.com/item?id=36081350
RETE family algorithms did turn out to be somewhat parallelizable, enough to get a real speed-up on ordinary multicore CPUs. There was an idea in the 1980s that symbolic AI would be massively parallelizable that turned out to be a disappointment.
https://en.wikipedia.org/wiki/Fifth_Generation_Computer_Syst...
You could argue that since automatic differentiation and symbolic differentiation are equivalent, [1] symbolic AI did succeed by becoming massively parallelizable, we just needed to scale up the data and hardware in kind.
[1]: https://arxiv.org/pdf/1904.02990
> [2]
In the comments, zero_k posted a link to the SAT competition's parallel track. The 2025 results page is here: https://satcompetition.github.io/2025/results.html Parallel solvers consistently score lower (take less time) than single-threaded solvers, and solve more instances within the time limit. Probably the speedup is nowhere near proportional to the amount of parallelism, but if you just want to get results a little bit faster, throwing more cores at the problem does seem like it generally works.
> The solvers participating in this track will be executed with a wall-clock time limit of 1000 seconds. Each solver will be run an a single AWS machine of the type m6i.16xlarge, which has 64 virtual cores and 256GB of memory.
For comparison, an H100 has 14,592 CUDA cores, with GPU clusters measured in the exaflops. The scaling exponents are clearly favorable for LLM training and inference, but whether the same algorithms used for parallel SAT would benefit from compute scaling is unclear. I maintain that either (1) SAT researchers have not yet learned the bitter lesson, or (2) it is not applicable across all of AI as Sutton claims.
In my experience, there's an opposing "bitter lesson" when trying to make incremental, tactical progress in user-facing AI / ML applications: _you're not a researcher_. Stay to tried-and-true, boring ML methods that have been proven at scale and then add human knowledge and rules to make it all work.
Then, as the article mentions, some new fundamental shift happens, and practitioners need to jump over to a completely new way of working. Monkeypatching to make it all work. Rinse repeat.
This brings about a good point:
How much of the recent bitter lesson peddling is done by compute salesmen?
How much of it is done by people who can buy a lot of compute?
Deepseek was scandalous for a reason.
I see elements of the bitter lesson in arguments about context window size and RAG. The argument is about retrieval being the equivalent of compute/search. Just improve them, to hell with all else.
However, retrieval is not just google search. Primary key lookups in my db are also retrieval. As are vector index queries or BM25 free text search queries. It's not a general purpose area like compute/search. In summary, i don't think that RAG is dead. Context engineering is just like feature engineering - transform the swamp of data into a structured signal that is easy for in-context learning to learn.
The corollory of all this is it's not just about scaling up agents - giving them more LLMs and more data via MCP. The bitter lesson doesn't apply to agents yet.
When The Bitter Lesson essay came out it was a bunch of important things: addressing an audience of serious practitioners, contrarian and challenging entrenched dogma, written without any serious reputational or (especially) financial stake in the outcome. It needed saying and it was important.
But its become a lazy crutch for a bunch of people who meet none of those criteria and perverted into a statement more along the lines of "LLMs trained on NVIDIA cards by one of a handful of US companies are guaranteed to outperform every other approach from here to the Singularity".
Nope. Not at all guaranteed, and at the moment? Not even looking likely.
It will have other stuff in it. Maybe that's prediction in representation space like JEPA, maybe its MCTS like Alpha*, maybe its some totally new thing.
And maybe it happens in Hangzhou.
I'm not so sure Stockfish is a good example. The fact it can run on an Iphone is due to Moore's law, which follows the same pattern. And Deepmind briefly taking its throne was a very good example of the Bitter Lesson.
Stockfish being so strong is not merely a result of scaling of computation with search and learning. Basic alpha-beta search doesn't really scale all that well with compute. The number of nodes visited grows exponentionally with the number of plies you look ahead. Additionally alpha-beta search is not embarassingly parallel. The reason Stockfish is so strong is that it includes pretty much every heuristic improvement to alpha-beta that's been thought of in the history of computer chess, somehow combining all of them while avoiding bugs and performance regressions. Many of these heuristics are based on chess knowledge. As well as a lot of very clever optimisation of data structures(transposition tables, bitboards) to facilitate parallel search and shave off every bit of overhead.
Stockfish is a culmination of a lot of computer science research, chess knowledge and clever, meticulous design.
While what you mention is true, I'm not sure how it undermines the bitter lesson. Optimizing the use of hardware (which is what NNUE essentially does) is one way of "increasing compute." Also, NNUE was not a chess specific technique, it was originally developed for Shogi.
> If AI agents can train on outputs alone, any organization that can define quality and provide enough examples might achieve similar results
Great, we're safe!
I think it's a little early (even in these AI times) to call HRM a counterexample of the bitter lesson.
I think it's quite a bit more likely for HRM to scale embarrassingly far and outstrip the tons of RLHF and distillation that's been invested in for transformers, more of a bitter lesson 2.0 than anything else.
I didn't know about HRM until this article, but it looks incredibly impressive, even in its current state. A 27M parameter model that can be trained to perform difficult domain-specific tasks with better results than flagship reasoning models? That is compelling.
The problem with the Bitter Lesson is that it doesn't clearly define what is a computational "hack" and what is a genuine architecturally breakthrough. We would be no where without transformers for example.
A better lesson: https://rodneybrooks.com/a-better-lesson/
Rodney Brooks's analysis, as usual, is spot-on.
Tldr; ML using neural networks is not really replacing human knowledge with computation; human work goes into encoding data, building datasets, and designing hyperparameters.
Does anyone else see the big flaw with the chess engine analogy?
When AlphaZero came along it blew stockfish out of the water.
Stockfish is a top engine now because besides that initial proof of concept there's no money to be made by throwing compute at Chess.
I’m wondering how prophetic this is for LLMs. My hunch is a lot.
The Neuro-Symbolic approach is what the article describes, without actually naming it.
Perhaps, but I see it more as an endorsement of careful feature selection. Subject matter experts can do this, and once done, you can get a away with a much smaller model and better price / performance.
This article focuses on the learning aspect of The Bitter Lesson. But The Bitter Lesson is about both search and learning.
This article cites Leela, the chess program, as an example of the Bitter Lesson, as it learns chess using a general method. The article then goes on to cite Stockfish as a counterexample, because it uses human-written heuristics to perform search. However, as you add compute to Stockfish's search, or spend time optimizing compute-expenditure-per-position, Stockfish gets better. Stockfish isn't a counterexample, search is still a part of The Bitter Lesson!
Stockfish is neither. It's a hybrid approach.
Well hardware has limits so I guess so. Humans evolved with faculties not a massive organic generic compute engine.
> The bitter lesson is dependent on high-quality data.
Arguably, so is the alternative: explicitly embedding knowledge!
Nothing is immune to GIGO.
It would not be surprising if a bitter lesson 2.0 comes about as a bitter lesson to the bitter lesson.
Not familiar with the cited essay (added to reading list for the weekend), but the post does make some generally good points on generalization (it me) vs specialization, and the benefits of an optimized and scalable generalist approach vs a niche, specialized approach, specifically with regards to current LLMs (and to a lesser degree, ML as a whole).
Where I furrow my brow is the casual mixing of philosophical conjecture with technical observations or statements. Mixing the two all too often feels like a crutch around defending either singular perspective in an argument by stating the other half of the argument defends the first half. I know I'm not articulating my point well here, but it just comes off as a little...insincere, I guess? I'm sure someone here will find the appropriate words to communicate my point better, if I'm being understood.
One nitpick on the philosophical side of things I'd point out is that a lot of the resistance to AI replacing human labor is less to do with the self-styled importance of humanity, and more the bleak future of a species where a handful of Capitalists will destroy civilization for the remainder to benefit themselves. That is what sticks in our collective craw, and a large reason for the pushback against AI - and nobody in a position of power is taking that threat remotely seriously, largely because the owners of AI have a vested interest in preventing that from being addressed (since it would inevitably curb the very power they're investing in building for themselves).
I should know better than to speak anything too enthusiastically about the humanities or feminism on this particular forum, but I just want to say the connection here to Donna Haraway was a surprise and delight. Any one open to that world at all would behoove themselves to check her out. "The Cyborg Manifesto" is the one everyone knows, but I recently finished "Living with the Trouble" and can't recommend it enough!
>Donna Haraway
>Any one open to that world...
The "world" in question being a brand of Marxism that's super-explicitly anti-human. No, I'm not kidding or exaggerating.
The Bitter Lesson assumes Moore's law is alive and well. It may still be alive, but not as full of vim and vigor as it once was.
In my experience LLMs are incredibly useful from a simple text interface (I only work with text, mainly computer code). I am still reeling from how disruptive they are, in that context.
But IMO there is not a lot of money to be made for start ups in that context (I expect there is not enough to justify the high valuations of outfits like Open AI). There should be a name for the curse - revolutionary technology that makes many people vastly more productive, but there is no real way to capture that value. Unless "Scaling AI across the enterprise" can succeed.
I have my doubts. I am sure there will be niches, and in a decade or so, with hindsight, it will be clear what they are. But there is no reliable way to tell now
The "Bitter Lesson" seems like a distraction to me. The fundamental problem is related: this technology is generally useful, much more than it is specifically useful.
The "killer app" is a browser window open to https://chat.deepseek.com. There is not much beyond that. Not nothing, just not much.
But so long as you have not bet your farm on "scaling AI across the enterprise" nor been fired by someone else who is trying, we should be very happy. We are in a "steam engine" moment. Nothing will ever be the same.
And if Open AI and the like all go belly up and demote a swathe of billionaires to be normally rich, that is the cherry on the top
All links render as blue strike-through line in Firefox (underline in Chrome), hurting legibility :(
I'm getting the same effect, seems to be the css property "text-underline-position: under;"
interesting. i see underlines in firefox. but the width of the line is 2px in chrome, 1px in firefox.
Fine over here using Firefox
[dead]
This is about AI, despite the title being ambiguous.
Is there more than one bitter lesson?
I've learned many
I don't know, but considering I had never heard of "the bitter lesson" before reading this article I agree the title is not especially clear. Not everyone lives and breathes AI and has heard of stuff like that, after all.
The original "Bitter Lesson" essay is about machine learning, but the linked article appears to be trying to apply it to LLMs.
If we're going to be pedantic:
This is about AI, the title is ambiguous.
Despite was used unambiguously wrong.
"Despite" is absolutely correct when you realize that cheating in the title is a way to make people look at articles when they would rather ignore AI in favor of actually-useful/interesting subjects.
> This views organizations as chaotic “garbage cans” where problems, solutions, and decision-makers are dumped in together, and decisions often happen when these elements collide randomly, rather than through a fully rational process
Only tangentially related, but this has to be one of the worst metaphors I’ve ever heard. Garbage cans are not typically hotbeds of chaotic activity, unless a raccoon gets in or something.