I think the author may be misunderstanding what practitioners mean when they say that "evals are all you need", because the term is overloaded.
There are generic evals (like MMLU), for which the proper term is really "benchmark".
But task-related evals, where you evaluate how a specific model/implementation is doing on performing a task in your project are, if not _all_ you need, at the very least the most important component by a wide margin. They do not _guarantee_ software performance in the same way that unit tests do for traditional software. But they are the main mechanism we have for evolving a system towards being good enough to use it in production. I am not aware of any workable alternative.
> the author may be misunderstanding what practitioners mean
I'm a "practitioner", well a researcher. Most of my peers that I talk to believe evals/benchmarks are a strong indicator of real world performance as well as much more abstract notions such as ability to think or intelligence or whatever those mean. But benchmarks and evals are fairly narrow.
> They do not _guarantee_ software performance in the same way that unit tests do for traditional software.
Unit tests do not guarantee code correctness. TDD is a flawed paradigm. Tests are good, but they shouldn't be the driver nor the measure. The fatal flaw in TDD is it is entirely dependent on your ability to foresee all possible failure points as well as there being an absence of black swans. It's success is highly dependent on the person implementing it, not the method itself. The common hubris of "this should never happen"
> I am not aware of any workable alternative.
The great problem in ML that we're currently facing is that all measurements (of ANY kind) are proxies to the thing you wish to measure. It is easy to forget this, to believe that even with a ruler in front of you that you are measuring meters (inches, whatever unit you want to pick). You are instead measuring increments of your measuring tape, which is hopefully well aligned with an actual meter. This is almost always "good enough" because the misalignment is less than uncertainty of the measuring device itself. The problem is, when you get to measuring much more abstract things, it is harder to know how well you are aligned. We forget to even question this notion until long after it becomes a problem.
Task specific evals are how you determine if an AI system works. Expanding your evals as you test, find new issues, and change techniques is the only way to ensure each change doesn’t regress a bunch of your past work. If you don’t build up evals, you end up stuck on some 3 year old model with mid performance, unable to update because of the risk. Without evals, your assessments are just “vibes”.
Random plug but it seems relevant here: I launched an eval toolkit earlier today. It includes synthetic eval data gen, automated evals, great UI so everyone on the team improve quality (QA, PM, not just data scientists), adversarial red-teaming, and human preference correlations. https://docs.getkiln.ai/docs/evaluations
It's an interesting article and I agree with some points you brought up here. But here are some of them to which I don't agree to
1. Evals are used throughout the article in the sense of LLM benchmarking, but this is not the point. One could effectively evaluate any AI system by building custom evals.
2. The purpose of evals is to help devs systematically improve their AI systems (at least how we look at it) not any of the ones listed in your article. It's not a one-time thing, it's a practice like the scientific method.
This conversation always results in a ":shrug: I guess we'll never know" at the end.
There's potentially never going to be a silver bullet approach to this, or something that satisfies our need for determinism as in unit testing, but we can still try.
Would love to see as much effort put into this in the open source framework sense as there is being put into agentic workflows.
I think the most important part of this article is how people focus too much on evaluating interactions with the model, but not evaluating the whole system that enables a feature or workflow that uses an LLM.
This is totally true! And I've talked with people who have found that "the LLM is a problem" only to find that upstream calls to services that produce data to be fed into the LLM were actually the ones causing problems.
There are so many "if you do that then it's not good" in this article that it actually seems like evals can in fact be all you need as long as you do them right?
I build an LLM-based platform at work, with a lot of agents and datasources, and yet we still don't fit in any of those "ifs".
I think the author may be misunderstanding what practitioners mean when they say that "evals are all you need", because the term is overloaded.
There are generic evals (like MMLU), for which the proper term is really "benchmark".
But task-related evals, where you evaluate how a specific model/implementation is doing on performing a task in your project are, if not _all_ you need, at the very least the most important component by a wide margin. They do not _guarantee_ software performance in the same way that unit tests do for traditional software. But they are the main mechanism we have for evolving a system towards being good enough to use it in production. I am not aware of any workable alternative.
Exactly. Benchmarks != evals.
Task specific evals are how you determine if an AI system works. Expanding your evals as you test, find new issues, and change techniques is the only way to ensure each change doesn’t regress a bunch of your past work. If you don’t build up evals, you end up stuck on some 3 year old model with mid performance, unable to update because of the risk. Without evals, your assessments are just “vibes”.
Random plug but it seems relevant here: I launched an eval toolkit earlier today. It includes synthetic eval data gen, automated evals, great UI so everyone on the team improve quality (QA, PM, not just data scientists), adversarial red-teaming, and human preference correlations. https://docs.getkiln.ai/docs/evaluations
It's an interesting article and I agree with some points you brought up here. But here are some of them to which I don't agree to
1. Evals are used throughout the article in the sense of LLM benchmarking, but this is not the point. One could effectively evaluate any AI system by building custom evals.
2. The purpose of evals is to help devs systematically improve their AI systems (at least how we look at it) not any of the ones listed in your article. It's not a one-time thing, it's a practice like the scientific method.
The article doesn’t provide any alternatives?
I think there are indeed many challenges when evaluating Compound AI Systems (http://bair.berkeley.edu/blog/2024/02/18/compound-ai-systems...)
But evals in complex systems are the best we have at the moment. It’s a “best-practice” just like all the forms of testing in the “test pyramid” (https://martinfowler.com/articles/practical-test-pyramid.htm...)
Nothing is a silver bullet. Just hard won, ideally automated, integrated quality and verification checks, built deep into the system and SDLC.
This conversation always results in a ":shrug: I guess we'll never know" at the end.
There's potentially never going to be a silver bullet approach to this, or something that satisfies our need for determinism as in unit testing, but we can still try.
Would love to see as much effort put into this in the open source framework sense as there is being put into agentic workflows.
I think the most important part of this article is how people focus too much on evaluating interactions with the model, but not evaluating the whole system that enables a feature or workflow that uses an LLM.
This is totally true! And I've talked with people who have found that "the LLM is a problem" only to find that upstream calls to services that produce data to be fed into the LLM were actually the ones causing problems.
There are so many "if you do that then it's not good" in this article that it actually seems like evals can in fact be all you need as long as you do them right?
I build an LLM-based platform at work, with a lot of agents and datasources, and yet we still don't fit in any of those "ifs".