Evals are not all you need | Svelte Hacker News

intellectronica 8 months ago

I think the author may be misunderstanding what practitioners mean when they say that "evals are all you need", because the term is overloaded.

There are generic evals (like MMLU), for which the proper term is really "benchmark".

But task-related evals, where you evaluate how a specific model/implementation is doing on performing a task in your project are, if not _all_ you need, at the very least the most important component by a wide margin. They do not _guarantee_ software performance in the same way that unit tests do for traditional software. But they are the main mechanism we have for evolving a system towards being good enough to use it in production. I am not aware of any workable alternative.

godelski 8 months ago
```
  > the author may be misunderstanding what practitioners mean
```
I'm a "practitioner", well a researcher. Most of my peers that I talk to believe evals/benchmarks are a strong indicator of real world performance as well as much more abstract notions such as ability to think or intelligence or whatever those mean. But benchmarks and evals are fairly narrow.
```
  > They do not _guarantee_ software performance in the same way that unit tests do for traditional software.
```
Unit tests do not guarantee code correctness. TDD is a flawed paradigm. Tests are good, but they shouldn't be the driver nor the measure. The fatal flaw in TDD is it is entirely dependent on your ability to foresee all possible failure points as well as there being an absence of black swans. It's success is highly dependent on the person implementing it, not the method itself. The common hubris of "this should never happen"
```
  > I am not aware of any workable alternative.
```
The great problem in ML that we're currently facing is that all measurements (of ANY kind) are proxies to the thing you wish to measure. It is easy to forget this, to believe that even with a ruler in front of you that you are measuring meters (inches, whatever unit you want to pick). You are instead measuring increments of your measuring tape, which is hopefully well aligned with an actual meter. This is almost always "good enough" because the misalignment is less than uncertainty of the measuring device itself. The problem is, when you get to measuring much more abstract things, it is harder to know how well you are aligned. We forget to even question this notion until long after it becomes a problem.
- maeil 8 months ago
  
  > I'm a "practitioner", well a researcher. Most of my peers that I talk to believe evals/benchmarks are a strong indicator of real world performance as well as much more abstract notions such as ability to think
  If this includes the benchmarks used by the frontier model providers, I'm stunned to hear this. It's abundantly clear that they're trained on and gamed to oblivion, rendered meaningless.
  Or are these a different set of benchmarks? Or only when applied to research, non-commercial models?
  
  godelski 8 months ago
  
  HumanEval has 60 authors. They thought that they could make up "leetcode style questions" and that because they were "hand written" that training on GitHub wouldn't spoil the test set. Hand crafted questions like returning the decimal part of a float or parsing nested parentheses. Have they seriously never heard of lisp? But you can find many of the canonical solutions verbatim in code prior to the cutoff. I guess no one checked.
  I'm not sure if people are just being quiet and going along with it or they are that disconnected. All those questions were obviously spoiled. The same is true, as you're noting. And if my CVPR reviews are any indication, they're laughable. I rarely see papers that do proper hypothesis testing. I shoot down poster papers that don't and will champion papers that do, even when I'm outnumbered. But I get why people don't write papers following the scientific style. If you isolate variables and make your work clear, reviewers take the easy way out and say the work is not novel.
  I've lost a lot of faith in my community.
  But then again, are we surprised? Look at how many experts supported Rabbit when it debuted. Every single one of them should have been able to sniff out a that scam. They were claiming the ability to do stuff beyond state of the art, on a tiny device, and did a really poor job faking the demo. I think there's an incentive structure for blinders and it's hindering our progress towards AGI or even more useful ML
scosman 8 months ago

Exactly. Benchmarks != evals.
Task specific evals are how you determine if an AI system works. Expanding your evals as you test, find new issues, and change techniques is the only way to ensure each change doesn’t regress a bunch of your past work. If you don’t build up evals, you end up stuck on some 3 year old model with mid performance, unable to update because of the risk. Without evals, your assessments are just “vibes”.
Random plug but it seems relevant here: I launched an eval toolkit earlier today. It includes synthetic eval data gen, automated evals, great UI so everyone on the team improve quality (QA, PM, not just data scientists), adversarial red-teaming, and human preference correlations. https://docs.getkiln.ai/docs/evaluations
ggm 8 months ago

"overloaded" is a term all in itself. BGP aficionados are fond of punning, so the term "peer" refers both to the actual protocol level practice of exchanging BGP messages AND the state of zero-cost settlement across a boundary to avoid cash flow. Why? Because.
C/F computer language Operator Overloading.
It only gets worse for ESL people. The more we re-use the words, the less meaning is specific and conveyed. Eval is not eval.

shahules 8 months ago

It's an interesting article and I agree with some points you brought up here. But here are some of them to which I don't agree to

1. Evals are used throughout the article in the sense of LLM benchmarking, but this is not the point. One could effectively evaluate any AI system by building custom evals.

2. The purpose of evals is to help devs systematically improve their AI systems (at least how we look at it) not any of the ones listed in your article. It's not a one-time thing, it's a practice like the scientific method.

trash_cat 8 months ago

2. I think to improve is the next step. KNOWING if the sytem even performs according to set criteria is more important. Improvement can't be made if you don't have any evals to know it is improving.

tcdent 8 months ago

This conversation always results in a ":shrug: I guess we'll never know" at the end.

There's potentially never going to be a silver bullet approach to this, or something that satisfies our need for determinism as in unit testing, but we can still try.

Would love to see as much effort put into this in the open source framework sense as there is being put into agentic workflows.

phillipcarter 8 months ago

I think the most important part of this article is how people focus too much on evaluating interactions with the model, but not evaluating the whole system that enables a feature or workflow that uses an LLM.

This is totally true! And I've talked with people who have found that "the LLM is a problem" only to find that upstream calls to services that produce data to be fed into the LLM were actually the ones causing problems.

iLoveOncall 8 months ago

There are so many "if you do that then it's not good" in this article that it actually seems like evals can in fact be all you need as long as you do them right?

I build an LLM-based platform at work, with a lot of agents and datasources, and yet we still don't fit in any of those "ifs".

groodt 8 months ago

The article doesn’t provide any alternatives?

I think there are indeed many challenges when evaluating Compound AI Systems (http://bair.berkeley.edu/blog/2024/02/18/compound-ai-systems...)

But evals in complex systems are the best we have at the moment. It’s a “best-practice” just like all the forms of testing in the “test pyramid” (https://martinfowler.com/articles/practical-test-pyramid.htm...)

Nothing is a silver bullet. Just hard won, ideally automated, integrated quality and verification checks, built deep into the system and SDLC.