Thank you for writing a clear explanation of how false positive rates determine your minimum change threshold. I've found it surprisingly difficult to explain this to developers/QA's without a basic statistical background.
What makes the situation worse, which you didn't mention, is that developers like to write a suite of benchmarks, each of which can result in a false positive regression. So even if the FP rate of an individual benchmark is <1%, you can easily end up at 10% FP rate if your suite is large enough.
I hadn't considered this, but it would be really interesting to take it into account. Given that the size of the benchmark suites directly affects the false positive rate, and counterintuitively, the more benchmarks in the suite, the more the chances of false positives, even with super steady benchmarks. (Thanks, it could also be an interesting follow-up article!)
I've been burnt by performance gates on GitHub Actions. One random timing spike and the whole PR turns red. The coefficient of variation math here nails why: GitHub Actions shows a 2.66% CV, which means a 2% performance gate gives you a 45% false positive rate (basically every other run flags a fake regression). No wonder developers stop trusting the check. In my experience the only way to make benchmarks actionable is to run them on deterministic bare-metal runners, whether CodSpeed's or something you host yourself.
Great write up! The use of coefficient of variation to detect instability seems reasonable. I've been done performance analysis of actual request traffic (not for benchmarking) to gauge the level of noise in a noisy-neighbor environment, and I believe I looked into using CoV for this and found it wasn't particularly reliable for my purposes.
I'd love to learn about more statistical techniques for doing such analysis. For example, one thing I looked into was correlating tenants and identifying a likely culprit, and it's often not just a matter of absolute request volume. If multiple tenants' latencies increase at once, it's usually because one of them started doing something. But its hard to isolate what that is, when there are many different types of workloads with unpredictable performance impact.
I think the best way to do this would be to use something deterministic like instruction counts for the actual pass/fail. You can include wall time for information.
Yes definitely, this is what we already do with cpu simulation. But we had many people with benchmarks using syscalls for network,fs and other resources. And we found the only solution for this is to actually measure wall time
Thank you for writing a clear explanation of how false positive rates determine your minimum change threshold. I've found it surprisingly difficult to explain this to developers/QA's without a basic statistical background.
What makes the situation worse, which you didn't mention, is that developers like to write a suite of benchmarks, each of which can result in a false positive regression. So even if the FP rate of an individual benchmark is <1%, you can easily end up at 10% FP rate if your suite is large enough.
I hadn't considered this, but it would be really interesting to take it into account. Given that the size of the benchmark suites directly affects the false positive rate, and counterintuitively, the more benchmarks in the suite, the more the chances of false positives, even with super steady benchmarks. (Thanks, it could also be an interesting follow-up article!)
I've been burnt by performance gates on GitHub Actions. One random timing spike and the whole PR turns red. The coefficient of variation math here nails why: GitHub Actions shows a 2.66% CV, which means a 2% performance gate gives you a 45% false positive rate (basically every other run flags a fake regression). No wonder developers stop trusting the check. In my experience the only way to make benchmarks actionable is to run them on deterministic bare-metal runners, whether CodSpeed's or something you host yourself.
Great write up! The use of coefficient of variation to detect instability seems reasonable. I've been done performance analysis of actual request traffic (not for benchmarking) to gauge the level of noise in a noisy-neighbor environment, and I believe I looked into using CoV for this and found it wasn't particularly reliable for my purposes.
I'd love to learn about more statistical techniques for doing such analysis. For example, one thing I looked into was correlating tenants and identifying a likely culprit, and it's often not just a matter of absolute request volume. If multiple tenants' latencies increase at once, it's usually because one of them started doing something. But its hard to isolate what that is, when there are many different types of workloads with unpredictable performance impact.
I think the best way to do this would be to use something deterministic like instruction counts for the actual pass/fail. You can include wall time for information.
Yes definitely, this is what we already do with cpu simulation. But we had many people with benchmarks using syscalls for network,fs and other resources. And we found the only solution for this is to actually measure wall time