Detecting a 0.005% regression means detecting that a 20s task now takes 20.001s.
It's not even easy to reliably detect such a small performance regression on a single thread on a single machine.
I suppose in theory having multiple machines could actually improve the situation, by letting them average out the noise? But on the other hand, it's not like you have identically distributed samples to work with - workloads have variance over time and space, so there's extra noise that isn't uniform across machines.
Color me a little skeptical, but it's super cool if actually true.
The .005% is a bit of a headline-grabber for sure, but the idea makes sense given the context: They're monitoring large, multi-tenant/multi-use-case systems which have very large amounts of diverse traffic. In these cases, the regression may be .005% of the overall size, but you don't detect it like that, but rather by detecting a 0.5% regression in a use case which was 1% of the cost. They can and do slice data in various ways (group by endpoint, group by function, etc.) to improve the odds of detecting a regression.
You’re right to be skeptical. This entire space as far as I can tell is filled with people who overpromise and under deliver and use bad metrics to claim success. If you look at their false positive and false negative sections, they perform terribly but use words to claim that it’s actually good and use flawed logic to extrapolate on missing data (eg assume our rates stay the same for non-response vs “people are tired of our tickets and ignore our system”). And as follow up work their solution is to keep tuning their parameters (ie keep fiddling to overfit past data”). You can even tell how it’s perceived where they describe people not even bother to interact with their system during 2/4 high impact incidents examined and blaming the developer for one of them as “they didn’t integrate the metrics”. Like if a system can provide a meaningful cost/benefit the teams would be clamoring to adjust their processes. Until demonstrated clearly otherwise it’s dressed up numerology.
I saw a team at oculus fail to do this for highly constrained isolated environments with repeatable workloads and having the threshold be much more conservative (eg 1-10%) and failing. This paper is promulgating filtering your data all to hell to the point of overfitting.
Thanks, I hadn't read that. But I have to say I had what I feel is a good reason to be skeptical before even reading a single word in the paper, honestly. Which is that Facebook never felt so... blazing fast, shall we say, to make me believe anyone even wanted to pay attention to tiny performance regressions, let alone the drive and tooling to do so.