« BackUnit Testing Numerical Routinesbuchanan.oneSubmitted by boscillator 5 days ago
  • phdelightful an hour ago

    I have worked on Kokkos Kernels, which is performance-portable math libraries for the Kokkos programming system. We implement BLAS, sparse matrix operations, a variety of solvers, some ODE stuff, and various utilities for a variety of serial and parallel execution modes on x86, ARM, and the three major GPU vendors.

    The simplest and highest-impact tests are all the edge cases - if an input matrix/vector/scalar is 0/1/-1/NaN, that usually tells you a lot about what the outputs should be.

    It can be difficult to determine sensible numerical limit for error in the algorithms. The simplest example is a dot product - summing floats is not associative, so doing it in parallel is not bitwise the same as serial. For dot in particular it's relatively easy to come up with an error bound, but for anything more complicated it takes a particular expertise that is not always available. This has been a work in progress, and sometimes (usually) we just picked a magic tolerance out of thin air that seems to work.

    Solvers are tested using analytical solutions and by inverting them, e.g. if we're solving Ax = y, for x, then Ax should come out "close" to the original y (see error tolerance discussion above).

    One of the most surprising things to me is that the suite has identified many bugs in vendor math libraries (OpenBLAS, MKL, cuSparse, rocSparse, etc.) - a major component of what we do is wrap up these vendor libraries in a common interface so our users don't have to do any work when they switch supercomputers, so in practice we test them all pretty thoroughly as well. Maybe I can let OpenBLAS off the hook due to the wide variety of systems they support, but I expected the other vendors would do a better job since they're better-resourced.

    For this reason we find regression tests to be useful as well.

    • alexpotato 35 minutes ago

      Mentioning this b/c it was something that surprised me when I first heard it:

      Many python numerical libraries change how they internally represent certain data types and/or change internal algorithms from version to version. This means that running the exact same code on two different versions may give you slightly different results.

      In your side project this may not be a big deal but if you are working at, say, a hedge fund or a company that does long and complicated data processing pipelines then version upgrades with no unit testing can be "very bad".

      • atoav 11 minutes ago

        If you are working in a hedge fund you probably (hopefully?) know it isn't a good idea to represent monetary values using floats. And your numerics library should be checked for correctness on each commit, just in case.

      • physicsguy 2 hours ago

        All good stuff, I’d add though that for many numerical routines you end up testing on simple well known systems that have well defined analytic answers.

        So for e.g. if you’re writing solvers for ODE systems, then you tend to test it on things like simple harmonic oscillators.

        If you’re very lucky, some algorithms have bounded errors. For e.g. the fast multipole method for evaluating long range forces does, so you can check this for very simple systems.

        • LeanderK 32 minutes ago

          yeah that's how I do it. You start simple (known solutions for trivial points), then easy cases with analytic solutions and then just some random stuff where you test that the solutions is reached without errors and makes sense (correct sign etc.), here called the property based tests.

        • amelius 2 hours ago

          If you have a numerical routine that converts from one representation to another, you can test it by calling the routine implementing the inverse conversion.

          • Arech 17 minutes ago

            Honestly, I was expecting to see suggestions for testing for numerical stability. This might be a super annoying issue if the data is just right to hit certain peculiarities of floating point numbers representation.