GEMM has been the workhorse of machine learning. It’s amazing how we’ve ratcheted up the TFLOPs over the years.
I wonder what other algorithms allow hardware optimization like this.