Great read and visuals. I think they typo'd the pun on basically/basisally. It got me thinking about program synthesis in the following scheme: data is embedded as vectors and program operations are metric tensors (or maybe just fields in general?) which tell the data how to move. Then, if you have an input/output pair we seek some program to move the data from input to output along some low energy path. Model a whole program as a time varying (t 0-1) metric tensor (is that a thing?) and optimize to find such an object. Maybe you choose ahead of time the number of operations you're searching over and these are like spline basis points and then you lerp between the metric tensors of each op; or you do it continuously and then somehow recover the operations. Then you want to find one program which satisfies multiple input/output pairs, ie one time varying metric tensor (or generally field) such that if you integrate from the input points they all end up at (or close to, which makes me think that you want some learned metric tensor for closeness) the output points. Right now I'm only thinking of unary ops with no constants, maybe the constants could be appended to the input data symbolically and you also get to optimize that portion of the input vectors, with the constraint that it is a shared parameter across all inputs.
Establishing linkages between ML and Differential Geometry is intriguing (to say the least). But I have this nagging sense that "data manifolds" are too rigidly tied to numerical representations for this program to flourish. Differential geometry is all about invariance. Geometric objects have a life of their own so to speak, irrespective of any particular representation. In the broader data science world such an internal structure is not accessible in general. The systems modeled are too complex and their capture in data too superficial to be a reflection of the "true state". In a sense this is analogous to the "blind men touching a elephant in different parts and disagreeing about what it is".
I'm not sure I agree about the data manifolds being too rigid. When we look at the quality score-based generative models and diffusion we can see a clear evidence of how flexible these representations are. We could say the same about statistical manifolds, but the fact that the Fisher is the fundamental metric tensor for the statistical manifold is a fundamental piece of many 1st and 2nd order optimizers today.
Would applying https://en.wikipedia.org/wiki/Banach_fixed-point_theorem yield interesting convergence (and uniqueness) guarantees ?
The Banach fixed point theorem is extensively used for convergence proofs in reinforcement learning, but when you operate at the level of gradient descent for deep neutral networks it's difficult to do so because most commonly used optimizers are not guaranteed to converge to a unique fixed point.
The article seems to do the work to define a Fisher Information metric space, and contractions with the Stein score. Which seems to be the hypothesis for the Banach fixed point theorem, but I am not quite sure what conclusion we would get in this instance.