« BackWho Invented Backpropagation?people.idsia.chSubmitted by nothrowaways 4 hours ago
  • pncnmnp 3 hours ago

    I have a question that's bothered me for quite a while now. In 2018, Michael Jordan (UC Berkeley) wrote a rather interesting essay - https://medium.com/@mijordan3/artificial-intelligence-the-re... (Artificial Intelligence — The Revolution Hasn’t Happened Yet)

    In it, he stated the following:

    > Indeed, the famous “backpropagation” algorithm that was rediscovered by David Rumelhart in the early 1980s, and which is now viewed as being at the core of the so-called “AI revolution,” first arose in the field of control theory in the 1950s and 1960s. One of its early applications was to optimize the thrusts of the Apollo spaceships as they headed towards the moon.

    I was wondering whether anyone could point me to the paper or piece of work he was referring to. There are many citations in Schmidhuber’s piece, and in my previous attempts I've gotten lost in papers.

    • drsopp 3 hours ago

      Perhaps this:

      Henry J. Kelley (1960). Gradient Theory of Optimal Flight Paths.

      [1] https://claude.ai/public/artifacts/8e1dfe2b-69b0-4f2c-88f5-0...

      • pncnmnp 2 hours ago

        Thanks! This might be it. I looked up Henry J. Kelley on Wikipedia, and in the notes I found a citation to this paper from Stuart Dreyfus (Berkeley): "Artificial Neural Networks, Back Propagation and the Kelley-Bryson Gradient Procedure" (https://gwern.net/doc/ai/nn/1990-dreyfus.pdf).

        I am still going through it, but the latter is quite interesting!

      • duped 3 hours ago

        They're probably talking about Kalman Filters (1961) and LMS filters (1960).

        • pjbk 2 hours ago

          To be fair, any multivariable regulator or filter (estimator) that has a quadratic component (LQR/LQE) will naturally yield a solution similar to backpropagation when an iterative algorithm is used to optimize its cost or error function through a differentiable tangent space.

          • bgnn 20 minutes ago

            So yeah, this was what I was thinking for a while. What about a more nonlinear estimator? Intuitively seems similar to me.

        • psYchotic 3 hours ago
          • pncnmnp 3 hours ago

            Apologies - I should have been clear. I was not referring to Rumelhart et al., but to pieces of work that point to "optimizing the thrusts of the Apollo spaceships" using backprop.

        • cubefox 2 hours ago

          > ... first arose in the field of control theory in the 1950s and 1960s. One of its early applications was to optimize the thrusts of the Apollo spaceships as they headed towards the moon.

          I think "its" refers to control theory, not backpropagation.

          • dataflow 3 hours ago

            I asked ChatGPT and it gave a plausible answer but I haven't fact checked. It says "what you’re thinking of is the “adjoint/steepest-descent” optimal-control method (the same reverse-mode idea behind backprop), developed in aerospace in the early 1960s and applied to Apollo-class vehicles." It gave the following references:

            - Henry J. Kelley (1960), “Gradient Theory of Optimal Flight Paths,” ARS Journal.

            - A.E. Bryson & W.F. Denham (1962), “A Steepest-Ascent Method for Solving Optimum Programming Problems,” Journal of Applied Mechanics.

            - B.G. Junkin (1971), “Application of the Steepest-Ascent Method to an Apollo Three-Dimensional Reentry Optimization Problem,” NASA/MSFC report.

            • throawayonthe 3 hours ago

              it's rude to show people your llm output

              • drsopp 2 hours ago

                Why?

                • danieldk 2 hours ago

                  Because it is terribly low-effort. People are here for interesting and insightful discussions with other humans. If they were interested in unverified LLM output… they would ask an LLM?

                  • drsopp 2 hours ago

                    Who cares if it is low effort? I got lots of upvotes for my link to Claude about this, and pncnmnp seems happy. The downvoted comment from ChatGPT was maybe a bit spammy?

                    • lcnPylGDnU4H9OF 2 hours ago

                      > Who cares if it is low effort?

                      It's a weird thing to wonder after so many people expressed their dislike of the upthread low-effort comment with a down vote (and then another voiced a more explicit opinion). The point is that a reader may want to know that the text they're reading is something a human took the time to write themselves. That fact is what makes it valuable.

                      > pncnmnp seems happy

                      They just haven't commented. There is no reason to attribute this specific motive to that fact.

                      • drsopp an hour ago

                        > The point is that a reader may want to know that the text they're reading is something a human took the time to write themselves.

                        The reader may also simply want information that helps them.

                        > They just haven't commented.

                        Yes, they did.

                • aeonik 2 hours ago

                  I don't think it's rude, it saves me from having to come up with my own prompt and wade through the back and forth to get useful insight from the LLMs, also saves me from spending my tokens.

                  Also, I quite love it when people clearly demarcate which part of their content came from an LLM, and specifies which model.

                  The little citation carries a huge amount of useful information.

                  The folks who don't like AI should like it too, as they can easily filter the content.

            • cs702 2 hours ago

              Whatever the facts, the OP comes across as sour grapes. The author, Jürgen Schmidhuber, believes Hopfield and Hinton did not deserve their Nobel Prize in Physics, and that Hinton, Bengio, and LeCun did not deserve their Turing Award. Evidently, many other scientists disagree, because both awards were granted in consultation with the scientific community. Schmidhuber's own work was, in fact, cited by the Nobel Prize committee as background information for the 2024 Nobel.[a] Only future generations of scientists, looking at the past more objectively, will be able to settle these disputes.

              [a] https://www.nobelprize.org/uploads/2024/11/advanced-physicsp...

              • eigenspace an hour ago

                For what it's worth, it's a very mainstream opinion in the physics community that Hinton did not at all deserve a nobel prize in physics for his work. But that's because his work, and wasnt impactful at all to the physics community

                • Lerc an hour ago

                  I think Hinton himself has made that observation.

                  In a recent talk he made a quip that he had to change some slides because if you have a Nobel prize in physics you should at least get the units right.

                  • jimsimmons an hour ago

                    Honest person would have rejected it and protected the prize's honour

                    • pretzellogician 26 minutes ago

                      That's a joke, right? Turning down community recognition and a million dollars to make an unclear statement about which category the prize was awarded in?

                      • bgnn 23 minutes ago

                        It's up to the committee to protect that honour

                    • DalasNoin 26 minutes ago

                      At least among friends who are studying physics at university, many have had some kind of ML model as part of their thesis project, like an ML model to estimate early universe background radiation. Whether that's actually useful for the field is another question.

                    • empiko 2 hours ago

                      I think the unspoken claim here is that the North American scientific establishment takes credit from other sources and elevates certain personas instead of the true innovators who are overlooked. Arguing that the establishment doesn't agree with this idea is kinda pointless.

                      • icelancer 2 hours ago

                        Didn't click the article, came straight to the comments thinking "I bet it's Schmidhuber being salty."

                        Some things never change.

                      • mindcrime 3 hours ago

                        Who didn't? Depending on exactly how you interpret the notion of "inventing backpropagation" it's been invented, forgotten, re-invented, forgotten again, re-re-invented, etc, about 7 or 8 times. And no, I don't have specific citations in front of me, but I will say that a lot of interesting bits about the history of the development of neural networks (including backpropagation) can be found in the book Talking Nets: An Oral History of Neural Networks[1].

                        [1]: https://www.amazon.com/Talking-Nets-History-Neural-Networks/...

                        • catgary 40 minutes ago

                          I think it’s the move towards GPU-based computing is probably more significant - the constraints put in place by GPU programming (no branching, try not to update tensors in place, etc) sync up with the constraints put in place by differentiable programming.

                          Once people had a sufficiently compelling reason to write differentiable code, the frameworks around differentiable programming (theano, tensorflow, torch, JAX) picked up a lot of steam.

                          • convolvatron 3 hours ago

                            don't undergrad adaptive filters count?

                            https://en.wikipedia.org/wiki/Adaptive_filter

                            doesn't need a differentiation of the forward term, but if you squint it looks pretty close

                          • pjbk 3 hours ago

                            As it is stated, I always thought it came from formulations like Euler-Lagrange procedures in mechanics used in numeric methods for differential geometry. In fact when I recreated the algorithm as an exercise it immediately reminded me of gradient descent for kinematics, with the Jacobian calculation for each layer similar to an iterative pose calculation in generalized coordinates. I never thought it was something "novel".

                            • vonneumannstan 7 minutes ago

                              The only surprise here is that Schmidhuber himself didn't claim to invent it lol

                              • mystraline 3 hours ago

                                > BP's modern version (also called the reverse mode of automatic differentiation)

                                So... Automatic integration?

                                Proportional, integrative, derivative. A PID loop sure sounds like what they're talking about.

                                • eigenspace 3 hours ago

                                  Reverse move automatic differentiation is not integration. It's still differentiation, but just a different method of calculating the derivative than the one you'd think to do by hand. It basically just applies the chain rule in the opposite order from what is intuitive to people.

                                  It has a lot more overhead than regular forwards mode autodiff because you need to cache values from running the function and refer back to them in reverse order, but the advantage is that for function with many many inputs and very few outputs (i.e. the classic example is calculating the gradient of a scalar function in a high dimensional space like for gradient descent), it is algorithmically more efficient and requires only one pass through the primal function.

                                  On the other hand, traditional forwards mode derivatives are most efficient for functions with very few inputs, but many outputs. It's essentially a duality relationship.

                                  • stephencanon 2 hours ago

                                    I don't think most people think to do either direction by hand; it's all just matrix multiplication, you can multiply them in whatever order makes it easier.

                                    • eigenspace an hour ago

                                      Im just talking about the general algorithm to write down the derivative of `f(g(h(x)))` using the chain rule.

                                      For vector valued functions, the naive way you would learn in a vector calculus class corresponds to forward mode AD.

                                  • digikata 2 hours ago

                                    There are large bodies of work for optimization of state space control theory that I strongly suspect as a lot of crossover for AI, and at least has very similar mathematical structure.

                                    e.g. optimization of state space control coefficients looks something like training a LLM matrix...

                                    • imtringued 3 hours ago

                                      Forward mode automatic differentiation creates a formula for each scalar derivative. If you have a billion parameters you have to calculate each derivative from scratch.

                                      As the name implies, the calculation is done forward.

                                      Reverse mode automatic differentiation starts from the root of the symbolic expression and calculates the derivative for each subexpression simultaneously.

                                      The difference between the two is like the difference between calculating the Fibonacci sequence recursively without memoization and calculating it iteratively. You avoid doing redundant work over and over again.

                                    • bjornsing 2 hours ago

                                      The chain rule was explored by Gottfried Wilhelm Leibniz and Isaac Newton in the 17th century. Either of them would have ”invented” backpropagation in an instant. It’s obvious.

                                      • _fizz_buzz_ 2 hours ago

                                        Funny enough. For me it was the other way around. I always knew how to compute the chain rule. But really only understood what the chain rule means when I read up on what back propagation was.

                                        • Lerc an hour ago

                                          That's essentially it. Learning what the chain rule does, and learning what it can be used for, and how to apply it.

                                          Neither are really inventions, they are discoveries, if anything the chain rule leans slightly more to invention than backdrop.

                                          I understand the need for attribution as a means to track the means and validity of discovery, but I intensely dislike it when people act like it is a deed of ownership of an idea.

                                          • Jensson an hour ago

                                            You don't think the people who invented the chain rule understood what it means?

                                            • _fizz_buzz_ 20 minutes ago

                                              Obviously, Newton and Leibniz and many other Mathematicians (and other people) understood the chain rule before back propagation. But unfortunately I am very far from a Newton or Leibniz, so it took me a lot longer to grasp why the chain rule is the way it is. And back propagation just made it click for me. I was really just talking about me personally.

                                        • Anon84 2 hours ago

                                          Can we back propagate credit?

                                          • amai an hour ago

                                            Good ideas are never invented. They are always rediscovered.

                                            • caycep 2 hours ago

                                              this fight has become legendary and infamous, and also pops up on HN every 2-3 years

                                              • dicroce 2 hours ago

                                                Isn't it just kinda a natural thing once you have the chain rule?

                                                • fritzo 3 hours ago

                                                  TIL that the same Shun'ichi Amari who founded information geometry also made early advances to gradient descent.

                                                  • uoaei 2 hours ago

                                                    Calling the implementation of chain rule "inventing" is most of the problem here.

                                                    • caycep 2 hours ago

                                                      this fight has become legendary and infamous

                                                      • kypro 17 minutes ago

                                                        I've always found it rather crazy that the power of backpropagation and artificial neural networks was doubted by AI researchers for so long. It's really only since the early 2010s that researchers started to take the field seriously. This is despite the core algorithm (backpropagation) being known for decades.

                                                        I remember when I learnt about artificial neural networks at university in the late 00s my professors were really sceptical of them, rightly explaining that they become hard to train as you added more hidden layers.

                                                        See, what makes backpropagation and artificial neural networks work are all of the small optimisations and algorithm improvements that were added on top of backpropagation. Without these improvements it's too computationally inefficient to be practical and you have to contend with issues like exploding gradients.

                                                        I think Geoffrey Hinton has noted a few times that for people like him who have been working on artificial neural networks for years it's quite surprising that today neural networks just work because for years it was so hard to get them to do anything. In this sense while backpropagation is the foundational algorithm, it's not sufficient on it's own. It was the many improvements that were made on top of backpropagation that actually make artificial neural networks work and take off in the 2010s when some of the core components of modern neural networks started to fall into place.

                                                        I remember when I first learnt about neural networks I thought maybe coupling them with some kind of evolutionary approach might be what was needed to make them work. I had absolutely no idea what I was doing of course, but I spent so many nights experimenting with neural networks. I just loved the idea of an artificial "neural network" being able to learn a new problem and spit out an answer. The biggest regret of my life was coming out of university and going into web development because there were basically no AI jobs back then, and no such thing as an AI startup. If you wanted to do AI back then you basically had to be a researcher which didn't interest me at the time.

                                                        • PunchTornado 2 hours ago

                                                          Funny that hinton is not mentioned. Like how childish can the author be?

                                                          • cma an hour ago

                                                            I tried to verify this, and it isn't true. This is one of the first footnotes:

                                                            [HIN] J. Schmidhuber (AI Blog, 2020). Critique of Honda Prize for Dr. Hinton. Science must not allow corporate PR to distort the academic record.

                                                          • cubefox 3 hours ago

                                                            See also: The Backstory of Backpropagation - https://yuxi.ml/essays/posts/backstory-of-backpropagation/

                                                            • aaroninsf 2 hours ago

                                                              When I worked on neural networks, I was taught David Rumelhart.

                                                              • scheisshausDan an hour ago

                                                                Gays and other bumfucks.

                                                                • dudu24 3 hours ago

                                                                  It's just an application of the chain rule. It's not interesting to ask who invented it.

                                                                  • qarl 3 hours ago

                                                                    From the article:

                                                                    Some ask: "Isn't backpropagation just the chain rule of Leibniz (1676) [LEI07-10] & L'Hopital (1696)?" No, it is the efficient way of applying the chain rule to big networks with differentiable nodes—see Sec. XII of [T22][DLH]). (There are also many inefficient ways of doing this.) It was not published until 1970 [BP1].

                                                                    • uoaei 2 hours ago

                                                                      The article says that but it's overcomplicating to the point of being actually wrong. You could, I suppose, argue that the big innovation is the application of vectorization to the chain rule (by virtue of the matmul-based architecture of your usual feedforward network) which is a true combination of two mathematical technologies. But it feels like this and indeed most "innovations" in ML is only considered as such due to brainrot derived from trying to take maximal credit for minimal work (i.e., IP).