• abeppu 2 hours ago

    but the experiments it did that "improved" validation BPB in the GH screenshot were all basically hyperparameter changes right? So is this better or worse, either per experiment or per unit time, than hyperparameter tuning techniques that don't involve an LLM? It's not clear from this if the LLM is more or less making random changes which sometimes work , and or the LLM thinking actually finds "good" changes because of what the LLM has internalized. E.g. how does this compare to a hyperparameter tuning pass with e.g. BayesOpt that does the same number of 5-min training experiments?

    • karpathy an hour ago

      this is very far from hyperparameter tuning in at least three important ways:

      - it can modify code arbitrarily, the notion of a "hyperparameter" dissolves

      - there is no need to run "sweeps" - this is the standard parallel process that wastes compute. because LLM agents are sequential, they can do more efficient versions such as binary search to narrow in on the right setting very quickly (usually many parameters will have a U shaped optimal setting).

      - it's fully automatic, it doesn't require human in the loop to mess with the code.

      You're right that many of the changes it seems to make out of the box (as I intentionally did not try to prompt engineer it too hard yet because I was curious what you get by default) seem to be tuning existing hyperparameters. not all of the changes are like that - e.g. it tried to replace the non-linearity, etc. I will say that overall (and again, out of the box) the LLM feels unwilling to creatively pursue a research direction or something like that. The models feel very "cagy" and "scared" when they are given problems that are a little too open ended. But that's just where the fun parts, e.g. I had some early successes with the idea of a "chief scientist" that was basically a never-ending plan mode that looked at what worked, didn't work, tried to find related code/papers, and created a long list of experiments to try, which it could then send to junior engineers running in tmux sessions. I think quite a few approaches are possible, so I think it's a nice canvas. The reason we're not getting "novel research" feels like half capability issue and half skill issue.

    • mikert89 an hour ago

      As ai improves, most tasks will become something like this. Environments setup where the model learns through trial and error

      Any human endeavor that can be objectively verified in some environment like this can be completely automated

      • oezi an hour ago

        Is there a Autoresearch for Jupyter somewhere? I point it to a Jupyter cell to improve based on another which calculates the target metric?

        • falcor84 3 hours ago

          The only thing missing is for the agents to publish and peer-review their research.

          • ting0 2 hours ago

            That's a great idea.

            • whattheheckheck an hour ago

              Then you get a statistical mess of crap that takes more energy to dive in and refute....

          • AlexCoventry 3 hours ago

            Wow, Gemini suggested a very similar experiment to me yesterday. Guess I know where it got the idea from, now. :-)

            • kubb 2 hours ago

              He's burning Claude tokens to slightly improve his tiny and not very capable LLM? It's fun, I bet, but wake me up when it leads to a research breakthrough.

              • hustwindmaple an hour ago

                I suspect Ant is already doing this for Claude. Takes a sh*t ton of compute though.

              • lostmsu 2 hours ago

                Non-zero based chart makes it look like it was very successful.