• tjungblut 2 days ago

    If you are curios, like me, how the actual reinforcement learning happens. It uses verl [1] underneath. The paper "HybridFlow: A Flexible and Efficient RLHF Framework" [2] explains it really well.

    [1] https://github.com/volcengine/verl

    [2] https://arxiv.org/abs/2409.19256v2

    • anorwell 2 days ago

      Some of the comments so far seem to be misunderstanding this submission. As I understand it:

      1. Custom scaffolding (system prompt and tools) using Qwen3-32B achieved 13.75% on Terminal-Bench. No training was involved.

      2. The author has built an RL system, but it has not been used for anything due to cost limitations.

      So there's actually no result related to training here. It well known that the scaffolding used can have a large impact on benchmark outcomes (the Terminal bench leaderboard also demonstrates this [1]).

      [1] https://www.tbench.ai/leaderboard

      • esafak a day ago

        It looks like the submission has two aspects that are being conflated.

        1. Tooling for training a terminal agent.

        2. An agent that was _not_ trained with this tooling but prompt engineered. I could not find the author's discussion on this point.

      • OtherShrezzing 2 days ago

        That you've spent in the low-thousands (by the looks of it), and managed to beat GPT4.1 is an amazing insight into the moat of the big AI labs.

        • rboyd 2 days ago

          Great work! There should be a way for entities to crowdfund model training. Can a model like this be partially evaluated during training time and save through early stopping?

          What are the best papers/resources on sota long-horizon RL?

          Thanks.

          • TarasBob a day ago

            I'm willing to help fund this if the creator is interested. I sent him an email.

            • enigma101 2 days ago

              Did you consider a kickstarter to overcome the gpu poorness??? 30 to 50 should be doable

              • bravesoul2 2 days ago

                Wow amazing! Amazing a "one person band" can do this much. It crosses many skillets.

                • thomasfromcdnjs 2 days ago

                  How much did you spend?

                  • lostmsu 12 hours ago

                    Why do you need 50k? Can't you tune using LoRA?

                    • Danau5tin 12 hours ago

                      Exactly my first thought when I realised the cost! Currently LoRA is not supported by rLLM (The team told me they aim to support in next release), but it is certainly possible to port to verl directly or another RL framework for sure. I just did not have the time to port again (already done 2x as other RL frameworks had issues)

                    • erdaltoprak 2 days ago

                      This is incredible work