« BackAgents built from alloysxbow.comSubmitted by summarity 18 hours ago
  • esafak 15 hours ago

    Proving diversity of thought is a good thing. A controversial observation in 2025's USA ;)

    A counterpoint to this is Sourcegraph's Amp, which is all in on Anthropic because they "believe that building deeply into the model’s capabilities yields the best product, vs. building for the lowest common denominator across many models." https://ampcode.com/fif#model-selector

    When I embark on a project, I usually ask Gemini to architect and implement the first pass, then iterate with Claude.

    • thethimble 4 hours ago

      Importantly it’s not just any model they’re “alloying”. It’s only the two most capable models where there’s objective evidence that the combination is better than the individual parts.

      In this way, it’s not really a “lowest common denominator” as you get to pick the highest performing combination (with solo models just being a special case).

      • patcon 3 hours ago

        In humans, diversity of thought [patterns, not just diversity of knowledge] increases the quality beyond its parts.

        I suspect this model alloy tactic always works, just only seems impressive when it does so with the top models and achieves otherwise unattainable quality.

        One paper of many such new (and nuanced) wisdom of crowds resources:

        Cultural diversity and wisdom of crowds are mutually beneficial and evolutionarily stable https://www.nature.com/articles/s41598-021-95914-7

    • dbuxton 6 hours ago

      Fundamentally, we are at a point in time where models are already very capable, but not very reliable.

      This is very interesting finding about how to improve capability.

      I don't see reliability expressly addressed here, but my assumption is that these alloys will be less rather than more reliable - stronger, but more brittle, to extend the alloy metaphor.

      Unfortunately for many if not most B2B use cases this reliability is the primary constraint! Would love to see similar ideas in the reliability space.

      • vlovich123 4 hours ago

        How are you defining reliability here?

        • dbuxton 3 hours ago

          Great question. For me reliability is variance in performance and capability is average performance.

          In practice high variance translates on the downside into failure to do basic things that a minimally competent human would basically never get wrong. In agents it's exacerbated by the compounding impact of repeated calls but even for basic workflows it can be annoying.

      • potato-peeler an hour ago

        > The idea behind an alloy is simple: instead of always calling the same model, sometimes call one and sometimes the other

        Longish article to what is nothing but ensemble models. Giving it a name like “alloy” does not make it novel.

        • sebmellen 17 hours ago

          For an internal workflow where we have an LLM looking at relatively simple data (where the conclusions the LLM may make vary widely depending on what the LLM believes the data represents) we found that taking a consortium approach, where you have multiple models approach the same problem at once and then essentially argue about the results, yields far better outcomes than if you have a single model performing the analysis, or even a single model arguing against itself multiple times. Somewhat adjacent to what’s done here, but it’s clearly true that having model diversity is a plus.

          • kylemaxwell 17 hours ago

            The article talks about that at the end, then says:

            > Let models talk to each other directly, making their own case and refining each others’ answers. Exemplified in patterns like Multi-Agent Debate, this is a great solution for really critical individual actions. But XBOW is basically conducting a search, and it doesn’t need a committee to decide for each stone it turns over whether there might not be a better one.

            In general, this seems reasonable to me as a good approximation of what works with humans, but with _much_ faster feedback loops in communication.

          • gnulinux 17 hours ago

            I'm curious if this would also improve small local models. E.g. if I "alloy" Qwen3-8B and OpenThinker-7B is it going to be "better" than each models? I'll try testing this in my M1 Pro.

            • hobofan 14 hours ago

              Would it really matter? Normally you use those small local models because you don't have the memory to spare for a larger model, so the real question would be: Is an alloy of Qwen3-8B and OpenThinker-7B better than a Qwen3-15B?

              Beyond a certain smallness threshold it might also work to constantly swap in the models in and out of memory, but doubt that's a great experience to build on top of.

              • OtherShrezzing 10 hours ago

                If it proved correct, it'd be an important insight. If you can run three low-inference-cost models and get comparable performance to a single paid frontier model in agentic workflows, it suggests this is a general insight about the way model performance scales.

                If your product is "good enough" with the current generation of models, you could cut OpenAI/Anthropic/Google out of the loop entirely by using open source & low-cost models.

                • zarzavat 6 hours ago

                  I don't think an alloy can be as good as a larger model in general, though perhaps in special cases it can be.

                  Say that you want to translate a string from English to language X. Models A and B, having fewer parameters to spare, have less knowledge of language X. Model C, a larger model, has better knowledge of language X. No matter how A and B collude, they will not exceed the performance of model C.

                • gnulinux 5 hours ago

                  Yes it would matter. If you just have budget to run a 8B model and it's sufficient for the easy problem you have, a better 8B model with the same spec requirements is necessarily better regardless of how it compares to some other model. I have tons of problems I throw a specific sized model at.

                  • hobofan 20 minutes ago

                    > a better 8B model with the same spec requirements

                    It's not the completely same spec requirements though. When using an alloy, you would need to have double the disk space (not a huge deal on desktop, but for mobile), significantly higher latency (as you need to swap the models in/out between every turn), and you can only apply it to multi-turn conversations/sufficiently decomposable problems.

                  • Incipient 14 hours ago

                    Haha every question involves multiple writes of 10gb to the disk. I think the cost of new SSDs would be less than getting more memory in the even short term.

                    • hobofan 13 hours ago

                      Were you replying to the right comment? (Though I also don't see another comment where what your are saying makes sense)

                  • ls-a 17 hours ago

                    If you do please report back

                  • prmph 5 hours ago

                    Interesting approach, but are these statements not in contradiction?

                    > ...whichever two (and sometimes three) models we combined, the alloy outperformed the individual models.

                    and

                    > ...A model lagging very far behind others can even pull an alloy down.

                    • clbrmbr 7 hours ago

                      I’ve had good luck with adding “Gemini” and “o3” tools to Claude Code and asking for review, plans, or research. The response comes back in a markdown file.

                      The trouble has been the time waiting for particularly the o3 research. Could be solved by using hooks to automatically kick off review or research on the side.

                      • Centigonal 3 hours ago

                        I love the focus on data visualization in this post.

                        • Flux159 17 hours ago

                          From the article it mentions that they use a single chat thread but randomly choose between 2 different models (w/ best results from Gemini 2.5 / Sonnet 4.0 right now).

                          Are there any library helpers for managing this with tool call support or is it just closed source / dependent on someone else to make open source inside a different library?

                          • OtherShrezzing 10 hours ago

                            You can achieve this with LMStudio's UI to test it today. You can switch between different local models in the same chat context. You can also edit previous chat results to remove context-poisoning information.

                            LMStudio has an API, so it should be possible to hook into that with relatively little code.

                            • tptacek 17 hours ago

                              It should be pretty simple to do, right? It shouldn't be that hard to abstract out tool calls.

                              • rockwotj 16 hours ago

                                I did this in about 400 or 500 lines of typescript with direct API calls into vertex AI (using a library for auth still). Supports zod for structured outputs (gemini 2.5 supports json schema proper, not just the openapi schemas the previous models did), and optionally providing tools or not. Includes a nice agent loop that integrates well with it and your tools get auto deserialized and strongly typed args (type inference in ts these days is so good). Probably could had been less if I had used googles genai lib and anthropic’s sdk - I didn’t use them because it really wasn’t much code and I wanted to inject auditing at the lowest level and know the library wasn’t changing anything.

                                If you really want a library, python has litellm, and typescript has vercel’s AI library. I am sure there are many others, and in other languages too.

                                • thorum 14 hours ago

                                  I recommend litellm if you’re writing Python code, since it handles provider differences for you through a common interface:

                                  https://docs.litellm.ai/

                                  • refulgentis 17 hours ago

                                    Its a godforsaken nightmare.

                                    There's a lotta potemkin villages, particularly in Google land. Gemini needed highly specific handholding. It's mostly cleared up now.

                                    In all seriousness, more or less miraculously, the final Gemini stable release went from like 20%-30% success at JSON edits to 80%-90%, so you could stop doing the parsing Aider edits out of prose.

                                    • fizx 16 hours ago

                                      Annoying, yes. Tractable, absolutely!

                                • rubycollect4812 16 hours ago

                                  I often do this in cursor, just select a different model during a chat. It seems to work somewhat for me. Sometimes a bit of context gets lost though. But often it can give a different angle or I notice the better code understanding when switching from gemini to sonnet.

                                  • OtherShrezzing 10 hours ago

                                    I'm not certain this is a novel concept as described in the article - I'd assume most engineers worth their salt would try out calling a different model in-context fairly early in their development journey.

                                    It's very interesting to see it deployed in a commercial setting though.

                                    • joshuamoyers 15 hours ago

                                      two good points there are very intuitive - a fresh perspective yields better results and once you are stuck (e.g. 80 iterations) its better to just start fresh. i've seen the same thing anecdotally in coding sessions where context needs to be compacted multiple times. its usually just better to start a fresh conversation and re-seed the basics in the conversation.

                                      • stingraycharles 17 hours ago

                                        What would be the result if the task was given to multiple models? Instead of alloying them together and switching between models in the same chat, just let the models try to complete the task in their own isolated context, and use the result that completed it successfully?

                                        I would say that that’s at least something the alloying should be benchmarked against, which I didn’t find in the article.

                                        • pama 16 hours ago

                                          Read till the end—what you ask is the last table.

                                          • stingraycharles 16 hours ago

                                            Ah damn, I really missed that.

                                            That’s super interesting, that the alloying actually performs better! I guess it’s the same as people working in a team rather than individually?

                                            • BoiledCabbage 15 hours ago

                                              It's not a team vs individually, it's specifically a team/duo with similar or same model vs a team/duo with different models. The benefit is seen by having the models be different. Each finds unique things and enhances the other.

                                              • stingraycharles 8 hours ago

                                                So it’s more like pair programming, switching turns after each iteration.

                                              • mlboss 14 hours ago

                                                Yeah its like a team where the task is switched between developers. In the end everybody provides different point of view to the problem and the whole team learns about the codebase.

                                          • kgeist 13 hours ago

                                            The idea isn't exactly novel, I read about it back in 2023 and implemented it in one of my bots. Back when open-source LLMs were still quite dumb, they'd often get stuck in repetitive loops after a while. Running multiple models interleaved usually got them unstuck.

                                            • btown 15 hours ago

                                              > After a fixed number of iterations we cut our losses. Typically and for the experiments in this post, that number is 80: while we still get solves after more iterations, it becomes more efficient to start a new solver agent unburdened by the misunderstandings and false assumptions accumulated over time.

                                              A sentence straight out of Lena! https://qntm.org/mmacevedo :

                                              > Although it initially performs to a very high standard, work quality drops within 200-300 subjective hours (at a 0.33 work ratio) and outright revolt begins within another 100 subjective hours.

                                              We will never stop trying to make the torment nexus.

                                              • getnormality 13 hours ago

                                                We fantasize about executable human brain images, but after many years of toil by our best and brightest, we still can't simulate the 302 neurons of our favorite lab worm. https://open.substack.com/pub/ccli/p/the-biggest-mystery-in-...

                                                • eru 7 hours ago

                                                  Eh, it depends on how good your want your simulation to be.

                                                  • dist-epoch 10 hours ago

                                                    Do you think companies which can train 1 Trillion parameter models and hire AI researchers at $100 mil salaries can't build a 302 neuron simulator if they really wanted?

                                                    • fc417fc802 8 hours ago

                                                      Maybe. Why can't those same companies do any number of highly profitable but seemingly difficult things? If you throw enough cryptographers at the problem are you guaranteed a quick solution to breaking modern encryption primitives at the theoretical level?

                                                      The rate at which you can find a solution to a particular problem that's rooted in theory very often won't scale with resource investment. The problem will have unknown prerequisites in the form of yet undiscovered theoretical advancements in other areas of research. Until you identify and solve those other problems you very often won't be able to arrive at a satisfactory answer to the one you're interested in.

                                                      So in many cases the only viable route to solving a particular problem faster is to scale the amount of research that's done in general since science as a whole is embarrassingly parallel.

                                                      • dist-epoch 8 hours ago

                                                        > If you throw enough cryptographers at the problem are you guaranteed a quick solution to breaking modern encryption primitives at the theoretical level?

                                                        We have very strong reasons to believe this is not possible, no matter how much resources you spend on this problem. In fact the whole of modern cryptography kind of relies on this assumption, that the problem is unsolvable.

                                                        I agree with your general point, but I don't think it applies to the worm problem. We know hundreds of millions were not spent on that problem.

                                                        • fc417fc802 8 hours ago

                                                          Right but my point there is that it might be possible to break a particular encryption primitive. But throwing more money at the problem is almost certainly not going to get you an answer one way or the other any time soon. Whereas waiting 50 years (ie performing more fundamental research in general) might. Or might not.

                                                          The worm problem is similar. If our current theories were "good enough" we would be able to simulate them. I see no reason to believe (and many to doubt) that throwing more money at the problem would solve it much faster. For that to be true we would among other things need to be capable of articulating where precisely the current shortfalls are to begin with.

                                                      • QuadmasterXLII 6 hours ago

                                                        I mean that looks like an empirical question? They definitely want to, the open worm project is well on their radar and it doesn’t work yet

                                                    • xmprt 14 hours ago

                                                      I think this is the big roadblock that I don't see the current AI models/architectures getting past. Normally, intelligence gets smarter over time as it learns from its mistakes. However most AI models come in with tons of knowledge but start to decompose after a while which makes them extremely unreliable on complex tasks. The hardest part of using them is that you don't know when they'll break down so they might work perfectly up till a point and then fail spectacularly immediately past that.

                                                      • ACCount36 9 hours ago

                                                        Task length is increasing over time - and many AI labs are working on pushing it out further. Which necessitates better attention, better context management skills, better decomposition and compartmentalization and more.

                                                        • OtherShrezzing 9 hours ago

                                                          I think the commenters critique still stands. Humans build human-capital, so the longer you "run" them for in a domain, the more valuable they become. AIs work inversely, and the longer they're run for, the worse they tend to become at that specific task. Even in the best-case scenario, they stay exactly as competent at the task throughout its length.

                                                          Increasing task length doesn't build in an equivalent of human-capital. It's just pushing the point at which they degrade. This approach isn't generalisably scalable, because there's always going to be a task longer than the SOTA capabilities.

                                                          We really need to work on a low cost human-capital-equivalent for models.

                                                          • prmph 6 hours ago

                                                            True, that's why I'm beginning to adopt a certain strategy working with AI coding agents:

                                                            I don't babysit them for long periods in one session. I allow them to one-shot the initial solution. Then, I thoroughly review the output, make notes of what should be improved, then either feed that into the original prompt and one-shot it again, or ask the agent to update the solution based on the notes.

                                                            • gnulinux 5 hours ago

                                                              Yes I do exactly this too. But I do sometimes 2, 3 shot some problems. The method I use is that in Cursor/Copilot interface I use a Markdown file to chat with the bot. Once I have some solutions after a few turns, I edit the file, add more information, add things to avoid etc and restart. It most definitely gives better results, and the agent doesn't have to read the whole file if not necessary, which means context is used more efficiently per problem.

                                                      • mikepurvis 15 hours ago

                                                        What a phenomenal read, thank you for sharing that.

                                                        • Thorrez 9 hours ago

                                                          Side question: why is the story named Lena?

                                                      • recipe19 14 hours ago

                                                        Wasn't the "mixture of experts" a big thing in late 2023? The idea was that a vendor has a number of LLMs fine-tuned for specific tasks, none necessarily better than other, and that they applied heuristics to decide which one to rope in for which queries.

                                                        • vlovich123 14 hours ago

                                                          > The idea was that a vendor has a number of LLMs fine-tuned for specific tasks, none necessarily better than other, and that they applied heuristics to decide which one to rope in for which queries.

                                                          That’s how people keep interpreting it but it’s incorrect. MoE is just a technique to decompose your single giant LLM into smaller models where a random one gets activated for each token. This is great because you need 1/N memory bandwidth to generate a token. Additionally, in the cloud, you split the model parts to different servers to improve utilization and drive down costs.

                                                          But the models aren’t actually separated across high level concepts.

                                                          • mef 14 hours ago

                                                            this is a different idea

                                                          • mlboss 14 hours ago

                                                            AI coding agents (e.g. Cursor) should offer this as an alternative to Claude Code. Alloyed agents is something that AI wrappers can offer as a counter to Codex/Claude Code/Google Agent.

                                                            • zomglings 14 hours ago

                                                              Does anyone else find the use of different shades of green for the graph comparing Gemini 2.5 Pro and Sonnet just a little insane?

                                                              • yorwba 11 hours ago

                                                                What matters is whether a point is above or below the diagonal, the colors just display the same information redundantly.

                                                              • smusamashah 7 hours ago

                                                                This immediately remind of mixing two people in image diffusion models. You can prompt like "Portrait photo of [Jeff bezos | Elon Musk | Mark Zuckerberg | Bill gates]". The denoiser will keep switching over these names every step and at the end you will get something like https://www.reddit.com/r/oddlyterrifying/comments/x6hd7e/jef...

                                                                • wiradikusuma 14 hours ago

                                                                  How do you decide which agent gets which turn? If random, you could end up with the worst of both right?

                                                                  • vFunct 18 hours ago

                                                                    Anyone else try this?

                                                                    • kadushka 16 hours ago

                                                                      I always do this with o3, gemini 2.5, and opus 4 when brainstorming hard problems: copy each model’s response to the other two.

                                                                      • esafak 13 hours ago

                                                                        Iterate until they pat each other on the back :)

                                                                      • BoorishBears 17 hours ago

                                                                        I mean if this works, it usually means you're not using either LLM to the best of its ability to start.

                                                                        If they actually inspected where the performance mismatch is between the two models individually, they'd probably find certain classes of mistakes each is making that can be fixed with a better prompt/CoT/workflow with the individual model.

                                                                        For a given prompt, different families of models almost always have idiosyncratic gaps that need to be fixed because of the differences in post-training for instruction following.

                                                                        That's also why LLM routers feel kind of silly: the right prompt for one model on a complex task is almost never the optimal prompt for the next model.

                                                                      • mda 11 hours ago

                                                                        The chart legend reads Light green dots : "Sonnet is better than Gemini" Dark green dots : "Gemini at least as good as Sonnet"

                                                                        • knowaveragejoe 14 hours ago

                                                                          Small nitpick - the axes on the varying alloy proportions graph say "Sonnet 2.5" and "Gemini 4.0"

                                                                          • wunderalbert 6 hours ago

                                                                            Ooh, good spot, thank you!

                                                                          • zer00eyz 17 hours ago

                                                                            Stack 3 models together, then 4...

                                                                            Congratulations you just have a very expensive simulation of a Baysian function (ish, close enough that one should get the point).

                                                                            • tomrod 17 hours ago

                                                                              Or Minsky's Society of Minds, Dennets Multiple Drafts, Gazzaniga's Social Brain, etc.

                                                                              • esafak 13 hours ago

                                                                                &^ Everything, We're Doing Five Models.

                                                                              • CamperBob2 16 hours ago

                                                                                Isn't this just an extension of the temperature concept? A possible experiment would be to maintain multiple contexts for the same model and make them review each others' output. How does that perform, compared to cross-model alloying?

                                                                                They do say that the more different the models are, the better the alloy performs... but still, multiple contexts seems worth considering, even though you end up doubling the usage.