• irthomasthomas 4 hours ago

    I'm impressed by the scope of this drop. The raw intelligence of open models seems to be falling behind closed. But I think that's because frontier models from openai and anthropic are not just raw models, but probably include stuff like COT, 'best of N', or control vectors.

    • jcoc611 18 hours ago

      Probably an ignorant question, but could someone explain why the Context Length is much larger than the Generation Length?

      • dacox 18 hours ago

        When doing inference for an LLM, there are two stages.

        The first phase is referred to as "prefill", where the input is processed to create the KV Cache.

        After that phase, the "decode" phase is called auto-regressively. Each decode yields one new token.

        This post on [Inference Memory Requirements](https://huggingface.co/blog/llama31#inference-memory-require...) is quite good.

        These two phases have pretty different performance characteristics - prefill can really maximize GPU memory. For long contexts, its can be nigh impossible to do it all in a single pass - frameworks like vLLM use a technique called "chunked prefill".

        The decode phase is compute intensive, but tends not to maximize GPU memory.

        If you are serving these models, you really want to be able to have larger batch sizes during inference, which can only really come with scale - for a smaller app, you won't want to make the user wait that long.

        So, long contexts only have to be processed _once_ per inference, which is basically a scheduling problem.

        But the number of decode passes scales linearly with the output length. If it was unlimited, you could get some requests just _always_ present in an inference batch, reducing throughput for everyone.

        • mmoskal 13 hours ago

          Decode speed is generally memory bandwidth bound. Prefill is typically arithmetic bound. This is the reason for mixed batches (both decode and prefill) - it let's you saturate both memory and arithmetic.

          Chunked prefill is for minimizing latency for decode entries in the same batch. It's not needed if you have only one request - in that case it's the fastest to just prefill in one chunk.

          I'm pretty sure the sibling comment is right about different length limits - it's because of training and model talking nonsense if you let too long.

          • easygenes 16 hours ago

            It is also a training issue. The model has to be trained to reinforce longer outputs, which has a quadratic train-time cost and requires suitable long-context response training data.

            • jcoc611 18 hours ago

              That's a great explanation, thank you!

            • undefined 18 hours ago
              [deleted]
              • grayxu 6 hours ago

                Besides technical details, in a normal usage scenario, the context length is also much greater than the generation length.

              • freeqaz 20 hours ago

                32B is a nice size for 2x 3090s. That comfortably fits on the GPU with minimal quantization and still leaves extra memory for the long context length.

                70B is just a littttle rough trying to run without offloading some layers to the CPU.

                • a_wild_dandan 17 hours ago

                  70B+ models typically run great with my MacBook's 96GB of (V)RAM. I want a Mac Studio to run e.g. llama-405B, but I can't justify the marginal model quality ROI for like $7k or whatever. (But I waaant iiit!)

                  • tarruda 17 hours ago

                    You can get refurbished Mac Studio M1 Ultra with 128GB VRAM for ~ $3k on ebay. M1 ultra has 800GB/s memory bandwidth, same as the M2 ultra.

                    Not sure if 128GB VRAM is enough for running 405b (maybe at 3-bit quant?), but it seems to offer great value for running 70B models at 8-bit.

                    • a_wild_dandan 10 hours ago

                      Yeah, I would want the 192GB Mac for attempting such hefty models. But I have such basic bitch needs that 405B is overkill haha.

                      • tarruda 6 hours ago

                        405b even at low quants would have very low tokens generation speed, so even if you got the 192GB it would probably not be a good experience. I think 405b is the kind of model that only makes sense to run in clusters of A100/H100.

                        IMO it is not worth it, 70b models at q8 are already pretty darn good, and 128gb is more than enough for those.

                    • diggan 15 hours ago

                      > run great

                      How many tokens/second is that approx?

                      For reference, Qwen 2.5 32B on CPU (5950X) with GPU offloading (to RTX 3090ti) gets about 8.5 token/s, while 14B (fully on GPU) gets about ~64 tokens/s.

                      • a_wild_dandan 10 hours ago

                        For 70B models, I usually get 15-25 t/s on my laptop. Obviously that heavily depends on which quant, context length, etc. I usually roll with q5s, since the loss is so minuscule.

                      • BaculumMeumEst 16 hours ago

                        what quant are you running for that rig? i've been running q4, not sure if I can bump that up to q5 across the board (or if it's worth it in general)

                    • Flux159 20 hours ago

                      It would be nice to have comparisons to Claude 3.5 for the coder model, only comparing to open source models isn’t super helpful because I would want to compare to the model I’m currently using for development work.

                      • imjonse 20 hours ago

                        Aider will probably have some numbers at https://aider.chat/docs/leaderboards/

                        • Deathmax 20 hours ago

                          They've posted their own run of the Aider benchmark [1] if you want to compare, it achieved 57.1%.

                          [1]: https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen2.5/Qwen...

                          • reissbaker 8 hours ago

                            Oof. I'm really not sure why companies keep releasing these mini coding models; 57.1% is worse than gpt-3.5-turbo, and running it locally will be slower than OpenAI's API. I guess you could use it if you took your laptop into the woods, but with such poor coding ability, would you even want to?

                            The Qwen2.5-72B model seems to do pretty well on coding benchmarks, though — although no word about Aider yet.

                        • diggan 20 hours ago

                          Here is a comparison of the prompt "I want to create a basic Flight simulator in Bevy and Rust. Help me figure out the core properties I need for take off, in air flight and landing" between Claude Sonnet 3.5 and Qwen2.5-14B-Instruct-Q4_K_M.gguf:

                          https://gist.github.com/victorb/7749e76f7c27674f3ae36d791e20...

                          AFAIK, there isn't any (micro)benchmark comparisons out yet.

                          • yourMadness 19 hours ago

                            14B with Q4_K_M quantization is about 9 GB.

                            Remarkable that it is at all comparable to Sonnet 3.5

                            • diggan 18 hours ago

                              Comparable, I guess. But the result is a lot worse compared to Sonnet for sure. Parts of the example code doesn't make much sense. Meanwhile Sonnet seems to have the latest API of Bevy considered, and mostly makes sense.

                          • Sn0wCoder 17 hours ago

                            This might be what you are asking for... https://qwenlm.github.io/blog/qwen2.5-coder/

                            Ctrl F - Code Reasoning:

                          • ekojs 21 hours ago

                            Actually really impressive. They went up from 7T tokens to 18T tokens. Curious to see how they perform after finetuning.

                            • cateye 18 hours ago

                              > we are inspired by the recent advancements in reinforcement learning (e.g., o1)

                              It is interesting to see what the future will bring when models incorporate chain of thought approaches and whether o1 will get outperformed by open source models.

                              • GaggiX 21 hours ago

                                >our latest large-scale dataset, encompassing up to 18 trillion tokens

                                I remember when GPT-3 was trained on 300B tokens.

                                • imjonse 21 hours ago

                                  and was considered too dangerous to be released publicly.

                                  • Prbeek 5 hours ago

                                    I remember when Ps2 chips were considered too advanced the US government banned shipment of PlayStations to China lest the PLA gets hold of them

                                    • baq 20 hours ago

                                      they are dangerous... for folks who need to scrape the web for low background tokens to train their transformers.

                                      • abc-1 18 hours ago

                                        Nobody ever really believed this, the truth is rarely in vogue.

                                        • GaggiX 21 hours ago

                                          The larger GPT-2s were also considered too dangerous to release publicly at first.

                                          • Workaccount2 20 hours ago

                                            I remember be very understanding of it too after seeing the incredible (but absolutely terrible in retrospect) outputs.

                                            • GaggiX 19 hours ago

                                              I wasn't really compelled at the time, nothing has changed.

                                      • undefined 3 hours ago
                                        [deleted]
                                        • undefined 18 hours ago
                                          [deleted]