• simonw 11 hours ago

    I think gemma-3-27b-it-qat-4bit is my new favorite local model - or at least it's right up there with Mistral Small 3.1 24B.

    I've been trying it on an M2 64GB via both Ollama and MLX. It's very, very good, and it only uses ~22Gb (via Ollama) or ~15GB (MLX) leaving plenty of memory for running other apps.

    Some notes here: https://simonwillison.net/2025/Apr/19/gemma-3-qat-models/

    Last night I had it write me a complete plugin for my LLM tool like this:

      llm install llm-mlx
      llm mlx download-model mlx-community/gemma-3-27b-it-qat-4bit
    
      llm -m mlx-community/gemma-3-27b-it-qat-4bit \
        -f https://raw.githubusercontent.com/simonw/llm-hacker-news/refs/heads/main/llm_hacker_news.py \
        -f https://raw.githubusercontent.com/simonw/tools/refs/heads/main/github-issue-to-markdown.html \
        -s 'Write a new fragments plugin in Python that registers
        issue:org/repo/123 which fetches that issue
            number from the specified github repo and uses the same
            markdown logic as the HTML page to turn that into a
            fragment'
    
    It gave a solid response! https://gist.github.com/simonw/feccff6ce3254556b848c27333f52... - more notes here: https://simonwillison.net/2025/Apr/20/llm-fragments-github/
    • rs186 11 hours ago

      Can you quote tps?

      More and more I start to realize that cost saving is a small problem for local LLMs. If it is too slow, it becomes unusable, so much that you might as well use public LLM endpoints. Unless you really care about getting things done locally without sending information to another server.

      With OpenAI API/ChatGPT, I get response much faster than I can read, and for simple question, it means I just need a glimpse of the response, copy & paste and get things done. Whereas on local LLM, I watch it painstakingly prints preambles that I don't care about, and get what I actually need after 20 seconds (on a fast GPU).

      And I am not yet talking about context window etc.

      I have been researching about how people integrate local LLMs in their workflows. My finding is that most people play with it for a short time and that's about it, and most people are much better off spending money on OpenAI credits (which can last a very long time with typical usage) than getting a beefed up Mac Studio or building a machine with 4090.

      • simonw 10 hours ago

        My tooling doesn't measure TPS yet. It feels snappy to me on MLX.

        I agree that hosted models are usually a better option for most people - much faster, higher quality, handle longer inputs, really cheap.

        I enjoy local models for research and for the occasional offline scenario.

        I'm also interested in their applications for journalism, specifically for dealing with extremely sensitive data like leaked information from confidential sources.

        • freeamz 10 hours ago

          >I'm also interested in their applications for journalism, specifically for dealing with extremely sensitive data like leaked information from confidential sources.

          Think it is NOT just you. Most company with decent management also would not want their data going to anything outside the physical server they have in control of. But yeah for most people just use an app and hosted server. But this is HN,there are ppl here hosting their own email servers, so shouldn't be too hard to run llm locally.

          • simonw 9 hours ago

            "Most company with decent management also would not want their data going to anything outside the physical server they have in control of."

            I don't think that's been true for over a decade: AWS wouldn't be trillion dollar business if most companies still wanted to stay on-premise.

            • ipdashc 2 hours ago

              Yeah, this has been confusing me a bit. I'm not complaining by ANY means, but why does it suddenly feel like everyone cares about data privacy in LLM contexts, way more than previous attitudes to allowing data to sit on a bunch of random SaaS products?

              I assume because of the assumption that the AI companies will train off of your data, causing it to leak? But I thought all these services had enterprise tiers where they'll promise not to do that?

              Again, I'm not complaining, it's good to see people caring about where their data goes. Just interesting that they care now, but not before. (In some ways LLMs should be one of the safer services, since they don't even really need to store any data, they can delete it after the query or conversation is over.)

              • pornel 35 minutes ago

                It is due to the risk of a leak.

                Laundering of data through training makes it a more complicated case than a simple data theft or copyright infringement.

                Leaks could be accidental, e.g. due to an employee logging in to their free-as-in-labor personal account instead of a no-training Enterprise account. It's safer to have a complete ban on providers that may collect data for training.

              • terhechte 9 hours ago

                Or GitHub. I’m always amused when people don’t want to send fractions of their code to a LLM but happily host it on GitHub. All big llm providers offer no-training-on-your-data business plans.

                • tarruda 8 hours ago

                  > I’m always amused when people don’t want to send fractions of their code to a LLM but happily host it on GitHub

                  What amuses me even more is people thinking their code is too unique and precious, and that GitHub/Microsoft wants to steal it.

                  • AlexCoventry 8 hours ago

                    Concern about platform risk in regard to Microsoft is historically justified.

                    • vikarti 5 hours ago

                      Regulations sometimes matter. Stupid "security" rules sometimes matter too.

                      • Terretta 8 hours ago

                        Unlikely they think Microsoft or GitHub wants to steal it.

                        With LLMs, they're thinking of examples that regurgitated proprietary code, and contrary to everyday general observation, valuable proprietary code does exist.

                        But with GitHub, the thinking is generally the opposite: the worry is that the code is terrible, and seeing it would be like giant blinkenlights* indicating the way in.

                        * https://en.wikipedia.org/wiki/Blinkenlights

                    • __float 9 hours ago

                      While none of that is false, I think there's a big difference from shipping your data to an external LLM API and using AWS.

                      Using AWS is basically a "physical server they have control of".

                      • simonw 8 hours ago

                        That's why AWS Bedrock and Google Vertex AI and Azure AI model inference exist - they're all hosted LLM services that offer the same compliance guarantees that you get from regular AWS-style hosting agreements.

                        • IanCal 7 hours ago

                          As in aws is a much bigger security concern?

                  • overfeed 10 hours ago

                    > Whereas on local LLM, I watch it painstakingly prints preambles that I don't care about, and get what I actually need after 20 seconds.

                    You may need to "right-size" the models you use to match your hardware, model, and TPS expectations, which may involve using a smaller version of the model with faster TPS, upgrading your jardware, or paying for hosted models.

                    Alternatively, if you can use agentic workflows or tools like Aider, you don't have to watch the model work slowly with large modles locally. Instead you queue work for it, go to sleep, or eat, or do other work, and then much later look over the Pull Requests whenever it completes them.

                    • rs186 4 hours ago

                      I have a 4070 super for gaming, and used it to play with LLM a few times. It is by no means a bad card, but I realize that unless I want to get 4090 or new Macs that I don't have any other use for, I can only use it to run smaller models. However, most smaller models aren't satisfactory and are still slower than hosted LLMs. I haven't found a model that I am happy with for my hardware.

                      Regarding agentic workflows -- sounds nice but I am too scared to try it out, based on my experience with standard LLMs like GPT or Claude for writing code. Small snippets or filling in missing unit tests, fine, anything more complicated? Has been a disaster for me.

                    • trees101 3 hours ago

                      Not sure how accurate my stats are. I used ollama with the --verbose flag. Using a 4090 and all default settings, I get 40TPS for Gemma 29B model

                      `ollama run gemma3:27b --verbose` gives me 42.5 TPS +-0.3TPS

                      `ollama run gemma3:27b-it-qat --verbose` gives me 41.5 TPS +-0.3TPS

                      Strange results; the full model gives me slightly more TPS.

                    • k__ 5 hours ago

                      The local LLM is your project manager, the big remote ones are the engineers and designers :D

                      • starik36 2 hours ago

                        On an A5000 with 24GB, this model typically gets between 20 to 25 tps.

                        • DJHenk 6 hours ago

                          > More and more I start to realize that cost saving is a small problem for local LLMs. If it is too slow, it becomes unusable, so much that you might as well use public LLM endpoints. Unless you really care about getting things done locally without sending information to another server.

                          There is another aspect to consider, aside from privacy.

                          These models are trained by downloading every scrap of information from the internet, including the works of many, many authors who have never consented to that. And they for sure are not going to get a share of the profits, if there is every going to be any. If you use a cloud provider, you are basically saying that is all fine. You are happy to pay them, and make yourself dependent on their service, based on work that wasn't theirs to use.

                          However, if you use a local model, the authors still did not give consent, but one could argue that the company that made the model is at least giving back to the community. They don't get any money out of it, and you are not becoming dependent on their hyper capitalist service. No rent-seeking. The benefits of the work are free to use for everyone. This makes using AI a little more acceptable from a moral standpoint.

                          • ein0p 6 hours ago

                            Sometimes TPS doesn't matter. I've generated textual descriptions for 100K or so images in my photo archive, some of which I have absolutely no interest in uploading to someone else's computer. This works pretty well with Gemma. I use local LLMs all the time for things where privacy is even remotely important. I estimate this constitutes easily a quarter of my LLM usage.

                            • lodovic 6 hours ago

                              This is a really cool idea. Do you pretrain the model so it can tag people? I have so many photo's that it seems impossible to ever categorize them,using a workflow like yours might help a lot

                              • ein0p 6 hours ago

                                No, tagging of people is already handled by another model. Gemma just describes what's in the image, and produces a comma separated list of keywords. No additional training is required besides a few tweaks to the prompt so that it outputs just the description, without any "fluff". E.g. it normally prepends such outputs with "Here's a description of the image:" unless you really insist that it should output only the description. I suppose I could use constrained decoding into JSON or something to achieve the same, but I didn't mess with that.

                                On some images where Gemma3 struggles Mistral Small produces better descriptions, BTW. But it seems harder to make it follow my instructions exactly.

                                I'm looking forward to the day when I can also do this with videos, a lot of which I also have no interest in uploading to someone else's computer.

                                • fer 4 hours ago

                                  How do you use the keywords after? I have Immich running which does some analysis, but the querying is a bit of a hit and miss.

                                  • ein0p 4 hours ago

                                    Search is indeed hit and miss. Immich, for instance, currently does absolutely nothing with the EXIF "description" field, so I store textual descriptions on the side as well. I have found Immich's search by image embeddings to be pretty weak at recall, and even weaker at ranking. IIRC Lightroom Classic (which I also use, but haven't found a way to automate this for without writing an extension) does search that field, but ranking is a bit of a dumpster fire, so your best bet is searching uncommon terms or constraining search by metadata (e.g. not just "black kitten" but "black kitten AND 2025"). I expect this to improve significantly over time - it's a fairly obvious thing to add given the available tech.

                              • starik36 2 hours ago

                                I was thinking of doing the same, but I would like to include people's name. in the description. For example "Jennifer looking out in the desert sky.".

                                As it stands, Gemma will just say "Woman looking out in the desert sky."

                              • otabdeveloper4 8 hours ago

                                The only actually useful application of LLM's is processing large amounts of data for classification and/or summarizing purposes.

                                That's not the stuff you want to send to a public API, this is something you want as a 24/7 locally running batch job.

                                ("AI assistant" is an evolutionary dead end, and Star Trek be damned.)

                              • prvc 5 hours ago

                                > ~15GB (MLX) leaving plenty of memory for running other apps.

                                Is that small enough to run well (without thrashing) on a system with only 16GiB RAM?

                                • simonw 4 hours ago

                                  I expect not. On my Mac at least I've found I need a bunch of GB free to have anything else running at all.

                                  • mnoronha 3 hours ago

                                    Any idea why MLX and ollama use such different amounts of ram?

                                • codybontecou 3 hours ago

                                  Can you run the mlx-variation of this model through Ollama so that I can interact with it in Open WebUI?

                                • paprots 6 hours ago

                                  The original gemma3:27b also took only 22GB using Ollama on my 64GB MacBook. I'm quite confused that the QAT took the same. Do you know why? Which model is better? `gemma3:27b`, or `gemma3:27b-qat`?

                                  • zorgmonkey an hour ago

                                    Both versions are quantized and should use the same amount of RAM, the difference with QAT is the quantization happens during training time and it should result in slightly better (closer to the bf16 weights) output

                                    • kgwgk 4 hours ago

                                      Look up 27b in https://ollama.com/library/gemma3/tags

                                      You'll find the id a418f5838eaf which also corresponds to 27b-it-q4_K_M

                                      • nolist_policy 6 hours ago

                                        I suspect your "original gemma3:27b" was a quantized model since the non-quantized (16bit) version needs around 54gb.

                                        • superkuh 4 hours ago

                                          Quantization aware training just means having the model deal with quantized values a bit during training so it handles the quantization better when it is quantized after training/etc. It doesn't change the model size itself.

                                        • tomrod 10 hours ago

                                          Simon, what is your local GPU setup? (No doubt you've covered this, but I'm not sure where to dig up).

                                          • simonw 10 hours ago

                                            MacBook Pro M2 with 64GB of RAM. That's why I tend to be limited to Ollama and MLX - stuff that requires NVIDIA doesn't work for me locally.

                                            • Elucalidavah 8 hours ago

                                              > MacBook Pro M2 with 64GB of RAM

                                              Are there non-mac options with similar capabilities?

                                              • simonw 8 hours ago

                                                Yes, but I don't really know anything about those. https://www.reddit.com/r/LocalLLaMA/ is full of people running models on PCs with NVIDIA cards.

                                                The unique benefit of an Apple Silicon Mac at the moment is that the 64GB of RAM is available to both the GPU and the CPU at once. With other hardware you usually need dedicated separate VRAM for the GPU.

                                                • dwood_dev 4 hours ago

                                                  Anything with the Radeon 8060S/Ryzen AI Max+ 395. One of the popular MiniPC Chinese brands has them for preorder[0] with shipping starting May 7th. Framework also has them, but shipping Q3.

                                                  0: https://www.gmktec.com/products/prepaid-deposit-amd-ryzen™-a...

                                                  • danans an hour ago

                                                    Nvidia Orin AGX if a desktop form factor works for you.

                                                    • _neil 6 hours ago

                                                      It’s not out yet, but the upcoming Framework desktop [0] is supposed to have a similar unified memory setup.

                                                      [0] https://frame.work/desktop

                                                • nico 10 hours ago

                                                  Been super impressed with local models on mac. Love that the gemma models have 128k token context input size. However, outputs are usually pretty short

                                                  Any tips on generating long output? Like multiple pages of a document, a story, a play or even a book?

                                                  • simonw 10 hours ago

                                                    The tool you are using may set a default max output size without you realizing. Ollama has a num_ctx that defaults to 2048 for example: https://github.com/ollama/ollama/blob/main/docs/faq.md#how-c...

                                                    • nico 10 hours ago

                                                      Been playing with that, but doesn’t seem to have much effect. It works very well to limit output to smaller bits, like setting it to 100-200. But above 2-4k the output seems to never get longer than about 1 page

                                                      Might try using the models with mlx instead of ollama to see if that makes a difference

                                                      Any tips on prompting to get longer outputs?

                                                      Also, does the model context size determine max output size? Are the two related or are they independent characteristics of the model?

                                                      • simonw 9 hours ago

                                                        Interestingly the Gemma 3 docs say: https://ai.google.dev/gemma/docs/core/model_card_3#:~:text=T...

                                                        > Total output context up to 128K tokens for the 4B, 12B, and 27B sizes, and 32K tokens for the 1B size per request, subtracting the request input tokens

                                                        I don't know how to get it to output anything that length though.

                                                        • nico 9 hours ago

                                                          Thank you for the insights and useful links

                                                          Will keep experimenting, will also try mistral3.1

                                                          edit: just tried mistral3.1 and the quality of the output is very good, at least compared to the other models I tried (llama2:7b-chat, llama2:latest, gemma3:12b, qwq and deepseek-r1:14b)

                                                          Doing some research, because of their training sets, it seems like most models are not trained on producing long outputs so even if they technically could, they won’t. Might require developing my own training dataset and then doing some fine tuning. Apparently the models and ollama have some safeguards against rambling and repetition

                                                          • Gracana 5 hours ago

                                                            You can probably find some long-form tuned models on HF. I've had decent results with QwQ-32B (which I can run on my desktop) and Mistral Large (which I have to run on my server). Generating and refining an outline before writing the whole piece can help, and you can also split the piece up into multiple outputs (working a paragraph or two at a time, for instance). So far I've found it to be a tough process, with mixed results.

                                                            • nico 3 hours ago

                                                              Thank you, will try out your suggestions

                                                              Have you used something like a director model to supervise the output? If so, could you comment on the effectiveness of it and potentially any tips?

                                                              • Gracana 31 minutes ago

                                                                Nope, sounds neat though. There's so much to keep up with in this space.

                                                    • tootie an hour ago

                                                      I'm using 12b and getting seriously verbose answers. It's squeezed into 8GB and takes its sweet time but answers are really solid.

                                                      • Casteil 9 hours ago

                                                        This is basically the opposite of what I've experienced - at least compared to another recent entry like IBM's Granite 3.3.

                                                        By comparison, Gemma3's output (both 12b and 27b) seems to typically be more long/verbose, but not problematically so.

                                                        • nico 7 hours ago

                                                          I agree with you. The outputs are usually good, it’s just that for the use case I have now (writing several pages of long dialogs), the output is not as long as I’d want it, and definitely not as long as it’s supposedly capable of doing

                                                      • littlestymaar 7 hours ago

                                                        > and it only uses ~22Gb (via Ollama) or ~15GB (MLX)

                                                        Why is the memory use different? Are you using different context size in both set-ups?

                                                    • Havoc 5 minutes ago

                                                      Definitely my current fav. Also interesting that for many questions the response is very similar to the gemini series. Must be sharing training datasets pretty directly.

                                                      • Samin100 5 hours ago

                                                        I have a few private “vibe check” questions and the 4 bit QAT 27B model got them all correctly. I’m kind of shocked at the information density locked in just 13 GB of weights. If anyone at Deepmind is reading this — Gemma 3 27B is the single most impressive open source model I have ever used. Well done!

                                                        • diggan 11 hours ago

                                                          First graph is a comparison of the "Elo Score" while using "native" BF16 precision in various models, second graph is comparing VRAM usage between native BF16 precision and their QAT models, but since this method is about doing quantization while also maintaining quality, isn't the obvious graph of comparing the quality between BF16 and QAT missing? The text doesn't seem to talk about it either, yet it's basically the topic of the blog post.

                                                          • croemer 11 hours ago

                                                            Indeed, the one thing I was looking for was Elo/performance of the quantized models, not how good the base model is. Showing how much memory is saved by quantization in a figure is a bit of an insult to the intelligence of the reader.

                                                            • nithril 11 hours ago

                                                              In addition the graph "Massive VRAM Savings" graph states what looks like a tautology, reducing from 16 bits to 4 bits leads unsurprisingly to a x4 reduction in memory usage

                                                              • claiir 7 hours ago

                                                                Yea they mention a “perplexity drop” relative to naive quantization, but that’s meaningless to me. > We reduce the perplexity drop by 54% (using llama.cpp perplexity evaluation) when quantizing down to Q4_0.

                                                                Wish they showed benchmarks / added quantized versions to the arena! :>

                                                              • mark_l_watson 6 hours ago

                                                                Indeed!! I have swapped out qwen2.5 for gemma3:27b-it-qat using Ollama for routine work on my 32G memory Mac.

                                                                gemma3:27b-it-qat with open-codex, running locally, is just amazingly useful, not only for Python dev, but for Haskell and Common Lisp also.

                                                                I still like Gemini 2.5 Pro and o3 for brainstorming or working on difficult problems, but for routine work it (simply) makes me feel good to have everything open source/weights running on my own system.

                                                                Wen I bought my 32G Mac a year ago, I didn't expect to be so happy as running gemma3:27b-it-qat with open-codex locally.

                                                                • trebligdivad 10 hours ago

                                                                  It seems pretty impressive - I'm running it on my CPU (16 core AMD 3950x) and it's very very impressive at translation, and the image description is very impressive as well. I'm getting about 2.3token/s on it (compared to under 1/s on the Calme-3.2 I was previously using). It does tend to be a bit chatty unless you tell it not to be; pretty much everything it'll give you a 'breakdown' unless you tell it not to - so for traslation my prompt is 'Translate the input to English, only output the translation' to stop it giving a breakdown of the input language.

                                                                  • Havoc 16 minutes ago

                                                                    The upcoming qwen3 series is supposed to be MoE...likely to give better tk/s on CPU

                                                                    • simonw 10 hours ago

                                                                      What are you using to run it? I haven't got image input working yet myself.

                                                                      • trebligdivad 10 hours ago

                                                                        I'm using llama.cpp - built last night from head; to do image stuff you have to run a separate client they provide, with something like:

                                                                        ./build/bin/llama-gemma3-cli -m /discs/fast/ai/gemma-3-27b-it-q4_0.gguf --mmproj /discs/fast/ai/mmproj-model-f16-27B.gguf -p "Describe this image." --image ~/Downloads/surprise.png

                                                                        Note the 2nd gguf in there - I'm not sure, but I think that's for encoding the image.

                                                                        • terhechte 9 hours ago

                                                                          Image input has been working with LM Studio for quite some time

                                                                      • mythz 11 hours ago

                                                                        The speed gains are real, after downloading latest QAT gemma3:27b eval perf is now 1.47x faster on ollama, up from 13.72 to 20.11 tok/s (on A4000's).

                                                                        • behnamoh 11 hours ago

                                                                          This is what local LLMs need—being treated like first-class citizens by the companies that make them.

                                                                          That said, the first graph is misleading about the number of H100s required to run DeepSeek r1 at FP16. The model is FP8.

                                                                          • mmoskal 6 hours ago

                                                                            Also ~noone runs h100 at home, ie at batch size 1. What matters is throughput. With 37b active parameters and a massive deployment throughout (per gpu) should be similar to Gemma.

                                                                            • freeamz 9 hours ago

                                                                              so what is the real comparison against DeepSeek r1 ? Would be good to know which is actually more cost efficient and open (reproducible build) to run locally.

                                                                              • behnamoh 9 hours ago

                                                                                half the amount of those dots is what it takes. but also, why compare a 27B model with a +600B? that doesn't make sense.

                                                                          • mekpro 9 hours ago

                                                                            Gemma 3 is way way better than Llama 4. I think Meta will start to lose its position in LLM mindshare. Another weakness of Llama 4 is its model size that is too large (even though it can run fast with MoE), which greatly limits the applicable users to a small percentage of enthusiasts who have enough GPU VRAM. Meanwhile, Gemma 3 is widely usable across all hardware sizes.

                                                                            • porphyra 7 hours ago

                                                                              It is funny that Microsoft had been peddling "AI PCs" and Apple had been peddling "made for Apple Intelligence" for a while now, when in fact usable models for consumer GPUs are only barely starting to be a thing on extremely high end GPUs like the 3090.

                                                                              • icedrift 4 hours ago

                                                                                Capable local models have been usable on Macs for a while now thanks to their unified memory.

                                                                                • ivape 6 hours ago

                                                                                  This is why the "AI hardware cycle is hype" crowd is so wrong. We're not even close, we're basically at ColecoVision/Atari stage of hardware here. It's going be quite a thing when everyone gets a SNES/Genesis.

                                                                                  • dragonwriter 3 hours ago

                                                                                    AI PCs aren't about running the kind of models that take a 3090-class GPU, or even running on GPU at all, but systems where the local end is running something like Phi-3.5-vision-instruct, on system RAM using a CPU with an integrated NPU, which is why the AI PC requirements specify an NPU, a certain amount of processing capacity, and a minimum amount of DDR5/LPDDR5 system RAM.

                                                                                    • NorwegianDude 4 hours ago

                                                                                      A 3090 is not a extremely high end GPU. Is a consumer GPU launched in 2020, and even in price and compute it's around a mid-range consumer GPU these days.

                                                                                      The high end consumer card from Nvidia is the RTX 5090, and the professional version of the card is the RTX PRO 6000.

                                                                                      • dragonwriter 3 hours ago

                                                                                        For model usability as a binary yes/no, pretty much the only dimension that matters is VRAM, and at 24GB the 3090 is still high end for a consumer NVidia GPUs, yes, the 5090 (and only the 5090) is above it, at 32GB, but 24GB is way ahead of the mid-range.

                                                                                        • NorwegianDude 2 hours ago

                                                                                          24 GB of VRAM is a large amount of VRAM on a consumer GPU, that I totally agree with you on. But it's definitely not an extremely high end GPU these days. It is suitable, yes, but not high end. The high end alternative for a consumer GPU would be the RTX 5090, but that is only available for €3000 now, while used 3090s are around €650.

                                                                                        • zapnuk 3 hours ago

                                                                                          A 3090 still costs 1800€. Thats not mid-range by a long shot

                                                                                          The 5070 or 5070ti are mid range. They cost 650/900€.

                                                                                          • sentimentscan 2 hours ago

                                                                                            A year ago, I bought a brand-new EVGA hybrid-cooled 3090 Ti for 700 euros. I'm still astonished at how good of a decision it was, especially considering the scarcity of 24GB cards available for a similar price. For pure gaming, many cards perform better, but they mostly come with 12 to 16GB of VRAM.

                                                                                            • NorwegianDude 3 hours ago

                                                                                              3090s are no longer produced, that's why new ones are so expensive. At least here, used 3090s are around €650, and a RTX 5070 is around €625.

                                                                                              It's definitely not extremely high end any more, the price is(at least here) the same as the new mid range consumer cards.

                                                                                              I guess the price can vary by location, but €1800 for a 3090 is crazy, that's more than the new price in 2020.

                                                                                        • emrah 13 hours ago
                                                                                          • jinay 11 hours ago

                                                                                            Make sure you're using the "-it-qat" suffixed models like "gemma3:27b-it-qat"

                                                                                          • Der_Einzige 11 hours ago

                                                                                            How many times do I have to say this? Ollama, llamacpp, and many other projects are slower than vLLM/sglang. vLLM is a much superior inference engine and is fully supported by the only LLM frontends that matter (sillytavern).

                                                                                            The community getting obsessed with Ollama has done huge damage to the field, as it's ineffecient compared to vLLM. Many people can get far more tok/s than they think they could if only they knew the right tools.

                                                                                            • Zambyte 11 hours ago

                                                                                              The significant convenience benefits outweigh the higher TPS that vLLM offers in the context of my single machine homelab GPU server. If I was hosting it for something more critical than just myself and a few friends chatting with it, sure. Being able to just paste a model name into Open WebUI and run it is important to me though.

                                                                                              It is important to know about both to decide between the two for your use case though.

                                                                                              • Der_Einzige 8 hours ago

                                                                                                Running any HF model on vllm is as simple as pasting a model name into one command in your terminal.

                                                                                                • Zambyte 6 hours ago

                                                                                                  What command is it? Because that was not at all my experience.

                                                                                                  • Der_Einzige 2 hours ago

                                                                                                    Vllm serve… huggingface gives run instructions for every model with vllm on their website.

                                                                                                    • Zambyte 2 hours ago

                                                                                                      How do I serve multiple models? I can pick from dozens of models that I have downloaded through Open WebUI.

                                                                                              • ach9l 11 hours ago

                                                                                                instead of ranting, maybe explain how to make a qat q4 work with images in vllm, afaik it is not yet possible

                                                                                                • simonw 11 hours ago

                                                                                                  Last I looked vLLM didn't work on a Mac.

                                                                                                  • mitjam 9 hours ago

                                                                                                    Afaik vllm is for concurrent serving with batched inference for higher throughput, not single-user inference. I doubt inference throughput is higher with single prompts at a time than Ollama. Update: this is a good Intro to continuous batching in llm inference: https://www.anyscale.com/blog/continuous-batching-llm-infere...

                                                                                                    • Der_Einzige 8 hours ago

                                                                                                      It is much faster on single prompts than ollama. 3X is not unheard of

                                                                                                  • oezi 11 hours ago

                                                                                                    Why is sillytavern the only LLM frontend which matters?

                                                                                                    • GordonS 10 hours ago

                                                                                                      I tried sillytavern a few weeks ago... wow, that is an "interesting" UI! I blundered around for a while, couldn't figure out how to do anything useful... and then installed LM Studio instead.

                                                                                                      • imtringued 9 hours ago

                                                                                                        I personally thought the lorebook feature was quite neat and then quickly gave up on it because I couldn't get it to trigger, ever.

                                                                                                        Whatever those keyword things are, they certainly don't seem to be doing any form of RAG.

                                                                                                      • Der_Einzige 8 hours ago

                                                                                                        It supports more sampler and other settings than anyone else.

                                                                                                      • janderson215 11 hours ago

                                                                                                        I did not know this, so thank you. I read a blogpost a while back that encouraged using Ollama and never mention vLLM. Do you recommend reading any particular resource?

                                                                                                        • oezi 11 hours ago

                                                                                                          Somebody in this thread mentioned 20.x tok/s on ollama. What are you seeing in vLLM?

                                                                                                          • Zambyte 11 hours ago

                                                                                                            FWIW I'm getting 29 TPS on Ollama on my 7900 XTX with the 27b qat. You can't really compare inference engine to inference engine without keeping the hardware and model fixed.

                                                                                                            Unfortunately Ollama and vLLM are therefore incomparable at the moment, because vLLM does not support these models yet.

                                                                                                            https://github.com/vllm-project/vllm/issues/16856

                                                                                                          • m00dy 11 hours ago

                                                                                                            Ollama is definitely not for production loads but vLLm is.

                                                                                                        • 999900000999 9 hours ago

                                                                                                          Assuming this can match Claude's latest, and full time usage ( as in you have a system that's constantly running code without any user input,) you'd probably save 600 to 700 a month. A 4090 is only 2K and you'll see an ROI within 90 days.

                                                                                                          I can imagine this will serve to drive prices for hosted llms lower.

                                                                                                          At this level any company that produces even a nominal amount of code should be running LMS on prem( AWS if your on the cloud).

                                                                                                          • rafaelmn 4 hours ago

                                                                                                            I'd say using a Mac studio with M4 Max and 128 GB RAM will get you way further than 4090 in context size and model size. Cheaper than 2x4090 and less power while being a great overall machine.

                                                                                                            I think these consumer GPUs are way too expensive for the amount of memory they pack - and that's intentional price discrimination. Also the builds are gimmicky. It's just not setup for AI models, and the versions that are cost 20k.

                                                                                                            AMD has that 128GB RAM strix halo chip but even with soldered ram the bandwidth there is very limited, half of M4 Max, which is half of 4090.

                                                                                                            I think this generation of hardware and local models is not there yet - would wait for M5/M6 release.

                                                                                                            • tootie an hour ago

                                                                                                              There's certainly room to grow but I'm running Gemma 12b on a 4060 (8GB VRAM) which I bought for gaming and it's a tad slow but still gives excellent results. And it certainly seems software is outpacing hardware right now. The target is making a good enough model that can run on a phone.

                                                                                                          • miki123211 9 hours ago

                                                                                                            What would be the best way to deploy this if you're maximizing for GPU utilization in a multi-user (API) scenario? Structured output support would be a big plus.

                                                                                                            We're working with a GPU-poor organization with very strict data residency requirements, and these models might be exactly what we need.

                                                                                                            I would normally say VLLM, but the blog post notably does not mention VLLM support.

                                                                                                          • umajho 11 hours ago

                                                                                                            I am currently using the Q4_K_M quantized version of gemma-3-27b-it locally. I previously assumed that a 27B model with image input support wouldn't be very high quality, but after actually using it, the generated responses feel better than those from my previously used DeepSeek-R1-Distill-Qwen-32B (Q4_K_M), and its recognition of images is also stronger than I expected. (I thought the model could only roughly understand the concepts in the image, but I didn't expect it to be able to recognize text within the image.)

                                                                                                            Since this article publishes the optimized Q4 quantized version, it would be great if it included more comparisons between the new version and my currently used unoptimized Q4 version (such as benchmark scores).

                                                                                                            (I deliberately wrote this reply in Chinese and had gemma-3-27b-it Q4_K_M translate it into English.)

                                                                                                            • piyh 6 hours ago

                                                                                                              Meta Maverick is crying in the shower getting so handily beat by a model with 15x fewer params

                                                                                                              • holografix 12 hours ago

                                                                                                                Could 16gb vram be enough for the 27b QAT version?

                                                                                                                • jffry 11 hours ago

                                                                                                                  With `ollama run gemma3:27b-it-qat "What is blue"`, GPU memory usage is just a hair over 20GB, so no, probably not without a nerfed context window

                                                                                                                  • woadwarrior01 11 hours ago

                                                                                                                    Indeed, the default context length in ollama is a mere 2048 tokens.

                                                                                                                  • hskalin 11 hours ago

                                                                                                                    With ollama you could offload a few layers to cpu if they don't fit in the VRAM. This will cost some performance ofcourse but it's much better than the alternative (everything on cpu)

                                                                                                                    • senko 9 hours ago

                                                                                                                      I'm doing that with a 12GB card, ollama supports it out of the box.

                                                                                                                      For some reason, it only uses around 7GB of VRAM, probably due to how the layers are scheduled, maybe I could tweak something there, but didn't bother just for testing.

                                                                                                                      Obviously, perf depends on CPU, GPU and RAM, but on my machine (3060 + i5-13500) it's around 2 t/s.

                                                                                                                    • halflings 11 hours ago

                                                                                                                      That's what the chart says yes. 14.1GB VRAM usage for the 27B model.

                                                                                                                      • erichocean 11 hours ago

                                                                                                                        That's the VRAM required just to load the model weights.

                                                                                                                        To actually use a model, you need a context window. Realistically, you'll want a 20GB GPU or larger, depending on how many tokens you need.

                                                                                                                        • oezi 11 hours ago

                                                                                                                          I didn't realize that the context would require such so much memory. Is this KV caches? It would seem like a big advantage if this memory requirement could be reduced.

                                                                                                                      • parched99 10 hours ago

                                                                                                                        I am only able to get the Gemma-3-27b-it-qat-Q4_0.gguf (15.6GB) to run with a 100 token context size on a 5070 ti (16GB) using llamacpp.

                                                                                                                        Prompt Tokens: 10

                                                                                                                        Time: 229.089 ms

                                                                                                                        Speed: 43.7 t/s

                                                                                                                        Generation Tokens: 41

                                                                                                                        Time: 959.412 ms

                                                                                                                        Speed: 42.7 t/s

                                                                                                                        • idonotknowwhy an hour ago

                                                                                                                          I didn't realise the 5070 is slower than the 3090. Thanks.

                                                                                                                          If you want a bit more context, try -ctv q8 -ctk q8 (from memory so look it up) to quant the kv cache.

                                                                                                                          Also an imatrix gguf like iq4xs might be smaller with better quality

                                                                                                                          • tbocek 5 hours ago

                                                                                                                            This is probably due to this: https://github.com/ggml-org/llama.cpp/issues/12637. This GitHub issue is about interleaved sliding window attention (iSWA) not available in llama.cpp for Gemma 3. This could reduce the memory requirements a lot. They mentioned for a certain scenario, going from 62GB to 10GB.

                                                                                                                            • parched99 5 hours ago

                                                                                                                              Resolving that issue, would help reduce (not eliminate) the size of the context. The model will still only just barely fit in 16 GB, which is what the parent comment asked.

                                                                                                                              Best to have two or more low-end, 16GB GPUs for a total of 32GB VRAM to run most of the better local models.

                                                                                                                            • floridianfisher 6 hours ago

                                                                                                                              Try one of the smaller versions. 27b is too big for your gpu

                                                                                                                              • parched99 5 hours ago

                                                                                                                                I'm aware. I was addressing the question being asked.

                                                                                                                          • justanotheratom 11 hours ago

                                                                                                                            Anyone packaged one of these in an iPhone App? I am sure it is doable, but I am curious what tokens/sec is possible these days. I would love to ship "private" AI Apps if we can get reasonable tokens/sec.

                                                                                                                            • zamadatix 10 hours ago

                                                                                                                              There are many such apps, e.g. Mollama, Enclave AI or PrivateLLM or dozens of others, but you could tell me it runs at 1,000,000 tokens/second on an iPhone and I wouldn't care because the largest model version you're going to be able to load is Gemma 3 4B q4 (12 B won't fit in 8 GB with the OS + you still need context) and it's just not worth the time to use.

                                                                                                                              That said, if you really care, it generates faster than reading speed (on an A18 based model at least).

                                                                                                                              • woodson 9 hours ago

                                                                                                                                Some of these small models still have their uses, e.g. for summarization. Don’t expect them to fully replace ChatGPT.

                                                                                                                                • zamadatix 8 hours ago

                                                                                                                                  The use case is more "I'm willing to have really bad answers that have extremely high rates of making things up" than based on the application. The same goes for summarization, it's not like it does it well like a large model would.

                                                                                                                              • Alifatisk 11 hours ago

                                                                                                                                If you ever ship a private AI app, don't forget to implement the export functionality, please!

                                                                                                                                • idonotknowwhy 2 minutes ago

                                                                                                                                  You mean conversations? Just the jsonl of the standard hf dataset format to import into other systems?

                                                                                                                                • nico 10 hours ago

                                                                                                                                  What kind of functionality do you need from the model?

                                                                                                                                  For basic conversation and RAG, you can use tinyllama or qwen-2.5-0.5b, both of which run on a raspberry pi at around 5-20 tokens per second

                                                                                                                                  • nolist_policy 8 hours ago

                                                                                                                                    FWIW, I can run Gemma-3-12b-it-qat on my Galaxy Fold 4 with 12Gb ram at around 1.5 tokens / s. I use plain llama.cpp with Termux.

                                                                                                                                    • Casteil 7 hours ago

                                                                                                                                      Does this turn your phone into a personal space heater too?

                                                                                                                                  • CyberShadow 10 hours ago

                                                                                                                                    How does it compare to CodeGemma for programming tasks?

                                                                                                                                    • jarbus 11 hours ago

                                                                                                                                      Very excited to see these kinds of techniques, I think getting a 30B level reasoning model usable on consumer hardware is going to be a game changer, especially if it uses less power.

                                                                                                                                      • apples_oranges 11 hours ago

                                                                                                                                        Deepseek does reasoning on my home Linux pc but not sure how power hungry it is

                                                                                                                                        • gcr 11 hours ago

                                                                                                                                          what variant? I’d considered DeepSeek far too large for any consumer GPUs

                                                                                                                                          • scosman 11 hours ago

                                                                                                                                            Some people run Deepseek on CPU. 37B active params - it isn't fast but it's passible.

                                                                                                                                            • danielbln 11 hours ago

                                                                                                                                              Actual deepseek or some qwen/llama reasoning fine-tune?

                                                                                                                                              • scosman 9 hours ago

                                                                                                                                                Actual Deepseek. 500gb of memory and a threadripper works. Not a standard PC spec, but a common ish home brew setup for single user Deepseek.

                                                                                                                                      • Alifatisk 11 hours ago

                                                                                                                                        Except this being lighter than the other models, is there anything else the Gemma model is specifically good at or better than the other models at doing?

                                                                                                                                        • Zambyte 10 hours ago

                                                                                                                                          I have found Gemma models are able to produce useful information about more niche subjects that other models like Mistral Small cannot, at the expense of never really saying "I don't know", where other models will, and will instead produce false information.

                                                                                                                                          For example, if I ask mistral small who I am by name, it will say there is no known notable figure by that name before the knowledge cutoff. Gemma 3 will say I am a well known <random profession> and make up facts. On the other hand, I have asked both about local organization in my area that I am involved with, and Gemma 3 could produce useful and factual information, where Mistral Small said it did not know.

                                                                                                                                          • nico 10 hours ago

                                                                                                                                            They are multimodal. Havent tried the QAT one yet. But the gemma3s released a few weeks ago are pretty good at processing images and telling you details about what’s in them

                                                                                                                                            • itake 10 hours ago

                                                                                                                                              Google claims to have better multi language support, due tokenizer improvements.

                                                                                                                                            • cheriot 4 hours ago

                                                                                                                                              Is there already a Helium for GPUs?

                                                                                                                                              • wtcactus 11 hours ago

                                                                                                                                                They keep mentioning the RTX 3090 (with 24 GB VRAM), but the model is only 14.1 GB.

                                                                                                                                                Shouldn’t it fit a 5060 Ti 16GB, for instance?

                                                                                                                                                • oktoberpaard 11 hours ago

                                                                                                                                                  With a 128K context length and 8 bit KV cache, the 27b model occupies 22 GiB on my system. With a smaller context length you should be able to fit it on a 16 GiB GPU.

                                                                                                                                                  • jsnell 11 hours ago

                                                                                                                                                    Memory is needed for more than just the parameters, e.g. the KV cache.

                                                                                                                                                    • cubefox 11 hours ago

                                                                                                                                                      KV = key-value

                                                                                                                                                    • Havoc an hour ago

                                                                                                                                                      Just checked - 19 gigs with 8k context @ q8 kv.Plus another 2.5-ish or so for OS etc.

                                                                                                                                                      ...so yeah 3090

                                                                                                                                                    • punnerud 3 hours ago

                                                                                                                                                      Just tested the 27B, and it’s not very good at following instructions and is very limited on more complex code problems.

                                                                                                                                                      Mapping from one JSON with a lot of plain text, into a new structure and it fails every time.

                                                                                                                                                      Ask it to generate SVG, and it’s very simple and almost too dumb.

                                                                                                                                                      Nice that it doesn’t need that huge amount of RAM, and perform ok on smaller languages from my initial tests.

                                                                                                                                                      • btbuildem 11 hours ago

                                                                                                                                                        Is 27B the largest QAT Gemma 3? Given these size reductions, it would be amazing to have the 70B!

                                                                                                                                                        • arnaudsm 11 hours ago

                                                                                                                                                          The original Gemma 3 does not have a 70B version.

                                                                                                                                                        • noodletheworld 11 hours ago

                                                                                                                                                          ?

                                                                                                                                                          Am I missing something?

                                                                                                                                                          These have been out for a while; if you follow the HF link you can see, for example, the 27b quant has been downloaded from HF 64,000 times over the last 10 days.

                                                                                                                                                          Is there something more to this, or is just a follow up blog post?

                                                                                                                                                          (is it just that ollama finally has partial (no images right?) support? Or something else?)

                                                                                                                                                          • deepsquirrelnet 11 hours ago

                                                                                                                                                            QAT “quantization aware training” means they had it quantized to 4 bits during training rather than after training in full or half precision. It’s supposedly a higher quality, but unfortunately they don’t show any comparisons between QAT and post-training quantization.

                                                                                                                                                            • noodletheworld 11 hours ago

                                                                                                                                                              I understand that, but the qat models (1) are not new uploads.

                                                                                                                                                              How is this more significant now than when they were uploaded 2 weeks ago?

                                                                                                                                                              Are we expecting new models? I don’t understand the timing. This post feels like it’s two weeks late.

                                                                                                                                                              [1] - https://huggingface.co/collections/google/gemma-3-qat-67ee61...

                                                                                                                                                              • simonw 11 hours ago

                                                                                                                                                                The official announcement of the QAT models happened on Friday 18th, two days ago. It looks like they uploaded them to HF in advance of that announcement: https://developers.googleblog.com/en/gemma-3-quantized-aware...

                                                                                                                                                                The partnership with Ollama and MLX and LM Studio and llama.cpp was revealed in that announcement, which made the models a lot easier for people to use.

                                                                                                                                                                • llmguy 11 hours ago

                                                                                                                                                                  8 days is closer to 1 week then 2. And it’s a blog post, nobody owes you realtime updates.

                                                                                                                                                                  • noodletheworld 11 hours ago

                                                                                                                                                                    https://huggingface.co/google/gemma-3-27b-it-qat-q4_0-gguf/t...

                                                                                                                                                                    > 17 days ago

                                                                                                                                                                    Anywaaay...

                                                                                                                                                                    I'm literally asking, quite honestly, if this is just an 'after the fact' update literally weeks later, that they uploaded a bunch of models, or if there is something more significant about this I'm missing.

                                                                                                                                                                    • osanseviero 10 hours ago

                                                                                                                                                                      Hi! Omar from the Gemma team here.

                                                                                                                                                                      Last time we only released the quantized GGUFs. Only llama.cpp users could use it (+ Ollama, but without vision).

                                                                                                                                                                      Now, we released the unquantized checkpoints, so anyone can quantize themselves and use in their favorite tools, including Ollama with vision, MLX, LM Studio, etc. MLX folks also found that the model worked decently with 3 bits compared to naive 3-bit, so by releasing the unquantized checkpoints we allow further experimentation and research.

                                                                                                                                                                      TL;DR. One was a release in a specific format/tool, we followed-up with a full release of artifacts that enable the community to do much more.

                                                                                                                                                                      • oezi 10 hours ago

                                                                                                                                                                        Hey Omar, is there any chance that Gemma 3 might get a speech (ASR/AST/TTS) release?

                                                                                                                                                                      • timcobb 11 hours ago

                                                                                                                                                                        Probably the former... I see your confusion but it's really only a couple weeks at most. The news cycle is strong in you, grasshopper :)

                                                                                                                                                                • xnx 11 hours ago

                                                                                                                                                                  The linked blog post was 2 days ago

                                                                                                                                                                • XCSme 10 hours ago

                                                                                                                                                                  So how does 27b-it-qat (18GB) compare to 27b-it-q4_K_M (17GB)?

                                                                                                                                                                  • perching_aix 10 hours ago

                                                                                                                                                                    This is my first time trying to locally host a model - gave both the 12B and 27B QAT models a shot.

                                                                                                                                                                    I was both impressed and disappointed. Setup was piss easy, and the models are great conversationalists. I have a 12 gig card available and the 12B model ran very nice and swift.

                                                                                                                                                                    However, they're seemingly terrible at actually assisting with stuff. Tried something very basic: asked for a powershell one liner to get the native blocksize of my disks. Ended up hallucinating fields, then telling me to go off into the deep end, first elevating to admin, then using WMI, then bringing up IOCTL. Pretty unfortunate. Not sure I'll be able to put it to actual meaningful use as a result.

                                                                                                                                                                    • HachiWari8 2 hours ago

                                                                                                                                                                      I tried the 27B QAT model and it hallucinates like crazy. When I ask it for information about some made up person, restaurant, place name, etc., it never says "I don't know about that" and instead seems eager to just make up details. The larger local models like the older Llama 3.3 70B seem better at this, but are also too big to fit on a 24GB GPU.

                                                                                                                                                                      • terhechte 9 hours ago

                                                                                                                                                                        Local models, due to their size more than big cloud models, favor popular languages rather than more niche ones. They work fantastic for JavaScript, Python, Bash but much worse at less popular things like Clojure, Nim or Haskell. Powershell is probably on the less popular side compared to Js or Bash.

                                                                                                                                                                        If this is your main use case you can always try to fine tune a model. I maintain a small llm bench of different programming languages and the performance difference between say Python and Rust on some smaller models is up to 70%

                                                                                                                                                                        • perching_aix 9 hours ago

                                                                                                                                                                          How accessible and viable is model fine-tuning? I'm not in the loop at all unfortunately.

                                                                                                                                                                        • parched99 9 hours ago

                                                                                                                                                                          I think Powershell is a bad test. I've noticed all local models have trouble providing accurate responses to Powershell-related prompts. Strangely, even Microsoft's model, Phi 4, is bad at answering these questions without careful prompting. Though, MS can't even provide accurate PS docs.

                                                                                                                                                                          My best guess is that there's not enough discussion/development related to Powershell in training data.

                                                                                                                                                                          • fragmede 5 hours ago

                                                                                                                                                                            Which, like, you'd think Microsoft has an entire team there who's purpose would be to generate good PowerShell for it to train on.

                                                                                                                                                                        • briandear 8 hours ago

                                                                                                                                                                          The normal Gemma models seem to work fine on Apple silicon with Metal. Am I missing something?

                                                                                                                                                                          • simonw 8 hours ago

                                                                                                                                                                            These new special editions of those models claim to work better with less memory.

                                                                                                                                                                          • api 9 hours ago

                                                                                                                                                                            When I see 32B or 70B models performing similarly to 200+B models, I don’t know what to make of this. Either the latter contains more breadth of information but we have managed to distill latent capabilities to be similar, the larger models are just less efficient, or the tests are not very good.

                                                                                                                                                                            • simonw 9 hours ago

                                                                                                                                                                              It makes intuitive sense to me that this would be possible, because LLMs are still mostly opaque black boxes. I expect you could drop a whole hunch of the weights without having a huge impact on quality - maybe you end up mostly ditching the parts that are derived from shitposts on Reddit but keep the bits from Arxiv for example.

                                                                                                                                                                              (That's a massive simplification of how any of this works, but it's how I think about it at a high level.)

                                                                                                                                                                              • retinaros 8 hours ago

                                                                                                                                                                                its just bs benchmarks. they are all cheating at this point feeding the data in the training set. doesnt mean the llm arent becoming better but when they all lie...

                                                                                                                                                                              • rob_c 11 hours ago

                                                                                                                                                                                Given how long between this being released and this community picking up on it... Lol

                                                                                                                                                                                • GaunterODimm 7 hours ago

                                                                                                                                                                                  2days :/...

                                                                                                                                                                                  • rob_c 7 hours ago

                                                                                                                                                                                    Given I know people running gemma3 on local devices for over almost a month now this is either a very slow news day or evidence of finger missing the pulse... https://blog.google/technology/developers/gemma-3/

                                                                                                                                                                                    • simonw 7 hours ago

                                                                                                                                                                                      This is new. These are new QAT (Quantization-Aware Training) models released by the Gemma team.

                                                                                                                                                                                      • rob_c 7 hours ago

                                                                                                                                                                                        There's nothing more than an iteration on the topic, gemma3 was smashing local results a month ago and made no waves as it dropped...

                                                                                                                                                                                        • simonw 6 hours ago

                                                                                                                                                                                          Quoting the linked story:

                                                                                                                                                                                          > Last month, we launched Gemma 3, our latest generation of open models. Delivering state-of-the-art performance, Gemma 3 quickly established itself as a leading model capable of running on a single high-end GPU like the NVIDIA H100 using its native BFloat16 (BF16) precision.

                                                                                                                                                                                          > To make Gemma 3 even more accessible, we are announcing new versions optimized with Quantization-Aware Training (QAT) that dramatically reduces memory requirements while maintaining high quality.

                                                                                                                                                                                          The thing that's new, and that is clearly resonating with people, is the "To make Gemma 3 even more accessible..." bit.

                                                                                                                                                                                          • rob_c 6 hours ago

                                                                                                                                                                                            As I've said in my lectures on how to perform 1bit training of QAT systems to build classifiers...

                                                                                                                                                                                            "An iteration on a theme".

                                                                                                                                                                                            Once the network design is proven to work yes it's an impressive technical achievement, but as I've said given I've known people in multiple research institutes and companies using Gemma3 for a month mostly saying they're surprised it's not getting noticed... This is just enabling more users but the none QAT version will almost always perform better...

                                                                                                                                                                                            • simonw 6 hours ago

                                                                                                                                                                                              Sounds like you're excited to see Gemma 3 get the recognition it deserves on Hacker News then.

                                                                                                                                                                                              • rob_c 6 hours ago

                                                                                                                                                                                                No just pointing out the flooding obvious as usual and collecting down votes for it

                                                                                                                                                                                                • fragmede 4 hours ago

                                                                                                                                                                                                  Speaking for myself, my downvotes are not because of the content of your arguments, but because your tone is consistently condescending and dismissive. Comments like “just pointing out the flooding obvious” come off as smug and combative rather than constructive.

                                                                                                                                                                                                  HN works best when people engage in good faith, stay curious, and try to move the conversation forward. That kind of tone — even when technically accurate — discourages others from participating and derails meaningful discussion.

                                                                                                                                                                                                  If you’re getting downvotes regularly, maybe it's worth considering how your comments are landing with others, not just whether they’re “right.”