• LTL_FTC 15 hours ago

    It sounds like you don’t need immediate llm responses and can batch process your data nightly? Have you considered running a local llm? May not need to pay for api calls. Today’s local models are quite good. I started off with cpu and even that was fine for my pipelines.

    • ok_orco 4 hours ago

      I haven't thought about that, but really want to dig in more now. Any places you recommend starting?

      • kreetx 12 hours ago

        Though haven't done any extensive testing then I personally could easily get by with current local models. The only reason I don't is that the hosted ones all have free tiers.

        • ydu1a2fovb 9 hours ago

          Can you suggest any good llms for cpu?

          • R_D_Olivaw 6 hours ago

            Following.

          • queenkjuul 13 hours ago

            Agreed, I'm pretty amazed at what I'm able to do locally just with an AMD 6700XT and 32GB of RAM. It's slow, but if you've got all night...

          • 44za12 15 hours ago

            This is the way. I actually mapped out the decision tree for this exact process and more here:

            https://github.com/NehmeAILabs/llm-sanity-checks

            • homeonthemtn 9 hours ago

              That's interesting. Is there any kind of mapping to these respective models somewhere?

              • 44za12 9 hours ago

                Yes, I included a 'Model Selection Cheat Sheet' in the README (scroll down a bit).

                I map them by task type:

                Tiny (<3B): Gemma 3 1B (could try 4B as well), Phi-4-mini (Good for classification). Small (8B-17B): Qwen 3 8B, Llama 4 Scout (Good for RAG/Extraction). Frontier: GPT-5, Llama 4 Maverick, GLM, Kimi

                Is that what you meant?

            • gandalfar 15 hours ago

              Consider using z.ai as model provider to further lower your costs.

              • ok_orco 4 hours ago

                Will take a look!

                • tehlike 15 hours ago

                  This is what i was going to suggest too.

                  • viraptor 15 hours ago

                    Or minimax - m2.1 release didn't make a big splash in the news, but it's really capable.

                    • DANmode 15 hours ago

                      Do they or any other providers offer any improvements on the often-chronicled variability of quality/effort from the major two services e.g. during peak hours?

                    • deepsummer 13 hours ago

                      As much as I like the Claude models, they are expensive. I wouldn't use them to process large volumes of data. Gemini 2.5 Flash-Lite is $0.10 per million tokens. Grok 4.1 Fast is really good and only $0.20. They will work just as well for most simple tasks.

                      • joshribakoff 15 hours ago
                        • dezgeg 15 hours ago

                          Are you also adding the proper prompt cache control attributes? I think Anthropic API still doesn't do it automatically

                          • arthurcolle 21 hours ago

                            Can you discuss a bit more of the architecture?

                            • ok_orco 21 hours ago

                              Pretty straightforward. Sources dump into a queue throughout the day, regex filters the obvious junk ("lol", "thanks", bot messages never hit the LLM), then everything gets batched overnight through Anthropic's Batch API for classification. Feedback gets clustered against existing pain points or creates new ones.

                              Most of the cost savings came from not sending stuff to the LLM that didn't need to go there, plus the batch API is half the price of real-time calls.

                            • DeathArrow 14 hours ago

                              You also can try to use cheaper models like GLM, Deepseek, Qwen,at least partially.