Comments Page - Gemini 3 Deep Think

« Back Gemini 3 Deep Thinkblog.googleSubmitted by tosh 8 hours ago

lukebechtel 8 hours ago
Arc-AGI-2: 84.6% (vs 68.8% for Opus 4.6)
Wow.
https://blog.google/innovation-and-ai/models-and-research/ge...
- raincole 6 hours ago
  Even before this, Gemini 3 has always felt unbelievably 'general' for me. It can beat Balatro (ante 8) with text description of the game alone[0]. Yeah, it's not an extremely difficult goal for humans, but considering:
  1. It's an LLM, not something trained to play Balatro specifically
  2. Most (probably >99.9%) players can't do that at the first attempt
  3. I don't think there are many people who posted their Balatro playthroughs in text form online
  I think it's a much stronger signal of its 'generalness' than ARC-AGI. By the way, Deepseek can't play Balatro at all.
  [0]: https://balatrobench.com/
  tl 3 hours ago
  Per BalatroBench, gemini-3-pro-preview makes it to round (not ante) 19.3 ± 6.8 on the lowest difficulty on the deck aimed at new players. Round 24 is ante 8's final round. Per BalatroBench, this includes giving the LLM a strategy guide, which first-time players do not have. Gemini isn't even emitting legal moves 100% of the time.
  YeGoblynQueenne 38 minutes ago
  https://balatrobench.com/
  nerdsniper 28 minutes ago
  My experience also shows that Gemini has unique strength in “generalized” (read: not coding) tasks. Gemini 2.5 Pro and 3 Pro seems stronger at math and science for me, and their Deep Research usually works the hardest, as long as I run it during off-hours. Opus seems to beat Gemini almost “with one hand tied behind its back” in coding, but Gemini is so cheap that it’s usually my first stop for anything that I think is likely to be relatively simple. I never worry about my quota on Gemini like I do with Opus or Chat-GPT.
  Comparisons generally seem to change much faster than I can keep my mental model updated. But the performance lead of Gemini on more ‘academic’ explorations of science, math, engineering, etc has been pretty stable for the past 4 months or so, which makes it one of the longer-lasting trends for me in comparing foundation models.
  I do wish I could more easily get timely access to the “super” models like Deep Think or o3 pro. I never seem to get a response to requesting access, and have to wait for public access models to catch up, at which point I’m never sure if their capabilities have gotten diluted since the initial buzz died down.
  They all still suck at writing an actually good essay/article/literary or research review, or other long-form things which require a lot of experienced judgement to come up with a truly cohesive narrative. I imagine this relates to their low performance in humor - there’s just so much nuance and these tasks represent the pinnacle of human intelligence. Few humans can reliably perform these tasks to a high degree of performance either. I myself am only successful some percentage of the time.
  S1M0N38-hn 13 minutes ago
  Hi, BalatroBench creator here. Yeah, Google models perform well (I guess the long context + world knowledge capabilities). Opus 4.6 looks good on preliminary results (on par with Gemini 3 Pro). I'll add more models and report soon. Tbh, I didn't expect LLMs to start winning runs. I guess I have to move to harder stakes (e.g. red stake).
  ankit219 3 hours ago
  Agreed. Gemini 3 Pro for me has always felt like it has had a pretraining alpha if you will. And many data points continue to support that. Even as flash, which was post trained with different techniques than pro is good or equivalent at tasks which require post training, occasionally even beating pro. (eg: in apex bench from mercor, which is basically a tool calling test - simplifying - flash beats pro). The score on arc agi2 is another datapoint in the same direction. Deepthink is sort of parallel test time compute with some level of distilling and refinement from certain trajectories (guessing based on my usage and understanding) same as gpt-5.2-pro and can extract more because of pretraining datasets.
  (i am sort of basing this on papers like limits of rlvr, and pass@k and pass@1 differences in rl posttraining of models, and this score just shows how "skilled" the base model was or how strong the priors were. i apologize if this is not super clear, happy to expand on what i am thinking)
  ebiester 4 hours ago
  It's trained on YouTube data. It's going to get roffle and drspectred at the very least.
  FuckButtons 22 minutes ago
  Strange, because I could not for the life of me get Gemini 3 to follow my instructions the other day to work through an example with a table, Claude got it first try.
  silver_sun 4 hours ago
  Google has a library of millions of scanned books from their Google Books project that started in 2004. I think we have reason to believe that there are more than a few books about effectively playing different traditional card games in there, and that an LLM trained with that dataset could generalize to understand how to play Balatro from a text description.
  Nonetheless I still think it's impressive that we have LLMs that can just do this now.
  mjamesaustin 4 hours ago
  Winning in Balatro has very little to do with understanding how to play traditional poker. Yes, you do need a basic knowledge of different types of poker hands, but the strategy for succeeding in the game is almost entirely unrelated to poker strategy.
  gilrain 4 hours ago
  If it tried to play Balatro using knowledge of, e.g., poker, it would lose badly rather than win. Have you played?
  gcr 4 hours ago
  I think I weakly disagree. Poker players have intuitive sense of the statistics of various hand types showing up, for instance, and that can be a useful clue as to which build types are promising.
  barnas2 3 hours ago
  >Poker players have intuitive sense of the statistics of various hand types showing up, for instance, and that can be a useful clue as to which build types are promising.
  Maybe in the early rounds, but deck fixing (e.g. Hanged Man, Immolate, Trading Card, DNA, etc) quickly changes that. Especially when pushing for "secret" hands like the 5 of a kind, flush 5, or flush house.
  winstonp 5 hours ago
  DeepSeek hasn't been SotA in at least 12 calendar months, which might as well be a decade in LLM years
  cachius 5 hours ago
  What about Kimi and GLM?
  zozbot234 4 hours ago
  These are well behind the general state of the art (1yr or so), though they're arguably the best openly-available models.
  epolanski an hour ago
  Idk man, GLM 5 in my tests matches opus 4.5 which is what, two months old?
  tgrowazay 3 hours ago
  According to artificial analysis ranking, GLM-5 is at #4 after Claude Opus 4.5, GPT-5.2-xhigh and Claude Opus 4.6 .
  dudisubekti 5 hours ago
  But... there's Deepseek v3.2 in your link (rank 7)
  littlestymaar 5 hours ago
  > . I don't think there are many people who posted their Balatro playthroughs in text form online
  There are *tons* of balatro content on YouTube though, and it makes absolutely zero doubt that Google is using YouTube content to train their model.
  sdwr 5 hours ago
  Yeah, or just the steam text guides would be a huge advantage.
  I really doubt it's playing completely blind
  tehsauce 3 hours ago
  How does it do on gold stake?
  acid__ 5 hours ago
  > Most (probably >99.9%) players can't do that at the first attempt
  Eh, both myself and my partner did this. To be fair, we weren’t going in completely blind, and my partner hit a Legendary joker, but I think you might be slightly overstating the difficulty. I’m still impressed that Gemini did it.
- robertwt7 8 minutes ago
  I’m surprised that gemini 3 pro is so low at 31.1% though compared to opus 4.6 and gpt 5.2. This is a great achievement but its only available to ultra subscribers unfortunately
- nubg 7 hours ago
  Weren't we barely scraping 1-10% on this with state of the art models a year ago and it was considered that this is the final boss, ie solve this and its almost AGI-like?
  I ask because I cannot distinguish all the benchmarks by heart.
  modeless 6 hours ago
  François Chollet, creator of ARC-AGI, has consistently said that solving the benchmark does not mean we have AGI. It has always been meant as a stepping stone to encourage progress in the correct direction rather than as an indicator of reaching the destination. That's why he is working on ARC-AGI-3 (to be released in a few weeks) and ARC-AGI-4.
  His definition of reaching AGI, as I understand it, is when it becomes impossible to construct the next version of ARC-AGI because we can no longer find tasks that are feasible for normal humans but unsolved by AI.
  beklein 5 hours ago
  https://x.com/fchollet/status/2022036543582638517
  joelthelion 4 hours ago
  Do opus 4.6 or gemini deep think really use test time adaptation ? How does it work in practice?
  mapontosevenths 5 hours ago
  > His definition of reaching AGI, as I understand it, is when it becomes impossible to construct the next version of ARC-AGI because we can no longer find tasks that are feasible for normal humans but unsolved by AI.
  That is the best definition I've yet to read. If something claims to be conscious and we can't prove it's not, we have no choice but to believe it.
  Thats said, I'm reminded of the impossible voting tests they used to give black people to prevent them from voting. We dont ask nearly so much proof from a human, we take their word for it. On the few occasions we did ask for proof it inevitably led to horrific abuse.
  Edit: The average human tested scores 60%. So the machines are already smarter on an individual basis than the average human.
  estearum 4 hours ago
  > If something claims to be conscious and we can't prove it's not, we have no choice but to believe it.
  This is not a good test.
  A dog won't claim to be conscious but clearly is, despite you not being able to prove one way or the other.
  GPT-3 will claim to be conscious and (probably) isn't, despite you not being able to prove one way or the other.
  dullcrisp 3 hours ago
  An LLM will claim whatever you tell it to claim. (In fact this Hacker News comment is also conscious.) A dog won’t even claim to be a good boy.
  WarmWash 4 hours ago
  >because we can no longer find tasks that are feasible for normal humans but unsolved by AI.
  "Answer "I don't know" if you don't know an answer to one of the questions"
  mrandish 3 hours ago
  I've been surprised how difficult it is for LLMs to simply answer "I don't know."
  It also seems oddly difficult for them to 'right-size' the length and depth of their answers based on prior context. I either have to give it a fixed length limit or put up with exhaustive answers.
  CamperBob2 an hour ago
  The best pro/research-grade models from Google and OpenAI now have little difficulty recognizing when they don't know how or can't find enough information to solve a given problem. The free chatbot models rarely will, though.
  sva_ 4 hours ago
  > Edit: The average human tested scores 60%. So the machines are already smarter on an individual basis than the average human.
  I think being better at this particular benchmark does not imply they're 'smarter'.
  woah 4 hours ago
  > If something claims to be conscious and we can't prove it's not, we have no choice but to believe it.
  Can you "prove" that GPT2 isn't concious?
  mapontosevenths 4 hours ago
  If we equate self awareness with consciousness then yes. Several papers have now shown that SOTA models have self awareness of at least a limited sort. [0][1]
  As far as I'm aware no one has ever proven that for GPT 2, but the methodology for testing it is available if you're interested.
  [0]https://arxiv.org/pdf/2501.11120
  [1]https://transformer-circuits.pub/2025/introspection/index.ht...
  pixl97 2 hours ago
  Honestly our ideas of consciousness and sentience really don't fit well with machine intelligence and capabilities.
  There is the idea of self as in 'i am this execution' or maybe I am this compressed memory stream that is now the concept of me. But what does consciousness mean if you can be endlessly copied? If embodiment doesn't mean much because the end of your body doesnt mean the end of you?
  A lot of people are chasing AI and how much it's like us, but it could be very easy to miss the ways it's not like us but still very intelligent or adaptable.
  criddell 4 hours ago
  > The average human tested scores 60%. So the machines are already smarter on an individual basis than the average human.
  Maybe it's testing the wrong things then. Even those of use who are merely average can do lots of things that machines don't seem to be very good at.
  I think ability to learn should be a core part of any AGI. Take a toddler who has never seen anybody doing laundry before and you can teach them in a few minutes how to fold a t-shirt. Where are the dumb machines that can be taught?
  mapontosevenths 2 hours ago
  Would you argue that people with long term memory issues are no longer conscious then?
  CamperBob2 an hour ago
  There's no shortage of laundry-folding robot demos these days. Some claim to benefit from only minimal monkey-see/monkey-do levels of training, but I don't know how credible those claims are.
  Mistletoe 23 minutes ago
  When the AI invents religion and a way to try to understand its existence I will say AGI is reached. Believes in an afterlife if it is turned off, and doesn’t want to be turned off and fears it, fears the dark void of consciousness being turned off. These are the hallmarks of human intelligence in evolution, I doubt artificial intelligence will be different.
  https://g.co/gemini/share/cc41d817f112
  jrflowers an hour ago
  > If something claims to be conscious and we can't prove it's not, we have no choice but to believe it.
  https://x.com/aedison/status/1639233873841201153#m
  hmmmmmmmmmmmmmm 5 hours ago
  I don't think the creator believes ARC3 can't be solved but rather that it can't be solved "efficiently" and >$13 per task for ARC2 is certainly not efficient.
  But at this rate, the people who talk about the goal posts shifting even once we achieve AGI may end up correct, though I don't think this benchmark is particularly great either.
  fishpham 7 hours ago
  Yes, but benchmarks like this are often flawed because leading model labs frequently participate in 'benchmarkmaxxing' - ie improvements on ARC-AGI2 don't necessarily indicate similar improvements in other areas (though it does seem like this is a step function increase in intelligence for the Gemini line of models)
  layer8 6 hours ago
  Isn’t the point of ARC that you can’t train against it? Or doesn’t it achieve that goal anymore somehow?
  egeozcan 6 hours ago
  How can you make sure of that? AFAIK, these SOTA models run exclusively on their developers hardware. So any test, any benchmark, anything you do, does leak per definition. Considering the nature of us humans and the typical prisoners dilemma, I don't see how they wouldn't focus on improving benchmarks even when it gets a bit... shady?
  I tell this as a person who really enjoys AI by the way.
  mrandish 3 hours ago
  > does leak per definition.
  As a measure focused solely on fluid intelligence, learning novel tasks and test-time adaptability, ARC-AGI was specifically designed to be resistant to pre-training - for example, unlike many mathematical and programming test questions, ARC-AGI problems don't have first order patterns which can be learned to solve a different ARC-AGI problem.
  The ARC non-profit foundation has private versions of their tests which are never released and only the ARC can administer. There are also public versions and semi-public sets for labs to do their own pre-tests. But a lab self-testing on ARC-AGI can be susceptible to leaks or benchmaxing, which is why only "ARC-AGI Certified" results using a secret problem set really matter. The 84.6% is certified and that's a pretty big deal.
  IMHO, ARC-AGI is a unique test that's different than any other AI benchmark in a significant way. It's worth spending a few minutes learning about why: https://arcprize.org/arc-agi.
  D-Machine 2 hours ago
  > which is why only "ARC-AGI Certified" results using a secret problem set really matter. The 84.6% is certified and that's a pretty big deal.
  So, I'd agree if this was on the true fully private set, but Google themselves says they test on only the semi-private:
  > ARC-AGI-2 results are sourced from the ARC Prize website and are ARC Prize Verified. The set reported is v2, semi-private (https://storage.googleapis.com/deepmind-media/gemini/gemini_...)
  This also seems to contradict what ARC-AGI claims about what "Verified" means on their site.
  > How Verified Scores Work: Official Verification: Only scores evaluated on our hidden test set through our official verification process will be recognized as verified performance scores on ARC-AGI (https://arcprize.org/blog/arc-prize-verified-program)
  So, which is it? IMO you can trivially train / benchmax on the semi-private data, because it is still basically just public, you just have to jump through some hoops to get access. This is clearly an advance, but it seems to me reasonable to conclude this could be driven by some amount of benchmaxing.
  EDIT: Hmm, okay, it seems their policy and wording is a bit contradictory. They do say (https://arcprize.org/policy):
  "To uphold this trust, we follow strict confidentiality agreements. [...] We will work closely with model providers to ensure that no data from the Semi-Private Evaluation set is retained. This includes collaborating on best practices to prevent unintended data persistence. Our goal is to minimize any risk of data leakage while maintaining the integrity of our evaluation process."
  But it surely is still trivial to just make a local copy of each question served from the API, without this being detected. It would violate the contract, but there are strong incentives to do this, so I guess is just comes down to how much one trusts the model providers here. I wouldn't trust them, given e.g. https://www.theverge.com/meta/645012/meta-llama-4-maverick-b.... It is just too easy to cheat without being caught here.
  mrandish 8 minutes ago
  Chollet himself says "We certified these scores in the past few days." https://x.com/fchollet/status/2021983310541729894.
  The ARC-AGI papers claim to show that training on a public or semi-private set of ARC-AGI problems to be of very limited value in passing a private set. <--- If the prior sentence is not correct, then none of ARC-AGI can possibly be valid. So, before "public, semi-private or private" answers leaking or 'benchmaxing' on them can even matter - you need to first assess whether their published papers and data demonstrate their core premise to your satisfaction.
  There is no "trust" regarding the semi-private set. My understanding is the semi-private set is only to reduce the likelihood those exact answers unintentionally end up in web-crawled training data. This is to help an honest lab's own internal self-assessments be more accurate. However, labs doing an internal eval on the semi-private set still counts for literally zero to the ARC-AGI org. They know labs could cheat on the semi-private set (either intentionally or unintentionally), so they assume all labs are benchmaxing on the public AND semi-private answers and ensure it doesn't matter.
  WarmWash 4 hours ago
  Because the gains from spending time improving the model overall outweigh the gains from spending time individually training on benchmarks.
  The pelican benchmark is a good example, because it's been representative of models ability to generate SVGs, not just pelicans on bikes.
  D-Machine an hour ago
  > Because the gains from spending time improving the model overall outweigh the gains from spending time individually training on benchmarks.
  This may not be the case if you just e.g. roll the benchmarks into the general training data, or make running on the benchmarks just another part of the testing pipeline. I.e. improving the model generally and benchmaxing could very conceivably just both be done at the same time, it needn't be one or the other.
  I think the right take away is to ignore the specific percentages reported on these tests (they are almost certainly inflated / biased) and always assume cheating is going on. What matters is that (1) the most serious tests aren't saturated, and (2) scores are improving. I.e. even if there is cheating, we can presume this was always the case, and since models couldn't do as well before even when cheating, these are still real improvements.
  And obviously what actually matters is performance on real-world tasks.
  theywillnvrknw 6 hours ago
  * that you weren't supposed to be able to
  jstummbillig 6 hours ago
  Could it also be that the models are just a lot better than a year ago?
  bigbadfeline 4 hours ago
  > Could it also be that the models are just a lot better than a year ago?
  No, the proof is in the pudding.
  After AI we're having higher prices, higher deficits and lower standard of living. Electricity, computers and everything else costs more. "Doing better" can only be justified by that real benchmark.
  If Gemini 3 DT was better we would have falling prices of electricity and everything else at least until they get to pre-2019 levels.
  ctoth 4 hours ago
  > If Gemini 3 DT was better we would have falling prices of electricity and everything else at least
  Man, I've seen some maintenance folks down on the field before working on them goalposts but I'm pretty sure this is the first time I saw aliens from another Universe literally teleport in, grab the goalposts, and teleport out.
  WarmWash 4 hours ago
  You might call me crazy, but at least in 2024, consumers spent ~1% less of their income on expenses than 2019[2], which suggests that 2024 is more affordable than 2019.
  This is from the BLS consumer survey report released in dec[1]
  [1]https://www.bls.gov/news.release/cesan.nr0.htm
  [2]https://www.bls.gov/opub/reports/consumer-expenditures/2019/
  Prices are never going back to 2019 numbers though
  gowld 3 hours ago
  That's an improper analysis.
  First off, it's dollar-averaging every category, so it's not "% of income", which varies based on unit income.
  Second, I could commit to spending my entire life with constant spending (optionally inflation adjusted, optionally as a % of income), by adusting quality of goods and service I purchase. So the total spending % is not a measure of affordability.
  WarmWash 3 hours ago
  Almost everyone lifestyle ratchets, so the handful that actually downgrade their living rather than increase spending would be tiny.
  This part of a wider trend too, where economic stats don't align with what people are saying. Which is most likley explained by the economic anomaly of the pandemic skewing peoples perceptions.
  twoodfin an hour ago
  We have centuries of historical evidence that people really, really don’t like high inflation, and it takes a while & a lot of turmoil for those shocks to work their way through society.
  XenophileJKO 6 hours ago
  https://chatgpt.com/s/m_698e2077cfcc81919ffbbc3d7cccd7b3
  aleph_minus_one 6 hours ago
  I don't understand what you want to tell us with this image.
  fragmede 5 hours ago
  they're accusing GGP of moving the goalposts.
  olalonde 6 hours ago
  Would be cool to have a benchmark with actually unsolved math and science questions, although I suspect models are still quite a long way from that level.
  gowld 3 hours ago
  Does folding a protein count? How about increasing performance at Go?
  verdverm 7 hours ago
  Here's a good thread over 1+ month, as each model comes out
  https://bsky.app/profile/pekka.bsky.social/post/3meokmizvt22...
  tl;dr - Pekka says Arc-AGI-2 is now toast as a benchmark
  Aperocky 7 hours ago
  If you look at the problem space it is easy to see why it's toast, maybe there's intelligence in there, but hardly general.
  verdverm 6 hours ago
  the best way I've seen this describes is "spikey" intelligence, really good at some points, those make the spikes
  humans are the same way, we all have a unique spike pattern, interests and talents
  ai are effectively the same spikes across instances, if simplified. I could argue self driving vs chatbots vs world models vs game playing might constitute enough variation. I would not say the same of Gemini vs Claude vs ... (instances), that's where I see "spikey clones"
  Aperocky 6 hours ago
  You can get more spiky with AIs, whereas with human brain we are more hard wired.
  So maybe we are forced to be more balanced and general whereas AI don't have to.
  verdverm 6 hours ago
  I suspect the non-spikey part is the more interesting comparison
  Why is it so easy for me to open the car door, get in, close the door, buckle up. You can do this in the dark and without looking.
  There are an infinite number of little things like this you think zero about, take near zero energy, yet which are extremely hard for Ai
  pixl97 an hour ago
  >Why is it so easy for me to open the car door
  Because this part of your brain has been optimized for hundreds of millions of years. It's been around a long ass time and takes an amazingly low amount of energy to do these things.
  On the other hand the 'thinking' part of your brain, that is your higher intelligence is very new to evolution. It's expensive to run. It's problematic when giving birth. It's really slow with things like numbers, heck a tiny calculator and whip your butt in adding.
  There's a term for this, but I can't think of it at the moment.
  gowld 3 hours ago
  You are asking a robotics question, not an AI question. Robotics is more and less than AI. Boston Dynamics robots are getting quite near your benchmark.
  tasuki 3 hours ago
  > maybe there's intelligence in there, but hardly general.
  Of course. Just as our human intelligence isn't general.
- mNovak 6 hours ago
  I'm excited for the big jump in ARC-AGI scores from recent models, but no one should think for a second this is some leap in "general intelligence".
  I joke to myself that the G in ARC-AGI is "graphical". I think what's held back models on ARC-AGI is their terrible spatial reasoning, and I'm guessing that's what the recent models have cracked.
  Looking forward to ARC-AGI 3, which focuses on trial and error and exploring a set of constraints via games.
  causal 5 hours ago
  Agreed. I love the elegance of ARC, but it always felt like a gotcha to give spatial reasoning challenges to token generators- and the fact that the token generators are somehow beating it anyway really says something.
  throw310822 5 hours ago
  The average ARC AGI 2 score for a single human is around 60%.
  "100% of tasks have been solved by at least 2 humans (many by more) in under 2 attempts. The average test-taker score was 60%."
  https://arcprize.org/arc-agi/2/
  modeless 5 hours ago
  Worth keeping in mind that in this case the test takers were random members of the general public. The score of e.g. people with bachelor's degrees in science and engineering would be significantly higher.
  throw310822 4 hours ago
  Random members of the public = average human beings. I thought those were already classified as General Intelligences.
  imiric 4 hours ago
  What is the point of comparing performance of these tools to humans? Machines have been able to accomplish specific tasks better than humans since the industrial revolution. Yet we don't ascribe intelligence to a calculator.
  None of these benchmarks prove these tools are intelligent, let alone generally intelligent. The hubris and grift are exhausting.
  guelo 3 hours ago
  What's the point of denying or downplaying that we are seeing amazing and accelerating advancements in areas that many of us thought were impossible?
  D-Machine an hour ago
  It can be reasonable to be skeptical that advances on benchmarks may be only weakly or even negatively correlated with advances on real-world tasks. I.e. a huge jump on benchmarks might not be perceptible to 99% of users doing 99% of tasks, or some users might even note degradation on specific tasks. This is especially the case when there is some reason to believe most benchmarks are being gamed.
  Real-world use is what matters, in the end. I'd be surprised if a change this large doesn't translate to something noticeable in general, but the skepticism is not unreasonable here.
  munksbeer an hour ago
  I would suggest it is a phenomenon that is well studied, and has many forms. I guess mostly identify preservation. If you dislike AI from the start, it is generally a very strongly emotional view. I don't mean there is no good reason behind it, I mean, it is deeply rooted in your psyche, very emotional.
  People are incredibly unlikely to change those sort of views, regardless of evidence. So you find this interesting outcome where they both viscerally hate AI, but also deny that it is in any way as good as people claim.
  That won't change with evidence until it is literally impossible not to change.
  CamperBob2 an hour ago
  The hubris and grift are exhausting.
  And moving the goalposts every few months isn't? What evidence of intelligence would satisfy you?
  Personally, my biggest unsatisfied requirement is continual-learning capability, but it's clear we aren't too far from seeing that happen.
  imiric 31 minutes ago
  > What evidence of intelligence would satisfy you?
  That is a loaded question. It presumes that we can agree on what intelligence is, and that we can measure it in a reliable way. It is akin to asking an atheist the same about God. The burden of proof is on the claimer.
  The reality is that we can argue about that until we're blue in the face, and get nowhere.
  In this case it would be more productive to talk about the practical tasks a pattern matching and generation machine can do, rather than how good it is at some obscure puzzle. The fact that it's better than humans at solving some problems is not particularly surprising, since computers have been better than humans at many tasks for decades. This new technology gives them broader capabilities, but ascribing human qualities to it and calling it intelligence is nothing but a marketing tactic that's making some people very rich.
  throw310822 3 hours ago
  > Machines have been able to accomplish specific tasks...
  Indeed, and the specific task machines are accomplishing now is intelligence. Not yet "better than human" (and certainly not better than every human) but getting closer.
  imiric 3 hours ago
  > Indeed, and the specific task machines are accomplishing now is intelligence.
  How so? This sentence, like most of this field, is making baseless claims that are more aspirational than true.
  Maybe it would help if we could first agree on a definition of "intelligence", yet we don't have a reliable way of measuring that in living beings either.
  If the people building and hyping this technology had any sense of modesty, they would present it as what it actually is: a large pattern matching and generation machine. This doesn't mean that this can't be very useful, perhaps generally so, but it's a huge stretch and an insult to living beings to call this intelligence.
  But there's a great deal of money to be made on this idea we've been chasing for decades now, so here we are.
  warkdarrior 3 hours ago
  > Maybe it would help if we could first agree on a definition of "intelligence", yet we don't have a reliable way of measuring that in living beings either.
  How about this specific definition of intelligence?
  Solve any task provided as text or images.
  AGI would be to achieve that faster than an average human.
  throw310822 2 hours ago
  I still can't understand why they should be faster. Humans have general intelligence, afaik. It doesn't matter if it's fast or slow. A machine able to do what the average human can do (intelligence-wise) but 100 times slower still has general intelligence. Since it's artificial, it's AGI.
  colordrops 5 hours ago
  Wouldn't you deal with spatial reasoning by giving it access to a tool that structures the space in a way it can understand or just is a sub-model that can do spatial reasoning? These "general" models would serve as the frontal cortex while other models do specialized work. What is missing?
  causal 5 hours ago
  That's a bit like saying just give blind people cameras so they can see.
  pixl97 an hour ago
  I mean, no not really. These models can see, you're giving them eyes to connect to that part of their brain.
  amelius 3 hours ago
  They should train more on sports commentary, perhaps that could give spatial reasoning a boost.
- aeyes 7 hours ago
  https://arcprize.org/leaderboard
  $13.62 per task - so we need another 5-10 years for the price to run this to become reasonable?
  But the real question is if they just fit the model to the benchmark.
  onlyrealcuzzo 6 hours ago
  Why 5-10 years?
  At current rates, price per equivalent output is dropping at 99.9% over 5 years.
  That's basically $0.01 in 5 years.
  Does it really need to be that cheap to be worth it?
  Keep in mind, $0.01 in 5 years is worth less than $0.01 today.
  willis936 5 hours ago
  Wow that's incredible! Could you show your work?
  onlyrealcuzzo 4 hours ago
  https://epoch.ai/data-insights/llm-inference-price-trends
  golem14 3 hours ago
  A grad student hour is probably more expensive…
  elromulous 2 hours ago
  In my experience, a grad student hour is treated as free :(
  re-thc 5 hours ago
  What’s reasonable? It’s less than minimum hourly wage in some countries.
  willis936 5 hours ago
  Burned in seconds.
  gowld 3 hours ago
  Getting the work done faster for the same money doesn't make the work more expensive.
  You could slow down the inference to make the task take longer, if $/sec matters.
  igravious 6 hours ago
  That's not a long time in the grand scheme of things.
  throwup238 6 hours ago
  Speak for yourself. Five years is a long time to wait for my plans of world domination.
  tasuki 3 hours ago
  This concerns me actually. With enough people (n>=2) wanting to achieve world domination, we have a problem.
  throwup238 2 hours ago
  It’s not that I want to achieve world domination (imagine how much work that would be!), it’s just that it’s the inevitable path for AI and I’d rather it be me than then next shmuck with a Claude Max subscription.
  pixl97 an hour ago
  I mean everyone with prompt access to the model says these things, but people like Sam and Elon say these things and mean it.
  gowld 3 hours ago
  n = 2 is Pinky and the Brain.
  amelius 5 hours ago
  Yes, you better hurry.
- mnicky 7 hours ago
  Well, fair comparison would be with GPT-5.x Pro, which is the same class of a model as Gemini Deep Think.
- culi 5 hours ago
  Yes but with a significant (logarithmic) increase in cost per task. The ARC-AGI site is less misleading and shows how GPT and Claude are not actually far behind
  https://arcprize.org/leaderboard
- saberience 7 hours ago
  Arc-AGI (and Arc-AGI-2) is the most overhyped benchmark around though.
  It's completely misnamed. It should be called useless visual puzzle benchmark 2.
  It's a visual puzzle, making it way easier for humans than for models trained on text firstly. Secondly, it's not really that obvious or easy for humans to solve themselves!
  So the idea that if an AI can solve "Arc-AGI" or "Arc-AGI-2" it's super smart or even "AGI" is frankly ridiculous. It's a puzzle that means nothing basically, other than the models can now solve "Arc-AGI"
  CuriouslyC 6 hours ago
  The puzzles are calibrated for human solve rates, but otherwise I agree.
  saberience 6 hours ago
  My two elderly parents cannot solve Arc-AGI puzzles, but can manage to navigate the physical world, their house, garden, make meals, clean the house, use the TV, etc.
  I would say they do have "general intelligence", so whatever Arc-AGI is "solving" it's definitely not "AGI"
  hmmmmmmmmmmmmmm 6 hours ago
  You are confusing fluid intelligence with crystallised intelligence.
  casey2 6 hours ago
  I think you are making that confusion. Any robotic system in the place of his parents would fail with a few hours.
  There are more novel tasks in a day than ARC provides.
  hmmmmmmmmmmmmmm 6 hours ago
  Children have great levels of fluid intelligence, that's how they are able to learn to quickly navigate in a world that they are still very new to. Seniors with decreasing capacity increasingly rely on crystallised intelligence, that's why they can still perform tasks like driving a car but can fail at completely novel tasks, sometimes even using a smartphone if they have not used one before.
  mrbungie 2 hours ago
  My late grandma learnt how to use an iPad by herself during her 70s to 80s without any issues, mostly motivated by her wish to read her magazines, doomscroll facebook and play solitaire. Her last job was being a bakery cashier in her 30s and she didn't learn how to use a computer in-between, so there was no skill transfer going on.
  Humans and their intelligence are actually incredible and probably will continue to be so, I don't really care what tech/"think" leaders wants us to think.
  zeroonetwothree 5 hours ago
  It really depends on motivation. My 90 year old grandmother can use a smartphone just fine since she needs it to see pictures of her (great) grandkids.
- karmasimida 8 hours ago
  It is over
  baal80spam 8 hours ago
  I for one welcome our new AI overlords.
logicprog 6 hours ago
Is it me or is the rate of model release is accelerating to an absurd degree? Today we have Gemini 3 Deep Think and GPT 5.3 Codex Spark. Yesterday we had GLM5 and MiniMax M2.5. Five days before that we had Opus 4.6 and GPT 5.3. Then maybe two weeks I think before that we had Kimi K2.5.
- i5heu 5 hours ago
  I think it is because of the Chinese new year. The Chinese labs like to publish their models arround the Chinese new year, and the US labs do not want to let a DeepSeek R1 (20 January 2025) impact event happen again, so i guess they publish models that are more capable then what they imagine Chinese labs are yet capable of producing.
  kristopolous 20 minutes ago
  I guess. Deepseek v3 was released on boxing day a month prior
  https://api-docs.deepseek.com/news/news1226
  woah 4 hours ago
  Singularity or just Chinese New Year?
  syndacks 2 hours ago
  The Singularity will occur on a Tuesday, during Chinese New Year
- aliston 5 hours ago
  I'm having trouble just keeping track of all these different types of models.
  Is "Gemini 3 Deep Think" even technically a model? From what I've gathered, it is built on top of Gemini 3 Pro, and appears to be adding specific thinking capabilities, more akin to adding subagents than a truly new foundational model like Opus 4.6.
  Also, I don't understand the comments about Google being behind in agentic workflows. I know that the typical use of, say, Claude Code feels agentic, but also a lot of folks are using separate agent harnesses like OpenClaw anyway. You could just as easily plug Gemini 3 Pro into OpenClaw as you can Opus, right?
  Can someone help me understand these distinctions? Very confused, especially regarding the agent terminology. Much appreciated!
  logicprog 5 hours ago
  > Also, I don't understand the comments about Google being behind in agentic workflows.
  It has to do with how the model is RL'd. It's not that Gemini can't be used with various agentic harnesses, like open code or open claw or theoretically even claude code. It's just that the model is trained less effectively to work with those harnesses, so it produces worse results.
  janalsncm an hour ago
  The term “model” is one of those super overloaded terms. Depending on the conversation it can mean:
  - a product (most accurate here imo)
  - a specific set of weights in a neural net
  - a general architecture or family of architectures (BERT models)
  So while you could argue this is a “model” in the broadest sense of the term, it’s probably more descriptive to call it a product. Similarly we call LLMs “language” models even if they can do a lot more than that, for example draw images.
  re-thc 5 hours ago
  There are hints this is a preview to Gemini 3.1.
- sanderjd 40 minutes ago
  So, yes, for the past couple weeks it has felt that way to me. But it seems to come in fits and starts. Maybe that will stop being the case, but that's how it's felt to me for awhile.
- rogerkirkness 6 hours ago
  Fast takeoff.
- redox99 5 hours ago
  There's more compute now than before.
- bpodgursky 6 hours ago
  Anthropic took the day off to do a $30B raise at a $380B valuation.
  IhateAI 6 hours ago
  Most ridiculous valuation in the history of markets. Cant wait to watch these compsnies crash snd burn when people give up on the slot machine.
  andxor 4 hours ago
  As usual don't take financial advice from HN folks!
  blibble a few seconds ago
  not as if you could get in on it even if you wanted to
  kgwgk 6 hours ago
  WeWork almost IPO’s at $50bn. It was also a nice crash and burn.
  jascha_eng 5 hours ago
  Why? They had $10+ billion arr run rate in 2025 trippeled from 2024 I mean 30x is a lot but also not insane at that growth rate right?
  gokhan 3 hours ago
  It's a 13 days old account with IHateAI handle.
- brokencode 5 hours ago
  They are using the current models to help develop even smarter models. Each generation of model can help even more for the next generation.
  I don’t think it’s hyperbolic to say that we may be only a single digit number of years away from the singularity.
  mrandish 30 minutes ago
  > using the current models to help develop even smarter models.
  That statement is plausible. However, extrapolating that to assert all the very different things which must be true to enable any form of 'singularity' would be a profound category error. There are many ways in which your first two sentences can be entirely true, while your third sentence requires a bunch of fundamental and extraordinary things to be true for which there is currently zero evidence.
  Things like LLMs improving themselves in meaningful and novel ways and then iterating that self-improvement over multiple unattended generations in exponential runaway positive feedback loops resulting in tangible, real-world utility. All the impressive and rapid achievements in LLMs to date can still be true while major elements required for Foom-ish exponential take-off are still missing.
  lm28469 5 hours ago
  I must be holding these things wrong because I'm not seeing any of these God like superpowers everyone seem to enjoy.
  brokencode 5 hours ago
  Who said they’re godlike today?
  And yes, you are probably using them wrong if you don’t find them useful or don’t see the rapid improvement.
  lm28469 4 hours ago
  Let's come back in 12 months and discuss your singularity then. Meanwhile I spent like $30 on a few models as a test yesterday, none of them could tell me why my goroutine system was failing, even though it was painfully obvious (I purposefully added one too many wg.Done), gemini, codex, minimax 2.5, they all shat the bed on a very obvious problem but I am to believe they're 98% conscious and better at logic and math than 99% of the population.
  Every new model release neckbeards come out of the basements to tell us the singularity will be there in two more weeks
  brokencode 4 hours ago
  You are fighting straw men here. Any further discussion would be pointless.
  lm28469 3 hours ago
  Of course, n-1 wasn't good enough but n+1 will be singularity, just two more weeks my dudes, two more week... rinse and repeat ad infinitum
  brokencode 3 hours ago
  Like I said, pointless strawmanning.
  You’ve once again made up a claim of “two more weeks” to argue against even though it’s not something anybody here has claimed.
  If you feel the need to make an argument against claims that exist only in your head, maybe you can also keep the argument only in your head too?
  BeetleB 4 hours ago
  On the flip side, twice I put about 800K tokens of code into Gemini and asked it to find why my code was misbehaving, and it found it.
  The logic related to the bug wasn't all contained in one file, but across several files.
  This was Gemini 2.5 Pro. A whole generation old.
  woah 4 hours ago
  Post the file here
  logicprog 4 hours ago
  Meanwhile I've been using Kimi K2T and K2.5 to work in Go with a fair amount of concurrency and it's been able to write concurrent Go code and debug issues with goroutines equal to, and much more complex then, your issue, involving race conditions and more, just fine.
  Projects:
  https://github.com/alexispurslane/oxen
  https://github.com/alexispurslane/org-lsp
  (Note that org-lsp has a much improved version of the same indexer as oxen; the first was purely my design, the second I decided to listen to K2.5 more and it found a bunch of potential race conditions and fixed them)
  shrug
  Izikiel43 4 hours ago
  Out of curiosity, did you give a test for them to validate the code?
  I had a test failing because I introduced a silly comparison bug (> instead of <), and claude 4.6 opus figured out it wasn't the test the problem, but the code and fixed the bug (which I had missed).
  lm28469 3 hours ago
  There was a test and a very useful golang error that literally explain what was wrong. The model tried implementing a solution, failed and when I pointed out the error most of them just rolled back the "solution"
  Izikiel43 3 hours ago
  Ok, thanks for the info
  sekai 4 hours ago
  > I don’t think it’s hyperbolic to say that we may be only a single digit number of years away from the singularity.
  We're back to singularity hype, but let's be real: benchmark gains are meaningless in the real world when the primary focus has shifted to gaming the metrics
  brokencode 3 hours ago
  Ok, here I am living in the real world finding these models have advanced incredibly over the past year for coding.
  Benchmaxxing exists, but that’s not the only data point. It’s pretty clear that models are improving quickly in many domains in real world usage.
  mrbungie 2 hours ago
  Yet even Anthropic has shown the downsides to using them. I don't think it is a given that improvements in models scores and capabilities + being able to churn code as fast as we can will lead us to a singularity, we'll need more than that.
xnx 8 hours ago
Google is absolutely running away with it. The greatest trick they ever pulled was letting people think they were behind.
- wiseowise 5 hours ago
  Their models might be impressive, but their products absolutely suck donkey balls. I’ve given Gemini web/cli two months and ran away back to ChatGPT. Seriously, it would just COMPLETELY forget context mid dialog. When asked about improving air quality it just gave me a list of (mediocre) air purifiers without asking for any context whatsoever, and I can list thousands of conversations like that. Shopping or comparing options is just nonexistent. It uses Russian propaganda sources for answers and switches to Chinese mid sentence (!), while explaining some generic Python functionality. It’s an embarrassment and I don’t know how they justify 20 euro price tag on it.
  mavamaarten 5 hours ago
  I agree. On top of that, in true Google style, basic things just don't work.
  Any time I upload an attachment, it just fails with something vague like "couldn't process file". Whether that's a simple .MD or .txt with less than 100 lines or a PDF. I tried making a gem today. It just wouldn't let me save it, with some vague error too.
  I also tried having it read and write stuff to "my stuff" and Google drive. But it would consistently write but not be able to read from it again. Or would read one file from Google drive and ignore everything else.
  Their models are seriously impressive. But as usual Google sucks at making them work well in real products.
  davoneus 4 hours ago
  I don't find that at all. At work, we've no access to the API, so we have to force feed a dozen (or more) documents, code and instruction prompts through the web interface upload interface. The only failures I've ever had in well over 300 sessions were due to connectivity issues, not interface failures.
  Context window blowouts? All the time, but never document upload failures.
  pixl97 an hour ago
  Honestly this is as Google product as you can get. Prizes for some, beatings for others.
  chermanowicz 4 hours ago
  It's so capable at some things, and others are garbage. I uploaded a photo of some words for a spelling bee and asked it to quiz my kid on the words. The first word it asked, wasn't on the list. After multiple attempts to get it to start asking only the words in the uploaded pic, it did, and then would get the spellings wrong in the Q&A. I gave up.
  sequin 4 hours ago
  How can the models be impressive if they switch to Chinese mid-sentence? I've observed those bizarre bugs too. Even GPT-3 didn't have those. Maybe GPT-2 did. It's actually impressive that they managed to botch it so badly.
  Google is great at some things, but this isn't it.
  gokhan 3 hours ago
  Agreed on the product. I can't make Gemini read my emails on GMail. One day it says it doesn't have access, the other day it says Query unsuccessful. Claude Desktop has no problem reaching to GMail, on the other hand :)
  kilroy123 4 hours ago
  Sadly true.
  It is also one of the worst models to have a sort of ongoing conversation with.
  HardCodedBias 5 hours ago
  Their models are absolutely not impressive.
  Not a single person is using it for coding (outside of Google itself).
  Maybe some people on a very generous free plan.
  Their model is a fine mid 2025 model, backed by enormous compute resources and an army of GDM engineers to help the “researchers” keep the model on task as it traverses the “tree of thoughts”.
  But that isn’t “the model” that’s an old model backed by massive money.
- Ozzie_osman 6 hours ago
  Peacetime Google is not like wartime Google.
  Peacetime Google is slow, bumbling, bureaucratic. Wartime Google gets shit done.
  nutjob2 5 hours ago
  OpenAI is the best thing that happened to Google apparently.
  RationPhantoms 5 hours ago
  Competition always is. I think there was a real fear that their core product was going to be replaced. They're already cannibalizing it internally so it was THE wake up call.
  taurath 3 hours ago
  Just not search. The search product has pretty much become useless over the past 3 years and the AI answers often will get just to the level of 5 years ago. This creates a sense that that things are better - but really it’s just become impossible to get reliable information from an avenue that used to work very well.
  I don’t think this is intentional, but I think they stopped fighting SEO entirely to focus on AI. Recipes are the best example - completely gutted and almost all receive sites (therefore the entire search page) run by the same company. I didn’t realize how utterly consolidated huge portions of information on the internet was until every recipe site about 3 months ago simultaneously implemented the same anti-Adblock.
  koolala 2 hours ago
  Next they compete on ads...
  lern_too_spel 5 hours ago
  Wartime Google gave us Google+. Wartime Google is still bumbling, and despite OpenAI's numerous missteps, I don't think it has to worry about Google hurting its business yet.
  NikolaNovak 2 hours ago
  I do miss Google+. For my brain / use case, it was by far the best social network out there, and the Circle friends and interest management system is still unparalleled :)
- kenjackson 6 hours ago
  But wait two hours for what OpenAI has! I love the competition and how someone just a few days ago was telling how ARC-AGI-2 was proof that LLMs can't reason. The goalposts will shift again. I feel like most of human endeavor will soon be just about trying to continuously show that AI's don't have AGI.
  kilpikaarna 5 hours ago
  > I feel like most of human endeavor will soon be just about trying to continuously show that AI's don't have AGI.
  I think you overestimate how much your average person-on-the-street cares about LLM benchmarks. They already treat ChatGPT or whichever as generally intelligent (including to their own detriment), are frustrated about their social media feeds filling up with slop and, maybe, if they're white-collar, worry about their jobs disappearing due to AI. Apart from a tiny minority in some specific field, people already know themselves to be less intelligent along any measurable axis than someone somewhere.
  7777332215 6 hours ago
  Soon they can drop the bioweapon to welcome our replacement.
  nutjob2 5 hours ago
  "AGI" doesn't mean anything concrete, so it's all a bunch of non-sequiturs. Your goalposts don't exist.
  Anyone with any sense is interested in how well these tools work and how they can be harnessed, not some imaginary milestone that is not defined and cannot be measured.
  kenjackson 5 hours ago
  I agree. I think the emergence of LLMs have shown that AGI really has no teeth. I think for decades the Turing test was viewed as the gold standard, but it's clear that there doesn't appear to be any good metric.
  sincerely 2 hours ago
  The turing test was passed in the 80s, somehow it has remained relevant in pop culture despite the fact that it's not a particularly difficult technical achievement
  kenjackson 2 minutes ago
  It wasn’t passed in the 80s. Not the general Turing test.
- amunozo 7 hours ago
  Those black nazis in the first image model were a cause of inside trading.
- naasking 6 hours ago
  Google is still behind the largest models I'd say, in real world utility. Gemini 3 Pro still has many issues.
- dfdsf2 7 hours ago
  Trick? Lol not a chance. Alphabet is a pure play tech firm that has to produce products to make the tech accessible. They really lack in the latter and this is visible when you see the interactions of their VP's. Luckily for them, if you start to create enough of a lead with the tech, you get many chances to sort out the product stuff.
  dakolli 6 hours ago
  You sound like Russ Hanneman from SV
  s-kymon 6 hours ago
  It's not about how much you earn. It's about what you're worth.
- Razengan 6 hours ago
  Gemini's UX (and of course privacy cred as with anything Google) is the worst of all the AI apps. In the eyes of the Common Man, it's UI that will win out, and ChatGPT's is still the best.
  xnx 5 hours ago
  Google privacy cred is ... excellent? The worst data breach I know of them having was a flaw that allowed access to names and emails of 500k users.
  bitpush 5 hours ago
  Link? Are you conflating with "500k Gmail accounts leaked [by a third party]" with Gmail having a breach?
  Afaik, Google has had no breaches ever.
  xnx 4 hours ago
  https://en.wikipedia.org/wiki/2018_Google_data_breach
  Razengan 4 hours ago
  Google is the breach.
  laurex 5 hours ago
  If you consider "privacy" to be 'a giant corporation tracks every bit of possible information about you and everyone else'?
  xnx 4 hours ago
  OpenAI is running ads. Do you think they'll track less?
  Razengan 5 hours ago
  They don't even let you have multiple chats if you disable their "App Activity" or whatever (wtf is with that ass naming? they don't even have a "Privacy" section in their settings the last time I checked)
  and when I swap back into the Gemini app on my iPhone after a minute or so the chat disappears. and other weird passive-aggressive take-my-toys-away behavior if you don't bare your body and soul to Googlezebub.
  ChatGPT and Grok work so much better without accounts or with high privacy settings.
  alexpotato 6 hours ago
  > Gemini's UX ... is the worst of all the AI apps
  Been using Gemini + OpenCode for the past couple weeks.
  Suddenly, I get a "you need a Gemini Access Code license" error but when you go to the project page there is no mention of this or how to get the license.
  You really feel the "We're the phone company and we don't care. Why? Because we don't have to." [0] when you use these Google products.
  PS for those that don't get the reference: US phone companies in the 1970s had a monopoly on local and long distance phone service. Similar to Google for search/ads (really a "near" monopoly but close enough).
  0 - https://vimeo.com/355556831
  ainch 3 hours ago
  I find Gemini's web page much snappier to use than ChatGPT - I've largely swapped to it for most things except more agentic tasks.
  jonathanstrange 6 hours ago
  You mean AI Studio or something like that, right? Because I can't see a problem with Google's standard chat interface. All other AI offerings are confusing both regarding their intended use and their UX, though, I have to concur with that.
  ergonaught 5 hours ago
  The lack of "projects" alone makes their chat interface really unpleasant compared to ChatGPT and Claude.
  xnx 6 hours ago
  AI Studio is also significantly improved as of yesterday.
  wiseowise 5 hours ago
  No projects, completely forgets context mid dialog, mediocre responses even on thinking, research got kneecapped somehow and is completely uses now, uses propaganda Russian videos as the search material (what’s wrong with you, Google?), janky on mobile, consumes GIGABYTES of RAM on web (seriously, what the fuck?). Left a couple of tabs over night, Mac is almost complete frozen because 10 tabs consumed 8 GBs of RAM doing nothing. It’s a complete joke.
  uxhoiuewfhhiu 4 hours ago
  Gemini is completely unusable in VS Code. It's rated 2/5 stars, pathetic: https://marketplace.visualstudio.com/items?itemName=Google.g...
  Requests regularly time out, the whole window freezes, it gets stuck in schizophrenic loops, edits cannot be reverted and more.
  It doesn't even come close to Claude or ChatGPT.
rob-wagner 3 hours ago
I’ve been using Gemini 3 Pro on a historical document archiving project for an old club. One of the guys had been working on scanning old handwritten minutes books written in German that were challenging to read (1885 through 1974). Anyways, I was getting decent results on a first pass with 50 page chunks but ended up doing 1 page at a time (accuracy probably 95%). For each page, I submit the page for a transcription pass followed by a translation of the returned transcription. About 2370 pages and sitting at about $50 in Gemini API billing. The output will need manual review, but the time savings is impressive.
- dubeye an hour ago
  It sounds like a job where one pass might also be a viable option. Until you do the manual review you won't have a full sense of the time savings involved.
  rob-wagner 6 minutes ago
  Good idea. I’ll try modifying the prompt to transcribe, identify the language, and translate if not English, and then return a structured result. In my spot checks, most of the errors are in people’s names and if the handwriting trails into margins (especially into the fold of the binding). Even with the data still needing review, the translations from it has revealed a lot of interesting characters as well as this little anecdote from the minutes of the June 6, 1941 Annual Meeting:
  It had already rained at the beginning of the meeting. During the same, however, a heavy thunderstorm set in, whereby our electric light line was put out of operation. Wax candles with beer bottles as light holders provided the lighting. In the meantime the rain had fallen in a cloudburst-like manner, so that one needed help to get one's automobile going. In some streets the water stood so high that one could reach one's home only by detours. In this night 9.65 inches of rain had fallen.
dmbche 16 minutes ago
So what happens if the AI companies can't make money? I see more and more advances and breakthrough but they are taking in debt and no revenue in sight.
I seem to understand debt is very bad here since they could just sell more shares, but aren't (either valuation is stretched or no buyers).
Just a recession? Something else? Aren't they very very big to fall?
- echelon 11 minutes ago
  AI will kill SaaS moats and thus revenue. Anyone can build new SaaS quickly. Lots of competition will lead to marginal profits.
  AI will kill advertising. Whatever sits at the top "pane of glass" will be able to filter ads out. Personal agents and bots will filter ads out.
  AI will kill social media. The internet will fill with spam.
  AI models will become commodity. Unless singularity, no frontier model will stay in the lead. There's competition from all angles. They're easy to build, just capital intensive (though this is only because of speed).
  All this leaves is infrastructure.
- ipnon 7 minutes ago
  They're using the ride share app playbook. Subsidize the product to reach market saturation. Once you've found a market segment that depends on your product you raise the price to break even. One major difference though is that ride share's haven't really changed in capabilities since they launched: it's a map that shows a little car with your driver coming and a pin where you're going. But it's reasonable to believe that AI will have new fundamental capabilities in the 2030s, 2040s, and so on.
simianwords 8 hours ago
OT but my intuition says that there’s a spectrum
- non thinking models
- thinking models
- best of N models like deep think an gpt pro
Each one is of a certain computational complexity. Simplifying a bit, I think they map to - linear, quadratic and n^3 respectively.
I think there are certain class of problems that can’t be solved without thinking because it necessarily involves writing in a scratchpad. And same for best of N which involves exploring.
Two open questions
1) what’s the higher level here, is there a 4th option?
2) can a sufficiently large non thinking model perform the same as a smaller thinking?
- futureshock 4 hours ago
  I think step 4 is the agent swarm. Manager model gets the prompt and spins up a swarm of looping subagents, maybe assigns them different approaches or subtasks, then reviews results, refines the context files and redeploys the swarm on a loop till the problem is solved or your credit card is declined.
  simianwords 4 hours ago
  i think this is the right answer
  edit: i don't know how this is meaningfully different from 3
- NitpickLawyer 7 hours ago
  > best of N models like deep think an gpt pro
  Yeah, these are made possible largely by better use at high context lengths. You also need a step that gathers all the Ns and selects the best ideas / parts and compiles the final output. Goog have been SotA at useful long context for a while now (since 2.5 I'd say). Many others have come with "1M context", but their usefulness after 100k-200k is iffy.
  What's even more interesting than maj@n or best of n is pass@n. For a lot of applications youc an frame the question and search space such that pass@n is your success rate. Think security exploit finding. Or optimisation problems with quick checks (better algos, kernels, infra routing, etc). It doesn't matter how good your pass@1 or avg@n is, all you care is that you find more as you spend more time. Literally throwing money at the problem.
- mnicky 7 hours ago
  > can a sufficiently large non thinking model perform the same as a smaller thinking?
  Models from Anthropic have always been excellent at this. See e.g. https://imgur.com/a/EwW9H6q (top-left Opus 4.6 is without thinking).
  simianwords 7 hours ago
  its interesting that opus 4.6 added a paramter to make it think extra hard.
sigmar 8 hours ago
Here is the methodologies for all the benchmarks: https://storage.googleapis.com/deepmind-media/gemini/gemini_...
The arc-agi-2 score (84.6%) is from the semi-private eval set. If gemini-3-deepthink gets above 85% on the private eval set, it will be considered "solved"
>Submit a solution which scores 85% on the ARC-AGI-2 private evaluation set and win $700K. https://arcprize.org/guide#overview
- gs17 8 hours ago
  Interestingly, the title of that PDF calls it "Gemini 3.1 Pro". Guess that's dropping soon.
  sigmar 8 hours ago
  I looked at the file name but not the document title (specifically because I was wondering if this is 3.1). Good spot.
  edit: they just removed the reference to "3.1" from the pdf
  josalhor 6 hours ago
  I think this is 3.1 (3.0 Pro with the RL improv of 3.0 Flash). But they probably decided to market it as Deep Think because why not charge more for it.
  WarmWash 6 hours ago
  The Deep Think moniker is for parallel compute models though, not long CoT like pro models.
  It's possible though that deep think 3 is running 3.1 models under the hood.
  staticman2 7 hours ago
  That's odd considering 3.0 is still labeled a "preview" release.
  ainch 3 hours ago
  I think it'll be 3.1 by the time it's labelled GA - they said after 3.0 launch that they figured out new RL methods for Flash that the Pro model hasn't benefitted from.
  WarmWash 7 hours ago
  The rumor was that 3.1 was today's drop
  losvedir 6 hours ago
  Where are these rumors floating around?
  beauzero 6 hours ago
  One of many https://x.com/synthwavedd/status/2021983382314660075
- riku_iki 7 hours ago
  > If gemini-3-deepthink gets above 85% on the private eval set, it will be considered "solved"
  They never will do on private set, because it would mean its being leaked to google.
Scene_Cast2 6 hours ago
It's a shame that it's not on OpenRouter. I hate platform lock-in, but the top-tier "deep think" models have been increasingly requiring the use of their own platform.
- raybb 5 hours ago
  OpenRouter is pretty great but I think litellm does a very good job and it's not a platform middle man, just a python library. That being said, I have tried it with the deep think models.
  https://docs.litellm.ai/docs/
  imiric 3 hours ago
  Part of OpenRouter's appeal to me is precisely that it is a middle man. I don't want to create accounts on every provider, and juggle all the API keys myself. I suppose this increases my exposure, but I trust all these providers and proxies the same (i.e. not at all), so I'm careful about the data I give them to begin with.
  octoberfranklin 2 hours ago
  Unfortunately that's ending with mandatory-BYOK from the model vendors. They're starting to require that you BYOK to force you through their arbitrary+capricious onboarding process.
- chr15m 2 hours ago
  The golden age is over.
Decabytes 4 hours ago
Gemini has always felt like someone who was book smart to me. It knows a lot of things. But if you ask it do anything that is offscript it completely falls apart
- dwringer 4 hours ago
  I strongly suspect there's a major component of this type of experience being that people develop a way of talking to a particular LLM that's very efficient and works well for them with it, but is in many respects non-transferable to rival models. For instance, in my experience, OpenAI models are remarkably worse than Google models in basically any criterion I could imagine; however, I've spent most of my time using the Google ones and it's only during this time that the differences became apparent and, over time, much more pronounced. I would not be surprised at all to learn that people who chose to primarily use Anthropic or OpenAI models during that time had an exactly analogous experience that convinced them their model was the best.
- esafak 4 hours ago
  I'd rather say it has a mind of its own; it does things its way. But I have not tested this model, so they might have improved its instruction following.
  vkazanov 4 hours ago
  Well, one thing i know for sure: it reliably misplaces parentheses in lisps.
  esafak 4 hours ago
  Clearly, the AI is trying to steer you towards the ML family of languages for its better type system, performance, and concurrency ;)
jetter 6 hours ago
it is interesting that the video demo is generating .stl model. I run a lot of tests of LLMs generating OpenSCAD code (as I have recently launched https://modelrift.com text-to-CAD AI editor) and Gemini 3 family LLMs are actually giving the best price-to-performance ratio now. But they are very, VERY far from being able to spit out a complex OpenSCAD model in one shot. So, I had to implement a full fledged "screenshot-vibe-coding" workflow where you draw arrows on 3d model snapshot to explain to LLM what is wrong with the geometry. Without human in the loop, all top tier LLMs hallucinate at debugging 3d geometry in agentic mode - and fail spectacularly.
- mchusma 4 hours ago
  Hey, my 9 year old son uses modelrift for creating things for his 3d printer, its great! Product feedback: 1. You should probably ask me to pay now, I feel like i've used it enough. 2. You need a main dashboard page with a history of sessions. He thought he lost a file and I had to dig in the billing history to get a UUID I thought was it and generate the url. I would say naming sessions is important, and could be done with small LLM after the users initial prompt. 3. I don't think I like the default 3d model in there once I have done something, blank would be better.
  We download the stl and import to bambu. Works pretty well. A direct push would be nice, but not necessary.
- gundmc 5 hours ago
  Yes, I've been waiting for a real breakthrough with regard to 3D parametric models and I don't think think this is it. The proprietary nature of the major players (Creo, Solidworks, NX, etc) is a major drag. Sure there's STP, but there's too much design intent and feature loss there. I don't think OpenSCAD has the critical mass of mindshare or training data at this point, but maybe it's the best chance to force a change.
- lern_too_spel 4 hours ago
  If you want that to get better, you need to produce a 3d model benchmark and popularize it. You can start with a pelican riding a bicycle with working bicycle.
czhu12 an hour ago
It’s incredible how fast these models are getting better. I thought for sure a wall would be hit, but these numbers smashes previous benchmarks. Anyone have any idea what the big unlock that people are finding now?
- fsh 43 minutes ago
  Companies are optimizing for all the big benchmarks. This is why there is so little correlation between benchmark performance and real world performance now.
Metacelsus 8 hours ago
According to benchmarks in the announcement, healthily ahead of Claude 4.6. I guess they didn't test ChatGPT 5.3 though.
Google has definitely been pulling ahead in AI over the last few months. I've been using Gemini and finding it's better than the other models (especially for biology where it doesn't refuse to answer harmless questions).
- CuriouslyC 6 hours ago
  Google is way ahead in visual AI and world modelling. They're lagging hard in agentic AI and autonomous behavior.
- throwup238 8 hours ago
  The general purpose ChatGpt 5.3 hasn’t been released yet, just 5.3-codex.
- neilellis 7 hours ago
  It's ahead in raw power but not in function. Like it's got the worlds fast engine but one gear! Trouble is some benchmarks only measure horse power.
  NitpickLawyer 7 hours ago
  > Trouble is some benchmarks only measure horse power.
  IMO it's the other way around. Benchmarks only measure applied horse power on a set plane, with no friction and your elephant is a point sphere. Goog's models have always punched over what benchmarks said, in real world use @ high context. They don't focus on "agentic this" or "specialised that", but the raw models, with good guidance are workhorses. I don't know any other models where you can throw lots of docs at it and get proper context following and data extraction from wherever it's at to where you'd need it.
- scarmig 5 hours ago
  > especially for biology where it doesn't refuse to answer harmless questions
  Usually, when you decrease false positive rates, you increase false negative rates.
  Maybe this doesn't matter for models at their current capabilities, but if you believe that AGI is imminent, a bit of conservatism seems responsible.
- Davidzheng 7 hours ago
  I gather that 4.6 strengths are in long context agentic workflows? At least over Gemini 3 pro preview, opus 4.6 seems to have a lot of advantages
  verdverm 7 hours ago
  It's a giant game of leapfrog, shift or stretch time out a bit and they all look equivalent
- nkzd 6 hours ago
  Google models and CLI harness feels behind in agentic coding compared OpenAI and Antrophic
- simianwords 8 hours ago
  The comparison should be with GPT 5.2 pro which has been used successfully to solve open math problems.
mark_l_watson an hour ago
I feel like a luddite: unless I am running small local models, I use gemini-3-flash for almost everything: great for tool use, embedded use in applications, and Python agentic libraries, broad knowledge, good built in web search tool, etc. Oh, and it is fast and cheap.
I really only use gemini-3-pro occasionally when researching and trying to better understand something. I guess I am not a good customer for super scalers. That said, when I get home from travel, I will make a point of using Gemini 3 Deep Think for some practical research. I need a business card with the title "Old Luddite."
mark_l_watson an hour ago
Off topic comment (sorry): when people bash "models that are not their favorite model" I often wonder if they have done the engineering work to properly use the other models. Different models and architectures often require very different engineering to properly use them. Also, I think it is fine and proper that different developers prefer different models. We are in early days and variety is great.
nphardon 36 minutes ago
I think I'm finally realizing that my job probably won't exist in 3-5. Things are moving so fast now that the LLMs are basically writing themselves. I think the earlier iterations moved slower because they were limited by human ability and productivity limitations.
ipaddr 2 hours ago
The benchmark should be: can you ask it to create a profitable business or product and send you the profit?
Everything else is bike shedding.
siva7 6 hours ago
I can't shake of the feeling that Googles Deep Think Models are not really different models but just the old ones being run with higher number of parallel subagents, something you can do by yourself with their base model and opencode.
- Davidzheng 6 hours ago
  And after i do that, how do i combine the output of 1000 subagents into one output? (Im not being snarky here, i think it's a nontrivial problem)
  tifik 6 hours ago
  The idea is that each subagent is focused on a specific part of the problem and can use its entire context window for a more focused subtask than the overall one. So ideally the results arent conflicting, they are complimentary. And you just have a system that merges them.. likely another agent.
  mattlondon 6 hours ago
  You just pipe it to another agent to do the reduce step (i.e. fan-in) of the mapreduce (fan-out)
  It's agents all the way down.
  jonathanstrange 6 hours ago
  Start with 1024 and use half the number of agents each turn to distill the final result.
aliljet 5 hours ago
The problem here is that it looks like this is released with almost no real access. How are people using this without submitting to a $250/mo subscription?
- andxor 4 hours ago
  People are paying for the subscriptions.
- tootie 3 hours ago
  I gather this isn't intended a consumer product. It's for academia and research institutions.
simonw 7 hours ago
The pelican riding a bicycle is excellent. I think it's the best I've seen.
https://simonwillison.net/2026/Feb/12/gemini-3-deep-think/
- steve_adams_86 5 minutes ago
  We've reached PGI
- tasuki 3 hours ago
  Tbh they'd have to be absolutely useless at benchmarkmaxxing if they didn't include your pelican riding a bicycle...
- nickthegreek 5 hours ago
  I routinely check out the pelicans you post and I do agree, this is the best yet. It seemed to me that the wings/arms were such a big hangup for these generators.
- Manabu-eo 7 hours ago
  How likely this problem is already on the training set by now?
  simonw 7 hours ago
  If anyone trains a model on https://simonwillison.net/tags/pelican-riding-a-bicycle/ they're going to get some VERY weird looking pelicans.
  suddenlybananas 6 hours ago
  Why would they train on that? Why not just hire someone to make a few examples.
  simonw 5 hours ago
  I look forward to them trying. I'll know when the pelican riding a bicycle is good but the ocelot riding a skateboard sucks.
  suddenlybananas 5 hours ago
  But they could just train on an assortment of animals and vehicles. It's the kind of relatively narrow domain where NNs could reasonably interpolate.
  simonw 5 hours ago
  The idea that an AI lab would pay a small army of human artists to create training data for $animal on $transport just to cheat on my stupid benchmark delights me.
  suddenlybananas 5 hours ago
  When you're spending trillions on capex, paying a couple of people to make some doodles in SVGs would not be a big expense.
  simonw 4 hours ago
  The embarrassment of getting caught doing that would be expensive.
  throwup238 7 hours ago
  For every combination of animal and vehicle? Very unlikely.
  The beauty of this benchmark is that it takes all of two seconds to come up with your own unique one. A seahorse on a unicycle. A platypus flying a glider. A man’o’war piloting a Portuguese man of war. Whatever you want.
  recursive 7 hours ago
  No, not every combination. The question is about the specific combination of a pelican on a bicycle. It might be easy to come up with another test, but we're looking at the results from a particular one here.
  svara 6 hours ago
  More likely you would just train for emitting svg for some description of a scene and create training data from raster images.
  recursive an hour ago
  None of this works if the testers are collaborating with the trainers. The tests ostensibly need to be arms-length from the training. If the trainers ever start over-fitting to the test, the tester would come up with some new test secretly.
  zarzavat 7 hours ago
  You can always ask for a tyrannosaurus driving a tank.
  verdverm 7 hours ago
  I've heard it posited that the reason the frontier companies are frontier is because they have custom data and evals. This is what I would do too
- enraged_camel 5 hours ago
  Is there a list of these for each model, that you've catalogued somewhere?
- throwup238 7 hours ago
  The reflection of the sun in the water is completely wrong. LLMs are still useless. (/s)
  margalabargala 6 hours ago
  It's not actually, look up some photos of the sun setting over the ocean. Here's an example:
  https://stockcake.com/i/sunset-over-ocean_1317824_81961
  throwup238 6 hours ago
  That’s only if the sun is above the horizon entirely.
  margalabargala 5 hours ago
  No, it's not.
  https://stockcake.com/i/serene-ocean-sunset_1152191_440307
  throwup238 3 hours ago
  Yes, it is. In that photo the sun is clearly above the horizon, the bottom half is just obscured by clouds.
- deron12 7 hours ago
  It's worth noting that you mean excellent in terms of prior AI output. I'm pretty sure this wouldn't be considered excellent from a "human made art" perspective. In other words, it's still got a ways to go!
  Edit: someone needs to explain why this comment is getting downvoted, because I don't understand. Did someone's ego get hurt, or what?
  gs17 6 hours ago
  It depends, if you meant from a human coding an SVG "manually" the same way, I'd still say this is excellent (minus the reflection issue). If you meant a human using a proper vector editor, then yeah.
  fvdessen 6 hours ago
  maybe you're a pro vector artist but I couldn't create such a cool one myself in illustrator tbh
  dfdsf2 6 hours ago
  Indeed. And when you factor in the amount invested... yeah it looks less impressive. The question is how much more money needs to be invested to get this thing closer to reality? And not just in this instance. But for any instance e.g. a seahorse on a bike.
- saberience 6 hours ago
  Do you have to still keep trying to bang on about this relentlessly?
  It was sort of humorous for the maybe first 2 iterations, now it's tacky, cheesy, and just relentless self-promotion.
  Again, like I said before, it's also a terrible benchmark.
  jeanloolz 4 hours ago
  I'll agree to disagree. In any thread about a new model, I personally expect the pelican comment to be out there. It's informative, ritualistic and frankly fun. Your comment however, is a little harsh. Why mad?
  Davidzheng 6 hours ago
  Eh, i find it more of a not very informative but lighthearted commentary
  simonw 5 hours ago
  It being a terrible benchmark is the bit.
- dfdsf2 6 hours ago
  Highly disagree.
  I was expecting something more realistic... the true test of what you are doing is how representative is the thing in relation to the real world. E.g. does the pelican look like a pelican as it exists in reality? This cartoon stuff is cute but doesnt pass muster in my view.
  If it doesn't relate to the real world, then it most likely will have no real effect on the real economy. Pure and simple.
  chriswarbo 6 hours ago
  I disagree. The task asks for an SVG; which is a vector format associated with line drawings, clipart and cartoons. I think it's good that models are picking up on that context.
  In contrast, the only "realistic" SVGs I've seen are created using tools like potrace, and look terrible.
  I also think the prompt itself, of a pelican on bicycle, is unrealistic and cartoonish; so making a cartoon is a good way to solve the task.
  peaseagee 6 hours ago
  The request is for an SVG, generally _not_ the format for photorealistic images. If you want to start your own benchmark, feel free to ask for a photorealistic JPEG or PNG of a pelican riding a bicycle. Could be interesting to compare and contrast, honestly.
sega_sai 2 hours ago
I do like google models (and I pay for them), but the lack of competitive agent is a major flaw in Google's offering. It is simply not good enough in comparison to claude code. I wish they put some effort there (as I don't want to pay two subscriptions to both google and anthropic)
anematode 3 hours ago
It found a small but nice little optimization in Stockfish: https://github.com/official-stockfish/Stockfish/pull/6613
Previous models including Claude Opus 4.6 have generally produced a lot of noise/things that the compiler already reliably optimizes out.
neilellis 7 hours ago
Less than a year to destroy Arc-AGI-2 - wow.
- Davidzheng 7 hours ago
  I unironically believe that arc-agi-3 will have a introduction to solved time of 1 month
  ACCount37 4 hours ago
  Not very likely?
  ARC-AGI-3 has a nasty combo of spatial reasoning + explore/exploit. It's basically adversarial vs current AIs.
  etyhhgfff 7 hours ago
  The AGI bar has to be set even higher, yet again.
  dakolli 6 hours ago
  wow solving useless puzzles, such a useful metric!
  esafak 3 hours ago
  How is spatial reasoning useless??
- modeless 6 hours ago
  It's still useful as a benchmark of cost/efficiency.
- XCSme 6 hours ago
  But why only a +0.5% increase for MMMU-Pro?
  kingstnap 5 hours ago
  Its possibly label noise. But you can't tell from a single number.
  You would need to check to see if everyone is having mistakes on the same 20% or different 20%. If its the same 20% either those questions are really hard, or they are keyed incorrectly, or they aren't stated with enough context to actually solve the problem.
  It happens. Old MMLU non pro had a lot of wrong answers. Simple things like MNIST have digits labeled incorrect or drawn so badly its not even a digit anymore.
  kenjackson 6 hours ago
  Everyone is already at 80% for that one. Crazy that we were just at 50% with GPT-4o not that long ago.
  XCSme 24 minutes ago
  But 80% sounds far from good enough, that's 20% error rate, unusable in autonomous tasks. Why stop at 80%? If we aim for AGI, it should 100% any benchmark we give.
  kenjackson 3 minutes ago
  Are humans 100%?
- saberience 6 hours ago
  It's a useless meaningless benchmark though, it just got a catchy name, as in, if the models solve this it means they have "AGI", which is clearly rubbish.
  Arc-AGI score isn't correlated with anything useful.
  Legend2440 4 hours ago
  It's correlated with the ability to solve logic puzzles.
  It's also interesting because it's very very hard for base LLMs, even if you try to "cheat" by training on millions of ARC-like problems. Reasoning LLMs show genuine improvement on this type of problem.
  HDThoreaun 2 hours ago
  ARC-AGI 2 is an IQ test. IQ tests have been shown over and over to have predictive power in humans. People who score well on them tend to be more successful
  fsh 28 minutes ago
  IQ tests only work if the participants haven't trained for them. If they do similar tests a few times in a row, scores increase a lot. Current LLMs are hyper-optimized for the particular types of puzzles contained in popular "benchmarks".
  jabedude 6 hours ago
  how would we actually objectively measure a model to see if it is AGI if not with benchmarks like arc-AGI?
  WarmWash 5 hours ago
  Give it a prompt like
  >can u make the progm for helps that with what in need for shpping good cheap products that will display them on screen and have me let the best one to get so that i can quickly hav it at home
  And get back an automatic coupon code app like the user actually wanted.
sinuhe69 7 hours ago
I'm pretty certain that DeepMind (and all other labs) will try their frontier (and even private) models on First Proof [1].
And I wonder how Gemini Deep Think will fare. My guess is that it will get half the way on some problems. But we will have to take an absence as a failure, because nobody wants to publish a negative result, even though it's so important for scientific research.
[1] https://1stproof.org/
- zozbot234 7 hours ago
  The 1st proof original solutions are due to be published in about 24h, AIUI.
- octoberfranklin 2 hours ago
  Really surprised that 1stproof.org was submitted three times and never made front page at HN.
  https://hn.algolia.com/?q=1stproof
  This is exactly the kind of challenge I would want to judge AI systems based on. It required ten bleeding-edge-research mathematicians to publish a problem they've solved but hold back the answer. I appreciate the huge amount of social capital and coordination that must have taken.
  I'm really glad they did it.
vessenes 7 hours ago
Not trained for agentic workflows yet unfortunately - this looks like it will be fantastic when they have an agent friendly one. Super exciting.
- dakolli 6 hours ago
  Its really weird how you all are begging to be replaced by llms, you think if agentic workflows get good enough you're going to keep your job? Or not have your salary reduced by 50%?
  If Agents get good enough it's not going to build some profitable startup for you (or whatever people think they're doing with the llm slot machines) because that implies that anyone else with access to that agent can just copy you, its what they're designed to do... launder IP/Copyright. Its weird to see people get excited for this technology.
  None of this good. We are simply going to have our workforces replaced by assets owned by Google, Anthropic and OpenAI. We'll all be fighting for the same barista jobs, or miserable factory jobs. Take note on how all these CEOs are trying to make it sound cool to "go to trade school" or how we need "strong American workers to work in factories".
  BeetleB 4 hours ago
  > Its really weird how you all are begging to be replaced by llms, you think if agentic workflows get good enough you're going to keep your job? Or not have your salary reduced by 50%?
  The computer industry (including SW) has been in the business of replacing jobs for decades - since the 70's. It's only fitting that SW engineers finally become the target.
  sgillen 4 hours ago
  I think a lot of people assume they will become highly paid Agent orchestrators or some such. I don't think anyone really knows where things are heading.
  timeattack 3 hours ago
  I agree with you and have similar thoughts (maybe, unfortunately for me). I personally know people who outsource not just their work, but also their life to LLMs, and reading their exciting comments makes me feel a mix of cringe, fomo and dread. But what is the engame for me and you likes, when we finally would be evicted from our own craft? Stash money while we still can, watching 'world crash and burn', and then go and try to ascend in some other, not yet automated craft?
  dakolli 3 hours ago
  Yeah, that's a good question that I can't stop thinking about. I don't really enjoy much else other than building software, its genuinely my favorite thing to do. Maybe there will be a world where we aren't completely replaced, we have handmade clothes still after all that are highly coveted. I just worry its going to uproot more than just software engineering, theoretically it shouldn't be hard to replace all low hanging fruit in the realm of anything that deals with computer I/O. Previous generations of automation have created new opportunities for humans, but this seems mostly just as a means of replacement. The advent of mass transportation/vehicles created machines who needed mechanics (and eventually software), I don't see that happening in this new paradigm.
  I don't think that's going to make society very pleasant if everyone's fighting over the few remaining ways to make livelihood. People need to work to eat. I certainly don't see the capitalist class giving everyone UBI and letting us garden or paint for the rest of our lives. I worry we're likely going to end up in trenches or purged through some other means.
  vessenes 2 hours ago
  I’m someone who’d like to deploy a lot more workers than I want to manage.
  Put another way, I’m on the capital side of the conversation.
  The good news for labor that has experience and creativity is that it just started costing 1/100,000 what it used to to get on that side of the equation.
  jimmymcgee73 an hour ago
  If LLMs truly cause widespread replacement of labor, you’re screwed just as much as anyone else. If we hit say 40% unemployment do you think people will care you own your home or not? Do you think people will care you have currency or not? The best case outcome will be universal income and a pseudo utopia where everyone does ok. The “bad” scenario is widespread war.
  I am one of the “haves” and am not looking forward to the instability this may bring. Literally no one should.
  blibble 39 minutes ago
  > I am one of the “haves” and am not looking forward to the instability this may bring. Literally no one should.
  these people always forget capitalism is permitted to exist by consent of the people
  if there's 40% unemployment it won't continue to exist, regardless of what the TV/tiktok/chatgpt says
  ergonaught 5 hours ago
  Most folks don't seem to think that far down the line, or they haven't caught on to the reality that the people who actually make decisions will make the obvious kind of decisions (ex: fire the humans, cut the pay, etc) that they already make.
  blibble 3 hours ago
  they think they're going to be the person making that decision
  but forgot there's likely someone above them making exactly the same one about them
  newswasboring 4 hours ago
  You don't hate AI, you hate capitalism. All the problems you have listed are not AI issues, its this crappy system where efficiency gains always end up with the capital owners.
ramshanker 7 hours ago
Do we get any model architecture details like parameter size etc.? Few months back, we used to talk more on this, now it's mostly about model capabilities.
- Davidzheng 7 hours ago
  I'm honestly not sure what you mean? The frontier labs have kept arch as secrets since gpt3.5
  willis936 5 hours ago
  At the very least gemini 3's flyer claims 1T parameters.
Legend2440 4 hours ago
I'm really interested in the 3D STL-from-photo process they demo in the video.
Not interested enough to pay $250 to try it out though.
Dirak 5 hours ago
Praying this isn't another Llama4 situation where the benchmark numbers are cooked. 84.6% on Arc-AGI is incredible!
ismailmaj 6 hours ago
top 10 elo in codeforces is pretty absurd
jonathanstrange 8 hours ago
Unfortunately, it's only available in the Ultra subscription if it's available at all.
andrewstuart 6 hours ago
Gemini was awesome and now it’s garbage.
It’s impossible for it to do anything but cut code down, drop features, lose stuff and give you less than the code you put in.
It’s puzzling because it spent months at the head of the pack now I don’t use it at all because why do I want any of those things when I’m doing development.
I’m a paid subscriber but there’s no point any more I’ll spend the money on Claude 4.6 instead.
- halapro 6 hours ago
  I never found it useful for code. It produced garbage littered with gigantic comments.
  Me: Remove comments
  Literally Gemini: // Comments were removed
  andrewstuart 6 hours ago
  It would make more sense to me if it had never been awesome.
  mortsnort 4 hours ago
  They may quantize the models after release to save money.
- ergonaught 5 hours ago
  It seems to be adept at reviewing/editing/critiquing, at least for my use cases. It always has something valuable to contribute from that perspective, but has been comparatively useless otherwise (outside of moats like "exclusive access to things involving YouTube").
m3kw9 6 hours ago
Gemini 3 Pro/Flash is stuck in preview for months now. Google is slow but they progress like a massive rock giant.
okokwhatever 7 hours ago
I need to test the sketch creation a s a p. I need this in my life because learning to use Freecad is too difficult for a busy person like me (and frankly, also quite lazy)
- sho_hn 6 hours ago
  FWIW, the FreeCAD 1.1 nightlies are much easier and more intuitive to use due to the addition of many on-canvas gizmos.
syntaxing 8 hours ago
Why a Twitter post and not the official Google blog post… https://blog.google/innovation-and-ai/models-and-research/ge...
- dang 7 hours ago
  Just normal randomness I suppose. I've put that URL at the top now, and included the submitted URL in the top text.
- meetpateltech 8 hours ago
  The official blog post was submitted earlier (https://news.ycombinator.com/item?id=46990637), but somehow this story ranked up quickly on the homepage.
  verdverm 7 hours ago
  @dang will often replace the post url & merge comments
  HN guidelines prefer the original source over social posts linking to it.
- aavci 7 hours ago
  Agreed - blog post is more appropriate than a twitter post
HardCodedBias 5 hours ago
Always the same with Google.
Gemini has been way behind from the start.
They use the firehose of money from search to make it as close to free as possible so that they have some adoption numbers.
They use the firehose from search to pay for tons of researchers to hand hold academics so that their non-economic models and non-economic test-time-compute can solve isolated problems.
It's all so tiresome.
Try making models that are actually competitive, Google.
Sell them on the actual market and win on actual work product in millions of people lives.
dperhar 6 hours ago
Does anyone actually use Gemini 3 now? I cant stand its sleek salesy way of introduction, and it doesnt hold to instructions hard – makes it unapplicable for MECE breakdowns or for writing.
- sigmar an hour ago
  I use it often. Occasionally for quick questions, but mostly for deep research.
- copperx 6 hours ago
  I do. It's excellent when paired with an MCP like context7.
- throwa356262 6 hours ago
  I dont agree, Gemini 3 is pretty good, even the Lite version.
  dperhar 6 hours ago
  What do you use it for and why? Genuinely curious
- jeffbee 6 hours ago
  It indeed departs from instructions pretty regularly. But I find it very useful and for the price it beats the world.
  "The price" is the marginal price I am paying on top of my existing Google 1, YouTube Premium, and Google Fi subs, so basically nothing on the margin.