Comments Page - We accidentally solved robotics by watching 1M hours of YouTube

« Back We accidentally solved robotics by watching 1M hours of YouTubeksagar.bearblog.devSubmitted by alexcos 19 hours ago

rozab 18 hours ago
I just wrote a reply to a comment talking about the AI tells this writing has, but it got flagged so my comment disappeared when I hit post. I'll rephrase out of spite:
My first thought upon reading this was that an LLM had been instructed to add a pithy meme joke to each paragraph. They don't make sense in context, and while some terminally online people do speak in memes, those people aren't quoting doge in 2025.
There's also a sense of incoherence in the whole piece. For instance, this section:
"- after: 22 million videos + 1 million images (now we're talking)
they basically hoovered up everything: something-something v2, kinetics, howto100m, and a billion youtube videos"
Was it a billion vids or 22m? It turns out the latter sentence is just rephrasing the list of sources in a cool casual way, and the last one is called YT-Temporal-1B. That's a billion frames of video, not a billion videos.
- billstar 9 hours ago
  Also, the author of the blog "Ksagar Atharva" doesn't appear anywhere in the list of authors on the linked FB research paper with Yann LeCun as a co-author. Unless the blog author is using a heavily modified pseudonym.
  The research is very real but the blog post appears to be very fake.
  xdfgh1112 8 hours ago
  It's someone explaining the research as a blog essay right? Which is very commonly done. We=humanity
  Kiro 5 hours ago
  Exactly. It's very obvious what "we" is referring to here.
- roveo an hour ago
  I'm using eigenrobot's (X user) prompt for ChatGPT and the style is very recognizable. Everything lowercase, tone, zoomer abbreviations, esotheric style of jokes.
- HeartStrings 6 hours ago
  Yeah, obviously LLM written. They tried to be unique by removing capitals.
- wincy 3 hours ago
  I don’t know, 400k people are listening to the White House streaming lo-fi hip hop on X right now with cutesy videos of Trump on one side and his executive orders streaming on the other at 4am. I think there’s plenty of people quoting doge in 2025.
  If you’re in the US, you likely work with them and they have learned to studiously avoid talking about politics except in vagaries to avoid conflict.
- Thorrez 3 hours ago
  >those people aren't quoting doge in 2025
  Could you explain what this means? Is this article quoting doge?
  debugnik 2 hours ago
  There was a clear attempt at the doge meme format, yes:
  > very scientific. much engineering.
  Emphasis on attempt because you're supposed to use words with grammatically incorrect modifiers, and the first one doesn't. (Even the second one doesn't seem entirely incorrect to me? I'm not a native speaker though.) "many scientific, so engineering" for example would have worked.
  I assume they, or most likely their LLM, tried too hard to follow the most popular sequence (very, much, wow) and failed at it.
  jojobas 5 minutes ago
  You'd think it would be easy to write "very engineering, much scientific". LLMs work in mysterious ways.
  shubb 2 hours ago
  "Much engineering was required" Archaic but still used a bit in articles or to give a certain vibe.
- undefined 8 hours ago
  [deleted]
- SV_BubbleTime 7 hours ago
  > some terminally online people do speak in memes, those people aren't quoting doge in 2025.
  You may be surprised to find out how incorrect this.
  I can think of two popular conservative sites likely to quote Doge people off hand that do this. I read all news in order not be an insufferable ideologue. So again, off the top of my head, NotTheBee (I think affiliated to BabylonBee (conservative The Onion)) and Twitchy. Among YouTubers, I think Asmond Gold, and I’m sure others like Steven Crowder who himself is in a famous meme.
  That said… yea, you are probably right.
  tomrod 6 hours ago
  Aren't those sites primarily Russian bots tho?
  mlinhares 5 hours ago
  Isn’t that just a synonym for conservative?
  mid-kid 6 hours ago
  Not conservative but I used to love the meme before it was co-opted by musk, so I will occasionally use it as a "haha now you feel OLD" without thinking of its modern connotations.
  pjerem 5 hours ago
  Also I think it’s somehow important to not let fascism steal our cultural heritage, even if it’s just a meme.
  In my country, far righters are displaying the country’s flag everywhere. Now you can’t display a French flag without being thought as a far right person. That’s honestly insufferable.
  I know it’s less important with doge but still : before being a crypto it was just a picture of an overly innocent and enthusiastic dog. And even when it became a little crypto, it was totally assumed that it was a meme coin and wasn’t meant for speculation, the idea was that 1DOGE = 1DOGE only and people gifted them to other people who made nice contributions on the internet.
  Musk broke all of this when it started to use it to do gigantic pumps and dumps using his own visibility on Twitter.
  We don’t have to let fascism steal all the popular symbols / memes, because they will steal them anyway.
  foxglacier 2 hours ago
  Lets see you try to recover the swastika from fascism ;)
dchftcs 7 hours ago
Pure vision will never be enough because it does not contain information about the physical feedback like pressure and touch, or the strength required to perform a task.
For example, so that you don't crush a human when doing massage (but still need to press hard), or apply the right amount of force (and finesse?) to skin a fish fillet without cutting the skin itself.
Practically in the near term, it's hard to sample from failure examples with videos on Youtube, such as when food spills out of the pot accidentally. Studying simple tasks through the happy path makes it hard to get the robot to figure out how to do something until it succeeds, which can appear even in relatively simple jobs like shuffling garbage.
With that said, I suppose a robot can be made to practice in real life after learning something from vision.
- carlosdp 5 hours ago
  > Pure vision will never be enough because it does not contain information about the physical feedback like pressure and touch, or the strength required to perform a task.
  I'm not sure that's necessarily true for a lot of tasks.
  A good way to measure this in your head is this:
  "If you were given remote control of two robot arms, and just one camera to look through, how many different tasks do you think you could complete successfully?"
  When you start thinking about it, you realize there are a lot of things you could do with just the arms and one camera, because you as a human have really good intuition about the world.
  It therefore follows that robots should be able to learn with just RGB images too! Counterexamples would be things like grabbing an egg without crushing, perhaps. Though I suspect that could also be done with just vision.
  moefh 4 hours ago
  > It therefore follows that robots should be able to learn with just RGB images too!
  I don't see how that follows. Humans have trained by experimenting with actually manipulating things, not just by vision. It's not clear at all that someone who had gained intuition about the world exclusively by looking at it would have any success with mechanical arms.
  deadfoxygrandpa an hour ago
  counterpoint: think about all the tasks you could do with your hands and arms while your eyes are closed. i think its really a lot of stuff considering blind people can do the vast majority of things sighted people can do, and i suspect anything you could do with your eyes closed would be extremely difficult to do with a camera feed as the literal only sensory input
  abenga 2 hours ago
  Humans did not accumulate that intuition just using images. In the example you gave, you subconsciously augment the image information with a lifetime of interacting with the world using all the other senses.
  jpc0 4 hours ago
  I think you vastly underestimate how difficult the task you are proposing would be without depth or pressure indication, even for a super intelligence like humans.
  Simple concept, pick up a glass and pour its content into a vertical hole the approximate size of your mouth. Think of all the failure modes that can be triggered in the trivial example you do multiple times a day, to do the same from a single camera feed with no other indicators would take you hours to master and you already are a super intelligent being.
  var_cw 2 hours ago
  The point is how much non-vision sensors vs pure vision, helps humans to be humans. Don't you think this point was proven by LLMs already that generalizability doesn't come from multi-modality but by scaling a single modality itself? And jepa is for sure designed to do a better job at that than an LLM. So no doubt about raw scaling + RL boost will kick-in highly predictable & specific robotic movements.
  jrimbault 3 hours ago
  A routine gesture I've done everyday for almost all my life: getting a glass out of the shelves and into my left hand. It seems like a no brainer, I open the cabinet with my left hand, take the glass with my right hand, throw the glass from my right hand to the left hand while closing the cabinet with my shoulder. Put the glass under the faucet with left hand, open the faucet with the right hand.
  I have done this 3 seconds gesture, and variations of it, my whole life basically, and never noticed I was throwing the glass from one hand to the other without any visual feedback.
  stavros 3 hours ago
  If I have to pour water into my mouth, you can bet it's going all over my shirt. That's not how we drink.
  jpc0 3 hours ago
  Except this is the absolutely most common thing humans do, and my argument is that that it will spill water all over but rather that it will shatter numerous glasses, knock them over etc all before it has picked up the glass.
  The same process will be repeated many times trying to move the glass to its “face” and then when either variable changes, plastic vs glass, size, shape, location and all bets are off purely because there just plainly is the enough information
  suddenlybananas 3 hours ago
  Humans have innate knowledge that help them interact with the world and can learn from physical interaction for the rest. RGB images aren't enough.
  whatever1 3 hours ago
  Video games have shown that we can control pretty darn well characters in virtual worlds where we have not experienced their physics. We just look at a 2D monitor and using a joystick/keyboard we manage to figure it out.
  deadfoxygrandpa an hour ago
  a game has very limited physics. like the buttons you press are pre-tuned to perform certain actions and you arent dealing with continuous nearly infinite possibilities with large ranges of motion, pressure, speed etc. like think about how difficult the game QWOP is because you mostly just have visual feedback
  suddenlybananas 2 hours ago
  Yeah but we already have a conception of what physics should be prior to that that helps us enormously. It's not like game designers are coming up with stuff that intentionally breaks our naïve physics.
  jaisio 3 hours ago
  [dead]
- namibj 6 hours ago
  If the robot already knows "how to" the happy path, the training difficulty falls severely at least if it can continue after a recovery.
  dchftcs 5 hours ago
  The tasks you do to recover from the failure is often different from the happy path. For example, the happy path of dumping garbage is carrying a garbage bag to a collection bin. The non-happy path is that the bin is overflowing and you have to put the bag on the ground, or if the bag leaks and you need to move to a new bag, or if the bag breaks entirely and you have to pick up the trash again.
  But yeah, I think a better way to put it is that sampling the happy path would indeed make the failure case easier, but sampling just happy paths is far from sufficient from completing even some of the simplest human tasks with failure.
- rocqua 6 hours ago
  On humans, you can generally see the force they apply by looking at strain.
  dchftcs 5 hours ago
  The error margins will be huge, and for small enough force (like the skinning part or handling fine mechanical stuff) there's basically almost zero signal.
dimatura 7 hours ago
"why didn't we think of this sooner?", asks the article. Not sure who the "we" is supposed to be, but the robotics community has definitely thought of this before. https://robo-affordances.github.io/ from 2023 is one pretty relevant example that comes to mind, but I have recollections of similar ideas going back to at least 2016 or so (many of which are cited in the V-JEPA2 paper). If you think data-driven approaches are a good idea for manipulation, then the idea of trying to use Youtube as a source of data (an extremely popular data source in computer vision for the past decade) isn't exactly a huge leap. Of course, the "how" is the hard part, for all sorts of reasons. And the "how" is what makes this paper (and prior research in the area) interesting.
- a_t48 5 hours ago
  I definitely saw somebody at Actuate last year talking about supplementing training videos for VLA with Youtube, but I think they actually found that "any" video of the real world helped give a better physics "understanding" to the model.
liendolucas an hour ago
I didn't understand a single word about this post and what was supposed to be solved and had to stop reading.
Was this actually written by a human being? If so, the author(s) suffer from severe language communication problems. Doesn't seem to be grounded at least with reality and my personal experience with robotics. But here's my real world take:
Robotics is going to be partially solved when ROS/ROS2 becomes effectively exterminated and completely replaced by a sane robotics framework.
I seriously urge the authors to use ROS/ROS2. Show us, implementing your solution with ROS, pushing it to a repository and allow others to verify what you solved, maybe?. Suffer a bit with the framework and then write a real post about real robotics hands-on, and not just wander on fancy uncomprehensible stuff that probably no-one will ever do.
Then we can maybe start talking about robotics.
Voloskaya 18 hours ago
This article contains so many falsehoods and history rewrites that it's pretty painful to read.
october8140 8 hours ago
I was unable to make through the article (now we're talking).
poulpy123 20 minutes ago
No you didn't, and I don't even need to click on the link to know it
imranq 18 hours ago
This was a bit hard to read. It would be good to have a narrative structure and more clear explanation of concepts.
- Aurornis 9 hours ago
  > This was a bit hard to read.
  This writing style is prominent on Twitter and niche Discords. It's funny how much I've come to be able to cut right through it, but if you haven't seen much of it it's really hard to parse. That's by design, too. The vibe of this writing style is to project an air of confidence so strong that the author doesn't care if you get it or not. It's a sort of humblebrag where the writing is supposed to flex the author's understanding of the subject while also not caring if you get it or not.
  As others have already covered, there's also some heavy stretching of the truth and rewriting of history going on in this post. That's also common of the extreme bravado in this style of semi-impenetrable writing: The vagueness and ambiguities allow the author to make grandiose claims but then wiggle out of them later if someone is astute enough to catch on.
  For example: The blog post is written as “We…” but is the author part of the team? Or is he using “we” meaning society in general?
  Pyxl101 8 hours ago
  What's the point in writing something while "not caring" if the reader understands or not? Seems like a false confidence or false bravado to me; it reads like an attempt to project an impression, and not really an attempt to communicate.
  Aurornis 6 hours ago
  Basically: If you understand the topic well, you’re not the target audience.
  This is a type of information arbitrage where someone samples something intellectual without fully understanding it, then writes about it for a less technical audience. Their goal is to appear to be the expert on the topic, which translates into clout, social media follows, and eventually they hope job opportunities.
  The primary goal of the writing isn’t to get you to understand the topic clearly, because that would diminish the sense that the author is more knowledgeable than you. The goal is to sound guru-like while making the topic feel impenetrably complex for you, while appearing playfully casual for the author.
  dclowd9901 3 hours ago
  I guess "bullshitting as a career" isn't going away any time soon.
  dotancohen 6 hours ago
  This style of writing is very effective at convincing people in their impressionable years of a narrative or viewpoint, often one that is hard to defend with more traditional writing styles.
  I hope I'm wrong, but this looks like an effort to normalize such writing style. As this happens, intelligent discourse and rhetoric become harder.
- signal-intel 18 hours ago
  Very intentional. Their response would be: “if you need narrative structure and clear explanation of concepts, yngmi”.
- dclowd9901 3 hours ago
  It would also be good if the perspective of the article would stay put. This "we" and "they" thing was at best confusing and at worst possibly a way to get more clicks or pretend the author had something to do with the work.
weinzierl 4 hours ago
I do not know and do not care much about robotics per se, but I wish LLM's were better with spatial reasoning. If the new insight helps with that - great!
I dabbled a bit in geolocation with LLM's recently. It is still surprising to me how good they are with finding the general area a picture was taken. Give it a photo of a random street corner on this earth and it is likely will not only tell you the correct city or town but most often even the correct quarter.
On the other hand, if you ask it for a birds eye view of a green, a brown and a white house on the north side of a one-way street (running west to east) east of an intersection running north to south, it may or may not get it right. If you want it to add an arrow going in the direction of the one-way street, it certainly has no clue at all and the result is 50/50.
hbarka 5 hours ago
Dr Fei-Fei Li talks about this as the LWM (Large World Model) during this interview: https://youtu.be/fQGu016AlVo and with https://www.worldlabs.ai/
pr337h4m 18 hours ago
IMO, VideoMimic is a better proof-of-concept
https://www.videomimic.net/
https://www.videomimic.net/page1.html
- Keyframe 18 hours ago
  Looks like it was trained on Shaolin Drunken Fist videos. Does it look drunk because of the videos or because there's a discrepancy between videos and it not accounting for gravity and physics in general?
  jdmichal 8 hours ago
  My guess would be lack of actuators. For instance, this robot looks like it has an ankle that can only go up and down, but not roll like a human's. Also, I wonder if there's a center of gravity issue, as it almost always appears to be leaning backwards to even out.
  I think it's still pretty impressive in its recoveries, even though there's an unnaturally large number of them necessary. About 8 seconds into the video on the homepage, it almost misses and ends up slipping off the second step. I've eaten shit at missing a couple inch curb, though I don't think "graceful" has ever been used as a descriptor for me. So the fact that it just recovers and keeps going without issue is impressive to me.
  namibj 6 hours ago
  > So the fact that it just recovers and keeps going without issue is impressive to me.
  I'm pretty sure that's just a matter of reaction speed and it maintaining a constant focus/vigilance on it's movement that you'd usually not reserve outside of some sports and situations pre-identified as deserving the attention due to danger, like concentrating on balance and not getting into a position that overstresses your joints when you know it's icy.
column 5 hours ago
> right now, you need to show the robot pictures of what you want it to do. want it to "clean the kitchen"? better have a photo of a clean kitchen handy.
What about using Flux Kontext (or Controlnets) to turn the messy kitchen into a clean kitchen?
- seydor 4 hours ago
  Sure thing, let me just put the fridge in the washing machine.
moneywaters 6 hours ago
So video gen models basically can be extrapolated to control robotics ? How long until Veo3 robots take over?
m3kw9 6 hours ago
Solving robotics is some claim.
- dfedbeef 6 hours ago
  Spoiler: not solved
  amelius 2 hours ago
  Betteridge not only applies to headlines with questions but it also works quite well with Twitter style headlines.
  Joel_Mckay 6 hours ago
  Indeed, the robotics edge-case problem space complexity balloons far faster than most assume.
  Physics informed training is a real methodology (simple introduction to the subject: https://www.youtube.com/@Eigensteve/videos ).
  However, the slop article is 80% nonsense. =3
richard___ 18 hours ago
Solved??? Where?
- chihuahua 8 hours ago
  Yeah, wake me up when they have a robot that can wash, peel, cut fruit and vegetables; unwrap, cut, cook meat; measure salt and spices; whip cream; knead and shape dough; and clean up the resulting mess from all of these. Then they will have "solved" part of robotics.
a-dub 6 hours ago
i thought all the cool data driven robotics stuff was like reinforcement learning from sensors that track moving effectors in the real world with online retraining that mimics the sensorimotor experimentation that is observed during the developmental phases of real neurobiological systems?
so you just kinda let it run for a while and it bumps and squirms around until it stands up or whatever.
seems also the future for real ai?
ErrorNoBrain 18 hours ago
Someone watched 'Devs' ?
if you havent - highly recommended.
- root_axis 8 hours ago
  Not sure why people love this show. Really terrible writing.
  dmix 7 hours ago
  Love Alex Garland but the characters ruin the show.
- andruby 18 hours ago
  Do you have a link or a less generic search term?
  VladVladikoff 18 hours ago
  It’s a TV show made by Adam Garland https://m.imdb.com/title/tt8134186/ It’s pretty good sci fi IMHO
  hshshshshsh 18 hours ago
  [flagged]
  conception 18 hours ago
  Do we have a “let me ChatGPT that for you..” site yet?
hahaxdxd123 18 hours ago
Extremely oversold article.
> the core insight: predict in representation space, not pixels
We've been doing this since 2014? Not only that, others have been doing it at a similar scale. e.g. Nvidia's world foundation models (although those are generative).
> zero-shot generalization (aka the money shot)
This is easily beaten by flow-matching imitation learning models like what Pi has.
> accidentally solved robotics
They're doing 65% success on very simple tasks.
The research is good. This article however misses a lot of other work in the literature. I would recommend you don't read it as an authoritative source.
okdood64 19 hours ago
Does YouTube allow massive scraping like this in their ToS?
- nerdsniper 7 hours ago
  Per HiQ vs. LinkedIn, it doesn't matter what their ToS says if the scraper didn't have to agree to the ToS to scrape the data. YouTube will serve videos to someone who isn't logged in. So if you've never agreed to YouTube's ToS, you can scrape the videos. If YT forced everyone to log in before they could watch a video, then anyone who wants to scrape videos would have had to agree to the ToS at some point.
- perching_aix 18 hours ago
  My "lawyer" (gpt4o) claims that since YouTube is merely a non-exclusive licensee of the user content upload to their service, even if they have such restrictions in their ToS (they do), they likely would not hold up in court, citing [0]. Something about that non-exclusivity meaning they cannot constrain the copyright further on their own terms. Which I guess makes sense?
  And since scraping of publicly available data is not illegal (in the US, according to the aforementioned "lawyer"), it seems like it's okay?
  Not legal advice.
  [0] https://www.skadden.com/insights/publications/2024/05/distri...
- klysm 18 hours ago
  I don't think they can legally prevent it
- MaxPock 18 hours ago
  They don't and neither do I allow my site - whose content I found on Gemini -scraped
- mouse_ 18 hours ago
  Probably not.
  Who cares at this point? No one is stopping ML sets from being primarily pirated. The current power is effectively dismantling copyright for AI related work.
  snickerdoodle12 18 hours ago
  > Who cares at this point
  Anyone who has a shred of integrity. I'm not a fan of overreaching copyright laws, but they've been strictly enforced for years now. Decades, even. They've ruined many lives, like how they killed Aaron Swartz.
  But now, suddenly, violating copyright is totally okay and carries no consequences whatsoever because the billionaires decided that's how they can get richer now?
  If you want to even try to pretend you don't live in a plutocracy and that the rule of law matters at all these developments should concern you.
  jagged-chisel 9 hours ago
  > … like how they killed Aaron Swartz.
  I can’t imagine why you’d let the FBI off the hook
  mouse_ 10 hours ago
  > If you want to even try to pretend you don't live in a plutocracy and that the rule of law matters at all
  Can't even pretend anymore, this season jumped the shark
  shadowgovt 10 hours ago
  Aaron Swartz died of suicide, not copyright.
  His death was a tragedy but it wasn't done to him.
  marcus_holmes 8 hours ago
  There's an English phrase "hounded to death", meaning that someone was pursued and hassled until they died. It doesn't specify the cause of death, but I think the assumption would be suicide, since you can't actually die of fatigue.
  I think that's what was done to Aaron Swartz.
  shadowgovt 8 hours ago
  Many people have dealt with the law, with copyright infringement, even with gross amounts of it, and had the book thrown at them, and survived the experience.
  Swartz was ill. It is a tragedy he did not survive the experience, and indeed, trial is very stressful. But he was no more hounded than any defendant who comes under federal scrutiny and has to defend themselves in a court of law via the trial system. Kevin Mitnick spent a year in prison (first incarceration) and survived it. Swartz was offered six months and committed suicide.
  I don't know how much we should change of the system to protect the Aaron Swartzs of the world; that's the mother of all Chesterton's Fences.
  marcus_holmes 5 hours ago
  Many people get (for example) pneumonia and recover. Some people get pneumonia and die. The people who died of pneumonia died because of pneumonia. The fact that other people survived it doesn't mean that they didn't die of it.
  Saying that we should not work on cures for pneumonia because it's a Chesterton Fence is obviously, blatantly, illogical. Saying that we should change the system so that government officials working for moneyed interests can't hound someone to death is similarly illogical.
  shiroiuma 5 hours ago
  Maybe someone should throw you in prison for a year on some BS made-up charges to see how well you survive it. We can use it as a data point for your argument.
  perching_aix 18 hours ago
  > The current power is effectively dismantling copyright for AI related work.
  Out of the loop apparently, could you elaborate? By "the current power" I take you mean the current US administration?
  undefined 18 hours ago
  [deleted]
  bgwalter 18 hours ago
  Trump fired the head of the copyright office:
  https://www.heise.de/en/news/After-criticism-of-AI-training-...
  The "Big Beautiful Bill" contains a clause that prohibits state "AI" legislation.
  Trump has a "Crypto and AI czar" who is very active in promoting "AI" on his YouTube propaganda outlet. The same czar also promoted, pre-election of course, accelerated peace with Russia and then stopped talking about the subject altogether.
  perching_aix 18 hours ago
  Oh wow okay, genuinely missed these. Thanks.
- dangoodmanUT 18 hours ago
  What ToS
  bobmcnamara 18 hours ago
  https://www.youtube.com/static?template=terms ?
rzzzt 18 hours ago
Friendly unit conversion man at your service: 114 years.
- isoprophlex 18 hours ago
  How much is that in football fields?
  forks 18 hours ago
  If you accept 30 years as the average lifespan of an nfl stadium, 3.8
  washadjeffmad 9 hours ago
  Good catch. Approximately 9,192,631 Turkish decibels.
  rzzzt 4 hours ago
  Fun fact: the International Bureau of Weights and Measures in Paris is the owner of a perfect 0 dB noise floor enclosed in a perfect titanium sphere (with some sheep's wool filling to avoid reflections). There is a small door on the side over which microphone capsules can be inserted for calibration.
  (/joke)
- ReptileMan 18 hours ago
  So a half zoom meeting... or 1/3 Teams one.
  perching_aix 18 hours ago
  I genuinely wish there was a cost estimation feature built into them. Doesn't even have to be even remotely close to the true cost if it's anything like the meetings I attend, there will be enough people and it will go on for long enough to make up for it.
  ReptileMan 18 hours ago
  I worked as consultant. And started billing at normal hourly rates for meetings. You will be surprised how fast the company desire for my participation in them decreased.
  hobs 18 hours ago
  Why would you do anything but that? You want to just chat with me forever the rate is the rate.
trhway 2 hours ago
>camera pose sensitivity
>the model is basically a diva about camera positioning. move the camera 10 degrees and suddenly it thinks left is right and up is down.
Reminds that several years ago Tesla had to finally start to explicitly extract 3D model from the net. Similarly i expect here that it would get pipelined - one model extracts/builds 3D, and the other is actually the "robot" working in that 3D. Each one can be alone trained much better and efficiently, with much better transfer and generalization, than the large monolithic model working from the 2D video. In pipeline approach, it is very easy to generate synthetic input 3D data better covering interesting scenario space for the "robot" model.
And, for example, you can't just, without significant training, feed the large monolithic model a lidar point space instead of videos. Whereis in a pipelined approach, you just switch the 3D generating pipeline input model.
contingencies 18 hours ago
This is interesting for generalized problems ("make me a sandwich") but not useful for most real world functions ("perform x within y space at z cost/speed"). I think the number of people on the humanoid bandwagon trying to implement generalized applications is staggering right now. The physics tells you they will never be as fast as purpose-built devices, nor as small, nor as cheap. That's not to say there's zero value there, but really we're - uh - grasping at straws...
- dotancohen 6 hours ago
  The value is in the generalisation.
  For a single example, in any factory watch how humans are added as ad-hoc machines wherever a problem occurs. Machine N outputting faster than machine N+1 can accept? Have a human stack, and destack, the product between them. No matter the size, shape, it within reason the weight of the product. But most importantly: the process can begin within seconds of the problem occurring. No need for a programmer, developer, or maintenance worker to get involved. Just a clear order from the shift manager.
  A general purpose robot with physical interfaces similar to a human would be very valuable for such environments. If it had the software to be as easy to instruct as a human.
- xyzzy123 8 hours ago
  As the vendor you can sell it with the promise that awesomeness is coming "just around the corner" with the next software update.
  You can also seek investment without committing to an actual concrete business model.
- foobarian 18 hours ago
  I wonder if a generalized machine would have an advantage from scale, and then putting all the specialized stuff into software. We have seen this play out before.
- jes5199 17 hours ago
  analogy: a CPU is more expensive, more complicated, more energy demanding than custom made circuitry, in most cases.
- ahmedbaracat 18 hours ago
  Well, there’s a middle ground, kinda. Using more specialized hardware (ex: cobots) but deploy state-of-art Physical AI (ML/Computer Vision) on them. We’re building one such startup at ko-br (https://ko-br.com/) :))
  contingencies 18 hours ago
  Quite a few startups in your space. Many deployed with customers. Good luck finding a USP!
- jjangkke 18 hours ago
  Very good point! This area faces a similar misalignment of goals in that it tries to be a generic fit-all solution that is rampant with today's LLMs.
  We made a sandwich but it cost you 10x more than it would a human and slower might slowly become faster and more efficient but by the time you get really good at it, its simply not transferable unless the model is genuinely able to make the leap across into other domains that humans naturally do.
  I'm afraid this is where the barrier of general intelligence and human intelligence lies and with enough of these geospatial motor skill database, we might get something that mimics humans very well but still run into problems at the edge, and this last mile problem really is a hinderance to so many domains where we come close but never complete.
  I wonder if this will change with some sort of computing changes as well as how we interface with digital systems (without mouse or keyboard), then this might be able to close that 'last mile gap'.
  esjeon 18 hours ago
  Note that the username here is a Korean derogatory term for Chinese people.
  jcrawfordor 9 hours ago
  It's an interesting comment, it has the same "compliment the OP, elaborate, raise a further question" format I've seen used by apparently LLM-generated spam accounts on HN. But, the second paragraph is so incoherently structured that I have a hard time thinking an LLM produces it.
6510 3 hours ago
Put tiny cams on robot arms and let it control them. They can be flimsy for safety. If it is sure something is happening say nothing, if it is 70-99% sure have it guess what is going on, if <70% have it ask what is going on.
throwaway198846 18 hours ago
I wonder how much language does this model understand. If we pan across text will it fill in sensible next word? How good will it be?
daft_pink 9 hours ago
My mom said I was throwing away my life watching YouTube all day and clearly I just haven’t been watching YouTube enough. 1 million YouTube videos here I come!
undefined 10 hours ago
[deleted]
undefined 18 hours ago
[deleted]
accidentallfact 18 hours ago
https://news.ycombinator.com/item?id=44073183
undefined 18 hours ago
[deleted]
aaron695 18 hours ago
[dead]
accidentallfact 18 hours ago
[flagged]
isoprophlex 18 hours ago
[flagged]
- throwaway198846 18 hours ago
  I have never seen "ngmi" before, I wonder in which subculture it is common
  panarky 18 hours ago
  It's the second most common four-letter acronym in crypto hype threads right after hfsp.
  canyp 7 hours ago
  The Urban Dictionary definition is hilarious, opens with "HFSP is an acronym used typically in the crypto community against non-belivers".
  Hasn't defined the term yet and I know I'm in for a hell of a ride.
  jasonjmcghee 7 hours ago
  very popular on tech twitter. Right up there with "we're back" and "we're so back"
  mensetmanusman 18 hours ago
  Seen budding lot in Ivy League hacker subculture 15 years ago when I was there
  undefined 18 hours ago
  [deleted]
  spencerflem 18 hours ago
  not sure, by my college friend group uses it occasionally
- perching_aix 18 hours ago
  > gen z douchebag
  Hello there! As a fellow gen-z douchebag, the article looks authentic, albeit a bit slim on Discord screencaps. Will be fun(?) to be proven wrong though.
- undefined 18 hours ago
  [deleted]
canyp 7 hours ago
I don't know. I'm not the expert, but if you've ever tried to a backflip or anything where your toes are above your head, then you'll know that spatial awareness goes well beyond vision. Or if you throw a frisbee for the dog to catch, they don't actually look at it while running; they look, predict position, then move in. Veni, vidi, vici. So any model that "learns physics" just through vision seems flawed from the start. What's your thought there?