World Models: чи стануть новою великою річчю в AI?
Мерве Ноян з Hugging Face обговорює потенціал world models в AI, підкреслюючи їхню здатність стискати та розуміти середовища для таких застосувань, як автономне водіння та робототехніка. Вона також торкається vision language action models та їхньої ролі в наданні можливості AI взаємодіяти зі світом через природну мову та дії.
Ключові тези
- World models стискають світ, щоб AI міг розуміти середовища.
- Vision language action models дозволяють AI виконувати дії на основі природної мови.
- Тенденція йде до менших, локальних AI моделей, які можуть працювати на телефонах і планшетах.
Створення більш ефективних систем автономного водіння з кращим розумінням дорожньої ситуації • Розробка роботів, здатних працювати в складних і непередбачуваних умовах • Можливість моделювання та оптимізації виробничих процесів у реальному часі
Більшість поточних реалізацій world models вимагають значних обчислювальних ресурсів. Перехід до локальних моделей на телефонах і планшетах — амбітна, але поки що далека мета.
Опис відео▼
This year I'm super bullish on a few of things that are a bit intertwined uh and they are a bit like the team is um like gen like generalization of vision to world and taking actions. So in my opinion this year the there's something called world models and world models are a bit like compression of the world like that's what also Yanikun says a bit um he says that um you cannot really learn a world through text tokens and for instance the you can see this trend with uh J&3 by deep mind like if you have seen it it's like a 3D model where it takes in the action input and the current scene and then it outputs like the continuation of the 3D world that you are in. So it's like a 3D world. Um there's also this company of world labs by FI. You probably have heard about it. She's also trying to solve a similar problem and Yan Leun I think his new lab is also doing a similar thing. So basically what they are trying to do is that they want to compress the world and they want models to model the world itself in a special way. I don't know how to pronounce it like spatial way like awareness of the surroundings and stuff and this unlocks a lot of things. Um, I think first and foremost it's like one of my favorite parts of AI is um I really want AI to do very boring things and things in the out there in the world and not um creative things. Let's say it's a bit of a weird take, but uh for instance, if you want autonomous driving, if you want uh robotics to be more out in the world, you need uh to be able to compress noisy noisy environments, right? And pre prior to the machine learning taking over this, we had like physics simulators and stuff, but they are too isolated. So world models in a way enable to compress the world and like because you have a ton of like images, 3D models that 3D data etc to be able to compress the uh all of the observations like these are called observations like the state that you are in to be able to take an action. It's called observations. So this can include like um sensor data. I don't know like uh mostly images by the way like these people are using images these days images and stuff you actually need to be able to comp model the world in a very good way and I am very bullish in that for this year seeing the Google deep mind's work faile company and Yan Leon's company as well for instance like last year they released video Japa 2 I think it was also like um model that uh does does this through videos. Nvidia does a similar thing. They have a platform called Cosmos I think and the models to uh like world models as well uh to do this. Um so I'm very like bullish at HF. We also have like we do something similar. It's called it's not world model per se but it's like vision language action model. And it's like rather than taking an action like like when you for instance like think of your computer keyboards like going up and down and stuff. Um these vision language action models are actually taking actions from like natural language um and um like uh for instance think like richini and like seeing something and you talk to it through a natural language you know ask it to do something and then it takes an action. So yeah, like um these are also like extensions of visual language models by the way. It's just the the difference is that they take actions. Last year for one of the first implementations of this I remember a lab reached out to us. Um they were using a lang visual language model called like very early ones called Polyama. Uh Polyjama was previously released by Google Brain when it was Google Brain and not Google Deep Mind and it was basically like uh uh Polyama but extended for action. So actually like vision language models also extend to robotics and stuff. So like uh I was very much looking forward to it. Uh and like when you think of action as well, it's also a bit like it goes into agency and stuff. So like once you enable very large generalization into vision language or like world you can have a lot of possibilities that we don't know like we were predicting last year that these things would happen and it they actually happened um >> very fast actually. >> Yeah very it all happened very fast. So yeah like um overall I'm very bullish in those. And second thing I'm very bullish about is um um like actually like very agentic like re reasoning models going on phones and interacting with the apps and stuff. So this I wanted to if you were to take this video by December or something it would have made more sense. Uh I already had this prediction from last year actually from another podcast with a friend of mine. Um but we have it very mainstream with open claw and stuff at this moment. Um I also predicted that we would have models that operate over screenshots on phone like it it looks at the screenshot and it decides where to click and things. Then last year there was like an 8B model that was released like that from a Chinese app. I think it was mini CPM, but I'm not very sure. So, but it was obviously like sending a request to server because it's very hard to run an 8 beam models on our phone reliably. Um, I would say this year I would expect these models to be going like ne like local or like we could have also like so we can have models being smaller but also we could have some hardware leaps perhaps like we could have some phones with a bit more like AI oriented um accelerators inside. Um, so I would say I would like the models to not only go local on our computers but also on phones or like tablets and stuff. So yeah, like these are I think my predictions for this




