YouTube каталог
How AI Vision Evolved
🛠 How-to
en

Еволюція AI Vision: від Vision Transformers до великих моделей та їх насичення

Hugging Face8 днів тому6 квіт. 2026Impact 5/10
AI Аналіз

Hugging Face проаналізував шлях AI Vision: від Vision Transformers до LLaVA. Архітектурні прориви сповільнились, але з'явився фокус на вирівнюванні та покращенні наявних моделей, що відкриває нові можливості для бізнесу в задачах, де потрібне розуміння зображень та тексту.

Ключові тези

  • Перехід від CNN до Vision Transformers прискорив масштабування та transfer learning
  • Моделі на кшталт LLaVA стали ключовими для мультимодального AI
  • Ринок досяг певного плато в архітектурних інноваціях, але з'явились нові можливості для оптимізації
Можливості

Оптимізація існуючих моделей для конкретних бізнес-задач • Інтеграція багатомодального AI в існуючі робочі процеси • Використання технік вирівнювання для покращення продуктивності моделей

Нюанси

Насичення ринку архітектурними інноваціями не означає зупинку розвитку. Це сигнал до переходу від експериментів до практичного застосування існуючих моделей та їх оптимізації під конкретні бізнес-задачі.

Опис відео

I think it all happened very fast especially recently like I remember when I was first following vision and I think basically it all started with the vision transformers like before that we were working with convolutional neur networks and stuff with vision transformers I felt like they are more scalable and um also it's the fact that you can do a lot of transfer learning with them so like with vision transformers. The vision had its own birth moment, >> right? >> And then um we were able to transfer like many tasks uh and people started combining different architectures and um just uh adapting vision transformers to many different tasks. So object detection, image segmentation, image classification especially were um they had a huge leap uh especially for the labs that were training things and then um I think vision had its own GPT moment with the lava. >> So lava if you don't know is an architecture u which was so before lava there were many other models like lava. Lava is the model that can take in an image and um text prompt and output text >> and it's like LLM but that can input images basically. We had many models before like flamingo um I don't know like you name it but they were mostly closed and then um there was for instance edix by hugging face and stuff but lava was like very very scalable. edifix. The first edifix for instance was um um first open implementation of flamingo and then lava came and but lava is like way more scalable and um more easy to train because basically like to train a lava story I I think I went a bit low level but bear with me. So you just take like a image encoder that from a model like clip um and LLM and then just train a projection layer in between. So basically like trying to connect the image to text in a way and then uh do another round of training. So it's a bit like um I don't know like how you would pre-train an LLM and then uh do instruction fine tuning but with images. image and text pairs when you think about it. And everybody started training lava. And from there uh people started doing a lot of uh lava improvements like improvements on lava architectures you know adapting it to video feeding multiple uh frames or like interle image and text you know like you can say okay you see this image and then you insert the image inside and then you see another image blah blah like this is called interle interl and text and stuff and And um it all came to the moment of um the point where like we are a bit saturated at the moment in the sense that we don't really do a lot of architectural improvements anymore like people are often times like we let we came to Q models uh by Alibaba which are very capable models. Um these days everybody is just training like continuously training QN models or building stuff on QN models. So like we don't really have any architectural improvements anymore or because like also many of the labs are releasing models in like different sizes like you know 1B to I don't know super big mixture of experts models. Um it's just uh I I feel like I am not as excited to read um research papers. [laughter] Uh maybe like one thing that is exciting let's say is we have more um alignment techniques like um reinforcement learning uh from human feedback and you know like all of that stuff coming up for uh multimodel things. Um those papers are more exciting um I should admit but yeah like um overall uh we we have reached a point in vision where like we are just improving like we have established benchmarks just like LLMs um and we are trying to improve you know the document understanding video understanding agentic vision reasoning and uh you know like there was this recent Kimmy model that was able to look at the web few images and generate like website or brand or whatever. >> Yeah, that was >> like even just write uh the website code looking at the few like images. >> It's an open model available in current providers by the way. >> Yeah, correct. So yeah, like I think vision just evolved to a point where many things are just tied to VLMs now, right? >> And not not there's not a lot of um big labs doing um vanilla vision anymore. And when I refer to vanilla vision, I refer to everything but vision language models basically. >> Okay. Um this also includes some of the models like segment anything model by meta like the latest one especially uh even though it can take and text prompt it's not um established vision language model like um it cannot take long prompts and like do do instruction work and stuff. Sorry I I kind of yap a lot. >> That's great. I could I could I think I could y for >> you're actually giving us a master class of elevation right now. So basically like to sum up we we used to have CNN's and then we had V vision transformers like it's it was a birth moment >> for image understanding tasks like segmentation detection um uh image classification and then we had um the GPT moment with lava and then now we have we are a bit more saturated in large models and large pre-training doing everything.