Intro to Mixture of Experts | Aritra Roy Gosthipaty | HF Podcast #2

🛠 How-to

Експерт Hugging Face пояснює, що таке Mixture of Experts (MoE)

Hugging Face•близько 2 місяців тому•13 квіт. 2026•Impact 6/10

AI Аналіз

Представник Hugging Face Арітра Рой Гостіпаті пояснює, що таке моделі Mixture of Experts (MoE), підкреслюючи їх ефективність і зростаючу тенденцію їх впровадження у світ LLM. Він обговорює переваги MoE, такі як вища швидкість висновування та знижені обчислювальні витрати, а також розглядає їх обмеження та потенційні майбутні тенденції.

Ключові тези

Моделі Mixture of Experts (MoE) пропонують спосіб вибіркової активації підмножин параметрів, покращуючи ефективність і швидкість висновування порівняно зі щільними архітектурами.
Такі моделі, як DeepSeek і Mistral, популяризували MoE, продемонструвавши їх здатність досягати високої продуктивності з відносно невеликими просторами активації.
У той час як MoE стають все більш популярними для базових моделей, менші, щільні моделі залишаються актуальними для периферійних пристроїв і додатків, що вимагають швидкої обробки.

Можливості

Зниження витрат на обчислення для великих мовних моделей на 50%+ • Підвищення швидкості висновування для покращення користувацького досвіду • Можливість створення більш потужних моделей без значного збільшення витрат

Нюанси

Більшість компаній зосереджуються на створенні моделей, але інфраструктура для їх ефективного використання часто недооцінюється. Без належної інфраструктури навіть найпотужніші MoE не зможуть розкрити свій потенціал.

Опис відео

▼

Good morning everyone. How's it going? Today we have Aritra here who's going to be joining us. He's from the developer advocacy team. He's a developer advocate here at Hugging Face and he works with very closely with Transformers. He's a machine learning expert and he's here to share some secrets about MOEs and life at Hugging Face. So, welcome Aritra. Thank you very much for joining us. Namaskar. I am Aritra and that is how I say hello to everybody. I am Aritra in my native tongue. I am a developer advocate at Hugging Face as Alejandro has already mentioned. And I work with the Transformers team. I love working with smarter people so hence I am here in the room as and also in the Hugging Face team. If you have like questions, I would love to pinpoint because my entire journey is too long to kind of say. Awesome. Yeah, sure. Tell me a little bit about how you got to Hugging Face actually. How did you first hear about Hugging Face and how did you start working with it? Okay, so this is around 2021-ish when I got to know what Transformers as the architecture did because I was way late into the AI space. It's supposed to be 2017 but 2021. So, that's when I understood what Transformers is. I wanted to implement it and then I stuck up on this repository that everybody kind of globally uses at this point in time. That is the Hugging Face {slash} Transformers repository. And back then I was also working a little bit into TensorFlow, into Keras and all that stuff and um I contributed a bunch of TensorFlow models inside of Hugging Face Transformers. And that is when I got to know a lot of contributors. I got to know a lot of people who were maintaining Transformers at that point in time and uh that is a ladder into me being a Hugging Face fellow first. So, I am one of those people who are from the fellowship of Hugging Face. And then later I asked people whether I could join the team because I loved what they were doing and I loved how easy it is to communicate and I asked them and all that and here. So, you literally just asked them, "Hey, can I join Hugging Face?" And >> And probably they found me good enough to you know Awesome. Well, that's a great that's a great story. I actually didn't know that. So cool. When was that? I am in Hugging Face around 1 year 2 months-ish. Awesome. Okay. Oh, so it was already huge and you were like, "Hey, can I join?" Like, "Yeah, sure. Please join." Those are the words, "Can I join?" There you go. Awesome. So, cool. And what are you currently working on with Hugging Face? Uh so, at this point in time I think Transformers team is doing a huge huge load of things that make usage of Transformers better. So, what I try to do in my role is that I advocate about what the Transformers team is doing. And I've seen that MOEs like mixture of experts are a big thing. They are hyped and for the very right reasons. And the Transformers team is inculcating a lot of things to make MOEs the first class citizens of Hugging Face in general. The hub, the repositories that we have and so on. So, very right this time I am preparing a blog post with all of my colleagues. When I say all of my colleagues, I mean all of my colleagues in the team to chat out the entire thing so that it's easier for the community to interact with MOEs and the Transformers repository. That's great. And talking about MOEs in particular, now that you're writing these blog posts, maybe you can give us a quick sneak peek or a quick introduction to MOEs and tell us what they are and why they're so important. Okay. So, uh A mixture of expert is um a term that has been around quite some time. If I'm not wrong, it was Shazeer at al who kind of made it huge uh back in 2018-19. Please do write in the comments if I'm wrong. So, that's one. Uh what that means is if you have a dense architecture, you can sparse it out. You have now uh subsets of parameters out of the dense architecture and you can sparsely activate it which makes it very interesting because now the entire architecture does not have to get activated for generating of tokens. You just have certain experts which activate that um This huge for efficiency, that is huge for inference speed and so on. So, with with this, what I'm also trying to come to is this is a whole new paradigm. With Transformers, there are a bunch of downstream applications that use Transformers as back end. So, vLLM uses Transformers as back end, SG Lang, Llama.cpp and so on. If we make MOEs the first class citizens of Transformers, you see that all the downstream applications kind of use our back end which is already there, already good. And how can you make sense of all the efficiencies that we have in Transformers. So, run up the whole whole bunch. Okay. And when when was the first time that MOEs started to become this mainstream and this popular in the industry? For me, it was Deep Seek which which was, "Hey, we need to watch out. What is this happening?" And that is that is how I also see um the first time there where the closed models and the open models, the gap between the open and closed models kind of went down and people really took notice of this entire new paradigm and things just blew out from there. So, that was with Deep Seek. And also probably also I'm thinking about Mistral, right? Mhm. Probably that was also one of the first times that I thought that I heard about MOEs being like talked about everywhere, too, right? >> Yeah. Yeah. Um what like what was different that they that they tried at the time? Like, why was it different? What why would why did it become so popular just all of a sudden? If you talk about Mistral, that is the first time that people really took notice of MOEs in general. It was just that the sheer amount of parameters that Mistral had and how easy it was to kind of run inference and also talk about the efficiencies that related to a big model but not that big. So, when you when we say when we talk about mixture of experts, we essentially talk about how huge [snorts] the model is but also talk about how little activations that are there in MOEs. So, if you have say 1 trillion parameters, you'll probably activate like 20% every time token is generated. So, that kind of blew people's mind as in we need a very small activation space rather than a huge model. Right. And that was probably the the >> tipping point. >> the tipping point, the shift. Okay. What What are the most popular MOE models that we currently have? Last week, we had Qwen, we had MiniMax, we had ZIAI, we had we had Moonshot and everybody was kind of giving us very very good MOE models. So, there's a bunch. I just spelled like it was on the top of my head because I am currently writing it but there are a bunch of good MOE models in the hub open. People can just go in, use it or use our inference providers to get um the generations. And do you think like this is this is it at least for a while, the MOEs? Or are we looking at a new architecture that is maybe gaining traction or that is becoming more popular in the recent times? I think okay, so this is a hot take from him. So, it's it's it's not it but definitely a good a refreshing kind of a scene that we are probably seeing inside of the LLMs world. It's it's very difficult to train. MOEs is the is what we understand. We also have collaborated recently with Unsloth who has you know, given us great kernels, great efficiencies with training. So, that's good. But I also see that a lot of people in their open in their open works not only talk about how the model is good but also talk about how it was trained which is very important at this point in time to really train huge models. It's it's huge models anyway. So, it's it's definitely here to stay for a bit. Awesome. Okay. So good to hear that. How can somebody who's new to to MOEs start learning more about them? What it What would be the right um the right path to take if you're just a beginner? If you're just a beginner, I would I would suggest there are a bunch of blog posts. I would not bias you to going to our blog posts like hugging face, but there are a bunch of good writers who write about MOEs. There are a bunch of papers out there which kind of tell you what MOEs are, how do they how do you use them and stuff like that, but uh with my blog post see what I did there? With my blog post what I'm trying to do is I'm not only trying to talk about how transformers and MOEs work in hand, but also give a pretty good foundation if somebody's absolutely new and wants to walk in the journey that I am currently working on. Awesome. That's also pretty cool about uh the community that we have right here is that where you can literally reach They can literally reach out to you and ask you questions about your blog post and about their learning path in general. Another question I had about MOEs is uh does that basically mean that dense models are gone? We're not going to be seeing more of them? Not really. Dense models are uh so that's a good question because I am startled at this point in time. I don't see dense models going away. MOEs are just so big. They can they can model out your data set which is also vast very quickly or so we know from open reports. But when you have something which has to work on edge, I think smaller dense models which are distilled from MOEs are still the way to go. So dense models are definitely not going anywhere. It's just that if you are going for a foundational model people usually use MOEs than dense models because it's just more efficient that way. And there are a bunch of open models that have already proved it. And that is where the research is at this point in time. And what would be the current use case for dense models then? I think edge and small models which are dense is like really big. We also see I think it's yesterday or the day after the day before Cohere did with Tiny Aya which was like very small and it was dense. So on edge devices if you want to run like very quick models dense is the way to go. Right. And just to make sure that everyone hears in the right like up to speed, can you just talk very quickly about the difference between dense models and what that actually means and MOE models? >> Okay, so um dense models is a lot of parameters or or just parameters of a of an architecture where if you generate and this is strictly in context of large language models or language models in general. So if you generate a token every one of those parameters in a dense model have to be activated or not. But when you talk about mixture of experts uh the uh feed forward networks are now subsets of like smaller feed forward networks and for each token getting generated there are like 20% of the experts being activated. So 80% of our compute budget is saved. Okay. So basically that means that we have faster models that are more affordable and are even more powerful than regular dense models when they are super big. Yeah. Okay. Does that mean Does that basically mean Okay, so you told me that dense models are going to be super usable for edge edge models and smaller models. But that kind of makes it sound like scaling for dense models is kind of a is done. Is that true? Um at this point in time it feels like it. Um if you if you consider a big MOE, the number of parameters that it has if you have to directly translate that into a dense model, you would probably need like X more compute budget to just train it. So it's essential to understand that again reiterate the fact that if you are talking about foundational large language models, MOEs are the way to go to pre-train, to fine-tune and so on. But if you want to distill them and distill use dense models, I think dense models are here to stay as well. Having said that, I also think that the infrastructure for training a dense model is so involved it would take a lot of time to get better at training MOEs. All right. So even if you the person who is looking at our video, right? If they want to train a foundational model, they have compute and they don't really know where to start with. I would suggest going into dense first because there are like bunch of resources to get you started like this. Right. But if you want to take your time build a foundational model or even distill from it and do it efficiently, MOEs are the way to go. There you go. Do you have like a a favorite moment in MOE history? As I said, DeepSeek. DeepSeek is It told me to read MOEs from ground up and so I did. Right. Awesome. Well, there you go for MOEs. That's very very interesting. Do you Do you see um like how do you see the future of models involving MOEs or not really? What do you think is coming up in the LLM world? Do you see any trends? Uh the trend at this point in time is um like a uptick with MOEs. That's for sure. Most of the big research labs love MOEs and they are releasing them every week now. Um having said that, I also I also see a trend in smaller models. Um which are which work very well with edge devices or how you could deploy a model which is which gives like very good throughput very quickly. Um that's one. I also have this idea that if MOEs are here to stay, the infra has to go hand in hand. So if you are watching out for like Java applications or stuff like that, I think it's better to understand MOEs and also infra so that because people are like definitely going to need more infra engineers to scale this entire topic up. Right. How can someone specialize in infra for this though? Um I think you need to understand the architectures very well and then we have see what I did there again? We have a blog post which is like a book which talks about scaling of LLMs. And there's a bunch of topics. Oh yeah, I've seen that one. It's huge. Yeah. >> It is. So if you and it's it's open access so anybody could access it and it's really well made. Um with that it's not the place to be, but it definitely takes you from zero to like 10. If you want to reach to 50, it will guide you. Sometimes have this question when you have like new models coming out every few months or every few weeks really. It kind of sounds like the training is very similar and the architecture is very similar behind all of these models. What in your opinion or in your expert's insight what is what makes one model better than other? What do people do different during training for example? That's a good question. For me, the papers that I read or whatever models I have in access which have like proper open access, I think the moat is data set. The the better the data and it's pretty obvious to anybody who is in the ML field or not. The better the data, the better the models. There is this also notion of data engines where you build your data set with another model. So there's this notion of whether my model which I'm training is modeling on some other models' engine. The idea of synthetic data set is growing and how you could leverage that into your foundational models is also growing. So that's I that I think builds a big moat around which models are good versus which models are not. That's actually very interesting cuz I I remember reading at some point also a research paper. I think it was last year. It was essentially saying that the more we use synthetic data to train our models, the more the models converge into a less creative model and you get start getting diminishing returns the more synthetic data that you put into them. Is that true or have we gone past that? How How's that How does that go? Okay. So I think this question is more suitable for our science team because they do it on a day-to-day basis, but I have this idea of I think this is going on a tangent, but I have this idea where say for example, you want a foundational model which is text only. Now text is something which humans have created. It has rules, it has grammar. And what I mean by that is you know that this is a space which can be contained inside a model. But once you talk about like images, the world models, the things that we see, we capture it with a camera which is lossy. And then we feed try to make sense of whatever we see around us through a model. So the model takes our lossy representation of what we see. So it's like an abstraction of things that that are actually happening. So having said that, synthetic synthetic data is again an abstraction of what actual text would be, which kind of makes a diminishing return on training on them. But we still use it to >> We do. We do because synthetic data is very constrained. You can have instructions built out of it. You can do a bunch of things already which the internet scale data does not have or even grab or scrape that data, you need to filter out, you need to prep the entire data. Which again has a beautiful blog post by our science team, the fine web blog post. Kind of tells you what they did to scrape the internet and also have hosted in the hub, so. I'll definitely bring them over and ask them lots of questions about this, for sure. Okay, so do you currently think that MOEs have a weak spot? I do. MOEs are huge. So with huge, it also means that it cannot be run locally. You would need the parameters are so big, they would need to rest somewhere to then get activated or not. So even if you have like 20% of the activations working for an inference, you need the 100% of the model sitting on your device. Right. Which is insane because you would you you would not have such huge GPUs all the time running locally. Right. So maybe does that mean that we may start saying smaller infrastructures where you have more of a an actual router that routes between models rather than an MOE LLM inside like local or small machines like a computer or a network phone. Exactly why if you have an MOE, people distill that into smaller dense models which can then be worked on a local system. There you go. Okay. Also, have you have you started using this this coding tools? I have. And I have a very spicy take on using them. It's not only agents. I would also go back to usage of LLMs in general because they are they are the backbone of agents. So I take pride in certain things that I do. Like writing. Reading and also researching. I see that using LLMs as an assistant can be very blurry in the in the sense that you tend to lean on the assistant so much so that the creativity of your own self is burdened. You you become or should I say I become a little lazy into not doing certain things which I should have. That directly translates to me coding and using assistance. There are certain things that I would never want to code. If I have say for example, in this blog post, I did a very thorough PR research in which we we gain a lot of efficiency with loading MOEs inside transformers. So I I really did that all myself. And then when I had a benchmark to run, I ran the benchmark, the code was mine. But the results were just a CSV. But now I have to portray this CSV as a plot in the blog post because users might not like CSV in a blog post. So I just dumped that into cloud code and said, "Hey, give me a plot." And it did quite well. Yeah. So so the entire essence of what I'm trying to say is you as a person need to understand what you or where you want to draw the line. And if you draw the line, you have to be very careful in noticing that this line will always exist in your mind. So once you draw the line and you kind of go over it and use it for once, the line shifts and becomes blurry is what I think. So I will use agents, I will use LLMs, but very specific to tasks which I I would want them to do and be happy with the productivity I get. Right. Is that when like you mentioned that leaning over an LLM or this kind of agents for tasks that you would otherwise be able to do, you think that we can atrophy your creativity or your skills even? It's absolutely my opinion that it does. Having said that, it also means that the amount of productivity that one would get without agents versus with agents is tenfold. So if you if you lean if you get productive, good on you. If you get productive and you think that there there is like every other task can be done with it and I'm productive, I don't think that's true. Not every task should be done by agents. There has to be like lines because at at at some point you would you would say that, "Hey, it's just me. I don't even know this for sure. Let me research it a bit." And that research again goes to another agent and so on. So it's like an never-ending loop. I mean this is of course changing super fast, right? And I I have seen myself my own opinion shift very very quickly about this. Like I I remember people would ask me this question as well. Like say 6 months ago and I would say, "Never commit code that you did not read. That's impossible. Do not do it. Even if it's just for a side project, you're going to end up with a lot of technical debt and your project's going to go bankrupt immediately." Now I'm more and more starting to use these tools and yeah, I completely see what you mean that you kind of lean over this LLMs and this agents. And in the end, you stop exercising the same muscles, your mental muscles for this particular skills for coding. And eventually, if you don't exercise those muscles, they will just weaken and you'll not be able to carry as much weight as you used to. Um However, that do you think that's a bad thing or do you think that we may just not need coding anymore and that's not that much of a problem? >> My take would be if you had cars, the ones we had cars, we did not commute on our legs that often. But you eventually got into running, you walk a little to be healthy. So if you have agents, if you lean on them too much, you would need certain other aspects to exercise your brain and still be healthy is what I think. I also would suggest people who don't know coding at all or who are starting with coding or maybe just any any at this point in time any um tasks which agents can do. Coding is one of them. So we are focusing on coding. But if you're starting, I would I would rather not have you talk to LLMs and agents that often. But it it can really grow your understanding once you are at one certain level. But it's very difficult to you know, pinpoint when do you have to use LLMs or when do you have to use agents because every other person's learning capabilities are different. So it depends on the individual when they want to use agents, when they don't. But be very sure, be very certain of whether you are leaning too much in in the in the manner of losing your creativity because I I very intensely believe that everyone has that little voice in their head saying that I should not have done that. This could have been me doing it. >> Right. Yeah. Yeah, and I I completely understand what you mean. I sometimes think to myself that I wish I had become better at coding when I could. I don't see myself like, I don't know, for example, learning the intricacies of some new library of or some like, for example, some new language. And right now, I could, of course, spend the time to learn that, but it does not really feel like it's going to have much of an impact if I learn it, right? Makes sense, yeah. So, but what the point that you bring for beginners is very very very interesting and very important. I I I'm not sure what I would be doing if I was a beginner at this point. >> Yep. Um Do you have any other advice for beginners who want to just get into coding and also into building AI applications or even training these models? Um I think the idea of a beginner should change from hey, everybody's using agents and they're way ahead of what I am to being okay with where you are. Um because the agents are not going anywhere. If you take your time, fail a lot, and then succeed a little bit, the that adrenaline or that positivity in your mind is going to take you places. Having said that, even I don't know properly if I was a beginner at this moment, where I have like Codex and Cloud Code and Chat GPT and so on and so forth. Uh models helping me out with my education with my coding skills and so on. Would I have not used them? I would have. But it's very easy to kind of fall into the trap of they're making you better, where you're you're actually getting things done quickly, but are you actually getting things done? Is a question that you have to constantly ask yourself. And you don't think that you can get this mental stimulation from just creating the architecture for these applications or I mean, I don't really know if you really need to know the architecture or you can just ask it, hey, create this app, and it will create it for you. Yeah, I think like every generation that you have, generation as in the token generation from LLMs or agents doing stuff, you have to have a mentality that, hey, this might be wrong. My issue with these agents is that the shift in mental model of people goes from hey, this might be wrong to this is perfect. Let me do something else on top of it. So, if you read through the generations, if you read them through with the notion that this may be wrong, I need to recheck it or I need to do a bunch of other research myself to make me sure that I thoroughly understand it, good for you. And that is how a beginner with agents should work. But if you are doing like AI products and you want to get stuff going very quickly with um with good formatting and so on, be my guest, use agents all you want. There you go. Okay, yeah, I think I can agree with that. Interesting question now related also to coding and agents. Let's try to give a prediction of what coding is going to look like, not even like like in 5 years, just in a year. What do you think like coding is going to look like in a year? Like building applications, building even models. Coding for me would seem to be prompting very thoroughly. Also, it would mean that people would really know what the platform is or framework is. So, say, for example, you work with PyTorch, right? And if you don't know the details, nitty-gritty details of PyTorch, you might not be able to prompt a model to build something which you think is the best. The the overlap of you not knowing and prompting and you knowing and prompting is going to be the differentiator of coding good models or making models code better. Yeah, I I I can completely agree with that. It is I mean, to me, I just find it crazy how some people can build so fast with these agents, and some other people are like, okay, so I built this app, and still it they built it much faster than if they had not used an agent. But the other guy in the same time built like 10 apps, and they were crazy good. It's Nope. It probably comes to the knowledge of the architecture and the what you're and the knowledge of what you're actually prompting, right? Yep. Okay, that makes sense. Um okay. Uh let's just very quickly talk about just one last question about your GPT moments. What would you say are your GPT moments? Now, you al- already mentioned DeepSeek, but in general, uh throughout your life, throughout your career, uh what would you say are those say three to five GPT moments that changed for you the way that you see things. I mean, ChatGPT is probably one, I don't know. Yep. ChatGPT is definitely up there. A GPT moments, I'm in a spot. ChatGPT was definitely definitely up there. Um you could not only type things and expect a completion, but the completions were damn good. Having said that, I don't really know. I I think GPT is is one. The second best would be DeepSeek, you mentioned. >> DeepSeek is definitely up there, but it's not that powerful to me. Uh because G- Chat with ChatGPT, it was something else. Yeah. DeepSeek was just closing in on closed and open. So, that's good. Um I have to really think it through. To me, you know what um I mean, it's probably not as GPT moment for most people cuz I was not part of the AI world at the time. I was just building websites. But AlphaZero was crazy for me. I don't I don't know if it was that big in the AI ecosystem. Was it at the time? Was it? Cuz I I remember I was uh I was actually playing chess at the time. I was training. And it was crazy how like all of a sudden, everything that I thought was like the answer for the problems. I don't know if you've trained chess before, but when you you have these puzzles, and then at some point, you just uh I see the solution to the puzzle, and the solution was computated by Stockfish, and that was the solution, the right answer, right? And all of a sudden, you have this AlphaZero who came over and just destroyed Stockfish. I was like, "Whoa." Stockfish is the engine that Yeah, Stockfish is the engine that it used to be like the standard for solving chess problems, and it I don't know if professionals saw it that way, but when you're training chess, I would see it as the right answer, just not the best answer that we could find so far, the right answer, period. To me, that was one of the GPT >> But let me ask you this. Having said that there are models who perform way better than chess players, would you advise beginners who are learning chess or training for chess to stop using like stop training or Yeah, I know, for sure not, but I mean, I think it's probably a bit different. I mean, I don't play chess anymore. I quit because it was taking too much time. >> But I quit because the models were The models were too good, no. >> [laughter] >> No, but uh no, in the end, I mean, you're playing against yourself with um with um with the puzzles, right? But um yeah, definitely it was a huge thing for me to realize that we did not have the right solution. It was just like one solution. Yeah, that would be one of mine. But there you go. Okay. Well, thank you very much, Rachel. This was This was pretty fun. >> This was pretty fun, yeah. >> Yeah, I I actually learned a lot about LLMs. Thanks to you. Thanks a lot, and I hope that our viewers did, too. And thank you very much for your time. Thanks for having me here. >> hopefully see you uh some other time as well, uh probably remotely, but there you go. See you. Thanks a lot. Have a good one.

Дивитись на YouTube Підписатись на AI-дайджест

Ще з цього каналу

Are We Overusing Giant Vision Models?

близько 2 місяців тому

RL for Agents Workshop - Deep Dive on Training Agents with RL and Open Source

близько 2 місяців тому

Labs sharing their models via DropBox 😅

близько 2 місяців тому

Local Agents are the Future

близько 2 місяців тому