DeepSeek розкрила революційну архітектуру Engram для LLM
DeepSeek представила новий компонент трансформера Engram, який забезпечує умовну пам’ять через масштабований пошук, дозволяючи швидко відновлювати статичні мовні патерни та зменшувати надмірні обчислення. Цей підхід доповнюємеханізм mieszanki експертів, виділяючи частину спарсного бюджету на шлях пам’яті, покращуючи ефективність та результати на завданнях з фактичних знань. Експерименти показують зниження втрат до 0,8% та можливість масштабування з мінімальним впливом на швидкість виводу.
Ключові тези
- Engram додає блок умовної пам’яті до трансформера для швидкого отримання багатокенових патернів.
- Він використовує хеш‑пошук та контекстуальний gates для інтеграції статичних знань без додаткових обчислень на токен.
- Виділення 20‑25% спарсного бюджету на Engram зменшує втрати на ~0,8% у моделі 110B.
- Масштабування таблиці Engram до 13B додаткових параметрів дає лінійне покращення, особливо в математиці та алгоритмічних завданнях.
- Накладні витрати на вивід низькі (~2‑3%) через детермінований хеш та асинхронне попереднє завантаження.
🟢 Можливості — інтегрувати Engram‑оптимізовані моделі в продукти, що вимагають швидкого доступу до фактичних знань (бот‑помічники, юридичний аналіз, фінансовий звіт). 🔴 Загрози — залежність від одного постачальника архітектури може збільшити ризикиvendor lock‑in, а також вимагати нових інструментів для відладки та моніторингу пам’яті.
Хоча акцент зроблено на ефективності, авторами не показано, як Engram впливає на креативність та генерацію нового контенту, залишаючи відкритим питання про потенційну втрату різноманітності виводу.
Опис відео▼
When I thought I was finally used to how good DeepSeek is at publishing research, my mind has yet again been blown away with how insanely well-crafted this new paper is. Like, if you want to do good LM architectural research or just good ablazion studies, you should just get my video and straight up read the paper because there's just no way for me to do the paper a complete justice. They wasted no time improving their idea works with methods so rigorous that it makes any other architectural related work published previously look like they were only doing toy experiments. And not just the attention to details, but their desire to remove all the potential confounding variables that may affect the research results and drawing proofs mechanistically to show that this new LM component doesn't just work, but also makes sense. A complete research masterclass. Whoever came up with these experiments, I mean, no wonder they work at Deep Seek. So for today's video, let's take a look at Deep Seek's new paper called conditional memory via scalable lookup, a new axis of sparsity for LMS where they proposed this new component called engram that is definitely going to be the second fundamental architectural upgrade after MHC which I already have a video on and I highly recommend you to watch that video because even though MHC is a brand new idea, this engram paper has already implemented MHC under the name multibranch architecture and shows that it's just too good not to use it. But before we dive into it, if you ever built something that relies on real-time information from the web, things like market trackers, alert systems, live research tools, or even feeding fresh data into an AI model, you'll probably discover pretty quickly that the hardest part isn't building the tool. It's actually getting the data, captas, rate limits, dynamic web page. It's never consistent with most of them, making it harder for you to automate this process every day. And this is exactly where today's sponsor, SER API, comes in. It provides clean, structured search results from Google, YouTube, and other search engines, all delivered in a simple JSON format. I have done plenty of legal scraping for projects in the past, and honestly, having a service handle that entire layer is amazingly convenient because all the messy infrastructure, proxies, blocking, parsing is handled behind the scenes, which means you can focus on actually building the product. It's also incredibly useful if you're working in AI or machine learning since you can easily collect things like titles, images, links, and even academic research data through their Google Scholar API. I wish I had this when I built my top most cited research paper list last year, which would have saved me so much time. So, if you want to build projects that rely on live search data like Google, YouTube, and even more, you should definitely check them out using the link down in the description. And thank you SER API for sponsoring this video. Anyways, with Deep Sea V4 soon to be upon us, the amount of technical research that they are releasing in advance definitely signals that V4 will contain a lot of new incredible insights and ideas that V4 paper will not be able to contain or will actually prove at scale. And you can already see this trend in the Deepseek V3.2 paper where they published the key innovation deepseek sparse attention in a paper 2 months earlier. So, if nothing goes wrong, we are definitely going to see Engram and MSC scout up really soon. But contrary to MSC, Engram is not using a lot of fancy names in its paper. Yet, it feels impossible to understand what they mean just by reading their paper title. Like some sort of conditional memory that's like a lookup table in a transformer. How does that even work? So essentially, beside the attention and a feed forward network, they just added a new block called engram as the third key component into the transformer. In a vanilla transformer, we are very familiar with the intuition that the attention block functions as a way to connect the semantic meaning across tokens. For example, how strongly a token is related to another. And the feed forward network block is where the model actually processes what attention delivers to them where it turns the relational signals into meaningful features. Kind of like a memory that lives in the weights where it contains patterns it has seen during training and convert it into useful internal representations. But these two blocks within a transformer layer are pretty much perfect. Why even bother to add the third block? Well, the key intuition here is even if the knowledge is in the weights, a transformer still has to reconstruct its meaning at runtime. If the knowledge is represented in one token, it doesn't make that much of a difference. But when the knowledge is spread across a few tokens, like a common phrase, boilerplate syntax, or even formatting patterns, the model ends up needing to spend some compute to build up a representation of that multi-token phrase before it could be used downstream. And this process kind of gets redundant because the model has to redo that feature construction every time the same multi-token pattern shows up. Let's say there's a sentence, this photo contains Diana, Princess of Wales. In a vanilla transformer, when the model arrives at Wales, it doesn't instantly register the meaning of Diana, Princess of Wales. Instead, it gradually composes the meaning from Wales being a country in the United Kingdom to the actual person, Diana, first wife of Prince Charles, all through the layers of attention, plus feed forward network transformations that would create a feature which will reliably behave like a single entity later down the transformer layers. With Deep Seek's proposed engram, however, when the model gets the token whales, it will instantly recognize that it is a common phrase of a famous historical person and directly outputs that representation of this person and fuse it back into the main information stream, skipping the compute effort, puzzling together that this multi-token phrase behaves like a single well-known entity. Which brings us to the major idea that Deepseek is pushing in this paper, conditional compute and conditional memory. Conditional compute is thee short for a mixture of experts that we are all decently familiar with. You basically have a large pool of expert parameters for the feed forward network block and for any given token you only activate a small subset. The model spends compute selectively where it matters. But on the other hand, conditional memory is the engram that Deepseek is proposing. Instead of activating experts, you selectively retrieve stored representations from a big lookup table based on the local multi-token pattern. The lookup table can be huge, but it stays cheap and constant time because it's a hashed lookup. And in hindsight, these two seem very similar because they both save compute by only touching a subset of parameters each step. But Deepseek is saying here that there's a major distinction between the two, big enough that they should be treated as two different axis of sparity. Most importantly, conditional compute is about which expert weights you run. And conditional memory is about which piece of stored information you fetch. So the expert weights help with context dependent computation like things that genuinely require context aware processing like reasoning while engram helps with fast recall of mostly static patterns that otherwise would have gotten reconstructed redundantly across layers. And by cleanly separating these two ideas, a new way to optimize ALMs have emerged. So now you don't have to dump your entire sparse parameter budget into experts and pray they learn everything. You can allocate some of that budget to a memory pathway that's explicitly designed to handle the static lookup part of language and leave to fully do the thinking in that framing. Engram doesn't feel out of place anymore. It starts to look like a missing component that should have always been there. But how does engram actually work at the technical level? because it seems a bit elusive trying to fit this seemingly delicate idea into the model. Well, there has already been other engrams related modeling approach, but none of them really come close to how DeepS seek applied the engrams idea into the model as most of them were at the front of the pipeline either as features for a classifier or as a way to build an input representation rather than something that lives inside the transformer as a key component. So, if you step back, this is a very bold of deepseek. Intuitively, you can think of it as a module that repeatedly does four things. Step one, it watches the token stream as the model reads left to right and at every position, it looks at the local tail of what it has just seen. So, in engram's case, where the engram is two and three, it looks at both the last two and three tokens disregarding if they are a phrase or not. Step two, it uses that short phrase as an address in a giant memory. And as you cannot store a onetoone entry for every possible phrase, engram here uses a hashing approach where the phrase gets mapped into a few slots in a big table. It then pulls out the vectors sitting in those slots. And if the multi token is actually a common phrase or pattern, then this vector will contain a rich representation. But if it's meaningless, it would just be noisy. Step three, the model would take in a hidden state at that layer which will contain the contextual information and a contextualized gating mechanism would use that as a filter to see if this retrieve pattern actually makes sense. So if the retrieve vector is useless or if the retrieve vector is meaningful but does not actually make sense in the context, the gating will make the feature signal very weak and in step four, the feature would get fused back into the mainstream and the transformer would continue as normal. But now it begs the question, how does the model learn to do all this? Interestingly, engram is trained end to end, which means the model would just actually pick it up with how common certain multi-token patterns would show up again. Since the lookup is deterministic, these patterns would be repeatedly mapped into the same slots in the table. And when these slots receive consistent gradient updates during training, the vectors stored within it start to become genuinely meaningful. So when retrieved during inference, they aren't complete noise. The contextualized gating mechanism would also learn a very simple yet powerful technique where when the retrieve vector consistently helps in the context the gate would be open more often and when its noise or when it clashes with the current context the gate will learn to be shut. So we end up with frequent visits to the common patterns that would make the vectors residing there better and the better vectors give the gate a reason to selectively rely on them and all these are learned by the model implicitly. Amazing right? Another important fact is engram isn't applied in every transformer layer. Instead, they treat engram as something you insert as specific layers because where you inject the multi-token features matters a lot. And you don't need too many engram blocks because if you think about it, injecting the meaning ones is probably enough. And logically, adding engram in early layers should be the best because that's where the transformer is still trying to puzzle together the multi token meaning. And having engram there can potentially offload it before the model burns through a bunch of layers just to do the exact thing. But if you put it too early, the contextualized gating mechanism may have signals that are too weak to be even useful. Like in layer 1, the hidden state doesn't have much information yet. So it doesn't have enough global context to decide whether the retrieved memory is actually relevant or not. So through their ablation study, they found that having engram at layer 2 is the best as that gave the lowest loss out of inserting it at all other layers. But interestingly, they found that placing engrams at layer 2 and six performs even better after their more complicated sweep. This design seemingly balances the model with the benefits of early intervention while having better late stage contextualized gating as the hidden state in layer 6 would have better contextual information. So, the baseline for figure 4 is actually engram being inserted at two layers instead of just one. And if you look to the right where it says ablation variations, you can see how much the performance degrades for each mechanism that they introduced to make engram better but being taken out with the most degradation being not using the multibranch architecture which is the MHC that Deep Seek recently introduced. Can you believe that it's making this much impact already? This is an interesting one too because engram plus MHC creates this multi- view of the same lookup but can be filtered differently where each branch can decide in its own way how much to trust the retrieved memory. And without MHC there would only be one branch which apparently would perform a lot worse. Aside from that they also have to use token compression which just basically means mapping raw token IDs into canonical ids like making words all in lowercase or just semantically equivalent forms. Removing the gating mechanism will also create a large impact because the lookup can be noisy and this just shows how important the contextualized gating mechanism is. While all these I just talked about is a good way to interpret the ablazion studies. Deep sea researchers still went the extra mile to prove that it's not just a convenient interpretation but something you can actually see happening inside the network. Here they use a probe called logit lens to observe the model mechanistically where at every layer you take the model's hidden state and ask if I force the model to predict the next token right now what would it predict then you compare that layer's early guess to the model's final guess at the last layer using KL divergence as a measurement. So lower KL indicates that a layer is already close to the final answer. And what they found is that engram models get close to their final prediction earlier in the network especially in the early layers. As you can see, the lower KL appears sooner, which matches the claim where engram injects a useful local pattern early, so the backbone doesn't have to spend as many layers slowly assembling those features, leaving later layers more room for the harder contextual dependent work. But the researchers weren't done. They wanted to double down on the interpretation. So, they also did a representation level test using a CKA similarity heat map. How it works is that it takes the hidden states from every layer of engram and every layer of thee baseline and measure how similar their representation geometry is. If a layer in engram is doing about the same kind of work as a layer ine baseline with the same depth on the heat map, you would expect the brightest region to sit near the diagonal. But what they observe is an upward shift off the diagonal. Because for many layers, an engram layer looks more similar to a deeper layer with engram layer 5 aligning best with arounde layer 12, which kind of implies that engram's shallow layers are functionally closer to deeper layers of thee baseline, which can be interpreted like engram is increasing the model's depth without actually adding layers. Anyways, so now that engram's effectiveness is actually proven practicalitywise, how should it be utilized? To start off, we have two directions that we could look into. What's the best balance of Engram and MOE under a fixed budget? And how crazy can we scale Engram if budget is not a concern? For a fixed budget, they did a sweep under the same amount of compute flops. And right off the bat, a puree model is the most inefficient with the highest loss. And as you allocate more to engram whereby reducing the total number of routed experts while keeping the active compute per token the same. They found that the best point is when roughly 20 to 25% of the sparse capacity is moved from into engram. Improving the loss by around 0.8% for 110 billion parameter model. As to how crazy you can scale it, a fascinating thing about engram is that making the engram table larger only increases stored parameters and not per token compute. Because engram only performs a constant hasht lookup and a fixed size fusion. So however big the table is doesn't really matter. So by just scaling the hash table, they see the loss keep improving as the table grows up to an extra 13 billion parameters which is the largest size they have tested in this paper. And the trend seems to scale linearly while improving the model's math encoding capabilities. Not only that, when you take out the engram component from a trained model, you can see how much the performance deteriorates, especially for tasks related to factual knowledge and algorithmic thinking, dropping up to 56%. With tasks like reading comprehension retaining some performance, pretty much shows that engram is actually carrying a lot of weight for the stored knowledge domain rather than just being a minor helper. So, if engrams this good, there must be some tradeoffs, right? maybe inference gets slower or more expensive. Well, actually since the lookup addresses are deterministic within engram once the token sequence is known the hash indices for the lookup would all be fixed. So the system can prefetch the needed features asynchronously. So while layer 1 is doing the pass the prefetching can already be started so that when layer 2 needs engram it's pretty much ready and can be transferred in over PCIe in their benchmark a 100 billion parameter engram table that is entirely CPU offloaded in hosted DRAM the throughput penalty is actually really tiny a 1.9% slowdown on a 4 billion parameters model and 2.8% 8% slowdown on a 8 billion parameters model. So with how insane engram is, the next step is definitely see in action at a large scale, which Deep Seek V4 should implement. So hopefully by now you understand the hype that I have for V4 because this many new innovations already is definitely about to be a crazy model drop. And if you like how I explain the AI concepts today, you should definitely check out my latest project intuitive.academy Academy where it contains an intuitive explanation of modern MS including and engrams from the ground up. This is a series where I'll break down AI topics intuitively because I genuinely think anyone could understand them no matter how difficult it may seem. It currently has over 24 chapters and I plan to add a new one at least once a month. So for those who want to get into AI or LMS, this should be the perfect place for you to dive into the technical parts without being intimidated by crazy looking maths. And right now I'm also putting out an early bird discount. So you can use the code early for 40% off a yearly plan. And thank you guys for watching. A big shout out to Spam Match, Chris Ladoo, Degan, Robert Zaviasa, Marcelo Ferraria, Poof, and Enu DX Research Group, Alex Midwest Maker, and many others that support me through Patreon or YouTube. Follow me on Twitter if you haven't and I'll see you in the next
![DeepSeek's Insane Architecture Breakthrough [Engram Explained]](/_next/image?url=https%3A%2F%2Fimg.youtube.com%2Fvi%2FxUlX6jvwVfM%2Fhqdefault.jpg&w=3840&q=75)



