Autoencoders, VAE, VQ-VAE: пояснення ключових концепцій
Відео пояснює еволюцію автокодувальників, варіаційних автокодувальників (VAE) та векторно-квантованих варіаційних автокодувальників (VQ-VAE), які є ключовими для моделей генерації зображень з тексту, таких як Dolly. Розглядаються визначення, призначення та механізми кожної концепції, підкреслюються їх відмінності та переваги.
Ключові тези
- Автокодувальники стискають та відтворюють вхідні сигнали, використовуючи архітектуру кодувальника-декодувальника.
- VAE стискають зображення в імовірнісний та структурований латентний простір, що дозволяє генерувати зображення з безперервного розподілу.
- VQ-VAE використовують дискретне латентне представлення з навчальним кодовим словником, уникаючи колапсу апостеріорного розподілу та покращуючи узагальнення.
Розуміння VQ-VAE дозволяє ефективніше використовувати дискретні представлення в генеративних моделях. • Використання VAE для створення контрольованих генеративних моделей з можливістю маніпулювання латентним простором. • Застосування автокодувальників для стиснення даних та зменшення розмірності без значної втрати інформації.
Хоча відео охоплює основні концепції, воно не заглиблюється в останні досягнення та варіації цих архітектур. Для практичного застосування може знадобитися додаткове дослідження та експерименти.
Опис відео▼
Greetings fellow learners. In this video, we are going to talk about the core of Dolly autoenccoders. And um this is going to be like the first in a two-part series on Dolly where Dolly is going to take some text and generate an image related to that text. So one core component of this is a discrete variational autoenccoder. And so before explaining Dolly as a system, I thought it would be nice to understand the progression of ideas that made Dolly possible. And specifically, it's the evolution of autoenccoders from autoenccoders for compression to variational autoenccoders that introduced the generation component and then vector quantized variational autoenccoders that added discretization. And it is a type of discrete variational autoenccoder this VQVAE that is actually going to be used in Dolly which we'll look at in the next video. So let's just take a look at these three concepts. The what, the why, and the how. So let's start with autoenccoders for like a brief oneline explanation of what they are. It is essentially a neural network that is trained to compress and reproduce an input signal. So you can imagine here that it has an encoder decoder architecture and each of these could be for example like feed forward layers, they could be convolution networks or in modern day they can also be transformer layers. And so what we'll have is we'll take an input, we'll pass it to the encoder to compress it into this vector here, which is a latent code. And this is then passed into a decoder that'll reconstruct the original image. So why do we have autoenccoders? Well, it is to learn the compressed representation of this input image via back propagation without a supervisor. And when I say supervisor, I mean that there is no explicit label that we have to this data. The label is the original image itself to help reconstruction and it learns via back propagation of errors. So how are autoenccoders trained and inferred? Well, they're trained via back propagation to minimize a reconstruction loss. And this reconstruction loss function could be like a mean squared error for colored images where we compare pixel by pixel values or they can be like a binary cross entropy loss for binary images or black and white images. So that's kind of the core essence of how autoenccoders are trained and how they work. So let's now move on to the second concept that is VAEEs. So variational autoenccoders these are neural networks that are trained to compress images into a latent space just like autoenccoders but that latent space is probabilistic and structured. So let's see why and how that works. So why do we have variational autoenccoders? So let's say that we train this autoenccoder on dog images. Now when we want to do some generation here, we just need to use the decoder to generate the image. But we need to sample like this vector from somewhere. So what should this actually be? And if you just like sample random values for this vector like from some like random distribution these values you'll just get gibberish. But what if there was a specific distribution that we can sample to get these latent values to always generate some image. So by that I mean let's say that we have this entire embedding space where any possible vector that we can input to the decoder exists. But let's say that there was this one small region. Let's just call it this yellow looking puddle over here where we can just continuously like we can sample a vector pass it to the decoder and it will always generate at least some relevant image. So if I take it at a different point in this will generate another vector to get another image. sample another point. We can get a vector again sampled and it will generate another image and variational autoenccoders kind of do this at a high level and so we have them. So succinctly stated randomly sampling a vector to input to the decoder can give us a garbage output. Variational autoenccoders solve this by allowing us to sample the vector from a continuous fixed distribution. And that distribution tends to be a gshian distribution, but it can actually kind of be anything. So now that we kind of understood the what and the why of variational autoenccoders, let's actually look at how the forward and backward passes are done. So variational autoenccoders have three main components. They have an encoder. There's like a sampling layer over here. And then there's like a decoder phase over here. So we first take an input image. We will pass it to an encoder which again let's just say it's like a convolution layered network. And the idea of this encoder is that it is going to output the parameters of a distribution in which like the images or are going to be like encoded or the latence vector is going to come from. Now in order to generate a probability distribution you typically just need like the parameters of it. For example for a gausian distribution you need to know the mean and the standard deviation. So if you know the mean and standard deviation those two parameter values you can generate or know what the gshian looks like. So that's why this encoder because we assume a gshian distribution we are going to say that okay well we'll generate a mean and a standard deviation and so we can generate now this entire gshian distribution. Now it is from this gshian distribution that we can sample a latent vector. This vector is going to be passed into a decoder and the decoder is going to then reconstruct the original image. So this is at least at a high level of how this would work. Now an issue though is that let's say you know over here we're going to be computing some loss which we'll talk about very shortly. Now this loss is going to have to be back propagated through the network as these are neural networks that learn via the back propagation learning procedure and so we need to be able to compute gradients of every component of this network. So this is fine through the decoder we can always compute like how each of these parameter values changed the loss. So for example we know how the dell l by dell z we can compute that quite easily through back propagation. But then we get to the sampling layer which becomes very tricky. This is a stochastic procedure and hence it is very difficult to understand how changing the mean or changing the variance is actually going to affect the loss of the network. So how do we really deal with this so that the encoder can effectively learn? Well, we can use something called the reparameterization trick. And the idea here is that instead of sampling this latent vector from this normal distribution, what we will do instead is that we will write this latent vector as mu plus sigma * epsilon. And this is a hatamard product. So it's an element-wise multiplication of two vectors. And this actually works because it approximates a sample from the distribution still, right? So you can intuitively kind of see this distribution is going to be like um you know centered at the mean and we are going to add some sigma some variation to it and this epsilon is going to be sampled from a standard gausian distribution. So basically cons you can imagine that z is going to be a value that's mu plus sigma or mu minus sigma maybe somewhere in between and maybe slightly out of that balance. So that kind of reflects this distribution itself. So it is quite literally a reparameterization of the original normal distribution that we saw over here. So what this now allows us to do though apart from just being an approximation for this distribution we can now easily compute the gradients that we need. So for back propagation we want to know how the loss how changing this mu will affect the loss. Well, through chain rule, it's how this Z affects the loss times how this mu affects Z. And so if you write that out, you'll see that you know the Dell Z by del mu, it's actually going to be one based on this equation over here. If you just take the derivative with respect to mu, right? And so d l by del mu is the same thing as the gradient l by gradient of z. So it just passes through here. So now we now have a way for the gradients to pass through. Similarly for sigma squar we can say d l by del sigma that is just the standard deviation is how you know how this changing z changes the loss times how changing this sigma changes z. And so if you take this you know how changing um if we how like the changing sigma changes z you can actually just take the derivative here with respect to sigma and you'll get this constant term epsilon it's at least constant for a specific forward and backward pass and so you'll end up with sigma time d l by del z and effectively now what has happened is we now have gradients that can pass through Z to mu and then effectively to sigma or sigma squared here as well. And when we have gradients through here, we can easily then back propagate through the encoder and so the encoder can effectively learn. So that's kind of why we're doing this reparameterization trick and also how it's super helpful here. One thing to note is that I write sigma squared over here and I write sigma here. That is intentional and I didn't just want to clutter this graph too much. But in many cases, you'll actually see that the encoder for numerical stability purposes will output log sigma squared because that's a value that can kind of range from negative infinity to positive infinity without being constrained to the positive space as it is over here. But these logarithmic operations or taking square operations, they're all in at the end of the day differentiable. So we can still use the chain rule to inevitably get like some function of how changing you know Z affects the loss to how this downstream log sigma squar also affects the loss. It's all a differentiable at the end of the day but I just wanted to make that a clear note. Now let's look at the actual loss components of which there are two. One is the reconstruction loss and this is the same as we saw in the case of the autoenccoder where we're just trying to make sure that the reconstruction is similar to the original image and then we have this KL divergence loss function and this is to ensure that the encoder latent distribution that is this distribution over here is going to be as close to the prior standard gausian distribution itself. So we made the assumption that our distribution at least when we want to do the generation phase is going to be sampled from a standard Gaussian distribution. That is our prior standard gausian distribution. And so we want to make sure that this mu and sigma squar the parameters of our posterior distribution are is going to be pushed towards this standard gausian. So mu should be pushed towards zero and sigma square should be pushed effectively towards one but they won't be exactly there because it's going to be tugged by this reconstruction loss. So we have to effectively reconstruct the image correctly while making sure that the image or the latent space is around this standard gausian distribution. And that's kind of how you know the intuition of this works. So we'll get a loss which can then be effectively learned via back propagation. We can learn the parameters of the encoder and decoder. So when you want to do some inference now let's say it's trained. We have we just take the train decoder. I put it in green here because it's trained. We now just sample from a standard gausian distribution. we'll get a vector Z and this is then passed to a decoder in order to generate an image and the cool thing is now this continuous distribution we can sample from any part of it and still get some relevant image sample and sampling and varying you know the latent vector slightly can also help us control what kind of images are generated. So you can imagine here for this grid of faces each face was performed with you know just varying Z that vector ever so slightly so that we can create like from frowning faces to neutral faces to even smiling faces and we can control this effectively. So on to vector quantized variational autoenccoders. So these are a VAE style model with a discrete latent representation. So let's talk about this. So we saw that was like the oneline definition of the what let's see how it works. So we have an original image. We'll now pass this to an encoder. And let's say this is like a convolution network. So it has a bunch of convolution activation pooling layers which will eventually transform this into a tensor. That tensor let's just say it is of shape 32 cross 32 cross d. D could be like some like 512. And the way that I like to interpret this for now is just that it's a grid of latent continuous vectors. So along the depth, so you can imagine this is like 32 * 32 is 1,024. It's like 1,024 vectors that are of dimensions. And each of these d-dimensional vectors is effectively going to be like a continuous vector that represents some region of of the image. So we'll take each of these vectors. We are now going to have something called a learnable code book embeddings. So you can imagine these are just d-dimensional vectors. This is like a 512 dimension vector. 512 dimensions all the way over here too. And let's say there's like 8,000 of these or something like that. So these are all learnable vectors here. And what we want to do is we're going to try to find for this vector here. We want to find the nearest neighbor. And let's say it's this vector in the code book called E2. And so this entry number is going to be documented here. So you can imagine this is just 32 cross 32. And we're just putting the entry in the of the code book over here. So this is entry number two. Whereas this last one is entry number three which would have been the nearest neighbor for this which is this. And then what we want to do is we're going to use that actual vector as input to the decoder. So we have 32 cross 32 cross like D. And each of these D's instead of being the continuous vector, they are now the code book vectors themselves. And we pass these discrete vectors into the decoder. And the decoder is going to reconstruct the image. And as a part of this VQVAE, there's going to be three major losses that contribute to the loss function. One is the reconstruction loss, which is what we have talked about in the autoenccoder and the variational autoenccoder. The second one is the code book loss function. So what this is going to do is it's going to push this selected code right over here, this E2 towards the corresponding output of the encoder. And then we have the commitment loss. And what that's going to do is it's also going to push the corresponding output of the encoder to the selected code book here. Assuming that this codebook is fixed now. And so what we can now do is so you can imagine that these two loss functions are kind of symbiotic. They'll keep each other in check. And so we add these losses together or do like a weighted addition and that will contribute to the final loss and this entire network will be learned via back propagation. But one thing to note here, so you can see that when you do back propagation of this loss, it can go into the decoder just fine. And we can get up till here. The gradients can propagate quite well here because this decoder is just a convolution network or something. But from these discrete codebook vectors to these continuous code book these continuous vectors over here how do we actually get the gradients to propagate that section after all for example if you vary you know this vector ever so slightly just by a little bit what will happen well if you vary this only a little bit in most positions it's the same codebook entry that's going to be selected so it's not going to change the codebook entry and hence it won't change the downstream loss computed. So the gradient for the most parts is zero and the encoder will not learn. But there's going to be like some point in between where you know you vary this just by a little bit but the nearest codebook vector will now change and all of a sudden the gradient is going to like spike up and this spiking and then flat gradient is really not very useful to learn the parameters of this encoder. So in order to combat this issue, what we do instead is that we do something called just a straight through estimation where whatever gradients we compute for, you know, these vectors over here is the same gradient that we're simply going to propagate as if you know it's the same over here. This is more heristic and practical as it kind of does make sense, right? We do at the end of the day want these vectors to be similar to each other. So it makes sense to have their gradients the same as well. So I hope it makes sense of like how this entire network can now effectively learn. All right. So now that we talked about the what and the how of VQVAE, let's talk about the why do we have this. So a primary advantage of VQVAEs over the original vanilla um variational autoenccoder is this idea of posterior collapse. So I just want to clarify some terminology over here. This is the vanilla variational autoenccoder. The encoder will output the parameters of a distribution given some image. So we have this idea of posterior distribution and prior distribution. A prior distribution is essentially a distribution without seeing any data. A posterior distribution which is what the output of this encoder is is a distribution when you see some input data. We have some input data and we have generated we have like this distribution over here. So this is a posterior distribution that's effectively constructed from these parameters. And this latent vector is effectively going to be sampled from that posterior distribution. So now that's clarified. So during training now the decoder should use Z this vector to reconstruct the image. But if this decoder is very expressive and very powerful, it might actually learn to reconstruct this image without relying too much on Z. And because of this, the reconstruction loss can actually be minimized and can go down with very little dependence on what this Z actually is. So now that this you know this there's like a weak dependence on Z that affects this reconstruction loss and this reconstruction loss typically pulls on this KL divergence. What the encoder can now do is well because you know Z doesn't affect the output too much. It doesn't need to worry and it is a little bit more flexible with the outputs of you know mu and sigma square. It is more flexible with this posterior distribution. So it can focus on just minimizing KL divergence altogether. And the best way to minimize KL divergence because it just is trying to make it close to a standard Gausian is to literally push the mu towards zero and push sigma squar directly towards one. And so the output posterior distribution is now literally pushed very much so towards the prior Gausian distribution. and hence there is a posterior collapse. So we might you know because of this during like the generation phase what this could entail is that even if you vary Z a kind of meaningfully it doesn't create a meaningful change in the generated images itself. Now, this is just an example that shows images that were somewhat recovered um from posterior collapse, but I hope it kind of paints the point of like, oh, we might get images that are generated and they might look kind of like average faces, but we can't meaningfully control that by changing Z and hence this becomes a problem. Now, vector quantized variational auto-enccoders, well, they don't actually even have this concept of posterior collapse and this is because the encoder no longer really outputs the parameters of a distribution and hence there's no concept of that. There are other issues like codebook collapse but that's separate from this original posterior collapse. Now a second reason of why we would use for example vector quantized variational encoders is because of their discrete representation. So discrete representation you know with this codebook right this is a learnable code book there's like let's say 8,000 vectors of 512 dimensions and this information has to be reused across all training images. So no matter how many training images there are, there's just this, you know, all of that representation has to be captured within this fixed set of code books. And so this enables this like forcing function here enables like a better generalization. Um this is as opposed to like the variational autoenccoders which you know can output continuous vectors which are very flexible but may get into the tendency of memorizing specific like high pixel variation signals for example and of course you know code books discrete vectors are much more um they're more computationally efficient and storeable. And the third here is well this is just like a nice to have right now and kind of segus well into the dolly discussion but is the compatibility with sequence models. So we have like let's say a trained encoder from ve from vector quantized vaes. We pass an image. We'll now get uh a matrix or a tensor of continuous vectors which we can snap to discrete vectors. And these are going to be now like 1,24 image tokens that is going to be used in dolli which we'll take a look at more in detail in the next video. And this works out well because a lot of like models today kind of work with sequences of tokens. Um typic it was originally created for text but now you can see we can do the same for images and hence leverage these very powerful transformer architectures when you have a significant amount of data. Quiz time. Have you been paying attention? Let's quiz you to find out what advantages does vector quantized vees have over the vanilla VAEs. A to avoid posterior collapse. B it uses a discrete latent space. C it always trains faster than VAEEs. Or D, you don't need an encoder network. I'll give you a few seconds to answer this question. And just note that multiple options may be correct. The correct options are A and B. Did you get them right? Please comment your reasoning down in the comments below and let's have a discussion. And at this point, if you think I deserve it, please do consider giving this video a like because it will help me out a lot. And that's going to do it for quiz time. But before we go, let's generate a summary. So in this video, we actually took a look at the what, why, and how of autoenccoders, variational autoenccoders, and the vector quantized variational auto-enccoders. And it's actually a discretized version of the VAEEs that is like a version similar to this vector quantized variational auto-enccoder that's going to effectively be used in Dali as a core component for image tokenization. So, I hope all of this makes sense. And if you do want some more reading material, I'm going to put it down in the description below all the reference papers that I do highly encourage you go through. And I hope you can use this as a supplement to whatever you're learning when you're tackling these concepts. They're very math heavy. They're kind of difficult, but if you stare at it long enough, I'm sure you'll get it. So, thank you so much and I'll see you in the next




