🛠 How-to

Hugging Face Storage Buckets: доступне сховище для AI-проєктів

Hugging Face•3 місяці тому•7 квіт. 2026•Impact 6/10

Позитивна🏭 Виробництво і Промисловість 🛍️ eCommerce 🏦 Фінанси і Банкінг

AI Аналіз

Hugging Face представила Storage Buckets, новий тип репозиторію для AI-воркфлоу, що пропонує прозоре ціноутворення за терабайт із вбудованою CDN та дедуплікацією. Це має стати більш надійною та доступною альтернативою Amazon S3 buckets, особливо для наборів даних машинного навчання, контрольних точок та модельних артефактів.

Ключові тези

Storage Buckets пропонують ціноутворення за терабайт із вбудованою CDN та дедуплікацією
Розроблено для робочих процесів машинного навчання, включаючи набори даних, контрольні точки та модельні артефакти
Інтеграція з Hugging Face CLI дозволяє легко створювати та керувати бакетами

Можливості

Зниження витрат на зберігання даних машинного навчання до 50% порівняно з S3 • Вбудована CDN забезпечує швидкий доступ до даних з будь-якої точки світу • Інтеграція з Hugging Face CLI спрощує автоматизацію робочих процесів

Нюанси

Використання Zet для дедуплікації дозволяє значно економити місце та прискорює передачу даних, особливо для великих бінарних файлів, які часто використовуються в машинному навчанні. Це може бути вирішальним фактором для команд, які працюють з великими обсягами даних.

Опис відео

▼

Good morning everyone. How's it going today? Uh today we're going to be taking a look at HugenFac's storage bucket which is a wonderful development released by the team just a couple weeks ago and uh it is great. It's the first time that the hub has a new nature of a repository. So usually you have models, data sets and spaces and now you also have buckets. So the first thing that we're going to see today is what actually are storage buckets and how they are so important, how they are a drop in replacement to your Amazon S3 buckets and how these are also much more reliable and um affordable and with predictable pricing. We're also going to see why they are so good because they use the duplication with Zet. uh we're going to see a very quick comparison of how they are different to regular repositories and some very quick examples of where you could use this. So without any further ado, let's actually get right into it. All right, so let's talk very quickly about what are storage buckets. Now storage buckets are this very straightforward storage unit for your AI workflows, but not only for AI workflows, as we're going to see in just a moment. They are very, very versatile. Um they offer per turbyte pricing which is very transparent and which comes with built-in CDN and dduplication speedups which is something that you do not have on S3 buckets for example. There are no Git constraints and they are designed for machine learning workflows such as data sets, checkpoints and model artifacts but not only that as we're going to see in just a moment. And here we have an example of how it would work. As you can see it is available on the hugenface CLI. So, in order to create a bucket, you would just do HF buckets create, then the name of your organization, and then the ID of your bucket. And then there you go. It's already ready for you to use. And you're going to be able to access it as though it was within your own file system just by doing HF colbuckets and then the ID of your bucket. Okay? So you can do hfs sync which is basically kind of like rs sync or like this sync command for your S3 bucket and just sync your checkpoints to this bucket right here and it will work work much much faster because it will duplicate everything with zed. Now what does that actually mean to use set for dduplication? Well, um let me tell you a little bit about how Zet uh works in the hugging face storage uh system and as a backend for actually pretty much every repository. So when you're updating when you're uploading a repository to hugenface uh sometimes or very often you have very heavy files. So these are uh model model checkpoints or other machine learning artifacts and these are usually binary files. Now if you wanted to version control those uh whenever you changed a little bit of that binary file git or your version control system would have to ret track the entire binary file even if it's huge but even if only 5% of it changed and that's how git works and that's also how S3 uploads work. So if you just change 5% of your model for example which is the usual thing when you retrain your model. So let's suppose that you have your model checkpoint, you retrain it, but after retraining only about 5% of the weights actually change. Well, if you try to re-upload or sync that file with S3 or with regular git, it will basically retrain, I mean re-upload the whole thing. Whereas with set, what happens is that if you have a large binary file, you set what it does is that it um chunks it into different chunks that make sense and then just updates the chunks that actually were updated. Okay. And that is how it works with set. That is why in hugging face repositories whenever you have a binary file such as a model checkpoint or any other binary file you have to update upload it with set and that comes by that comes natively in uh hugging face buckets. Um for pricing it is also incredibly affordable as you can see it is about $8 to 12 per terabyte per month which is way more affordable than Amazon S3 buckets and it [snorts] is of course very predictable. you only have that fee to pay and uh everything else is dealt with. So you have agress that at no extra fees. You have CDN with no extra cost uh which is not the case for S3 buckets. And uh there you go. I mean for CDNs you have every bucket comes with built-in CDN which is very generous uh which makes makes it just easier to access your files from anywhere in the world. Just choose a CDN that is closer to you and it is pre-warmed for you. It works great and egress is included up to a generous 8:1 ratio of your total storage at no extra cost. So which is uh crazy. Now this is going to be also very useful for your agents because of course this works directly within the CLI which means that your agent can just use it within it just by calling the bash tool. So here we have an example. So the agent creates a bucket and syncs experiment outputs. As you can see, what the model did was basically just call the uh HF buckets create created bucket and then just synced all the output of its experiments and then there you go. All the experiments are already in your bucket. Uh let's take a very very quick look at the kickstart so that you can start actually creating your bucket and um yeah, let's get right into that. All right, so let's start off and actually create a bucket. Okay, so there are multiple ways to create one. You can use the CLI, you can use the graphical user interface. You can also create them programmatically with Python. Uh if you want to use the CLI, the uh graphical user interface, all you have to do is go right here, just call it demo, for example. And here where you want your CDN to be, for example, in US East, it's going to be public. I can just hit on create like this. And there you go. Now my bucket is created and I can just upload files right here. That's also very useful. But if you want to create one with your CLI, that's also possible. All you're going to have to do is make sure that you have the hugenface uh CLI installed. In my case, I have it installed and up to date version. Uh if you do not have the latest version, all you have to do is just paste this right here, and it will reinstall HuggingFace CLI uh to make sure that you have the latest version. And once that is done, you can even create a new bucket right here. So here you have the new buckets command that you can use. You have all the commands available and you can for example create a new bucket. I'm going to call it demo uh CLI like this. And there you go. Now I have my demo CLI bucket right here. I can list my buckets that I have available. As you can see, I have the demo that I created right here and the demo CLI that I created right here. And if I want to take a look at them, I just go right here to my profile and I have my two buckets right here. Now, let's suppose that I want to upload something to my demo bucket. So, all I'm going to have to do is let's suppose that I want to sync something right here. Let's suppose that I want to sync something small. Let's just sync the hugging face skills for now. So, I'm going to do HF HF skills. And as you can see right here, all I have to do to access it is just as though it was inside of my file system. All I have to do is target it using hf col double uh slashbuckets and then the ID of my bucket and it will just sync the whole thing right there as you are seeing right now. And there you go. It synced all of my files as you can see right here. And now if I go right here into my buckets, I will see that my demo CLI has all these files that I have um right here and I can just have them. And here I have my Z hashes. Okay, so there you go. That is basically how you can use buckets. And you can now just instruct your AI agents to use these buckets. You can of course use them within the CLI in the uh graphical user interface. And also very importantly, you can also use them programmatically with Python. Let's just go right here to the documentation and show you that everything that I showed you right here with the uh guey and the CLI. You can also do it with Python right here using the hugging face hub package. Right? So just import create bucket and just create your bucket like we did. Same thing for syncing, for managing files, for uploading files. It's very very straightforward. So now your agents have access to this new set of um tools so that they can have persistent storage for their memory for their training runs. And even if you're not using an agent, if you just want to store your checkpoints somewhere that is easily reachable that looks like your file system but is actually in the hugging face storage buckets, this is probably the way to go. Uh let me know if you have any questions about this and how to use this. It has been a pleasure as usual and I will see you next time.

Дивитись на YouTube Підписатись на AI-дайджест

Ще з цього каналу

Are We Overusing Giant Vision Models?

3 місяці тому

Intro to Mixture of Experts | Aritra Roy Gosthipaty | HF Podcast #2

3 місяці тому

RL for Agents Workshop - Deep Dive on Training Agents with RL and Open Source

3 місяці тому

Labs sharing their models via DropBox 😅

3 місяці тому