🔴 News

Локальні LLM: як заощадити на AI за допомогою Nvidia RTX та DGX Spark

Matthew Berman•близько 2 місяців тому•13 квіт. 2026•Impact 6/10

Позитивна🏦 Фінанси і Банкінг 🏥 Медицина і Фармацевтика

AI Аналіз

Відеоблогер показує, як використовувати локальні LLM на базі Nvidia RTX GPU або DGX Spark для зниження витрат на AI. Застосування гібридної архітектури дозволяє компаніям заощадити до 90% витрат на AI, виконуючи простіші завдання локально, а складні - в хмарі.

Ключові тези

Використання локальних LLM для зниження витрат на AI
Гібридна архітектура: хмара для складних завдань, локальні моделі для простих
Nvidia RTX GPU та DGX Spark для локального запуску моделей

Можливості

Зниження витрат на AI до 90% для типових задач • Повний контроль над даними та підвищення конфіденційності • Можливість кастомізації моделей під конкретні потреби бізнесу

Нюанси

Локальні моделі вимагають значних обчислювальних ресурсів, особливо для великих моделей. Важливо правильно підібрати модель під наявне обладнання та завдання.

Опис відео

▼

OpenClaw is expensive and that's a problem. I've seen people spend upwards of $10,000 a month just using OpenClaw. I wanted to fix it. It takes a lot of money to process everything in the cloud. What if you could offload some of that to open-source models running locally? In this video, I'm going to show you how to do that using Nvidia's RTX GPUs or DJX Spark. The nice thing about this is that you can do this even on RTX GPUs you're not using right now. So your old gaming laptop, your desktop, any of these can be part of your open claw setup running these models. And by the way, this video is sponsored by Nvidia. And so what we're going to do in this video is talk about first why you want to offload and use local models. Then we're going to talk about the actual hardware you need to power it. Then I'm going to teach you about my hybrid architecture approach, which is super powerful. So stay tuned for that. Then I'm going to walk you through real use cases that I'm using local models for in my actual production OpenClaw environment and I'll show you how local compares to fully hosted models and give you some price estimates as well. So let's get into it. If you're watching this video and you're gaining value from it, I highly suggest you like and subscribe to the channel. And so if you're watching this video and you're gaining value from it, go ahead and give me a like and subscribe and let's see if this works. So did you catch the difference? One of those was free, the other costs money. They are both Whisper models, but one of them runs locally and the other one runs in the cloud hosted by OpenAI. Check out the cost difference. For most use cases, you don't actually need a Frontier model. Local open-source models are incredible for 90% of use cases. You can cut costs, increase security, increase your privacy, and it is more personalized than if you were to go with only hosted models. So, the core concepts of this video work with any Nvidia RTX hardware or if you have a DGX Spark, they will also work on that, and that's what I'm going to be using today. And so, you don't need the latest, most expensive RTX hardware either. You can use local models on older RTX GPUs like the 30 series or the 40 series. So, the only real trade-off is model size. If you have a bunch of VRAM, you can fit bigger models. You can take on more sophisticated use cases. And if you don't have as much VRAM, you just can't take on those cutting edge use cases locally. But that's okay. Again, the majority of use cases you can run with pretty average hardware. And so the real point here is you don't want to give these fully hosted frontier models like Opus 46 and GPT 5.4 these really heavy but more simplistic use cases. You're just churning through tokens and spending a ton of money and it's not necessary in the least. You want to reserve those frontier models for the absolute cuttingedge use cases that you are using. And so the way that we're going to get it running on our local machine, my DJX Spark, but you can get this running on any RTX machine is LM Studio. It is what I recommend because it is by far the simplest to use. It comes with its own interface and it also determines which model can best fit on your machine. It just makes everything really simple. What I'm going to be teaching you today is called a hybrid architecture. We're going to be using both Frontier models hosted in the cloud and open- source models hosted locally. So, let me show you what that looks like. We're going to have two different parts of our system. Some of it is going to be served by our cloud models, our hosted models, things like Opus 46 and GPT 5.4. These are front tier models that are way too large to be hosted locally. Plus, they don't even offer the open weights, so you can't host them locally. Then we have a number of fantastic models that we're going to be hosting on our RTX PC and on our DGX Spark. Those models can include things like Quen, Llama, GLM. There are a ton of really powerful open-source models and of course the most recently released and provided by Nvidia, Neotron. So for our most complicated use cases, we are going to use our Opus46 and GPT 5.4 models. Things like coding, any and all coding, especially building the actual OpenCloud system or your Aentic workflows, those should be done most of the time with cloud models. You want the best possible coding models to write your code for you. for planning. Anything where you're doing complex planning and then you're going to delegate out to other models, that should be done by the best possible model you can get your hands on. Now, here's the cool thing. Everything else can be done with local models. Things like embeddings can be done locally. This is something that almost any computer, regardless of how much VRAM you have, can be done really easily. And embeddings just means taking a bunch of text data and making it easily searchable by a large language model. And remember that by doing things locally, these embeddings are also kept private. A key advantage of local verse cloud as I showed you earlier, transcriptions can easily be done locally. And the other side of transcriptions, voice, voice generation can be done very easily locally. There are a number of very powerful texttovoice models out there. PDF extraction, very easy for a local model to do. Anything with classification, very easy for these models to do. Any small, relatively small Quen model or neatron, these can easily do classification. You can chat with these local models and they do have personalities and they're really good at chat. So if you're not doing coding or orchestration planning, then you can use your local model as your chat model within OpenClaw or any agentic system. And again, these models are getting better by the day. So the number of use cases that you're going to be able to do with a local model continues to increase. They are getting better at tool calling. They're getting better at codew writing, at aentic flows. All of these models are just getting better by the day. So you're going to keep finding use cases that you can offload to a local model, saving you money, increasing privacy, everything else we talked about earlier. So you might not use local models for coding today, but soon enough you will. These open source models are getting smaller and better. But how do you actually think about when to use an open source model, when to use a fully hosted model like a Chad GPT? How do you think about that? And then how do you know which use cases can actually be offloaded? Well, I do things in a very simple process. The first step is to experiment. That's when you're experimenting with different workflows, different automations, trying to figure out what's working. And at that point, the only thing you want to use is a frontier model. During the experimentation step, you're going to be figuring things out. And that means you're testing different workflows, making sure they work, making sure the data is in the right place and formatted correctly, making sure your integrations are working correctly. And so that is all part of the experimentation phase. After experimentation, you want to productionize it. You want to make it so it becomes repeatable and works confidently. And so during the productionizing phase, you still are using a frontier model, but this is where you start to look and think, okay, I could probably replace this one part with a local model. and you're starting to figure out which parts can be offloaded for the scale phase. This is the beginning of the transition to local models. At this point, you're going to start looking for opportunities to offload to a local model, testing edge cases, using it on real production data, and so on. Just making sure that it's very repeatable and you have confidence in it. So, if you think about a company, this is the phase in which you're telling your employee to write down all of the processes. so they can train the new guy joining the team. And then after productionizing it, you want to scale it and that is when you transition to a local model. So you find use cases that you do repeatedly and you start to look for local models that can do them just as well as those frontier expensive models. All right. So next, what does the actual architecture look like? I'm going to diagram mine and then yours might be a little different and so I'll diagram a few different versions of this. My openclaw system sits on a MacBook. From there, I have a number of different RTX and Nvidia powered machines that provide the GPU. So, let's say over here I have my 5090 machine and then over here I have the DGX Spark and they are both essentially acting as just a GPU that I s into. And so you can kind of think of sshing like attaching this external GPU to whatever computer you're sshing from. SSH is kind of just like visiting a website. It is just sending information back and forth, but it allows you to control the machine from any other machine. So in this instance, I have my MacBook controlling my 5090 machine and my DJX Spark, but it's just basically acting as if it were just a GPU. So the models go here. They live on these different devices and then they get served to the MacBook and the MacBook is where OpenClaw is hosted and that's how the connection works. But you don't have to do it that way. Let's say you want to run everything on a single machine. That's totally fine. All right. So you have your PC here and this is running your open claw and this is let's say a 5090 and all of your local models will be hosted here. You have the cloud right there and it can always call a frontier model like Opus or GPT. The cool thing about OpenClaw is is you can do this from anywhere. So now I have my phone and from my phone I can use Telegram which talks to my OpenClaw instance and from the OpenClaw instance it will use the 5090 and power those local models and of course if you need you can always call that Frontier model in the cloud. So, how do you actually know how to SSH in? Well, you don't really need to know how. You can simply ask OpenClot to do it for you. So, you can start with if you're on the same local network as the machine you want to SSH into. You simply say, "What machines are on my local network that I can SSH into." And the only things you're going to need to SSH into a machine are your username, your password, and your IP address. The IP address can be found simply by asking OpenClaw what machines are on my local network that I can SSH into. I'm not going to show it because these are my actual IP addresses, but you can see it listed out the different devices on my network and then you can connect into it. And again, you don't really need to know how to do it. You just simply tell OpenClot to do that. Next, we're going to identify the different use cases that are ripe for putting them on local models. And so I actually use cursor for a lot of my open claw building. So I already have a pretty complex set of model routes. But let me show you those. So here are the use cases that I've already identified can likely be offloaded to a local model. And the reason I know that is because I've already moved them from Opus 46, which is the best model that Anthropic offers to Sonnet 46, which is more of a workhorse model. It's not as good as Opus. So I've already shown that these use cases can be done with less capable models and here they are. So notification classifier company news relevance CRM context extraction all of these can easily be offloaded to a local model. And so on the spark I've downloaded this is LM Studio Quen 3.5 35 billion parameter model with 3 billion active parameters. So we select that and let's just test it. Now, it is a thinking model. It is also a vision model. We can easily turn off thinking, but we'll leave it on for now. And you can see it is just flying through those tokens completely free running on that DGX Spark. So, there we go. We got 65 tokens per second. Plenty for all of the use cases that I mentioned. So, the first thing I'm going to do is hop into cursor, which is where I build OpenClaw, and I'm simply going to tell it to add that model to the config that gives us all the available models. I'm simply going to say let's add the spark quen 3.5 35b a3b model to openclaw as an available model we can use and then we hit enter. And so I'm basically saying I ssh into that spark. Make sure you know how to reach the spark and make sure you know which model to use and just add it to the config and openclaw. Then we're going to go test it. Make sure it works. Then we're going to plug it into one of our use cases. And by the way, this is the cool part. You don't actually need to know how to set all of this up. I'm using cursor, but you can easily just go through Telegram, OpenClaw. OpenClaw will know how to do these things. So, you don't need to actually code anything. You simply type in natural language and it'll know how to do it for you. So, here you can see in the model routing JSON, we have the Quen 35 Spark and it pointed it right there for us. Beautiful. Okay. So, here we go. added the Sparkhosted Quen model. Perfect. And it routes it perfectly. Great. And it actually did our live smoke test, which is wonderful. And we got a result. Next, let's plug it into OpenClaw Telegram and see if it works. So, I have this OS test channel. I'm going to grab the actual channel ID, make the chat model for this channel, the Quen model we just set up, and I paste the channel ID in there, and it will set it up for me. So, first let's just see which model is loaded. So, we do slash status and we can see right here Spark Quen Quen 3535B. Perfect. We can see the context window is at 256K. Perfect. Now, let's type hello. All right. And so, it looks like we have our Quen model working. This is so exciting. Check this out. I did slash status. And we can see right here Spark/Qen Quen 3.5. Excellent. I started a new session and we can see it's still there although the default is sonnet we are currently using the Quen model and I can say tell me a 100word story and let's see what it says and it should be relatively quick also and there look how fast that was it was almost instant that's actually super impressive all right so asking sonnet to write a hundredword story to actually hit the cloud come back with the response all of that takes about 5 to 8 seconds. That took just a couple seconds. I'm going to do a thousandword story between Quen 3.5 running locally and Sonnet 46 running in the cloud. Here it is. Timed. Let's go. Unreal. And because the Spark has 128 gigs of unified memory, you can also fit much larger models on it like Neotron 3 Super 12B or Quen 3.5122B. These won't be as fast, but they're significantly more capable, which makes them ideal for tasks where quality matters more than speed. It's important to match your model to your hardware. A 30 billion parameter model like Neotron runs great on an RTX5090, while the full 120 billion parameter version fits comfortably on the Spark. So, you could pick the right balance of speed and capability for whatever you're running. Remember, you can run this on other RTX software. It's just about right sizing the right model to the right use case on the right hardware. I've actually found that the kind of 30 billion parameter range is perfect. You really don't need much more than that. It is the perfect balance of size and quality. And it fits on many consumer- grade GPUs. So, the 5090, probably the 4090, you can definitely put it on the DGX Spark. Plus, there's different quantizations you can use. And so, specifically, of course, I've been using Gemma 4 lately. I've been using the Neatron family of models. I've been using Quen. Actually, those are really the three main ones that I've been using. Everything from extraction to classification, summarization, text to speech. These are the use cases that I'm using local models for. So, the first thing I'm going to do is show you my knowledgebased use case, which right now I'm using Sonnet 4.6 and it costs money. and I drop a bunch of links in here and it definitely uses a lot of my daily and weekly quota with sonnet. That's not necessary. I can use the Quen model completely locally, completely free, no limit, and it'll work just as well. Okay, so we have now replaced the knowledgebased article ingesttor, which is a use case that I use. I drop articles and tweets and videos and it ingests all of it, embeds it, puts it in this big database and I can always reference it and recall it later. Now it is powered by Quen. And so I will drop a link. I'll hit enter. And now this should be using Quen. It's going to do tool calls. It's going to go scrape the article, ingest it into our database. The cool thing is the embeddings piece was already done by a local embeddings model. So we don't even need to switch anything there. So we're scraping the article and then Quen will summarize it for us to then put into our database. All right. So for example, we just did the knowledgebased summarization task and we offloaded it to Quen. And now let me show you the difference. For the cost, we were paying let's say $12 to $20 a month. I am paying about $200 a month for the subscription and I only get a limited quota. So, it's an estimate of about $12 to $20 a month in actual usage. Now, it is completely free. Everything I'm asking it, all of the articles I have saved, it is completely locally. I'm already embedding it all locally. And now, all of my questions and the answers stay local as well. Here's another use case. We replaced our CRM functionality using a Frontier model with Quen. And now I can ask questions to my CRM which I custom built with OpenClaw using that Quen model. So I simply said without showing the sponsor name because I don't want to dox them. Summarize the last conversation we had with the last sponsor I talked to. And here it is. This is a summary of the emails and the video transcripts from the last time I spoke to this company. And perfect, there it is. So once again, $12 to $20 a month for the Frontier model, but it wasn't necessary. Now with Quen, it is free. And here's the important part. All of that is stored locally, but as soon as I start asking questions to it, previously it would have to hit a frontier cloud model and all of my data would be shared with them. But now it stays local completely. Nothing leaves my office. So everything I've shown you today, optimizing for cost, optimizing for privacy and security and personalization, these are not just hobbyist hacks. The workflow that I showed you of experimenting with frontier models and then extracting them and productionizing and then finally scaling it up with local models that is very real and in fact Nvidia believes in it so much that they just released their own third version of their open-source model Neotron. >> So they are going very big on opensource. In fact Nvidia also announced their own enterprise version of openclaw called Neoclaw. This is Nvidia's extreme code design in action. They're building the hardware. They're building the software to control the hardware. They're also building and releasing completely free the models, the open- source models for the world to use. Now, after 10 billion tokens spent getting my open claw to where it is today, the lesson is clear. Don't make the same mistake I did. Offload your use cases to local models as often as you can. And with your Nvidia RTX GPU or the DJX Spark, you can easily run local models to power your Aentic workflows. Cheaper, more private, more customized, and just awesome knowing it's completely running on your own device. We took a look at a bunch of different use cases today. And if you offloaded those just like I have two local models, you could be saving hundreds of dollars a month in token quota or token costs. So you can either pay $300 a month using fully hosted models or you could probably pay about $3 a month in just electricity costs running these models locally. So the future is hybrid. The most complex use cases are going to the cloud. Everything else is going to be run locally. If you enjoyed this video, please consider giving a like and subscribe.

Дивитись на YouTube Підписатись на AI-дайджест

Ще з цього каналу

Mythos... I can't sleep

близько 2 місяців тому

New BEST AI Memory System

близько 2 місяців тому

Mythos is real and it scares me...

близько 2 місяців тому

Anthropic banned OpenClaw...

близько 2 місяців тому