Tutorial: SN50™ for Agentic AI Infrastructure | Future of Data and AI | Agentic AI Conference
Опис відео▼
Welcome back. Um so our next session is basically you know solving agentic IIA's infrastructure crisis powering agentic inference with Salmanova. So this tutorial demonstrates sort of going from single agent configurations to more uh fully distributed systems. So, you're going to be using, you know, Google ADK, um, the A2A protocol and cloud run. Um, and you'll create and implement a kind of scalable multi- aent architecture that kind of, you know, mimics actual production processes. Um, so you're going to create and coordinate specialized agents with the Google ADK and the A2A uh protocol. And then you're going to install a sort of production ready um, scalable sort of multi- aent system on uh, cloud run. So Quasi Ankama will lead this tutorial and I'll hand it over to Quasi to start the tutorial. >> Hey Rebecca. So I've been just to set the scene. So um we there might have been a a slight uh discrepancy with abstract cuz I don't think we'll be using Google ADK. I'll just be talking through the infrastructure crisis and taking through uh Sanova and basically talking about some of the apps that we've built. Um and we can share some of the repos, but we won't be going into into Google ADK in this one. So that's just to manage people's manage people's expectations there. Um all good, all good, >> all good. Um so let me just uh share my screen. Okay, perfect. Can everyone see that? Go with yes. Okay, so hi everyone. So my name is uh Quasian Koma. Um I am a director of AI solutions engineering here at Sanonova. Um and Sonova is kind of a full stack AI inference company. We try to solve the agentic compute problem by serving agents that are kind of scalable um and compute that is kind of efficient and super fast uh as compared to other chip providers. So what I'm going to be talking about today is solving the infrastructure crisis for agents as we scale. I think we've heard from everyone that tokens agents kind of run continuously now which means that they consume much more tokens and which means that they consume much more compute and therefore you kind of need to scale that compute in an efficient way if you really want to scale agents um versus kind of adding lots of infrastructure as you scale up as well which is not always possible. So a little bit of background as to who we are. So basically we was founded in 2017 um out of Stanford. So we our co-founders are all kind of um professors at Stanford. We have Kunlay Alakuton, uh Christopher Ray and Rodri Golang as our CEO and we have Luan as our chairman as well. Um and we've raised well over a billion and we have around 400 employees and we're based in PaloAlto, uh London and a few other places as well with offices in Tokyo and um Australia as well. So what who we are and what we do and why this is important for I guess what we're going to what we're going to talk about. So we're a kind of full stack um inference AI platform which means that we have our own custom chip. We have our own um integrated rack level system. We have uh a runtime that allows us to bundle open- source models and then we have we expose this in three products. One of them is our cloud which I'll show you. One of them is something called Sambber stack and the other is called Samber managed. Um so what I will kind of go through here is some of the key um some of the key bits around why we think we're good at serving agents at scale. Uh we tend to work at a lower power consumption uh which allows you to scale uh quite well in terms of how much you can put inside inside a data center. We also have very um fast compute in terms of we can run kind of the larger models the deepseks and um other kind of 500 billion plus parameter models at a very fast speed and on kind of one rack or one node as we call it. And we also um basically support kind of custom checkpoints via kind of uh a concept called BYOC for architectures that we already support. So if you fine-tuned your model elsewhere for aentic workflows, you can kind of bring them onto our system as well. So before I move on, I'll kind of you know, talk to a few things that have kind of happened um recently. Uh so you know we've had this uh huge uh rise in usage in open core and I think everyone would have seen the announcement from anthropic you know that they're changing their pricing model and you know someone made a really interesting point that the pricing the old pricing model was was based off the fact that you you know a user interacting with a with a web portal. Uh so you know maybe you you have two hours of heavy usage but then the other 22 hours of the day you know the the user isn't actually um generating tokens or using your compute. But obviously with things like open claw and you know uh always on agents so to speak you now have agents that are consuming tokens 247 and that needs that requires a a different sort of thinking especially if you want to serve a lot of users at a particular throughput and token speed for those users which agents we know of course need as they run kind of a lot of them latency sensitive information. Um and you're also seeing from the market that the there is a signal that people are willing to pay for faster compute. I think you've seen it with uh Nvidia with the purchase of Grock and the fact that they announced it at GTC this kind of system that's going to try and do prefill and decode on different node on different architectures. One being GPU and the other being custom silicon. Um and we've also seen it in the signal of the cost increase right? So, Anthropic now charge you kind of, you know, 6x to use fast mode. And that's kind of sending a signal that fast compute is kind of here to stay and that people are kind of willing to pay for it as well. So, what I'm going to look at now, if you give me a second, is go and kind of show you what that looks like. So, before I go into this, I'm just going to very quickly um go into the dashboard. So, if you want to kind of go to cloud.samnova.ai, you can kind of follow along. We've got a few just to kind of give you some um some some uh material here. We've got these kind of AI starter kits uh where we've actually got live code uh where we where we've built applications using Sanova. So, you can feel free to go and use um and kind of run these things yourself that I'm going to show you these applications. And also, we have all of our code. We have our Helm chart. So you can kind of go go and deploy all of this in your in your in your Kubernetes. Um and we also have an integrations page. So I'm kind of going to show you the documentation here. So we have kind of um all of our integrations. And within our integrations, we also have a GitHub repo that allows you we have lots of example code that allows you to kind of go through and build um with with all these integrations on Sanernova as well. So feel free to go and use that. Um, as I said, I won't I'll be I'll be talking a little bit more about the the the chip and then I'll kind of go into some of the uh use cases and applications that we're seeing to kind of show you what that looks like in real life. So, if we kind of go to the back to this screen, I think what the key thing that I kind of want to get get across is the speed of the inference. So, if you go here to our playground and you can again sign up um we give I think $5 free credit. Um if you if you if we take a model and we say kind of you know write a binary search tree. Now if you kind of go here the key thing I'm trying to get here is that we're running kind of 8B at around 1,100 tokens per second. So it's a huge speed increase um over kind of GPU and obviously you can do this for other models like GPOSS um and we run this kind of way a lot quicker than um a lot of the GPUs in the market as well. So, one of the key things that we're trying to do at the moment is you can see here we've kind of got um 769 tokens per second. Um and so again, super fast inference on some of the larger models which makes a big difference when it comes to aentic workflows. So, I kind of wanted to show you you can kind of go go and um play with that at your leisure and it will it's all OpenAI compatible so it will um integrate into any application that you've got at the moment. And again, if you have any questions while I'm while I'm going through this, please please shout um and I'll try to answer them as well. So the I'm going to talk a little bit about what matters to uh agents in terms of metrics and how workloads are served. So if you think about you know what's happening under the hood, you've kind of got, you know, two two key things. You got the client and the server. So there's three key metrics here that the that the agent is worried about, right? You have end to-end latency, you know, right from when I click the button or I ask for request, how long does it take for the agent to complete its trace or its task? Then the total cost per task, right? So if I'm calling, you know, agent X and agent X does 200 calls over its span, how much does that cost me? So, you know, that could be if you're paying on a kind of per token basis, that can be per token or if you have your own infrastructure, you kind of um you you work out that number by your kind of total cost of ownership. And the third thing is the accuracy of the agent, right? So, you know, if I'm running a insurance um validation of a claim workflow, you know, we we're working with some of our clients who are doing similar things. This is really important. So you know how accur accurate was that agent in completing its task um in terms of getting the claims right you know calling the right tools and things like that. So those are the key key things. Now what you've kind of got here is what happens very quickly in terms of the the number of tokens scaling very fast. Right? So you've got the user request maybe call one you've got a plan and then you've got the tool calls. Right? And you know managing context is a huge huge part of this. Right? How do we make sure we manage context of our agents? So if you kind of think about what what what is going on in the in the um under the hood of a of a planner and what matters for inference providers, it's basically the fact that you want to make sure that you've got profitable margins, flexible infrastructure and it's energy efficient. Right? So here what we've got is something what we kind of call you know an agentic planner. You may have coding embedding something task specific and then you've got a final output as well. And what you're what you're kind of what you are tasked with is that you have an unpredictable workload. And this can be challenging for you know cloud providers or end users, right? How do you make sure that if you are managing a budget for your team in terms of compute cost, how do you make sure that you you manage it properly and that you basically map the workloads of your agents. So here at Samanova, what we've done is we've we've we've launched a new chip which we call the SM50 and a new rack which you know the Samarak SM50 which we're shipping in Q4 now. So what we're saying here is that you we have a system um built custom to run these large models that gives you kind of fast mode in these frontier models at a fraction of the cost. Um so before I kind of go into the into the kind of you know the the underlying bits of the chip and stuff like that what I will kind of want to show you is how that kind of looks in practice. Right. So here we we're using one of our one of our apps at the moment. So, we've got a few apps here. So, here is kind of one of our apps that we're kind of going to call, you know, Samber chat, right? And this is a fork um of one of the popular um chat apps that we've done some stuff to. We've included our own runtime and we've done some stuff under the hood to kind of make it run um as we want it to and also run on prem as well. So, what we can do here is I'm going to just try and find uh So, what I'm going to do here is I'm going to try and basically run something. So, what we've got here is I'm going to basically put our code interpreter in and then what I'm going to do is I'm going to include this data set. So, what we're doing here and what I'm using is Miniax 2.5. And again, we've got this code so you can actually kind of um uh deploy this yourself with Helm charts. But what I really want to show you is what that difference means when you actually run it. So this is Miniax running um a totally kind of autonomous data analysis pipeline. And the the bit I'm kind of kind of trying to show you is the speed that it's able to run at. Right? So we're we're able to have kind of reasoning um code generation execution all running really really quickly. And uh one of the things that again we we think is really important here is that a lot of the a lot of the speed that you're seeing comes from different places, right? So you've got the infrastructure on the CPU side. You've got how fast is your sandbox, how fast is your web search and things like that. But what we've got here is basically Miniax doing all of this. So what I fed it was basically like a kind of Kaggle mini Kaggle competition, right? I gave it a data set and I've said to it you know go and find me the best model I guess you know those of us who are starting off in data science you the stuff that you used to do um manually using pandas and stuff and I basically told it to go and write a report right so basically you know in less than in less than 2 minutes um it's b you know minimax has done all of those calls it's generated a ton of tokens and it's giving me the output now again the re why I think this is super important is because as you scale agents there's two things that you really care about, as I said, end to-end latency and how fast the agent can work. Um, and what we do here at Samova is that we make sure that our token speeds are high so that when you chain these agentic calls, you can actually run coding agents a lot faster than we see on GPUs and things like that and a lot of our clients have been coming to us, but we kind of wanted to show you what that means in real life. So here, this is kind of generated a report. So if I click here, so you can see here that it's b it's gone through and it's kind of done that EDA. It's generated the artifacts um and again a very a very very great report and you know eight pages and it's done all of that in kind of less than a minute and a half and the the key unlock here is actually having inference that is super fast right so one of the key reasons that we we talk about the the difference that fast inference makes in real life is that when you run thousands of these agentic workflows so say that I've got a coding agent that's running autonomously I have an analyt analytics agent or I've got an agent that's doing um uh kind of um visual analysis and things like that. Having these calls come back quickly makes a real big difference from compressing a workflow that maybe took say 30 40 minutes to something that could take seven or eight minutes and that's a huge unlock in terms of um intelligence because intelligence is really measured on how fast the model can think right because it can generate more reasoning tokens. So I just kind of wanted to give you a flavor of what this means like in real life. And again, we've got these reference architectures. Um this is what me and my team work on. So you can go and deploy this on your own infrastructure and try it with Samova yourself. So I'm going to go back to this um to kind of set the scene and kind of go through that. But before I'm going to kind of um stop to see if there's any questions. Um I can't see any questions at the moment. So I'm just going to continue. So what we're kind of giving here is kind of a snapshot of what the SM50 allows us to do. So again, we've got the hardware features, we've increased the compute from our from our previous chip. Um we've increased the network bandwidth. Um the also with the throughput as well. And also we have this thing called aentic caching which I'll show you um very quickly. And then I'll you know we've also got this thing called um cloud scale as well. So what's really cool here is that if you if you kind of go into our next slide, this is the bit that we think is really important for agents moving forward and in general where we think compute is going and we've had this validated by, you know, what happened at GTC and the recent um acquisitions and things like that. Now the general consensus is is that there's things that GPUs are really good at um and there's things that you know um other custom providers really good at but they both have their drawbacks in terms of how latency and then scalability. So if you start on the you know the furthest left hand side what you've got here is kind of the the GPU uh B200s we're using here and of course they are very very good at batching requests. So you can imagine that if I'm trying to batch a ton of um you know loss adjustment um and I want to process that great if I have a no no latency budget you can batch all those requests together and CPUs are very good at that they have you know very high batch sizes however what that means is if you are running an app that is interactive with a user like lots of agents are your your latency is not going to be acceptable to the user right you're going to kind of go down to this kind of 40 to 60 70 tokens per second which in modern day um AI is going to start to look really slow and a user will come kind of disenfranchised. Now what you've got on the on the furthest right hand side is you've got the super fast right so you may have a a provider um who can get very fast inference speeds so like you know cerebras for instance but the problem here is that the scale is very hard right so what I've got here is that if I want to scale that to let's say to support you know a thousand users and get the throughput up the I need to I need to drastically scale the amount of infrastructure that I've got in order to support it. So you have these kind of two extremes where you have one where I can have super great batching but I don't get the the latency budget that is needed for a gentic applications and then on the other side I've got I get the latency budget what's needed but I need to basically scale my hardware a lot to get to the kind of throughput that I need to. So what we've kind of what we've kind of said here for ourselves is we have something what we kind of call the Goldilock zone. So the the gold lock zone for us is basically where we have a acceptable amount of tokens per um per user. So essentially we can keep the throughput so much that a user is seeing around 300 400 tokens per second. However, we're able to do that on a much smaller hardware footprint which means that it's much more scalable. So you can scale it to to many more users without having to scale, you know, um a a huge amount of nodes to support it. So that's what we start to think is going to happen with a genetic inference. And you're already seeing a lot of that in the market, right? You're seeing um uh systems that are being built to kind of do disagregated inference. And disagated inference is basically the the idea of doing the prefill on a say a GPU and then doing the decode um on say like an RDU which is our chip as well. So again you're already seeing that we that serving workloads is starting to get specialized and we think that our chip is really well placed and based on where agents are going. seeing lots of need and people coming to us because they've tried to scale agents but when they've tried to scale them to a you know over a thousand 10,000 users they start to run into either this hardware problem or this latency problem. So again, this is the kind of the same chart, but here we're kind of using G GPT OSS and this were the the models that I showed you. And you can see here that if the cost of the cost of getting to this tokens per second using a a kind of normal just GPU approach is is way basically we are far less expensive in order to get the same number of tokens and the same number of throughput. So that's really where you kind of get those savings is the fact that you know you've really got to scale the number of GPUs um to get to that require kind of 600 tokens per second versus using um the the RDU architecture. So the next thing I'm kind of going to want to talk about which I think is really important especially for agentic models is this thing that we're kind of calling aentic cache. So I'll kind again I'll I'll jump into a demo to kind of show you a little bit what I mean by this and what what actually happens. Um, so oh okay actually I'm seeing all of the questions sorry coming through now sorry I didn't actually there was a Q&A button um so the reason of using token AI has processed do you offer an on-remise infrastructure solution yes we do um it's called so we basically our products you can have in the cloud or we have something called stamber stack where you can um you can actually put this on onto your own data center So we so we do actually have that um working as well. Um so hi wondering what the stack for this code interpreter is. Uh D no so the code so the code interpreter D um the code interpreter is basically us. So if if you think about what a code interpreter code interpreter is doing it's basically just a runtime right. So if it's basically just a runtime which means it's some sort of x86 machine that is running. Now we we actually built our own um code interpreter runtime for security. We have customers who will actually not allow even you know their their code to go to third party providers who do sandboxes. So we have our own kind of sandbox runtime which is isolated per user per thread. So it doesn't use lang chain or anything else. All we do is that we we are using um in this instance langraph to kind of orchestrate the agents but all all all the sandbox is in terms of agents is a tool right so it's a tool and that tool calls to the runtime and we allow that runtime to ex to expose very common sandbox tools like upload file run code all of that but it's in a really trusted environment and we destroy the sandboxes they're ephemereral like all sandboxes should be and they they persist for the thread Um and then once that thread is done um the sandbox goes sandbox goes away. Okay. So um I can rerun that code. Um and I can tell you what it was doing. So basically that code that I shared what it was doing it was basically doing analysis. It was doing EDA and it was doing model uh model selection like you would have done with like sklearn and things like that and pandas but it was doing it all automatically. So it was it was going step by step reasoning finding finding out exactly what was happening fixing its errors and then coming up with a report. Um but I can yeah happy to go through it a little bit more. Um so is the Samova platform for building an AI factory will it be so yes so basically our platform is built to handle enterprise level um enterprise AI behind beyond agents as well. Right. So if you want to build just we we expose open AI compatible endpoints. So what you do with those endpoints is completely up to you. Um our customers do all sorts of things but of course we see a huge um shift in terms of like you know agentic work. So that's kind of you know one of the things that we support obviously as well. And the reason we talk about agents is because our hardware is is made to work pretty well um with agents uh just because of the the number of calls agents do and what that means there as well. Okay. So, so I will talk about uh aentic caching now and maybe I'll come back to the to that question there as well. Um what metric what metric do you recommend to measure the unit cost for faster inference? So this is a good question. So the what I would recommend is like how many so tokens tokens generated per user, right? That's really what I want to um like that's really what I want to see here. How many how many tokens are generated per user? Because essentially that's the thing that a user sees. So if you're if you're if you're running inference, you can run inference super fast, but how many users could it support, right? So like what is the throughput per user? I think is a really good way of um of doing it. Okay. So, I'm just going through some of these questions. I can review them as well. So, Robert Wilcox, do you have a gig of the project you created for building models? So, I don't quite understand the question for building models. So, we we have uh we use open-source models. So, we we don't kind of build the uh models ourselves. But if you're talking about the app um yeah you know we can we can share with you some of the GitHubs that we use there as well. Um so does the SM50 get provisioned through one of the cloud providers? No. No. So we are actually working with some of the cloud providers as well. Um so we we are basically um we are we're making it easy so that you can use our compute but through your existing cloud um infrastructure. So we're working on that and so hang tight on that. um we we're making some good progress there. Okay, thank you for all the questions. I'm going to kind of go back to this and I'll and I'll return to the questions in a sec. Um okay, aentic caching. So what is aentic caching? So there's two things you should think about, right? When you're when you're doing your agentic workflow. So one of them is the fact that if you call an age agentic flow, it's likely that I'm using more than one model, right? Let's so I may use I may use a model for the planner. I may have a specific audio model, maybe a a multimodal model, but I have different models. Right? Now, what that means usually on a GPU infrastructure is that I need to stand up a a different node to serve each of the models. Now, you can of course load and unload different models on a GPU, but the time to do it is very big. And so what we call agentic caching is the fact that we because of our three tier memory we have a lot of kind of this offchip memory that we can actually this DDR that we can actually use at HBM that we can actually use to host these models and swap them in super fast and why that's important is it means again the whole point here is scaling the kind of infrastructure crisis right so if I can collapse the footprint of my model or my models onto to a smaller hardware footprint. It means that I can use them quicker and I don't have to again I can get more kind of inference for my buck because I can swap models really fast. Now what we're saying here so let's we take a llama 70B right? So obviously GPU uncashed 8 seconds unlikely it's likely that you'll have the model at least cached. So if you if we compare cached to to the to the SM40 SM50 you can see that there's a huge difference right 0.2 seconds to 1.5 seconds. Now why this becomes super important is if I'm doing agentic workflows and I want to swap models then having a very slow a very like a slower model swapping time that will compound. So in that workflow that I just showed you can imagine that if I was changing to loads of different models I would have like 1.5 seconds 1.5 seconds added and that would over like you know 70 calls across five or six models that would really start to add up. And on the on the Llama 8B, again, you see it not quite as pronounced because it's a smaller model, but the biggest one I think is when, you know, we're one of the few providers that runs deepseek um you know, the current Deep Seeks at their full um quantization. So like full full precision. Um so when we swap it on the SM4, SM50 on the cache, it's 0.5 seconds, whereas a GPU is 6 seconds. So, this is where you start to see a real big increase in terms of model switching time. And by by being able to swap models quickly, we can run workflow super quickly as well. Um, so what I'm going to do now is kind of show you what I mean by that. So, I'm just going to go to Okay, so we're going Oh, let me see here. Oh, so I think my screen sharing just stopped. Um, I'm not sure why. Give me two seconds and I'll see try and find out what happened there. Uh, okay. Okay. So, I think we've got it back. So, I'm just going to share my screen again. Apologies about that. Okay. So, what we've got here is kind of an again another application. So, what I'm going to do here is actually I'm going to see if I can Okay, share some sound here. Okay, perfect. So, what I'm going to do now is ask a question. So, I'm going to say something like um analyze Apple's financials for me. So, the reason I'm I'm running this is I want to kind of show you and this is another app that we made. Again, we have this code. It's called agents and it's on chat.samino.ai. Um and me and my team, we do many things. So, so this this on the right hand side, you're seeing a load of agents run, right? So, we have a risk assessment agent, we have a fundamentals agent. Um, and what's really cool about this is that I'm what I'm trying to show you is what that model swapping means. And we used three different models here. We had Miniax, we had um 3.370B, and we had um Llama Maverick, right? And they all made different levels of calls. Now what we mean by swapping the the hot swapping of the models and the agentic cache is that you can actually run all of this on one node which then means that you can basically run this um on the same infrastructure versus versus having to swap to a different node. Now what I'm going to the reason why I think that's important is I'm going to kind of show you underneath what's happening here. So you can see here Miniax started right then we switched to 3.3 then we did then you can see 3.3 some some some parallel tool calls and then finally I use Maverick to do some you know news and aggregation. So you can see here what happens when you're actually switching out and these these kind of little milliseconds once you compound them over thousands of tool calls and thousands of uh users and and traces it the difference is is significant. And so that's why we always want to we always want to kind of get people that's the easiest way I can show you is kind of showing you what happens behind the hood of these models being called and just the fact that we're able to switch them in super fast and it makes a huge difference. So again this is this is you know a a nice way of looking at um that kind of model swapping uh and aentic cache um side of things. Okay. So let me go back to this. So, so again a little bit more about the the the the chips. So, we're able to scale over kind of 256 accelerators so that you can you kind of, you know, multi- terabyte per second interconnect uh with this proprietary, but you can kind of scale up to kind of 10 trillion parameter models, right? And huge context lengths as well. So, we all know that context length is another big differentiator. You know, we know that um Opus is a 1 million context and that is a big big um advantage. And I think Gemini when it did have the longest contest, it was a huge um deciding factor for many people who needed that long context as well. So again, this is kind of, you know, a a a product comparison summary. What we've got here is, you know, where we think the SM40 is, which is what the chip that I'm we we have now, and that's the chip that these demos are running against. um the SM50 will be significantly faster and you know with only a little bit increased in the power efficiency. So um for those of you kind of get into the hardware we run our SM40s at around 12 10 to 12 kW um which is compared to compared to the equivalent GPU which is around 120 kW. So it's a hu an order of magnitude saving in terms of power. Um and then the SM50 I think will be around 20 25 um uh kilow as well. Um, and what we've got here, again, we're we're very good at running the big models. Um, and we think that our power efficiency is pretty good, but again, this is what I was trying to show you with the differences with the other providers is that they're good at some things. You know, Cerebras are excellent at inference speed, but they, you know, they fall down in other places like fire efficiency and tokens power efficiency in terms of high tokens per watt. Um, and then, you know, power efficiency, low kilowatt per rack and aentic caching and the ability to swap those models in as well. So again this is like a nice comparison um to kind of give you an idea of you know where we think we we play well and you know again what I also want to be clear here is that you know there are some there are some workloads if if all that you are optimizing for is speed and you don't care about efficiency then we might not be the right provider for you right like you know if you're you you need ultra ultra ultra fast and but you know you you're willing to you know build billions and billions in data centers then that's that's another thing and of course if you're if batching is what you're doing and throughput and that is the that's the key uh metric for you and you're not latency sensitive then you know on that other side of the graph might be good but what we're seeing with this agentic push is that we really see lots of people needing both the speed and the throughput to serve their users um so just a little bit about um I guess the who the folks that we're working with we have we have three products as I said Sambber managed and uh Sambber stack and uh Sam cloud uh so we have a sovereign AI play. So we have a lot of sovereign AI at the moment. This is where people take our racks. They either um use our cloud and we kind of you know they have a white label. We have we work with Neo clouds who are using us. Um and then we you know so these are all the people that it's for. Um and it's basically it's designed for those companies who need to scale up their inference um super super quickly and need to get it at the at the fastest um at the fastest speeds. So with that, I'm just looking at time. So I just want to make sure that I leave time for questions. I haven't gone into some of the questions. So, let me go and do that. Um, let me see. So, let me just go into the questions. And again, here we go. Q&A. Um, okay. So, so Okay, let's have a look. So, there's a lot of questions here. Um, okay. So, which GPU is it? Is it Nvidia and AMD? No. So, the it's actually we um we make our own chips. So, it's an RDU. So, Samanova, if you look here, this is our own custom chip. It's our own custom silicon. So, we are an alternative to GPUs. So, we think that we have very very compelling inference speeds and so we we have our own custom silicon, our own runtime, our own compiler. Uh so, we don't use GPUs. Okay. So can you share a little bit more about the platform but platform support for agents or about security? So we so we again we basically concentrate on the hardware side of things. We have lots of integrations into all of these uh platforms like langfuse uh all the observability or the MCP all the platforms like ADK uh langraph all of these things. Now me and my team, we build reference architecture. So we can share those links with you. And a lot of the demos that I showed you, we we actually just share them. We show our customers how to use the platforms. Uh but we so that's how we support through integrations and things like that. But we ourselves don't kind of we're not a you know um a solutions company. We we give you uh reference architecture that we we think that you can use on our on our um on our on our software and hardware. Okay. Okay. Someone's already answered the the the question. uh when when hot swapping models how does concurrent requests from different users work? Yeah. Are they Q? AB that's a great question. Yeah, they are. So we have again we have some kind of work in the background that's you know working with the queue like a you know like a a router and then they they will based on the algorithm we will swap and make sure that they get put into the best queue for them. So that's that's the thing we make sure that that request goes to of course it will go to a model right so we make sure it goes to the best node that we think can serve that traffic as well. Yeah. So how to decide which model to use first and which to switch over time. So I think this is really interesting. Right. I think you know if I if you if you if you go for kind of an example um you you should be able to I like to think of this if you're if you're doing reasoning tasks then you want the stronger models right so like if you're doing I can I can give you an example of this actually maybe maybe it's easier if I just um go back to here right so I'm going to go into full screen here so oh I'm not in full screen I thought I was okay. So I think everyone can see my screen. So the question here is, you know, how do we when when do you know which one to use? So let me see if I can do this. So when do you know which one to use? It very much depends on on the task, right? So if you're if you're doing something that's reasoning heavy, I would always recommend that you use your reasoning model and then for the smaller tasks, you can then pass it off to another model. So for instance if I if I write something like you know um you know uh so I think we have yeah so can you um get me get me the latest stories in AI as at April uh and create a uh PDF with stories please and source links. So in in something like this when I hit go the model that first does this that routes needs to be your most powerful model, right? So so you need to have a a model that's super powerful to kind of do all the routing. Now the the stuff that happens underneath it like maybe your coding model, maybe the one that's doing the actual um what's it called? maybe the one that's doing the actual like web search doesn't need to be as powerful, right? Because it it has a very clear task. You can set it off as a sub agent and you don't need to waste tokens there or have an expensive thing. So again, I think my general rule is the more reasoning heavy or the planners more you know more parameters more intelligence the but the more task specific you can come down to smaller parameter sizes and more specialist models as well. Okay, I hope that helped. So what breaks first when agent workloads scale? Latency, cost or accuracy? I'll be honest. I think you the two that break the first given that the quality of models that we've got are latency and cost. And these are the things that we're trying to solve for at seven overright. So you latency you basically if you try and scale then and you want all the intelligence, you just have to wait. It just takes a long time. That's why you know a fast mode was was was developed in the first place and why you know the acquisition of Nvidia and Gro made sense. They knew that they needed something to get the latency right. That's the first thing. Now then the next thing that breaks after that once you fix the latency is the cost right because you fix the latency but at what cost? You know like do I have to now scale up to 70 nodes right because I need to get that tokens per second that I showed on that graph. So that's that how I think it break accuracy I think definitely depending on the model and the guardrails of course accuracy is always a problem but we're starting to get so good in especially in certain domains with the accuracy if you've got good architecture and you've got good um you're able to basically have the model have guard rails and be well trained and well directed. But for me the two that break first are latency and cost for sure. So inference of economics, how do you measure the real cost per agent task across many tool calls? Um so that's quite I guess that's not simple. Again, if you use an observ observability platform, we're kind of open source. We use Langfuse. Um so again, Langmith, you know, they've got one as well from Langchain. There's a few there's many of them out there, right? Um you can see in the back end if you're on a per token basis you can see how many tokens you have but also if you're running any tool calls you then that you just have to kind of do kind of some napkin math right so how if it's a managed service how much does that managed service call like cost you and what's the API cost there if it's more of like infrastructure right so for instance we have to run you know um our own sandboxes for for you know for private stuff and if if you're a client you'd have to do that so you have to think about the cost of the infrastructure as well but I think First thing model costs plus you know API cost for tool calls then plus the cost of the infrastructure that's running those tool calls. So if you're running a rag database you might have to spin up an EC2 to put Postgress on it things as well. So that's again some some good some good pointers for that as well. Um so why does high tokens per second matter so much for a genic workflows? Very good question right? So it it matters so much because you're doing so many calls. So if you think about it, when we first started LLM, you used to do like one one call like hello, how are you? Like you know, do do go and do this for me and you just come back. It wasn't yes, you could deal with a speed decrease, but it wasn't it wasn't such a huge thing. But now, as I've shown you in these examples, in those examples I did in the one that did the the the the Netflix data set, there were probably about 35 um LLM calls in there. So if you think about that once that compounds it gets really slow. So the tokens per second really start to matter because it means that if that if that workflow is taking me 10 12 minutes on a GPU which we see normally it means that I can collapse that down to like 1 to 2 minutes and that's an order of magnitude faster in terms of getting that and getting getting your your work done faster. So that's why we think it's important. It's important because there's there's more tokens are being generated for a start because of the reasoning and secondly you're doing more LLM calls. Okay. How does your cache reduce model switching time in multi- aent systems? So I think I kind of went through this. So, so basically how so our cache supports it by if we have low switching times which means that if you have a multi- aent system like like I just showed you I can because we have this kind of gentic cache we can swap out models in like milliseconds which means that it doesn't impact your total end to end latency by a lot. So that's the cache system and happy to go more into it, but that essentially the nuts the nuts and bolts is we've got three tier memory. We can move the models very quickly between that three- tier memory which allows us to to to to reduce the switching time. Okay, so how do you support thousands of concurrent aentic users without huge infrastructure growth? So again the the answer here is kind of the answer to a couple of the last ones. So that it two I can swap models very quickly. That's that's a big one. So I can I can keep the the throughput of the So if I go back to this and I go into slideshow mode. Let's go back to this to answer that question. Oh, it was this graph I think. Yeah. So I can basically in order to do that the reason I can do it is that I can I can serve the user at a higher tokens per second and a lower infrastructure footprint. So that means that in order for me to scale I have to scale less because I can support more in terms of the throughput. This is the big SN50 unlock. So, I'm happy to go more into it, but that is really the the unlock there is the fact that because I can keep the infrastructure cost um uh compacted because I have high tokens and low power and aentic cache, I then should be able to scale out u more as well. Okay. So, so I think there was a specific question there. Um, okay. I think I've kind of Okay. So, okay. I could maybe some more. So, architecture. What's the difference between GPU batching and our RDU approach? So, I think, you know, we I don't think there's too much of a difference in the batching. It's more about the hardware approach, right? So, I think it's just more um that we can we can basically the GPU is very good at batching, right? But as I said, the problem that you've got there is that the latency goes down. So what there's it's not so much the approach that we're talking about. It's just the fact that we think that we can run at a high enough batch size for you for you to have um some really good tokens per second per user. That's the key difference. Yeah. But again, happy to get into that a bit more. Um so I'm going so use so you asking about use cases. So okay so this is a good question. So where where do we see people using Sanova right? So I'll give you a couple of examples. If you are latency sensitive then people use Sanova. So we work with a company called Hume. They're a voice provider. They run their models on our RDU because we have super low latency. So that's an example. If you have if you have a latency critical budget then we're very good for tokens. Now the other thing is that if you really if you and many data center center providers come to us because they have data center space but they don't have this crazy power like they don't they can't support these crazy power needs of some of these more power hungry systems right over 120 130 kilow watts per rack. So they say hey we've got this space we want to run compute but we don't have a lot of power. So that's another real reason that people come to us. That's a good use case from the hardware. Now in terms of you know where we see real big use cases code generation right so you know code generation is a huge case I just showed you an example we you know we integrate with some of the big um open source tools like open code and you can use miniax and kind of get you know a very very excellent coding experience at the fraction of the cost. So like that that is a huge unlock, right? So like we're starting to see people say, "Hey, like this is um this is this is a huge unlock from our point of view and it allows us to basically run more either either run more complex agentic workflows quicker, one thing, or I can run more of them, right? I can support more users. So coding is really taking off. But to answer your question, if your voice if your voice or latency sensitive, if you if you know you haven't got power and the last one I should say is security. So someone asked at the top of the meeting, can you bring the racks on premises? Some of our key customers, we work with the some of the national labs in the United States. Um they of course are very security conscious. All the racks are, you know, like on prem, right? So if you're security conscious, you you can't have your data leaving, then again, we're a really good fit, right? because everything that I've showed you can be can be delivered onto onto a data center premises or a collocation and things like that. So we think those are the things where uh those are the areas where we see a natural fit for Samanova. Okay. So the PDF was lightning first. Yeah, I'm sure we can give you um we can kind of share the share the repo for this. Um does it so followup bas does it choose the models autonomously? No. So you so basically good question there. You can have routers by the way that say hey this looks like query this looks like this kind of query route it to that model. So you couldn't do things like that. But what I'm finding is again for me I think people kind of know pretty well when they design an aentic application what model they want to use for what time. But you can also put routers in. But for this example I chose it. It was predetermined. I I specifically chose those models because again I want to mix up my more powerful models to do the reasoning to my cheaper models or my smaller models to do more of the kind of you know task heavy work. I don't need I don't need like you know Deep Seek 6671 to go and do a web search, right? I can use like a Llama 8 or Llama 870B model. Far quicker, far cheaper, far more efficient as well. But you can put routers in and stuff if you want to do that. Okay. So, what what what are your thoughts on context caching? So, that's a whole other um that's a whole other uh what's it called? Um subject in terms of that. That's kind of like prompt caching and things like that. Um so we we currently are working on you know prompt cash caching as a lot of people are but this is something that's again something that we're we're working on but it's again this is something that will then be we'll look to kind of get into the platform at um in the future as well but it will make a big difference especially um to to coding solutions and things like that. Um okay so I think I've answered most of the questions there. I think we're coming up to time. Um, so I'm going to sorry, I'm going to get out of here and then try and stop sharing my screen. Um, okay. Uh, so yeah, I think that was it. Sorry, I think I've been going. I think it's 12:00. Yeah, >> thank you so much. Yeah. No, thank you for for walking through, you know, the Salmon Nova platform, going through all the details as well. um answering our audience questions. Um it was very very productive. So yeah, appreciate you going through that.
Ще з цього каналу

Rethinking Knowledge Work in the Age of AI
близько 14 годин тому

Tutorial: Why AI Pilots Fail: Real Customer Stories | Future of Data and AI | Agentic AI Conference
близько 21 години тому

Tutorial: Run Open-Weight AI Agents on AMD GPUs | Future of Data and AI | Agentic AI Conference
3 днi тому

Tutorial: How Docker Builds Guardrails for AI Coders| Future of Data and AI | Agentic AI Conference
4 днi тому
