Відкриті аудіомоделі Hugging Face: синтез та транскрипція для бізнесу
Hugging Face анонсувала відкриті аудіомоделі: Mistral Vauxra 4B (синтез мовлення) та Coher (транскрипція). Це прискорить обробку аудіо для бізнесу, від генерації контенту до аналізу дзвінків підтримки.
Ключові тези
- Mistral AI випустила Vauxra 4B, передову модель для перетворення тексту на мовлення.
- Cohere випустила модель розпізнавання мовлення, яка працює багатьма мовами та дуже швидко.
- Hugging Face представила Storage Buckets, HF Mount та HF Jobs для масштабованої обробки аудіо.
Транскрипція Coher швидша за Whisper API, але потребує інтеграції з інфраструктурою Hugging Face.
Моделі відкриті, але інфраструктура (Storage Buckets, HF Mount, HF Jobs) прив'язує до Hugging Face. Це може обмежити гнучкість для тих, хто хоче використовувати власні обчислювальні ресурси.
Опис відео▼
Wow, it's been an amazing week for open audio models. Models that speak, model that transcribe. Just today, Mistra released Vauxra 4B texttospech model. It's a state-of-the-art model. You can try it in this space built by Mistral. Let's type in some text. This model and let's pick Curious Jane as our speaker. This model is fast. Now, >> what's an enentic harness? What does it mean? >> Really fast and expressive. So that's Voxrol TTS. Try it out on Hiden Face. Next, I want to give some props to Coher. They just released yesterday an amazing model for speech recognition. That means doing the opposite, turning speech into text. And that's a state-of-the-art model. It works in many languages. And it's really, really fast. and only two billion parameters. That mean it can run basically anywhere. Actually, the transformers.js team built a demo of transcribe running in the browser, meaning no cloud required. Everything works within my Chrome browser. You can pick the Wow, you can pick the language. That's cool. I can do that all day. Let's record some audio. I love this animation. I could do this all day. Let's see. It's already there. This model is super super fast. So, it's great if you want to do uh to use it on a massive amount of video content or recordings. The license is super permissive. It's Apache 2.0. So, how do you do that? How do you uh put that into production on a large scale? Well, Daniel built a UV script that used all the latest and greatest features from hugging face storage buckets, HF mount, hugging face jobs to do everything in just one line in your terminal. That's what this UV script is. You can find the actual scripts there and modify them. two lines in your terminal. One line HF jobs to run the script and download everything into a storage bucket and the second line to do the actual transcription. It's going to run the other script using cohhere transcribe to transcribe everything that's in your storage bucket and it's going to do this using HF mount and hyperface jobs. The result you can find in this uh storage bucket everything that's been transcribed over this script and it did this really really quickly. So I mentioned storage buckets, HF mount, HF jobs like what's all this? All right, storage bucket. It's a new kind of repository available on Hidenface alongside your models, your data sets, your spaces, and it is basically Henface's version on what AI native storage should look like. It's fast. You only transfer the bits that you're going to use thanks to dduplication. You can put all the data close to where you're going to be processing it thanks to our CDN. And it's perfect for research team who want to scale training and for um deploying agentic applications. Next, we talked about HF mount. What is it? It's an open-source uh project that allows you to take those storage buckets and mount them on your local machine as if it was like a thumb drive, right? And you don't need to download everything. It's a great way to uh work with massive data sets, massive amounts of data without clogging up your hard drive, without having to transfer all that data. It uses streaming in the background. And last I mentioned hugging face jobs. What is Hidenface jobs? It's a way to use ondemand hugging face compute to run scripts like the UV scripts I mentioned earlier or to do any kind of like training uh or other job. It leverages hiding face compute. We have many GPUs and CPU instances available. Again, you can do this directly in one line in your terminal using our clients HF jobs and do the work. So, yes, an incredible week for audio open models. Find them on hiding




