YouTube каталог
New Tests Reveal The Truth About China’s AI Progress...
🔴 News
en

Реальний стан справ у китайському AI: нові тести показують відставання

The AI Grid8 днів тому6 квіт. 2026Impact 6/10
AI Аналіз

Новий аналіз бенчмарків ARC AGI 2, Pencil Puzzle Bench та Frontier Math виявив відставання китайських AI-моделей від західних приблизно на 8 місяців. Це свідчить про значний розрив у розвитку штучного інтелекту, попри заяви про швидкий прогрес Китаю.

Ключові тези

  • Китайські AI-моделі відстають від західних приблизно на 8 місяців у здатності до нового мислення та вирішення проблем.
  • Бенчмарки, такі як Pencil Puzzle Bench, що тестують логічні міркування, показують значне відставання китайських моделей від американських.
  • SWE-rebench виявляє, що китайські моделі покладаються на оптимізацію під конкретні бенчмарки, а не на узагальнений інтелект.
Можливості

Можливість для західних компаній зберегти лідерство в AI, інвестуючи в дослідження та розробки. • Потреба в більш об'єктивних і всебічних бенчмарках для оцінки AI-моделей. • Стимул для китайських компаній зосередитися на розвитку фундаментальних навичок reasoning, а не на оптимізації під бенчмарки.

Нюанси

Бенчмарки можуть бути неповними або упередженими, і важливо враховувати контекст і конкретні завдання, для яких розробляються моделі. Крім того, фокус на LLM може ігнорувати інші сфери AI, де Китай може мати перевагу.

Опис відео

So, this is a video I've wanted to make for a time. What is the truth about China's AI progress? And does it actually live up to the hype? Recent results actually say otherwise. So, let's dive into it. So, I saw this tweet that came out from the Ark Prize and it said international models on the Ark AGI 2 semi-private. Now, if you don't know what the Ark AGI 2 is, this is a test specifically for things where you can't fake. Okay? And what it means is that you cannot brute force it with more data or distill it from other models. If you are taking the ARC AGI2, you need general novel problem solving. This is why the scores, okay, are so interesting here. Chinese labs, if you haven't been paying attention, have actually been seeming to catch up with Western labs. But take a look at what happens when you look at the benchmarks on the ARK AI2, which is apparently a benchmark that only tests novel reasoning, meaning that you cannot prepare for the test in any way. It just tests the fundamental reasoning of that model. And you can see here that Kim K2, Minmax M2.5, GLM5, and Deepseek 3.2 score below the July 2025 Frontier models, meaning the current Frontier state-of-the-art Chinese models actually are akin to models from Western Labs that were released 8 months ago. I mean, it's pretty shocking that July 2025 Frontier Labs line is brutal. Ark Prize is saying that today's best Chinese models cannot even beat truly, okay, what Western Labs were doing eight months ago. And that's not a gap. That's a generation behind. Now, for those of you guys who are thinking, "Oh, this is just clickbait. It's just, you know, one bad benchmark." Trust me, the deep we get into this video, the more you will see. Remember guys, Arc AGI 2 specifically tests reasoning that you cannot fake. You can't brute force it with more data. You can't distill it. It requires genuinely novel problem solving. Now remember, okay, there are so many different statements and you know things going on in AI right now where people have said that AI companies in China are behind us. They're right behind us. So this is going to be super super interesting. I also saw this tweet and this is a brand new benchmark and this paper was published on March the 2nd 2026 literally the same day as this tweet and there is this puzzle benchmark and pencil puzzle benchmark evaluates LLM reasoning through pencil puzzles a family of constraint satisfaction problems closely related to NPMPlete problems with deterministic step level verification basically this can't be gamed and so the first performances that they had was 03 at 3% in early 2025 followed by an explosion in capability ility with the GPT5 family. And so basically, they're testing new capability frontiers, not one that existed before. And models that were released before 2024, literally score 0%. Meaning that there's no training data on these specific puzzle combinations. You either can reason through the constraint steps or you can't. Now, when you look at the chart, when you look at the data, the chart can't be clearer. United States closed models dominate entirely. GBT 5.2 gets 56%, Claude Opus 4.6, 6 36.7% and Gemini 3.1 Pro gets 33%. I know that it is not the highest quality right now, but just listen to what I'm saying. And then when we look at the Chinese models, there is a complete fall-off cliff. Kim K2 gets 6%, Minmax gets 3.3% and Deepseek gets 2%, Quen 3.5, 0.7%, GLM 5 0.7%. And I mean, it's pretty crazy over there. I mean, think about it. This is one of the most damning charts when we look at LLMs because this is a brand new independently run and tests something completely different from all of the other benchmarks like HLE, Frontier Math, you know, the humanities last exam test knowledge breadth, Frontier Math that tests mathematical depth, pencil puzzle bench. This actually tests pure multi-step logical constraint reasoning and no knowledge required at all. just the ability to hold multiple rules in your head simultaneously and reason forward step by step. And a key finding is that GPT 5.2 improves 81 times from no reasoning to maximum effort. Meaning that United States frontier models are genuinely engaging with reasoning on these problems. And maybe, just maybe, Chinese open- source models aren't closing that gap regardless of effort. there. Three completely different benchmarks, three different methodologies, and we're getting the same gap. So, you've seen Arc AGI 2, there's a complete drop off. Okay? And remember, this was tested recently, and now we're looking at the pencil puzzle bench, and we're seeing once again that massive drop off when we look at those models. You can see here, someone said on Reddit that this test is designed to expose benchm guys, if we even decide to look at Frontier Math, I mean, the same story exists. I mean, it's the same story as HLE. I mean all current models on GSM 8K and math and all those other standard benchmarks score over 90% and this is particularly due to data contamination but the models are trained on these problems but you know they have the data already inside of them that closely resembles those tests and those benchmarks actually stopped measuring anything real and so Frontier Math was actually a set of questions that was designed to introduce exceptionally challenging mathematical problems covering most major branches of modern mathematics from computationally ally intensive problems in number theory and real analysis to abstract questions in algebraic geometry and category theory. Now what makes it genuinely hard? Solving a typical frontier math problems require multiple hours of effort from a researcher in the relevant branch of mathematics and for the upper end of questions multiple days. Now this isn't Olympiad stuff you know and we look at these questions these are open research problems. Now remember guys these cannot be gamed. Frontier Math addresses data contaminations by featuring exclusively new previously unpublished problems and every question was written specifically for this benchmark. They've never appeared online in textbooks or any other training corpus. The problems are also guest pro the problems are also guest proof meaning that they're nearly impossible to solve without doing the actual mathematical work. And the models are given full access to Python. They can run code test hypothesis and iterate and they still can't solve them. And remember guys, we look at the data here. Look at where Kimmy K2 appears. Okay, you can see that it appears literally after Gemini 2.5 deep think all the way after 04 mini high all the way after Gemini 3. I mean Kimmy, GLM, and Deepseek are all basically clustered at the very bottom. The only way you can actually see them is if I actually flip to the next slide and you'll see here that GLM 5, you know, Kimmy, that's where the models actually are. And so these models score around, you know, 2% to 3% which of course isn't that great. And of course when we look at the humanities last exam which is of course another benchmark of 2500 of the toughest subject diverse multimodal questions designed to be the last academic exam of its kind testing the depth and you know breadth of knowledge across the subject domains. And remember guys this is another ungameable benchmark. They have an additional private held out set of you know humanity's last exam questions and searchable questions were removed by a specific procedure. And here's the thing that was quite concerning. Okay, Kim K2 actually reported a 50% on the hle this benchmark but artificial analysis actually found that they were 29.4% a 21 point inflation and the boost from adding tool use jump score significantly. I mean take a look at what Jen Hen Huang recently said about you know China being behind and this is why I say these statements are so interesting because on one hand you've got people are saying that China are not behind China are right behind us but on the other hand when we look at just LLMs remember this is just an LLM thing you can see that actually they are behind when we're looking at benchmarks that aren't gamed >> like how far behind do you think China is >> uh China is not behind I anybody >> ahead of you China's right behind us I mean They're we're very very close. Uh and but remember this is a long-term this is an infinite race. There's no you know in the in the world of life there's no those you know there's no twominut end of the quarter. There's there's no such thing. And so we're going to compete for a long time. And just remember that this is this is a country with great will and they have great technical capabilities. 50% of the world's AI researchers are Chinese and and so this is a uh this is an industry that that we will have to compete >> and so what Nvidia is saying is that you know so when Nvidia's Jen Huang says that China will win the AI race with the US maybe there might be more to the story I mean in 2025 you know the that the in 2025 he told the financial times that China is going to win the AI race and then you know he put out a statement later saying that China is nanoseconds behind America. Now essentially the real argument here is that you know China is subsidizing energy costs to encourage local companies to use Chinese ships while the US states are considering waves of new AI regulations. And of course he did state that you know around 50% of the world's AI researchers are in China and the majority of leading open source models are being created there. And apparently the long argument is that the export chip bands are basically counterproductive. If they lose access to Nvidia GPUs, it forces them to build homegrown alternatives and they're less reliant on Nvidia's ecosystem and they're just going to be more innovative. And I mean, take a listen to what Samman actually says about China and then I'm going to show you guys even more data. >> The the progress of Chinese tech companies across the entire stack and also not just in AI but in many fields is remarkable. Um the reason I'm pushing back on underestimated is it feels like every conversation I have is like oh China's beating us what do we do about it? So I think people are aware of what's happening there. But uh yes, Chinese progress is amazingly fast. >> Are they are they near the frontier? Are they are they near you? >> Uh in some areas, yes and in some areas, no. It's not quite a unilateral thing like that. >> Now, now if you guys thought it was just one benchmark, two benchmark, take a look at this cuz this one was very surprising. And I did before, like before recording this video, I did not know that this was a thing. And so the S swb bench is essentially an evaluation framework designed to test how well the models can solve real world software engineering problems. Instead of asking it to write a simple isolated function which is what all the benchmarks did s actually drops it into a complex code base to see if it can fix a real bug. Now it's built from thousands of real GitHub issues and it's currently the gold standard in testing you know AI software engineers and that's is what people use now. Currently, you can see that self-reported. You can see that Minmax is apparently scoring around 75 on the SWE bench right up there with Claude 4 Opus, Gemini 3 Flash, Claude Opus 4.6. And you can see that Kimmy K 2.5 high reasoning is is around those GPT5 models and Claude 4.5 Sonic. However, this is where I started to say, okay, maybe there clearly is something off about Chinese open source models when it comes to benchmarks. Because if you look at SWE rebench, okay, and this is a new continuously evolving benchmark introduced in May 2025 by NBUS AI's R&D team. This addresses contamination issues in popular coding benchmarks like the SWE bench using fresh decontaminated GitHub tasks that models haven't seen during training. And this makes it harder to game them through overfitting or data leakage. And on the original SWE, Chinese models apparently matched Western Frontiers, which is what we just saw. But remember guys on SWE rebench their scores actually plummet indicating reliance on benchmark specific optimizations rather than generalizable intelligence. And so when we look at the SWE rebench where do we see Chinese open source models? We can see them at places 11 and 17 whereas other you know models from Western Labs are farther higher up the chain. And so this is what you have to ask yourself. If you are using certain open- source models, maybe they may not have the same performance as Opus 4.5 or GPT 5.4 or GPT 5.3 codecs. And you should take that into account because remember every single company, not just the Chinese open source companies are incentivized to inflate their benchmarks so that they can get more attention, more downloads, more publicity. So please do bear that in mind. Always run your own benchmarks, run your own tests, test it for yourself and you see how well it performs for your specific use