TL;DR: On February 16, 2026, Alibaba released Qwen3.5-397B-A17B: a 397B-parameter MoE model that activates only 17B per forward pass, scores 88.4 on GPQA Diamond, and runs at $0.60 per million input tokens, roughly 8x cheaper than Claude Opus 4.6.[1] Seven days earlier, Anthropic published a landmark distillation attack report naming three Chinese AI labs for industrial-scale theft.[2] Alibaba was not among them. That absence is not a coincidence. It is the entire story.
On February 23, 2026, Anthropic named three Chinese AI labs (DeepSeek, Moonshot AI, and MiniMax) for running coordinated campaigns to extract Claude's capabilities through 16 million fraudulent API exchanges.[2] The industry parsed the names that were included. Almost nobody asked about the names that were left out.
Alibaba was not named. ByteDance was not named. Baidu was not named. Tencent was not named.
These are not small actors. Together they control more users, more compute, more revenue, and more AI deployment than the three companies Anthropic did name. And they are conspicuously, structurally, predictably absent from a report about labs that needed to steal training data because they could not generate it themselves.
Qwen3.5 is the proof of concept. Frontier-tier benchmarks. Open-source Apache 2.0 licensing. Eight times cheaper than Claude at the API level. Built by a company that processes more commercial transactions annually than Amazon, eBay, and Etsy combined. They did not need to distill from anyone. They could not afford the reputational risk even if they had wanted to. And most importantly: they had better options.
The Absent Name
Anthropic's distillation report named every major pure-play Chinese AI startup. It named none of the Chinese Big Tech AI divisions. Alibaba, ByteDance, Baidu, Tencent, and Xiaomi, all running frontier AI programs, are absent. The companies that were caught are all data-poor relative to their frontier ambitions. The companies that were not caught are data-rich by construction. This is not a coincidence. It is the operating logic of the distillation problem.
What Qwen3.5 Actually Is
Before the strategic analysis, the technical reality deserves attention, because Qwen3.5 is genuinely impressive in ways that get buried under the geopolitical noise.
Released on February 16, 2026, Qwen3.5-397B-A17B is the first open-weight model in the new Qwen3.5 series.[1] It is a native vision-language model built on a hybrid architecture that fuses linear attention via Gated Delta Networks with a sparse mixture-of-experts design. The architecture matters because it achieves something that was considered difficult eighteen months ago: 397 billion total parameters with only 17 billion activated per forward pass. That is a 95% reduction in active compute relative to total capacity without proportional capability loss.
This is not a capability demo. Qwen3.5 is already deployed across Alibaba's product suite. The model supports 201 languages and dialects, up from 119 in the previous generation, reflecting Alibaba's global commercial footprint. The hosted Qwen3.5-Plus version includes a default 1 million token context window and built-in tool use with adaptive agent capabilities.[3]
The architecture is genuinely novel. Most frontier models bolt on vision as a second stage. Qwen3.5 processes text, images up to 1344×1344 resolution, and 60-second video clips from the first pretraining stage. The multimodal capability is architectural, not cosmetic.
Qwen3.5 by the Numbers
Sparse MoE; only 17B activate per forward pass (95% reduction in active compute)
Top open-source score on the GPQA leaderboard as of February 2026
Graduate-level knowledge benchmark; ahead of GPT-5.2 Pro on this metric
Real-world software engineering tasks; approaches Claude Opus 4.6 at 80%+
Per million input tokens. Claude Opus 4.6 is $5.00/M, GPT-5.2 is $1.75/M
Up from 119 in Qwen3, reflecting Alibaba's global commercial footprint
The Benchmarks: Where Qwen3.5 Actually Sits
Self-reported benchmarks from Chinese AI labs require caveat. Alibaba claims Qwen3.5 outperforms GPT-5.2, Claude Opus 4.6, and Gemini 3 Pro on roughly 80% of evaluated benchmark categories.[3] CNBC noted that CNBC could not independently verify those claims.[3] That is the standard honest caveat.
Here is what independent leaderboards show, which is less dramatic but still telling.
Frontier vs Efficiency: Qwen3.5 Across the Stack
GPQA Diamond from llm-stats.com leaderboard. Coding and reasoning from Alibaba and Anthropic published evaluations. Sonnet 4.5 data from Anthropic model card.
Qwen3.5-397B-A17B
Alibaba / Qwen
Qwen3.5-35B-A3B
Alibaba / Qwen
Claude Sonnet 4.5
Anthropic
Claude Opus 4.6
Anthropic
Performance metrics based on official benchmarks and third-party evaluations. Scores may vary by methodology and version.
On the GPQA leaderboard (the most reliably independently verified benchmark), Qwen3.5-397B-A17B at 88.4% sits at rank 9 globally, tied with Grok-4 Heavy, below Claude Opus 4.6 (91.3%) and GPT-5.2 (92.4%) but ahead of GPT-5.1, ChatGPT-4o, DeepSeek V3.2, and every other open-weight model.[4] That is solidly frontier-tier, not frontier-adjacent.
The 35B-A3B comparison is the more interesting efficiency story. At only 3B active parameters, Qwen3.5-35B-A3B scores 84.2 on GPQA Diamond versus Claude Sonnet 4.5's 83.4. It scores 85.3 on MMLU-Pro versus Sonnet 4.5's 85.0. Those are effectively equal on general reasoning at roughly 25x lower active compute, and at a fraction of the API cost ($0.12/M input vs $3.00/M for Sonnet 4.5).[14] The 35B trails on coding: Sonnet 4.5 scores 77.2 on SWE-bench Verified versus the 35B's 69.2. But for reasoning-heavy workloads, the efficiency gap is striking.
The pricing context makes the flagship benchmark positioning sharper too. At $0.60/M input, Qwen3.5-397B-A17B is 8.3x cheaper than Claude Opus 4.6 for roughly 3 percentage points less GPQA performance. For most enterprise deployments, that trade-off is not a trade-off at all. It is an obvious operational decision. The open-weight version under Apache 2.0 makes the self-hosted case even more compelling: no API billing at all.
The Missing Names in Anthropic's Report
Let's be clear about what Anthropic's February 23 distillation report actually tells us about the structure of the Chinese AI ecosystem.
The three named labs, DeepSeek, Moonshot AI (Kimi), and MiniMax, share a defining characteristic: they are pure-play AI research companies.[2] They were founded with the explicit goal of building frontier AI models. They do not operate e-commerce platforms, social media networks, enterprise software ecosystems, or payment systems. They generate revenue through AI products and API access. Their training data comes from the open web, public datasets, licensed sources, and, as Anthropic documented, from fraudulently extracted competitor outputs.
The companies not named, including Alibaba, ByteDance, Baidu, and Tencent, share a different defining characteristic. They are some of the largest tech companies on Earth, and they generate proprietary training data as a byproduct of their core businesses at scales that are simply unavailable to a pure-play AI startup.
This is not speculation. It is structural.
Why Some Labs Needed to Distill and Others Didn't
| Feature | Named in Distillation Report | Proprietary Data Ecosystem |
|---|---|---|
| DeepSeek | Yes: 150K extractions, chain-of-thought targeting | Pure-play AI research lab. Web data only. No proprietary product data. |
| Moonshot AI (Kimi) | Yes: 3.4M exchanges, senior staff directly involved | Pure-play AI startup. Consumer chatbot. No diversified data moat. |
| MiniMax | Yes: 13M+ exchanges, largest single campaign | Pure-play AI startup. Consumer products. No enterprise data ecosystem. |
| Alibaba / Qwen | Not named | Taobao/Tmall (world's largest e-commerce), Alibaba Cloud, Ant Financial, DingTalk enterprise comms, Youku video. Billions of proprietary transactions. |
| ByteDance / Seed2.0 | Not named | TikTok/Douyin (1B+ users, world's largest behavioral ranking dataset), Toutiao news, Lark enterprise. Unmatched behavioral signal. |
| Baidu / ERNIE | Not named | China's dominant search engine. 30+ years of Chinese-language web index. Maps, cloud, Apollo autonomous driving data. |
The pattern is striking enough to name directly. Every lab that was caught is a pure-play AI company. Every major Chinese Big Tech AI division, labs that collectively deploy more compute than the three named companies combined, is absent. This is not coincidence and it is not necessarily evidence of virtue. It is evidence of data sufficiency.
The Alibaba Data Moat: What Qwen Actually Trained On
Alibaba's data position is one of the most underappreciated structural advantages in global AI development. This is not about having "more data." It is about having the right kinds of data that frontier models demonstrably need.
Taobao and Tmall process over $1.7 trillion in gross merchandise volume annually. Every product listing, every customer review, every seller description, every search query, every purchasing decision generates structured, high-quality natural language data in commercial contexts. This is fundamentally different from scraped web text. It is purposeful, transactional, real-world language with ground-truth behavioral outcomes attached.
Alibaba Cloud serves millions of enterprise customers across China and Southeast Asia. The workloads flowing through those systems (document processing, code generation, data analysis, customer service automation) are exactly the agentic use cases that Moonshot and MiniMax felt they had to distill from Claude to acquire.[2] Alibaba generates the same data organically through production deployments.
Ant Financial's payment and credit infrastructure handles more transactions than Visa and Mastercard combined in many quarters. The financial reasoning, risk assessment, and document understanding capabilities that represent frontier AI's most valuable enterprise applications are generated organically by Alibaba every day.
DingTalk, Alibaba's enterprise communication platform, has over 500 million registered users. Workplace documents, meeting transcripts, project management workflows, technical specifications: structured professional language at industrial scale, owned by Alibaba.
The labs Anthropic caught were trying to acquire in weeks what Alibaba built organically over two decades. The distillation attack is a symptom of data poverty. Qwen3.5 is evidence of what data wealth produces.
None of this data required stealing from a competitor. None of it required fraudulent API accounts or proxy networks. It required building a diversified tech company over twenty years and being smart enough to route that data asset into an AI training pipeline.
The ByteDance Parallel: SeedDance and the Same Story
This is not an Alibaba-specific phenomenon. We covered ByteDance's Seed2.0 release in detail: a full-stack AI ecosystem spanning frontier LLMs, vision, coding, and the Seedance video model that went viral.[5]
Seed2.0 was also absent from Anthropic's distillation report. The reason is structurally identical to the Alibaba case.
ByteDance operates TikTok and Douyin, collectively the most sophisticated content recommendation systems ever built and the platform that first demonstrated that large-scale behavioral AI could outperform human editorial judgment. The training signal from a billion users selecting and rejecting content in real time is categorically different from anything available to a pure-play AI startup. It is rich, diverse, multilingual, multimodal, and continuously updated.
Toutiao, ByteDance's news aggregation platform, has been training content understanding models for a decade. Lark, their enterprise product, runs on the same infrastructure. Byte's game division, their e-commerce expansion, their music platform. All of it generates proprietary behavioral data that flows into Seed2.0 training.
ByteDance didn't need to distill from Claude. They had better data than Claude's trainers had when they built Claude.
The pattern is now two data points, which is enough to name it: Chinese Big Tech has built data moats that make distillation structurally unnecessary. Chinese AI pure-play startups have not. Anthropic's report named exactly the companies without data moats.
What The Industry Actually Looks Like From This Angle
The distillation report framed the problem as Chinese labs stealing from American labs. That framing is accurate at the level of the specific ToS violations. MiniMax, Moonshot, and DeepSeek ran fraudulent operations that violated Anthropic's terms and circumvented regional access controls.[2]
But the framing obscures a more uncomfortable structural reality: the frontier AI race has two different competitive dynamics running simultaneously.
The first is the well-understood competition between American frontier labs. OpenAI, Anthropic, and Google are competing on research talent, compute, and proprietary training data. Their advantage is years of accumulated RLHF, Constitutional AI, and preference data from hundreds of millions of human users.
The second is less discussed: Chinese Big Tech is winning the proprietary data competition on a different axis entirely. Alibaba and ByteDance have access to commercial and behavioral data at scales and qualities that American AI labs do not. Neither Google nor Anthropic has anything like Taobao's transactional database. Neither has TikTok's behavioral ranking signal. The data moat runs in both directions.
Why Qwen Trains Natively in 201 Languages
Qwen3.5 supports 201 languages and dialects.[6] This is not a technical achievement in isolation. It reflects the geographic distribution of Alibaba's actual commercial operations. Lazada (Southeast Asia), Daraz (South Asia), AliExpress (global), Alibaba Cloud (Asia-Pacific). Alibaba processes real commercial transactions in dozens of languages daily. The multilingual capability is a byproduct of having genuine multilingual users, not a training objective pursued artificially.
Qwen3.5's native multimodal capabilities follow the same logic. Alibaba processes product images, merchant video, user-generated visual content, and document scans at industrial scale. Native vision-language training from pretraining rather than multimodal fine-tuning is what happens when your pretraining data is inherently multimodal because your business is.
Distillation as a Structural Response to Data Poverty
Here's the angle nobody in the distillation coverage has fully articulated: the labs Anthropic caught were not primarily trying to save money on training. They were trying to solve a structural problem that money alone cannot solve.
Training data diversity, quality, and domain coverage are the binding constraints on frontier model capability for any lab that is not Google, Meta, Alibaba, or ByteDance. Pure-play AI startups, including, to a significant degree, Anthropic itself, cannot generate the breadth of real-world behavioral data that comes from operating diversified technology products at scale.
Anthropic's own data situation is instructive. Their training pipeline relies heavily on human feedback from contractors, licensed datasets, and web-crawled text. Reddit sued Anthropic in 2025 for scraping over 100,000 posts and comments without permission to fine-tune Claude.[7] The Stanford Alpaca project trained on 52,000 ChatGPT outputs for $500 in 2023 and was celebrated across the research community, including by researchers who later joined Anthropic and built on that work.[8] The knowledge transfer that Anthropic is now calling an "attack" is the same process that built much of the open-source AI ecosystem those researchers came from.
This is not to excuse the fraudulent accounts or the ToS violations. Those are real legal and ethical problems. It is to say that the technique of acquiring knowledge about a model's outputs to improve your own model is industry standard. The question of where the knowledge ultimately comes from is more complicated when the companies that have it either acquired it through their own advantages (Alibaba's data moat) or built on a research commons that everybody contributed to and everybody benefited from.
The Data Moat Thesis: A Timeline
Key milestones in development
| Date | Milestone | Significance |
|---|---|---|
| 2003–2010 | Alibaba builds China's e-commerce infrastructure | Taobao launches (2003), Alipay spins out (2004), Tmall launches (2008). Billions of commercial transactions in Chinese begin accumulating. |
| 2012 | ByteDance founded | Toutiao's algorithm-first content model demonstrates behavioral data's value. TikTok/Douyin follows, building the world's largest behavioral ranking dataset. |
| 2023 | Stanford Alpaca: distillation celebrated | 52K ChatGPT outputs, $500, Apache 2.0. The research community celebrates distillation as AI democratization. Future frontier lab researchers build on this work. |
| Apr 2025 | Qwen3 launches | Alibaba releases Qwen3 family trained on 36 trillion tokens across 119 languages. Trained on a decade of commercial data. No distillation campaign needed. |
| Feb 2026 | Anthropic's distillation report | Three pure-play Chinese AI startups named. Zero Chinese Big Tech companies named. The data moat thesis is validated by absence. |
| Feb 16, 2026 | Qwen3.5 releases | 88.4 GPQA Diamond. $0.60/M input. Apache 2.0 open-weight. Built on Alibaba's proprietary data moat. One week after the distillation report. |
The Open-Source Play: What Alibaba Is Actually Doing
Qwen3.5 is not a flex. It is a strategy.
Alibaba has released over 100 open-weight models under Apache 2.0 and similar licenses, which have been downloaded more than 40 million times globally.[9] This is not altruism. It is the same playbook ByteDance used with Seed2.0's open-weight releases: flood the ecosystem with capable, permissively licensed models that developers adopt, build on, and extend. Make Qwen-based deployments the default for cost-conscious enterprise development. Build a community that generates feedback, fine-tunes, and extends your models. All of this flows back into your training pipeline.
The open-source strategy also creates a powerful competitive moat in the Asian enterprise market. Local deployments of Qwen3.5 inside Chinese corporations are not subject to Anthropic's regional access restrictions. They do not generate API data that flows to American companies. They run on Alibaba Cloud infrastructure. The open-weight release is a Trojan horse for cloud infrastructure adoption.
Who Qwen3.5 Actually Disrupts
Anthropic's API Business
An 8.3x price gap on comparable general capability is not a rounding error. Enterprise customers doing cost sensitivity analysis will run the numbers and some will switch. The gap is particularly acute for high-volume agentic workloads.
Pure-Play Chinese AI Startups
MiniMax, Moonshot, and DeepSeek are now under reputational and regulatory pressure from the distillation report while competing against Alibaba, a company that can undercut them on price, outcompete them on data, and absorb losses indefinitely.
American AI Policy Debate
The distillation narrative frames China's AI progress as dependent on stolen American IP. Qwen3.5 complicates that narrative. Alibaba's progress is not theft. It is decades of proprietary data accumulation that American labs simply do not have.
Open-Source AI Community
Qwen3.5 is the strongest argument yet that open-weight frontier AI is viable and strategic. Apache 2.0 licensing at 88.4 GPQA Diamond resets expectations for what 'open-source AI' means in practice.
The Uncomfortable Symmetry
Let's state the uncomfortable truth plainly, because it doesn't fit neatly into either the American or Chinese narrative.
Anthropic named the distillation attackers correctly. MiniMax, Moonshot, and DeepSeek ran fraudulent operations. The evidence is credible, the scale was industrial, and the censorship use case DeepSeek ran against Claude crosses a qualitative line that pure capability extraction does not.[2]
And Alibaba built a frontier AI model without doing any of that. Not because they are more ethical. Because they had better tools. The data moat is the ethical shortcut that was never framed as a shortcut because it was built through legitimate business operations.
The AI labs most vocally opposed to distillation are, without exception, labs that lack proprietary data moats and therefore depend on restricted model outputs and licensed data as their competitive training resource. Anthropic's position is strategically coherent: they are protecting the thing they need that they cannot generate at Alibaba's scale. That does not make them wrong about the specific violations. It does mean the policy position should be understood as strategic, not purely principled.
Qwen3.5 is a benchmark-setting frontier model built by a company that processes more commercial data in a day than Anthropic has in its entire history. That is the context Anthropic's distillation report did not provide. This article is providing it.
What the Next Eighteen Months Will Show
Qwen3.5 is the first model in the Qwen3.5 series, released as a single open-weight flagship. The next models in the series (likely a Qwen3.5-Coder, Qwen3.5-Math, and Qwen3.5-VL) will extend the capability coverage. Each one will be built on the same proprietary data moat. Each one will be open-weight. And none of them will be distilling from anyone. Alibaba's competitive advantage does not require it.
What to Actually Take Away From Qwen3.5
Qwen3.5-397B-A17B is a legitimate frontier model. 88.4 GPQA Diamond puts it at rank 9 globally, tied with Grok-4 Heavy, ahead of every other open-source model on the leaderboard.
The 8.3x price gap vs Claude Opus 4.6 for comparable general capability is an enterprise procurement argument, not a benchmark footnote.
Alibaba was absent from Anthropic's distillation report because it has a data moat that makes distillation structurally unnecessary, not because it was undetected.
The same pattern holds for ByteDance (Seed2.0), Baidu (ERNIE), and Tencent. Every Chinese Big Tech AI division with a proprietary data moat was absent from the report.
The companies Anthropic caught are all pure-play AI startups without diversified data advantages, structurally motivated to distill because they have no better option.
Distillation is a symptom of data poverty. Qwen3.5 is evidence of what two decades of commercial data accumulation produces when you finally point it at a transformer.
Last updated: February 25, 2026 Last updated: February 26, 2026
Sources & References
Key sources and references used in this article
| # | Source | Outlet | Date | Key Takeaway |
|---|---|---|---|---|
| 1 | Qwen3.5: Towards Native Multimodal Agents | Feb 16, 2026 | ||
| 2 | Detecting and preventing distillation attacks | Feb 23, 2026 | ||
| 3 | Alibaba unveils Qwen3.5 as China's chatbot race shifts to AI agents | Feb 17, 2026 | ||
| 4 | GPQA Diamond Leaderboard | Feb 2026 | ||
| 5 | ByteDance Seed2.0: The Full-Stack AI Empire Behind Seedance | Feb 14, 2026 | ||
| 6 | Qwen3.5: Features, Access, and Benchmarks | Feb 16, 2026 | ||
| 7 | Reddit Files Lawsuit Against Anthropic Over Alleged Unauthorized Data Scraping | Jun 2025 | ||
| 8 | Alpaca: A Strong, Replicable Instruction-Following Model | Mar 13, 2023 | ||
| 9 | Qwen | Feb 2026 | ||
| 10 | Anthropic Catches Three Chinese AI Labs Stealing Claude | Feb 25, 2026 | ||
| 11 | MiniMax M2.5: Frontier AI at 20x Less Than Claude Opus | Feb 12, 2026 | ||
| 12 | Qwen3.5: 397B MoE Benchmarks, Pricing and Complete Guide | Feb 16, 2026 | ||
| 13 | Open Source LLM Leaderboard 2026: Rankings, Benchmarks | Feb 24, 2026 | ||
| 14 | Claude Sonnet 4.5 Model Card | Sep 29, 2025 |




