# LLM.txt - TwELL: The Sparse Kernel Bet That Makes LLMs 20.5% Faster ## Article Metadata - **Title**: TwELL: The Sparse Kernel Bet That Makes LLMs 20.5% Faster - **URL**: https://www.llmrumors.com/news/sakana-ai-twell-sparse-transformer-llm-gpu-kernels - **Publication Date**: May 29, 2026 - **Reading Time**: 14 min read - **Tags**: Sakana AI, NVIDIA, TwELL, Sparse LLMs, GPU Kernels, Inference Optimization, AI Infrastructure, Transformer Architecture - **Slug**: sakana-ai-twell-sparse-transformer-llm-gpu-kernels ## Summary Sakana AI and NVIDIA's TwELL shows why sparse LLMs were not blocked by theory. They were blocked by GPU execution economics. ## Key Topics - Sakana AI - NVIDIA - TwELL - Sparse LLMs - GPU Kernels - Inference Optimization - AI Infrastructure - Transformer Architecture ## Content Structure This article from LLM Rumors covers: - Industry comparison and competitive analysis - Data acquisition and training methodologies - Financial analysis and cost breakdown - Comprehensive source documentation and references ## Full Content Preview TL;DR: Sakana AI and NVIDIA's TwELL is not just another sparse-model trick, it is a kernel-level argument that LLM efficiency now lives inside data layout. Their 2B sparse model runs 20.5% faster in forward execution, trains 21.9% faster, and cuts inference energy from 7.85 mJ/token to 6.51 mJ/token, while mean task accuracy moves from 49.1% to 48.8%.[4] The uncomfortable truth is that sparsity did not fail because the math was wrong. It failed because GPUs made the sparse path expensive. Sakana AI announced TwELL on May 9, 2026, with NVIDIA as collaborator, open-source code, Hugging Face checkpoints, and an ICML 2026 presentation slot.[1][5] The headline sounds technical. Tile-wise ELLPACK. CUDA kernels. Hybrid sparse formats. But the business meaning is blunt: the next cost war in AI may be fought below the model API, inside the feedforward block. While competitors chase bigger context windows, larger dense models, and more routing experts, Sakana and NVIDIA are asking a more uncomfortable question: why are we paying GPUs to multiply by activations the model did not need in the first place? Feedforward layers account for over two-thirds of parameters and over 80% of total FLOPs in larger transformer models, according to the paper's framing of prior work.[4] TwELL targets that exact cost center. If sparse feedforward execution becomes practical, inference efficiency stops being only a decoding problem and becomes a full-stack hardware-software design problem. The Real Story: Sparse Models Needed A GPU Deal Let's be clear: sparsity was never the new idea. Neural networks have always contained wasted compute. ReLU activations go to zero. Gated feedforward blocks naturally silence dimensions. Prior papers have shown activation sparsity in transformers for years. The missing piece was not proof that many activations are zero. The missing piece was a way to make GPUs benefit from those zeros. Modern accelerators are extraordinary at regular dense matrix multiplication. They want big predictable tiles, coalesced memory access, shared-memory reuse, and work that maps cleanly onto warps and Tensor Cores. Unstructured sparsity gives them the opposite: irregular indices, extra bookkeeping, branchy execution, and sparse conversion steps that can erase the theoretical savings. That is the paradox TwELL attacks. Making a model do less math can make it run slower if the hardware has to work harder to find the math it can skip. Sakana's launch post phrases it plainly: do not force the GPU to adapt to sparsity. Reshape the sparsity to fit the GPU.[1] That is the entire thesis. The target is the transformer feedforward block. In modern gated MLPs, the model expands each token into a larger hidden dimension, gates that hidden space, multiplies the up and gate paths together, then projects back down. The gate activation decides which hidden dimensions matter for that token.[2][4] TwELL turns those sparse gate activations into a tile-aware representation that CUDA kernels can actually use. The Mechanism: Tile-wise ELLPACK And The Hybrid Escape Hatch TwELL stands for Tile-wise ELLPACK. Classic ELLPACK packs sparse data row by row. TwELL instead partitions columns into horizontal tiles, then stores nonzero values and their indices inside each tile.[4] That sounds like a data-structure detail. It is actually the whole product. Here's the genius: TwELL chooses the tile geometry so the gate activation can be converted into sparse storage inside the same tiled matrix multiplication kernel that computes it. The paper describes this as setting the TwELL horizontal tile size to match the matmul tiling dimension, which lets the kernel materialize sparse output in the epilogue inste... [Content continues - full article available at source URL] ## Citation Format **APA Style**: LLM Rumors. (2026). TwELL: The Sparse Kernel Bet That Makes LLMs 20.5% Faster. Retrieved from https://www.llmrumors.com/news/sakana-ai-twell-sparse-transformer-llm-gpu-kernels **Chicago Style**: LLM Rumors. "TwELL: The Sparse Kernel Bet That Makes LLMs 20.5% Faster." Accessed May 29, 2026. https://www.llmrumors.com/news/sakana-ai-twell-sparse-transformer-llm-gpu-kernels. ## Machine-Readable Tags #LLMRumors #AI #Technology #SakanaAI #NVIDIA #TwELL #SparseLLMs #GPUKernels #InferenceOptimization #AIInfrastructure #TransformerArchitecture ## Content Analysis - **Word Count**: ~1,753 - **Article Type**: News Analysis - **Source Reliability**: High (Original Reporting) - **Technical Depth**: Medium - **Target Audience**: AI Professionals, Researchers, Industry Observers ## Related Context This article is part of LLM Rumors' coverage of AI industry developments, focusing on data practices, legal implications, and technological advances in large language models. --- Generated automatically for LLM consumption Last updated: 2026-05-29T17:11:05.166Z Source: LLM Rumors (https://www.llmrumors.com/news/sakana-ai-twell-sparse-transformer-llm-gpu-kernels)