# TwELL: The Sparse Kernel Bet That Makes LLMs 20.5% Faster

**Plutonous** | May 29, 2026 | 14 min read



Tags: Sakana AI, NVIDIA, TwELL, Sparse LLMs, GPU Kernels, Inference Optimization, AI Infrastructure, Transformer Architecture

---

**TL;DR:** Sakana AI and NVIDIA's TwELL is not just another sparse-model trick, it is a kernel-level argument that LLM efficiency now lives inside data layout. Their 2B sparse model runs **20.5% faster** in forward execution, trains **21.9% faster**, and cuts inference energy from **7.85 mJ/token** to **6.51 mJ/token**, while mean task accuracy moves from **49.1%** to **48.8%**.<sup><a href="#source-4">[4]</a></sup> The uncomfortable truth is that sparsity did not fail because the math was wrong. It failed because GPUs made the sparse path expensive.

Sakana AI announced TwELL on May 9, 2026, with NVIDIA as collaborator, open-source code, Hugging Face checkpoints, and an ICML 2026 presentation slot.<sup><a href="#source-1">[1]</a></sup><sup><a href="#source-5">[5]</a></sup> The headline sounds technical. Tile-wise ELLPACK. CUDA kernels. Hybrid sparse formats. But the business meaning is blunt: the next cost war in AI may be fought below the model API, inside the feedforward block.

While competitors chase bigger context windows, larger dense models, and more routing experts, Sakana and NVIDIA are asking a more uncomfortable question: why are we paying GPUs to multiply by activations the model did not need in the first place?

> **Why This Matters Now**
>
> Feedforward layers account for over two-thirds of parameters and over 80% of total FLOPs in larger transformer models, according to the paper's framing of prior work.[4] TwELL targets that exact cost center. If sparse feedforward execution becomes practical, inference efficiency stops being only a decoding problem and becomes a full-stack hardware-software design problem.


## The Real Story: Sparse Models Needed A GPU Deal

Let's be clear: sparsity was never the new idea. Neural networks have always contained wasted compute. ReLU activations go to zero. Gated feedforward blocks naturally silence dimensions. Prior papers have shown activation sparsity in transformers for years.

The missing piece was not proof that many activations are zero. The missing piece was a way to make GPUs benefit from those zeros.

Modern accelerators are extraordinary at regular dense matrix multiplication. They want big predictable tiles, coalesced memory access, shared-memory reuse, and work that maps cleanly onto warps and Tensor Cores. Unstructured sparsity gives them the opposite: irregular indices, extra bookkeeping, branchy execution, and sparse conversion steps that can erase the theoretical savings.

That is the paradox TwELL attacks. Making a model do less math can make it run slower if the hardware has to work harder to find the math it can skip.

> "Sparsity did not need a better slogan. It needed a better contract with the GPU."


Sakana's launch post phrases it plainly: do not force the GPU to adapt to sparsity. Reshape the sparsity to fit the GPU.<sup><a href="#source-1">[1]</a></sup> That is the entire thesis.

The target is the transformer feedforward block. In modern gated MLPs, the model expands each token into a larger hidden dimension, gates that hidden space, multiplies the up and gate paths together, then projects back down. The gate activation decides which hidden dimensions matter for that token.<sup><a href="#source-2">[2]</a></sup><sup><a href="#source-4">[4]</a></sup>

TwELL turns those sparse gate activations into a tile-aware representation that CUDA kernels can actually use.


## The Mechanism: Tile-wise ELLPACK And The Hybrid Escape Hatch

TwELL stands for Tile-wise ELLPACK. Classic ELLPACK packs sparse data row by row. TwELL instead partitions columns into horizontal tiles, then stores nonzero values and their indices inside each tile.<sup><a href="#source-4">[4]</a></sup>

That sounds like a data-structure detail. It is actually the whole product.

Here's the genius: TwELL chooses the tile geometry so the gate activation can be converted into sparse storage inside the same tiled matrix multiplication kernel that computes it. The paper describes this as setting the TwELL horizontal tile size to match the matmul tiling dimension, which lets the kernel materialize sparse output in the epilogue instead of launching a separate conversion pass.<sup><a href="#source-4">[4]</a></sup>

Translation: fewer extra kernel launches, fewer DRAM reads, less synchronization, less bookkeeping.

The compact variant described in Sakana's technical blog packs count metadata, 16-bit values, and 16-bit column indices into a single 32-bit matrix layout.<sup><a href="#source-2">[2]</a></sup> The inference kernel then uses one warp of 32 threads per cooperative thread array, handles one row at a time, loads the token activation into registers, and avoids materializing both the gate activation and the intermediate up-projection product as dense matrices.<sup><a href="#source-2">[2]</a></sup>


What's often overlooked is the dense backup path. The paper's hybrid training format routes ordinary sparse rows into compact ELL storage, but sends rare high-activity rows into a dense backup matrix. The authors say a cap of **128** elements per row and dense backup rows equal to **one-eighth** of the token batch are robust above L1 = 1.5e-5.<sup><a href="#source-4">[4]</a></sup>

That is not a hack. That is the part that makes the system credible. Real token activations are not uniformly sparse. Some tokens carry a lot more information. A sparse format that cannot handle the outliers is a demo, not infrastructure.


## The Numbers: The 2B Result Is Good, The Table Is Messier

The 2B row is the headline because it is the cleanest strategic story. Sparse forward execution rises from **87.8** to **106 input tokens/ms**. Energy drops from **7.85** to **6.51 mJ/token**. Training throughput rises from **22.4** to **27.3 input tokens/ms**. Accuracy barely moves, from **49.1%** to **48.8%**.<sup><a href="#source-4">[4]</a></sup>

But the whole table matters because it shows both promise and caveat.


The 0.5B model is not a clean training win. Training throughput falls **1.5%**, from 97.3 to 95.9 input tokens/ms, even as forward inference improves **17.0%**.<sup><a href="#source-4">[4]</a></sup> The 2B model has the opposite caveat: it delivers the largest speedups, but peak memory rises from **46.7 GB** to **57.1 GB**, because the sparse run fits a larger micro-batch in the reported training configuration.<sup><a href="#source-4">[4]</a></sup>

This is why the correct takeaway is not "sparse always wins." The correct takeaway is sharper: sparse execution becomes increasingly interesting when model scale, batch shape, activation distribution, and kernel design line up.

**39 to 24** — Average active elements fall with scale


That scale effect is the reason the paper is more interesting than a small-model optimization demo. The sparse path gets cleaner as the model gets larger in the evaluated range.<sup><a href="#source-4">[4]</a></sup>

The sparsity sweep adds more context. The 1.5B model tests eight L1 coefficients: **0**, **6e-6**, **1e-5**, **1.5e-5**, **2e-5**, **3e-5**, **6e-5**, and **1e-4**.<sup><a href="#source-4">[4]</a></sup> At L1 = 1e-4, the model averages less than one activated neuron. That is too sparse to treat casually. Up to L1 = 3e-5, the authors report essentially no downstream task drop and final cross-entropy within 2% of the unregularized baseline.<sup><a href="#source-4">[4]</a></sup>

The ReLU versus SiLU ablation is also revealing. The non-sparse SiLU model scores **47.1%** mean accuracy, higher than the non-sparse ReLU model's **46.4%**, but SiLU keeps all **5,632** hidden activations nonzero and is slightly slower. The sparse ReLU model at L1 = 2e-5 uses only **29** nonzeros on average and gets **17.9%** faster forward execution.<sup><a href="#source-4">[4]</a></sup>

That's the trade. SiLU is a little better in the dense baseline. ReLU plus L1 opens the sparse execution door.

## The Business Implication: Kernel Ownership Becomes A Moat

The real story isn't that Sakana compressed a 2B model. The real story is that Sakana and NVIDIA are arguing for a new axis of AI infrastructure competition: who can make the model's wasted compute visible to the hardware?

This is not the same as MoE. MoE routes tokens through selected experts. TwELL exploits sparse activations inside a more conventional transformer feedforward block. MoE changes the architecture and serving stack. TwELL changes the activation behavior, sparse format, and CUDA execution path.

That distinction is commercially important. A lab can look at TwELL and imagine a path that does not require rebuilding the entire model family around expert routing. But it does require owning enough of the kernel stack to matter.


Here's the genius from a business standpoint: TwELL creates a moat that cannot be copied by changing a prompt, swapping a model endpoint, or publishing a benchmark table. It lives in the boring layer. CUDA kernels. Sparse packing. Training infrastructure. Benchmark scripts. Hardware-specific assumptions.

That is exactly why it matters.

## The Catch: Not A Drop-In Speed Button

There is a reason this is not already everywhere.

The public GitHub repo expects **CUDA 12.8+**, says the custom kernels are designed for **H100 GPUs**, and provides a benchmark path with **500** reps and **5** warmup reps for inference testing.<sup><a href="#source-5">[5]</a></sup> The repo roadmap marks sparse model training code and TwELL inference kernels as complete, but efficient TwELL training kernels are still listed as unfinished in the public package.<sup><a href="#source-5">[5]</a></sup>

That does not invalidate the paper. It does mean the release is closer to a serious research codebase than a drop-in production library.

The Hugging Face artifacts deserve the same caution. The SparseLM pages expose BF16 safetensors, Apache-2.0 metadata, and a `llama_sparse_relu` tag, but the model cards are sparse and the models are not deployed through a hosted Inference Provider at time of inspection.<sup><a href="#source-6">[6]</a></sup><sup><a href="#source-8">[8]</a></sup> SparseLM1.5B also reports roughly **1.64B** BF16 parameters through the model API, while the page metadata can present confusing rounded size labels.<sup><a href="#source-8">[8]</a></sup>

The methodology is real, but bounded. Main experiments use FineWeb, max sequence length **2,048**, **1,048,576** tokens per step, AdamW, a **1e-3** learning rate, **0.1** weight decay, **600** warmup steps, BF16 compute, GPT-2 tokenizer, and one node of **eight H100 PCIe GPUs** unless otherwise specified.<sup><a href="#source-4">[4]</a></sup> Model scales run **10.49B**, **20.97B**, **31.46B**, and **41.94B** tokens for 0.5B, 1B, 1.5B, and 2B respectively.<sup><a href="#source-4">[4]</a></sup>


> **The Key Constraint**
>
> TwELL is strongest where the deployment controls model architecture, sparse activation behavior, CUDA kernels, and hardware. If you are locked into generic dense inference paths, sparse activations are mostly a theoretical asset. The money appears when model design and kernel design are treated as one system.


## The Bigger Picture: Sparsity Moves From Compression To Infrastructure

The old sparsity story was about making models smaller. The new sparsity story is about making compute selective without making the GPU miserable.

That reframing is the point.

What's often overlooked is that the AI industry has spent the last two years treating inference cost as a pricing problem, a batching problem, or a model-distillation problem. TwELL says it is also a data-layout problem. The hidden activations inside a transformer are not just math. They are a scheduling problem for a very expensive machine.

While competitors optimize the visible product layer, Sakana and NVIDIA are optimizing the path between an activation becoming zero and the GPU actually saving work. That is less glamorous than a new chatbot. It may be more durable.


The real story isn't that sparse LLMs are suddenly solved. The real story is that Sakana and NVIDIA have shown where the fight moves next. Not just bigger models. Not just cheaper tokens. The next infrastructure edge is making the hardware stop paying for activations the model already decided to ignore.

That is the sparse kernel bet.


*Last updated: May 29, 2026*

---

*Source: [LLM Rumors](https://www.llmrumors.com/news/sakana-ai-twell-sparse-transformer-llm-gpu-kernels)*
