TwELL Sparse LLMs: Sakana AI's 20.5% GPU Speedup

TL;DR: Sakana AI and NVIDIA's TwELL is not just another sparse-model trick, it is a kernel-level argument that LLM efficiency now lives inside data layout. Their 2B sparse model runs 20.5% faster in forward execution, trains 21.9% faster, and cuts inference energy from 7.85 mJ/token to 6.51 mJ/token, while mean task accuracy moves from 49.1% to 48.8%.^[4] The uncomfortable truth is that sparsity did not fail because the math was wrong. It failed because GPUs made the sparse path expensive.

Sakana AI announced TwELL on May 9, 2026, with NVIDIA as collaborator, open-source code, Hugging Face checkpoints, and an ICML 2026 presentation slot.^[1]^[5] The headline sounds technical. Tile-wise ELLPACK. CUDA kernels. Hybrid sparse formats. But the business meaning is blunt: the next cost war in AI may be fought below the model API, inside the feedforward block.

While competitors chase bigger context windows, larger dense models, and more routing experts, Sakana and NVIDIA are asking a more uncomfortable question: why are we paying GPUs to multiply by activations the model did not need in the first place?

BREAKING

Why This Matters Now

Feedforward layers account for over two-thirds of parameters and over 80% of total FLOPs in larger transformer models, according to the paper's framing of prior work.^[4] TwELL targets that exact cost center. If sparse feedforward execution becomes practical, inference efficiency stops being only a decoding problem and becomes a full-stack hardware-software design problem.

Developing story

The Real Story: Sparse Models Needed A GPU Deal

Let's be clear: sparsity was never the new idea. Neural networks have always contained wasted compute. ReLU activations go to zero. Gated feedforward blocks naturally silence dimensions. Prior papers have shown activation sparsity in transformers for years.

The missing piece was not proof that many activations are zero. The missing piece was a way to make GPUs benefit from those zeros.

Modern accelerators are extraordinary at regular dense matrix multiplication. They want big predictable tiles, coalesced memory access, shared-memory reuse, and work that maps cleanly onto warps and Tensor Cores. Unstructured sparsity gives them the opposite: irregular indices, extra bookkeeping, branchy execution, and sparse conversion steps that can erase the theoretical savings.

That is the paradox TwELL attacks. Making a model do less math can make it run slower if the hardware has to work harder to find the math it can skip.

Sakana's launch post phrases it plainly: do not force the GPU to adapt to sparsity. Reshape the sparsity to fit the GPU.^[1] That is the entire thesis.

The target is the transformer feedforward block. In modern gated MLPs, the model expands each token into a larger hidden dimension, gates that hidden space, multiplies the up and gate paths together, then projects back down. The gate activation decides which hidden dimensions matter for that token.^[2]^[4]

TwELL turns those sparse gate activations into a tile-aware representation that CUDA kernels can actually use.

Watercolor editorial illustration of sparse matrix blocks transforming into organized tile-wise bands — TwELL's core idea is not merely skipping zero activations. It is packing sparse activations into tile-wise structures that fit the execution pattern of modern GPU kernels.

The Mechanism: Tile-wise ELLPACK And The Hybrid Escape Hatch

TwELL stands for Tile-wise ELLPACK. Classic ELLPACK packs sparse data row by row. TwELL instead partitions columns into horizontal tiles, then stores nonzero values and their indices inside each tile.^[4]

That sounds like a data-structure detail. It is actually the whole product.

Here's the genius: TwELL chooses the tile geometry so the gate activation can be converted into sparse storage inside the same tiled matrix multiplication kernel that computes it. The paper describes this as setting the TwELL horizontal tile size to match the matmul tiling dimension, which lets the kernel materialize sparse output in the epilogue instead of launching a separate conversion pass.^[4]

Translation: fewer extra kernel launches, fewer DRAM reads, less synchronization, less bookkeeping.

The compact variant described in Sakana's technical blog packs count metadata, 16-bit values, and 16-bit column indices into a single 32-bit matrix layout.^[2] The inference kernel then uses one warp of 32 threads per cooperative thread array, handles one row at a time, loads the token activation into registers, and avoids materializing both the gate activation and the intermediate up-projection product as dense matrices.^[2]

What's often overlooked is the dense backup path. The paper's hybrid training format routes ordinary sparse rows into compact ELL storage, but sends rare high-activity rows into a dense backup matrix. The authors say a cap of 128 elements per row and dense backup rows equal to one-eighth of the token batch are robust above L1 = 1.5e-5.^[4]

That is not a hack. That is the part that makes the system credible. Real token activations are not uniformly sparse. Some tokens carry a lot more information. A sparse format that cannot handle the outliers is a demo, not infrastructure.

Watercolor cutaway of a sparse feedforward engine with most neuron columns dim and a few active channels glowing — The paper's sparse feedforward story depends on uneven computation: most channels stay quiet, while a small active subset carries the token-specific work.

The Numbers: The 2B Result Is Good, The Table Is Messier

The 2B row is the headline because it is the cleanest strategic story. Sparse forward execution rises from 87.8 to 106 input tokens/ms. Energy drops from 7.85 to 6.51 mJ/token. Training throughput rises from 22.4 to 27.3 input tokens/ms. Accuracy barely moves, from 49.1% to 48.8%.^[4]

But the whole table matters because it shows both promise and caveat.

Sparse vs Dense Across Model Scales

Feature	0.5B	1B	1.5B	2B
Training tokens	10B	20B	30B	40B
Mean accuracy change	40.4% to 40.4%	44.6% to 44.7%	46.4% to 46.2%	49.1% to 48.8%
Forward execution	410 to 480	185 to 219	119 to 141	87.8 to 106
Energy per token	1.63 to 1.43 mJ	3.71 to 3.17 mJ	5.73 to 4.87 mJ	7.85 to 6.51 mJ
Training step	97.3 to 95.9	48.6 to 52.1	31.8 to 35.5	22.4 to 27.3
Peak memory	26.2 to 21.2 GB	44.5 to 33.1 GB	62.8 to 45.1 GB	46.7 to 57.1 GB

The 0.5B model is not a clean training win. Training throughput falls 1.5%, from 97.3 to 95.9 input tokens/ms, even as forward inference improves 17.0%.^[4] The 2B model has the opposite caveat: it delivers the largest speedups, but peak memory rises from 46.7 GB to 57.1 GB, because the sparse run fits a larger micro-batch in the reported training configuration.^[4]

This is why the correct takeaway is not "sparse always wins." The correct takeaway is sharper: sparse execution becomes increasingly interesting when model scale, batch shape, activation distribution, and kernel design line up.

That scale effect is the reason the paper is more interesting than a small-model optimization demo. The sparse path gets cleaner as the model gets larger in the evaluated range.^[4]

The sparsity sweep adds more context. The 1.5B model tests eight L1 coefficients: 0, 6e-6, 1e-5, 1.5e-5, 2e-5, 3e-5, 6e-5, and 1e-4.^[4] At L1 = 1e-4, the model averages less than one activated neuron. That is too sparse to treat casually. Up to L1 = 3e-5, the authors report essentially no downstream task drop and final cross-entropy within 2% of the unregularized baseline.^[4]

The ReLU versus SiLU ablation is also revealing. The non-sparse SiLU model scores 47.1% mean accuracy, higher than the non-sparse ReLU model's 46.4%, but SiLU keeps all 5,632 hidden activations nonzero and is slightly slower. The sparse ReLU model at L1 = 2e-5 uses only 29 nonzeros on average and gets 17.9% faster forward execution.^[4]

That's the trade. SiLU is a little better in the dense baseline. ReLU plus L1 opens the sparse execution door.

The Business Implication: Kernel Ownership Becomes A Moat

The real story isn't that Sakana compressed a 2B model. The real story is that Sakana and NVIDIA are arguing for a new axis of AI infrastructure competition: who can make the model's wasted compute visible to the hardware?

This is not the same as MoE. MoE routes tokens through selected experts. TwELL exploits sparse activations inside a more conventional transformer feedforward block. MoE changes the architecture and serving stack. TwELL changes the activation behavior, sparse format, and CUDA execution path.

That distinction is commercially important. A lab can look at TwELL and imagine a path that does not require rebuilding the entire model family around expert routing. But it does require owning enough of the kernel stack to matter.

Watercolor editorial scene comparing dense compute traffic with clean sparse lanes into data centers — The business case is not that every sparse token is free. It is that the dense path starts to look like traffic, while the sparse path becomes a controlled infrastructure lane.

Here's the genius from a business standpoint: TwELL creates a moat that cannot be copied by changing a prompt, swapping a model endpoint, or publishing a benchmark table. It lives in the boring layer. CUDA kernels. Sparse packing. Training infrastructure. Benchmark scripts. Hardware-specific assumptions.

That is exactly why it matters.

The Catch: Not A Drop-In Speed Button

There is a reason this is not already everywhere.

The public GitHub repo expects CUDA 12.8+, says the custom kernels are designed for H100 GPUs, and provides a benchmark path with 500 reps and 5 warmup reps for inference testing.^[5] The repo roadmap marks sparse model training code and TwELL inference kernels as complete, but efficient TwELL training kernels are still listed as unfinished in the public package.^[5]

That does not invalidate the paper. It does mean the release is closer to a serious research codebase than a drop-in production library.

The Hugging Face artifacts deserve the same caution. The SparseLM pages expose BF16 safetensors, Apache-2.0 metadata, and a llama_sparse_relu tag, but the model cards are sparse and the models are not deployed through a hosted Inference Provider at time of inspection.^[6]^[8] SparseLM1.5B also reports roughly 1.64B BF16 parameters through the model API, while the page metadata can present confusing rounded size labels.^[8]

The methodology is real, but bounded. Main experiments use FineWeb, max sequence length 2,048, 1,048,576 tokens per step, AdamW, a 1e-3 learning rate, 0.1 weight decay, 600 warmup steps, BF16 compute, GPT-2 tokenizer, and one node of eight H100 PCIe GPUs unless otherwise specified.^[4] Model scales run 10.49B, 20.97B, 31.46B, and 41.94B tokens for 0.5B, 1B, 1.5B, and 2B respectively.^[4]

WARNING

The Key Constraint

TwELL is strongest where the deployment controls model architecture, sparse activation behavior, CUDA kernels, and hardware. If you are locked into generic dense inference paths, sparse activations are mostly a theoretical asset. The money appears when model design and kernel design are treated as one system.

The Bigger Picture: Sparsity Moves From Compression To Infrastructure

The old sparsity story was about making models smaller. The new sparsity story is about making compute selective without making the GPU miserable.

That reframing is the point.

What's often overlooked is that the AI industry has spent the last two years treating inference cost as a pricing problem, a batching problem, or a model-distillation problem. TwELL says it is also a data-layout problem. The hidden activations inside a transformer are not just math. They are a scheduling problem for a very expensive machine.

While competitors optimize the visible product layer, Sakana and NVIDIA are optimizing the path between an activation becoming zero and the GPU actually saving work. That is less glamorous than a new chatbot. It may be more durable.

The real story isn't that sparse LLMs are suddenly solved. The real story is that Sakana and NVIDIA have shown where the fight moves next. Not just bigger models. Not just cheaper tokens. The next infrastructure edge is making the hardware stop paying for activations the model already decided to ignore.

That is the sparse kernel bet.

Sources & References

Key sources and references used in this article

#	Source	Outlet	Date	Key Takeaway
1	Sparser, Faster, Lighter Transformer Language Models	Sakana AI	May 9, 2026	Official announcement for the Sakana AI and NVIDIA collaboration, ICML 2026 framing, and links to paper, blog, and code.
2	Sparser, Faster, Lighter Transformer Language Models	Sakana AI Technical Blog	May 2026	Accessible technical explanation of feedforward sparsity, TwELL packing, fused inference kernels, and hybrid training format.
3	Sparser, Faster, Lighter Transformer Language Models	arXiv Edoardo Cetin, Stefano Peluchetti, Emilio Castillo, Akira Naruse, Mana Murakami, Llion Jones	Mar 24, 2026	arXiv record with author list, submission date, revision date, abstract, and subject metadata.
4	Sparser, Faster, Lighter Transformer Language Models PDF	arXiv Sakana AI and NVIDIA	May 8, 2026	Primary source for Table 1 performance values, Table 2 hyperparameters, sparse format definitions, ablations, and limitations.
5	SakanaAI/sparser-faster-llms	GitHub Sakana AI	2026	Reference implementation with sparse training code, TwELL inference kernels, H100 CUDA requirements, benchmarking scripts, and roadmap.
6	SakanaAI/SparseLM0.5B	Hugging Face Sakana AI	2026	Released 0.5B-class SparseLM checkpoint with BF16 safetensors and Apache-2.0 model metadata.
7	SakanaAI/SparseLM1B	Hugging Face Sakana AI	2026	Released 1B-class SparseLM checkpoint for reproducing the sparse model family.
8	SakanaAI/SparseLM1.5B	Hugging Face Sakana AI	2026	Released 1.5B-class SparseLM checkpoint referenced by the repo's benchmark examples.
9	SakanaAI/SparseLM2B	Hugging Face Sakana AI	2026	Released 2B-class SparseLM checkpoint corresponding to the largest model scale in the paper's main comparison.
10	NVIDIA Hopper Tuning Guide	NVIDIA Developer Documentation	2026	Official background on Hopper GPU features relevant to tiled execution, shared memory, and Tensor Memory Accelerator behavior.

10 sourcesClick any row to visit original

Last updated: May 29, 2026

TwELL: The Sparse Kernel Bet That Makes LLMs 20.5% Faster

Why This Matters Now

The Real Story: Sparse Models Needed A GPU Deal

The Mechanism: Tile-wise ELLPACK And The Hybrid Escape Hatch

Sparse gate

Use ReLU in the gate

Add L1 pressure

TwELL inference

Pack in the epilogue

Fuse up and down

Hybrid training

Route normal rows

Route heavy rows

The Numbers: The 2B Result Is Good, The Table Is Messier

The Business Implication: Kernel Ownership Becomes A Moat

The Catch: Not A Drop-In Speed Button

The Key Constraint

The Bigger Picture: Sparsity Moves From Compression To Infrastructure

Sources & References

More Coverage

Recursive's AutoResearch System Turns AI Research Into A Factory

OpenAI's TPU Shift: What It Means for Nvidia's Dominance

How KANs Work: Why Learnable Edges Survived the MLP-Killer Hype

DeepSpec: DeepSeek Open-Sourced the Inference Cost War

Stay Updated

Why This Matters Now

TwELL by the Numbers

The Real Story: Sparse Models Needed A GPU Deal

The Mechanism: Tile-wise ELLPACK And The Hybrid Escape Hatch

How TwELL Turns Sparsity Into Execution

Sparse gate

Use ReLU in the gate

Add L1 pressure

Challenges:

TwELL inference

Pack in the epilogue

Fuse up and down

Challenges:

Hybrid training

Route normal rows

Route heavy rows

Challenges:

The Numbers: The 2B Result Is Good, The Table Is Messier

Sparse vs Dense Across Model Scales

The Business Implication: Kernel Ownership Becomes A Moat

Who Should Care

Frontier labs

Cloud providers

GPU vendors

Enterprise buyers

The Catch: Not A Drop-In Speed Button

How To Read The Claim

Treat the kernel as the product

Do not extrapolate straight to frontier scale

Separate inference readiness from training readiness

Watch the memory line

The Key Constraint

The Bigger Picture: Sparsity Moves From Compression To Infrastructure

Bottom Line

Sources & References

More Coverage

Recursive's AutoResearch System Turns AI Research Into A Factory

OpenAI's TPU Shift: What It Means for Nvidia's Dominance

How KANs Work: Why Learnable Edges Survived the MLP-Killer Hype

DeepSpec: DeepSeek Open-Sourced the Inference Cost War

Stay Updated