Back to News
Sakana AI

TwELL: The Sparse Kernel Bet That Makes LLMs 20.5% Faster

LLM Rumors··14 min read·
...
Sakana AINVIDIATwELLSparse LLMsGPU KernelsInference OptimizationAI InfrastructureTransformer Architecture
TwELL: The Sparse Kernel Bet That Makes LLMs 20.5% Faster

TL;DR: Sakana AI and NVIDIA's TwELL is not just another sparse-model trick, it is a kernel-level argument that LLM efficiency now lives inside data layout. Their 2B sparse model runs 20.5% faster in forward execution, trains 21.9% faster, and cuts inference energy from 7.85 mJ/token to 6.51 mJ/token, while mean task accuracy moves from 49.1% to 48.8%.[4] The uncomfortable truth is that sparsity did not fail because the math was wrong. It failed because GPUs made the sparse path expensive.

Sakana AI announced TwELL on May 9, 2026, with NVIDIA as collaborator, open-source code, Hugging Face checkpoints, and an ICML 2026 presentation slot.[1][5] The headline sounds technical. Tile-wise ELLPACK. CUDA kernels. Hybrid sparse formats. But the business meaning is blunt: the next cost war in AI may be fought below the model API, inside the feedforward block.

While competitors chase bigger context windows, larger dense models, and more routing experts, Sakana and NVIDIA are asking a more uncomfortable question: why are we paying GPUs to multiply by activations the model did not need in the first place?

BREAKING

Why This Matters Now

Feedforward layers account for over two-thirds of parameters and over 80% of total FLOPs in larger transformer models, according to the paper's framing of prior work.[4] TwELL targets that exact cost center. If sparse feedforward execution becomes practical, inference efficiency stops being only a decoding problem and becomes a full-stack hardware-software design problem.

Developing story

TwELL by the Numbers

The headline figures from the 2B sparse model at L1 = 2e-5, compared with the dense 2B baseline.

106
Forward execution

Input tokens/ms for sparse 2B inference, versus 87.8 for dense.

+ 20.5% faster
27.3
Training step

Input tokens/ms for sparse 2B training, versus 22.4 for dense.

+ 21.9% faster
6.51
Energy per token

mJ/token for sparse inference, versus 7.85 mJ/token for dense.

+ 17.0% lower
-0.3
Accuracy delta

Percentage-point mean task accuracy change, 49.1% dense to 48.8% sparse.

= near flat
Source: Sparser, Faster, Lighter Transformer Language Models, Table 1.
LLMRumors.com

The Real Story: Sparse Models Needed A GPU Deal

Let's be clear: sparsity was never the new idea. Neural networks have always contained wasted compute. ReLU activations go to zero. Gated feedforward blocks naturally silence dimensions. Prior papers have shown activation sparsity in transformers for years.

The missing piece was not proof that many activations are zero. The missing piece was a way to make GPUs benefit from those zeros.

Modern accelerators are extraordinary at regular dense matrix multiplication. They want big predictable tiles, coalesced memory access, shared-memory reuse, and work that maps cleanly onto warps and Tensor Cores. Unstructured sparsity gives them the opposite: irregular indices, extra bookkeeping, branchy execution, and sparse conversion steps that can erase the theoretical savings.

That is the paradox TwELL attacks. Making a model do less math can make it run slower if the hardware has to work harder to find the math it can skip.

Sparsity did not need a better slogan. It needed a better contract with the GPU.

LLM Rumors/Analysis
LLMRumors.com

Sakana's launch post phrases it plainly: do not force the GPU to adapt to sparsity. Reshape the sparsity to fit the GPU.[1] That is the entire thesis.

The target is the transformer feedforward block. In modern gated MLPs, the model expands each token into a larger hidden dimension, gates that hidden space, multiplies the up and gate paths together, then projects back down. The gate activation decides which hidden dimensions matter for that token.[2][4]

TwELL turns those sparse gate activations into a tile-aware representation that CUDA kernels can actually use.

Watercolor editorial illustration of sparse matrix blocks transforming into organized tile-wise bands
TwELL's core idea is not merely skipping zero activations. It is packing sparse activations into tile-wise structures that fit the execution pattern of modern GPU kernels.

The Mechanism: Tile-wise ELLPACK And The Hybrid Escape Hatch

TwELL stands for Tile-wise ELLPACK. Classic ELLPACK packs sparse data row by row. TwELL instead partitions columns into horizontal tiles, then stores nonzero values and their indices inside each tile.[4]

That sounds like a data-structure detail. It is actually the whole product.

Here's the genius: TwELL chooses the tile geometry so the gate activation can be converted into sparse storage inside the same tiled matrix multiplication kernel that computes it. The paper describes this as setting the TwELL horizontal tile size to match the matmul tiling dimension, which lets the kernel materialize sparse output in the epilogue instead of launching a separate conversion pass.[4]

Translation: fewer extra kernel launches, fewer DRAM reads, less synchronization, less bookkeeping.

The compact variant described in Sakana's technical blog packs count metadata, 16-bit values, and 16-bit column indices into a single 32-bit matrix layout.[2] The inference kernel then uses one warp of 32 threads per cooperative thread array, handles one row at a time, loads the token activation into registers, and avoids materializing both the gate activation and the intermediate up-projection product as dense matrices.[2]

How TwELL Turns Sparsity Into Execution

The paper's stack has three layers: induce sparsity, pack it where the GPU already works, and use a separate hybrid strategy for training.

1

Sparse gate

ReLU and a small L1 activation penalty make most feedforward gate activations zero.

Use ReLU in the gate

The sparse recipe keeps the transformer shape familiar while changing the gate activation behavior.

Add L1 pressure

The recommended L1 = 2e-5 setting preserves quality while driving active neurons down sharply.

Challenges:
  • +Avoid dead-neuron collapse
  • +Keep downstream accuracy flat
2

TwELL inference

Sparse gate activations are packed into tile-wise ELLPACK during the matmul kernel itself.

Pack in the epilogue

The gate matmul emits TwELL directly instead of writing a dense matrix and converting later.

Fuse up and down

A CUDA kernel traverses only active gate entries and avoids dense materialization of intermediate products.

Challenges:
  • +Preserve memory locality
  • +Hide uneven token activation counts
3

Hybrid training

Training converts TwELL into a capped sparse format plus a dense backup matrix for rare heavy rows.

Route normal rows

Most rows fit into compact ELL storage with a practical cap of 128 elements.

Route heavy rows

Overflow rows go to a dense backup matrix so the format does not break on high-activity tokens.

Challenges:
  • +Static allocation
  • +Backpropagation memory traffic
LLMRumors.com

What's often overlooked is the dense backup path. The paper's hybrid training format routes ordinary sparse rows into compact ELL storage, but sends rare high-activity rows into a dense backup matrix. The authors say a cap of 128 elements per row and dense backup rows equal to one-eighth of the token batch are robust above L1 = 1.5e-5.[4]

That is not a hack. That is the part that makes the system credible. Real token activations are not uniformly sparse. Some tokens carry a lot more information. A sparse format that cannot handle the outliers is a demo, not infrastructure.

Watercolor cutaway of a sparse feedforward engine with most neuron columns dim and a few active channels glowing
The paper's sparse feedforward story depends on uneven computation: most channels stay quiet, while a small active subset carries the token-specific work.

The Numbers: The 2B Result Is Good, The Table Is Messier

The 2B row is the headline because it is the cleanest strategic story. Sparse forward execution rises from 87.8 to 106 input tokens/ms. Energy drops from 7.85 to 6.51 mJ/token. Training throughput rises from 22.4 to 27.3 input tokens/ms. Accuracy barely moves, from 49.1% to 48.8%.[4]

But the whole table matters because it shows both promise and caveat.

Sparse vs Dense Across Model Scales

Feature0.5B1B1.5B2B
Training tokens10B20B30B40B
Mean accuracy change40.4% to 40.4%44.6% to 44.7%46.4% to 46.2%49.1% to 48.8%
Forward execution410 to 480185 to 219119 to 14187.8 to 106
Energy per token1.63 to 1.43 mJ3.71 to 3.17 mJ5.73 to 4.87 mJ7.85 to 6.51 mJ
Training step97.3 to 95.948.6 to 52.131.8 to 35.522.4 to 27.3
Peak memory26.2 to 21.2 GB44.5 to 33.1 GB62.8 to 45.1 GB46.7 to 57.1 GB
LLMRumors.com

The 0.5B model is not a clean training win. Training throughput falls 1.5%, from 97.3 to 95.9 input tokens/ms, even as forward inference improves 17.0%.[4] The 2B model has the opposite caveat: it delivers the largest speedups, but peak memory rises from 46.7 GB to 57.1 GB, because the sparse run fits a larger micro-batch in the reported training configuration.[4]

This is why the correct takeaway is not "sparse always wins." The correct takeaway is sharper: sparse execution becomes increasingly interesting when model scale, batch shape, activation distribution, and kernel design line up.

39 to 24
Average active elements fall with scale

At L1 = 2e-5, average nonzero activations fall from 39 in the 0.5B model to 24 in the 2B model, making the sparse path more attractive at larger scale.

LLMRumors.com

That scale effect is the reason the paper is more interesting than a small-model optimization demo. The sparse path gets cleaner as the model gets larger in the evaluated range.[4]

The sparsity sweep adds more context. The 1.5B model tests eight L1 coefficients: 0, 6e-6, 1e-5, 1.5e-5, 2e-5, 3e-5, 6e-5, and 1e-4.[4] At L1 = 1e-4, the model averages less than one activated neuron. That is too sparse to treat casually. Up to L1 = 3e-5, the authors report essentially no downstream task drop and final cross-entropy within 2% of the unregularized baseline.[4]

The ReLU versus SiLU ablation is also revealing. The non-sparse SiLU model scores 47.1% mean accuracy, higher than the non-sparse ReLU model's 46.4%, but SiLU keeps all 5,632 hidden activations nonzero and is slightly slower. The sparse ReLU model at L1 = 2e-5 uses only 29 nonzeros on average and gets 17.9% faster forward execution.[4]

That's the trade. SiLU is a little better in the dense baseline. ReLU plus L1 opens the sparse execution door.

The Business Implication: Kernel Ownership Becomes A Moat

The real story isn't that Sakana compressed a 2B model. The real story is that Sakana and NVIDIA are arguing for a new axis of AI infrastructure competition: who can make the model's wasted compute visible to the hardware?

This is not the same as MoE. MoE routes tokens through selected experts. TwELL exploits sparse activations inside a more conventional transformer feedforward block. MoE changes the architecture and serving stack. TwELL changes the activation behavior, sparse format, and CUDA execution path.

That distinction is commercially important. A lab can look at TwELL and imagine a path that does not require rebuilding the entire model family around expert routing. But it does require owning enough of the kernel stack to matter.

Watercolor editorial scene comparing dense compute traffic with clean sparse lanes into data centers
The business case is not that every sparse token is free. It is that the dense path starts to look like traffic, while the sparse path becomes a controlled infrastructure lane.

Who Should Care

TwELL is not a consumer feature. It is a signal to teams that pay the largest compute bills.

Frontier labs

Sparse feedforward execution becomes a model-design option, not just a compression afterthought.

+Potentially lower inference energy
+Training wins at larger scale
+Requires sparse-aware architecture choices

Cloud providers

A 17.0% energy-per-token reduction at 2B scale is not a pricing revolution, but across fleet-scale serving it is real money.

+Better H100 utilization
+Pressure for sparse kernel support
+New benchmark surface beyond dense throughput

GPU vendors

The work says irregular computation is not useless if the software stack makes it regular enough.

+Tile-aware sparse formats
+Kernel fusion
+Hardware-software co-design

Enterprise buyers

Do not expect SaaS prices to collapse tomorrow. Do expect model providers to compete harder on hidden infrastructure efficiency.

+Cost improvements start upstream
+Hosted APIs may not expose the mechanism
+Self-hosters need kernel maturity
LLMRumors.com

Here's the genius from a business standpoint: TwELL creates a moat that cannot be copied by changing a prompt, swapping a model endpoint, or publishing a benchmark table. It lives in the boring layer. CUDA kernels. Sparse packing. Training infrastructure. Benchmark scripts. Hardware-specific assumptions.

That is exactly why it matters.

The Catch: Not A Drop-In Speed Button

There is a reason this is not already everywhere.

The public GitHub repo expects CUDA 12.8+, says the custom kernels are designed for H100 GPUs, and provides a benchmark path with 500 reps and 5 warmup reps for inference testing.[5] The repo roadmap marks sparse model training code and TwELL inference kernels as complete, but efficient TwELL training kernels are still listed as unfinished in the public package.[5]

That does not invalidate the paper. It does mean the release is closer to a serious research codebase than a drop-in production library.

The Hugging Face artifacts deserve the same caution. The SparseLM pages expose BF16 safetensors, Apache-2.0 metadata, and a llama_sparse_relu tag, but the model cards are sparse and the models are not deployed through a hosted Inference Provider at time of inspection.[6][8] SparseLM1.5B also reports roughly 1.64B BF16 parameters through the model API, while the page metadata can present confusing rounded size labels.[8]

The methodology is real, but bounded. Main experiments use FineWeb, max sequence length 2,048, 1,048,576 tokens per step, AdamW, a 1e-3 learning rate, 0.1 weight decay, 600 warmup steps, BF16 compute, GPT-2 tokenizer, and one node of eight H100 PCIe GPUs unless otherwise specified.[4] Model scales run 10.49B, 20.97B, 31.46B, and 41.94B tokens for 0.5B, 1B, 1.5B, and 2B respectively.[4]

How To Read The Claim

The right reading is aggressive but not naive.

1.

Treat the kernel as the product

The sparse model matters, but the breakthrough is the execution path that turns sparse activations into real GPU throughput.

Tip:Ask whether a deployment can run the custom sparse kernels, not just whether the model is sparse.
2.

Do not extrapolate straight to frontier scale

The experiments are meaningful billion-parameter runs, but they are not proof that every 100B-plus production model gets the same curve.

Tip:Watch for replication at longer context, larger batch regimes, larger models, and more serving frameworks.
3.

Separate inference readiness from training readiness

The public repo exposes inference benchmarking and sparse training code, while efficient TwELL training kernels remain on the roadmap.

Tip:Use the paper for the systems thesis and the repo for specialist reproduction, not as a one-command infra migration.
4.

Watch the memory line

The 2B row reports higher peak memory even while training throughput improves, so the operational trade is not one-dimensional.

Tip:Benchmark with your own micro-batch, sequence length, and memory budget before calling it a universal win.
LLMRumors.com
WARNING

The Key Constraint

TwELL is strongest where the deployment controls model architecture, sparse activation behavior, CUDA kernels, and hardware. If you are locked into generic dense inference paths, sparse activations are mostly a theoretical asset. The money appears when model design and kernel design are treated as one system.

The Bigger Picture: Sparsity Moves From Compression To Infrastructure

The old sparsity story was about making models smaller. The new sparsity story is about making compute selective without making the GPU miserable.

That reframing is the point.

What's often overlooked is that the AI industry has spent the last two years treating inference cost as a pricing problem, a batching problem, or a model-distillation problem. TwELL says it is also a data-layout problem. The hidden activations inside a transformer are not just math. They are a scheduling problem for a very expensive machine.

While competitors optimize the visible product layer, Sakana and NVIDIA are optimizing the path between an activation becoming zero and the GPU actually saving work. That is less glamorous than a new chatbot. It may be more durable.

Bottom Line

1

TwELL is not just sparse modeling. It is sparse execution, which is the harder and more commercially meaningful problem.

2

The strongest result is the 2B row: 20.5% faster forward execution, 21.9% faster training, and 17.0% lower energy per token with nearly flat mean accuracy.

3

The messy parts matter: 0.5B training slows, 2B peak memory rises, and the public training-kernel path is not fully packaged yet.

4

The strategic implication is that kernel ownership becomes an AI infrastructure moat. The labs that control the model and the execution path get leverage the API-only labs do not.

LLMRumors.com

The real story isn't that sparse LLMs are suddenly solved. The real story is that Sakana and NVIDIA have shown where the fight moves next. Not just bigger models. Not just cheaper tokens. The next infrastructure edge is making the hardware stop paying for activations the model already decided to ignore.

That is the sparse kernel bet.

Sources & References

Key sources and references used in this article

#SourceOutletDateKey Takeaway
1
Sparser, Faster, Lighter Transformer Language Models
Sakana AI
May 9, 2026Official announcement for the Sakana AI and NVIDIA collaboration, ICML 2026 framing, and links to paper, blog, and code.
2
Sparser, Faster, Lighter Transformer Language Models
Sakana AI Technical Blog
May 2026Accessible technical explanation of feedforward sparsity, TwELL packing, fused inference kernels, and hybrid training format.
3
Sparser, Faster, Lighter Transformer Language Models
arXiv
Edoardo Cetin, Stefano Peluchetti, Emilio Castillo, Akira Naruse, Mana Murakami, Llion Jones
Mar 24, 2026arXiv record with author list, submission date, revision date, abstract, and subject metadata.
4
Sparser, Faster, Lighter Transformer Language Models PDF
arXiv
Sakana AI and NVIDIA
May 8, 2026Primary source for Table 1 performance values, Table 2 hyperparameters, sparse format definitions, ablations, and limitations.
5
SakanaAI/sparser-faster-llms
GitHub
Sakana AI
2026Reference implementation with sparse training code, TwELL inference kernels, H100 CUDA requirements, benchmarking scripts, and roadmap.
6
SakanaAI/SparseLM0.5B
Hugging Face
Sakana AI
2026Released 0.5B-class SparseLM checkpoint with BF16 safetensors and Apache-2.0 model metadata.
7
SakanaAI/SparseLM1B
Hugging Face
Sakana AI
2026Released 1B-class SparseLM checkpoint for reproducing the sparse model family.
8
SakanaAI/SparseLM1.5B
Hugging Face
Sakana AI
2026Released 1.5B-class SparseLM checkpoint referenced by the repo's benchmark examples.
9
SakanaAI/SparseLM2B
Hugging Face
Sakana AI
2026Released 2B-class SparseLM checkpoint corresponding to the largest model scale in the paper's main comparison.
10
NVIDIA Hopper Tuning Guide
NVIDIA Developer Documentation
2026Official background on Hopper GPU features relevant to tiled execution, shared memory, and Tensor Memory Accelerator behavior.
10 sourcesClick any row to visit original

Last updated: May 29, 2026