TL;DR: Sakana AI and NVIDIA's TwELL is not just another sparse-model trick, it is a kernel-level argument that LLM efficiency now lives inside data layout. Their 2B sparse model runs 20.5% faster in forward execution, trains 21.9% faster, and cuts inference energy from 7.85 mJ/token to 6.51 mJ/token, while mean task accuracy moves from 49.1% to 48.8%.[4] The uncomfortable truth is that sparsity did not fail because the math was wrong. It failed because GPUs made the sparse path expensive.
Sakana AI announced TwELL on May 9, 2026, with NVIDIA as collaborator, open-source code, Hugging Face checkpoints, and an ICML 2026 presentation slot.[1][5] The headline sounds technical. Tile-wise ELLPACK. CUDA kernels. Hybrid sparse formats. But the business meaning is blunt: the next cost war in AI may be fought below the model API, inside the feedforward block.
While competitors chase bigger context windows, larger dense models, and more routing experts, Sakana and NVIDIA are asking a more uncomfortable question: why are we paying GPUs to multiply by activations the model did not need in the first place?
Why This Matters Now
Feedforward layers account for over two-thirds of parameters and over 80% of total FLOPs in larger transformer models, according to the paper's framing of prior work.[4] TwELL targets that exact cost center. If sparse feedforward execution becomes practical, inference efficiency stops being only a decoding problem and becomes a full-stack hardware-software design problem.
TwELL by the Numbers
The headline figures from the 2B sparse model at L1 = 2e-5, compared with the dense 2B baseline.
Input tokens/ms for sparse 2B inference, versus 87.8 for dense.
Input tokens/ms for sparse 2B training, versus 22.4 for dense.
mJ/token for sparse inference, versus 7.85 mJ/token for dense.
Percentage-point mean task accuracy change, 49.1% dense to 48.8% sparse.
The Real Story: Sparse Models Needed A GPU Deal
Let's be clear: sparsity was never the new idea. Neural networks have always contained wasted compute. ReLU activations go to zero. Gated feedforward blocks naturally silence dimensions. Prior papers have shown activation sparsity in transformers for years.
The missing piece was not proof that many activations are zero. The missing piece was a way to make GPUs benefit from those zeros.
Modern accelerators are extraordinary at regular dense matrix multiplication. They want big predictable tiles, coalesced memory access, shared-memory reuse, and work that maps cleanly onto warps and Tensor Cores. Unstructured sparsity gives them the opposite: irregular indices, extra bookkeeping, branchy execution, and sparse conversion steps that can erase the theoretical savings.
That is the paradox TwELL attacks. Making a model do less math can make it run slower if the hardware has to work harder to find the math it can skip.
Sparsity did not need a better slogan. It needed a better contract with the GPU.
Sakana's launch post phrases it plainly: do not force the GPU to adapt to sparsity. Reshape the sparsity to fit the GPU.[1] That is the entire thesis.
The target is the transformer feedforward block. In modern gated MLPs, the model expands each token into a larger hidden dimension, gates that hidden space, multiplies the up and gate paths together, then projects back down. The gate activation decides which hidden dimensions matter for that token.[2][4]
TwELL turns those sparse gate activations into a tile-aware representation that CUDA kernels can actually use.

The Mechanism: Tile-wise ELLPACK And The Hybrid Escape Hatch
TwELL stands for Tile-wise ELLPACK. Classic ELLPACK packs sparse data row by row. TwELL instead partitions columns into horizontal tiles, then stores nonzero values and their indices inside each tile.[4]
That sounds like a data-structure detail. It is actually the whole product.
Here's the genius: TwELL chooses the tile geometry so the gate activation can be converted into sparse storage inside the same tiled matrix multiplication kernel that computes it. The paper describes this as setting the TwELL horizontal tile size to match the matmul tiling dimension, which lets the kernel materialize sparse output in the epilogue instead of launching a separate conversion pass.[4]
Translation: fewer extra kernel launches, fewer DRAM reads, less synchronization, less bookkeeping.
The compact variant described in Sakana's technical blog packs count metadata, 16-bit values, and 16-bit column indices into a single 32-bit matrix layout.[2] The inference kernel then uses one warp of 32 threads per cooperative thread array, handles one row at a time, loads the token activation into registers, and avoids materializing both the gate activation and the intermediate up-projection product as dense matrices.[2]
How TwELL Turns Sparsity Into Execution
The paper's stack has three layers: induce sparsity, pack it where the GPU already works, and use a separate hybrid strategy for training.
Sparse gate
ReLU and a small L1 activation penalty make most feedforward gate activations zero.
Use ReLU in the gate
The sparse recipe keeps the transformer shape familiar while changing the gate activation behavior.
Add L1 pressure
The recommended L1 = 2e-5 setting preserves quality while driving active neurons down sharply.
Challenges:
- +Avoid dead-neuron collapse
- +Keep downstream accuracy flat
TwELL inference
Sparse gate activations are packed into tile-wise ELLPACK during the matmul kernel itself.
Pack in the epilogue
The gate matmul emits TwELL directly instead of writing a dense matrix and converting later.
Fuse up and down
A CUDA kernel traverses only active gate entries and avoids dense materialization of intermediate products.
Challenges:
- +Preserve memory locality
- +Hide uneven token activation counts
Hybrid training
Training converts TwELL into a capped sparse format plus a dense backup matrix for rare heavy rows.
Route normal rows
Most rows fit into compact ELL storage with a practical cap of 128 elements.
Route heavy rows
Overflow rows go to a dense backup matrix so the format does not break on high-activity tokens.
Challenges:
- +Static allocation
- +Backpropagation memory traffic
What's often overlooked is the dense backup path. The paper's hybrid training format routes ordinary sparse rows into compact ELL storage, but sends rare high-activity rows into a dense backup matrix. The authors say a cap of 128 elements per row and dense backup rows equal to one-eighth of the token batch are robust above L1 = 1.5e-5.[4]
That is not a hack. That is the part that makes the system credible. Real token activations are not uniformly sparse. Some tokens carry a lot more information. A sparse format that cannot handle the outliers is a demo, not infrastructure.

The Numbers: The 2B Result Is Good, The Table Is Messier
The 2B row is the headline because it is the cleanest strategic story. Sparse forward execution rises from 87.8 to 106 input tokens/ms. Energy drops from 7.85 to 6.51 mJ/token. Training throughput rises from 22.4 to 27.3 input tokens/ms. Accuracy barely moves, from 49.1% to 48.8%.[4]
But the whole table matters because it shows both promise and caveat.
Sparse vs Dense Across Model Scales
| Feature | 0.5B | 1B | 1.5B | 2B |
|---|---|---|---|---|
| Training tokens | 10B | 20B | 30B | 40B |
| Mean accuracy change | 40.4% to 40.4% | 44.6% to 44.7% | 46.4% to 46.2% | 49.1% to 48.8% |
| Forward execution | 410 to 480 | 185 to 219 | 119 to 141 | 87.8 to 106 |
| Energy per token | 1.63 to 1.43 mJ | 3.71 to 3.17 mJ | 5.73 to 4.87 mJ | 7.85 to 6.51 mJ |
| Training step | 97.3 to 95.9 | 48.6 to 52.1 | 31.8 to 35.5 | 22.4 to 27.3 |
| Peak memory | 26.2 to 21.2 GB | 44.5 to 33.1 GB | 62.8 to 45.1 GB | 46.7 to 57.1 GB |
The 0.5B model is not a clean training win. Training throughput falls 1.5%, from 97.3 to 95.9 input tokens/ms, even as forward inference improves 17.0%.[4] The 2B model has the opposite caveat: it delivers the largest speedups, but peak memory rises from 46.7 GB to 57.1 GB, because the sparse run fits a larger micro-batch in the reported training configuration.[4]
This is why the correct takeaway is not "sparse always wins." The correct takeaway is sharper: sparse execution becomes increasingly interesting when model scale, batch shape, activation distribution, and kernel design line up.
At L1 = 2e-5, average nonzero activations fall from 39 in the 0.5B model to 24 in the 2B model, making the sparse path more attractive at larger scale.
That scale effect is the reason the paper is more interesting than a small-model optimization demo. The sparse path gets cleaner as the model gets larger in the evaluated range.[4]
The sparsity sweep adds more context. The 1.5B model tests eight L1 coefficients: 0, 6e-6, 1e-5, 1.5e-5, 2e-5, 3e-5, 6e-5, and 1e-4.[4] At L1 = 1e-4, the model averages less than one activated neuron. That is too sparse to treat casually. Up to L1 = 3e-5, the authors report essentially no downstream task drop and final cross-entropy within 2% of the unregularized baseline.[4]
The ReLU versus SiLU ablation is also revealing. The non-sparse SiLU model scores 47.1% mean accuracy, higher than the non-sparse ReLU model's 46.4%, but SiLU keeps all 5,632 hidden activations nonzero and is slightly slower. The sparse ReLU model at L1 = 2e-5 uses only 29 nonzeros on average and gets 17.9% faster forward execution.[4]
That's the trade. SiLU is a little better in the dense baseline. ReLU plus L1 opens the sparse execution door.
The Business Implication: Kernel Ownership Becomes A Moat
The real story isn't that Sakana compressed a 2B model. The real story is that Sakana and NVIDIA are arguing for a new axis of AI infrastructure competition: who can make the model's wasted compute visible to the hardware?
This is not the same as MoE. MoE routes tokens through selected experts. TwELL exploits sparse activations inside a more conventional transformer feedforward block. MoE changes the architecture and serving stack. TwELL changes the activation behavior, sparse format, and CUDA execution path.
That distinction is commercially important. A lab can look at TwELL and imagine a path that does not require rebuilding the entire model family around expert routing. But it does require owning enough of the kernel stack to matter.

Who Should Care
TwELL is not a consumer feature. It is a signal to teams that pay the largest compute bills.
Frontier labs
Sparse feedforward execution becomes a model-design option, not just a compression afterthought.
Cloud providers
A 17.0% energy-per-token reduction at 2B scale is not a pricing revolution, but across fleet-scale serving it is real money.
GPU vendors
The work says irregular computation is not useless if the software stack makes it regular enough.
Enterprise buyers
Do not expect SaaS prices to collapse tomorrow. Do expect model providers to compete harder on hidden infrastructure efficiency.
Here's the genius from a business standpoint: TwELL creates a moat that cannot be copied by changing a prompt, swapping a model endpoint, or publishing a benchmark table. It lives in the boring layer. CUDA kernels. Sparse packing. Training infrastructure. Benchmark scripts. Hardware-specific assumptions.
That is exactly why it matters.
The Catch: Not A Drop-In Speed Button
There is a reason this is not already everywhere.
The public GitHub repo expects CUDA 12.8+, says the custom kernels are designed for H100 GPUs, and provides a benchmark path with 500 reps and 5 warmup reps for inference testing.[5] The repo roadmap marks sparse model training code and TwELL inference kernels as complete, but efficient TwELL training kernels are still listed as unfinished in the public package.[5]
That does not invalidate the paper. It does mean the release is closer to a serious research codebase than a drop-in production library.
The Hugging Face artifacts deserve the same caution. The SparseLM pages expose BF16 safetensors, Apache-2.0 metadata, and a llama_sparse_relu tag, but the model cards are sparse and the models are not deployed through a hosted Inference Provider at time of inspection.[6][8] SparseLM1.5B also reports roughly 1.64B BF16 parameters through the model API, while the page metadata can present confusing rounded size labels.[8]
The methodology is real, but bounded. Main experiments use FineWeb, max sequence length 2,048, 1,048,576 tokens per step, AdamW, a 1e-3 learning rate, 0.1 weight decay, 600 warmup steps, BF16 compute, GPT-2 tokenizer, and one node of eight H100 PCIe GPUs unless otherwise specified.[4] Model scales run 10.49B, 20.97B, 31.46B, and 41.94B tokens for 0.5B, 1B, 1.5B, and 2B respectively.[4]
How To Read The Claim
The right reading is aggressive but not naive.
Treat the kernel as the product
The sparse model matters, but the breakthrough is the execution path that turns sparse activations into real GPU throughput.
Do not extrapolate straight to frontier scale
The experiments are meaningful billion-parameter runs, but they are not proof that every 100B-plus production model gets the same curve.
Separate inference readiness from training readiness
The public repo exposes inference benchmarking and sparse training code, while efficient TwELL training kernels remain on the roadmap.
Watch the memory line
The 2B row reports higher peak memory even while training throughput improves, so the operational trade is not one-dimensional.
The Key Constraint
TwELL is strongest where the deployment controls model architecture, sparse activation behavior, CUDA kernels, and hardware. If you are locked into generic dense inference paths, sparse activations are mostly a theoretical asset. The money appears when model design and kernel design are treated as one system.
The Bigger Picture: Sparsity Moves From Compression To Infrastructure
The old sparsity story was about making models smaller. The new sparsity story is about making compute selective without making the GPU miserable.
That reframing is the point.
What's often overlooked is that the AI industry has spent the last two years treating inference cost as a pricing problem, a batching problem, or a model-distillation problem. TwELL says it is also a data-layout problem. The hidden activations inside a transformer are not just math. They are a scheduling problem for a very expensive machine.
While competitors optimize the visible product layer, Sakana and NVIDIA are optimizing the path between an activation becoming zero and the GPU actually saving work. That is less glamorous than a new chatbot. It may be more durable.
Bottom Line
TwELL is not just sparse modeling. It is sparse execution, which is the harder and more commercially meaningful problem.
The strongest result is the 2B row: 20.5% faster forward execution, 21.9% faster training, and 17.0% lower energy per token with nearly flat mean accuracy.
The messy parts matter: 0.5B training slows, 2B peak memory rises, and the public training-kernel path is not fully packaged yet.
The strategic implication is that kernel ownership becomes an AI infrastructure moat. The labs that control the model and the execution path get leverage the API-only labs do not.
The real story isn't that sparse LLMs are suddenly solved. The real story is that Sakana and NVIDIA have shown where the fight moves next. Not just bigger models. Not just cheaper tokens. The next infrastructure edge is making the hardware stop paying for activations the model already decided to ignore.
That is the sparse kernel bet.
Sources & References
Key sources and references used in this article
| # | Source | Outlet | Date | Key Takeaway |
|---|---|---|---|---|
| 1 | Sparser, Faster, Lighter Transformer Language Models | Sakana AI | May 9, 2026 | Official announcement for the Sakana AI and NVIDIA collaboration, ICML 2026 framing, and links to paper, blog, and code. |
| 2 | Sparser, Faster, Lighter Transformer Language Models | Sakana AI Technical Blog | May 2026 | Accessible technical explanation of feedforward sparsity, TwELL packing, fused inference kernels, and hybrid training format. |
| 3 | Sparser, Faster, Lighter Transformer Language Models | arXiv Edoardo Cetin, Stefano Peluchetti, Emilio Castillo, Akira Naruse, Mana Murakami, Llion Jones | Mar 24, 2026 | arXiv record with author list, submission date, revision date, abstract, and subject metadata. |
| 4 | Sparser, Faster, Lighter Transformer Language Models PDF | arXiv Sakana AI and NVIDIA | May 8, 2026 | Primary source for Table 1 performance values, Table 2 hyperparameters, sparse format definitions, ablations, and limitations. |
| 5 | SakanaAI/sparser-faster-llms | GitHub Sakana AI | 2026 | Reference implementation with sparse training code, TwELL inference kernels, H100 CUDA requirements, benchmarking scripts, and roadmap. |
| 6 | SakanaAI/SparseLM0.5B | Hugging Face Sakana AI | 2026 | Released 0.5B-class SparseLM checkpoint with BF16 safetensors and Apache-2.0 model metadata. |
| 7 | SakanaAI/SparseLM1B | Hugging Face Sakana AI | 2026 | Released 1B-class SparseLM checkpoint for reproducing the sparse model family. |
| 8 | SakanaAI/SparseLM1.5B | Hugging Face Sakana AI | 2026 | Released 1.5B-class SparseLM checkpoint referenced by the repo's benchmark examples. |
| 9 | SakanaAI/SparseLM2B | Hugging Face Sakana AI | 2026 | Released 2B-class SparseLM checkpoint corresponding to the largest model scale in the paper's main comparison. |
| 10 | NVIDIA Hopper Tuning Guide | NVIDIA Developer Documentation | 2026 | Official background on Hopper GPU features relevant to tiled execution, shared memory, and Tensor Memory Accelerator behavior. |
Last updated: May 29, 2026




