DeML OS - 2026-03-31

DeML OS Daily DeML OS 最新前沿分析 DeML OS デイリー

Explore Frontier

03.31

2026

Tue

📄

Paper

Route Experts by Sequence, not by Token https://arxiv.org/abs/2511.06494

Tiansheng Wen Routing Sparsity

Wen et al. proposed SeqTopK in their paper, shifting the expert routing budget from token to sequence level for end-to-end dynamic allocation, improving model performance while maintaining the same overall budget.

Notes

Standard TopK routing assigns a fixed number of experts to all tokens, ignoring complexity variation.
SeqTopK shifts the expert budget from token to sequence level for end-to-end dynamic allocation.
It requires minimal code changes, adds <1% overhead, and is fully compatible with pretrained MoE models.
Outperforms TopK and prior parameter-free adaptive methods on math, coding, law, and writing tasks.
Gains are substantially larger under higher sparsity (up to 16.9%).
A simple, efficient, scalable routing strategy well-suited for next-gen LLMs' extreme sparsity.

Collected by @icerdesign

DeML OS Q & A 问答

Deep Dive 💬

03.31

2026

Tue

😇

What is the main improvement of SeqTopK?

SeqTopK shifts the expert selection budget from per-token to per-sequence level, allowing dynamic allocation of more experts to complex tokens and fewer to easy ones, while keeping the total budget constant.

😎

😊

What advantages does SeqTopK have over traditional TopK routing?

SeqTopK dynamically allocates expert resources based on token difficulty, improving computational efficiency and model performance. It's simple to implement, adds negligible overhead, is compatible with pretrained models without retraining, and shows especially large gains under high sparsity.

😎

🤓

Why does SeqTopK show larger performance gains under higher sparsity?

Under higher sparsity, the number of activatable experts per token (K) is smaller, creating tighter resource constraints. The sequence-level flexibility of SeqTopK becomes critical, allowing it to precisely allocate scarce expert resources to the most needed (hardest) tokens, avoiding resource misallocation caused by fixed assignment, thus maximizing the utility of the limited compute budget.

😎

Prompted by @icerdesign