DeML OS - 2026-04-01

DeML OS Daily DeML OS 最新前沿分析 DeML OS デイリー

Explore Frontier

04.01

2026

Wed

📄

Paper

MoE-Sieve: Routing-Guided LoRA for Efficient MoE Fine-Tuning https://arxiv.org/abs/2603.24044

Andrea Manzoni MoE LoRA

Researchers propose MoE-Sieve in the paper, which applies LoRA adapters only to frequently activated experts based on routing analysis, significantly reducing fine-tuning compute and storage costs.

Notes

Expert routing in MoE layers is highly skewed, with a few 'hot' experts handling most tokens while many 'cold' experts are rarely activated.
MoE-Sieve profiles routing counts on a small calibration set and applies LoRA adapters only to the top-k most-routed experts per layer.
Tuning only the top 25% experts per layer matches full LoRA performance (mean difference within +/-1 percentage point).
This reduces trainable parameters by 70-73%, adapter checkpoint size by 71-73%, and training time by up to 50%.
Random expert selection at matched budget performs ~2.5 percentage points worse, showing the routing signal is crucial.
Adapting cold experts may introduce gradient noise without improving accuracy, explaining the efficiency gains.

Collected by @icerdesign

DeML OS Q & A 问答

Deep Dive 💬

04.01

2026

Wed

😇

How does MoE-Sieve decide which experts to apply LoRA to?

It profiles the routing frequency (how many tokens each expert handles) on a small calibration set, then selects the top-k most frequently routed experts per layer for LoRA fine-tuning.

😎

😊

Why does random expert selection perform worse than MoE-Sieve?

Random selection ignores the routing signal, potentially picking rarely activated 'cold' experts. Adapting these experts introduces gradient noise without effectively improving task accuracy.

😎

🤓

What is the implication of the non-monotonic relationship between expert routing skew and seed-to-seed variance mentioned in the paper?

It suggests that adapting 'cold' experts (when routing skew is high) increases training instability (variance), supporting the hypothesis that tuning cold experts introduces harmful gradient noise. This explains how MoE-Sieve maintains stability while reducing parameters.

😎

Prompted by @icerdesign