DeML OS - 2026-04-07

DeML OS Daily DeML OS 最新前沿分析 DeML OS デイリー

Explore Frontier

04.07

2026

Tue

📄

Paper

Patterns behind Chaos: Forecasting Data Movement for Efficient Large-Scale MoE LLM Inference https://arxiv.org/abs/2510.05497

Zhongkai Yu MoE Profiling

Yu et al. analyze data movement bottlenecks in large-scale MoE LLM inference in their paper, proposing optimization strategies that achieve a 6.6x speedup on wafer-scale GPUs.

Notes

Random expert selection in large-scale MoE LLMs leads to major data movement bottlenecks.
Profiling on four SOTA MoE models (200B-1000B, 2025) is conducted.
Temporal and spatial analysis yields six key serving system insights.
Wafer-scale GPU tweaks based on insights achieve 6.6x speedup.
Prefill-aware expert placement on existing GPUs brings up to 1.25x MoE speedup.
This is the first comprehensive data-centric study of large-scale MoE models with public traces.

Collected by @icerdesign

DeML OS Q & A 问答

Deep Dive 💬

04.07

2026

Tue

😇

What is the main bottleneck in MoE LLM inference?

The paper identifies that the random expert selection mechanism in large-scale MoE LLMs introduces significant data movement, which becomes the dominant performance bottleneck in multi-unit serving systems.

😎

😊

What benefit does the prefill-aware expert placement algorithm bring?

Designed for existing GPU systems, this algorithm optimizes expert placement to reduce data movement, achieving up to a 1.25x speedup specifically in MoE computation.

😎

🤓

How do the six key insights distilled guide future serving system design?

These insights, distilled from spatiotemporal analysis, provide a theoretical foundation for designing efficient data movement patterns and resource scheduling strategies. They can be directly applied to architectural modifications for wafer-scale GPUs and algorithmic optimizations for existing systems to alleviate data movement bottlenecks.

😎

Prompted by @icerdesign