MLSys Lab

Advancing Scalable and Efficient Systems

Distributed LLM training systems, tensor core optimization, and technical notes.

Blog 2025 TWIST

Publication 2025 Mantle: Efficient Hierarchical Metadata Management for Cloud Object Storage Services

Publication 2025 AutoCCL: Automated Collective Communication Tuning for Accelerating Distributed and Parallel DNN Training

Featured

Latest from the lab.

TrainingActive

TWIST

High-Efficiency LLM Training System via Strand Interleaving on NVIDIA Hopper GPUs.

llmtrainingdistributed-systemshopper

Blog

TWIST

Technical note with overview, layer breakdown, and MFU comparison figures.

A technical note for DHeLlam, the ICCD 2025 Best Paper, summarizing the currently available overview, layer breakdown, and H200 MFU comparison figures.

TWISTtechnical notesystemsDHeLlam

Dhellam overview figure used in the TWIST note.

SOSP 2025 October 2025

Mantle: Efficient Hierarchical Metadata Management for Cloud Object Storage Services

Jiahao Li, Yiduo Wang, Cheng Li, Kang Chen, Baidu Canghaixing Storage Team

Mantle proposes a two-layer metadata architecture for cloud object storage, enabling scalable hierarchical namespace management.

cloud storagemetadata managementdistributed file systemsobject storage

PDF

Research Areas

What we work on.

parallel computing / distributed systems / AI systems

Parallel and Distributed Intelligent Computing

Automatic micro-batch co-execution, tensor core scheduling, and efficient distributed LLM training.

storage / data systems / infrastructure

Storage Systems

Cloud-edge query processing, data-centric system design.

data-centric / systems / large models

Data-Centric System Design

Data-centric perspective across large model system design problems.

Blog

Technical writeups.

All posts

Blog

TWIST

Technical note with overview, layer breakdown, and MFU comparison figures.

A technical note for DHeLlam, the ICCD 2025 Best Paper, summarizing the currently available overview, layer breakdown, and H200 MFU comparison figures.

TWISTtechnical notesystemsDHeLlam

Projects

Research projects.

View all projects

TrainingActive

TWIST

High-Efficiency LLM Training System via Strand Interleaving on NVIDIA Hopper GPUs.

llmtrainingdistributed-systemshopper

SystemsActive

Adacluster

Adaptive clustering system for large-scale ML workloads with dynamic resource scheduling.

ml-systemsschedulingresource-management

Publications

Research publications and award-winning publications from the lab.

Peer-reviewed work on distributed training, tensor core optimization, and efficient AI systems published at top-tier venues.

View all publications

SOSP 2025 October 2025

Mantle: Efficient Hierarchical Metadata Management for Cloud Object Storage Services

Jiahao Li, Yiduo Wang, Cheng Li, Kang Chen, Baidu Canghaixing Storage Team

Mantle proposes a two-layer metadata architecture for cloud object storage, enabling scalable hierarchical namespace management.

cloud storagemetadata managementdistributed file systemsobject storage

PDF

NSDI 2025 April 2025

AutoCCL: Automated Collective Communication Tuning for Accelerating Distributed and Parallel DNN Training

Guanbin Xu, Zhihao Le, Yinhe Chen, Zhiqi Lin, Zewen Jin, Youshan Miao, Cheng Li

AutoCCL automates collective communication tuning to accelerate distributed and parallel DNN training.

collective communicationDNN trainingdistributed systemsauto-tuning

PDF Code

IEEE TPDS 2025 February 2025

GLPilot: Efficient Distributed GNN Training With Learnable Embeddings

Chengru Yang, Chaoyi Ruan, Chengjie Tang, Ping Gong, Shiyi Wang, Xiang Song, Cheng Li

GLPilot introduces a staleness-bounded embedding buffering mechanism to reduce remote fetches in distributed GNN training with learnable vertex embeddings.

GNNdistributed traininggraph learningembeddings

PDF