TWIST
High-Efficiency LLM Training System via Strand Interleaving on NVIDIA Hopper GPUs.
MLSys Lab
Distributed LLM training systems, tensor core optimization, and technical notes.
Featured
High-Efficiency LLM Training System via Strand Interleaving on NVIDIA Hopper GPUs.
Technical note with overview, layer breakdown, and MFU comparison figures.
A technical note for DHeLlam, the ICCD 2025 Best Paper, summarizing the currently available overview, layer breakdown, and H200 MFU comparison figures.
Mantle proposes a two-layer metadata architecture for cloud object storage, enabling scalable hierarchical namespace management.
Research Areas
Automatic micro-batch co-execution, tensor core scheduling, and efficient distributed LLM training.
Cloud-edge query processing, data-centric system design.
Data-centric perspective across large model system design problems.
Blog
Technical note with overview, layer breakdown, and MFU comparison figures.
A technical note for DHeLlam, the ICCD 2025 Best Paper, summarizing the currently available overview, layer breakdown, and H200 MFU comparison figures.
Projects
High-Efficiency LLM Training System via Strand Interleaving on NVIDIA Hopper GPUs.
Adaptive clustering system for large-scale ML workloads with dynamic resource scheduling.
Publications
Peer-reviewed work on distributed training, tensor core optimization, and efficient AI systems published at top-tier venues.
View all publicationsMantle proposes a two-layer metadata architecture for cloud object storage, enabling scalable hierarchical namespace management.
AutoCCL automates collective communication tuning to accelerate distributed and parallel DNN training.
GLPilot introduces a staleness-bounded embedding buffering mechanism to reduce remote fetches in distributed GNN training with learnable vertex embeddings.