Featured

Latest from the lab.

TrainingActive

TWIST

High-Efficiency LLM Training System via Strand Interleaving on NVIDIA Hopper GPUs.

llmtrainingdistributed-systemshopper
Blog

TWIST

Technical note with overview, layer breakdown, and MFU comparison figures.

A technical note for DHeLlam, the ICCD 2025 Best Paper, summarizing the currently available overview, layer breakdown, and H200 MFU comparison figures.

TWISTtechnical notesystemsDHeLlam
Dhellam overview figure used in the TWIST note.

Research Areas

What we work on.

parallel computing / distributed systems / AI systems

Parallel and Distributed Intelligent Computing

Automatic micro-batch co-execution, tensor core scheduling, and efficient distributed LLM training.

storage / data systems / infrastructure

Storage Systems

Cloud-edge query processing, data-centric system design.

data-centric / systems / large models

Data-Centric System Design

Data-centric perspective across large model system design problems.

Blog

Technical writeups.

All posts
Blog

TWIST

Technical note with overview, layer breakdown, and MFU comparison figures.

A technical note for DHeLlam, the ICCD 2025 Best Paper, summarizing the currently available overview, layer breakdown, and H200 MFU comparison figures.

TWISTtechnical notesystemsDHeLlam
Dhellam overview figure used in the TWIST note.

Projects

Research projects.

View all projects
TrainingActive

TWIST

High-Efficiency LLM Training System via Strand Interleaving on NVIDIA Hopper GPUs.

llmtrainingdistributed-systemshopper
SystemsActive

Adacluster

Adaptive clustering system for large-scale ML workloads with dynamic resource scheduling.

ml-systemsschedulingresource-management

Publications

Research publications and award-winning publications from the lab.

Peer-reviewed work on distributed training, tensor core optimization, and efficient AI systems published at top-tier venues.

View all publications
IEEE TPDS 2025 February 2025

GLPilot: Efficient Distributed GNN Training With Learnable Embeddings

Chengru Yang, Chaoyi Ruan, Chengjie Tang, Ping Gong, Shiyi Wang, Xiang Song, Cheng Li

GLPilot introduces a staleness-bounded embedding buffering mechanism to reduce remote fetches in distributed GNN training with learnable vertex embeddings.

GNNdistributed traininggraph learningembeddings