Publications | MLSys Lab

2025

SOSP 2025 October 2025

Mantle: Efficient Hierarchical Metadata Management for Cloud Object Storage Services

Jiahao Li, Yiduo Wang, Cheng Li, Kang Chen, Baidu Canghaixing Storage Team

Mantle proposes a two-layer metadata architecture for cloud object storage, enabling scalable hierarchical namespace management.

cloud storagemetadata managementdistributed file systemsobject storage

PDF

NSDI 2025 April 2025

AutoCCL: Automated Collective Communication Tuning for Accelerating Distributed and Parallel DNN Training

Guanbin Xu, Zhihao Le, Yinhe Chen, Zhiqi Lin, Zewen Jin, Youshan Miao, Cheng Li

AutoCCL automates collective communication tuning to accelerate distributed and parallel DNN training.

collective communicationDNN trainingdistributed systemsauto-tuning

PDF Code

IEEE TPDS 2025 February 2025

GLPilot: Efficient Distributed GNN Training With Learnable Embeddings

Chengru Yang, Chaoyi Ruan, Chengjie Tang, Ping Gong, Shiyi Wang, Xiang Song, Cheng Li

GLPilot introduces a staleness-bounded embedding buffering mechanism to reduce remote fetches in distributed GNN training with learnable vertex embeddings.

GNNdistributed traininggraph learningembeddings

PDF

VLDB 2025 (PACMMOD)

Perseus: Achieving Strong Consistency and High Data Freshness for Scalable Geo-distributed HTAP

Haoze Song, Xusheng Chen, Ruijie Gong, Zekai Sun, Tianxiang Shen, Cheng Li, Hao Feng, Sen Wang, Heming Cui

Perseus is a scalable HTAP database that enforces strong consistency for both transactions and analytical queries across geo-distributed deployments.

HTAPgeo-distributedconsistencydatabase

PDF

ICCD 2025 Best Publication Award

DHeLlam: General-Purpose, Automatic Micro-Batch Co-Execution for Distributed LLM Training

Haiquan Wang, Chaoyi Ruan, Jia He, Jiaqi Ruan, Chengjie Tang, Xiaosong Ma, Cheng Li

ICCD 2025 Best Paper. DHeLlam introduces automatic micro-batch co-execution to reduce communication bottlenecks in distributed LLM training.

LLM trainingdistributed systemsmicro-batch co-execution

PDF

2024

EMNLP 2024 November 2024

VPTQ: Extreme Low-bit Vector Post-Training Quantization for Large Language Models

Yifei Liu, Jicheng Wen, Yang Wang, Shengyu Ye, Li Lyna Zhang, Ting Cao, Cheng Li, Mao Yang

VPTQ leverages vector quantization with second-order optimization to achieve extreme low-bit (1–4 bit) compression of LLMs while maintaining high accuracy.

LLMquantizationmodel compressionvector quantization

PDF Code

PVLDB 2024 November 2024

PolyBase: Adapting to Data Affinity Changes in Geo-Replicated Database via Row-Level Consensus-Group Affiliation Re-Assignment

Chaoyi Ruan, Yingqiang Zhang, Juncheng Zhang, Cheng Li, Xiaosong Ma, Hao Chen, Jie Zhou, Feifei Li, Xinjun Yang

PolyBase addresses data affinity misalignment in geo-replicated databases by enabling row-level Paxos-group re-assignment, significantly reducing wide-area round trips.

geo-replicationdatabasePaxosconsistency

PDF

OSDI 2024 July 2024

nnScaler: Constraint-Guided Parallelization Plan Generation for Deep Learning Training

Zhiqi Lin, Youshan Miao, Quanlu Zhang, Fan Yang, Yi Zhu, Cheng Li, Saeed Maleki, Xu Cao, Ning Shang, Yilei Yang, Weijiang Xu, Mao Yang, Lintao Zhang, Lidong Zhou

nnScaler generates efficient parallelization plans for DNN training via three primitives that capture any parallel plan's model transformation and spatiotemporal scheduling.

deep learningparallelizationDNN trainingdistributed systems

PDF

2023

SOSP 2023 October 2023

gSampler: General and Efficient GPU-based Graph Sampling for Graph Learning

Ping Gong, Renjie Liu, Zunyao Mao, Zhenkun Cai, Xiao Yan, Cheng Li, Minjie Wang, Zhuozhao Li

gSampler is a general GPU-based graph sampling framework using the ECSF model that unifies 15 popular graph sampling algorithms with high efficiency.

graph samplingGPUgraph learningGNN

PDF Code

SOSP 2023 October 2023

SPFresh: Incremental In-Place Update for Billion-Scale Vector Search

Yuming Xu, Hengyu Liang, Jin Li, Shuotao Xu, Qi Chen, Qianxi Zhang, Cheng Li, Ziyue Yang, Fan Yang, Yuqing Yang, Peng Cheng, Mao Yang

SPFresh supports efficient in-place vector updates for billion-scale ANNS through the LIRE lightweight incremental rebalancing protocol.

vector searchANNSin-place updatebillion-scale

PDF

EuroSys 2023 May 2023

FrozenHot Cache: Rethinking Cache Management for Modern Hardware

Ziyue Qiu, Juncheng Yang, Juncheng Zhang, Cheng Li, Xiaosong Ma, Qi Chen, Mao Yang, Yinlong Xu

FrozenHot improves cache scalability on modern multi-core hardware by separating a frozen hot segment that eliminates per-access synchronization overhead.

cachestoragescalabilitymulti-core

PDF Code

EuroSys 2023 May 2023

CFS: Scaling Metadata Service for Distributed File System via Pruned Scope of Critical Sections

Yiduo Wang, Yufei Wu, Cheng Li, Pengfei Zheng, Biao Cao, Yan Sun, Fei Zhou, Yinlong Xu, Yao Wang, Guangjun Xie

CFS builds a scalable, fully POSIX-compliant distributed file system by pruning critical sections to reduce locking overhead.

distributed file systemmetadatascalabilityPOSIX

PDF

HPCA 2023 February 2023

MPress: Democratizing Billion-Scale Model Training on Multi-GPU Servers via Memory-Saving Inter-Operator Parallelism

Quan Zhou, Haiquan Wang, Xiaoyan Yu, Cheng Li, Youhui Bai, Feng Yan, Yinlong Xu

MPress enables billion-scale DNN model training on a single multi-GPU server by exploiting inter-operator parallelism to save GPU memory.

model trainingGPU memoryinter-operator parallelismLLM

PDF

2021

SC 2021 November 2021

Lunule: An Agile and Judicious Metadata Load Balancer for CephFS

Yiduo Wang, Cheng Li, Xinyang Shao, Youxu Chen, Fang Yan, Yinlong Xu

Lunule proposes an imbalance factor model for accurate metadata load balancing in CephFS, enabling agile and efficient rebalancing.

CephFSmetadataload balancingdistributed file system

PDF

SOSP 2021 October 2021

Gradient Compression Supercharged High-Performance Data Parallel DNN Training

Youhui Bai, Cheng Li, Quan Zhou, Jun Yi, Ping Gong, Feng Yan, Ruichuan Chen, Yinlong Xu

This paper introduces Tensor Homomorphic Compression (THC) enabling direct aggregation of compressed gradients, supercharging data parallel DNN training.

gradient compressionDNN trainingdistributed systemscommunication

PDF

VLDB 2021 May 2021

AutoGR: Automated Geo-Replication with Fast System Performance and Preserved Application Semantics

Jiawei Wang, Cheng Li, Kai Ma, Jingze Huo, Feng Yan, Xinyu Feng, Yinlong Xu

AutoGR automates geo-replication deployment, freeing programmers from manual consistency annotations while preserving application semantics.

geo-replicationconsistencydistributed systemsautomation

PDF Code

USENIX FAST 2021 February 2021

SpanDB: A Fast, Cost-Effective LSM-tree Based KV Store on Hybrid Storage

Hao Chen, Chaoyi Ruan, Cheng Li, Xiaosong Ma, Yinlong Xu

SpanDB adapts RocksDB to utilize high-speed NVMe SSDs selectively for WAL and top LSM-tree levels, achieving significant throughput and latency improvements.

key-value storeLSM-treeNVMe SSDstorage

PDF Code

2020

SoCC 2020 October 2020

PaGraph: Scaling GNN Training on Large Graphs via Computation-aware Caching

Zhiqi Lin, Cheng Li, Youshan Miao, Yunxin Liu, Yinlong Xu

PaGraph accelerates GNN training on large graphs by exploiting GPU cache for frequently accessed graph data and computation-aware graph partitioning.

GNNgraph trainingcachingGPU

PDF Code

2019

USENIX ATC 2019 July 2019

ElasticBF: Elastic Bloom Filter with Hotness Awareness for Boosting Read Performance in Large Key-Value Stores

Yongkun Li, Chengjin Tian, Fan Guo, Cheng Li, Yinlong Xu

ElasticBF dynamically adjusts per-SSTable Bloom filter size based on access frequency to improve read performance in LSM-tree-based KV stores.

Bloom filterkey-value storeLSM-treeread performance

PDF

2018

USENIX ATC 2018 July 2018

Fine-grained Consistency for Geo-Replicated Systems

Cheng Li, Nuno Preguiça, Rodrigo Rodrigues

This paper presents PoR consistency, a novel fine-grained consistency definition that generalizes the trade-off between performance and coordination in geo-replicated systems.

consistencygeo-replicationdistributed systems

PDF