Mantle: Efficient Hierarchical Metadata Management for Cloud Object Storage Services
Mantle proposes a two-layer metadata architecture for cloud object storage, enabling scalable hierarchical namespace management.
MLSys @ USTC
Research on distributed training, tensor core optimization, and efficient AI systems.
Mantle proposes a two-layer metadata architecture for cloud object storage, enabling scalable hierarchical namespace management.
AutoCCL automates collective communication tuning to accelerate distributed and parallel DNN training.
GLPilot introduces a staleness-bounded embedding buffering mechanism to reduce remote fetches in distributed GNN training with learnable vertex embeddings.
Perseus is a scalable HTAP database that enforces strong consistency for both transactions and analytical queries across geo-distributed deployments.
ICCD 2025 Best Paper. DHeLlam introduces automatic micro-batch co-execution to reduce communication bottlenecks in distributed LLM training.
VPTQ leverages vector quantization with second-order optimization to achieve extreme low-bit (1–4 bit) compression of LLMs while maintaining high accuracy.
PolyBase addresses data affinity misalignment in geo-replicated databases by enabling row-level Paxos-group re-assignment, significantly reducing wide-area round trips.
nnScaler generates efficient parallelization plans for DNN training via three primitives that capture any parallel plan's model transformation and spatiotemporal scheduling.
gSampler is a general GPU-based graph sampling framework using the ECSF model that unifies 15 popular graph sampling algorithms with high efficiency.
SPFresh supports efficient in-place vector updates for billion-scale ANNS through the LIRE lightweight incremental rebalancing protocol.
FrozenHot improves cache scalability on modern multi-core hardware by separating a frozen hot segment that eliminates per-access synchronization overhead.
CFS builds a scalable, fully POSIX-compliant distributed file system by pruning critical sections to reduce locking overhead.
MPress enables billion-scale DNN model training on a single multi-GPU server by exploiting inter-operator parallelism to save GPU memory.
Lunule proposes an imbalance factor model for accurate metadata load balancing in CephFS, enabling agile and efficient rebalancing.
This paper introduces Tensor Homomorphic Compression (THC) enabling direct aggregation of compressed gradients, supercharging data parallel DNN training.
AutoGR automates geo-replication deployment, freeing programmers from manual consistency annotations while preserving application semantics.
SpanDB adapts RocksDB to utilize high-speed NVMe SSDs selectively for WAL and top LSM-tree levels, achieving significant throughput and latency improvements.
PaGraph accelerates GNN training on large graphs by exploiting GPU cache for frequently accessed graph data and computation-aware graph partitioning.
ElasticBF dynamically adjusts per-SSTable Bloom filter size based on access frequency to improve read performance in LSM-tree-based KV stores.
This paper presents PoR consistency, a novel fine-grained consistency definition that generalizes the trade-off between performance and coordination in geo-replicated systems.