Reducing the GPU Memory Bottleneck with Lossless Compression for ML
Aditya K Kamath, Arvind Krishnamurthy, Marco Canini, Simon PeterPublished in 21st ACM European Conference on Computer Systems (EuroSys), 2026
In this paper, we examine the use of compressed data and on-the-fly decompression to feed different types of ML models that rely on transferring data from CPU memory to the GPU during execution. We propose Invariant Bit Packing (IBP), a lossless decompression algorithm to reduce PCIe transfer latency. IBP achieves, on average, 74% faster GNN training, 180% faster DLRM embedding lookup, and 25% faster LLM inference. Read more...
