Ph.D. Defense: Mitigating the Impact of Data Movement in Memory-Intensive Applications

Aditya K Kamath

Presented at University of Washington, 2026

This is the talk I gave for my Ph.D. defense. I gave similar versions of this talk for research job interviews. I make heavy use of animations in my talks and the ppt preview may not correctly render them. To get the full experience, download the ppt and open in Powerpoint.

Modern memory-intensive workloads move terabytes of data through hierarchies of interconnects whose bandwidth and latencies have failed to scale with compute. The result is that there is no longer a singular memory wall between compute and memory, but layers of connections, each a potential bottleneck. Caches, memory bus width, and inter-device interconnects all have an impact on the performance that applications observe.

Overcoming these limitations requires identifying key application characteristics that enable bottleneck mitigation. I identify and organize mitigations into three types: (1) avoiding data movement across certain layers by eliding extraneous or redundant data movement, (2) hiding data movement by overlapping transfers with complementary work, and (3) reducing the amount of data moved by exploiting structure present within the data itself. This characterization identifies the application properties that enable intervention, while directly informing us on how the mitigation should be designed.

I discuss three systems that demonstrate these concepts, each targeting a different bottleneck component, taking advantage of application properties that enable the mitigation. (MC)^2 targets bottlenecks with CPU memory access latencies, exploiting the observation that many workloads copy data that they never fully access. It extends the CPU memory controller to lazily track requested copies. POD-Attention targets the GPU memory bandwidth bottleneck in LLM inference, taking advantage of the complementary resource profiles of the two phases of LLM inference, prefill and decode. It fuses compute-bound prefill and memory-bound decode so that they execute concurrently: the prefill operations occupy the compute units while decode operations saturate the memory pipeline, hiding memory stalls behind concurrent compute. Invariant Bit Packing targets the PCIe interconnect bandwidth bottleneck between CPU and GPU memory that occurs in large-scale ML frameworks, taking advantage of low-entropy bit patterns present across ML tensors. It strips these bits before PCIe transfer and reconstructs the original tensors within the GPU, reducing data transfer volume without loss of information.

These systems achieve substantial benefits across a wide variety of workloads like databases, inter-process communication, large language models, graph neural networks, and deep learning recommendation models.

Share on

Twitter Facebook LinkedIn

Aditya K Kamath Hear pronunciation

Share on

Aditya K Kamath