Notable publications

You can find a full list of articles on my Google Scholar profile.

Reducing the GPU Memory Bottleneck with Lossless Compression for ML

Aditya K Kamath, Arvind Krishnamurthy, Marco Canini, Simon Peter

Published in 21st ACM European Conference on Computer Systems (EuroSys), 2026

In this paper, we examine the use of compressed data and on-the-fly decompression to feed different types of ML models that rely on transferring data from CPU memory to the GPU during execution. We propose Invariant Bit Packing (IBP), a lossless decompression algorithm to reduce PCIe transfer latency. IBP achieves, on average, 74% faster GNN training, 180% faster DLRM embedding lookup, and 25% faster LLM inference. Read more...

POD-Attention: Unlocking Full Prefill-Decode Overlap for Faster LLM Inference

Distinguished Artifact Award
Aditya K Kamath, Ramya Prabhu, Jayashree Mohan, Simon Peter, Ramachandran Ramjee, Ashish Panwar

Published in ACM 30th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2025

In this paper, we present POD-Attention — the first GPU kernel that efficiently computes attention for hybrid-batch LLM inference. POD-Attention maximizes the utilization of both compute and memory bandwidth by carefully allocating GPU resources such that prefill and decode operations happen concurrently on the same multiprocessor. POD-Attention speeds up attention computation by up to $59\%$ (mean $28\%$), enabling higher throughput and lower latency LLM inference. Read more...

@INPROCEEDINGS {POD:ASPLOS:2025, author = {Kamath, Aditya K and Prabhu, Ramya and Mohan, Jayashree and Peter, Simon and Ramjee, Ramachandran and Panwar, Ashish}, title = {POD-Attention: Unlocking Full Prefill-Decode Overlap for Faster LLM Inference}, year = {2025}, isbn = {9798400710797}, publisher = {Association for Computing Machinery}, address = {New York, NY, USA}, url = {https://doi.org/10.1145/3676641.3715996}, doi = {10.1145/3676641.3715996}, booktitle = {Proceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2}, pages = {897–912}, numpages = {16}, keywords = {Large language models; GPUs; self-attention}, location = {Rotterdam, Netherlands}, series = {ASPLOS 2025} }

(MC)²: Lazy MemCopy at the Memory Controller

Aditya K Kamath, Simon Peter

Published in ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA), 2024

In this paper we propose (MC)², a new lazy memory copy mechanism that provides performance improvement to applications that sparsely access the data that they copy. We show that (MC)² provides benefits across a variety of microbenchmarks and workloads, including Google’s Protobuf, where (MC)² provides a 43% speedup and Linux huge page copy-on-write faults, where (MC)² provides 250x lower latency. Read more...

@INPROCEEDINGS {MCSquare:ISCA:2024, author = { Kamath, Aditya K and Peter, Simon }, booktitle = { 2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA) }, title = {{ (MC)2: Lazy MemCopy at the Memory Controller }}, year = {2024}, volume = {}, ISSN = {}, pages = {1112-1128}, keywords = {Linux;Semantics;Computer architecture;Software;Hardware}, doi = {10.1109/ISCA59077.2024.00084}, url = {https://doi.org/10.1109/ISCA59077.2024.00084}, publisher = {IEEE Computer Society}, address = {Los Alamitos, CA, USA}, month =Jul}

Scalable, Programmable and Dense: The HammerBlade Open-Source RISC-V Manycore

Dai C Jung, Max Ruttenberg, Paul Gao, Scott Davidson, Daniel Petrisko, Kangli Li, Aditya K Kamath et al.

Published in ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA), 2024

In this paper, we explore HammerBlade, which simultaneously achieves scalability, programmability and density. HammerBlade is a fully open-source RISC-V manycore architecture, which has been silicon-validated with a 2048-core ASIC implementation using a 14/16nm process. Read more...

@INPROCEEDINGS {Hammerblade:ISCA:2024, author = { Jung, Dai Cheol and Ruttenberg, Max and Gao, Paul and Davidson, Scott and Petrisko, Daniel and Li, Kangli and Kamath, Aditya K and Cheng, Lin and Xie, Shaolin and Pan, Peitian and Zhao, Zhongyuan and Yue, Zichao and Veluri, Bandhav and Muralitharan, Sripathi and Sampson, Adrian and Lumsdaine, Andrew and Zhang, Zhiru and Batten, Christopher and Oskin, Mark and Richmond, Dustin and Taylor, Michael Bedford }, booktitle = { 2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA) }, title = {{ Scalable, Programmable and Dense: The HammerBlade Open-Source RISC-V Manycore }}, year = {2024}, volume = {}, ISSN = {}, pages = {770-784}, keywords = {Scalability;Instruction sets;Source coding;Computer architecture;Voltage;Parallel processing;Silicon}, doi = {10.1109/ISCA59077.2024.00061}, url = {https://doi.org/10.1109/ISCA59077.2024.00061}, publisher = {IEEE Computer Society}, address = {Los Alamitos, CA, USA}, month =Jul}

Scoped Buffered Persistency Model for GPUs

Shweta Pandey*, Aditya K Kamath*, and Arkaprava Basu
(*Authors contributed equally to this work)

Published in ACM 28th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2023

This paper explores persistency models for GPUs, analyzing whether CPU persistency models are suitable for GPU architecture and the needs of GPU applications (spoiler: they aren’t). We investigate how to express persistency models for intra-thread and inter-thread persist memory order (PMO) for GPU programs. We then look at how to design the hardware architecture necessary to implement these operations efficiently. Read more...

@inproceedings{SBP:ASPLOS:2023, author = {Pandey, Shweta and Kamath, Aditya K and Basu, Arkaprava}, title = {Scoped Buffered Persistency Model for GPUs}, year = {2023}, isbn = {9781450399166}, publisher = {Association for Computing Machinery}, address = {New York, NY, USA}, url = {https://doi.org/10.1145/3575693.3575749}, doi = {10.1145/3575693.3575749}, booktitle = {Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2}, pages = {688–701}, numpages = {14}, keywords = {Persistent Memory, Graphics Processing Unit, Crash consistency}, location = {Vancouver, BC, Canada}, series = {ASPLOS 2023} }

GPM: Leveraging Persistent Memory from a GPU

Shweta Pandey*, Aditya K Kamath*, and Arkaprava Basu
(*Authors contributed equally to this work)

Published in ACM 27th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2022

This paper explored how to utilise commercially available NVM on a GPU using real hardware. Through this process we came up with a benchmark suite (GPMBench) consisting of GPU applications that benefit from both GPU parallelism as well as NVM persistence. We also provide a GPU-optimised library (libGPM) that simplifies access to NVM from a GPU. Read more...

@inproceedings{gpm:ASPLOS:2022, author = {Pandey, Shweta and Kamath, Aditya K and Basu, Arkaprava}, title = {GPM: leveraging persistent memory from a GPU}, year = {2022}, isbn = {9781450392051}, publisher = {Association for Computing Machinery}, address = {New York, NY, USA}, url = {https://doi.org/10.1145/3503222.3507758}, doi = {10.1145/3503222.3507758}, booktitle = {Proceedings of the 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems}, pages = {142–156}, numpages = {15}, keywords = {Crash consistency, Graphics Processing Unit, Persistent Memory}, location = {Lausanne, Switzerland}, series = {ASPLOS 2022} }

iGUARD: In-GPU Advanced Race Detection

Aditya K Kamath and Arkaprava Basu

Published in ACM SIGOPS 28th Symposium on Operating Systems Principles (SOSP), 2021

This paper proposed an in-GPU software race detector. The race detector made use of NVBit, a binary instrumentation tool. Using this, we were able to detect races due to improper synchronization, scopes, or ITS. We even found races in 3 NVIDIA-supported libaries (cuML, CUB, Cooperative Groups). Read more...

@inproceedings{iguard:SOSP:2021, author = {Kamath, Aditya K. and Basu, Arkaprava}, title = {iGUARD: In-GPU Advanced Race Detection}, year = {2021}, isbn = {9781450387095}, publisher = {Association for Computing Machinery}, address = {New York, NY, USA}, url = {https://doi.org/10.1145/3477132.3483545}, doi = {10.1145/3477132.3483545}, booktitle = {Proceedings of the ACM SIGOPS 28th Symposium on Operating Systems Principles}, pages = {49–65}, numpages = {17}, keywords = {GPU program correctness, Debugging, Data races}, location = {Virtual Event, Germany}, series = {SOSP 2021} }

ScoRD: A Scoped Race Detector for GPUs

Aditya K Kamath*, Alvin A George*, and Arkaprava Basu
(*Authors contributed equally to this work)

Published in ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA), 2020

This paper proposed a hardware race detector for GPUs. Our hardware was able to efficiently support detection of scoped races in GPU programs. Read more...

@inproceedings{Scord:ISCA:2020, author = {Kamath, Aditya K. and George, Alvin A. and Basu, Arkaprava}, title = {ScoRD: a scoped race detector for GPUs}, year = {2020}, isbn = {9781728146614}, publisher = {IEEE Press}, url = {https://doi.org/10.1109/ISCA45697.2020.00088}, doi = {10.1109/ISCA45697.2020.00088}, booktitle = {Proceedings of the ACM/IEEE 47th Annual International Symposium on Computer Architecture}, pages = {1036–1049}, numpages = {14}, keywords = {graphics processing units, parallel programming, software debugging}, location = {Virtual Event}, series = {ISCA 2020} }

Aditya K Kamath Hear pronunciation

Notable publications

Aditya K Kamath