This article was accepted at ACL 2024
Large language models (LLMs) are fundamental to modern natural language processing, offering exceptional performance on a variety of tasks. However, their significant computational and memory requirements present challenges, especially for devices with limited DRAM capacity. This paper addresses the challenge of efficiently running LLMs that exceed the available DRAM capacity by storing model parameters in flash memory, but pulling them into DRAM when requested. Our method involves building an inference cost model that takes into account flash memory characteristics, guiding us to optimization in two critical areas: reducing the volume of data transferred from flash memory and reading data in larger, more contiguous chunks. Within this hardware-based framework, we present two main techniques. First, “windowing” strategically reduces data transfer by reusing previously activated neurons, and second, “row-column grouping,” tailored to the strengths of sequential data access in flash memory, increases the size of data chunks read from flash memory. These methods enable models to run on up to two times the available DRAM size, resulting in a 4–5x and 20–25x increase in inference speed compared to naive loading approaches on CPUs and GPUs, respectively. Our integration of sparsity awareness, context-adaptive loading, and a hardware-oriented design paves the way for efficient LLM inference on memory-constrained devices.