LLM in a snap: Efficient inference of large language models with limited memory

This article was accepted at ACL 2024

Large language models (LLMs) are fundamental to modern natural language processing, offering exceptional performance on a variety of tasks. However, their significant computational and memory requirements present challenges, especially for devices with limited DRAM capacity. This paper addresses the challenge of efficiently running LLMs that exceed the available DRAM capacity by storing model parameters in flash memory, but pulling them into DRAM when requested. Our method involves building an inference cost model that takes into account flash memory characteristics, guiding us to optimization in two critical areas: reducing the volume of data transferred from flash memory and reading data in larger, more contiguous chunks. Within this hardware-based framework, we present two main techniques. First, “windowing” strategically reduces data transfer by reusing previously activated neurons, and second, “row-column grouping,” tailored to the strengths of sequential data access in flash memory, increases the size of data chunks read from flash memory. These methods enable models to run on up to two times the available DRAM size, resulting in a 4–5x and 20–25x increase in inference speed compared to naive loading approaches on CPUs and GPUs, respectively. Our integration of sparsity awareness, context-adaptive loading, and a hardware-oriented design paves the way for efficient LLM inference on memory-constrained devices.

LLM in a snap: Efficient inference of large language models with limited memory

Technical Terrence Team

Occidental Petroleum beats second-quarter profit estimates on higher output By Reuters

Leave a Reply Cancel reply

Recommended.

Ethereum Whales Earn $230 Million in ETH in One Week

Vision Pro's most important app is Safari, whether Apple likes it or not

Wholesome Direct 2024: the cutest games

The easyJet share price is up 58% and the dividend is up a staggering 169%!

New school leader start kit

Categories

Important Links

LLM in a snap: Efficient inference of large language models with limited memory

Related

Technical Terrence Team

Occidental Petroleum beats second-quarter profit estimates on higher output By Reuters

Leave a Reply Cancel reply

Recommended.

Ethereum Whales Earn $230 Million in ETH in One Week

Vision Pro's most important app is Safari, whether Apple likes it or not

Wholesome Direct 2024: the cutest games

The easyJet share price is up 58% and the dividend is up a staggering 169%!

New school leader start kit

Categories

Important Links

Get daily news updates to your inbox!