Towards Low-Bit Communication for Parallel Tensor LLM Inference

This paper was accepted into the Efficient Natural Speech and Language Processing (ENLSP) Workshop at NeurIPS 2024.

Tensor parallelism provides an effective way to increase the inference efficiency of the server's large language model (LLM) despite adding additional communication cost. However, as server LLMs continue to increase in size, they will need to be distributed across more devices, increasing the cost of communication. One way to address this problem is through quantization, but current methods for LLMs tend to avoid quantizing the features that tensor parallelism needs to communicate. Taking advantage of the constant outliers in the reported features, we introduce a quantization method that reduces the reported values on average from 16 bits to 4.2 bits while preserving almost all of the original performance. For example, our method maintains about 98.0% and 99.5% of the original performance of Gemma 2 27B and Llama 2 13B, respectively, averaged over all the tasks we evaluate.

No Result