Large language models (LLM) have transformed natural language processing, but face significant challenges in generalized implementation due to their high cost of execution time. In this article, we present Seedlm, a new compression method after training that uses seeds of a pseudo-alley generator to encode and compress pesos of the model. Specifically, for each block of weights, we find a seed that feeds on a linear feedback change record (LFSR) during inference to efficiently generate a random matrix. This matrix is combined linearly with compressed coefficients to rebuild the weight block. SEEDLM reduces access to memory and takes advantage of inactive calculation cycles during inference, effectively accelerating tasks united by trade by trade with less memory accesses. Unlike avant -garde methods that depend on calibration data, our approach is data free and generalizes well in various tasks. Our experiments with flame3 70b, which is particularly challenging, show a zero shooting precision retention in 4 and 3 bits compression to be on par or better than the latest generation methods, while maintaining the performance comparable to the FP16 baselines. In addition, FPGA -based tests show that the 4 -bit plant, as the size of the model increases, is close to a 4x acceleration in a FP16 baseline call 2/3.
† Meta