Methods such as the impulse of the thought chain (COT) have improved reasoning by breaking complex problems in sequential subpasses. The most recent advances, such as modes of thought similar to O1, introduce capacities, including proof and error, setback, correction and iteration, to improve the performance of the model in difficult problems. However, these improvements come with substantial computational costs. The increase in tokens generation creates a significant memory overload due to the limitations of the transformer's architecture, where the complexity of the attention mechanism grows quadratically with the context length, while the storage of KV cache increases linearly. For example, when the QWEN32B context length reaches 10,000 tokens, the KV cache consumes memory comparable to the entire model.
Current approaches to accelerate LLM inference are divided into three main categories: quantification model, generate less tokens and reduce KV cache. The quantification model involves techniques for quantization of parameter cache and KV. Within the category of reducing KV cache, the pruning -based selection in a discreet space and compression based on the fusion in continuous space emerge as key strategies. Pruning -based strategies implement specific eviction policies to retain only important tokens during inference. Fusion -based strategies introduce anchor tokens that compress historically important information. The difference between these two methods is that pruning -based methods do not contain training, but require the application of eviction policies for each generated token, and fusion -based methods require models training.
Researchers at the University of Zhejiang, Ant Group, and the University of Zhejiang – Laboratory Laboratory Group of Ant Group of A.Cigas have proposed that Lightts Hilinker to allow the LLM to compress intermediate thoughts during dynamically reasoning. Inspired by human cognition, Light Thinker compresses the detailed reasoning steps in compact representations and discard original reasoning chains, significantly reducing the number of tokens stored in the context window. Researchers also introduce the dependency metric (DEP) to quantify the effectiveness of compression by measuring the dependence of historical tokens during generation. In addition, the light of light reduces the use of maximum memory and inference time while maintaining competitive precision, offering a promising direction to improve LLM efficiency in complex reasoning tasks.
The Lighttinker approach is evaluated using the QWEN2.5-7B and call3-8b models. The researchers made a full parameter instruction adjustment using the custom-stratous data assembly-17K, with the resulting model designated as vanilla. Five comparison baselines were implemented: two acceleration methods without training (H2O and SEPLM), a training-based method (ANLLM) and the COT request applied to both instruction models and R1-Distill. The evaluation occurred in four data sets (GSM8K, MMLU, GPQA and BBH), measuring effectiveness and efficiency (through the time of inference, the maximum tokens count and dependency metrics). The implementation presents two compression approaches: compression at Token level (converting every 6 tokens in 2) and compression at the level of thought (using “\ n \ n” as a delimiter to segment thoughts).
The results of the evaluation in the four metrics for both models in all data sets reveal several significant findings. Distill-R1 has a constant performance compared to COT in all data sets, with the performance gap attributed to the repetition problems caused by the greedy decoding. H2O effectively retains the performance of the model while reducing the use of memory, validating its greedy eviction policy for the generation of long text. However, H2O substantially increases the inference time (51% for QWEN and 72% for flame) due to its eviction policy in general terms of creating general expenses for each token generated. In addition, Lightthinker coincides with the H2O yield with compression rates similar to the time that reduces inference time with a 52% reduction for QWEN and 41% for flame.

In this article, researchers introduced Lighttinker, a novel approach to improve LLM efficiency in complex reasoning tasks through dynamic compression of intermediate thoughts during generation. By training models to learn optimal time and methods to compress the detailed reasoning steps in compact representations, Lightinker significantly reduces the general expenses of memory and computational costs while maintaining competitive precision. However, several limitations remain: compatibility with the fine adjustment methods of parameters such as LORA or Qlora is unexplored, the potential benefits of larger training data sets are unknown and performance degradation is remarkable in flame series models when they train in small data sets with prediction below.
Verify he Paper. All credit for this investigation goes to the researchers of this project. In addition, feel free to follow us <a target="_blank" href="https://x.com/intent/follow?screen_name=marktechpost” target=”_blank” rel=”noreferrer noopener”>twitter And don't forget to join our 80k+ ml subject.
Recommended Reading Reading IA Research Liberations: An advanced system that integrates the ai system and data compliance standards to address legal concerns in IA data sets

Sajad Ansari is an undergraduate last year of Iit Kharagpur. As an enthusiastic of technology, it deepens the practical applications of ai with an approach to understanding the impact of ai technologies and their implications of the real world. Its objective is to articulate complex concepts of ai in a clear and accessible way.