Pretrained on trillion-token corpuses, large neural language models (LLMs) have made notable gains in performance (Touvron et al., 2023a; Geng & Liu, 2023). However, the scalability benefits of such data still need to be explored for traditional n-gram language models (LMs). This paper from the University of Washington and the Allen Institute for artificial intelligence delves into the relevance of n-gram LMs in the era of neural LLMs and introduces innovative advances in their modernization.
The authors affirm the continued usefulness of ngram LMs in text analysis and the improvement of neural LLMs. To address this, they modernized traditional n-gram LMs by expanding the training data to an unprecedented 1.4 billion tokens, rivaling the size of major open source text corpora (Together, 2023; Soldaini et al. ., 2023). This represents the largest n-gram LM to date. Starting from historical restrictions on n (e.g., n ≤ 5), the authors highlight the advantages of a larger value of n. Figure 1 illustrates the improved predictive ability of n-gram LMs with larger n values, challenging conventional limitations. Consequently, they introduce the concept of a ∞-gram LM, with unbounded n, using a backtracking variant (Jurafsky and Martin, 2000) to improve precision.
The ∞ gram LM takes advantage of a suffix array, replacing the impractical n gram counting tables. This implementation, known as the infini-gram engine, achieves remarkable efficiency with 7 bytes of storage per token. The suffix array, built on 1.4 trillion tokens using an 80-core CPU node in less than three days, ensures low-latency, resource-efficient queries in less than 20 milliseconds for n-gram counts. The ∞ gram engine, a testament to innovation, makes disk indices an integral part of inference.
∞-gram LM, a conceptual extension of n-gram LM, judiciously employs backtracking to improve predictive accuracy. Sparsity in ∞-gram estimates requires interpolation with neural LMs, which addresses perplexity issues. The paper presents the query types supported by Infini-gram and shows impressive latency benchmarks in Table 1.
Starting from the implementation of the suffix array, the article describes efficient methods for n-gram counting, occurrence position retrieval, and document identification. Fragmentation strategies reduce latency proportional to the number of fragments, optimizing processing times. Smart optimizations such as reusing search results and searching to disk further improve the speed of ∞-gram calculation.
Application of Infini-gram on various neuronal LMs, including GPT-2, GPT-Neo, LLaMA-2, and SILO, demonstrates consistent improvements in perplexity (Table 2). The article highlights the importance of data diversity and reveals the effectiveness of ∞-gram in complementing neural LMs in different series of models.
∞-gram analyzes shed light on human-written and machine-generated text. In particular, ∞-gram shows high accuracy in predicting the next token based on prefixes from human-written documents. The article establishes a positive correlation between neural LMs and the ∞-gram, suggesting the potential of the latter to improve the performance of LMs in predicting human-written texts.
The article concludes with a visionary perspective and presents preliminary applications of the Infini-gram engine. From understanding text corpora to mitigating copyright infringement, the possibilities are diverse. The authors anticipate more insightful analyzes and innovative applications powered by Infini-gram.
Review the Paper and Model. All credit for this research goes to the researchers of this project. Also, don't forget to follow us on Twitter and Google news. Join our 36k+ ML SubReddit, 41k+ Facebook community, Discord channeland LinkedIn Grabove.
If you like our work, you will love our Newsletter..
Don't forget to join our Telegram channel
Vineet Kumar is a Consulting Intern at MarktechPost. She is currently pursuing her bachelor's degree from the Indian Institute of technology (IIT), Kanpur. He is a machine learning enthusiast. He is passionate about research and the latest advances in Deep Learning, Computer Vision and related fields.
<!– ai CONTENT END 2 –>