Large language models (LLMs) have become essential tools for tasks such as question-answering (QA) and text summarization. These models are excellent at processing long and complex texts, with capacities exceeding 100,000 tokens. As LLMs are popular for handling context-intensive tasks, ensuring their reliability and accuracy becomes more urgent. Users rely on LLMs to filter through a large amount of information and provide concise and correct answers. However, many models suffer from the “hallucination” problem, where they generate information that is not supported by the provided text. This limitation significantly impacts user trust in these models, as the absence of specific, verifiable citations makes it difficult to confirm the accuracy of the answers.
A major challenge in long-context LLMs is their inability to provide detailed citations linked directly to specific parts of the text. Users often have difficulty trusting the answers generated by LLMs because the models either do not provide citations at all or offer citations that refer broadly to entire sections of the text rather than pointing to the exact pieces of information that support the answer. This lack of specificity means that even if the answer is accurate, the user must manually search through large chunks of text to verify accuracy. The need for a system that can offer accurate sentence-level citations is crucial to improving the verifiability and reliability of long-context LLMs.
Existing citation methods, while relatively effective, still have limitations. Some models employ fragment-level citation techniques, where broad sections of text are referenced. While useful for reducing the number of searches users must perform, these fragment-based methods are not fine-grained enough to achieve accurate verification. Other methods include retrieval-augmented generation (RAG) and post-processing systems, where citations are added after the answer is generated. However, due to their multi-step processes, these techniques often need to improve the quality of the answers, and response times are slow. In addition, the citations provided by these systems are often too broad, making them ineffective for users looking to find specific supplementary information within large documents.
Researchers from Tsinghua University and Zhipu ai presented a novel approach to address these limitations through a method called CoF (coarse to fine)CoF is designed to generate highly detailed sentence-level citations, improving the accuracy and usability of LLM-generated responses. The research team proposed this system as a solution to the problem of broad and imprecise citations, offering a refined approach that provides users with citations linked to specific sentences rather than lengthy sections of text. To evaluate the performance of LLMs in answering long context questions (LQAC), they also developed Quote from LongBenchThis automated benchmark evaluates the performance of LLMs in generating citations from large text corpora. LongBench-Cite revealed significant room for improvement in current models, as many of the citations generated by LLMs were irrelevant or applied too broadly. To test the effectiveness of the new approach, the team built LongCite-45ka dataset consisting of 44,600 QA pairs with detailed and accurate citations. This dataset enables LLMs to train on tasks requiring accurate and precise citations, addressing a critical gap in current long-context QA models.
The CoF system works through steps designed to refine citation accuracy. The process begins with the LLM generating the query and corresponding response based on the long text provided. This initial step ensures that the model operates with a fully contextualized understanding of the document. Next, the CoF system retrieves relevant text fragments from the original document, each consisting of 128 tokens. These fragments are linked to the model’s response through coarse-grained citations. Finally, the system refines these citations by identifying and extracting the specific sentences within the fragments that directly support the response. Responses that lack sufficient citation support are filtered out. This multi-stage approach allows the CoF system to produce responses with accurate sentence-level citations, significantly improving user confidence and citation accuracy.
This research demonstrates that CoF-trained models LongCite-8B and LongCite-9B outperform existing proprietary models such as GPT-4 in citation quality and granularity. Specifically, LongCite-8B and LongCite-9B achieved a 6.4% and 3.6% improvement over GPT-4 in terms of citation F1 score, a metric used to measure citation accuracy. The average citation length of the LongCite models was also noticeably shorter than that of the proprietary models, further highlighting the accuracy of the CoF approach. LongCite-8B, for example, generated citations with an average length of 86 tokens, compared to GPT-4’s average of 169 tokens. This level of granularity allows users to locate the specific text supporting the model’s answers more easily. The CoF system reduces the occurrence of hallucinations by allowing models to more consistently use all available context, ensuring that responses are better informed by the original text.
In conclusion, this research represents a fundamental advance in the field of long-context LLMs by addressing a long-standing problem with citation accuracy. The introduction of LongBench-Cite to assess the citation performance of LLMs, combined with the CoF system and the LongCite-45k dataset, represents a significant advance in improving the reliability and verifiability of answers generated by LLMs. The researchers have enabled LLMs to produce more accurate and reliable answers by focusing on sentence-level citations rather than broad chunks of text. The improvements seen in the LongCite-8B and LongCite-9B models demonstrate the effectiveness of this approach, as these models outperform even the most advanced proprietary systems in citation accuracy. This advancement improves the performance of long-context QA systems and contributes to the broader goal of making LLMs more reliable tools for information retrieval and question answering tasks.
Take a look at the Paper. All credit for this research goes to the researchers of this project. Also, don't forget to follow us on twitter.com/Marktechpost”>twitter and LinkedInJoin our Telegram Channel.
If you like our work, you will love our fact sheet..
Don't forget to join our SubReddit of over 50,000 ml
Asif Razzaq is the CEO of Marktechpost Media Inc. As a visionary engineer and entrepreneur, Asif is committed to harnessing the potential of ai for social good. His most recent initiative is the launch of an ai media platform, Marktechpost, which stands out for its in-depth coverage of machine learning and deep learning news that is technically sound and easily understandable to a wide audience. The platform has over 2 million monthly views, illustrating its popularity among the public.
<script async src="//platform.twitter.com/widgets.js” charset=”utf-8″>