Large language models (LLMs) have gained significant attention for their versatility, but their veracity remains a critical concern. Studies have revealed that LLMs can produce non-factual, hallucinated, or outdated information, undermining trustworthiness. Current evaluation methods such as fact checking and fact quality control face several challenges. Fact checking struggles to assess the veracity of generated content, while fact quality control finds it difficult to scale evaluation data due to costly annotation processes. Both approaches also face the risk of data contamination from pre-training corpora crawled on the web. Furthermore, LLMs often respond inconsistently to the same fact when presented in different forms, a challenge being that existing evaluation datasets must be equipped to address it.
Existing attempts to assess LLMs' knowledge primarily use specific datasets, but face challenges such as data leakage, static content, and limited metrics. Knowledge graphs (KGs) offer advantages in personalization, knowledge evolution, and reducing test set leakage. Methods such as LAMA and LPAQA use KGs for assessment, but struggle with unnatural question formats and impracticality for large KGs. KaRR overcomes some issues, but remains inefficient for large graphs and lacks generalizability. Current approaches focus on accuracy over reliability, and do not address LLMs' inconsistent responses to the same fact. Furthermore, no existing work visualizes LLMs' knowledge using KGs, which presents an opportunity for improvement. These limitations highlight the need for more comprehensive and efficient methods to assess and understand LLMs' knowledge retention and accuracy.
Apple researchers presented KGLESKGLENS is an innovative knowledge investigation framework that has been developed to measure knowledge alignment between KG and LLM students and identify LLM knowledge blind spots. The framework employs a Thompson sampling-inspired method with a parameterized knowledge graph (PKG) to efficiently investigate LLMs. KGLENS features a graph-guided question generator that converts KGs into natural language using GPT-4, designing two types of questions (fact checking and fact quality control) to reduce answer ambiguity. Human evaluation shows that 97.7% of the generated questions are sensitive to annotators.
KGLENS employs a unique approach to efficiently investigate knowledge of LLMs using a method inspired by Thompson sampling and PKG. The framework initializes a PKG where each edge is boosted with a beta distribution, indicating the potential deficiency of the LLM on that edge. It then samples edges based on their probability, generates questions from these edges, and examines the LLM through a question-answering task. The PKG is updated based on the results, and this process is repeated until convergence. Additionally, this framework features a graph-guided question generator that converts KG edges into natural language questions using GPT-4. It creates two types of questions: Yes/No questions for judgment and Wh questions for generation, with the question type controlled by the graph structure. Entity aliases are included to reduce ambiguity.
To verify responses, KGLENS instructs master's students to generate specific response formats and employs GPT-4 to verify the accuracy of responses to quiz questions. The efficiency of the framework is evaluated through various sampling methods, demonstrating its effectiveness in identifying master's students' knowledge blind spots in various topics and relationships.
Evaluation of KGLENS on several LLMs reveals that the GPT-4 family consistently outperforms other models. GPT-4, GPT-4o, and GPT-4-turbo show comparable performance, with GPT-4o being more cautious with personal information. There is a significant gap between GPT-3.5-turbo and GPT-4, as GPT-3.5-turbo sometimes performs worse than traditional LLMs due to its conservative approach. Traditional models such as Babbage-002 and Davinci-002 show only slight improvement over random guesses, highlighting progress in recent LLMs. The evaluation provides insights into different error types and model behaviors, demonstrating the diverse capabilities of LLMs to handle various knowledge domains and difficulty levels.
KGLENS presents an efficient method for evaluating factual knowledge in LLMs using a Thompson sampling-inspired approach with parameterized knowledge graphs. The framework outperforms existing methods in revealing knowledge blind spots and demonstrates scalability across multiple domains. Human evaluation confirms its effectiveness, achieving an accuracy of 95.7%. KGLENS and its evaluation of KGs will be made available to the research community, fostering collaboration. For businesses, this tool facilitates the development of more reliable ai systems, improving user experiences and enhancing model knowledge. KGLENS represents a significant advancement in creating more accurate and reliable ai applications.
Take a look at the PaperAll credit for this research goes to the researchers of this project. Also, don't forget to follow us on twitter.com/Marktechpost”>twitter and join our Telegram Channel and LinkedIn GrAbove!. If you like our work, you will love our fact sheet..
Don't forget to join our Subreddit with over 48 billion users
Find upcoming ai webinars here
Asjad is a consultant intern at Marktechpost. He is pursuing Bachelors in Mechanical Engineering from Indian Institute of technology, Kharagpur. Asjad is a Machine Learning and Deep Learning enthusiast who is always researching the applications of Machine Learning in the healthcare domain.
<script async src="//platform.twitter.com/widgets.js” charset=”utf-8″>