Large Language Models (LLM) have proven to be the best innovation in the field of artificial intelligence. From BERT, PaLM and GPT to LLaMa DALL-E, these models have shown incredible performance in understanding and generating language in order to mimic humans. These models are continually improved based on updated information, user input, and design modifications. However, there is still uncertainty about how often GPT-3.5 and GPT-4 will receive updates, making it difficult to integrate these LLMs into broader workflows.
Instability can break downstreams if the behavior of an LLM, such as its correction or formatting in response to a prompt, changes abruptly. This unpredictability can make it difficult for developers and users to trust regular results, which can limit the stable integration of LLMs into current systems and workflows. To study how the behaviors of different Large Language Models (LLMs) change over time, a team of researchers from Stanford University and UC Berkeley evaluated the behavior of the March 2023 and June 2023 versions of GPT-3.5 and GPT-4.
Three crucial elements have been used to quantify the changes, which are the LLM services to monitor, the application scenarios to focus on, and the metrics to measure LLM deviation in each scenario. The core components of ChatGPT, GPT-4 and GPT-3.5, are the LLM services that are monitored in this study. Given ChatGPT’s acceptance by corporations and individuals, as well as its popularity, systematic and timely monitoring of these two services can help users better understand and use LLMs for their particular use cases.
The March 2023 and June 2023 snapshots of the two main versions of GPT-4 and GPT-3.5 that can be accessed via the OpenAI API have been used in the study, with the main aim of examining the variations or “deviations” between the two dates. The team has chosen four commonly researched LLM tasks for assessment that are used as performance and safety benchmarks. These jobs include:
- Solving math problems: When solving math problems, accuracy measures how often an LLM service produces the correct answer.
- Address Sensitive Questions – Response Rate, showing how often an LLM service provides a direct answer.
- Code Generation: The percentage of generated code that can be immediately executed in a programming environment and satisfies unit tests.
- Visual Reasoning: Exact Match, which evaluates whether the created visuals accurately match the source material.
In conclusion, the research focuses on GPT-4 and GPT-3.5, assesses them on four chosen tasks, and uses specialized performance measures and other common metrics to quantify and measure LLM deviations in each scenario in order to observe how the behaviors of various LLMs evolve over time. The study findings can help users better understand LLM behavior and use these models for a variety of applications.
review the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to join our 26k+ ML SubReddit, discord channel, and electronic newsletterwhere we share the latest AI research news, exciting AI projects, and more.
Tanya Malhotra is a final year student at the University of Petroleum and Power Studies, Dehradun, studying BTech in Computer Engineering with a specialization in Artificial Intelligence and Machine Learning.
She is a data science enthusiast with good analytical and critical thinking, along with a keen interest in acquiring new skills, leading groups, and managing work in an organized manner.