Large language models (LLMs) trained with large amounts of text data show remarkable abilities in various tasks by predicting and adjusting the next token. These tasks include marketing, reading comprehension, and medical analysis. While traditional benchmarks are becoming obsolete due to advances in LLM, distinguishing between deep understanding and superficial memorization poses a challenge. Assessing LLMs' true reasoning abilities requires testing that assesses their ability to generalize beyond training data, which is crucial for accurate evaluations.
This often occurs at a level of coherence previously thought to be only achievable through human cognition (Gemini Team, OpenAI). They demonstrate significant applicability in chat interfaces and various other contexts. When evaluating the capabilities of a given ai system, the predominant traditional method is to measure how well an ai system performs on fixed benchmarks for specific tasks. However, it is also plausible that a significant portion of these task benchmark successes are due to superficial memorization of task solutions and a superficial understanding of the patterns of the training sets in general.
The MIT researchers and others presented their work in Studies 1 and 2. In Study 1, the researchers use a conjoint approach, using twelve LLMs, to predict the outcomes of 31 binary questions. They compare these aggregated LLM predictions with 925 human forecasters from a three-month forecasting tournament. The results indicate that the LLM audience outperforms a no-information benchmark and matches the performance of the human audience. Additionally, Study 2 explores improving LLM predictions by incorporating human cognitive outcomes, focusing on the GPT-4 and Claude 2 models.
In Study 1, the researchers collected data from twelve diverse LLMs, including GPT-4 and Claude 2. They compared LLM predictions on 31 binary questions with 925 human forecasters from a three-month tournament, and found statistical equivalence. In Study 2, the researchers focused exclusively on GPT-4 and Claude 2, using a within-model design to collect pre- and post-intervention predictions by question. They investigated the updating behavior of LLMs with respect to human prediction estimates from a real-world forecasting tournament, using longer cues as guidance.
In study 1, they collected 1007 forecasts from 12 LLMs, observing predictions predominantly above the 50% midpoint. The mean forecast value of the LLM crowd significantly exceeded 50%, and 45% of the questions were resolved positively, indicating a bias toward positive outcomes. In study 2, 186 primary and updated GPT-4 and Claude 2 forecasts were analyzed across 31 questions. Exposure to human crowd forecasts significantly improved model accuracy and narrowed prediction intervals, with adjustments correlated with deviation from human benchmarks.
In conclusion, MIT and others have presented their study on LLM ensemble predictions. The study demonstrates that when LLMs leverage collective intelligence, they can rival crowd-based human methods in probabilistic forecasting. While previous research has shown that LLMs underperform in some contexts, combining simpler models in crowds can close the gap. This approach offers practical benefits for various real-world applications, potentially equipping decision makers with accurate political, economic, and technological forecasts, paving the way for broader societal use of LLM predictions.
Review the Paper. All credit for this research goes to the researchers of this project. Also, don't forget to follow us on Twitter and Google news. Join our 38k+ ML SubReddit, 41k+ Facebook community, Discord channeland LinkedIn Grabove.
If you like our work, you will love our Newsletter..
Don't forget to join our Telegram channel
You may also like our FREE ai Courses….
Asjad is an internal consultant at Marktechpost. He is pursuing B.tech in Mechanical Engineering at Indian Institute of technology, Kharagpur. Asjad is a machine learning and deep learning enthusiast who is always researching applications of machine learning in healthcare.
<!– ai CONTENT END 2 –>