Compared to their supervised counterparts, which can be trained with millions of labeled examples, large language models (LLMs) such as GPT-3 and PaLM have shown impressive performance on various natural language tasks, even in zero-trigger settings. . However, the use of LLM to solve the basic problem of text classification has had mixed results. Existing findings often perform markedly worse than trained reference testers. The only exception is a new strategy that is based on the massive, black box, trading GPT-4 system.
They argue that relying on such black box systems is not ideal for academic researchers due to significant cost constraints and access limitations to these systems. However, they do recognize the value of such scans in demonstrating the ability of LLMs to classify tasks. Ranking metrics can drop more than 50% when the order of input documents changes. In this study, they first explain why LLMs struggle with classification problems when using the point and list formulations of current approaches. Since generation-only LLM APIs (such as GPT-4) do not allow this, classification of point techniques requires LLMs to produce calibrated prediction probabilities prior to classification, which is known to be extremely challenging.
LLMs frequently provide inconsistent or nonsensical results, even with instructions that seem extremely obvious to humans for list techniques. Empirically, they find that ranking prompts by past job listings give results in medium-sized LLMs that don’t make any sense. These findings demonstrate that current and widely used LLMs need to understand classification tasks, possibly due to a lack of classification awareness of their pre-training and fine-tuning techniques. To significantly reduce task complexity for LLMs and address the issue of calibration, Google Research researchers propose the Peer Ranking Request (PRP) paradigm, which uses the query and a pair of documents as the request for grading. tasks. PRP is based on a simple request architecture and offers LLM generation and scoring APIs by default.
They discuss various variations of PRP to answer concerns about efficiency. The PRP results are the first in the literature to use moderately sized, open source LLMs on traditional reference datasets to achieve state-of-the-art classification performance. In the TREC-DL2020, the PRP based on the 20B parameter FLAN-UL2 model outperforms the previous best method in the literature, based on the commercial black-box GPT-4 with 50X (estimated) model size, by more than 5% in NDCG @1. In TREC-DL2019, PRP can outperform current solutions, such as InstructGPT, which has 175B parameters, by more than 10% for virtually all classification measures, but only performs worse than the GPT-4 solution on NDCG@5 and NDCG@ 10 metric. In addition, they present competitive results using FLAN-T5 models with parameters 3B and 13B to illustrate the effectiveness and applicability of the PRP.
They also review additional benefits of PRP, such as its support for LLM APIs for scoring and generation, and its insensitivity to input commands. In conclusion, this work makes three contributions:
• Demonstrate that the pairwise ranking cue works well for zero shot ranking using LLM for the first time. Their findings are based on moderately sized open source LLMs, compared to existing systems employing considerably larger, commercial, black-box models.
• Can produce state-of-the-art ranking performance using simple pointing and scoring mechanisms. The discovery will make future studies in this area more accessible.
• While achieving linear complexity, they examine various efficiency improvements and demonstrate good empirical performance.
review the Paper. Don’t forget to join our 25k+ ML SubReddit, discord channel, and electronic newsletter, where we share the latest AI research news, exciting AI projects, and more. If you have any questions about the article above or if we missed anything, feel free to email us at [email protected]
Featured Tools:
- Aragon: Achieve stunning professional face photos effortlessly with Aragon.
- StoryBird AI: Create personalized stories using AI
- taplio: Transform your LinkedIn presence with Taplio’s AI-powered platform
- Otter AI: Get a meeting assistant that records audio, writes notes, automatically captures slides, and generates summaries.
- Notion: Notion AI is a strong generative AI tool that helps users with tasks like summarizing notes
- tinyEinstein: tinyEinstein is an AI marketing manager that helps you grow your Shopify store 10x faster with almost zero time investment.
- adcreative.ai: Boost your advertising and social media game with AdCreative.ai, the ultimate artificial intelligence solution.
- SaneBox: SaneBox’s powerful AI automatically organizes your email, and the other smart tools ensure that your email habits are more efficient than you can imagine
- Motion: Motion is a smart tool that uses AI to create daily schedules that account for your meetings, tasks, and projects.
🚀 Check out 100 AI tools at AI Tools Club
Aneesh Tickoo is a consulting intern at MarktechPost. She is currently pursuing her bachelor’s degree in Information Science and Artificial Intelligence at the Indian Institute of Technology (IIT), Bhilai. She spends most of her time working on projects aimed at harnessing the power of machine learning. Her research interest is image processing and she is passionate about creating solutions around her. She loves connecting with people and collaborating on interesting projects.