Coding-related work has led to the rapid advancement of large language models (LLMs), with a focus on code editing. LLMs created specifically for coding jobs apply to a variety of activities, including code optimization and repair. As programming tools, they are becoming increasingly popular, but most evaluation techniques focus on code production, ignoring the crucial role that code editing plays in software development.
In recent research, a team of researchers from the Multimodal Art Projection Research Community, the University of Waterloo, HKUST, the University of Manchester, Tongji University and the Vector Institute have introduced CodeEditorBench, an evaluation system that has been designed to evaluate the effectiveness of LLMs in a range of code editing activities, such as requirements change, debugging, translation and polishing.
Unlike other benchmarks that focus primarily on code creation, CodeEditorBench emphasizes real-world applications and pragmatic elements of software development. The team has selected a variety of coding scenarios and challenges from five different sources, covering a wide spectrum of programming languages, degrees of difficulty, and editing tasks. By doing this, they have ensured that the evaluation takes into account the variety and complexity of difficulties encountered in real coding environments.
The team found some intriguing trends in their review, which included 19 different LLMs. In the CodeEditorBench framework, closed source models, specifically Gemini-Ultra and GPT-4, have demonstrated better performance than open source models. This emphasizes the importance of model architecture and training data in deciding performance, particularly when varying request sensitivity and problem categories.
The team has summarized its main contributions as follows.
- The goal of CodeEditorBench is to provide a consistent approach to evaluating LLMs. Tools for additional analysis, training, and visualization have been included in this framework. To promote further research into the functions of LLM, the team has shared that all assessment-related data will be openly accessible. To improve the comprehensiveness of the evaluation, more evaluation measures will be added in the future.
- The main objective is to map the current state of LLMs. OpenCIDS-33B is the most efficient base model available to the public, followed by OpenCI-DS-6.7B and DS-33B-INST. Models like Gemini, GPT, and GLM that are not publicly accessible tend to perform better than those that are. OpenCIDS-33B and DS-33B-INST, two instruction-tuned models with over 30 billion parameters, close this performance gap.
- The goal of CodeEditorBench is to draw attention to the shortcomings of LLMs, especially when it comes to rewriting and reviewing code. Although it performs admirably in three of the four categories, GPT4's code polishing capabilities are notably lacking. Similarly, Gemini Ultra is not up to the challenge of changing code requirements. The team has recognized these limitations to address these particular issues in LLM training and development.
In conclusion, the main objective of CodeEditorBench is to drive advancements in LLMs by providing a robust platform to comprehensively evaluate code editing capabilities.
Review the Paper, Projectand Github. All credit for this research goes to the researchers of this project. Also, don't forget to follow us on twitter.com/Marktechpost”>twitter. Join our Telegram channel, Discord channeland LinkedIn Grabove.
If you like our work, you will love our Newsletter..
Don't forget to join our SubReddit over 40,000ml
<figure class="wp-block-embed is-type-rich is-provider-twitter wp-block-embed-twitter“>
Tanya Malhotra is a final year student of University of Petroleum and Energy Studies, Dehradun, pursuing BTech in Computer Engineering with specialization in artificial intelligence and Machine Learning.
She is a Data Science enthusiast with good analytical and critical thinking, along with a burning interest in acquiring new skills, leading groups and managing work in an organized manner.
<script async src="//platform.twitter.com/widgets.js” charset=”utf-8″>