LLM models have been increasingly implemented as powerful linguistic agents capable of performing various programming-related activities. Despite these impressive advances, a considerable gulf still separates the capabilities demonstrated by these models in static experimental environments from the ever-changing demands of real programming scenarios.
Standard code generation benchmarks test how well LLM can generate new code from scratch. However, programming conventions rarely require the genesis of all code components from scratch.
When writing code for real-world applications, using existing and publicly available libraries is common practice. These developed libraries offer robust, battle-tested answers to various challenges. Therefore, the success of code LLMs should be evaluated in more ways than just the production of functions, such as their ability to execute code derived from open source libraries with the correct use of parameters.
A new study by Yale University, Nanjing University, and Peking University presents ML-BENCH, a realistic and comprehensive benchmark dataset for evaluating LLMs’ capabilities in understanding user instructions, navigating GitHub and produce executable code. ML-BENCH makes available high-quality, instructable ground truth code that satisfies the instructions’ requirements. There are 9,444 examples, across 130 tasks, and 14 popular machine learning GitHub repositories that make up ML-BENCH.
Researchers use Pass@k and Parameter Hit Precision as metrics in their research. Using these tools, they explore the possibilities of GPT-3.5-16k, GPT-4-32k, Claude 2 and CodeLlama in ML-BENCH environments. ML-BENCH suggests new tests for LLM. The empirical results show that the GPT and Claude 2 models outperformed CodeLlama by a large margin. Although GPT-4 shows a significant increase in performance over other LLMs, it still only completes 39.73% of the tasks in the experiments. Other known LLms experience hallucinations and have poor functioning. The findings suggest that LLMs need to do more than simply write code; They must also include extensive documentation. The key technological contribution is the proposal of ML-AGENT, an autonomous language agent designed to address deficiencies discovered through its error analysis. These agents can understand human language and instructions, generate efficient code, and perform difficult tasks.
ML-Bench and ML-Agent represent a significant advance in the state of the art of automated machine learning processes. The researchers hope this will interest both other researchers and practitioners.
Review the Paper and Project page. All credit for this research goes to the researchers of this project. Also, don’t forget to join. our 33k+ ML SubReddit, 41k+ Facebook community, Discord channel, and Electronic newsletterwhere we share the latest news on ai research, interesting ai projects and more.
If you like our work, you’ll love our newsletter.
Dhanshree Shenwai is a Computer Science Engineer and has good experience in FinTech companies covering Finance, Cards & Payments and Banking with a keen interest in ai applications. He is excited to explore new technologies and advancements in today’s evolving world that makes life easier for everyone.
<!– ai CONTENT END 2 –>