Resolving Code Review Comments with ML – Google AI Blog

Posted by Alexander Frömmgen, Staff Software Engineer, and Lera Kharatyan, Senior Software Engineer, Core Systems & Experiences

Code change reviews are a critical part of the software development process at scale, taking a significant amount of the time of code authors and code reviewers. As part of this process, the reviewer inspects the proposed code and asks the author for changes to the code through comments written in natural language. On Google, we see millions of reviewer comments per year, and authors require an average of ~60 minutes active grazing time between submitting the changes for review and finally submitting the change. In our measurements, the active work time required for the code author to address reviewer feedback grows almost linearly with the number of comments. However, with machine learning (ML), we have the opportunity to automate and streamline the code review process, for example by proposing code changes based on the text of a comment.

Today, we describe applying recent advancements of large sequence models in a real-world environment to automatically resolve code review comments in the daily development workflow at Google (post soon). Starting today, code changers at Google are addressing a substantial amount of reviewer feedback by applying an ML suggested edit. We expect that to reduce time spent on code reviews by hundreds of thousands of hours a year at Google scale. The very positive unsolicited feedback highlights that the impact of ML’s suggested code edits increases Googlers’ productivity and allows them to focus on more creative and complex tasks.

Code Edit Prediction

We start by training a model that predicts the code edits needed to address reviewer feedback. The model is pretrained on various coding tasks and related developer activities (for example, renaming a variable, fixing a broken build, editing a file). It is then fine-tuned for this specific task with reviewed code changes, reviewer comments, and edits the author made to address those comments.

An example of an ML-suggested edit of refactorings being distributed within the code.

Google uses a monorepoa single repository for all of your software artifacts, allowing our training dataset to include all of the unrestricted code used to build Google’s latest software, as well as older versions.

To improve the quality of the model, we iterate on the training data set. For example, we compared model performance for data sets with a single reviewer comment per file against data sets with multiple comments per file, and experimented with classifiers to clean up training data based on a small, selected data set. to choose the model with the best offline. precision and recall metrics.

At the service of infrastructure and user experience

We design and implement the function on top of the trained model, focusing on the overall user experience and developer efficiency. As part of this, we explore different user experience (UX) alternatives through a series of user studies. We then refine the feature based on insights from an internal beta (i.e., a test of the feature in development) that includes user feedback (for example, a “Was it helpful?” button next to the suggested edit ).

The final model was calibrated for a target. precision 50% That is, we tune the model and suggestion filtering so that 50% of the suggested edits in our test dataset are correct. In general, increasing target precision reduces the number of suggested edits displayed, and decreasing target precision leads to more incorrect suggested edits. Incorrect suggested edits cost developers time and reduce developer confidence in the feature. We found that a target accuracy of 50% provides a good balance.

At a high level, for each new reviewer comment, we generate the model input in the same format used for training, query the model, and generate the suggested code edit. If the model trusts the prediction and some additional heuristics hold, we send the suggested edit to downstream systems. Downstream systems, that is, the code review interface and the integrated development environment (IDE), expose suggested edits to the user and record user interactions such as preview and application events. A dedicated pipeline collects these logs and generates aggregated information, for example overall acceptance rates as reported in this blog post.

Architecture of the framework of editions suggested by ML. We process the code and infrastructure from multiple services, get the predictions from the model, and expose the predictions in the code review tool and IDE.

The developer interacts with the edits suggested by ML in the code review tool and the IDE. Based on user research insights, integration into the code review tool is best suited for a streamlined review experience. The IDE integration provides additional functionality and supports 3-way merging of ML-suggested edits (left in the figure below) in case of conflicting local changes to the top of the revised code state (right) in the result of the combination (middle).

3-way merge UX in IDE.

Results

Offline evaluations indicate that the model addresses 52% of feedback with a target accuracy of 50%. Online metrics from the beta and full internal release confirm these offline metrics, meaning we see model suggestions above the confidence of our target model in around 50% of all relevant reviewer feedback. Code authors apply 40-50% of all suggested preview edits.

We use “not useful” feedback during beta to identify patterns of recurring model failures. We’ve implemented time-of-service heuristics to filter them out, thereby reducing the number of incorrect predictions shown. With these changes, we trade quantity for quality and see a higher acceptance rate in the real world.

UX code review tool. The suggestion is displayed as part of the comment and can be previewed, applied, and rated as helpful or not helpful.

Our beta release showed a discovery challenge: Code authors only previewed ~20% of all suggested edits generated. We tweaked the UX and introduced a prominent “Show ML Edit” button (see figure above) next to the reviewer comment, resulting in an overall ~40% preview rate at launch. Additionally, we found that the edits suggested in the code review tool are often not applicable due to conflicting changes the author made during the review process. We address this with a button in the code review tool that opens the IDE in a combination view for suggested editing. Now we see that more than 70% of these are applied in the code review tool and less than 30% are applied in the IDE. All of these changes allowed us to increase the overall fraction of reviewer comments that are addressed with an ML suggested edit by a factor of 2 from beta to full internal release. At Google scale, these results help automate the resolution of hundreds of thousands of comments each year.

Suggestion filtering funnel.

We see edits suggested by ML that address a wide range of feedback from reviewers in production. This includes simple localized refactorings and refactorings that are distributed within the code, as shown in the examples throughout the previous blog post. The feature addresses longer, less formally worded comments that require code generation, refactorings, and imports.

Example of a hint for a longer, less formal comment that requires code generation, refactorings, and imports.

The model can also respond to complex comments and produce extensive code edits (as shown below). The generated test case follows the existing unit test pattern, while changing the details as described in the comment. Additionally, the edit suggests a fully qualified name for the test that reflects the semantics of the test.

Example of the model’s ability to respond to complex feedback and produce extensive code edits.

Conclusion and future work

In this post, we introduce an ML helpdesk feature to reduce the time spent on code review-related changes. Currently, a substantial amount of all actionable code review comments in supported languages are addressed with Google-applied ML suggested edits. A 12-week A/B experiment across all Google developers will further measure the impact of the feature on overall developer productivity.

We are working on improvements across the stack. This includes increasing model recall and quality, and creating a more streamlined developer experience with better discoverability throughout the review process. As part of this, we’re investigating the option to show suggested edits to the reviewer while they’re writing comments, and expanding the feature in the IDE to allow code changers to get suggested code edits for natural language commands.

Thanks

This is the work of many people on the team at Google Core Systems & Experiences, Google Research, and DeepMind. We would like to specifically thank Peter Choy for bringing the collaboration together and all our team members for their key contributions and helpful advice, including Marcus Revaj, Gabriela Surita, Maxim Tabachnyk, Jacob Austin, Nimesh Ghelani, Dan Zheng, Peter Josling, Mariana Stariolo , Chris Gorgolewski, Sascha Varkevisser, Katja Grünwedel, Alberto Elizondo, Tobias Welp, Paige Bailey, Pierre-Antoine Manzagol, Pascal Lamblin, Chenjie Gu, Petros Maniatis, Henryk Michalewski, Sara Wiltberger, Ambar Murillo, Satish Chandra, Madhura Dudhgaonkar, Niranjan Tulpule , Zoubin Ghahramani, Juanjo Carin, Danny Tarlow, Kevin Villela, Stoyan Nikolov, David Tattersall, Boris Bokowski, Kathy Nix, Mehdi Ghissassi, Luis C. Cobo, Yujia Li, David Choi, Kristóf Molnár, Vahid Meimand, Amit Patel, Brett Wiltshire , Laurent Le Brun, Mingpan Guo, Hermann Loose, Jonas Mattes, Savinee Dancs. Thanks to John Guiyard for creating the graphics in this post.

Resolving Code Review Comments with ML – Google AI Blog

Technical Terrence Team

Artificial Intelligence Annotation - Practical Ed Tech

Leave a Reply Cancel reply

Recommended.

Is a dividend yield of more than 7% from British American Tobacco safe?

Ethereum Price Expected to Hit $4,000 as Traders Invest $400 Million

Enrollment in VHS Learning's Flexible Self-Paced Courses Soars by More Than 400%

Economist and crypto critic reveals how Bitcoin could 'destroy' the dollar

Característica secreta de SQL Server: ejecutar Python y complementos de forma nativa en SQL Server | de Sasha Korovkina | mayo, 2024

Categories

Important Links

Resolving Code Review Comments with ML – Google AI Blog

Code Edit Prediction

At the service of infrastructure and user experience

Results

Conclusion and future work

Thanks

Related

Technical Terrence Team

Artificial Intelligence Annotation - Practical Ed Tech

Leave a Reply Cancel reply

Recommended.

Is a dividend yield of more than 7% from British American Tobacco safe?

Ethereum Price Expected to Hit $4,000 as Traders Invest $400 Million

Enrollment in VHS Learning's Flexible Self-Paced Courses Soars by More Than 400%

Economist and crypto critic reveals how Bitcoin could 'destroy' the dollar

Característica secreta de SQL Server: ejecutar Python y complementos de forma nativa en SQL Server | de Sasha Korovkina | mayo, 2024

Categories

Important Links

Get daily news updates to your inbox!