Code change reviews are a critical part of the software development process at scale, taking a significant amount of the time of code authors and code reviewers. As part of this process, the reviewer inspects the proposed code and asks the author for changes to the code through comments written in natural language. On Google, we see millions of reviewer comments per year, and authors require an average of ~60 minutes active grazing time between submitting the changes for review and finally submitting the change. In our measurements, the active work time required for the code author to address reviewer feedback grows almost linearly with the number of comments. However, with machine learning (ML), we have the opportunity to automate and streamline the code review process, for example by proposing code changes based on the text of a comment.
Today, we describe applying recent advancements of large sequence models in a real-world environment to automatically resolve code review comments in the daily development workflow at Google (post soon). Starting today, code changers at Google are addressing a substantial amount of reviewer feedback by applying an ML suggested edit. We expect that to reduce time spent on code reviews by hundreds of thousands of hours a year at Google scale. The very positive unsolicited feedback highlights that the impact of ML’s suggested code edits increases Googlers’ productivity and allows them to focus on more creative and complex tasks.
Code Edit Prediction
We start by training a model that predicts the code edits needed to address reviewer feedback. The model is pretrained on various coding tasks and related developer activities (for example, renaming a variable, fixing a broken build, editing a file). It is then fine-tuned for this specific task with reviewed code changes, reviewer comments, and edits the author made to address those comments.
An example of an ML-suggested edit of refactorings being distributed within the code. |
Google uses a monorepoa single repository for all of your software artifacts, allowing our training dataset to include all of the unrestricted code used to build Google’s latest software, as well as older versions.
To improve the quality of the model, we iterate on the training data set. For example, we compared model performance for data sets with a single reviewer comment per file against data sets with multiple comments per file, and experimented with classifiers to clean up training data based on a small, selected data set. to choose the model with the best offline. precision and recall metrics.
At the service of infrastructure and user experience
We design and implement the function on top of the trained model, focusing on the overall user experience and developer efficiency. As part of this, we explore different user experience (UX) alternatives through a series of user studies. We then refine the feature based on insights from an internal beta (i.e., a test of the feature in development) that includes user feedback (for example, a “Was it helpful?” button next to the suggested edit ).
The final model was calibrated for a target. precision 50% That is, we tune the model and suggestion filtering so that 50% of the suggested edits in our test dataset are correct. In general, increasing target precision reduces the number of suggested edits displayed, and decreasing target precision leads to more incorrect suggested edits. Incorrect suggested edits cost developers time and reduce developer confidence in the feature. We found that a target accuracy of 50% provides a good balance.
At a high level, for each new reviewer comment, we generate the model input in the same format used for training, query the model, and generate the suggested code edit. If the model trusts the prediction and some additional heuristics hold, we send the suggested edit to downstream systems. Downstream systems, that is, the code review interface and the integrated development environment (IDE), expose suggested edits to the user and record user interactions such as preview and application events. A dedicated pipeline collects these logs and generates aggregated information, for example overall acceptance rates as reported in this blog post.
The developer interacts with the edits suggested by ML in the code review tool and the IDE. Based on user research insights, integration into the code review tool is best suited for a streamlined review experience. The IDE integration provides additional functionality and supports 3-way merging of ML-suggested edits (left in the figure below) in case of conflicting local changes to the top of the revised code state (right) in the result of the combination (middle).
3-way merge UX in IDE. |
Results
Offline evaluations indicate that the model addresses 52% of feedback with a target accuracy of 50%. Online metrics from the beta and full internal release confirm these offline metrics, meaning we see model suggestions above the confidence of our target model in around 50% of all relevant reviewer feedback. Code authors apply 40-50% of all suggested preview edits.
We use “not useful” feedback during beta to identify patterns of recurring model failures. We’ve implemented time-of-service heuristics to filter them out, thereby reducing the number of incorrect predictions shown. With these changes, we trade quantity for quality and see a higher acceptance rate in the real world.
UX code review tool. The suggestion is displayed as part of the comment and can be previewed, applied, and rated as helpful or not helpful. |
Our beta release showed a discovery challenge: Code authors only previewed ~20% of all suggested edits generated. We tweaked the UX and introduced a prominent “Show ML Edit” button (see figure above) next to the reviewer comment, resulting in an overall ~40% preview rate at launch. Additionally, we found that the edits suggested in the code review tool are often not applicable due to conflicting changes the author made during the review process. We address this with a button in the code review tool that opens the IDE in a combination view for suggested editing. Now we see that more than 70% of these are applied in the code review tool and less than 30% are applied in the IDE. All of these changes allowed us to increase the overall fraction of reviewer comments that are addressed with an ML suggested edit by a factor of 2 from beta to full internal release. At Google scale, these results help automate the resolution of hundreds of thousands of comments each year.
Suggestion filtering funnel. |
We see edits suggested by ML that address a wide range of feedback from reviewers in production. This includes simple localized refactorings and refactorings that are distributed within the code, as shown in the examples throughout the previous blog post. The feature addresses longer, less formally worded comments that require code generation, refactorings, and imports.
Example of a hint for a longer, less formal comment that requires code generation, refactorings, and imports. |
The model can also respond to complex comments and produce extensive code edits (as shown below). The generated test case follows the existing unit test pattern, while changing the details as described in the comment. Additionally, the edit suggests a fully qualified name for the test that reflects the semantics of the test.
Example of the model’s ability to respond to complex feedback and produce extensive code edits. |
Conclusion and future work
In this post, we introduce an ML helpdesk feature to reduce the time spent on code review-related changes. Currently, a substantial amount of all actionable code review comments in supported languages are addressed with Google-applied ML suggested edits. A 12-week A/B experiment across all Google developers will further measure the impact of the feature on overall developer productivity.
We are working on improvements across the stack. This includes increasing model recall and quality, and creating a more streamlined developer experience with better discoverability throughout the review process. As part of this, we’re investigating the option to show suggested edits to the reviewer while they’re writing comments, and expanding the feature in the IDE to allow code changers to get suggested code edits for natural language commands.
Thanks
This is the work of many people on the team at Google Core Systems & Experiences, Google Research, and DeepMind. We would like to specifically thank Peter Choy for bringing the collaboration together and all our team members for their key contributions and helpful advice, including Marcus Revaj, Gabriela Surita, Maxim Tabachnyk, Jacob Austin, Nimesh Ghelani, Dan Zheng, Peter Josling, Mariana Stariolo , Chris Gorgolewski, Sascha Varkevisser, Katja Grünwedel, Alberto Elizondo, Tobias Welp, Paige Bailey, Pierre-Antoine Manzagol, Pascal Lamblin, Chenjie Gu, Petros Maniatis, Henryk Michalewski, Sara Wiltberger, Ambar Murillo, Satish Chandra, Madhura Dudhgaonkar, Niranjan Tulpule , Zoubin Ghahramani, Juanjo Carin, Danny Tarlow, Kevin Villela, Stoyan Nikolov, David Tattersall, Boris Bokowski, Kathy Nix, Mehdi Ghissassi, Luis C. Cobo, Yujia Li, David Choi, Kristóf Molnár, Vahid Meimand, Amit Patel, Brett Wiltshire , Laurent Le Brun, Mingpan Guo, Hermann Loose, Jonas Mattes, Savinee Dancs. Thanks to John Guiyard for creating the graphics in this post.