Google AI Introduces CHITA: An Optimization-based Approach for Pruning Pre-Trained Neural Networks at Scale

The results of today’s neural networks in fields as diverse as language, mathematics, and vision are remarkable. These networks, however, typically employ elaborate structures that are resource-intensive to run. When dealing with limited resources, such as those found in wearables and smartphones, delivering such models to users can be impracticable. Pruning pre-trained networks entails deleting part of their weights while ensuring that the reduction in utility is negligible to lower their inference costs. Each weight in a typical neural network specifies the link between two neurons. After reducing the consequences, the input will go over a more manageable subset of links, reducing the processing time needed.

The CHITA (Combinatorial Hessian-free Iterative Thresholding Algorithm) framework, developed by a group of researchers from MIT and Google, is an effective optimization-based strategy for large-scale network pruning. This method builds on previous research that approximated the loss function using a local quadratic function in the second-order Hessian. In contrast to other efforts, they take advantage of a simple but crucial insight that allows them to solve the optimization issue without computing and storing the Hessian matrix (thus the “Hessian-free” in CHITA moniker) and so efficiently address massive networks.

To further reduce the regression reformulation, they suggest a new method that uses active set strategies, improved stepsize selection, and other techniques to speed up convergence to the chosen support. Compared to Iterative Hard Thresholding techniques widely employed in the sparse learning literature, the suggested methodology yields substantial gains. The framework can sparsify networks with as many as 4.2M parameters by cutting them down to 20%.

The following is a summary of the contributions:

Based on local quadratic approximations of the loss function, researchers present CHITA, an optimization framework for network pruning.

They propose a restricted sparse regression reformulation to eliminate the memory overhead associated with storing a large, dense Hessian.

CHITA relies heavily on a novel IHT-based method to get high-quality solutions for sparse regression. By using the problem’s structure, they provide solutions to speed up convergence and boost pruning performance, such as a novel and effective stepsize selection strategy and rapid updates to the support’s weights. When compared to standard network pruning algorithms, this can boost performance by a factor of up to a thousand.

Improvements in model and data set performance are also demonstrated by the researchers.

An efficient pruning formulation for computing

By preserving only some of the weights from the original network, various pruning candidates can be derived. Let k represent a retained weights parameter set by the user. Among all potential pruning candidates (i.e., subsets of weights with only k weights kept), the candidate with the smallest loss is chosen. This is a logical formulation of pruning as a best-subset selection (BSS) issue.

CHITA avoids explicitly computing the Hessian matrix while using all its information by employing a reformulated version of the pruning problem (BSS with the quadratic loss). This is made possible by utilizing the fact that the empirical Fisher information matrix is low-rank. This new form can be considered a sparse linear regression issue, where the weights of the neurons in the network represent the regression coefficients.

Algorithms for optimization that scale well

Under the sparsity requirement that no more than k of the regression coefficients can be zero, CHITA transforms pruning into a linear regression problem. Researchers are thinking about tweaking the popular iterative hard thresholding (IHT) technique to solve this issue. All regression coefficients not in the Top-k (i.e., the k coefficients with the biggest magnitude) are zeroed out after each update in IHT’s gradient descent. In most cases, IHT provides a satisfactory answer by jointly optimizing over the weights and iteratively examining potential pruning alternatives.

In conclusion, researchers have presented CHITA, a unique, hessian-free constrained regression formulation, and combinatorial optimization techniques-based network pruning framework. The single-stage approaches significantly improve runtime and memory utilization while attaining outcomes comparable to previous methods. Furthermore, the multi-stage strategy can increase model accuracy because it builds upon the single-stage methodology. They have also shown that sparse networks with state-of-the-art accuracy may be achieved by adding the pruning techniques into preexisting gradual pruning frameworks.

Check out the Paper and Google Blog. All Credit For This Research Goes To the Researchers on This Project. Also, don’t forget to join our 29k+ ML SubReddit, 40k+ Facebook Community, Discord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more.

If you like our work, please follow us on Twitter

Dhanshree Shenwai is a Computer Science Engineer and has a good experience in FinTech companies covering Financial, Cards & Payments and Banking domain with keen interest in applications of AI. She is enthusiastic about exploring new technologies and advancements in today’s evolving world making everyone’s life easy.