Machine learning models are needed to encode long-form text for various natural language processing tasks, including summarizing or answering questions about long documents. Since the attention cost increases quadratically with input length and forward and projection layers must be applied to each input token, processing long texts using a Transformer model is computationally expensive. Several “Efficient Transformer” strategies have been released in recent years that reduce the cost of the attendant mechanism for long inputs. However, the preview and projection layers, particularly for larger models, bear the brunt of the computational load and can make it impossible to parse long inputs. This study presents COLT5, a new family of models that, by integrating architectural improvements for the attention and feedback layers, build on LONGT5 to enable fast processing of long inputs.
The foundation of COLT5 is the understanding that certain tokens are more important than others and that by allocating more compute to important tokens, higher quality can be achieved at reduced cost. For example, COLT5 separates each advance layer and each attention layer into a light branch that applies to all tokens and a heavy branch that is used to select significant tokens chosen especially for that input and component. Compared with normal LONGT5, the hidden dimension of the light feedforward branch is smaller than that of the heavy feedforward branch. Also, the percentage of significant tokens will decrease with the length of the document, allowing for manageable processing of long texts.
An overview of the COLT5 conditional mechanism is shown in Figure 1. The LONGT5 architecture has undergone two more changes thanks to COLT5. The heavy attention branch provides full attention through a different set of carefully chosen meaningful tokens, while the light attention branch has fewer heads and applies local attention. Multiple query cross-attendance, featured by COLT5, dramatically speeds up inference. In addition, COLT5 uses the UL2 pretraining target, which they show allows for in-context learning across long inputs.
Researchers at Google Research suggest COLT5, a new model for remote inputs that uses conditional computing for better performance and faster processing. They show that COLT5 outperforms LONGT5 in the arXiv summary data sets and TriviaQA Q&A datasets, improving over LONGT5 and reaching SOTA on the SCROLLS benchmark. With a less-than-linear scaling of “focus” tokens, COLT5 greatly improves the quality and performance of jobs with long inputs. COLT5 also performs substantially faster fitting and inference at the same or better model quality. The light layers of feedback and attention in COLT5 apply to the entire input, while the heavy branches only affect a selection of meaningful tokens chosen by a learned router. They demonstrate that COLT5 outperforms LONGT5 on various long input data sets at all speeds and can successfully and efficiently employ extremely long inputs of up to 64k tokens.
Aneesh Tickoo is a consulting intern at MarktechPost. She is currently pursuing her bachelor’s degree in Information Science and Artificial Intelligence at the Indian Institute of Technology (IIT), Bhilai. She spends most of her time working on projects aimed at harnessing the power of machine learning. Her research interest is image processing and she is passionate about creating solutions around her. She loves connecting with people and collaborating on interesting projects.