Tokens are generated in rapid succession using transformer-based causal language models. The model takes the previous K tokens and then iteratively computes K intermediate vectors in each hidden layer to produce the (K + 1)th token. The module operates on the output vectors of the previous layer, and each vector itself is the output of a module. Despite the complexity of the entire procedure, an unusual constraint must be met: the number of operations necessary to determine the next token is limited by the number of tokens already seen.
A recent study by Carnegie Mellon University and Google investigated the strategy of adding fake tokens to the input of a decoder-only model to postpone its output. In this work, they decided to choose a (learnable) pause token and add it to the input in a sequence of one or more times. To get the model’s response after the last token has been seen, they simply ignore matching outputs until then.
Importantly, researchers think about inserting such delays in inference and during post-tuning and pre-training. There’s no telling now what effect this seemingly small adjustment might have in the real world. The delay creates a potentially “wider” computational channel, which the Transformer can use to its advantage. A simpler outcome could be for the model to ignore the tokens’ ability to cause delays and continue running. After all, neither the tokens themselves nor the small number of new parameters introduced by embedding a single token are suitable for encoding additional information from the training data. These meaningless tokens can obscure useful signals and weaken the model.
The team conducted an empirical evaluation to understand the outcome of introducing (added) delays in all training and inference phases. They examine pause training in a 1B and 130M parameter decoder model initially trained on C4 (Raffel et al., 2019) and then fine-tuned on nine subsequent tasks covering extractive question answering, reasoning, general comprehension, and fact recall. Most significantly, this method increases the exact match score of model 1B by 18% on the SQuAD extractive question answering task. Similarly, they observed an 8% increase in the CommonSense QA general comprehension task and a 1% gain in accuracy in the GSM8k reasoning task relative to the standard model’s accuracy of 7.5%.
On the other hand, when tokens are introduced only during the final tuning stage (using the pre-trained baseline model), improvements are observed in only a small fraction of cases. The team also performed a number of key ablations, including:
- Finding that adding tokens is generally better than prepending them.
- Discover that there is an optimal number of tokens for any subsequent task.
- Finding that decreasing the number of inference time tokens results in gradual performance degradation.
The team believes that the next essential step would be to develop ways to make delays useful directly in a pre-trained normal model. They envision several new directions of theoretical and applied research opening up thanks to their work extending the paradigm of delayed next token prediction.
Review the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to join. our 31k+ ML SubReddit, Facebook community of more than 40,000 people, Discord channel, and Electronic newsletterwhere we share the latest news on ai research, interesting ai projects and more.
If you like our work, you’ll love our newsletter.
We are also on WhatsApp. Join our ai channel on Whatsapp.
Dhanshree Shenwai is a Computer Science Engineer and has good experience in FinTech companies covering Finance, Cards & Payments and Banking with a keen interest in ai applications. He is excited to explore new technologies and advancements in today’s evolving world that makes life easier for everyone.
<!– ai CONTENT END 2 –>