Neural network scaling has been popular in recent years. Several powerful deep networks are produced with the depth greatly increased for exponential expressiveness. The hidden dimension is then effectively expanded using sparse MoE models and model parallelism techniques. As the last atomic dimension of the neural network, the sequence length should be as long as possible. There are several benefits when the sequence length constraint is removed. First, it provides models with considerable memory and a receptive field, allowing them to interact with people and the outside environment. Second, longer contexts include more complex causal chains and thought processes, which models can use to train data.
On the other hand, the short dependency has more erroneous correlations, which is detrimental to generalization. Third, it allows exploring the limits of learning in context, which represents a paradigm shift for multiple education, since extraordinarily extensive context could decrease catastrophic forgetfulness in models. Finding the ideal balance between computational complexity and model expressiveness is the main difficulty in increasing sequence length.
The main purpose of RNN style models is to extend the length. However, parallelization during training, crucial in modeling long sequences, is limited by its sequential nature. Sequence modeling has more recently found favor with state space models. It can function as a CNN during training and switch to an effective RNN during testing. They perform well on long range waypoints but at shorter lengths. They’re up there with the Transformers. This is primarily due to the expressiveness of the model. Reducing the difficulty of Transformers, or the quadratic complexity of self-care, is another aspect of scaling the length of the sequence. Implementing sliding windows or convolution modules on attention is simple to make the complexity almost linear. However, doing so costs memory for the first few tokens, causing one to miss the prompts at the start of the string. Short attention reduces calculation by dispersing the attention matrix while maintaining the ability to recall distant information. Gets time complexity O(N (√ N) d), for example, using a fixed sparse pattern.
Learnable patterns, as well as heuristic patterns, work well for low attention. Low rank attention, kernel based techniques, downsampling strategies, recursive models, and fetch based techniques are some of the most effective Transformer based builds. Despite this, none have scaled to 1 billion tokens. They successfully scaled the sequence length to 1 billion tokens in their study. Researchers at Microsoft Research developed LONGNET, which swaps the focus of conventional Transformers with a cutting-edge element known as dilated attention. The main design principle is that the attention allocation decreases rapidly as the distance between tokens increases. They show that it achieves a logarithmic dependency between tokens and linear processing complexity.
This addresses the conflict between the accessibility of all tokens and the finite amount of attention resources. LONG NET can be converted to a Dense Transformer in the implementation, supporting standard Transformer optimizations (such as kernel merging, quantization, and distributed training) without issue. By using linear complexity, LONGNET can partition training across nodes and overcome both CPU and memory limitations. Since the standard Transformer suffers from quadratic complexity, this allows us to efficiently scale the length of the stream to 1B tokens with a nearly constant execution time.
review the Paper and github link. Don’t forget to join our 25k+ ML SubReddit, discord channel, and electronic newsletter, where we share the latest AI research news, exciting AI projects, and more. If you have any questions about the article above or if we missed anything, feel free to email us at [email protected]
🚀 Check out 100 AI tools at AI Tools Club
Aneesh Tickoo is a consulting intern at MarktechPost. She is currently pursuing her bachelor’s degree in Information Science and Artificial Intelligence at the Indian Institute of Technology (IIT), Bhilai. She spends most of her time working on projects aimed at harnessing the power of machine learning. Her research interest is image processing and she is passionate about creating solutions around her. She loves connecting with people and collaborating on interesting projects.