Reasoning efficiently through extended sequences is a major difficulty in machine learning. Recently, convolutions have emerged as a critical primitive for sequence modeling, supporting state-of-the-art performance in language modeling, time series analysis, computer vision, DNA modeling, and more. Despite these impressive quality findings and additional advantages such as improved stability and better scalability as sequence length increases, convolutional sequence models are still significantly slower than Transformers.
One main cause is unreliable hardware support. Convolutions for sequence modeling frequently employ filters as long as the input sequence, in contrast to the short filters used in classical convolutions for visual applications. The Fast Fourier Transform (FFT) convolution algorithm computes convolution between an input u and a convolution kernel k by mapping the input and output frequencies.
Despite being asymptotically efficient, the FFT convolution algorithm has a low clock time in contemporary accelerators. However, technological progress in systems has allowed Transformers to reach the limits of current accelerators, with end-to-end FLOP usage exceeding 72% when using FlashAttention-v2.
To deliver longer context capabilities, new research from Stanford University investigates how to optimize the FFT convolution method in contemporary accelerators. The researchers believe that as advances in systems like FlashAttention led to better models and new attention algorithms, optimization of FFT convolution will lead to new and better algorithms, increasing the quality of convolutional sequence models.
FFT convolution can be easily optimized for short sequences. It is common practice to reuse kernel filters in multiple batches, allowing the FFT of the filter to be pre-calculated before reusing it. Therefore, FFT convolution is parallel between batches and filters, and core fusion allows intermediate convolution outputs to be cached in SRAM or registers.
- However, the team highlights that two major obstacles appear as the length of the sequence increases. Relative to current accelerators, FFT convolutions do not optimally use specialized matrix-matrix multiplication units.
- Second, kernel merging fails because the streams grow too large to fit in SRAM and require expensive I/O operations. Causality padding operations and conversions from real-value inputs/outputs to complex-value FFT intermediaries could further increase these I/O costs.
In response, the researchers offer FlashFFTConv, a novel algorithm that employs a Monarch decomposition of the FFT to optimize FFT convolution for extended sequences. The FFT can be efficiently transferred to hardware thanks to a p-order Monarch decomposition, which rewrites the FFT as a series of p matrix-matrix multiplication operations. Higher p values result in less FLOP costs due to smaller arrays, but require more I/O to transmit intermediate results. Therefore, there is a trade-off involved.
The study demonstrates how to optimize p for FLOP cost and I/O cost on a GPU using a simple cost model based on sequence length. In addition to facilitating core fusion into longer sequence lengths, this decomposition reduces the amount of sequence that must be maintained in SRAM. Therefore, FlashFFTConv can easily handle sequences between 256 and 4 million characters. By using a real-value FFT algorithm and skipping parts of matrix multiplication operations when the input is zero-padded, FlashFFTConv can reduce FFT operation duration by up to half. Last but not least, the matrix view of FFT convolution provides a simple interface to implement two architectural modifications: partial convolutions, which learn with a convolution kernel that is shorter than the input sequence, and sparse frequency convolutions , which zero out sections of the FFT convolution. nucleus in frequency space. Both approaches can be implemented by simply skipping sections of the matrix decomposition, reducing memory usage and wall clock execution time, and can be considered as sparse/coarse attention convolutional parallels in Transformers.
Researchers demonstrate that FlashFFTConv accelerates FFT convolution, resulting in longer, more efficient, and better quality sequence models.
- FlashFFTConv improves the quality of convolutional sequence models through increased efficiency: for the same computing budget, FlashFFTConv allows Hyena-GPT-s to achieve 2.3 points more perplexity and allows M2-BERT-base to achieve Up to 3.3 higher average GLUE score – a performance gain equivalent to doubling model parameters.
- FlashFFTConv improves the convolution efficiency by up to 7.93 and up to 5.60 in memory savings compared to PyTorch, and this efficiency is maintained over four orders of magnitude across the sequence length. FlashFFTConv is faster in wall clock time than FlashAttention-v2 end-to-end for sequence lengths of 2K and above due to lower FLOP costs and achieves up to 62.3% end-to-end FLOP utilization, which It is only 10% less than FlashAttention. -v2.
- Longer sequence models are possible with FlashFFTConv. FlashFFTConv has produced the only model capable of completing the arena’s long Path-512 benchmark (sequence length 256K) for high-resolution image classification. FlashFFTConv is the first model to incorporate the longest human genes (up to 2.3 million base pairs) at single nucleotide resolution; extends HyenaDNA to a sequence length of 4 M using partial convolutions.
The team hopes that FlashFFTConv will pave the way for broader use of convolutional sequence models and that the lessons learned will lead to more resource-efficient computing architectures.
Review the Paper, GitHuband blog article. All credit for this research goes to the researchers of this project. Also, don’t forget to join. our 33k+ ML SubReddit, 41k+ Facebook community, Discord channel, and Electronic newsletterwhere we share the latest news on ai research, interesting ai projects and more.
If you like our work, you’ll love our newsletter.
Dhanshree Shenwai is a Computer Science Engineer and has good experience in FinTech companies spanning Finance, Cards & Payments and Banking with a keen interest in ai applications. He is excited to explore new technologies and advancements in today’s evolving world that makes life easier for everyone.
<!– ai CONTENT END 2 –>