Meta AI researchers propose MEGABYTE: a multi-scale decoding architecture that enables end-to-end differentiable modeling of sequences larger than one million bytes

Million-byte streams are common, as music, picture, and video files are often several megabytes in size. However, due to the quadratic cost of self-service and, more significantly, the cost of large per-position feedback networks, large transformer decoders (LLMs) typically only require a few thousand context tokens. This significantly narrows the range of tasks for which LLMs can be used. META researchers present MEGABYTE, a novel method for simulating long byte sequences. Byte streams are divided into patches of fixed size roughly equivalent to tokens.

So, your model has three components:

(1) A local module, a small autoregressive model that predicts bytes within a patch.

JOIN the fastest ML subreddit community

(2) A patch integrator simply encodes a patch by losslessly concatenating the embeddings of each byte.

(3) A global module, a large autoregressive transformer that inputs and generates patch representations.

Importantly, most byte predictions are straightforward for many tasks (such as completing a word given only a few initial letters), eliminating the need for massive per-byte networks and allowing considerably smaller models for intrapatch modelling. For extended sequence modeling, the MEGABYTE architecture offers three key advantages over transformers: Self-care that is sub-quadratic The vast majority of research on long-sequence models has been devoted to reducing the quadratic cost of self-care. The long streams are split into two shorter streams using MB, and the self-service cost is reduced to O(N^(4/3)) by using optimal patch sizes, which are still manageable for long sequences. Layers with feedforward per patch. MEGABYTE allows for much larger and more expressive models at the same cost by using huge advance layers per patch instead of per position. More than 98% of the FLOPS are used in GPT3 size models to calculate the position advance layers.

Decoding Parallelism Three transformers must serially process all computations during generation, since the input of each time step results from the initial output. MEGABYTE makes possible greater parallelism during builds by producing parallel representations for patches. With patch size P, MB can use a layer with mP parameters once for the same price as a reference transformer, using the same forward layer with mP parameters times. For example, when trained on the same computer, a MB model with 1.5B parameters can create sequences 40% faster than a conventional 350M transformer, while increasing perplexity.

Together, these improvements allow us to expand to long sequences, increase build speed during deployment, and train much larger, better-performing models with the same computational budget. Byte sequences are translated into larger discrete tokens in existing autoregressive models, which usually involve some tokenization. This is where MEGABYTE stands in stark contrast. Tokenization makes preprocessing, multimodal modeling, and transfer to different domains difficult, while obscuring the beneficial structure of the model. Furthermore, it implies that most of the cutting-edge models are still in progress. The most popular tokenization methods lose information without language-specific heuristics.

Therefore, switching from tokenization to high performance and effective byte models would have several benefits. They conduct extensive testing for both solid baselines and MB. To focus their comparisons entirely on model architecture rather than training resources, which are known to be advantageous for all models, they use a single computational budget and data across all models. They find that MEGABYTE enables byte-level models to achieve cutting-edge perplexities for density estimation on ImageNet, to perform competitively with subword models in extended context language modeling, and to enable audio modeling from raw audio data. These findings show that autoregressive sequence modeling without tokenization is scalable.

review the Paper. Don’t forget to join our 21k+ ML SubReddit, discord channel, and electronic newsletter, where we share the latest AI research news, exciting AI projects, and more. If you have any questions about the article above or if we missed anything, feel free to email us at asif@marktechpost.com

Check out 100 AI tools at AI Tools Club

Aneesh Tickoo is a consulting intern at MarktechPost. She is currently pursuing her bachelor’s degree in Information Science and Artificial Intelligence at the Indian Institute of Technology (IIT), Bhilai. She spends most of her time working on projects aimed at harnessing the power of machine learning. Her research interest is image processing and she is passionate about creating solutions around her. She loves connecting with people and collaborating on interesting projects.

Learn about Bright Data: the world’s #1 web data platform