Was this answer better or worse? Better Worse Same
It has been said that information theory and machine learning are “two sides of the same coin” due to their close relationship. An exquisite relationship is the fundamental similarity between probabilistic data models and lossless compression. The essential theory defining this concept is the source encoding theorem, which states that the predicted message length in bits of an ideal entropy encoder is equal to the negative log2 probability of the statistical model. In other words, decreasing the number of bits needed for each message is comparable to increasing the log2 probability. Different techniques to achieve lossless compression with a probabilistic model include Huffman coding, arithmetic coding, and asymmetric number systems.
Figure 1 | Arithmetic encoding of the sequence ‘AIXI’ with a probabilistic model (language) P (both in blue) produces the binary code ‘0101001’ (in green). The data is compressed by arithmetic coding by giving the symbols certain intervals depending on the probability given by P. It gradually smoothes out these pauses to produce compressed bits that replace the original message. Based on the incoming compressed bits, arithmetic encoding initializes an interval during decoding. To reconstruct the original message, it iteratively matches intervals with symbols using the probabilities given by P.
The overall compression efficiency depends on the capabilities of the probabilistic model, since arithmetic coding is known to be optimal in terms of coding length (Fig. 1). Furthermore, huge pre-trained Transformers, also known as basic models, have recently demonstrated excellent performance on a variety of prediction tasks and are therefore attractive candidates for use with arithmetic coding. Transformer-based compression with arithmetic coding has generated cutting-edge results in online and offline environments. The offline option they consider in their work involves training the model on an external data set before using it to compress a (perhaps different) data stream. In the online context, a pseudorandomly initialized model is immediately trained on the data stream to be compressed. As a result, offline compression uses a fixed set of model parameters and is performed in context.
Transformers are perfectly suited for offline reduction as they have demonstrated excellent in-context learning capabilities. Transformers are taught to compress effectively, as will be described in this task. Therefore, they must have strong contextual learning skills. Context length, a critical limiting factor in offline compression, determines the maximum number of bytes that a model can compress simultaneously. Transformers are computationally intensive and can only compress a small amount of data (a “token” is programmed with 2 or 3 bytes). Since many difficult prediction tasks (such as algorithmic reasoning or long-term memory) require extended contexts, extending the context length of these models is an important issue that is receiving more attention. The compression-in-context view sheds light on how current basic models fail. Google DeepMind and Meta ai & Inria researchers promote the use of compression to explore the prediction problem and evaluate how well large (fundamental) models compress data.
They make the following contributions:
• They conduct empirical research on the lossless compression capacity of foundation models. To do this, they explore the role of arithmetic coding in the compression of predictive models and draw attention to the relationship between the two fields of study.
• They demonstrate that basic models with in-context learning capabilities, trained primarily on text, are general-purpose compressors. For example, Chinchilla 70B outperforms domain-specific compressors such as PNG (58.5%) or FLAC (30.3%), achieving compression rates of 43.4% on ImageNet patches and 16.4% on LibriSpeech.
• They present a new perspective on scaling laws by showing that scaling is not a magic bullet and that the size of the data set sets a strict upper limit on the size of the model in terms of compression performance.
• They use compressors as generative models and use compression prediction equivalence to graphically represent the performance of the underlying compressor.
• They show that tokenization, which can be considered as precompression, does not, on average, improve compression performance. Instead, it allows models to increase the information content in their environment and is typically used to improve prediction performance.
Review the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to join. our SubReddit of more than 30,000 ml, Facebook community of more than 40,000 people, Discord channel, and Electronic newsletterwhere we share the latest news on ai research, interesting ai projects and more.
If you like our work, you’ll love our newsletter.
Aneesh Tickoo is a consulting intern at MarktechPost. She is currently pursuing her bachelor’s degree in Data Science and artificial intelligence at the Indian Institute of technology (IIT), Bhilai. She spends most of her time working on projects aimed at harnessing the power of machine learning. Her research interest is image processing and she is passionate about creating solutions around it. She loves connecting with people and collaborating on interesting projects.
<!– ai CONTENT END 2 –>