Language models (LMs) face challenges in self-supervised learning due to representation degeneration. LMs such as BERT or GPT-2 have low angular variability and atypical small-scale dimensions, and are composed of a neural network that processes sequences of tokens to generate contextual representations. A language modeling head, typically a linear layer with parameters W, produces probability distributions of the next token. Current trends involve scaling up generative pretraining like GPT-2 despite concerns about power and hardware limitations. Evaluation of the Pythia model suite revealed performance saturation in the later pre-training phases when training small models on large corpora.
Pythia models, trained with 300 billion Pile tokens, exhibit performance drops on smaller variants during late training of the Lambada dataset. Scaling laws predict inefficiencies when training compact models on large corpora, but recent efforts focus on reducing inference costs by training smaller language models on large data sets. The softmax bottleneck highlights the limitations in models with insufficient hidden dimensions. Representation degeneration in pretrained models leads to low-entropy singular value distributions, which affects language modeling. Some work connects scaling laws with data dimensionality, using singular value decomposition (SVD) to analyze the performance limitations of linear classifiers.
Researchers from Inria Paris and Sorbonne University conduct a comprehensive study to analyze the correlation between saturation and representation degeneration, particularly in small model language modeling. They showed that a linear language modeling head can be a performance bottleneck for architectures with small hidden dimensions. This bottleneck arises from a mismatch between the hidden dimension of the smallest models and the high rank of the target contextual probability distribution, which affects performance through the softmax bottleneck phenomenon.
The researchers investigated saturation performance in Pythia models at various sizes, confirming saturation up to 410M parameters. Loss saturation shows an increase in losses in the domain during the advanced stages of training. A scaling law matches model data points with more than 410 million parameters, revealing optimal parameters (A = 119.09 and α = 0.246). The final checkpoints underperform extrapolation by about 8% on average, while the best checkpoints fall short by about 4% due to incomplete learning rate cooldown.
The key contributions of this research are the following:
- Characterizing performance saturation of small language models by evaluating and extrapolating scaling laws.
- Identify concurrent degeneration of representations in smaller models, particularly rank saturation in LM prediction heads.
- Empirically verify the high range of the target contextual distribution and the substantial impact of a low range linear head on performance.
- Theoretically quantify the performance limitation induced by LM heads.
Anisotropy, a common representation degeneration in small language models, shows reduced angular variability between layers. Measuring anisotropy using average cosine similarity indicates its widespread presence. In Pythia models, a correlation between anisotropy and performance saturation is observed. The singular value distributions of the language modeling heads highlight patterns of spectral saturation that coexist with performance saturation. The theoretical analysis aims to establish a formal link between the dimensionality of the contextual distribution and the performance bottleneck induced by low-ranking bosses.
In conclusion, this research investigates performance saturation in small language models, which arises from the challenges of mapping between low-dimensional output representations and high-rank contextual probability distributions via linear language modeling heads. The article establishes a theoretical link between this performance gap and the spectral properties of contextual probability distributions. The empirical results confirm the relatively high ranking of the mapping. Experiments reveal significant drops in performance with hidden LM head dimensions below 1000. Analysis correlates saturation with last layer anisotropy and spectral saturation in small model LM heads, improving understanding of the impact of softmax bottleneck in language modeling.
Review the Paper. All credit for this research goes to the researchers of this project. Also, don't forget to follow us on twitter.com/Marktechpost”>twitter. Join our Telegram channel, Discord Channeland LinkedIn Grabove.
If you like our work, you will love our Newsletter..
Don't forget to join our SubReddit over 40,000ml
For content association, please Complete this form here.
Asjad is an internal consultant at Marktechpost. He is pursuing B.tech in Mechanical Engineering at Indian Institute of technology, Kharagpur. Asjad is a machine learning and deep learning enthusiast who is always researching applications of machine learning in healthcare.
<script async src="//platform.twitter.com/widgets.js” charset=”utf-8″>