Mix-LN: A hybrid normalization technique that combines the strengths of pre- and post-layer normalization

He Large Language Models (LLM) They are very promising in artificial intelligence. However, despite training on large data sets covering multiple languages

and themes, the ability to understand and generate text is sometimes exaggerated. LLM Applications in multiple domains have proven to have little impact on improving human-computer interactions or creating innovative solutions. This is because the deep layers of the LLMS do not contribute much and, if removed, do not affect its performance. This underutilization of deep layers shows inefficiency within the models.

Current methods demonstrated that the deeper layers of LLM contributed little to his performance. Although used to stabilize training, techniques such as before LN and post-LN showed significant limitations. Pre-LN reduced the magnitude of gradients in deeper layers, limiting their effectiveness, while post-LN caused gradients to disappear in earlier layers. Despite efforts to address these issues using dynamic linear combinations and adaptive model initialization, these techniques do not fully optimize LLM performance.

To address this question, researchers from Dalian University of technology, University of Surrey, Eindhoven University of technology, and oxford university proposed Mix-LN. This normalization technique combines the strengths of Pre-LN and Post-LN within the same model. Mix-LN applies Post-LN to earlier layers and Pre-LN to deeper layers to ensure smoother gradients. This approach allows both the superficial and deep layers to contribute effectively to training. The researchers tested the hypothesis that the deeper layers in the LLMs were inefficient due to pre-LN. The main difference between post-LN and pre-LN architectures is the location of the layer normalization (LN). In post-LN, LN is applied after the residual addition, while in pre-LN it is used before.

The researchers compared pre- and post-LN models in small-scale and open-weight in-house LLMs. Metrics such as angular distance and performance decay evaluated the effectiveness of the layer. The first layers were less effective in BERT-Large (after LN) than in deeper layers. In LLaMa2-7B (Pre-LN), deeper layers were less effective and their pruning showed minimal impact on performance. The researchers observed similar trends in LLaMa-130M, where Pre-LN layers were less effective at deeper levels and Post-LN maintained better performance at deeper layers. These results suggested that Pre-LN caused the inefficiency of the deeper layers.

The Optimal Post-LN Relationship to for Mix-LN was determined by experiments with LLaMA-1B in it C4 data set. The best performance occurred in α = 0.25where the perplexity was less. For the remaining layers, performance decreased but remained above the performance recorded by Pre-LN compared to layers adopting Post-LN. Mix-LN also supported a wider range of representations and maintained a healthier gradient norm for deeper layers to contribute effectively. Mix-LN achieved significantly low perplexity scores, outperforming other normalization methods.

In conclusion, the researchers identified inefficiencies caused by Pre-LN in deep layers of large language models (LLM) and proposed Mix-LN as a solution. Experiments showed that Mix-LN outperformed both Pre-LN and Post-LN, improving model performance during pre-training and tuning without increasing model size. This approach can act as a foundation for future research, offering a foundation for future improvements in deep model training and advancing model efficiency and capability.

Verify he Paper. All credit for this research goes to the researchers of this project. Also, don't forget to follow us on <a target="_blank" href="https://twitter.com/Marktechpost”>twitter and join our Telegram channel and LinkedIn Grabove. Don't forget to join our SubReddit over 60,000 ml.

Trending: LG ai Research launches EXAONE 3.5 – three frontier-level bilingual open-source ai models that deliver unmatched instruction following and broad context understanding for global leadership in generative ai excellence….

Divyesh is a Consulting Intern at Marktechpost. He is pursuing a BTech in Agricultural and Food Engineering from the Indian Institute of technology Kharagpur. He is a data science and machine learning enthusiast who wants to integrate these leading technologies in agriculture and solve challenges.

(Download) Large Language Model Vulnerability Assessment Report (Promoted)

Mix-LN: A hybrid normalization technique that combines the strengths of pre- and post-layer normalization

Technical Terrence Team

Carnival Cruise Line responds to an adult's question

Leave a Reply Cancel reply

Recommended.

Automate user on-boarding for financial services with a digital assistant powered by Amazon Bedrock

Microsoft shares rise 5%; cloud sales and AI plans lead quarterly results

Ricardo Salinas explains how Bitcoin is leveling the playing field for developing nations

Rolls-Royce shares have soared! But is this FTSE 250 stock a better buy right now?

Overcoming the obstacles of fitting the vision-language model for the generalization of OOD

Categories

Important Links

Mix-LN: A hybrid normalization technique that combines the strengths of pre- and post-layer normalization

Related

Technical Terrence Team

Carnival Cruise Line responds to an adult's question

Leave a Reply Cancel reply

Recommended.

Automate user on-boarding for financial services with a digital assistant powered by Amazon Bedrock

Microsoft shares rise 5%; cloud sales and AI plans lead quarterly results

Ricardo Salinas explains how Bitcoin is leveling the playing field for developing nations

Rolls-Royce shares have soared! But is this FTSE 250 stock a better buy right now?

Overcoming the obstacles of fitting the vision-language model for the generalization of OOD

Categories

Important Links

Get daily news updates to your inbox!