Cut your losses in great vocabulary language models

As the linguistic models become increasingly large, so do their vocabularies. This has changed the LLM memory footprint during training disproportionately to a single layer: cross entropy in the calculation of losses. Cross entropy develops a logit matrix with entries for each pair of input tokens and vocabulary elements and, for small models, consumes an order of magnitude more memory than the rest of the combined LLM. We propose cross entropy (CCE) CUT, a method that calculates the loss of cross entropy without materializing logits for all tokens in global memory. Rather, CCE only calculates the logit for the correct token and evaluates the log-sum-explb on all the logits of the fly. We implement a personalized nucleus that performs matrix multiplications and the Log-Sum-Sesp reduction on the vocabulary in flash memory, which makes global memory consumption for the calculation of the cross entropy is insignificant. This has a dramatic effect. Taking the Gemma 2 (2b) model as an example, CCE reduces the memory footprint of the 24 GB to 1 MB loss calculation, and the total consumption of training time of the head -heading time of the 28 GB to 1 GB classifier. To improve the performance of CCE, we take advantage of the inherent shortage of Softmax and propose to omit elements of the gradient calculation that have an insignificant contribution (that is, below the numerical precision) to the gradient. The experiments show that the dramatic reduction in memory consumption is achieved without sacrificing the speed or convergence of training.

Cut your losses in great vocabulary language models

Technical Terrence Team

Carnival Cruise Line has free drinking passengers that do not know

Leave a Reply Cancel reply

Recommended.

The second-generation Apple Pencil has fallen to a new all-time low

Microsoft Researchers Introduce Reprompting: An Iterative Sampling Algorithm That Finds the Chain of Thought (CoT) Recipes for a Given Task Without Human Intervention

AI-Based Personal Professional Learning Microgrant Announced Targeting Teachers and Trainers (K-6)

Showtime to change names after combining with a popular streaming platform

Autoglyph set reaches record $14.6 million in historic NFT sale

Categories

Important Links

Cut your losses in great vocabulary language models

Related

Technical Terrence Team

Carnival Cruise Line has free drinking passengers that do not know

Leave a Reply Cancel reply

Recommended.

The second-generation Apple Pencil has fallen to a new all-time low

Microsoft Researchers Introduce Reprompting: An Iterative Sampling Algorithm That Finds the Chain of Thought (CoT) Recipes for a Given Task Without Human Intervention

AI-Based Personal Professional Learning Microgrant Announced Targeting Teachers and Trainers (K-6)

Showtime to change names after combining with a popular streaming platform

Autoglyph set reaches record $14.6 million in historic NFT sale

Categories

Important Links

Get daily news updates to your inbox!