As the linguistic models become increasingly large, so do their vocabularies. This has changed the LLM memory footprint during training disproportionately to a single layer: cross entropy in the calculation of losses. Cross entropy develops a logit matrix with entries for each pair of input tokens and vocabulary elements and, for small models, consumes an order of magnitude more memory than the rest of the combined LLM. We propose cross entropy (CCE) CUT, a method that calculates the loss of cross entropy without materializing logits for all tokens in global memory. Rather, CCE only calculates the logit for the correct token and evaluates the log-sum-explb on all the logits of the fly. We implement a personalized nucleus that performs matrix multiplications and the Log-Sum-Sesp reduction on the vocabulary in flash memory, which makes global memory consumption for the calculation of the cross entropy is insignificant. This has a dramatic effect. Taking the Gemma 2 (2b) model as an example, CCE reduces the memory footprint of the 24 GB to 1 MB loss calculation, and the total consumption of training time of the head -heading time of the 28 GB to 1 GB classifier. To improve the performance of CCE, we take advantage of the inherent shortage of Softmax and propose to omit elements of the gradient calculation that have an insignificant contribution (that is, below the numerical precision) to the gradient. The experiments show that the dramatic reduction in memory consumption is achieved without sacrificing the speed or convergence of training.