In the rapidly evolving world of natural language processing (NLP), there is a strong demand to generate coherent and controlled text, as mentioned in the work. Towards controlled text generation. Traditional autoregressive models like GPT, which have long been the industry standard, have inherent limitations that sometimes manifest as repetitive, low-quality results, as seen in the work. The curious case of neuronal text degeneration. This is mainly due to a phenomenon known as “exposure bias,” as seen in the work Programmed sampling for sequence prediction with recurrent neural networks. This imperfection arises due to a mismatch between the way these models are trained and their actual use during inference, which often leads to the accumulation of errors during text generation.
To address these challenges, we wanted to draw attention to a latent text diffusion model that we introduced in fall 2023. The model combines non-autoregressive latent semantic diffusion with autoregressive generation to overcome the obstacles its predecessors faced. Specifically, we hope to conduct research to improve the experience of users who benefit from more diversified and controlled text generation. By adopting a latent diffusion approach (as discussed in High-resolution image synthesis with latent diffusion models and Latent diffusion for language generationPLANNER mitigates the computational overhead typically associated with similar models, while offering superior diversity and cohesion, and reducing the level of repetition of generated text, particularly in blocks of text and longer paragraphs, which have traditionally posed a challenge for models. text generation.
Our model, PLANNER, extends its benefits to various text generation tasks, such as semantic generation, completion, and text summarization, with comprehensive assessments of fluency, diversity, and repetition mitigation.
In stage 1 of Figure 1, a variational paragraph embedder encodes paragraphs into a series of latent codes. The encoder E and the decoder D construct a bidirectional mapping between the discrete data space and the latent code space. Paragraph embeddings z are extracted by taking the first k hidden state vectors of dimension h from the final layer of E, which are fed into the initial steps of the decoder, which is trained to reconstruct the original text x. BOS and EOS represent “beginning of sentence” and “end of sentence” tokens, respectively.
In stage 2 of Figure 1These latent z codes are processed by a transformer-based latent diffusion model (as discussed in the paper Scalable diffusion models with transformers) for training, so that it can generate new latent codes over time during inference time, simulating the evolution of text from coarse to fine. Finally, in stage 3, decoder D translates these evolving latent codes into coherent text.
Our PLANNER latent diffusion model considers the conditioning signal as plain text, such as the previous context or the document to be summarized. We apply a conditional feature encoder τ to the input and use the hidden states in the last layer as y. We feed yy time incorporating t into the latent diffusion model through two channels, namely cross-attention and adaptive layer normalization. The goal of our research is to use existing text samples, such as an email or document summary, to help generate longer texts that are both cohesive and readable. The examples in the following two figures are taken from a public data set of text samples related to hotel reviews.
Figure 2 It compares two language models: a large fitted GPT-2 model and our method. It shows how each model handles a message designed to evaluate its ability to generate diversified text from a repetitive signal. We decided to select GPT-2 because it was the most relevant model at the time of conducting the research. Starting with the fitted GPT-2 large model, this model was initialized using large GPT-2, which has 774 million parameters. As for publicly available versions of GPT-2, OpenAI has released different sizes of GPT-2 models, including a large version that can be accessed by researchers and developers. However, the particular refined version we used in our paper, PLANNER: Diversified Paragraph Generation Using the Latent Language Diffusion Model, may include proprietary data set adjustments and may not be available directly.
- FOOT stands for fine-tuning, which is the process of taking a pre-trained model and further training it on a new data set to specialize its knowledge.
- Greedy decoding is a method in which, at each step of text generation, the model chooses the word with the highest probability.
- Upsampling p is a technique in which the model chooses from the top p percent of likely words, allowing for greater randomness and potential creativity in its output, as discussed in the paper. The curious case of neural text degeneration
- Generation 512 releases refers to the number of times the model generates text to test its capabilities. In this context, it means that the model was used to generate text, from the message, 512 times for evaluation.
- N-grams They are sequences of N tokens.
The percentage numbers in the n-gram columns indicate the frequency of occurrence of each n-gram within the text generated by a specific method. A lower maximum percentage suggests that there is a greater variety of different n-grams, which is typically considered desirable for generating text that is less repetitive and more diverse.
“More diversified” implies that the generated word sequences (n-grams) are more varied and less repetitive compared to repetitive n-grams generated by other methods or models. This diversification generally indicates a higher quality of text generation that is more likely to generate useful and novel content for users.
Finally, we see cumulative errors in traditional autoregressive models, such as those in GPT-2, where the model gets stuck in a loop and produces repetitive or useless results. In the given context, the repeated phrase “horrible hotel” in the text generated by GPT-2 is an example of such a cumulative error.
figure 3 illustrates the gradual evolution of the generated text in a series of 10 steps. The model starts with initial rough predictions (represented in Figure 3 as step 1, the initial state) and progresses by performing repeated processing steps to remove noise and improve text.
The reader should imagine this scenario not as a snapshot of text entered or requested by an iPhone user, but as a systematic process by which a language model refines an initially vague or broad expression into more detailed and specific review text. In step 1, the text is a rough suggestion of what the user might want to express: it is concise and lacks detail. As time passes, the model refines the text, introducing more specific descriptions, feelings, and sophisticated language. In step 10, the final state, the generated text resembles a carefully written review that one might expect from an experienced reviewer who pays close attention to various aspects of their hotel stay.
Thus, figure 3 shows how PLANNER model generation progresses from coarse to fine, giving readers a step-by-step visualization of how text is iteratively enhanced to improve readability, specificity, and overall quality. The scenario begins with a minimal outline of positive sentiment and, over time, becomes a developed testimony with vivid details emerging with each subsequent step.
Conclusion
The PLANNER model represents an advance in the search for an improved natural language. Addressing the challenge of cumulative errors in traditional autoregressive models, our model leverages latent semantic diffusion to generate fluent, controlled, and diversified text.
Expressions of gratitude
Many people contributed to this work, including Richard Bai, Ronan Collobert, Zhe Gan, David Grangier, Edouard Grave, Tatiana Likhomanenko, Barry Theobald, Yinfei Yang, and Yizhe Zhang.
Apple Resources
Xu, Jin, Xiaojiang Liu, Jianhao Yan, Deng Cai, Huayang Li, and Jian Li, 2022. “Learning to break the loop: Replay analysis and mitigation for neural text generation” (link).
Zhang, Yizhe, Jiatao Gu, Zhuofeng Wu, Shuangfei Zhai, Josh Susskind, and Navdeep Jaitly, 2023. “PLANNER: Generating Diversified Paragraph via Latent Language Diffusion Model” (link).
External references
Bengio, Samy, Oriol Vinyals, Navdeep Jaitly and Noam Shazeer. 2015. “Scheduled Sampling for Sequence Prediction with Recurrent Neural Networks.” (link.)
Holtzman, Ari, Jan Buys, Li Du, Maxwell Forbes, and Yejin Choi. 2020. “The Curious Case of Neural Text Degeneration.” (link.)
Hu, Zhiting, Zichao Yang, Xiaodan Liang, Ruslan Salakhutdinov, and Eric P Xing, 2017, “Towards controlled text generation.” (link.)
Keskar, Nitish Shirish, Bryan McCann, Lav R. Varshney, Caiming Xiong, and Richard Socher. 2019. “CTRL: A Conditional Transformative Language Model for Controllable Generation.” (link.)
Lovelace, Justin, Varsha Kishore, Chao Wan, Eliot Shekhtman, and Kilian Q. Weinberger. 2023. “Latent diffusion for language generation.” (link))(https://doi.org/10.48550/arXiv.2212.09462)
Peebles, William and Saining Xie. 2022. “Scalable Diffusion Models with Transformers.” (link.)
Rombach, Robin, Andreas Blattmann, Dominik Lorenz, Patrick Esser and Björn Ommer. 2022. “High-Resolution Image Synthesis with Latent Diffusion Models.” (link.)