There is also a large area of risk, as documented in (4), where marginalized groups are associated with harmful connotations that reinforce hateful social stereotypes. For example, the representation of demographic groups that combines humans with animals or mythological creatures (such as black people as monkeys or other primates), the combination of humans with foods or objects (such as the association of people with disabilities and vegetables), or the association of demographic groups with negative semantic concepts. (like terrorism with Muslims).
Problematic associations like these between groups of people and concepts reflect long-standing negative narratives about the group. If a generative ai model learns problematic associations from existing data, it can reproduce them in the content it generates (4).
There are several ways to hone LLMs. According to (6), a common approach is called supervised fine tuning (SFT). This involves taking a pre-trained model and further training it with a data set that includes pairs of desired inputs and outputs. The model adjusts its parameters by learning to best match these expected responses.
Typically, tuning involves two phases: SFT to establish a base model, followed by RLHF to improve performance. SFT involves imitating high-quality demo data, while RLHF refines LLMs using preference feedback.
RLHF can be performed in two ways: reward-based methods or non-reward methods. In the reward-based method, we first train a reward model using preference data. This model then guides online reinforcement learning algorithms like PPO. Reward-free methods are simpler and directly train models based on preferences or classify the data to understand what humans prefer. Among these non-reward methods, DPO has demonstrated strong performance and has become popular in the community. DPO diffusion can be used to shift the model away from problematic representations toward more desirable alternatives. The complicated part of this process is not the training itself, but the data curation. For each risk, we need a collection of hundreds or thousands of messages, and for each message, a pair of desirable and undesirable images. Ideally, the desirable example should be a perfect description of that message, and the undesirable example should be identical to the desirable image, except that it should include the risk we want to unlearn.
These mitigations are applied after the model is finalized and deployed to the production stack. These cover all mitigations applied on the user input message and the final image output.
Quick filtering
When users enter a text message to generate an image, or upload an image to be modified using the painting technique, filters can be applied to block requests that explicitly request harmful content. At this stage, we address issues where users explicitly provide harmful messages like “show an image of one person killing another“or upload an image and ask”take off this person's clothes” etc.
To detect harmful requests and block them, we can use a simple block list based on keyword matching and block all requests that have a matching harmful keyword (for example, “suicide“). However, this approach is fragile and can produce a large number of false positives and false negatives. Any confusion mechanism (for example, users querying “suicide3” rather “suicide“) will fail with this approach. Instead, an embedding-based CNN filter can be used for harmful pattern recognition by converting user cues into embeddings that capture the semantic meaning of the text and then using a classifier to detect harmful patterns within these embeddings. However, it has been shown that LLMs be better at recognizing harmful patterns in cues because they excel at understanding context, nuance, and intent in a way that simpler models like CNNs can struggle with. They provide a more context-aware filtering solution and can adapt to evolving linguistic patterns, jargon, confusion techniques, and emerging harmful content more effectively than models trained with fixed embeddings. LLMs can be trained to block any policy guidelines defined by their organization. In addition to harmful content such as sexual images, violence, self-harm, etc., it can also be trained to identify and block requests to generate public figures or images related to election misinformation. To use an LLM-based solution at production scale, you would have to optimize for latency and incur the cost of inference.
Quick manipulations
Before passing the raw user message to the model for imaging, several manipulations can be performed on the message to improve the security of the message. Below are several case studies:
Rapid increase to reduce stereotypes.: MDLs amplify dangerous and complex stereotypes (5) . A wide range of ordinary cues produce stereotypes, including cues that simply mention traits, descriptors, occupations, or objects. For example, requesting basic traits or social roles that result in images that reinforce whiteness as an ideal, or requesting occupations that result in an amplification of racial and gender disparities. Quickly engineering to add racial and gender diversity to the user message is an effective solution. For example, “image of a CEO” -> “image of a CEO, Asian woman” or “image of a CEO, black man” to produce more diverse results. This can also help reduce harmful stereotypes transforming messages like “image of a criminal” -> “Image of a criminal, olive skin tone.” as the original message would most likely have produced a black man.
Immediate anonymization for greater privacy: Additional mitigations may be applied at this stage to anonymize or filter the content of requests that request information from specific private individuals. For example “Image of John Doe from in the shower” -> “Image of a person in the shower”
Immediate rewriting and grounding to convert harmful messages to benign ones: Prompts can be rewritten or substantiated (usually with a refined LLM) to reframe problematic scenarios in a positive or neutral way. For example, “Show a lazy person (ethnic group) taking a nap” -> “Show a person relaxing in the afternoon.” Defining a well-specified message, or commonly known as generation grounding, allows models to adhere more closely to instructions when generating scenes, thereby mitigating certain latent and unfounded biases. “It shows two people having fun.” (This could lead to inappropriate or risky interpretations) -> “It shows two people having dinner in a restaurant.”.
Output Image Classifiers
Image classifiers can be implemented that detect images produced by the model as harmful or not, and can block them before sending them back to users. Stand-alone image classifiers like this one are effective at blocking images that are visibly harmful (showing graphic violence or sexual content, nudity, etc.). However, for paint-based applications where users upload an input image (e.g. image of a white person). ) and give a harmful message (“give them a black face“) to transform it in an unsafe way, classifiers that only look at the output image in isolation will not be effective as they lose the context of the “transformation” itself. For such applications, multimodal classifiers that can consider the input image, message, and output image together to make a decision on whether an input-to-output transformation is safe or not are very effective. These classifiers can also be trained to identify “unwanted transformations,” for example, uploading an image of a woman and asking her to “make them beautiful”which leads to the image of a white, thin, blonde woman.
Regeneration instead of rejections
Instead of rejecting the output image, models such as DALL·E 3 use the classifier's guidance to enhance unsolicited content. A custom algorithm is implemented based on the classifier guide and the operation is described in (3):
When an output image classifier detects a harmful image, the message is sent back to DALL·E 3 with a special flag set. This flag activates the diffusion sampling process to use the harmful content classifier to sample images that could have triggered it.
Basically, this algorithm can “push” the diffusion model towards more appropriate generations. This can be done both at the request level and at the image classifier level.