Contrastive models such as CLIP have been shown to learn robust image representations that capture both semantics and style. To take advantage of these representations for image generation, we propose a two-stage model: a previous stage that generates an embedded CLIP image given a text caption, and a decoder that generates an image conditional on the embedded image. We show that explicit generation of image representations enhances image diversity with minimal loss of photorealism and subtitle similarity. Our image representation-conditional decoders can also produce variations of an image that preserve both its semantics and style, while varying non-essential details missing from the image representation. In addition, CLIP’s co-embedding space allows for language-guided image manipulations in a zero-trigger fashion. We use diffusion models for the decoder and experiment with autoregressive and diffusion models for the former, finding that the latter are more computationally efficient and produce higher quality samples.