Pre-training of robust or multimodal baseline vision models (e.g., CLIP) relies on large-scale data sets that can be noisy, potentially misaligned, and have long-tailed distributions. Previous work has shown promising results in augmenting data sets by generating synthetic samples. However, they only support domain-specific ad hoc use cases (e.g., only image or text, but not both) and have limited data diversity due to the lack of detailed control over the synthesis process. In this paper, we design a controllable image-text synthesis pipeline, CtrlSynth, for robust and data-efficient multimodal learning. The key idea is to decompose the visual semantics of an image into basic elements, apply user-specified control policies (e.g., delete, add, or replace operations), and recompose them to synthesize images or texts. The decompose and recompose function in CtrlSynth allows users to control data synthesis in fine detail by defining custom control policies to manipulate the basic elements. CtrlSynth leverages the capabilities of pre-trained core models, such as large language models or diffusion models, to reason and recompose core elements so that synthetic samples are natural and composed in a variety of ways. CtrlSynth is a modular, training-free, closed-loop framework that makes it easy to support different pre-trained models. With extensive experiments on 31 data sets spanning different vision and vision language tasks, we demonstrate that CtrlSynth substantially improves the zero-shot classification, image and text retrieval, and compositional reasoning performance of CLIP models.