We introduce MIA-Bench, a new benchmark designed to evaluate large multimodal language models (MLLMs) on their ability to strictly adhere to complex instructions. Our benchmark comprises a diverse set of 400 image-cue pairs, each designed to test models’ compliance with layered instructions to generate accurate responses that satisfy specific requested patterns. Evaluation results on a wide range of state-of-the-art MLLMs reveal significant variations in performance, highlighting areas for improvement in instruction fidelity. Furthermore, we create additional training data and explore supervised fine-tuning to improve the models’ ability to strictly follow instructions without compromising performance on other tasks. We hope that this benchmark will not only serve as a tool to measure MLLMs’ adherence to instructions but also guide future developments in MLLM training methods.