Remarkable advances in multimodal large language models (MLLMs) have not made them immune to challenges, particularly in the context of handling misleading information in prompts, thereby producing hallucinated responses under such conditions. To quantitatively evaluate this vulnerability, we present MAD-Bench, a carefully curated benchmark containing 1000 test samples divided into 5 categories such as non-existent objects, object count, and spatial relationship. We provide a comprehensive analysis of popular MLLMs, from GPT-4v, Reka, Gemini-Pro to open source models such as LLaVA-NeXT and MiniCPM-Llama3. Empirically, we observe significant performance gaps between GPT-4o and other models; and previous robust instruction-tuned models are not effective on this new benchmark. While GPT-4o achieves an accuracy of 82.82% on MAD-Bench, the accuracy of any other model in our experiments ranges between 9% and 50%. Additionally, we propose a solution that adds an additional paragraph to the misleading prompts to encourage models to think twice before answering the question. Surprisingly, this simple method can even double the accuracy; However, the absolute numbers are still too low to be satisfactory. We hope that MAD-Bench can serve as a valuable benchmark to stimulate further research to improve model resilience against misleading cues.