Behavioral testing in NLP allows for a detailed evaluation of systems by examining their linguistic capabilities through input-output behavior analysis. Unfortunately, existing work on behavioral testing in machine translation (MT) is currently limited to largely handcrafted tests covering a limited range of abilities and languages. To address this limitation, we propose to use large language models (LLMs) to generate a diverse set of source sentences designed to test the behavior of MT models in a variety of situations. We can then check whether the MT model shows the expected behavior by matching candidate sets that are also generated by LLM. Our approach aims to make behavioral testing of MT systems practical and require minimal human effort. In our experiments, we applied our proposed evaluation framework to evaluate multiple available MT systems, revealing that while overall pass rates follow the trends observable from traditional accuracy-based metrics, our method was able to uncover several important differences and potential errors that go unnoticed. when relying solely on precision.