Large language models (LLMs) have made substantial progress in recent months, outperforming state-of-the-art benchmarks in many domains. This article investigates the behavior of LLMs with respect to gender stereotypes, a well-known obstacle for previous models. We propose a simple paradigm to test for the presence of gender bias, building on but differentiating from WinoBias, a commonly used gender bias dataset that is likely to be included in the training data of current LLMs. We tested four recently published LLMs and showed that they express biased assumptions about men and women, specifically those aligned with people’s perceptions, rather than those based on facts. In addition, we study the explanations that the models give for their choices. In addition to explanations that rely explicitly on stereotypes, we find that a significant proportion of explanations are factually inaccurate and likely obscure the true reason behind the models’ choices. This highlights a key property of these models: LLMs are trained on imbalanced data sets; as such, even with learning reinforced with human feedback, they tend to reflect those imbalances in us. As with other types of social bias, we suggest that LLMs should be carefully evaluated to ensure that they treat minority individuals and communities equitably.