Recent advances in large language models (LLMs) have sparked interest in their formal reasoning capabilities, particularly in mathematics. The GSM8K benchmark is widely used to assess mathematical reasoning of models in school-level questions. While the performance of LLMs in GSM8K has improved significantly in recent years, it is still unclear whether their mathematical reasoning abilities have actually advanced, raising questions about the reliability of the reported metrics. To address these concerns, we conducted a large-scale study on several open and closed SOTA models. To overcome the limitations of existing assessments, we introduce GSM-Symbolic, an improved benchmark built from symbolic templates that enable the generation of a diverse set of questions. GSM-Symbolic enables more controllable evaluations, providing key insights and more reliable metrics to measure the models' reasoning capabilities. Our findings reveal that LLMs exhibit notable variation in responding to different instances of the same question. Specifically, the performance of all models decreases when only the numerical values of the question in the GSM-Symbolic benchmark are modified. Furthermore, we investigate the fragility of mathematical reasoning in these models and show that their performance deteriorates significantly as the number of clauses in a question increases. We hypothesize that this decline is due to current LLMs being unable to engage in genuine logical reasoning; They replicate reasoning steps from their training data. Adding a single clause that appears relevant to the question causes significant drops in performance (up to 65%) in all recent models, even though the clause does not contribute to the chain of reasoning needed for the final answer. Overall, our work offers a more nuanced understanding of the capabilities and limitations of LLMs in mathematical reasoning.