In this article, we begin by training end-to-end automatic speech recognition (ASR) models using federated learning (FL) and examining the fundamental considerations that can be instrumental in minimizing the performance gap in terms of word error rate between the models trained using FL versus its centralized counterpart. Specifically, we study the effect of (i) adaptive optimizers, (ii) loss features by altering the Connectionist Temporal Classification (CTC) weight, (iii) model initialization via seed initiation, (iv) transfer of experience modeling configuration in centralized training to FL, for example, pre- or post-layer normalization, and (v) FL-specific hyperparameters, such as the number of local epochs, the client sampling size, and the scheduler. learning rate, specifically for ASR under heterogeneous data distribution. We shed light on how some optimizers perform better than others at inducing smoothness. We also summarize the applicability of algorithms, trends and propose best practices from previous work in FL (in general) towards end-to-end ASR models.