*Equal taxpayers
While federated learning (FL) has recently emerged as a promising approach to training machine learning models, it is limited to only preliminary explorations in the domain of automatic speech recognition (ASR). Furthermore, FL does not inherently guarantee user privacy and requires the use of differential privacy (DP) for strong privacy guarantees. However, we are not aware of any prior work on applying DP to FL for ASR. In this paper, we aim to bridge this research gap by formulating an ASR benchmark for FL with DP and establishing first baselines. First, we extend existing research on FL for ASR by exploring different aspects of recent end-to-end transformer models: architecture design, seed models, data heterogeneity, domain shift, and impact of cohort size. With a practical number of core aggregations, we are able to train FL models that are near-optimal even with heterogeneous data, a seed model from another domain, or no pre-trained seed model. Second, we apply DP to FL for ASR, which is non-trivial since DP noise severely affects model training, especially for large transformer models, due to highly imbalanced gradients in the attention block. We counteract the adverse effect of DP noise by reviving layer-wise clipping and explain why its effect is more evident in our case than in previous work. Surprisingly, we achieve user-level (7.2, 10-9)-DP (resp. (4,5, 10-9)-DP) with an absolute drop of 1.3% (resp. 4.6%) in word error rate for extrapolation to a high (resp. low) population scale for FL with DP in ASR.