*=Equal taxpayers
Preserving training dynamics across batch sizes is an important tool for practical machine learning, as it allows you to balance batch size and wall clock time. This compensation is usually enabled by a scaling rule; For example, in stochastic gradient descent, the learning rate must be scaled linearly with the batch size. Another important machine learning tool is the EMA model, a functional copy of a target model whose parameters move toward those of its target model based on an exponential moving average (EMA) at a rate parameterized by a momentum hyperparameter. This EMA model can improve the robustness and generalization of supervised learning, stabilize pseudo-labeling, and provide a learning signal for self-supervised learning (SSL). Previous work has not considered optimizing the EMA model when scaling, resulting in different training dynamics between batch sizes and lower model performance. In this work, we provide a scaling rule for optimization in the presence of an EMA model and demonstrate the validity of the rule on a variety of architectures, optimizers, and data modalities. We also show the validity of the rule where the EMA model contributes to the optimization of the target model, allowing us to train EMA-based SSL and pseudo-labeling methods on small and large batches. For SSL, we allow BYOL training up to a batch size of 24,576 without sacrificing performance, a 6x wall clock time reduction on idealized hardware configurations.