This work studies the use of attention masking in speech recognition based on transformer transducers to build a single configurable model for different implementation scenarios. We present a complete set of experiments comparing fixed masking, where the same attention mask is applied on each frame, with fragmented masking, where the attention mask for each frame is determined by the fragment boundaries, in terms of accuracy of recognition and latency. We then explore the use of variable masking, where attention masks are sampled from a target distribution at the time of training, to build models that can work in different settings. Finally, we investigate how a single configurable model can be used to perform both first-pass transmission recognition and second-pass acoustic reassessment. Experiments show that fragmented masking achieves better accuracy against latency compensation compared to fixed masking, with and without FastEmit. We also show that variable masking improves accuracy by up to 8% relative to the acoustic rescore scenario.