Self-training has been shown to be helpful in addressing data paucity for many domains, including vision, speech, and language. Specifically, autotraining, or pseudolabeling, labels unsupervised data and adds it to the training set. In this work, we investigate and use pseudo-labeling for a recently proposed novel configuration: joint transcription and translation of speech, which suffers from an absence of sufficient parallel data resources. We show that under such data sparse circumstances, unlabeled data can vary significantly in the domain of supervised data, resulting in pseudolabel quality degradation. We investigated two categories of remedies that do not require additional monitoring and focus on domain mismatch: pseudo-label filtering and data augmentation. We show that analyzing and processing pseudo-labels in this way results in additional gains on top of the standard pseudo-labeling setup, providing a total improvement of up to 0.4% absolute WER and 2.1 BLEU points for In– De and 0.6% absolute WER and 2.2 BLEU points for En-Zh.