*=Authors contributed equally
Machine learning (ML) models can fail in unexpected ways in the real world, but not all model failures are created equal. With limited time and resources, ML practitioners are forced to prioritize their model enhancement and debugging efforts. Through interviews with 13 ML professionals at Apple, we found that professionals build small, specific test suites to estimate the nature, scope, and impact of a bug on users. Based on this information in a case study with machine translation models, we developed Angler, an interactive visual analysis tool to help practitioners prioritize model improvements. In a user study with 7 machine translation experts, we used Angler to understand prioritization practices when the input space is infinite and obtaining reliable signals of model quality is expensive. Our study revealed that participants were able to formulate more interesting and user-focused hypotheses for prioritization by analyzing quantitative summary statistics and qualitatively evaluating the data by reading sentences.