Multilingual (LM) language models, such as mBERT, XLM-R, mT5, mBART, have had notable success in enabling natural language tasks in low-resource languages via cross-language transfer from high-resource ones. In this paper, we try to better understand how such models, specifically mT5, transfer some cross-language linguistic and semantic knowledge, even if no explicit cross-language cues are provided during pre-training. Rather, only the unannotated texts from each language are presented to the model separately and independently of each other, and the model appears to implicitly learn the connections between languages. This raises several questions that motivate our study, such as: Are the interlanguage connections between each language pair equally strong? What properties of the source and target language affect the strength of cross-language transfer? Can we quantify the impact of those properties on transfer between languages?
In our research, we analyzed a previously trained mT5 to discover the attributes of the connections between languages learned by the model. Through a framework of statistical interpretation of more than 90 language pairs in three tasks, we show that transfer performance can be modeled by some linguistic and data-derived features. These observations allow us to interpret the multilingual understanding of the mT5 model. Based on this, one can favorably choose the best source language for a task and can anticipate its training data demands. A key finding of this work is that similarity of syntax, morphology, and phonology are good predictors of transfer between languages, much more so than simple lexical similarity of languages. For a given language, we can predict zero-shot performance, which increases on a logarithmic scale with the number of few-shot target language data points.