Multilingual neural models encode languages into a high-dimensional space. However, numerous work has shown that monolingual neural models frequently do not make use of every dimension. Isotropy is a measure of how the variance of a distribution is distributed across dimensions. In general, the field assumes that more isotropic models have better performance. Yet, there are numerous papers that argue both for and against this which we discussed in class. However, the majority of these works are monolingual. Little has been done in the multilingual setting. This homework assignment will have you measuring isotropy using different methods on various pre-trained multilingual models.
We define isotropy using the IsoScore paper. A distribution is isotropic if:
A distribution is anisotripic when the variance is dominated by a single dimension.
We first look at isotropy in mBERT defined in Rajaee and Pilehvar 2022. This paper looks at two measures of isotropy, cosine similarity and principal components. In general, the results of this paper sugges that mBERT is highly anisotropic despite not having a single outlier dimension. Broadly, BERT is found to be more isotropic than mBERT.
We first use the defintions of Isotropy from this paper, which relies on two previous works:
Formal definitions are found in Section 2.1 of Rajaee and Pilehvar (2022). For the first part of this assignment, you will replicate part of the experiments from this paper. You are free to use the code provided by the authors. Calculate the isotropy on the provided English and Spanish data in the repo.
Choose another language that is in mBERT, but was not evaluated in the paper. You will need to create a CSV file with roughly 3,000-5,000 lines in the language. Make sure to report all sources, preprocessing steps, and other information needed to replicate results.
However, one of the main arguments for investigating isotropy is that more isotropically distributed models perform better. We may expect the converse to be true. We therefore would expect a better performing model to be more isotropic. Since XLM-R generally performs better than mBERT on many downstream tasks, we would hypothesize it would have more isotropic distributions. Rerun the previous results swapping out XLM-R for mBERT. Remember to change tokenizers as well.
Rudman et al., (2022) propose IsoScore and argue that most of the published work on isotropy is not using the proper mathematical definition. They run experiments on GPT, GPT-2, BERT, and DistillBERT, but do not run any multilingual experiments. For the rest of this assignment, you will rectify this. Fortunately for you, they provide a toolkit you can pip install. Their repo is available here. Caclulate the IsoScore on the three languages you have been using above. Make sure to report the dimension you are evaluating on (mBERT and XLM-R have different dimensions) and IsoScore is dependent on this. You can make use of this code to generate embeddings.
Please write about 500 words on the assignment. Include disucssion about how and why you see differences between the IsoSCore and the other methods. Which do you think is a better measure of isotropy? Is isotropy actually worth measuring for multilingual settings? Discuss what you think would be a useful takeaway from this line of work for the field. Propose a potential improvement for either scoring, or using these scores (very broad, no correct answer).