XLM-V is a multilingual masked language model based on the XLM-R (XLM-RoBERTa) architecture with a 1M token vocabulary. It is trained on 2.5T of filtered CommonCrawl data in 100 languages. XLM-V outperforms XLM-R on every multilingual task we tested it on (XNLI, MLQA, TyDiQA, XQuAD, WikiAnn) with outsized gains on low-resource language tasks (MasakhaNER, AmericasNLI). [Paper] [Download (8.3G)] [Instructions]

Our model is also open-sourced in Huggingface Transformers.