Facebook goes massive : the culmination of years of work in massive multilingual language translation models

Home / Uncategorized / Facebook goes massive : the culmination of years of work in massive multilingual language translation models

October 23, 2020

Uncategorized

For Facebook, the release of the model was important enough to proactively inform media outlets as well as corporation IT departments of its release, a very unusual move for this company

By:

Eric De Grasse

Chief Technology Officer

23 October 2020 (Paris, France) – As machine translation research output soars again before the (virtual) academic conference season, Facebook is introducing M2M-100, a multilingual neural machine translation (NMT) model designed to avoid English as the intermediary (or pivot) language between source and target languages.

NOTE: I prepared a brief for clients on this soaring research output in multilingual machine translation research – everything new in machine translation research. It is now publicly available and can be accessed by clicking here.

These so called massive multilingual models are important in machine translation because of the promise efficiency gains and the option to use transfer learning.

For Facebook, the release of the model was important enough to proactively inform media outlets and corporate IT departments, an unusual move for the company. But given Facebook is now competing with Google to displace the “usual suspects” in machine language translation work in corporate IT and legal departments, maybe not so unusual.

Facebook has already open-sourced the model, training, and evaluation setup. An October 19, 2020 Facebook AI blog post described the new model as the “culmination of years of Facebook AI’s foundational work in machine translation.”

Facebook’s LASER, a library for calculating multilingual sentence embeddings, has been available on GitHub since 2018. In a November 2019 paper, Facebook researchers detailed how they mined LASER for billions of parallel sentences, a significant portion of which were not aligned with English; and then open-sourced the resulting mined data, referred to as CCMatrix. According to an October 19, 2020 comment on GitHub by lead author Holger Schwenk, the model now contains 10.8 billion sentences for about 80 languages, and the team plans to “release scripts to reproduce the data this week.”

Trained on 2,200 language directions and touted as being able to translate “between any pair of 100 languages without relying on English data,” M2M-100 is notable for its scale. But it also reports better BLEU scores than English-centric multilingual models.

And a Facebook AI researcher told me that Facebook also conducted human evaluation across 21 non-English directions, where the language specialist judges were asked to rate accuracy and fluency, and provide written feedback about translation issues. This is critically important, she said, as the company moves into legal translation work, translating different source languages into English and back again to the native language.

admin