Why is a Really Really Good Language Identification Tool Important when Training AI?
Lingsoft has its roots in academia, as it was established in 1986 by two professors, Fred Karlsson and Kimmo Koskeniemi, from the University of Helsinki. Since then, collaborating with Finnish and other Nordic research organisations has always been an important part of Lingsoft’s strategy for enabling language technology solutions with state-of-the-art performance for the Nordic languages.
This blog post gives an example of how the methods proposed in the Microservices at Your Service project facilitated integration of Finnish state-of-the-art research in Lingsoft’s commercial machine translation solution.
Solved: Finnish to Swedish Machine Translation Occasionally Translated into English. How and Why?
Machine translation is nowadays an important part of the translation process for language service providers such as Lingsoft. Lingsoft’s neural machine translation models are based on open source software trained on several hundred million words for each language pair. At the moment, Lingsoft uses these models to translate about two million words per month which contributes to important savings in our translation production.
Machine translation is not always good enough on its own and Lingsoft always assigns a professional (human) translator to review and correct the machine translation before the translation is delivered to the client. This process is called machine translation post-editing.
The machine translation is intended to help the translators, at least more often than not, and if the machine translation is unexpectedly poor, the translators often report this to Lingsoft. In most cases, there is no quick fix to the specific one-off bad translation example reported, one just has to follow established state-of-the-art processes and rely on that progress in the machine translation field allows for better and better machine translation and fewer and fewer bad eggs. However, one has to keep a lookout for fixable patterns, and in the case of Finnish-Swedish translation, a pattern of bad quality started to emerge from the translator reports: the Finnish to Swedish machine translation occasionally translated into English instead of Swedish.
After some investigation we could identify the root cause: the Finnish-Swedish translation model had been trained on too many Finnish-English texts. How did this happen?
The Finnish-Swedish machine translation model was trained on over 10 million segments (about 100 million words) of presumably Finnish-Swedish translations (a segment is often a sentence or a few words e.g. a headline). The majority of this data comes from translation memories. Although they are supposed to be clean records of, in our case, Finnish-Swedish translations, in practice they often contain more or less translations in other languages as well. Therefore, each segment had been run through a language identification (LID) tool to remove pairs that are not Finnish-Swedish translations. But, the translation-into-the-wrong-language problem suggested that there simply had been too many Finnish-English segments in the desired Finnish-Swedish data. Thus, the LID performance had been insufficient.
Removing “foreign” material from a collection of Finnish-Swedish translations is not a trivial problem. Many word forms are valid words in both English and Swedish e.g. “in”, “I”, “and”, “taxi”, “bus”, and of course many person, place and organisation names, although they don’t necessarily have the same meaning in both English and Swedish. Additionally, short sections of foreign material, especially in English, is an expected part of normal Finnish/Swedish sentences and translation jobs. If there is a short section of English in the Finnish source sentence, then this English part is often kept in English also in the Swedish translation. For example, there might be names of English organisations without official Finnish/Swedish translations, or short phrases or citations in English. It is important that the NMT learns to handle such material, and we therefore want to retain it for training. What we want to remove are Finnish sentences that are clearly translated into English (or another language) instead of Swedish.
The most common metric for measuring how good a language identification tool is precision and recall. Precision measures the proportion of segments that the tool predicts to be Swedish that are actually in Swedish. Recall measures the proportion of all Swedish segments that the tool predicted to be Swedish. The best tools have both high precision and high recall. When dealing with such large numbers as 10 million segments or 100 million words, then just a percentage wrong becomes quite a lot of words. For example, the difference between 99.9% and 98.9% precision means that there are potentially 1 million words more in English instead of Swedish in the 98.9% case than in the 99.9% case. Even though there would be 99 million Swedish words, the one million English words are certainly sufficient for a neural machine translation tool to learn to sometimes translate into English, especially if these words are not random LID errors but rather stem from one specific customer.
As part of the Tandem Industry Academia project with Lingsoft and University of Helsinki funded by the Finnish Research Impact Foundation, Tommi Jauhiainen conducted research into improving language identification. When evaluating on a common test set then the new LID had the previously mentioned 99.9% precision (and 99.6% recall) whereas the old, quite good tool, had precision 98.9% (and recall 96.4%). Thus, the new LID tool, named HeLI-OTS by the researcher, would most likely be better at removing the previously problematic Finnish-English translations.
Bridging the Gap Between Research Tool and Commercial Application
The Microservices at Your Service project was conveniently timed in parallel with the LID research project. The Microservices at Your Service project proposed that adding an API to a research tool and packaging it as a Docker image results in an easily shareable, testable and integratable tool. As part of the Microservices at Your Service project, Lingsoft took on the task of adding a REST API to the Heli-OTS LID and package it as an easily distributable Docker image. The HeLI-OTS tool was implemented in Java whereas Lingsoft’s language technology tools and platform were mainly in C and Python. Having HeLI-OTS on a Docker image helped us avoid the problem of maintaining compatibility between the environments supporting HeLI-OTS and the rest of our tools and platform. In addition to making the tool available for Lingsoft it was also put on display the European Language Grid (which has the ambitious goal of becoming the one-stop-shop for European language technology).
In conclusion, for our LID case, the API and Docker gave the research partner, University of Helsinki, a simple way of retaining IPR while facilitating sharing and, most importantly, re-use of the tool for other researchers and developers and thereby maximising the potential impact of the research. The industry partner, Lingsoft, got access to a great state-of-the-art tool that was simple to integrate with minimal impact on other tool dependencies.
As a result, when using the new dockerized LID tool in the text preprocessing, the re-trained Finnish-Swedish machine translation model stopped generating so much English translation.