In the first part of our three-part blog series "Machine Translation", we explained that the choice of Machine Translation software – i.e. the so-called "Engine" – can be a decisive factor in the implementation of Machine Translation within the entire translation process. In this second part, we’ll explain the characteristics and differences between Generic and Trainable Engines. We’ll also address decision factors such as quality, training data and costs – and of course, data protection.
Generic Engines are trained with publicly available data from various sources. Among common generic translation services are Google Translate, DeepL and Amazon Translate. These providers offer both browser-based translation tools – Google Translate is probably the best-known tool among them – and interfaces for the direct integration into CAT (Computer Assisted Translation) tools. Due to the wealth of data that goes into the training of these neural engines, the resulting translations seem very natural. However, this also makes it more difficult to detect errors without a thorough examination of the texts. Due to the extensive training involved, Generic Engines can be used in a wide variety of highly specialised areas. However, due to the unpredictable amount of data and varying algorithms that the Engines control, it can’t be guaranteed that technical content will be translated correctly, in accordance with the subject area.* This is particularly evident in translations with technical and company-specific terminology and style specifications. A sentence that’s been translated correctly in one place may receive a completely different translation at another location within the same text block.
Trainable Engines (also called Individualised Engines) are trained with customer-specific data. This data comes from existing Translation Memories (TMs) and bilingual or multilingual terminology databases. The training is provided by the vendors of the Machine Translation software and doesn’t have to be carried out by the customers, themselves. Individual training helps to better translate technical and customer-specific terms, and to ensure consistency in recurring texts – similar to an exact match or a context match. In addition, continuous training and adjustments allow errors to be corrected even more specifically. The more the Individualised Engine is trained, the higher the quality of the translation result will be. To sum up, the use of Trainable Engines is particularly suitable for jobs with recurring text portions, customer-specific specifications, and for low-context text blocks and word lists. Furthermore, Trainable Engines are characterised by the better handling of tags (i.e. markup elements that provide additional information such as formatting).
The quality of the training data is crucial for the successful use of Trainable Engines. Because of this, we recommend a thorough check of the segments from the Translation Memory before they’re used for machine training. Additionally, larger amounts of data are required for the training of Individualised Engines. Ideally, this would be at least 30,000 segments or 300,000 words. The use of Trainable Engines is therefore more suitable for customers with high translation volumes. It should also be noted that the integration of a Trainable Engine and its on-going maintenance also involves higher costs.
While Trainable Engines use the data only for training the specific Engine, there’s widespread scepticism about how vendors of generic systems process our data. When using openly accessible Generic Engines it’s unclear what happens to the data. We therefore advise against having your sensitive data translated with browser-based translation tools. However, company versions of Generic Engines are also offered, where the protection of sensitive data is assured. For example, DeepL claims to provide two-factor data protection through end-to-end encryption and the immediate deletion of data after its translation.
When deciding on the implementation of Machine Translation – whether Generic or Trainable – we should bear in mind that the translation results from the machine merely represent raw data. If the translations are to be used not only for purely informational purposes but are to be published, for example, then post-editing by trained, native-speaker Translators is still necessary. In the third part of our blog series on Machine Translation, we’ll show you what good post-editing looks like and whether, despite the post-editing of machine-translated texts, the desired and eagerly anticipated time and financial savings are actually achieved when all’s said and done.
* Since May 2020, DeepL has offered a Glossary function within both the DeepL Translator (https://www.deepl.com/en/translator) and the Pro version. By including terminology in the Glossary, consistent machine translations can be guaranteed, according to DeepL. It’s currently not known whether the DeepL Glossary function will also be available for CAT tools.
Source: https://www.deepl.com/blog/20200506.html (last update: 06 May 2020)