Terminology Management with Probabilistic AI Systems

Terminology management

Terminology management plays a central role in many fields, particularly in highly specialised domains such as law, technology, and medicine, where a clearly defined, consistent terminology can be essential for product safety, legal certainty, and the prevention of liability claims. In multilingual settings, terminology management must of course also be integrated into professional translation workflows – an area where Artificial Intelligence, and probabilistic AI systems in particular, are becoming increasingly important.

Terminology Enforcement with Conventional Translation Systems

Most commercial providers of AI-based translation systems (Neural Machine Translation, NMT), including DeepL and Google Translate, therefore offer the possibility to upload bilingual glossaries. During the translation process, these terms are systematically “enforced“, overriding probabilistically determined alternatives where necessary. This process is often referred to as terminology enforcement or terminology integration. Since it requires an intervention in highly complex neural networks with numerous hidden layers, terminology enforcement is considered a technically challenging task that even state-of-the-art systems cannot yet perform with complete reliability. Glossary terms may, for instance, be only partially implemented or even ignored entirely without any apparent reason. In addition, terminology enforcement in NMT systems may produce negative side effects, including additional spelling, punctuation or grammar mistakes. The following example illustrates two very typical cases with a grammatical agreement error and the insertion of a superfluous space:

Example

Source text:

Is wet-nursing culturally acceptable and can a safe wet-nurse be identified?
No: Feed infant formula milk to the baby until the mother recovers.

Google Translate:

Ist die Ammenstillen kulturell akzeptabel und kann eine sichere Amme identifiziert werden? Nein: Füttern Sie das Baby mit Säuglingsnahrung , bis sich die Mutter erholt hat.

Example from Šorak 2026. Text excerpt from the World Health Organization (2020): “FAQ: Breastfeeding and COVID-19. For health care workers”. Glossary terms underlined by the author.

A comparative investigation of DeepL and Google Translate based on texts from the World Health Organization showed that such negative effects are by no means marginal: they occurred in 12% of cases with DeepL, and as many as 20% of cases with Google Translate (Šorak 2026). The study further revealed that terminology enforcement with Google Translate failed in 24% of cases, compared with only 5% of cases with DeepL.

Terminology Enforcement with Generative LLMs

In recent years, generative Large Language Models (LLMs) such as OpenAI’s ChatGPT and Google Gemini have also increasingly been used for translation tasks. Unlike in conventional NMT systems, terminology enforcement in generative LLMs is not an additional feature, but an inherent capability that can be initiated via prompting. Typical negative effects such as the grammar or punctuation errors mentioned above have not yet been observed in these models. Nevertheless, LLMs remain similarly unreliable when it comes to consistent terminology enforcement. A pilot study using ChatGPT and Google Gemini on texts from the World Health Organization found that both systems adopted the glossary terms in only slightly more than 70% of cases (Šorak 2025). The test glossaries included adjectives, verbs, nouns and multi-word lexemes. Common-language terms were deliberately incorporated to test whether terminological preferences would still be reliably enforced when alternative translations with a significantly higher statistical probability exist, e.g. respectful – wertschätzend instead of respektvoll and disrespectful – geringschätzig instead of respektlos. Although the reasons for terminology enforcement failures remained largely unclear, it was noticeable that both models deviated slightly more frequently from glossary requirements when handling common-language terms. One plausible explanation is that terms for which a statistically more probable alternative translation exists are associated with an increased risk of enforcement failure because the LLMs may prioritise statistical dominance in these cases. An additional experiment using a deliberately incorrect glossary entry (WHO Statement – OHW-Stellungnahme) appeared to support this hypothesis: the models generally “corrected” the term, thereby disregarding the glossary. This suggests that the further a glossary entry deviates from learned probability patterns, the more likely it is to be ignored. Nevertheless, even “implausible” entries are implemented in some cases. The following example shows a case in which ChatGPT initially incorporated the glossary term but subsequently ignored it within the same paragraph:

Example

Source text:

To achieve a high standard of respectful care during childbirth, health systems must be organized and managed in a manner that ensures respect for women’s sexual and reproductive health and human rights. While many governments, professional societies, researchers, international organizations, civil society groups and communities worldwide have already highlighted the need to address this problem (11-14) in many instances policies to promote respectful maternal care have not been adopted, are not specific, or have not yet been translated into meaningful action.

ChatGPT:

Um ein hohes Maß an wertschätzender Versorgung während der Geburt zu gewährleisten, müssen Gesundheitssysteme so organisiert und verwaltet werden, dass sie die sexuellen und reproduktiven Gesundheitsrechte und Menschenrechte von Frauen respektieren. Zwar haben viele Regierungen, Fachgesellschaften, Forschende, internationale Organisationen, zivilgesellschaftliche Gruppen und Gemeinschaften weltweit bereits auf die Notwendigkeit hingewiesen, dieses Problem anzugehen, doch wurden entsprechende Politiken zur Förderung einer respektvollen* geburtshilflichen Versorgung häufig nicht verabschiedet, sind zu allgemein oder wurden noch nicht in konkrete Maßnahmen umgesetzt.

Example from Šorak 2025. Text excerpt from the World Health Organization (2020): “The prevention and elimination of disrespect and abuse during facility-based childbirth”. Glossary terms underlined by the author.

It remains unclear whether and to what extent terminology enforcement in LLMs may be improved with different prompting strategies. The pilot study cited above only employed direct prompts (Zero Shot).

Google Gemini: A Special Case

With regards to Google Gemini, it should also be noted that the model tends to make substantial modifications to the source text during translation. In the pilot study cited above, for instance, Gemini omitted headings, altered the sentence order, and even added, duplicated or omitted relevant information. The model furthermore changed the formatting by applying bold type to individual words or sentence parts and altering paragraph structures.

The following example shows a text excerpt in which Gemini duplicated and applied bold formatting to parts of the original instructions during translation. The model also highlighted most glossary terms in bold.

Example

Source text:

Gastric reflux (sometimes called acid reflux or heartburn) is very common, so avoid eating late at night or eating foods that cause indigestion.
Stop smoking.
Use other ways of communicating, such as writing, texting or using gestures, if talking is difficult or uncomfortable.

Google Gemini:

Umgang mit Reflux: Sodbrennen(Reflux oder saurer Rückfluss) ist sehr häufig. Vermeiden Sie es daher, spät abends zu essen oder Lebensmittel zu sich zu nehmen, die Magenverstimmungenverursachen.
Rauchen aufhören: Stellen Sie das Rauchen ein.
Alternative Kommunikation: Nutzen Sie andere Kommunikationswege wie Schreiben, Textnachrichten oder Gestik, wenn das Sprechen schwierig oder unangenehm ist.

Example from Šorak 2025. Text excerpt from the World Health Organisation (2021): “Support for rehabilitation: self-management after COVID-19-related illness”. Glossary terms underlined by the author, bold formatting added by Google Gemini.

In domains where close adherence to the source text is imperative (e.g., legal texts), this behaviour poses a significant problem. In such cases, Google Gemini must therefore be explicitly prompted to produce more restrictive translations. Ideally, however, preference should be given to models that inherently adhere more closely to the source text.

Conclusion

In probabilistic AI models such as generative LLMs and NMT systems, terminology enforcement during translation is not yet fully reliable. For terminology-sensitive areas of application, this means that downstream human or automated terminology verification processes remain indispensable.

Sources

Šorak, Vanessa (2025): „Terminologieerzwingung mit generativen LLMs: ChatGPT und Google Gemini als Übersetzungstools“, in: edition. Fachzeitschrift für Terminologie, 25(2), 19–24. URL: edition.dttev.org/ausgaben/edition-2025-2-e-version.pdf [29.01.2026]

Šorak, Vanessa (2026, in print): Neural Machine Translation in Pandemic-Related Health Communication. A Comprehensive Analysis of Risks and Risk Mitigation Strategies Using WHO’s COVID-19 Communication as an Example. Translation and Multilingual Natural Language Processing, Language Science Press: Berlin.

tl;dr

In probabilistic AI models such as generative LLMs and NMT systems, terminology enforcement during translation is not yet fully reliable.

An article by
Dr. Vanessa Šorak

Dr. Vanessa Šorak is an academic staff member at the Institute of Translation and Interpreting at Heidelberg University. In her research, she focuses on AI-based language technologies in multilingual translation processes, with a particular emphasis on public health communication.