AI translation document translation translation memory NLP LLM

Beyond Sentence Pairs

Henrik Kühnemann 8 april 2026 5 min läsning

Most translation infrastructure still thinks in spreadsheets.

A document is turned into flat sequences of text. A heading, a table cell, a footnote caption. They all become rows in a spreadsheet. Original structure, visual context, the relationship between a paragraph and the heading above it; all of that is put to the side before translation even begins.

This was a reasonable tradeoff in a workflow where the translator was a human. Context could be reestablished by someone comparing segments against a PDF, mentally reconnecting the threads of intra- and intertextual meaning. But in an AI-first setting, this is the wrong abstraction.

LLM based translation systems can work across entire documents. They can use local and global context, track tone and terminology across sections, and unlock capabilities that only emerge when translation is grounded in meaning across the whole document or corpus. But most translation infrastructure is still built as if the sentence, or at most the segment, were the natural unit of work. The result is a mismatch between what the model can do and what the system allows it to see.

Flattening documents into sentence pairs does enable translation memory as a mechanism for stability and control: the same source sentence will reliably produce the same target sentence. And there is real value here since one of the main limitations of LLM-based long-context translation is that consistency over time is hard, if not impossible, to guarantee. A model may produce a strong document-level translation in context yet still make slightly different choices in a later run, a revised version of the document, or elsewhere in the corpus. So, while long context unlocks better and more context-sensitive translation, it does not by itself provide durable consistency.

The dominant response to this tension has been to keep the flat-sequence model and wrap AI around it. This pipeline is popular because it does not challenge established ideas about how translation is done and because it maps cleanly onto the existing technology stack. But it does not unlock the possibilities of operating at the level of corpus and document and sentence. And it keeps human experts locked in the segment grid.

This abstraction has become a bottleneck.

The problem is not whether a model can produce a good translation. It is whether the system around the model preserves the structure and context that makes good translations possible.

That is the problem we set out to solve in Translator HUBB.

Gunnar: A Semantic Core for Document Translation

Gunnar is the processing engine at the heart of HUBB’s translation pipeline. It takes source documents and produces a stable semantic model before translation begins.

That model captures what each part of the document is: headings, paragraphs, tables, lists, formatting, images, links, protected placeholders. But also how parts relate to each other. Every part gets an identity, and that identity persists across document revisions, translation iterations, and format changes.

This gives us two things at once: a way to preserve consistency across iterations, and a semantically rich basis for long-context translation. In other words: it lets us keep meaning in place.

A semantic layer like this will become foundational for AI-first translation systems. Not because document structure is nice to have, but because without it, every long-context model translating in the dark, limited by the old abstractions. And systems that discard structure before translation begins will fall further behind.

What a Semantic Model Makes Possible for LLM Translation

Sentence-pair translation does not need document structure. Long-context LLM translation does.

When Gunnar ingests a document, it does so at multiple levels simultaneously: paragraph, sentence, and optionally phrase. These layers coexist. This means we can assemble translation units that carry their context with them: a sentence knows which paragraph it belongs to, which heading it falls under, where it sits in a table or list.

When that context reaches the language model, the model is not translating an orphaned string. It is translating a passage within a structure it can see. The heading above informs tone. Sibling paragraphs inform consistency. Table position informs whether something is a label, a value, or an instruction.

This is the practical payoff of having a semantic document model upstream of translation: the model gets to translate language inside the structure that gives it meaning.

What Changes for the Human Validator

When AI produces translations of entire documents in context, the role of the human validator shifts. This also changes what the editing experience needs to be.

In a segment grid, every table cell is just a row. The structure that gives the text meaning is exactly what the editing environment throws away.

A semantic document model changes where the validator’s attention goes. Because Gunnar captures document structure upstream, the translation can be navigated and reviewed inside that structure and not beside it. Table cells appear in their rows and columns, with headers visible. Internal references are navigable: the validator can click through and confirm that a translation is correct in relation to what it points to. The context is not reconstructed. It was never lost.

The point is not to build a prettier editor. It is that an AI-first translation workflow demands a different kind of human interface, one built around leverage points rather than failure points, and a semantic document model is what makes that interface possible.

The Real Shift

If AI translation is going to move beyond sentence-by-sentence automation, it needs a different substrate. And a different substrate requires a system that preserves structure, identity, and context from the start.

That is the shift Translator HUBB is built around. Gunnar is not an add-on to the old pipeline. It is the semantic core that makes corpus-level translation, validation, and consistency possible.