# Translate a book written in LaTeX from Slovenian into English

With permission of the author, we will demonstrate how to translate the book [Euclidean Plane Geometry](https://sites.google.com/site/projektivna/), written by Milan MitroviÄ‡ from Slovenian into English, without modifying any of the LaTeX commands.

To achieve this, we will first split the book into chunks, each roughly a page long, then translate each chunk into English, and finally stitch them back together.

## 1. Read in the data

In [19]:
from openai import OpenAI
import os
from transformers import GPT2Tokenizer

client = OpenAI(api_key=os.environ.get("OPENAI_API_KEY", "<your OpenAI API key if you didn't set as an env var>"))

# OpenAI GPT-2 tokenizer is the same as GPT-3 tokenizer
# we use it to count the number of tokens in the text
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")

with open("data/geometry_slovenian.tex", "r") as f:
    text = f.read()

1485565

### 1.1 Count the tokens in each chunk

In [20]:
chunks = text.split('\n\n')
ntokens = []
for chunk in chunks:
    ntokens.append(len(tokenizer.encode(chunk)))
max(ntokens)

Token indices sequence length is longer than the specified maximum sequence length for this model (1327 > 1024). Running this sequence through the model will result in indexing errors


1473

It turns out that a double newline is a good separator in this case, in order not to break the flow of the text. Also no individual chunk is larger than 1500 tokens. The model we will use is text-davinci-002, which has a limit of 4096 tokens, so we don't need to worry about breaking the chunks down further.

We will group the shorter chunks into chunks of around 1000 tokens, to increase the coherence of the text, and decrease the frequency of breaks within the text.

In [21]:
def group_chunks(chunks, ntokens, max_len=1000, hard_max_len=3000):
    """
    Group very short chunks, to form approximately page long chunks.
    """
    batches = []
    cur_batch = ""
    cur_tokens = 0
    
    # iterate over chunks, and group the short ones together
    for chunk, ntoken in zip(chunks, ntokens):
        # discard chunks that exceed hard max length
        if ntoken > hard_max_len:
            print(f"Warning: Chunk discarded for being too long ({ntoken} tokens > {hard_max_len} token limit). Preview: '{chunk[:50]}...'")
            continue

        # if room in current batch, add new chunk
        if cur_tokens + 1 + ntoken <= max_len:
            cur_batch += "\n\n" + chunk
            cur_tokens += 1 + ntoken  # adds 1 token for the two newlines
        # otherwise, record the batch and start a new one
        else:
            batches.append(cur_batch)
            cur_batch = chunk
            cur_tokens = ntoken
            
    if cur_batch:  # add the last batch if it's not empty
        batches.append(cur_batch)
        
    return batches


chunks = group_chunks(chunks, ntokens)
len(chunks)

869

Notice that adding a sample untranslated and translated first command, where only the content of the chapter name needs to be translated, helps to get more consistent results.

The format of the prompt sent to the model consists of:
1. A high level instruction to translate only the text, but not commands into the desired language
2. A sample untranslated command, where only the content of the chapter name needs to be translated
3. The chunk of text to be translated
4. The translated sample command from 2, which shows the model the beginning of the translation process

The expected output is the translated chunk of text.

In [40]:
def translate_chunk(chunk, model='gpt-3.5-turbo',
                    dest_language='English',
                    sample_translation=("\poglavje{Osnove Geometrije} \label{osn9Geom}", "\poglavje{The basics of Geometry} \label{osn9Geom}")
                    ):
    prompt = f'''Translate only the text from the following LaTeX document into {dest_language}. Leave all LaTeX commands unchanged
    
"""
{sample_translation[0]}
{chunk}"""

{sample_translation[1]}
'''
    response = client.chat.completions.create(
        messages=[{"role": "user", "content":prompt}],
        model=model,
        temperature=0,
        top_p=1,
        max_tokens=1500,
    )
    result = response.choices[0].message.content.strip()
    result = result.replace('"""', '') # remove the double quotes, as we used them to surround the text
    return result
print(translate_chunk(chunks[800], model='gpt-3.5-turbo', dest_language='English'))

Let $\mathcal{I}=\mathcal{S}_{AB} \circ\mathcal{S}_{CA}
    \circ\mathcal{S}_{BC}$. By  \ref{izoZrcdrsprq} is
    $\mathcal{I}$ a mirror reflection. Let $A_1$, $B_1$ and $C_1$ be in order the center points of the lines $BC$, $AC$ and $AB$ of the triangle $ABC$.
    Because it is a right triangle is $\mathcal{I}(A_1C_1)=A_1C_1$, which
    means that the line $A_1C_1$ is of this mirror reflection. It is not
    difficult to prove that for the point $A'_1=\mathcal{I}(A_1)$ (both
    lie on the axis $A_1C_1$) is
    $\overrightarrow{A_1A'_1}=3\overrightarrow{A_1C_1}$, so
    $\mathcal{I}=\mathcal{G}_{3\overrightarrow{A_1C_1}}$.

\item  \res{Given are the points $A$ and $B$ on the same side of the line
$p$.
Draw the line  $XY$, which lies on the line $p$ and is consistent
with the given line $l$, so that the sum
$|AX|+|XY|+|YB|$ is minimal.}

Let $A'=\mathcal{G}_{\overrightarrow{MN}}(A)$ (where $M,N\in
p$ and $MN\cong l$). The point $Y$ is obtained as the intersection of the lines $p$
and $

We can see here that this one chunk in particular translates only the text, but leaves LaTeX commands intact.

Let's now translate all the chunks in the book - this will take 2-3 hours, as we're processing requests sequentially.

In [39]:
dest_language = "English"

translated_chunks = []
for i, chunk in enumerate(chunks):
    print(str(i+1) + " / " + str(len(chunks)))
    # translate each chunk
    translated_chunks.append(translate_chunk(chunk, model='gpt-3.5-turbo', dest_language=dest_language))

# join the chunks together
result = '\n\n'.join(translated_chunks)

# save the final result
with open(f"data/geometry_{dest_language}.tex", "w") as f:
    f.write(result)

0 / 869
1 / 869
2 / 869
3 / 869
4 / 869
5 / 869
6 / 869
7 / 869
8 / 869
9 / 869
10 / 869
11 / 869
12 / 869
13 / 869
14 / 869
15 / 869
16 / 869
17 / 869
18 / 869
19 / 869
20 / 869
21 / 869
22 / 869
23 / 869
24 / 869
25 / 869
26 / 869
27 / 869
28 / 869
29 / 869
30 / 869
31 / 869
32 / 869
33 / 869
34 / 869
35 / 869
36 / 869
37 / 869
38 / 869
39 / 869
40 / 869
41 / 869
42 / 869
43 / 869
44 / 869
45 / 869
46 / 869
47 / 869
48 / 869
49 / 869
50 / 869
51 / 869
52 / 869
53 / 869
54 / 869
55 / 869
56 / 869
57 / 869
58 / 869
59 / 869
60 / 869
61 / 869
62 / 869
63 / 869
64 / 869
65 / 869
66 / 869
67 / 869
68 / 869
69 / 869
70 / 869
71 / 869
72 / 869
73 / 869
74 / 869
75 / 869
76 / 869
77 / 869
78 / 869
79 / 869
80 / 869
81 / 869
82 / 869
83 / 869
84 / 869
85 / 869
86 / 869
87 / 869
88 / 869
89 / 869
90 / 869
91 / 869
92 / 869
93 / 869
94 / 869
95 / 869
96 / 869
97 / 869
98 / 869
99 / 869
100 / 869
101 / 869
102 / 869
103 / 869
104 / 869
105 / 869
106 / 869
107 / 869
108 / 869
109 / 869
110 / 869
