translation Archives - Michael's Programming Bytes

Let’s get stuff set up!

Before we dive into the juicy Tesseract translations, let’s first get our packages installed and modules imported on the IDE:

!pip install pytesseract
!pip install googletrans

import pytesseract
import numpy as np
from PIL import Image
from googletrans import Translator

Now, unlike our previous Tesseract scenarios, we’ll need to pip install an additional package this time-googletrans (pip install googletrans), which is an open-source library that connects to Google Translate’s API. Why is this package necessary? While Tesseract certainly has its capabilities when it comes to reading text from standard-font images (recall how Tesseract couldn’t quite grasp the text in OCR Scenario 2: How Well Can Tesseract Read Photos?, OCR Scenario 3: How Well Can Tesseract Read Documents? and OCR Scenario 4: How Well Can Tesseract Read My Handwriting?), one thing Tesseract cannot do is translate text from one language to another. Granted, it can read the text just fine, but googletrans will actually help us translate the text from one language to another. In this post, I’ll test the abilities of Tesseract in conjunction with googletrans to see not only how well Tesseract can read foreign language but also how well googletrans can translate the foreign text. I’ll test the Tesseract/googletrans conjunction with three different images in the following languages-Spanish, French, and German-and see how each image’s text is translated to English.

Leyendo el texto en Español (reading the Spanish text)

In our first Tesseract translation, we’ll attempt to read the text from and translate the following phrase from Spanish to English:

This phrase simply reads Tomorrow is Friday in English, but let’s see if our Tesseract/googletrans combination can pick up on the English translation.

First, we get the text that Tesseract read from the image:

testImage = 'spanish text.png'
testImageNP = np.array(Image.open(testImage))
testImageTEXT = pytesseract.image_to_string(testImageNP)
print(testImageTEXT)

Manana es
viernes

Next, we run a googletrans translation and translate the text from Spanish to English:

translator = Translator()
translation = await translator.translate(testImageTEXT, src='es', dest='en')
print(translation.text)

Tomorrow is
friday

As you can see, the googletrans Translator object worked its magic here with the translator method which takes three parameters-the text extracted from Tesseract, the text’s original language (Spanish or es) and the language that you want to use for text translation (English or en). The translated text is correct-the image’s text did read Tomorrow is friday in English. Personally, I’m amazed it managed to get the correct translation even though Tesseract didn’t pick up the enye (~) symbol when it read the text.

Now, you may be wondering why I added the await keyword in front of the translator.translate() method call-and here’s where I’ll introduce a new Python concept. See, the translator.translate() function is what’s known as an asynchronous function, which returns a coroutine object so that while the Google Translate API is being called and the translation is taking place, subsequent code in the program can be executed. Since the translator.translate() method is asynchronous, calling translation.text won’t return the translated text as the API request is still being made. Instead, this call will return an error, so to get around this, we’ll need to add the await keyword in front of translator.translate() before calling translator.text to be able to retrieve the translated text. The await keyword will make the program await the completion of the translation request from the Google Translate API before subsequent code is executed.

Since the src and dest parameters require language codes for the methods to work properly, here’s Google Translate’s handy-dandy list of reference codes-https://developers.google.com/workspace/admin/directory/v1/languages.

Auto-detection…how might that work?

Granted the googletrans package did a good job of translating the text above from Spanish to English, but I want to see if the translator.translate() method can auto-detect the fact that the text is in Spanish and translate it to English:

translator = Translator()
translation = await translator.translate(testImageTEXT, dest='en')
print(translation.text)

Tomorrow is
friday

In this example, I only specified that I want to translate the text to English without mentioning that the original text is in Spanish. Despite the small change, I still get the same desired translation-Tomorrow is friday.

I’ve noticed that when I use Google Translate, it can sometimes do a good job of auto-detecting the text’s language (though like any AI translation tool, it can also mis-detect the source language at times)

Traduisons ce texte français (Let’s translate this French text)

For my next scenario, we’re going to see how well the Tesseract/googletrans conjuction can translate the following French text:

Just as we did with the Spanish text image, let’s first read the text using Tesseract:

testImage = 'french text.png'
testImageNP = np.array(Image.open(testImage))
testImageTEXT = pytesseract.image_to_string(testImageNP)
print(testImageTEXT)

Joyeux
anniversaire a tol

OK, so a small misreading here (tol instead of the French pronoun toi), but pretty accurate otherwise. Perhaps Tesseract thought the lowercase i in toi was a lowercase l? Let’s see how this affects the French-to-English translation:

translator = Translator()
translation = await translator.translate(testImageTEXT, src='fr', dest='en')
print(translation.text)

Happy
birthday to you

Interestingly, even with the slight Tesseract misread of the French text, we still got the correct English translation of Happy birthday to you.

Deutsche Textübersetzung (German text translation)

Last but not least, we’ll see the Tesseract/googletrans conjuction’s capabilities on German-to-English text translation. Here’s the German text we’ll try to translate to English:

Now just as we did with the Spanish text and French text images, let’s first extract the German text from this image with Tesseract:

testImage = 'german text.png'
testImageNP = np.array(Image.open(testImage))
testImageTEXT = pytesseract.image_to_string(testImageNP)
print(testImageTEXT)

Ich liebe
Programmieren
wirklich.

Let’s see what the resulting English translation is!

translator = Translator()
translation = await translator.translate(testImageTEXT, src='de', dest='en')
print(translation.text)

I love
Programming
really.

OK, so the actual phrase I put into Google translate was I really love programming and the German translation was Ich liebe Programmieren wirklich. Fair enough, right? However, the German-to-English translation of this phrase read I love programming really. How is this possible?

The translation quirk is possible because of the adverb in this case-wirklich (German for really). See, unlike English adverbs, German adverbs tend to be more flexible with where they’re placed in a sentence. So in English, “I love programming really” doesn’t sound too grammatically correct but in German, “Ich liebe Programmieren wirklich”-which places the adverb “really” after the thing it’s emphasizing “love programming”-is a more common way to use adverbs, as German adverbs tend to commonly be placed after the thing they’re emphasizing. And that is my linguistic fun fact for this post!

The Colab notebook can be found in my GitHub at this link-https://github.com/mfletcher2021/blogcode/blob/main/Tesseract_Translation.ipynb

Thanks for reading,

Michael

Tag: translation

OCR Scenario 5: Tesseract Translation

Let’s get stuff set up!

Leyendo el texto en Español (reading the Spanish text)

Auto-detection…how might that work?

Traduisons ce texte français (Let’s translate this French text)

Deutsche Textübersetzung (German text translation)