August 2025 - Michael's Programming Bytes

OCR Scenario 2: How Well Can Tesseract Read Photos?

But first, the setup!

Before we dive right in to text extraction, let’s read the image to the IDE and install & import any necessary packages. First, if you don’t already have these modules installed, run the following commands on either your IDE or CLI:

!pip install pytesseract
!pip install opencv-python

Next, let’s import the following modules:

import pytesseract
import numpy as np
from PIL import Image

And now, let’s read the image!

Now that we’ve got all the necessary modules installed and imported, let’s read the image into the IDE:

testImage = 'farmers market sign.jpg'
testImageNP = np.array(Image.open(testImage))
testImageTEXT = pytesseract.image_to_string(testImageNP)
print(testImageTEXT)

Output: [no text read from image]

Unlike the 7 years image I used in the previous lesson, no text was picked up by PyTesseract from this image. Why could that be? I have a few theories as to why no text was read in this case:

There’s a lot going on in the background of the image (cars, pavilions, etc.)
PyTesseract might not be able to understand the fonts of any of the elements on the banner as they are not standard computer fonts
Some of the elements on the banner-specifically the Nashville Farmers’ Market logo on the bottom right hand corner of the banner don’t have horizontally-aligned text and/or the text is too small for PyTesseract to read.

Can we solve this issue? Let’s explore one possible method-image thresholding.

A little bit about thresholding

First of all, I figured we can try image thresholding to read the image text for two reasons: it might help PyTesseract read at least some of the banner text AND it’s a new concept I haven’t yet covered in this blog, so I figured I could teach you all something new in the process.

Now, as for image thresholding, it’s the process where grayscale images are converted to a two-colored image using a specific pixel threshold (more on that later). The two colors used in the new thresholding image are usually black and white; this helps emphasize the contrast between different elements in the image.

And now, let’s try some thresholding!

Now that we know a little bit about what image thresholding is, let’s try it on the banner image to see if we can extract at least some text from it.

First, let’s read the image into the IDE using cv2.read() and convert it to grayscale (thresholding only works with gray-scaled images):

import cv2
from google.colab.patches import cv2_imshow

banner = cv2.imread('farmers market sign.jpg')
banner = cv2.cvtColor(banner, cv2.COLOR_BGR2GRAY)
cv2_imshow(banner)

As you can see, we now have a grayscale image of the banner that can be processed for thresholding.

The thresholding of the image

Here’s how we threshold the image using a type of thresholding called binary thresholding:

ret, thresh = cv2.threshold(banner, 127, 255, cv2.THRESH_BINARY)
cv2_imshow(thresh)

The cv2.threshold() method takes four parameters-the grayscale image, the pixel threshold to apply to the image, the pixel value to use for conversion for pixels above and below the threshold, and the thresholding method to use-in this case, I’m using cv2.THRESH_BINARY.

Now, what is the significance of the numbers 127 and 255? 127 is the threshold value, which means that any pixel with an intensity less than or equal to this threshold will be set to black (intensity 0) while any pixel with an intensity above this value will be set to white (intensity 255). While 127 isn’t a required threshold value, it’s ideal because it’s like a midway point between the lowest and highest pixel intensity values (0 and 255, respectively). In other words, 127 is a quite useful threshold value for helping to establish black-and-white contrast in image thresholding. 255, on the other hand, represents the pixel intensity value to use for any pixels above the 127 intensity threshold. As I mentioned earlier, white pixels have an intensity of 255, so any pixels in the image above a 127 intensity are converted to a 255 intensity, so those pixels turns white while pixels at or below the threshold are converted to a 0 intensity (black).

A little bit about the ret parameter in the code: this value represent the pixel intensity threshold value you want to use for the image. Since we’re doing simple thresholding, ret can be used interchangeably with the thresholding value we specified here (127). For more advanced thresholding methods, ret will contain the calculated optimal threshold.

And now the big question…will Tesseract read any text with the new image?

Now that we’ve worked OpenCV’s thresholding magic onto the image, let’s see if PyTesseract picks up any text from the image:

bannerTEXT = pytesseract.image_to_string(thresh)
print(bannerTEXT)

a>
FU aba tee
RKET

Using the PyTesseract image_to_string() method on the new image, the only real improvement here is that there was even text read at all. It appears that even after thresholding the image, PyTesseract’s output didn’t even pick up anything close to what was on the banner (although it surprisingly did pick up the RKET from the logo on the banner).

All in all, this goes to show that even with some good image preprocessing methods, PyTesseract still has its limits. I still have several other scenarios that I will test with PyTesseract, so stay tuned for more!

Here’s the GitHub link to the Colab notebook used for this tutorial (you will need to upload the images again to the IDE, which can easily be done by copying the images from this post, saving them to your local drive, and re-uploading them to the notebook)-https://github.com/mfletcher2021/blogcode/blob/main/OCR_photo_text_extraction.ipynb.

Thanks for reading,

Michael

How To Use OCR Bounding Boxes

But first, what are bounding boxes?

That’s a very good question. Simply put, it’s a rectangular region that denotes the location of a specific object-be it text or something else-within a given space.

For instance, let’s take this restaurant sign. The rectangle I drew on the COME ON IN part of the sign would serve as a bounding box

In this case, the red rectangular bounding box would denote the location of the COME ON IN text.

You can use bounding boxes to find anything in an image, like other text, other icons on the sign, and even the shadow the sign casts on the sidewalk.

Bounding boxes, tesseract style!

Now that we’ve explained what bounding boxes are, it’s time to test them out on an image with Tesseract!

Here’s the image we’ll test our bounding boxes on:

Now, how do we get our bounding boxes? Here’s how:

Keep in mind, I will continue from where I left off on my 7-year anniversary post, so if you want to know how to read the image and print the text to the IDE, here’s the post you should read-The Seven-Year Coding Wonder.

First, install the OpenCV package:

!pip install opencv-python

Next, run pytesseract’s image_to_data() method on the image and print out the resulting dictionary:

sevenYears = pytesseract.image_to_data(testImageNP, output_type=pytesseract.Output.DICT)
print(sevenYears)

{'level': [1, 2, 3, 4, 5, 5, 5, 5, 4, 5, 5], 'page_num': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], 'block_num': [0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], 'par_num': [0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1], 'line_num': [0, 0, 0, 1, 1, 1, 1, 1, 2, 2, 2], 'word_num': [0, 0, 0, 0, 1, 2, 3, 4, 0, 1, 2], 'left': [0, 528, 528, 571, 571, 1069, 1371, 1618, 528, 528, 1297], 'top': [0, 502, 502, 502, 504, 529, 502, 504, 690, 690, 692], 'width': [2129, 1205, 1205, 1124, 452, 248, 200, 77, 1205, 714, 436], 'height': [1399, 313, 313, 125, 97, 98, 99, 95, 125, 99, 123], 'conf': [-1, -1, -1, -1, 96, 96, 96, 96, -1, 96, 96], 'text': ['', '', '', '', 'Thank', 'you', 'for', '7', '', 'wonderful', 'years!']}

Now, what does all of this juicy data mean? Let’s dissect it key-by-key:

level-The element level in Tesseract output (1 indicates page, 2 indicates block, 3 indicates paragraph, 4 indicates line and 5 indicates word)
page_num-The page number on the document where the object was found; granted, this is just a one-page image we’re working with, so this information isn’t terribly useful (though if we were working with a PDF or multi-page document, this would be very helpful information)
block_num-This indicates which chunk of connected text (paragraph, column, etc.) an element belongs to (this runs on a 0-index system, so 0 indicates the first chunk)
par_num-The paragraph number that a block element belongs to (also runs on a 0-index system)
line_num-The line number within a paragraph (also runs on a 0-index system)
word_num-The word number within a line (also runs on a 0-index system)
left & top-The X-coordinate for the left boundary and Y-coordinate for the top boundary of the bounding box, respectively
width & height-The width & height in pixels, respectively, of the bounding box
conf-The OCR confidence value (from 0-100, 100 being an exact match) that the correct word was detected in the bounding box. If you see a conf of -1, the element has no confidence value as its not a word
text-The actual text in the bounding box

Wow, that’s a lot of information to dissect! Another thing to note about the above output-not all of it is relevant. Let’s clean up the output to only display information related to the words in the image:

import pandas as pd

sevenYearsDataFrame = pd.DataFrame(sevenYears)
sevenYearsWords = sevenYearsDataFrame[sevenYearsDataFrame['level'] == 5]
print(sevenYearsWords)

    level  page_num  block_num  par_num  line_num  word_num  left  top  width  \
4       5         1          1        1         1         1   571  504    452   
5       5         1          1        1         1         2  1069  529    248   
6       5         1          1        1         1         3  1371  502    200   
7       5         1          1        1         1         4  1618  504     77   
9       5         1          1        1         2         1   528  690    714   
10      5         1          1        1         2         2  1297  692    436   

    height  conf       text  
4       97    96      Thank  
5       98    96        you  
6       99    96        for  
7       95    96          7  
9       99    96  wonderful  
10     123    96     years!

Granted, it’s not necessary to convert the image dictionary into a dataframe, but I chose to do so since dataframes are quite versatile and easy to filter. As you can see here, we have all the same metrics we got before, just for the words (which is what we really wanted).

And now, let’s see some bounding boxes!

Now that we know how to find all the information about an image’s bounding boxes, let’s figure out how to display them on the image. Granted, the pytesseract library won’t actually draw the boxes onto the images. However, we can use another familiar library to help us out here-OpenCV (which I did a series on in late 2023).

First, let’s install the opencv-python module onto our IDE if it’s not already there:

!pip install opencv-python

Remember, no need for the exclamation point at the front of the string if your running this command on a CLI.

Next, let’s read the image into the IDE:

import cv2
from google.colab.patches import cv2_imshow

sevenYearsTestImage = cv2.imread('7 years.png', cv2.IMREAD_COLOR)
cv2_imshow(sevenYearsTestImage)
cv2.waitKey(0)

After installing the opencv module in the IDE, we then read the image into the IDE using the cv2.imread() method. The cv2.IMREAD_COLOR ensures we read and display this image in its standard color format.

You may be wondering why we’re reading the image into the IDE again, especially after reading it in with pytersseract. We need to read the image again as pytesseract will only read the image string into the IDE, not the image itself. We need to read in the actual image in order to display the bounding boxes.
If you’re not using Google Colab as your IDE, no need to include this line-from google.colab.patches import cv2_imshow. The reason Google Colab makes you include this line is because the cv2.imshow() method caused Google Colab to crash, so think of this line as Google Colab’s fix to the problem. It’s annoying I know, but it’s just one of those IDE quirks.

Drawing the bounding boxes

Now that we’ve read the image into the IDE, it’s time for the best part-drawing the bounding boxes onto the image. Here’s how we can do that:

sevenYearsWords = sevenYearsWords.reset_index(drop=True)

howManyBoxes = len(sevenYearsWords['text'])

for h in range(howManyBoxes):
  (x, y, w, h) = (sevenYearsWords['left'][h], sevenYearsWords['top'][h], sevenYearsWords['width'][h], sevenYearsWords['height'][h])
  sevenYearsTestImage = cv2.rectangle(sevenYearsTestImage, (x, y), (x + w, y + h), (255, 0, 0), 3)

cv2_imshow(sevenYearsTestImage)

As you can see, we can now see our perfectly blue bounding boxes on each text element in this image. The process also worked like a charm, as each text element is captured perfectly inside each bounding box-then again, it helped that each text element had a 96 OCR confidence score (which ensured high detection accuracy).

How did we get these perfectly blue bounding boxes?

I first reset the index on the sevenYearsWords dataframe because when I first ran this code, I got an indexing error. Since the sevenYearsWords dataframe is essentially a subset of the larger sevenYearsDataFrame (the one with all elements, not just words), the indexing for the sevenYearsWords dataframe would be based off of the original dataframe, so I needed to use the reset_index() command to reset the indexes of the sevenYearsWords dataframe to start at 0.
Keep this method (reset_index()) in mind whenever you’re working with dataframes generated as subsets of larger dataframes.
howManyBoxes would let the IDE know how many bounding boxes need to be drawn-normally, you’d need as many bounding boxes as you have text elements
The loop is essentially iterating through the elements and drawing a bounding box on each one using the cv2.rectangle() method. The parameters for this method are: the image where you want to draw the bounding boxes, the x & y coordinates of each box, the x-coordinate plus width and y-coordinate plus height for each box, the BGR color tuple of the boxes, and the thickness of the boxes in pixels (I went with 3-px thick blue boxes).

Come find the code on my GitHub-https://github.com/mfletcher2021/blogcode/blob/main/OCR_bounding_boxes.ipynb.

Thanks for reading!

Michael