Text recognition (OCR) with Tesseract and Opencv
In this tutorial we’re going to see how to use Tesseract to recognize text from an image.
Tesseract is the most popular OCR (Optical character recognition), it is open source and it is developed by google since 2006.
In this specific tutorial we will see:
- How to install Tesseract on (Windows, Mac or Linux)
- Read Text from an image
- Tune tesseract to improve the text recognition
1. Install Tesseract to work with Python and Opencv
Before proceeding with the installation of Tesseract, it’s important to understand all the tools that we are going to use and the purpose of each of them.
These are the tools that we need:
- Python and Opencv: we will use the python programming language and Opencv to load the image, and do some image preprocessing (for example remove the areas where there is no text, remove some noise, apply some image filter to make the text more readable).
- Tesseract: it’s the OCR engine, so the core of the actual text recognition. It takes the image and in return gives us the text.
- Pytesseract: it’s the tesseract binding for python. With this library we can use the tesseract engine with python with just a few lines of code.
1.1 Install Python and Opencv
First of all let’s make sure that you have python and Opencv installed. If not, you can follow this guide to install Opencv and Python on Windows.
1.2 Install Tesseract
Then it’s the moment to install Tesseract.
If you have windows, go on this page https://github.com/UB-Mannheim/tesseract/wiki, download and install tesseract 64 bit.
If you want to follow step by step the installation process, you can watch it on my video tutorial above.
On Linux you can simply open the terminal and insert the following commands:
sudo apt install tesseract-ocr sudo apt install libtesseract-dev
If the commands don’t work, you can refer to the Tesseract website page for more instructions: https://tesseract-ocr.github.io/tessdoc/Home.html.
To install tesseract on Mac use this command:
sudo port install tesseract
If this command doesn’t work, check this page for more instructions: https://tesseract-ocr.github.io/tessdoc/Home.html
1.3 Install PyTesseract
Pytesseract is an essential library if we want to use tesseract with Python. It can be easily installed as any other python library using the pip command.
So copy the following commands on your terminal.
pip install pytesseract pip3 install pytesseract
2. Read text from an image
After we’re done with the installation, it’s the time for us to try Tesseract to read the text from an image.
I have this digitalized page of a book (“The big sleep”). Let’s try extracting the text from this image.
First of all we start by importing the libraries Opencv, Numpy and Pytesseract.
Then on line 5 you need to tell where the Tesseract engine is installed. The configuration below is fine if you’re using windows, instead if you’re on Mac or Linux, you should refer to the official documentation to see how to set it up.
import cv2 import numpy as np import pytesseract pytesseract.pytesseract.tesseract_cmd = r"C:\Program Files\Tesseract-OCR\tesseract.exe"
The we simply load the image and extract the text using Pytesseract.
img = cv2.imread("bigsleep.jpg") text = pytesseract.image_to_string(img) print(text)
And we get the output:
THE BIG SLEEP by Raymond Chandler It was about eleven o'clock in the morning, mid October, with the sun not shining and a look of hard wet rain in the clearness of the foothills. I was wearing my powder-blue suit, with dark blue shirt, tie and display handkerchief, black brogues, black wool socks with dark blue clocks on them. I was neat, clean, shaved and sober, and I didn’t care who knew it. I was everything the well-dressed private detective ought to be. T was calling on four million dollars. ‘The main hallway of the Sternwood place was two stories high. Over the entrance doors, which would have let in a troop of Indian elephants, there ‘was a broad stained-glass panel showing a knight in dark armor rescuing
3. Tune tesseract to improve the text recognition
In the section 2 we have seen how easy it is to run Tesseract using python, and the result was really good as the engine performed really well and the text recognition was almost perfect.
We have to say also that it wasn’t that big of a challenge, as the text was really clear.
Now we’re going to make the text recognition more challenging, giving Tesseract a picture (instead of a scanning a book page), where the orientation of the text is not exactly horizontal but there is a slope, where the lightening changes on different part of the page and where there are elements that are not text and should not be there.
How will Tesseract perform with such images?
This is the image we’re going to test.
By using the pyton code we used for the previous image, by default we get a terrible result. The text is almost not recognized at all.
We can do what it is called “Image preprocessing”. These are operations that we’re going to make before trying to detect the text, by improving the image as much as possible to make the text more clear.
There are a lot of operations we can do to improve the image before giving it to the OCR. Here below I’m just going to give you a few ideas.
- Convert image to grayscale: we only need the text, we don’t care about colors. The text is simpler to identify on a grayscale image.
- Change the size of the image: I’ve seen that tesseract doesn’t work so well when the images are too big, so we can use as maximum resolution, the one where the text is clear for to read for our eyes.
- Removing noise: what to us look like a clear text, if we zoom it the image, pixel by pixel, might not be that clear but there might be some noise. So we can for example blur the image.
- Convert image to black and white: by converting the image to black and white, we set tha background as all white, and the text black. In this way we solve the problem to recognize the text when lightening changes constantly on the picture.
In this specific image, I’m going to apply few of them. Firs I will resize the image to make it smaller, then I convert it to grayscale and finally I make a threshold, by using the adaptive threshold method to convert it to black and white.
Here is the code used to do that:
# 1. Load the image img = cv2.imread("book_page.jpg") # 2. Resize the image img = cv2.resize(img, None, fx=0.5, fy=0.5) # 3. Convert image to grayscale gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY) # 4. Convert image to black and white (using adaptive threshold) adaptive_threshold = cv2.adaptiveThreshold(gray, 255, cv2.ADAPTIVE_THRESH_GAUSSIAN_C, cv2.THRESH_BINARY, 85, 11)
And that’s how the image looks like after the preprocessing operations:
As you see after the preprocessing, the text is much easier to read, and that makes a huge difference for the OCR engine.
In addition to the Image preprocessing operations, we can tune Tesseract.
Tesseract has 10 different Page segmentation modes (PSM) that we can manually select:
0 = Orientation and script detection (OSD) only.
1 = Automatic page segmentation with OSD.
2 = Automatic page segmentation, but no OSD, or OCR
3 = Fully automatic page segmentation, but no OSD. (Default)
4 = Assume a single column of text of variable sizes.
5 = Assume a single uniform block of vertically aligned text.
6 = Assume a single uniform block of text.
7 = Treat the image as a single text line.
8 = Treat the image as a single word.
9 = Treat the image as a single word in a circle.
10 = Treat the image as a single character.
We can choose which mode to use by adding a configuration value to the python code.
After “–psm ” you need to put a value from 0 to 10, based on the page segmentation you want to use.
config = "--psm 3" text = pytesseract.image_to_string(adaptive_threshold, config=config) print(text)
I won’t go deeper with the exaplanation, as the goal of this post is to give you an idea of how the OCR works, and so it’s important to know that depending of the text you have (if it’s a page, or just a word, or just a line) then you can improve the performance changing the PSM settings.
If you want a simpler way to recognize the text, you can always use some external services, whitout installing your own engine.
I did a video about an OCR api service, which in my opinion worked really well. You can read this post if you want to know more about it: https://pysource.com/2019/10/14/ocr-text-recognition-with-python-and-api-ocr-space/
This site uses Akismet to reduce spam. Learn how your comment data is processed.