Tesserocr or Tesseract?

Smriti Bajaj
2 min readFeb 7, 2021

--

Did I mistype it? Isn’t it tesseract-ocr? Well, nope.

Tesserocr is known in the world of Python for its advanced support with Tesseract’s C++ API. It is primarily a wrapper around the API. Then, what’s tesseract-ocr or pytesseract? pytesseract is another module/library that interfaces with tesseract-ocr CLI by being a wrapper. In this story, we’ll learn how to install them and a working example of each.

tesserocr

To install tesserocr in a Windows system, I’d advise using anaconda or miniconda navigators.

  1. Create a conda env with python 3.6 - conda create -n myenv python=3.6
  2. Activate myenv- conda activate myenv
  3. In the environment, fire this command- conda install -c conda-forge tesserocr==2.5.1
  4. Execute the following code-
from tesserocr import PyTessBaseAPI, RIL
with PyTessBaseAPI(lang = ‘eng’, path=’\\Library\\bin\\tessdata’) as api:
for img in [‘sample1.jpg’,’sample2.jpg’]:
api.SetImageFile(img)
print(api.GetUTF8Text())
boxes = api.GetComponentImages(RIL.TEXTLINE, False)
print(len(boxes))

**lang can be changed to a combination of multiple languages like ‘eng+san’, path to be set to a tessdata folder containing eng.traineddata etc.

While in Ubuntu, you can straight away fire the command in your terminal- sudo apt-get install tesseract-ocr libtesseract-dev libleptonica-dev pkg-config

Hurraaayy!!! We are all set to OCR our images!
Photo by Andre Hunter on Unsplash

Hurraaayy!!! We are all set to OCR our images!

pytesseract

Suppose, you don’t want to use the APIs but simple functions, we have pytesseract to the rescue. To install, we have straightforward commands-

For Ubuntu, there’s a prerequisite to install tesseract-ocr -

sudo apt install tesseract-ocr
sudo apt install libtesseract-dev
pip install pytesseract

In a Windows system,

  1. you’ll have to download tesseract via https://sourceforge.net/projects/tesseract-ocr/
  2. Set tessdata path in a TESSDATA_PREFIX environment variable or mention in the code as done above for tesserocr
  3. In your tessdata folder, you are required to have the language needed for OCR. In case of English, eng.traineddata must be present in the tessdata folder- https://github.com/tesseract-ocr/tessdata
  4. Now, fire pip install pytesseract in your command shell.

If using Anaconda/miniconda, use conda install -c conda-forge pytesseract

For code usage, refer to https://github.com/BajajSmriti/Opencv-OCR/blob/master/SanskritTextDetection_OCR.ipynb

If you face any errors, you can refer to — https://tesseract-ocr.github.io/tessdoc/Compiling.html and I’m open to discussing.

If you liked the content, please clap. It’s a really good exercise :)

Thanks for reading! ❤ #CodeEveryday

--

--

Smriti Bajaj
Smriti Bajaj

Written by Smriti Bajaj

Software Engineer | Machine Learning enthusiast

No responses yet