How to Use OCR in Linux – Extract Text From PDF Image

How to Use OCR in Linux – Extract Text From PDF Image. The gImageReader is a graphical GTK frontend to tesseract-ocr, a free software optical character recognition (OCR) engine.



Tesseract is a raw OCR engine, with no document layout analysis, no output formatting and no graphical user interface (GUI). If you want a GUI version of Tesseract you can install YAGF, the front end program for Tesseract OCR.

gImageReader processes an image or PDF file from which it creates text. It supports selecting columns and parts of the document, it can open multipage PDF files or images, supports all formats, can transmit a selected area to Tesseract for recognition and spell check the output.

Install Tesseract OCR in Linux

With the discontinuation of downloads at code.google.com, new source downloads will be posted to GoogleDrive. Other download folders will be setup as new files are uploaded, and the original Downloads page will go away

Warning: you must add the PPA, install the latest Tesseract and then disable the PPA as it contains a lot of bleeding edge packages!

Add the PPA and install Tesseract OCR 3.0 SVN:
sudo add-apt-repository ppa:alex-p/notesalexp
sudo apt-get update
sudo apt-get install tesseract-ocr tesseract-ocr-eng

Or you can install using Linux installer file here sourceforge.net/projects/gimagereader/

How to Use OCR in Linux - Extract Text From PDF Image convert image to text linux convert picture to text linux convert pdf image to text linux convert linux add text to image convert text file to image linux convert image to text in linux convert image to text linux

You can install some extra languages from this PPA, such as Bulgarian, Catalan, Czech, Danish, German, Greek, Finnish, Indonesian, Hungarian, Italian, Dutch, Polish, Romanian, Spanish and so on. Simply search for “tesseract-ocr” in Synaptic and you should easily find all these packages – install the ones you’ll need later on.

See also  How to Unblock and Bypass Website in Linux with SelekTOR

Now you must disable the PPA: press ALT + F2 and enter:
gksu software-properties-gtk

Then, on the “Other Software” tab look for the line(s) that says “ppa.launchpad.net/alex-p/notesalexp” and either disable it or delete it.

gImageReader

gImageReader is available for Linux and Windows and can be downloaded from HERE (.deb, .rpm and .exe files are available).

To use gImageReader, select the PDF or image you want to extract the text from and click “Recognize all” for the whole page or use your mouse to draw a selection and then click “Recognize selection” to extract only a part of the document.

If you’ve installed the Tesseract Ocr language for the PDF or image you’re trying to open, gImageReader will automatically detect the language.