Perform OCR on an image of typed text

Perform OCR on an image of typed text

You can use tesseract, an OCR command-line program, to convert an image of typed text to plain text (txt) like this one:

Here is the terminal command to perform OCR on the example image:

tesseract image.png stdout --psm 12 --dpi 70 > output.txt

which will output the following text:

You could also apply the tesseract command (but directing the output to the terminal) to a scanned document like this one:

which would output the following text:

For more information about the various command line options use tesseract --help or man tesseract.

NOTES:

tesseract works on linux, macOS, and Windows
The default language for tesseract is English but it can recognize more than 100 languages.
The default output format for tesseract is text but it can also create a searchable pdf output.
You can also apply OCR on tiff images (multipage tiff) but not pdf documents.

References

Image sources:

Wikipedia contributors. "Optical character recognition." Wikipedia, The Free Encyclopedia. Wikipedia, The Free Encyclopedia, 27 May. 2021. Web. 31 May. 2021.

First two paragraphs from the Wikipedia article.

Andrews' Book & Job Printing Office. "The Macon directory for 1860, containing the names of the inhabitants, a business directory and an appendix of much useful information." Washington Memorial Library (Macon, Ga.). 1860, http://dlg.galileo.usg.edu/do:zgy_mcd_dir-macon1860.

Comments