Perform OCR on an image of typed text
You can use tesseract, an OCR command-line program, to convert an image of typed text to plain text (txt) like this one:
Here is the terminal command to perform OCR on the example image:
tesseract image.png stdout --psm 12 --dpi 70 > output.txt
which will output the following text:
You could also apply the tesseract command (but directing the output to the terminal) to a scanned document like this one:
which would output the following text:
For more information about the various command line options use tesseract --help or man tesseract.
NOTES:
- tesseract works on linux, macOS, and Windows
- The default language for tesseract is English but it can recognize more than 100 languages.
- The default output format for tesseract is text but it can also create a searchable pdf output.
- You can also apply OCR on tiff images (multipage tiff) but not pdf documents.
References
Image sources:
- Wikipedia contributors. "Optical character recognition." Wikipedia, The Free Encyclopedia. Wikipedia, The Free Encyclopedia, 27 May. 2021. Web. 31 May. 2021.
- First two paragraphs from the Wikipedia article.
- Andrews' Book & Job Printing Office. "The Macon directory for 1860, containing the names of the inhabitants, a business directory and an appendix of much useful information." Washington Memorial Library (Macon, Ga.). 1860, http://dlg.galileo.usg.edu/do:zgy_mcd_dir-macon1860.
Comments
Post a Comment