Posts

Showing posts from 2021

Perform OCR on an image of typed text

Image
You can use  tesseract , an  OCR  command-line program, to convert an image of typed text to plain text ( txt ) like this one: Here is the terminal command to perform OCR on the example image: tesseract image.png stdout --psm 12 --dpi 70 > output.txt which will output the following text: You could also apply the  tesseract  command (but directing the output to the terminal) to a scanned document like this one: which would output the following text: For more information about the various command line options use  tesseract --help   or    man tesseract . NOTES: tesseract  works on linux , macOS , and Windows The default language for  tesseract  is English but it can recognize more than 100 languages . The default output format for  tesseract  is text but it can also create a searchable  pdf  output. You can also apply OCR on  tiff  images (multipage  tiff ) but not pdf documents. References Image sources :  Wikipedia contributors. "Optical character recognition." Wikipe

Convert e-books to txt

Image
If you want to convert e-books to plain text ( txt ) on linux or other Unix-like systems, here are some command-line utilities that you can use along with their terminal commands: File type Text conversion utilities Examples .djvu 1. djvutxt 2. ebook-convert 1. djvutxt input.djvu output.txt 2. ebook-convert input.djvu output.txt .epub 1. epub2txt 2.  ebook-convert 3. unzip 1. epub2txt  input.epub > output.txt 2. ebook-convert input.epub output.txt 3.  unzip  -c input.epub > output.txt .doc 1. catdoc 2. textutil  (macOS) 3.  ebook-convert 1. catdoc input.doc > output.txt 2. textutil -convert txt input.doc -output output.txt 3. ebook-convert input.doc output.txt .pdf 1. pdftotext 2.  ebook-convert 1. pdftotext input.pdf output.txt 2. ebook-convert input.pdf output.txt NOTES: There is a caveat for  unzip : the generated output file will also include HTML tag

Tip: search for a string that spans multiple lines

Image
The searching tip presented afterwards applies to any application that allows you to use regular expressions when finding strings within a file such as PyCharm : PyCharm allows you to search your source code with regex Let's say we want to search for the string  "turned into a democracy"  in the following text: The difficulty is that sometimes the string spans multiple lines and you want to match it no matter where the newline(s) happens in the string. If we use the simple search query without tokens  "turned into a democracy" , we will only match the first occurrence of the given string, as shown in the following regex101.com demo : To match all occurrences of the string no matter how many lines it spans, the following regex will do the trick: "turned\s+into\s+a\s+democracy" . We replaced the spaces between the words with whitespaces (one or unlimited), as shown in the following regex101.com demo : You can also try it out with PyCharm or any other

UnicodeEncodeError: 'ascii' codec can't encode character

Image
UnicodeEncodeError : 'ascii' codec can 't encode character ' \ U0001f6d1 ' in position 2: ordinal not in range(128) This Python error might happen if you are having problems displaying non-ASCII characters in your terminal when running a Python script. This is mainly a problem with your locale settings used by your terminal. I provide two solutions to resolve it: Solution #1 : change your locale settings (best solution) Solution #2 : export  PYTHONIOENCODING=utf8  (temporary solution) Solution #1: change your locale settings (best solution) The best solution consists in fixing your locale settings since it is permanent and you don’t have to change any Python code. Append ~/.bashrc  or  ~/.bash_profile  with: export LANG = "en_US.UTF-8" export LANGUAGE = "en_US:en" You should provide your own UTF-8 based locale settings. The example uses the English (US) locale with the encoding UTF-8 . The locale -a  comma