Programming and technology blog

Posts

Showing posts from 2021

Perform OCR on an image of typed text

- May 30, 2021

You can use tesseract , an OCR command-line program, to convert an image of typed text to plain text ( txt ) like this one: Here is the terminal command to perform OCR on the example image: tesseract image.png stdout --psm 12 --dpi 70 > output.txt which will output the following text: You could also apply the tesseract command (but directing the output to the terminal) to a scanned document like this one: which would output the following text: For more information about the various command line options use tesseract --help or man tesseract . NOTES: tesseract works on linux , macOS , and Windows The default language for tesseract is English but it can recognize more than 100 languages . The default output format for tesseract is text but it can also create a searchable pdf output. You can also apply OCR on tiff images (multipage tiff ) but not pdf documents. References Image sources : Wikipedia contributors. "Optical character recognition." Wikipe

Convert e-books to txt

- May 30, 2021

If you want to convert e-books to plain text ( txt ) on linux or other Unix-like systems, here are some command-line utilities that you can use along with their terminal commands: File type Text conversion utilities Examples .djvu 1. djvutxt 2. ebook-convert 1. djvutxt input.djvu output.txt 2. ebook-convert input.djvu output.txt .epub 1. epub2txt 2. ebook-convert 3. unzip 1. epub2txt input.epub > output.txt 2. ebook-convert input.epub output.txt 3. unzip -c input.epub > output.txt .doc 1. catdoc 2. textutil (macOS) 3. ebook-convert 1. catdoc input.doc > output.txt 2. textutil -convert txt input.doc -output output.txt 3. ebook-convert input.doc output.txt .pdf 1. pdftotext 2. ebook-convert 1. pdftotext input.pdf output.txt 2. ebook-convert input.pdf output.txt NOTES: There is a caveat for unzip : the generated output file will also include HTML tag

Tip: search for a string that spans multiple lines

- May 30, 2021

The searching tip presented afterwards applies to any application that allows you to use regular expressions when finding strings within a file such as PyCharm : PyCharm allows you to search your source code with regex Let's say we want to search for the string "turned into a democracy" in the following text: The difficulty is that sometimes the string spans multiple lines and you want to match it no matter where the newline(s) happens in the string. If we use the simple search query without tokens "turned into a democracy" , we will only match the first occurrence of the given string, as shown in the following regex101.com demo : To match all occurrences of the string no matter how many lines it spans, the following regex will do the trick: "turned\s+into\s+a\s+democracy" . We replaced the spaces between the words with whitespaces (one or unlimited), as shown in the following regex101.com demo : You can also try it out with PyCharm or any other

UnicodeEncodeError: 'ascii' codec can't encode character

- May 30, 2021

UnicodeEncodeError : 'ascii' codec can 't encode character ' \ U0001f6d1 ' in position 2: ordinal not in range(128) This Python error might happen if you are having problems displaying non-ASCII characters in your terminal when running a Python script. This is mainly a problem with your locale settings used by your terminal. I provide two solutions to resolve it: Solution #1 : change your locale settings (best solution) Solution #2 : export PYTHONIOENCODING=utf8 (temporary solution) Solution #1: change your locale settings (best solution) The best solution consists in fixing your locale settings since it is permanent and you don’t have to change any Python code. Append ~/.bashrc or ~/.bash_profile with: export LANG = "en_US.UTF-8" export LANGUAGE = "en_US:en" You should provide your own UTF-8 based locale settings. The example uses the English (US) locale with the encoding UTF-8 . The locale -a comma