Programming and technology blog

Posts

Perform OCR on an image of typed text

- May 30, 2021

You can use tesseract , an OCR command-line program, to convert an image of typed text to plain text ( txt ) like this one: Here is the terminal command to perform OCR on the example image: tesseract image.png stdout --psm 12 --dpi 70 > output.txt which will output the following text: You could also apply the tesseract command (but directing the output to the terminal) to a scanned document like this one: which would output the following text: For more information about the various command line options use tesseract --help or man tesseract . NOTES: tesseract works on linux , macOS , and Windows The default language for tesseract is English but it can recognize more than 100 languages . The default output format for tesseract is text but it can also create a searchable pdf output. You can also apply OCR on tiff images (multipage tiff ) but not pdf doc...

Convert e-books to txt

- May 30, 2021

If you want to convert e-books to plain text ( txt ) on linux or other Unix-like systems, here are some command-line utilities that you can use along with their terminal commands: File type Text conversion utilities Examples .djvu 1. djvutxt 2. ebook-convert 1. djvutxt input.djvu output.txt 2. ebook-convert input.djvu output.txt .epub 1. epub2txt 2. ebook-convert 3. unzip 1. epub2txt input.epub > output.txt 2. ebook-convert input.epub output.txt 3. unzip -c input.epub > output.txt .doc 1. catdoc 2. textutil (macOS) 3. ebook-convert 1. catdoc input.doc > output.txt 2. textutil -convert txt input.doc -output output.txt 3. ebook-convert input.doc output.txt .pdf 1. pdftotext 2. ebook-convert 1. pdftotext input.pdf output.txt 2. ebook-convert input.pdf output.txt NOTES: There is a caveat for unzip : the generated...

Tip: search for a string that spans multiple lines

- May 30, 2021

The searching tip presented afterwards applies to any application that allows you to use regular expressions when finding strings within a file such as PyCharm : PyCharm allows you to search your source code with regex Let's say we want to search for the string "turned into a democracy" in the following text: The difficulty is that sometimes the string spans multiple lines and you want to match it no matter where the newline(s) happens in the string. If we use the simple search query without tokens "turned into a democracy" , we will only match the first occurrence of the given string, as shown in the following regex101.com demo : To match all occurrences of the string no matter how many lines it spans, the following regex will do the trick: "turned\s+into\s+a\s+democracy" . We replaced the spaces between the words with whitespaces (one or unlimited), as shown in the following regex101.com demo : You can also try it out with PyCharm or any other ...

UnicodeEncodeError: 'ascii' codec can't encode character

- May 30, 2021

UnicodeEncodeError : 'ascii' codec can 't encode character ' \ U0001f6d1 ' in position 2: ordinal not in range(128) This Python error might happen if you are having problems displaying non-ASCII characters in your terminal when running a Python script. This is mainly a problem with your locale settings used by your terminal. I provide two solutions to resolve it: Solution #1 : change your locale settings (best solution) Solution #2 : export PYTHONIOENCODING=utf8 (temporary solution) Solution #1: change your locale settings (best solution) The best solution consists in fixing your locale settings since it is permanent and you don’t have to change any Python code. Append ~/.bashrc or ~/.bash_profile with: export LANG = "en_US.UTF-8" export LANGUAGE = "en_US:en" You should provide your own UTF-8 based locale settings. The example uses the English (US) locale with the encoding UTF-8 . The locale -a comma...

NLP in action: train a Naive Bayes model on movie reviews

- October 23, 2020

In the book Natural Language Processing in Action , section 2.3.2 Naive Bayes , we train a multinomial Naive Bayes classifier on movie reviews using scikit-learn 's MultinomialNB . NOTE: I am using numpy 1.19.1, pandas 1.1.3, and scikit-learn 0.23.2 However, we get a ValueError when transforming the predicted probabilities in the [-4, 4] range as done in the book: nb = MultinomialNB () nb = nb . fit ( df_bows , movies . sentiment > 0 ) movies [ 'predicted_sentiment' ] = nb . predict_proba ( df_bows ) * 8 - 4 ValueError: Wrong number of items passed 2, placement implies 1 The reason is that nb.predict_proba() returns a numpy array with two columns and we are trying to assign it to a single column from the Pandas table movies (which I believe you could do in previous Pandas versions; I am using Pandas version 1.1.3) : array ([[ 1.86060657e-01 , 8.13939343e-01 ], [ 1.19745717e-05 , 9.99988025e-01 ]...

Tips on publishing your Python package to PyPI

- September 17, 2020

If you want to publish your first Python package to PyPI, here are some tips that might help you in avoiding some pitfalls when releasing your code to the world. Table of contents 1. Unsupported reST directives and other limitations 1.1 PyPI limitations 1.2 GitHub limitations 1.3 Limitations from all 2. Versioning your project 3. Validate your documentation before uploading 4. Make a release 5. Publishing first to TestPyPI 6. Mistakes found in a published README 7. Remove a release from PyPI 8. Conclusion 9. Resources 10. Notes 1. Unsupported reST directives and other limitations The README.rst [ 1 ] that you painstakingly wrote for readthedocs.org [ 2 ] might not render correctly on PyPI, nor on GitHub [ 3 ] . Here are some of the limitations that you need to look carefully for each website on which you might ...

Using pynput on macOS

- September 17, 2020

While working on my personal project Darth-Vader-RPi , I used the Python package pynput for monitoring the keyboard as to simulate push buttons on a Raspberry Pi. However, some keyboard keys were not detected on macOS without running the script with sudo (after adding PYTHONPATH to etc/sudoers ). The keyboard keys that didn't need sudo were the following: alt keys cmd keys ctrl keys media buttons for play, pause, volume up/down/mute shift keys The other keyboard keys (all the alphanumeric and some special keys such as backspace and right) required the Python script to run with sudo . The pynput documentation explains the modifications you should apply to your application if you want to make it run on Linux, macOS, or Windows. Image credit: "Apple Key" by Antijingoist is licensed under CC BY-NC-ND 2.0