Posts

Perform OCR on an image of typed text

Image
You can use  tesseract , an  OCR  command-line program, to convert an image of typed text to plain text ( txt ) like this one: Here is the terminal command to perform OCR on the example image: tesseract image.png stdout --psm 12 --dpi 70 > output.txt which will output the following text: You could also apply the  tesseract  command (but directing the output to the terminal) to a scanned document like this one: which would output the following text: For more information about the various command line options use  tesseract --help   or    man tesseract . NOTES: tesseract  works on linux , macOS , and Windows The default language for  tesseract  is English but it can recognize more than 100 languages . The default output format for  tesseract  is text but it can also create a searchable  pdf  output. You can also apply OCR on  tiff  images (multipage  tiff ) but not pdf documents. References Image sources :  Wikipedia contributors. "Optical character recognition." Wikipe

Convert e-books to txt

Image
If you want to convert e-books to plain text ( txt ) on linux or other Unix-like systems, here are some command-line utilities that you can use along with their terminal commands: File type Text conversion utilities Examples .djvu 1. djvutxt 2. ebook-convert 1. djvutxt input.djvu output.txt 2. ebook-convert input.djvu output.txt .epub 1. epub2txt 2.  ebook-convert 3. unzip 1. epub2txt  input.epub > output.txt 2. ebook-convert input.epub output.txt 3.  unzip  -c input.epub > output.txt .doc 1. catdoc 2. textutil  (macOS) 3.  ebook-convert 1. catdoc input.doc > output.txt 2. textutil -convert txt input.doc -output output.txt 3. ebook-convert input.doc output.txt .pdf 1. pdftotext 2.  ebook-convert 1. pdftotext input.pdf output.txt 2. ebook-convert input.pdf output.txt NOTES: There is a caveat for  unzip : the generated output file will also include HTML tag

Tip: search for a string that spans multiple lines

Image
The searching tip presented afterwards applies to any application that allows you to use regular expressions when finding strings within a file such as PyCharm : PyCharm allows you to search your source code with regex Let's say we want to search for the string  "turned into a democracy"  in the following text: The difficulty is that sometimes the string spans multiple lines and you want to match it no matter where the newline(s) happens in the string. If we use the simple search query without tokens  "turned into a democracy" , we will only match the first occurrence of the given string, as shown in the following regex101.com demo : To match all occurrences of the string no matter how many lines it spans, the following regex will do the trick: "turned\s+into\s+a\s+democracy" . We replaced the spaces between the words with whitespaces (one or unlimited), as shown in the following regex101.com demo : You can also try it out with PyCharm or any other

UnicodeEncodeError: 'ascii' codec can't encode character

Image
UnicodeEncodeError : 'ascii' codec can 't encode character ' \ U0001f6d1 ' in position 2: ordinal not in range(128) This Python error might happen if you are having problems displaying non-ASCII characters in your terminal when running a Python script. This is mainly a problem with your locale settings used by your terminal. I provide two solutions to resolve it: Solution #1 : change your locale settings (best solution) Solution #2 : export  PYTHONIOENCODING=utf8  (temporary solution) Solution #1: change your locale settings (best solution) The best solution consists in fixing your locale settings since it is permanent and you don’t have to change any Python code. Append ~/.bashrc  or  ~/.bash_profile  with: export LANG = "en_US.UTF-8" export LANGUAGE = "en_US:en" You should provide your own UTF-8 based locale settings. The example uses the English (US) locale with the encoding UTF-8 . The locale -a  comma

NLP in action: train a Naive Bayes model on movie reviews

Image
In the book Natural Language Processing in Action , section 2.3.2 Naive Bayes , we train a multinomial Naive Bayes classifier on movie reviews using scikit-learn 's  MultinomialNB .   NOTE:  I am using numpy 1.19.1, pandas 1.1.3, and scikit-learn 0.23.2 However, we get a ValueError when transforming the predicted probabilities in the [-4, 4] range as done in the book: nb = MultinomialNB () nb = nb . fit ( df_bows , movies . sentiment > 0 ) movies [ 'predicted_sentiment' ] = nb . predict_proba ( df_bows ) * 8 - 4 ValueError: Wrong number of items passed 2, placement implies 1 The reason is that nb.predict_proba() returns a numpy array with two columns and we are trying to assign it to a single column from the Pandas table  movies  (which I believe you could do in previous Pandas versions; I am using Pandas version 1.1.3) : array ([[ 1.86060657e-01 , 8.13939343e-01 ], [ 1.19745717e-05 , 9.99988025e-01 ], [ 9.569

Tips on publishing your Python package to PyPI

Image
  If you want to publish your first Python package to PyPI, here are some tips that might help you in avoiding some pitfalls when releasing your code to the world. Table of contents   1. Unsupported reST directives and other limitations        1.1 PyPI limitations        1.2 GitHub limitations        1.3 Limitations from all   2. Versioning your project   3.  Validate your documentation before uploading   4. Make a release   5. Publishing first to TestPyPI   6. Mistakes found in a published README   7. Remove a release from PyPI   8. Conclusion   9. Resources 10. Notes 1. Unsupported reST directives and other limitations The README.rst [ 1 ] that you painstakingly wrote for readthedocs.org [ 2 ]  might not render correctly on PyPI, nor on GitHub [ 3 ] . Here are some of the limitations that you need to look carefully for each website on which you might be publishing your documentation. IMPORTANT: I am only referring to the reST markup language.  1.1 PyPI limitations Man

Using pynput on macOS

Image
While working on my personal project Darth-Vader-RPi , I used the Python package pynput for monitoring the keyboard as to simulate push buttons on a Raspberry Pi. However, some keyboard keys were not detected on macOS without running the script with sudo (after adding PYTHONPATH to etc/sudoers ).  The keyboard keys that didn't need sudo were the following: alt keys cmd keys ctrl keys media buttons for play, pause, volume up/down/mute shift keys The other keyboard keys (all the alphanumeric and some special keys such as backspace and right) required the Python script to run with sudo . The pynput documentation explains the modifications you should apply to your application if you want to make it run on Linux, macOS, or Windows. Image credit:   "Apple Key"  by  Antijingoist  is licensed under  CC BY-NC-ND 2.0