NLP in action: train a Naive Bayes model on movie reviews




In the book Natural Language Processing in Action, section 2.3.2 Naive Bayes, we train a multinomial Naive Bayes classifier on movie reviews using scikit-learn's MultinomialNB. 

NOTE: I am using numpy 1.19.1, pandas 1.1.3, and scikit-learn 0.23.2

However, we get a ValueError when transforming the predicted probabilities in the [-4, 4] range as done in the book:
nb = MultinomialNB()
nb = nb.fit(df_bows, movies.sentiment > 0)
movies['predicted_sentiment'] = nb.predict_proba(df_bows) * 8 - 4

ValueError: Wrong number of items passed 2, placement implies 1

The reason is that nb.predict_proba() returns a numpy array with two columns and we are trying to assign it to a single column from the Pandas table movies (which I believe you could do in previous Pandas versions; I am using Pandas version 1.1.3) :
array([[1.86060657e-01, 8.13939343e-01],
       [1.19745717e-05, 9.99988025e-01],
       [9.56997000e-01, 4.30029997e-02],
       ...,
       [6.85181156e-01, 3.14818844e-01],
       [1.37651091e-03, 9.98623489e-01],
       [9.99744277e-01, 2.55722778e-04]])

The solution is to extract only the second column (corresponding to the Positive target) from the predicted probabilities 2D numpy array as proposed by one of the authors in the book's discussion forum, like so:
movies['predicted_sentiment'] = nb.predict_proba(df_bows)[:, 1] * 8 - 4

You can now compute your model's mean absolute error  (MAE):
movies['error'] = (movies.predicted_sentiment - movies.sentiment).abs()
print(movies.error.mean().round(1))
Output:
1.9

However, you get a different MAE than in the book (which is 2.4), though the classification accuracy will be the same (93%). The reason is that the book's MAE of 2.4 is computed by converting the predicted probabilities to -4 or 4, not in the [-4, 4] range as suggested by the book's code. 

Thus, we need to modify our previous line that converts the predicted probabilities to the following line which converts instead the False and True predictions to -4 and 4, respectively: 
movies['predicted_sentiment'] = np.where(nb.predict(df_bows) is True, 4, -4)


Comments

Popular posts from this blog

Deactivate conda's base environment on startup

Draw arrows with GIMP plugins

Product review: SMONET wireless security camera system