
Benchmarking bag-of-words sentiment models on the IMDb corpus
From token counts to emotional cues: a fast route to reliable review sentiment.
Implemented sentiment classifiers on the IMDb dataset (50k reviews) comparing Multinomial Naive Bayes and Logistic Regression. Vocabulary size, stopword removal and stemming were studied to balance accuracy and overfitting. Naive Bayes with stopwords (vocab 1k) achieved 82.6% test accuracy, while Logistic Regression reached 85.4% with minimal tuning.
Reviews are tokenised with NLTK (lowercase, punctuation stripped, stopwords optionally removed, Porter stemming optional). Vocabularies up to 10k tokens are evaluated; the sweet spot is 1k tokens with stopwords kept, avoiding overfitting while keeping features informative. Multinomial Naive Bayes learns class priors and word likelihoods via relative frequencies, while Logistic Regression is trained with gradient descent (lr=0.0023, 1000 iterations) and no regularisation. Accuracy vs vocabulary curves reveal diminishing returns and widening train-test gaps beyond 5k tokens. Word importance analysis ranks “superb”, “fantastic” and “waste” as decisive markers. Worst-error inspection highlights ambiguous reviews (sarcasm, plot summaries).