Sentiment Analysis on Social Media
NLP

Sentiment Analysis on Social Media

Benchmarking bag-of-words sentiment models on the IMDb corpus

From token counts to emotional cues: a fast route to reliable review sentiment.

Home/Research/Sentiment Analysis on Social Media

Project Information

Course
Machine Learning
Authors
Andrea Alberti
Date
February 2024
Pages
5
View Code

Technologies

PythonScikit-learnNLTKNumPyPandas

Abstract

Implemented sentiment classifiers on the IMDb dataset (50k reviews) comparing Multinomial Naive Bayes and Logistic Regression. Vocabulary size, stopword removal and stemming were studied to balance accuracy and overfitting. Naive Bayes with stopwords (vocab 1k) achieved 82.6% test accuracy, while Logistic Regression reached 85.4% with minimal tuning.

About

Reviews are tokenised with NLTK (lowercase, punctuation stripped, stopwords optionally removed, Porter stemming optional). Vocabularies up to 10k tokens are evaluated; the sweet spot is 1k tokens with stopwords kept, avoiding overfitting while keeping features informative. Multinomial Naive Bayes learns class priors and word likelihoods via relative frequencies, while Logistic Regression is trained with gradient descent (lr=0.0023, 1000 iterations) and no regularisation. Accuracy vs vocabulary curves reveal diminishing returns and widening train-test gaps beyond 5k tokens. Word importance analysis ranks “superb”, “fantastic” and “waste” as decisive markers. Worst-error inspection highlights ambiguous reviews (sarcasm, plot summaries).

Key Results

85.4%
Accuracy
82.6%
MNB Accuracy
85.4%
LR Accuracy
1000 words
Best Vocabulary

Key Findings

  • Multinomial Naive Bayes with stopwords and a 1,000-token vocabulary achieved 82.6% test accuracy (82.0% train).
  • Removing stopwords reduced test accuracy to 81.6% (−1.0 pp), while applying Porter stemming yielded 81.7%.
  • Logistic Regression (tol 1e-4, lr 0.0023) reached 85.4% test and 86.7% train accuracy without regularisation.
  • High-impact tokens included “superb”, “wonderful”, “fantastic” (Δlog up to +1.737) and “waste”, “pointless”, “worst” (Δlog down to −2.601).

Methodology

IMDb reviews were tokenised (lowercase, punctuation removed) while experimenting with stopword removal and Porter stemming; vocabularies from 500 to 10k tokens were evaluated by training Multinomial Naive Bayes and Logistic Regression (lr=0.0023, 1000 iterations) to study accuracy and overfitting.