Clickbait Detection in News Headlines
NLP

Clickbait Detection in News Headlines

High-precision clickbait screening with tunable false-positive control

Keep the headlines your readers love while filtering the bait with explainable, bias-tuned models.

Home/Research/Clickbait Detection in News Headlines

Project Information

Course
Machine Learning
Authors
Andrea Alberti
Date
February 2024
Pages
6
View Code

Technologies

PythonScikit-learnNumPyPandasMatplotlib

Abstract

Benchmarked Multinomial Naive Bayes and Logistic Regression on 32k balanced news headlines to detect clickbait. Two deployment targets were explored: maximum accuracy (97.12% test accuracy with stopwords and 8k vocabulary) and zero false positives (0% FPR, 84% accuracy, TPR 68%). Detailed error analysis highlights impactful tokens and the trade-offs introduced by bias calibration.

About

Headlines are lowercased, stripped of punctuation (numbers retained) and vectorised via Bag-of-Words over vocabularies up to 12k tokens with/without stopwords. Both Multinomial Naive Bayes and Logistic Regression are trained under two regimes: accuracy-oriented (cross-validated, full feature space) and FPR-oriented (bias sweeping from −8 to 8 and custom selection of the optimal prior). Keeping stopwords proves critical—removing them drops accuracy by >2%. Error inspection surfaces ambiguous cases (e.g., concise factual headlines mislabelled as bait) and reveals the most discriminative tokens. The final deliverable offers a toggle: 97% accuracy for general moderation, or 0% FPR (84% accuracy, 68% TPR) when false positives must be eliminated.

Key Results

97.12%
Accuracy
0.0%
Best FPR
84.00%
FPR Accuracy
8000 words
Vocabulary

Key Findings

  • A controlled 32k-headline dataset was preprocessed into Bag-of-Words vocabularies (2k–12k tokens) to compare Multinomial Naive Bayes and Logistic Regression under identical splits.
  • Stopword retention consistently improved validation accuracy, motivating the 8k-token model that hit 97.12% test accuracy without favouring logistic regression.
  • Bias sweeping between −8 and 8 enabled a 0% FPR operating point (84% accuracy, 68% TPR), giving stakeholders a tunable moderation lever.
  • Inspecting token log deltas surfaced classic bait patterns (“2015”, “things”, “guess”) versus neutral news markers (“kills”, “iraq”), guiding editorial audits.

Methodology

Balanced headlines (32k) were cleaned and vectorised with Bag-of-Words vocabularies (2k–12k) with/without stopwords, training Multinomial Naive Bayes and Logistic Regression, then sweeping bias values (−8…8) to minimise FPR while inspecting the most impactful tokens.