
High-precision clickbait screening with tunable false-positive control
Keep the headlines your readers love while filtering the bait with explainable, bias-tuned models.
Benchmarked Multinomial Naive Bayes and Logistic Regression on 32k balanced news headlines to detect clickbait. Two deployment targets were explored: maximum accuracy (97.12% test accuracy with stopwords and 8k vocabulary) and zero false positives (0% FPR, 84% accuracy, TPR 68%). Detailed error analysis highlights impactful tokens and the trade-offs introduced by bias calibration.
Headlines are lowercased, stripped of punctuation (numbers retained) and vectorised via Bag-of-Words over vocabularies up to 12k tokens with/without stopwords. Both Multinomial Naive Bayes and Logistic Regression are trained under two regimes: accuracy-oriented (cross-validated, full feature space) and FPR-oriented (bias sweeping from −8 to 8 and custom selection of the optimal prior). Keeping stopwords proves critical—removing them drops accuracy by >2%. Error inspection surfaces ambiguous cases (e.g., concise factual headlines mislabelled as bait) and reveals the most discriminative tokens. The final deliverable offers a toggle: 97% accuracy for general moderation, or 0% FPR (84% accuracy, 68% TPR) when false positives must be eliminated.