Big Data

Review Helpfulness Prediction with Big Data

From raw Hadoop pipelines to interpretable helpfulness predictions

Correlate the conversation, embed the words, and surface the reviews readers trust.

Home/Research/Review Helpfulness Prediction with Big Data

Project Information

Course

Data Science & Big Data Analytics

Authors

Andrea Alberti, Davide Ligari, Andrea Andreoli

Date

September 2023

Pages

View Code

Technologies

Hadoop HDFSApache SparkPySparkSpark MLlibMongoDBPythonNLTKGensimScikit-learnMatplotlib

Abstract

Analyzed ~3M Amazon book reviews end-to-end with a big data stack (HDFS, Spark, MongoDB) to explain and predict perceived helpfulness. Hypothesis testing quantified the role of review length, sentiment and star rating, while Word2Vec embeddings fed Random Forest, SVR and MLP regressors for score prediction. The best Random Forest model achieved MSE 0.0259 (RMSE 0.1609, R² 0.253).

About

The workflow begins with HDFS storage and MapReduce joins to merge metadata and ratings. Spark notebooks clean the corpus, tokenize reviews and test six hypotheses (length, sentiment, rating influence, user bias, category and publisher scale). Local analyses run in MongoDB, while distributed replicas validate results in Spark using Spearman correlations, ANOVA and Naive Bayes sentiment tagging. For prediction, reviews are embedded with Gensim Word2Vec (30D and 150D); Random Forest, SVR (RBF) and MLP regressors are tuned via GridSearchCV. The best RF balances bias/variance (MSE 0.0259) and uncovers helpfulness drivers. Visual dashboards outline top publishers, category focus and validate that longer, positive, highly-rated reviews correlate with helpfulness only up to 400 words.

Key Results

Random Forest

Best Model

0.026

MSE

0.25

R²

Key Findings

•A Hadoop/Spark pipeline joined three million reviews, cleaned text and evaluated six hypotheses on length, sentiment, rating bias, user behaviour, category and publisher footprint.
•Local MongoDB experiments and distributed Spark jobs produced matching statistics, confirming that review length and positive sentiment correlate with helpfulness only up to ~400 words while star ratings remain the dominant driver.
•Word2Vec embeddings (30D/150D) enabled Random Forest regressors to outscore SVR and MLP, surfacing publishers and categories with consistently helpful feedback via dashboards.

Methodology

Loaded ratings and metadata into HDFS, joined them with MapReduce, explored hypotheses in Spark and MongoDB, generated Word2Vec embeddings of reviews and trained Random Forest, SVR and MLP regressors evaluated via GridSearchCV.