
From raw Hadoop pipelines to interpretable helpfulness predictions
Correlate the conversation, embed the words, and surface the reviews readers trust.
Analyzed ~3M Amazon book reviews end-to-end with a big data stack (HDFS, Spark, MongoDB) to explain and predict perceived helpfulness. Hypothesis testing quantified the role of review length, sentiment and star rating, while Word2Vec embeddings fed Random Forest, SVR and MLP regressors for score prediction. The best Random Forest model achieved MSE 0.0259 (RMSE 0.1609, R² 0.253).
The workflow begins with HDFS storage and MapReduce joins to merge metadata and ratings. Spark notebooks clean the corpus, tokenize reviews and test six hypotheses (length, sentiment, rating influence, user bias, category and publisher scale). Local analyses run in MongoDB, while distributed replicas validate results in Spark using Spearman correlations, ANOVA and Naive Bayes sentiment tagging. For prediction, reviews are embedded with Gensim Word2Vec (30D and 150D); Random Forest, SVR (RBF) and MLP regressors are tuned via GridSearchCV. The best RF balances bias/variance (MSE 0.0259) and uncovers helpfulness drivers. Visual dashboards outline top publishers, category focus and validate that longer, positive, highly-rated reviews correlate with helpfulness only up to 400 words.