The main part we are going to deal with today is the statistical analysis and testing various algorithms on the dataset.


Before that, for those people who are new, I would like to briefly introduce the motto of my project (I encourage you to visit my first post for deeper insights):

  • Efficient prediction of helpfulness of a review.
  • Rank the reviews. Reviews should be ranked according to their content, instead of just their ‘Helpful Votes’.

In Machine Learning, ‘HAVING THE DATA‘ and ‘KNOWING THE DATA‘ are two different things. Applying various algorithms just to get results, without explicitly knowing the data, won’t improve the results. The pivotal thing that ‘KNOWING THE DATA‘ decides are the features on which we train the model upon. For this purpose, I carried out some statistical analysis on the dataset as mentioned below:

Analysis is done on:

  1. Truncated Dataset: Reviews that have at least 5 votes.
  2. Original Dataset.

Both almost follow the same trend.


Statistical Analysis:

  • More reviews have high overall ratings.

1-histogram_overall_rating_truncateddata

  • Most of the reviews have high helpfulness percentage

2-distribution_of_helpfulness_truncateddata

  • This is the class distribution.

3-histogram_helpfulness_truncateddata

  • As overall ratings increase, helpfulness increases.

4-overall-ratings-vs-the-helpfulness_truncateddata

  • Word-Cloud for Negative and Positive reviews respectively.

5-wordcloud-for-negative-reviews_truncateddata

6-wordcloud-for-positive-reviews_truncateddata

  • Very low reviews with higher number of words. Average words per helpful review: 69

7-histogram-of-number-of-words_truncateddata

  • The polarity of the reviews increases with the overall ratings of the review.

8-variance-of-polarity-with-respect-to-overall-ratings_truncateddata

  • There are more sentences in an helpful review.

9-number-of-sentences-vs-helpfulness_truncateddata


Testing various Algorithms:

Based upon these analyses, these features were decided:

  • Overall product rating.
  • Lateness of the review.
  • Review text length (number of words & number of sentences per review).
  • Term frequencies of review texts.
  • Review Polarity ∈ [-1,1]
  • Review Subjectivity ∈ [0,1].
  • Data Split:
    • Training (80%)
    •  Testing (20%)

Models used:

  • Logistic Regression.
  • Decision Tree Classifier.
  • Support Vector Machines (with ‘linear’ and rbfkernel functions).
Classification Model Accuracy (%) Precision (%)
Logistic Regression 83.56 83.56
SVM (Linear Kernel) 79.2 76.8
SVM (RBF Kernel) 87.8 88.3
Decision Tree Classifier 83.7 85.55

As seen above, ‘Support Vector Machines’ with RBF Kernel yields the highest Accuracy and Precision score on this dataset.


 

In the next post, we will actually rank the reviews of a particular product by a interesting concept called as Entropy. Tune in to know about it.

Advertisements