Recently, I was looking for an interesting data set to hone my classification machine learning skills. I wanted to solve a problem that could actually have some real world business impact and I wanted the data set to be LARGE. Well, I got my wish when I stumbled upon this data collected by Prof. Julian McAuley.

Amazon Data Link

This data set contains Amazon review data from May 1996 – July 2014 and contains 83 million reviews! Such an incredible data set with a wealth of information. This post will detail how to read in the data and do some basic processing / exploration, so let’s get started!


Before starting, the main issue this project aims to solve needs to be discussed:

Millions of users post user-generated content online, expressing their opinions on all manner of goods and services, topics and social affairs. With the growing popularity of ecommerce, online retailers such as Amazon, Flipkart, ebay, etc. are increasingly relying on the numerous user-supplied product reviews to crowd-source the online shopping experience, providing information to consumers and feedback to manufacturers. With the diverse, countless and varying degree of quality of such reviews on popular online sites, this project aims to develop a machine learning approach to automatically predict whether a given product review is either ‘helpful’ or ‘unhelpful’, then assess and rank the helpfulness of reviews.

Amazon product review platform is used for this project. After consumer purchases a product, Amazon prompts him or her to write a review of the product and rank it based on a starrating scale from one to five stars. In order to differentiate reviews based on their helpfulness, Amazon has implemented an interface that allows customers to vote on whether a particular review has been helpful or unhelpful. The fraction of customers who have deemed the review helpful is displayed with the review, and Amazon uses these ratings to rank the reviews, displaying the most helpful rankings on the products front page. The drawback is that more recently written reviews are at a disadvantage since less people have voted on the helpfulness of the review. Because of this, reviews with few votes cannot be effectively ranked and will not gain visibility until they have accumulated adequate votes, which can take some time. As a result, this project aims to assess the helpfulness of reviews automatically, without having to wait for users to manually vote over the course of time. Due to this, users would get the most relevant, helpful, and up to date
reviews possible, without any delay in more helpful reviews being displayed. Moreover, such an automatic classification of reviews would be able to help in rooting out poorly written reviews lacking helpful information to other consumers. Thus, this project would try to minimize the time spent by any customer on researching if the particular product is worth buying or not (By ranking the reviews so that customer can decide by reading only minimum number of first vital reviews). As the number of online retailers who take feedbacks from customers are increasing day-by-day, this project has a vast application in real life. The urge to understand the actual application of multiple classification algorithms, their advantages and limitations on a dataset, motivated me to work on this idea.


Consider a product:

nikon-camera

Now, consider the 2 sample reviews on this product:

old-review_high-helpful-votes

recent-review_low-helpful-votes

The first review is dated May 5, 2015 and hardly has any information, yet it has been found helpful by 48 people. The second review is dated November 11, 2016 and has much more information than the first one, yet it has been found helpful only by 1 person. This is the issue we need to address.


With that thought, lets get started.

Firstly, so what kind of data are we dealing with? Well, it is user review data for Amazon products, and an example of the web page version of this data is below:

amazon_review_sample

As you can see, the review starts by stating how many people found the review helpful (56 of 63 in this case), how many stars out of five the user rated the product, a summary, user name, date, and finally, the actual review text.

Ok, so we have JSON data for 83 million reviews, so what do we do with it?? Well, what I did is just downloaded one category’s data. If you use python, data can be read as follows:

def parse(path):
    g = open(path, 'r')
    for l in g:
        yield eval(l)

obj_reviews = parse(reviews_data)
for i in obj_reviews:
    reviews.append(i)

df_reviews = pd.DataFrame(reviews)

Now, a sample row in the dataframe looks like:

{ 
  "reviewerID": "A2SUAM1J3GNN3B", 
  "asin": "0000013714", 
  "reviewerName": "J. McDonald", 
  "helpful": [2, 3], 
  "reviewText": "I bought this for my husband who plays the piano. 
                 He is having a wonderful time playing these old 
                 hymns. The music is at times hard to read because 
                 we think the book was published for singing from 
                 more than playing from. Great purchase though!", 
   "overall": 5.0, 
   "summary": "Heavenly Highway Hymns", 
   "unixReviewTime": 1252800000,
   "reviewTime": "09 13, 2009" 
}

where, the following are the feature descriptions:

  • reviewerID – ID of the reviewer, e.g. A2SUAM1J3GNN3B
  • asin – ID of the product, e.g. 0000013714
  • reviewerName – name of the reviewer
  • helpful – helpfulness rating of the review, e.g. 2/3
  • reviewText – text of the review
  • overall – rating of the product
  • summary – summary of the review
  • unixReviewTime – time of the review (unix time)
  • reviewTime – time of the review (raw)

Ok, great, now we have the reviews data in the pandas dataframe, but now what to do with it? Well there’s a few things that I did first. I casted the ‘unixReviewTime’ column to a datetime object which helps us later. Also, you’ll notice that the ‘helpful’ column contains values that look like this ‘[56, 63]’. The first value represents the number of helpful votes, the second represents overall votes. This data is useful to us, but not in this form, so let’s create some additional columns!

"""This function extracts information out of the tuple 'helpful' 
column so that we can start to create some other features"""

def creating_basic_features(): 
    df_reviews['helpful_votes'] = df_reviews.helpful.apply(lambda x: x[0])
    df_reviews['overall_votes'] = df_reviews.helpful.apply(lambda x: x[1])
    df_reviews['percent_helpful'] = round((df_reviews['helpful_votes'] / df_reviews['overall_votes']) * 100)
    df_reviews['review_helpful'] = np.where((df_reviews.percent_helpful > 60) & (df_reviews.overall_votes > 5), 1, 0)

You’ll notice that I also created a percent helpful column which just divides the number of helpful votes by overall votes and I also created a binary column which states if the review is helpful or not. So far, this column has some arbitrary cutoffs, but I’m working to find a fair way to calculate if a review is helpful or not, so, for now, this metric serves us well.

Next, it was time to start doing some basic analysis on the reviews. For this, I have chosen to use a package called Text Blob. I’ll show the code below, then explain what each feature is doing:

# if you haven't installed it yet, just run 'pip install textblob'

def create_textblob_features():
    from textblob import TextBlob
    df_reviews['polarity'] = df_reviews['reviewText'].apply(lambda x: TextBlob(x).sentiment.polarity)
    df_reviews['len_words'] = df_reviews['reviewText'].apply(lambda x: len(TextBlob(x).words))
    df_reviews['len_sentences'] = df_reviews['reviewText'].apply(lambda x: len(TextBlob(x).sentences))
    df_reviews['subjectivity'] = df_reviews['reviewText'].apply(lambda x: TextBlob(x).sentiment.subjectivity)
    df_reviews['words_per_sentence'] = df_reviews.len_words / df_reviews.len_sentences
    df_reviews['sentence_complexity'] = df_reviews.reviewText.apply(lambda x: float(len(set(TextBlob(x).words))) / float(len(TextBlob(x).words)))

Ok, so what I’m doing with all of the above is using TextBlob to extract information from each review. The things I hypothesized that would be interesting are polarity of the reivew, subjectivity, number of words, number of sentences, words per sentence, and sentence complexity.

All the code is on my Github account I encourage you to play around with all the variables and find your own stories to tell!

That’s it for now. In my next post, I will share some of beautiful insights of the statistical analysis I have done on this dataset. Also I will discuss how I actually plan to classify a review as helpful or not based on some popular machine learning techniques such as random forests, SVM, Decision Tree Classifiers, etc.

Advertisements