Facebook held its fourth Kaggle recruiting competition a year ago. I decided to work on the dataset because the feature engineering part intrigued me. The goal was to predict if a bidder is a human or a robot based on his history of bids on an online auction platform.
The administrators gave participants three datasets. One train dataset containing bidder IDs and the target (human or robot) and the corresponding test set that was to be scored. The third dataset contained 7.6 million bids on different auctions.
What I liked was the fact that one had to create his own features in order to compete. For me, feature engineering is the funnest part of a modelling project because you need to be creative. It was not like other challenges, in which all the columns of the dataset are anonymized and the only way to improve the model performance was to optimize the machine learning part.
That is why I considered that this competition represented two challenges: feature engineering and model optimization. In this post, I focus only on the feature engineering part.
In many ways, feature design for bot detection is very similar to what needs to be done for fraud detection. After all, to create smart features you have to put yourself in the bot designer / fraudster’s shoes. And that’s exactly what I do when I’m working on these types of projects.
About the dataset:
The bidder_id is a unique identifier that we will have to score (Human or Robot). The provided information on the auction (auction is the identifier and merchandise the category of product sold), on the conditions of the bid (device, country of IP address, anonymised IP, referer URL), and an anonymised time. Note that there is no price information. This is because each bid adds the exact same amount of money every time.
So what could distinguish a human from a robot? Here is the list of the features I decided to use.
My first intuition was that, from a physical point of view, a robot is able to take part in more auctions than a human. Hence, the first feature is simply the total number of auctions.
The second intuition is that a human will use only a small number of devices and IP address whereas a robot can use many more. So I created features on the number of distinct devices, country, IP and URL used by a bidder. I also used counts on combinations of these.
For example, total number of (device, IP) per bidder. This intuition can be confirmed by looking at the data. Other features could be the main country used and the main merchandise bought.
Who cares about time obfuscation?
There was a big discussion on the forum about time obfuscation. I chose to play without trying to break it because the information on time order is already important.
To understand why, we need to think again like a robot designer. What would be the goal of a robot? How and why would a robot enter an auction? Well, I see two main reasons (apart from trolling) to have a bidding robot.
- The first is to win the auction.
- The second to make the price artificially higher.
Logically in this case, winning the auction corresponds to the last bidder, information that I easily retrieved. On the other hand if I wanted to raise the price I would not win but I would probably finish second, bid on myself several times or make a new bid every time someone bids.
I also had the intuition that a robot would be more likely to be the first to bid on an auction or to bid after the first person has entered. So all this boils down to the following features:
- % of time first, second, before-last, and last in auction,
- percentage of bids on one’s self,
- and percentage of bids when an other bid was done.
On average, humans (coded by 0) are way less likely than bots to start an auction or to finish one (rescaled by the number of auctions they entered).
We can also create features similar to counts of distinct devices, IP, etc. It is possible for a human to have several IPs if he is travelling but he may use one for several auctions and then another one, etc. He will probably not be changing for every bid. So a feature could be the number / percent of times there is a change in the IP.
But we can go even further. To help me create features every time I ran a model, I used the prediction on a hold out to understand what was not well predicted. I wanted to understand how and why a human bidder could have a high probability to be predicted as a bot.
I did the following:
- give a rank to each auction from 1 to K based on the starting time (or finishing time) for the user.
- get the entire sequence of auctions for the user (ex: 1,2,1,1,1,2,3,3,2,3,4,4)
compare this sequence with the ordered one via the correlation between the two vectors or by counting the number of inversions needed to sort the sequence via merge sort (ok, ok, I’m slightly proud of this one). This is interesting because a high correlation (or low merge inversion value) is more likely for humans since they are not able to do pure random bids. This can be verified on the graph below.
Following the same idea, it is also possible to compute the number of active auctions at any time for a user (number of auction for which there is still an action later) and to derive quantiles of this measure.
Finally we can create features on the time variable itself, for example the number of times a bidder made several bids at the same time. Now if we had the hypothesis that inside an auction the difference between two times is proportional to the real difference (which corresponds to the obfuscated time to be a linear function of the real time), then we can add more features. As a bot could be timed to make a bid every X seconds, we can check if there is a mode in the difference distribution.
So in the end, simply having the information of time order enables us to create numerous interesting features.
At the end, I had more than enough features and I was confident I had captured most of the information contained in the bids dataset. I tried to capture features on what a bot would be designed to do as well as how it would be implemented. I then used gradient boosting trees to generate the predictions and finished with “0.91393” area under ROC curve, just in the top 4 %, probably because I overfitted the train set even though I used 10-fold cross validation.