Modeling quality of suggestions

Ankit Gyawali

4/30/2020


Section 1: Introduction

The goal of this project is to identify various relavant factors that could be used to classify “good” & “bad” suggestions provided on a human resource company. We compare the performance of various model to identify the best algorithm for this prediction task and evaluate if the model can be deployed on production. Finally, we discus and provide recommendations as to if any other attributes that might be more valuable for future classification task.

We group the report as follows and present our findings on last section:

  • Introduction
  • Data Cleanup and Transformation
  • Data Shape Exploration
  • Data Analysis/Modelling
    • Predictor selection - predictors that can classify suggestion
    • Modelling & Selection based on classification performance/CV
  • Conclusion

Section 2: Data Cleanup and Transformation

First we load our data.

Note: For localhost: https://drive.google.com/open?id=1k_0HMD99lPkqOOUfQcFSq5c6lUVmd6eL

Renaming columns with snake case:

##  [1] "recommended"          "suggestion_id"        "responses"            "views"                "upvotes"              "downvotes"            "author_id"            "author_profile_age"  
##  [9] "author_total_posts"   "author_posts_per_day"

From the above we can see some columns that could have interactions that we might want to use later on.

  • upvotes & downvotes
  • response & views
  • profile age & total posts

The third interaction already present on our data set as author_posts_per_day, we will capture the first two interactions as follows. NA & Inifinite values are replaced by 0 or 1 appropriately when calculating these interactions.

Converting the “recommended” column of interest as factor:

We divide our data into training and test set with approximately 90% & 10% of total samples we have respectively. We also clone the training and testing set into a different variable before dropping out irrelevant primary keys for our analysis like author_id & suggestion_id.

Section 3: Data Shape Exploration

Following outputs a basic statistic table for our data set grouped by the two kinds of recommended posts.

Recommended posts
responses views upvotes downvotes author_profile_age author_total_posts author_posts_per_day
Min. : 0.00 Min. : 0 Min. : 0.0 Min. : 0.000 Min. : 231 Min. : 10 Min. :0.00
1st Qu.: 34.00 1st Qu.: 814 1st Qu.: 45.0 1st Qu.: 1.000 1st Qu.:1011 1st Qu.: 711 1st Qu.:0.50
Median : 48.00 Median : 1468 Median : 69.5 Median : 4.000 Median :1214 Median :1330 Median :1.20
Mean : 60.51 Mean : 3076 Mean : 103.9 Mean : 7.641 Mean :1173 Mean :2121 Mean :1.88
3rd Qu.: 69.00 3rd Qu.: 2834 3rd Qu.: 115.0 3rd Qu.: 10.000 3rd Qu.:1341 3rd Qu.:2834 3rd Qu.:2.90
Max. :959.00 Max. :63243 Max. :2607.0 Max. :199.000 Max. :1623 Max. :6920 Max. :7.10
Not recommended posts
responses views upvotes downvotes author_profile_age author_total_posts author_posts_per_day
Min. : 0.00 Min. : 0 Min. : 0.00 Min. : 0.000 Min. : 4 Min. : 1.0 Min. : 0.0000
1st Qu.: 3.00 1st Qu.: 92 1st Qu.: 0.00 1st Qu.: 0.000 1st Qu.: 773 1st Qu.: 54.0 1st Qu.: 0.1000
Median : 7.00 Median : 173 Median : 2.00 Median : 1.000 Median :1027 Median : 275.0 Median : 0.3000
Mean : 13.63 Mean : 430 Mean : 11.65 Mean : 3.619 Mean :1017 Mean : 697.8 Mean : 0.6857
3rd Qu.: 15.00 3rd Qu.: 365 3rd Qu.: 7.00 3rd Qu.: 5.000 3rd Qu.:1291 3rd Qu.: 834.0 3rd Qu.: 0.9000
Max. :487.00 Max. :30072 Max. :1149.00 Max. :201.000 Max. :1624 Max. :9992.0 Max. :20.1000

We see that consistently- response, views & upvotes are a lot higher on recommended posts then on un-recommended posts.

Simialarly, the mean & median seem to suggest that the author profile age does not matter, however their frequency of posting tends higher on recommended posts - which implies authors whose posts are more recommended do tend to post more frequently.

We will now further visualize these data points as boxplots, removing the first 10% & bottom 10% of outliers so that the boxplot scale is zoomed in.

Botplots of various variables

bpxplot.responses <- ggplot(hrdata, aes(recommended, responses,  fill=recommended), na.rm = T) + geom_boxplot(na.rm = T, outlier.shape = NA) + ggtitle("Responses by suggestion group") + scale_y_continuous(limits = quantile(hrdata$responses, c(0.1, 0.9)))
bpxplot.views <- ggplot(hrdata, aes(recommended, views,  fill=recommended), na.rm = T) + geom_boxplot(na.rm = T, outlier.shape = NA) + ggtitle("Views by suggestion group") + scale_y_continuous(limits = quantile(hrdata$views, c(0.1, 0.9)))
boxplot.responsiveness <- ggplot(hrdata, aes(recommended, responsiveness,  fill=recommended), na.rm = T) + geom_boxplot(na.rm = T, outlier.shape = NA) + ggtitle("Responsiveness by suggestion group") + scale_y_continuous(limits = quantile(hrdata$responsiveness, c(0.1, 0.9)))

boxplot.upvotes <- ggplot(hrdata, aes(recommended, upvotes,  fill=recommended), na.rm = T) + geom_boxplot(na.rm = T, outlier.shape = NA) + ggtitle("Upvotes by suggestion group") + scale_y_continuous(limits = quantile(hrdata$upvotes, c(0.1, 0.9)))
boxplot.downvotes <- ggplot(hrdata, aes(recommended, downvotes,  fill=recommended), na.rm = T) + geom_boxplot(na.rm = T, outlier.shape = NA) + ggtitle("Downvotes by suggestion group") + scale_y_continuous(limits = quantile(hrdata$downvotes, c(0.1, 0.9)))
boxplot.popularity <- ggplot(hrdata, aes(recommended, popularity,  fill=recommended), na.rm = T) + geom_boxplot(na.rm = T, outlier.shape = NA) + ggtitle("Popularity by suggestion group") + scale_y_continuous(limits = quantile(hrdata$popularity, c(0.1, 0.9)))

boxplot.author_profile_age <- ggplot(hrdata, aes(recommended, author_profile_age,  fill=recommended), na.rm = T) + geom_boxplot(na.rm = T, outlier.shape = NA) + ggtitle("Author profile age by suggestion group") + scale_y_continuous(limits = quantile(hrdata$author_profile_age, c(0.1, 0.9)))
boxplot.author_total_posts <- ggplot(hrdata, aes(recommended, author_total_posts,  fill=recommended), na.rm = T) + geom_boxplot(na.rm = T, outlier.shape = NA) + ggtitle("Author total posts by suggestion group") + scale_y_continuous(limits = quantile(hrdata$author_total_posts, c(0.1, 0.9)))
boxplot.author_posts_per_day <-  ggplot(hrdata, aes(recommended, author_posts_per_day,  fill=recommended), na.rm = T) + geom_boxplot(na.rm = T, outlier.shape = NA) + ggtitle("Author posts per day by suggestion group") + scale_y_continuous(limits = quantile(hrdata$author_posts_per_day, c(0.1, 0.9)))

grid.arrange(bpxplot.responses, bpxplot.views, boxplot.responsiveness, boxplot.upvotes, boxplot.downvotes, boxplot.popularity, boxplot.author_profile_age, boxplot.author_total_posts,  boxplot.author_posts_per_day,  ncol=2)

These boxplots confirm the picture we had looking at the two kable tables earlier and the trends we pointed. The mean points seems to be closes on profile age as previously mentioned suggesting age of profile might not be as big of a factor when it comes to quality of posts.

We also see from the interaction terms we introduced that the “responsiveness” coffecient we intrdocued actually does not seem to have any major differences. However the “popularity” coffecient boxplot shows that recommended posts almost always have almost 1 popularity however the un-recommended posts tended lower than that at around 0.6.

Co-relation matrix

We can expect the interaction terms to be somewhat co-related. Modelling the real world, we can also expect that maybe the upvotes & downvotes could potentially be co-related. We present a co-relation matrix of our main predictors below.

And the following is co-relation matrix heatmap using corrplot library.

The co-relation matrix confirms our initial assumption of data points being modelled after the real world.

  • Responses, views and upvotes, downvotes seem to have higher co-relation amongst each other.
    • We have introduced the interaction term for both response/view (responsiveness) & upvotes/downvotes (popularity) earlier. We will see how they perform in the latter sections.
  • Other terms were not co-related with search other besides the interaction term, which will be considered when we pick our predictors for models.

Section 4: Data Analysis/Modelling

This section is further more dividied into following subsection to guide our analysis.

  1. Predictor selection - predictors that can classify suggestion
  2. Modelling & Selection based on classification performance/CV

We drive both sections on with initial queries from our report assignment & purpose of modelling to draw conclusions.

Predictor selection - predictors that can classify suggestion

We employ various predictor selection techniques to identify any predictors that can be used for our models.

Appearance of various predictors on regsubset model(s)

Following presents a sorted histogram of most frequent predictors picked by the regsubset model using exhaustive method accross various sizes of predictor picking.

We can see that some predictors like respones, downvotes, viewes, author_total_posts & upvotes are consistently picked by the method and outperforms other predictors like responsiveness, author_posts_perday. & author_profile_age

Predictors selected by Lasso

We can get a rough estimate of if the predictors picked by regsubset are good estimater by checking if same predictors are picked by lasso. Below is the plot:

The selected predictor generally matches the predictors produced by regsubset.

Predictors selected by tree based method - random forest

Further confirming the predictors with default value of P at each split on random forest model:

Importance of “age” of employee profile for good prediction

Ref. ↓

We see from the density plot that while there is slightly higher distribution of good recommendations from profile with longer age. In earlier stages of profile making the recommendation the differences were proportional.

First row on the above plot matrix shows error rates using LDA & QDA of author_profile_age against other predictors. The error rates for “age” seems comparable to author predictors.

Below provides scatter plot of the same using loess method.

hrdata.responses.glmplot <-  ggplot(hrdata, aes(x=author_profile_age, y=responses, color=recommended)) + geom_point() + geom_smooth(method=loess, se=FALSE, fullrange=TRUE)
hrdata.views.glmplot <- ggplot(hrdata, aes(x=author_profile_age, y=views, color=recommended)) + geom_point() + geom_smooth(method=loess, se=FALSE, fullrange=TRUE)
hrdata.responsiveness.glmplot <- ggplot(hrdata, aes(x=author_profile_age, y=responsiveness, color=recommended)) + geom_point() + geom_smooth(method=loess, se=FALSE, fullrange=TRUE)

hrdata.upvotes.glmplot <- ggplot(hrdata, aes(x=author_profile_age, y=upvotes, color=recommended)) + geom_point() + geom_smooth(method=loess, se=FALSE, fullrange=TRUE)
hrdata.downvotes.glmplot <- ggplot(hrdata, aes(x=author_profile_age, y=downvotes, color=recommended)) + geom_point() + geom_smooth(method=loess, se=FALSE, fullrange=TRUE)
hrdata.popularity.glmplot <- ggplot(hrdata, aes(x=author_profile_age, y=popularity, color=recommended)) + geom_point() + geom_smooth(method=loess, se=FALSE, fullrange=TRUE)

hrdata.author_total_posts.glmplot <- ggplot(hrdata, aes(x=author_profile_age, y=author_total_posts, color=recommended)) + geom_point() + geom_smooth(method=loess, se=FALSE, fullrange=TRUE)
hrdata.author_posts_per_day.glmplot <- ggplot(hrdata, aes(x=author_profile_age, y=author_posts_per_day, color=recommended)) + geom_point() + geom_smooth(method=loess, se=FALSE, fullrange=TRUE)

grid.arrange(hrdata.responses.glmplot,  hrdata.views.glmplot, hrdata.responsiveness.glmplot, hrdata.upvotes.glmplot, hrdata.downvotes.glmplot, hrdata.popularity.glmplot, hrdata.author_total_posts.glmplot, hrdata.author_posts_per_day.glmplot, ncol=2)

From the various visualizations above we can infer that author_profile_age is an equally important predictor compared to the other predictors.

Identification/Prediction of employees qualitiy who make “good” suggestions

Following subsection specifically aims to answer quality of subset of employees who do make good recommendations over others.

We start by creating a grouped set of our dataset grouped by author_id.

Following shows how the total posts author has made compares against how frequently they are recommended. It’s not linear suggesting there are indeed some underlying qualities other dimensions have for frequency of recommendation besides just post.

Ref. ↓

We perform repeated regsubset against these grouped data set to identify on predictors using sucess_rate & total # of recommendation made by that author as the response variable to see what predictors are trending upwards which we will use to infer conclusions about quality of author who get more recommendations.

# success_rate
hrdata.grouped.exhaustive.model <- regsubsets(success_rate ~ ., data = hrdata.train.group.author.rec, nbest = 1, nvmax = nvmax.size,  method = "exhaustive", really.big = TRUE)
hrdata.grouped.exhaustive.model.analysis <- predictor.analysis(hrdata.grouped.exhaustive.model)
hrdata.exhaustive.model.plot <- ggplot(data = hrdata.grouped.exhaustive.model.analysis) + geom_bar(mapping = aes(x = name, y = count, fill = name),  position = "dodge", stat = "identity") + ggtitle("Sucess_rate - Grouped by author_id") + theme(axis.text.x = element_text(angle = 90, hjust = 1)) + theme(axis.text=element_text(size=12), axis.text.x = element_text(angle = 45, hjust = 1))

# Sucess_rate without author_profile_age & author_posts_per_day

hrdata.groupedrec.exhaustive.model <- regsubsets(success_rate ~ popularity + views + downvotes + upvotes + responsiveness, data = hrdata.train.group.author.rec, nbest = 1, nvmax = nvmax.size,  method = "exhaustive", really.big = TRUE)
hrdata.groupedrec.exhaustive.model.analysis <- predictor.analysis(hrdata.groupedrec.exhaustive.model)
hrdata.exhaustive.rec.model.plot <- ggplot(data = hrdata.groupedrec.exhaustive.model.analysis) + geom_bar(mapping = aes(x = name, y = count, fill = name),  position = "dodge", stat = "identity") + ggtitle("Success_rate - grouped by author discounting profile #posts or age") + theme(axis.text.x = element_text(angle = 90, hjust = 1)) + theme(axis.text=element_text(size=12), axis.text.x = element_text(angle = 45, hjust = 1))


grid.arrange(hrdata.exhaustive.model.plot, hrdata.exhaustive.rec.model.plot, ncol = 2)

When grouped by author, it looks like the frequency of author definitely comes into play. Besides that we see response & upvotes become a factor for a sucess of author’s.

Having an older profile, mkaing more posts with upvotes & higher responses are likely to get recommended which falls in line with expectations.

Modelling & Selection based on classification performance/CV

We will actually start to build a final model and cross-validate them with various ROC curve to find something presentable in this section.

Allowing a large tree to grow:

Pruning back the tree using smaller complexity parameter:

Using Random Forest with suggested with p, p/2 & p^0.5 mixed predictor # per each node split next:

Next drawing OOB estimates for P, p/2 & sqrt of p:

The graph shows that p/2 minimizes error rates a lot faster than other split sizes for each node on the model.

Creating roc curve objects to be fed into final ggroc plot for comparison:

Next, we create models using LDA, QDA & Naive Bayes

We use this models with predict() function to get predictions for roc curve on test data set.

Following creates roc curves respectively for each of three models we predicted from above.

Finally printing out a graph that compares the ROCs:

Section 5: Conclusion

We had some initial questions the gist of which we used as a framework to establish path for data analysis for this report. We will attempt to answer those questions based on our analysis:

1a. Determine which combination of attributes of the suggestion (and maybe the person who wrote it) can be used to predict a ‘good’ suggestion.

Answer: We found that following predictors in order were picked by the regsubset model:

  1. responses
  2. downvotes
  3. views
  4. author_total_posts

These in order mattered most then predicting a “good” suggestion when compared to other predictors. We dropped the author_id & suggestion_id column before applying the predictor selection function(s).

1b. Does number of views matter more or less than votes?

Answer: # of downvotes & responses to a post actually mattered more than views.

2a. How much does the ‘age’ of the employee matter when it comes to their ability to make a good suggestion?

Answer: Partimat plot matrix for both LDA & Methods showed that age there was relatively resonable prediction rate for most of the variables when clasiffying recommended posts from the bad. The least performing variable against age when trying to classify recommendations was responses.

2b. Are the employees with longer tenures making better suggestions than those with shorter ones?

Answer: There was indeed higher distribution of good recommendations from profile with bigger age value according the density plot. On lower ages profiles however, the distinction was less important.

3a. Can the same data be used to rank employees based on their demonstrated ability to make predominantly good suggestions?

Answer: There were slightly varying order of suggested predictors when grouping the posts made by each authors. For each author, more recommended posts generally tended to come from the ones who had higher “author_profile_age”, “author_posts_per_day” and “responsesiveness” (views/responses). We did not penalize high number of posts with less views or recommendation on this analysis so the model could be slightly biased towards # of “author_posts_per_day” were an author post a lot of posts just because some of them could get responses and less downvotes making them a recommended suggestion. With this analysis we could create a scoring function that creates a sorting index based on higher values of “author_profile_age”, “author_posts_per_day” & “responsiveness” combined to create a ranking function of author who are likely to post more recommended posts.

3b. Can it be used to identify groups of employees whose suggestions could be aggregated to provide more reliable suggestions than made by the best individuals?

Answer: As per the previous question, grouping employees with bigger age profile & # of posts did seem to group the employees who could provide more reliable suggestions. If we were to remove the time element & # of posts out of the analysis, the # of downvotes & responsiveness interaciton term (responses/views) seem to matter the most. So even new profiles as long as their posts have less downvotes & have more responsiveness (response/views) are more likely to be recommended posts.

4a. Make recommendations to your IT department about better ways they could collect this data in the future. What other attributes would prove useful and why? Would it be possible to build a completely automated suggestion ranking system?

Answer: There was high prediction rate for each posts, however less attributes when we dropeed un-necessary predictors for analyzing posts by author_id. Information about author_id such as their real-life age or position of the employee, salary could potentially provide more features that provides insight into author profiles making high recommended suggestion so I would suggest the IT department to collect such data.

We have employeed various classification techniques & validation parameters learned in this class to answer some decision based questions. For the final model selection Random Forest model with all P split per node is recommended for deployment based on the prediction rate.