Modeling quality of suggestions

Ankit Gyawali

4/30/2020

Section 1: Introduction

The goal of this project is to identify various relavant factors that could be used to classify “good” & “bad” suggestions provided on a human resource company. We compare the performance of various model to identify the best algorithm for this prediction task and evaluate if the model can be deployed on production. Finally, we discus and provide recommendations as to if any other attributes that might be more valuable for future classification task.

We group the report as follows and present our findings on last section:

Introduction
Data Cleanup and Transformation
Data Shape Exploration
Data Analysis/Modelling
- Predictor selection - predictors that can classify suggestion
- Modelling & Selection based on classification performance/CV
Conclusion

Libraries import

library(dplyr) # Data Manip
library(ggplot2) # Plotting
library(gridExtra)
library(corrplot)
library(plotmo)
library(pROC)
library(randomForest) # Modelling
library(glmnet)
library(caret)
library(klaR)
library(MASS)
library(leaps)
library(rpart)
library(partykit)
library(prettydoc) # Theming
library(knitr)

Section 2: Data Cleanup and Transformation

First we load our data.

Note: For localhost: https://drive.google.com/open?id=1k_0HMD99lPkqOOUfQcFSq5c6lUVmd6eL

## Mirror #1
hrdata <- read.csv("http://www.andrew.cmu.edu/user/agyawali/suggestions.csv", header=TRUE)

# Mirror #2
# hrdata <- read.csv("https://gist.githubusercontent.com/ankitgyawali/608b857d160b3426225b65883bf13b72/raw/1ea4738d81b4465178b29e1f41b00c666eb8fc30/suggestions.csv", header=TRUE)

# Localhost
# hrdata <- read.csv("suggestions.csv", header=TRUE)

Renaming columns with snake case:

column.names <-  c("recommended", "suggestion_id","responses","views","upvotes", "downvotes","author_id", "author_profile_age", "author_total_posts", "author_posts_per_day")
colnames(hrdata) <- column.names
column.names

##  [1] "recommended"          "suggestion_id"        "responses"            "views"                "upvotes"              "downvotes"            "author_id"            "author_profile_age"  
##  [9] "author_total_posts"   "author_posts_per_day"

From the above we can see some columns that could have interactions that we might want to use later on.

upvotes & downvotes
response & views
profile age & total posts

The third interaction already present on our data set as author_posts_per_day, we will capture the first two interactions as follows. NA & Inifinite values are replaced by 0 or 1 appropriately when calculating these interactions.

## Popularity
hrdata$popularity <- 1- hrdata$downvotes/hrdata$upvotes
hrdata$popularity[is.na(hrdata$popularity)] <- 0
hrdata$popularity[is.infinite(hrdata$popularity)] <- 1

## Response per view 
hrdata$responsiveness <- hrdata$responses/hrdata$views
hrdata$responsiveness[is.na(hrdata$responsiveness)] <- 0

Converting the “recommended” column of interest as factor:

hrdata$recommended <- as.factor(hrdata$recommended)

We divide our data into training and test set with approximately 90% & 10% of total samples we have respectively. We also clone the training and testing set into a different variable before dropping out irrelevant primary keys for our analysis like author_id & suggestion_id.

# set.seed(222) # TODO: Disable this
test <- sample(1:nrow(hrdata), 1600)

hrdata.train <- hrdata[-test,]
hrdata.test <- hrdata[test,]

hrdata.train.all <- hrdata[-test,]
hrdata.test.all <- hrdata[test,]

hrdata.train <- subset(hrdata.train, select = -c(author_id, suggestion_id))
hrdata.test <- subset(hrdata.test, select = -c(author_id, suggestion_id))

Section 3: Data Shape Exploration

Following outputs a basic statistic table for our data set grouped by the two kinds of recommended posts.

summary.var.names <- c("responses", "views", "upvotes", "downvotes", "author_profile_age","author_total_posts", "author_posts_per_day")

hrdata.rec.summary <- summary(subset(hrdata, recommended == 1)[,summary.var.names])
hrdata.nrec.summary <- summary(subset(hrdata, recommended == 0)[,summary.var.names])

kable(hrdata.rec.summary, padding= 3 ,format = "html", caption  = "**Recommended posts**")

**Recommended posts**
responses	views	upvotes	downvotes	author_profile_age	author_total_posts	author_posts_per_day
Min. : 0.00	Min. : 0	Min. : 0.0	Min. : 0.000	Min. : 231	Min. : 10	Min. :0.00
1st Qu.: 34.00	1st Qu.: 814	1st Qu.: 45.0	1st Qu.: 1.000	1st Qu.:1011	1st Qu.: 711	1st Qu.:0.50
Median : 48.00	Median : 1468	Median : 69.5	Median : 4.000	Median :1214	Median :1330	Median :1.20
Mean : 60.51	Mean : 3076	Mean : 103.9	Mean : 7.641	Mean :1173	Mean :2121	Mean :1.88
3rd Qu.: 69.00	3rd Qu.: 2834	3rd Qu.: 115.0	3rd Qu.: 10.000	3rd Qu.:1341	3rd Qu.:2834	3rd Qu.:2.90
Max. :959.00	Max. :63243	Max. :2607.0	Max. :199.000	Max. :1623	Max. :6920	Max. :7.10

kable(hrdata.nrec.summary, padding= 3 ,format = "html", caption  = "**Not recommended posts**")

**Not recommended posts**
responses	views	upvotes	downvotes	author_profile_age	author_total_posts	author_posts_per_day
Min. : 0.00	Min. : 0	Min. : 0.00	Min. : 0.000	Min. : 4	Min. : 1.0	Min. : 0.0000
1st Qu.: 3.00	1st Qu.: 92	1st Qu.: 0.00	1st Qu.: 0.000	1st Qu.: 773	1st Qu.: 54.0	1st Qu.: 0.1000
Median : 7.00	Median : 173	Median : 2.00	Median : 1.000	Median :1027	Median : 275.0	Median : 0.3000
Mean : 13.63	Mean : 430	Mean : 11.65	Mean : 3.619	Mean :1017	Mean : 697.8	Mean : 0.6857
3rd Qu.: 15.00	3rd Qu.: 365	3rd Qu.: 7.00	3rd Qu.: 5.000	3rd Qu.:1291	3rd Qu.: 834.0	3rd Qu.: 0.9000
Max. :487.00	Max. :30072	Max. :1149.00	Max. :201.000	Max. :1624	Max. :9992.0	Max. :20.1000

We see that consistently- response, views & upvotes are a lot higher on recommended posts then on un-recommended posts.

Simialarly, the mean & median seem to suggest that the author profile age does not matter, however their frequency of posting tends higher on recommended posts - which implies authors whose posts are more recommended do tend to post more frequently.

We will now further visualize these data points as boxplots, removing the first 10% & bottom 10% of outliers so that the boxplot scale is zoomed in.

Botplots of various variables

bpxplot.responses <- ggplot(hrdata, aes(recommended, responses,  fill=recommended), na.rm = T) + geom_boxplot(na.rm = T, outlier.shape = NA) + ggtitle("Responses by suggestion group") + scale_y_continuous(limits = quantile(hrdata$responses, c(0.1, 0.9)))
bpxplot.views <- ggplot(hrdata, aes(recommended, views,  fill=recommended), na.rm = T) + geom_boxplot(na.rm = T, outlier.shape = NA) + ggtitle("Views by suggestion group") + scale_y_continuous(limits = quantile(hrdata$views, c(0.1, 0.9)))
boxplot.responsiveness <- ggplot(hrdata, aes(recommended, responsiveness,  fill=recommended), na.rm = T) + geom_boxplot(na.rm = T, outlier.shape = NA) + ggtitle("Responsiveness by suggestion group") + scale_y_continuous(limits = quantile(hrdata$responsiveness, c(0.1, 0.9)))

boxplot.upvotes <- ggplot(hrdata, aes(recommended, upvotes,  fill=recommended), na.rm = T) + geom_boxplot(na.rm = T, outlier.shape = NA) + ggtitle("Upvotes by suggestion group") + scale_y_continuous(limits = quantile(hrdata$upvotes, c(0.1, 0.9)))
boxplot.downvotes <- ggplot(hrdata, aes(recommended, downvotes,  fill=recommended), na.rm = T) + geom_boxplot(na.rm = T, outlier.shape = NA) + ggtitle("Downvotes by suggestion group") + scale_y_continuous(limits = quantile(hrdata$downvotes, c(0.1, 0.9)))
boxplot.popularity <- ggplot(hrdata, aes(recommended, popularity,  fill=recommended), na.rm = T) + geom_boxplot(na.rm = T, outlier.shape = NA) + ggtitle("Popularity by suggestion group") + scale_y_continuous(limits = quantile(hrdata$popularity, c(0.1, 0.9)))

boxplot.author_profile_age <- ggplot(hrdata, aes(recommended, author_profile_age,  fill=recommended), na.rm = T) + geom_boxplot(na.rm = T, outlier.shape = NA) + ggtitle("Author profile age by suggestion group") + scale_y_continuous(limits = quantile(hrdata$author_profile_age, c(0.1, 0.9)))
boxplot.author_total_posts <- ggplot(hrdata, aes(recommended, author_total_posts,  fill=recommended), na.rm = T) + geom_boxplot(na.rm = T, outlier.shape = NA) + ggtitle("Author total posts by suggestion group") + scale_y_continuous(limits = quantile(hrdata$author_total_posts, c(0.1, 0.9)))
boxplot.author_posts_per_day <-  ggplot(hrdata, aes(recommended, author_posts_per_day,  fill=recommended), na.rm = T) + geom_boxplot(na.rm = T, outlier.shape = NA) + ggtitle("Author posts per day by suggestion group") + scale_y_continuous(limits = quantile(hrdata$author_posts_per_day, c(0.1, 0.9)))

grid.arrange(bpxplot.responses, bpxplot.views, boxplot.responsiveness, boxplot.upvotes, boxplot.downvotes, boxplot.popularity, boxplot.author_profile_age, boxplot.author_total_posts,  boxplot.author_posts_per_day,  ncol=2)

These boxplots confirm the picture we had looking at the two kable tables earlier and the trends we pointed. The mean points seems to be closes on profile age as previously mentioned suggesting age of profile might not be as big of a factor when it comes to quality of posts.

We also see from the interaction terms we introduced that the “responsiveness” coffecient we intrdocued actually does not seem to have any major differences. However the “popularity” coffecient boxplot shows that recommended posts almost always have almost 1 popularity however the un-recommended posts tended lower than that at around 0.6.

Co-relation matrix

We can expect the interaction terms to be somewhat co-related. Modelling the real world, we can also expect that maybe the upvotes & downvotes could potentially be co-related. We present a co-relation matrix of our main predictors below.

panel.cor <- function(x, y, digits = 2, prefix = "", cex.cor, ...)
{
    usr <- par("usr"); on.exit(par(usr))
    par(usr = c(0, 1, 0, 1))
    r <- abs(cor(x, y))
    txt <- format(c(r, 0.123456789), digits = digits)[1]
    txt <- paste0(prefix, txt)
    if(missing(cex.cor)) cex.cor <- 0.8/strwidth(txt)
    text(0.5, 0.5, txt, cex = pmax(1, cex.cor * r))
}

pair.var.names <- c("responses", "views", "upvotes", "downvotes", "author_profile_age", "author_posts_per_day")
pairs(hrdata[,pair.var.names], lower.panel = panel.cor, main = "co-relation matrix for hrdata")

And the following is co-relation matrix heatmap using corrplot library.

correlations <- cor(subset(hrdata.train, select = -c(recommended)), use = "pairwise.complete.obs")
corrplot(correlations, diag = FALSE, type = "upper", tl.cex = 1)

The co-relation matrix confirms our initial assumption of data points being modelled after the real world.

Responses, views and upvotes, downvotes seem to have higher co-relation amongst each other.
- We have introduced the interaction term for both response/view (responsiveness) & upvotes/downvotes (popularity) earlier. We will see how they perform in the latter sections.
Other terms were not co-related with search other besides the interaction term, which will be considered when we pick our predictors for models.

Section 4: Data Analysis/Modelling

This section is further more dividied into following subsection to guide our analysis.

Predictor selection - predictors that can classify suggestion
Modelling & Selection based on classification performance/CV

We drive both sections on with initial queries from our report assignment & purpose of modelling to draw conclusions.

Predictor selection - predictors that can classify suggestion

We employ various predictor selection techniques to identify any predictors that can be used for our models.

Ref. ↓

Appearance of various predictors on regsubset model(s)

Following presents a sorted histogram of most frequent predictors picked by the regsubset model using exhaustive method accross various sizes of predictor picking.

predictor.analysis <- function(model) {
  hrdata.predictor.regsubsets <- sapply(as.data.frame(summary(model)$which), sum)
  hrdata.predictor.regsubsets <- hrdata.predictor.regsubsets[2:length(hrdata.predictor.regsubsets)]
  hrdata.predictor.regsubsets.df <- data.frame("name" = names(hrdata.predictor.regsubsets), "count" = hrdata.predictor.regsubsets)
  hrdata.predictor.regsubsets.df <- transform(hrdata.predictor.regsubsets.df, name = reorder(name, count))
  rownames(hrdata.predictor.regsubsets.df) <- NULL
  return(hrdata.predictor.regsubsets.df)
}

nvmax.size <- length(colnames(hrdata.train)) - 1

# Exhaustive
hrdata.exhaustive.model <- regsubsets(recommended ~ ., data = hrdata.train, nbest = 1, nvmax = nvmax.size,  method = "exhaustive", really.big = TRUE)
hrdata.exhaustive.model.analysis <- predictor.analysis(hrdata.exhaustive.model)
hrdata.exhaustive.model.plot <- ggplot(data = hrdata.exhaustive.model.analysis) + geom_bar(mapping = aes(x = name, y = count, fill = name),  position = "dodge", stat = "identity") + ggtitle("sorted predictor frequency on exhaustive model") + theme(axis.text.x = element_text(angle = 90, hjust = 1)) + theme(axis.text=element_text(size=12), axis.text.x = element_text(angle = 45, hjust = 1))

hrdata.exhaustive.model.plot

We can see that some predictors like respones, downvotes, viewes, author_total_posts & upvotes are consistently picked by the method and outperforms other predictors like responsiveness, author_posts_perday. & author_profile_age

Predictors selected by Lasso

We can get a rough estimate of if the predictors picked by regsubset are good estimater by checking if same predictors are picked by lasso. Below is the plot:

# Extract response variable (for lasso)
hrdata.x <- as.matrix(hrdata.train[, -which(names(hrdata.train) == "recommended")])
hrdata.y <- hrdata.train$recommended
hrdata.lasso <- glmnet(data.matrix(hrdata.x), as.factor(hrdata.y), alpha=1, family="binomial")
plot_glmnet(hrdata.lasso)

The selected predictor generally matches the predictors produced by regsubset.

Predictors selected by tree based method - random forest

Further confirming the predictors with default value of P at each split on random forest model:

#  Quick evaluation of random forest and most important variables
hrdata.rf <- randomForest(recommended ~ ., data = hrdata.train)
varImpPlot(hrdata.rf, main="Important predictors on hrdata Randomforest model", sort = T)

Importance of “age” of employee profile for good prediction

Ref. ↓

densityplot.author_profile_age <- ggplot(hrdata, aes(x=author_profile_age, color=recommended, fill=recommended)) +   geom_density(alpha = 0.5)
densityplot.author_profile_age

Ref. ↓

We see from the density plot that while there is slightly higher distribution of good recommendations from profile with longer age. In earlier stages of profile making the recommendation the differences were proportional.

partimat(recommended ~ author_profile_age + upvotes + downvotes + responses + views + author_total_posts, method="lda", data=hrdata.train)

partimat(recommended ~ author_profile_age + upvotes + downvotes + responses + views + author_total_posts, method="qda", data=hrdata.train)

First row on the above plot matrix shows error rates using LDA & QDA of author_profile_age against other predictors. The error rates for “age” seems comparable to author predictors.

Below provides scatter plot of the same using loess method.

hrdata.responses.glmplot <-  ggplot(hrdata, aes(x=author_profile_age, y=responses, color=recommended)) + geom_point() + geom_smooth(method=loess, se=FALSE, fullrange=TRUE)
hrdata.views.glmplot <- ggplot(hrdata, aes(x=author_profile_age, y=views, color=recommended)) + geom_point() + geom_smooth(method=loess, se=FALSE, fullrange=TRUE)
hrdata.responsiveness.glmplot <- ggplot(hrdata, aes(x=author_profile_age, y=responsiveness, color=recommended)) + geom_point() + geom_smooth(method=loess, se=FALSE, fullrange=TRUE)

hrdata.upvotes.glmplot <- ggplot(hrdata, aes(x=author_profile_age, y=upvotes, color=recommended)) + geom_point() + geom_smooth(method=loess, se=FALSE, fullrange=TRUE)
hrdata.downvotes.glmplot <- ggplot(hrdata, aes(x=author_profile_age, y=downvotes, color=recommended)) + geom_point() + geom_smooth(method=loess, se=FALSE, fullrange=TRUE)
hrdata.popularity.glmplot <- ggplot(hrdata, aes(x=author_profile_age, y=popularity, color=recommended)) + geom_point() + geom_smooth(method=loess, se=FALSE, fullrange=TRUE)

hrdata.author_total_posts.glmplot <- ggplot(hrdata, aes(x=author_profile_age, y=author_total_posts, color=recommended)) + geom_point() + geom_smooth(method=loess, se=FALSE, fullrange=TRUE)
hrdata.author_posts_per_day.glmplot <- ggplot(hrdata, aes(x=author_profile_age, y=author_posts_per_day, color=recommended)) + geom_point() + geom_smooth(method=loess, se=FALSE, fullrange=TRUE)

grid.arrange(hrdata.responses.glmplot,  hrdata.views.glmplot, hrdata.responsiveness.glmplot, hrdata.upvotes.glmplot, hrdata.downvotes.glmplot, hrdata.popularity.glmplot, hrdata.author_total_posts.glmplot, hrdata.author_posts_per_day.glmplot, ncol=2)

From the various visualizations above we can infer that author_profile_age is an equally important predictor compared to the other predictors.

Identification/Prediction of employees qualitiy who make “good” suggestions

Following subsection specifically aims to answer quality of subset of employees who do make good recommendations over others.

We start by creating a grouped set of our dataset grouped by author_id.

### Group Training set

hrdata.train.group.author.rec <- hrdata.train.all %>% group_by(author_id) %>% summarise(recommended = sum(as.numeric(recommended)), views = mean(views), upvotes = mean(upvotes),  downvotes = mean(downvotes), responses = mean(responses), author_profile_age = max(author_profile_age), author_total_posts = max(author_total_posts), author_posts_per_day = max(author_posts_per_day), popularity = mean(popularity),responsiveness = mean(responsiveness))
hrdata.train.group.author.rec$success_rate = hrdata.train.group.author.rec$recommended/hrdata.train.group.author.rec$author_total_posts

Following shows how the total posts author has made compares against how frequently they are recommended. It’s not linear suggesting there are indeed some underlying qualities other dimensions have for frequency of recommendation besides just post.

ggplot(hrdata.train.group.author.rec, aes(x=author_total_posts, y=recommended, color=recommended)) + geom_point()

Ref. ↓

We perform repeated regsubset against these grouped data set to identify on predictors using sucess_rate & total # of recommendation made by that author as the response variable to see what predictors are trending upwards which we will use to infer conclusions about quality of author who get more recommendations.

# success_rate
hrdata.grouped.exhaustive.model <- regsubsets(success_rate ~ ., data = hrdata.train.group.author.rec, nbest = 1, nvmax = nvmax.size,  method = "exhaustive", really.big = TRUE)
hrdata.grouped.exhaustive.model.analysis <- predictor.analysis(hrdata.grouped.exhaustive.model)
hrdata.exhaustive.model.plot <- ggplot(data = hrdata.grouped.exhaustive.model.analysis) + geom_bar(mapping = aes(x = name, y = count, fill = name),  position = "dodge", stat = "identity") + ggtitle("Sucess_rate - Grouped by author_id") + theme(axis.text.x = element_text(angle = 90, hjust = 1)) + theme(axis.text=element_text(size=12), axis.text.x = element_text(angle = 45, hjust = 1))

# Sucess_rate without author_profile_age & author_posts_per_day

hrdata.groupedrec.exhaustive.model <- regsubsets(success_rate ~ popularity + views + downvotes + upvotes + responsiveness, data = hrdata.train.group.author.rec, nbest = 1, nvmax = nvmax.size,  method = "exhaustive", really.big = TRUE)
hrdata.groupedrec.exhaustive.model.analysis <- predictor.analysis(hrdata.groupedrec.exhaustive.model)
hrdata.exhaustive.rec.model.plot <- ggplot(data = hrdata.groupedrec.exhaustive.model.analysis) + geom_bar(mapping = aes(x = name, y = count, fill = name),  position = "dodge", stat = "identity") + ggtitle("Success_rate - grouped by author discounting profile #posts or age") + theme(axis.text.x = element_text(angle = 90, hjust = 1)) + theme(axis.text=element_text(size=12), axis.text.x = element_text(angle = 45, hjust = 1))


grid.arrange(hrdata.exhaustive.model.plot, hrdata.exhaustive.rec.model.plot, ncol = 2)

When grouped by author, it looks like the frequency of author definitely comes into play. Besides that we see response & upvotes become a factor for a sucess of author’s.

Having an older profile, mkaing more posts with upvotes & higher responses are likely to get recommended which falls in line with expectations.

Modelling & Selection based on classification performance/CV

We will actually start to build a final model and cross-validate them with various ROC curve to find something presentable in this section.

Allowing a large tree to grow:

### Large Tree 
hrdata.test.responses <- hrdata.test.all$recommended

hrdata.large.tree <- rpart(formula = recommended ~ ., data = hrdata.train, method="class")
plotcp(hrdata.large.tree)

# Using cp that minimizes error using 1SE rule
minimize.cp.err <- 0.001

hrdata.large.tree.predict <- predict(hrdata.large.tree, newdata = hrdata.test.all)[,2]
hrdata.large.tree.roc <- roc(response = hrdata.test.all$recommended, predictor = hrdata.large.tree.predict , direction = "<" , levels = c(0, 1))
# plot(hrdata.large.tree.roc)

Pruning back the tree using smaller complexity parameter:

### Pruned Tree
hrdata.pruned.tree <- prune(tree = hrdata.large.tree, cp = minimize.cp.err)
hrdata.pruned.tree.predict <- predict(hrdata.pruned.tree, newdata = hrdata.test.all)[,2]
hrdata.pruned.tree.roc <- roc(response = hrdata.test.responses, predictor = hrdata.pruned.tree.predict, direction = "<" , levels = c(0, 1))
# plot(hrdata.pruned.tree.roc)

# Visualize tree to see which predictors are used
hrdata.pruned.tree.party <- as.party(hrdata.pruned.tree)
plot(hrdata.pruned.tree.party, gp = gpar(fontsize = 7))

Using Random Forest with suggested with p, p/2 & p^0.5 mixed predictor # per each node split next:

### Creating random forests models
hrdata.rf1 <- randomForest(recommended ~ ., data = hrdata.train, mtry = ncol(hrdata.train)-1 ,importance = TRUE)
hrdata.rf2 <- randomForest(recommended ~ ., data = hrdata.train, mtry = ncol(hrdata.train)/2,importance = TRUE)
hrdata.rf3 <- randomForest(recommended ~ ., data = hrdata.train, mtry = ncol(hrdata.train)^0.5,importance = TRUE)

Next drawing OOB estimates for P, p/2 & sqrt of p:

hrdata.train.rfplots <- ggplot(xlab = "# of trees") + xlab("# of trees") + ylab("Out of bag error estimates")
hrdata.train.rfplots <- hrdata.train.rfplots +  geom_line( aes(x=1:length(hrdata.rf1$err.rate[,1]), y=hrdata.rf1$err.rate[,1], colour = "OOB err - P"))
hrdata.train.rfplots <- hrdata.train.rfplots +  geom_line( aes(x=1:length(hrdata.rf2$err.rate[,1]), y=hrdata.rf2$err.rate[,1], colour = "OOB err - P/2"))
hrdata.train.rfplots <- hrdata.train.rfplots +  geom_line( aes(x=1:length(hrdata.rf3$err.rate[,1]), y=hrdata.rf3$err.rate[,1], colour = "OOB err - P^0.5"))
hrdata.train.rfplots

The graph shows that p/2 minimizes error rates a lot faster than other split sizes for each node on the model.

Creating roc curve objects to be fed into final ggroc plot for comparison:

hrdata.rf1.predict <- predict(hrdata.rf1, newdata = hrdata.test, type = "prob")[,2]
hrdata.rf1.roc <- roc(response = hrdata.test$recommended, predictor = hrdata.rf1.predict , direction = "<" , levels = c(0, 1))

hrdata.rf2.predict <- predict(hrdata.rf1, newdata = hrdata.test, type = "prob")[,2]
hrdata.rf2.roc <- roc(response = hrdata.test$recommended, predictor = hrdata.rf1.predict , direction = "<" , levels = c(0, 1))

hrdata.rf3.predict <- predict(hrdata.rf1, newdata = hrdata.test, type = "prob")[,2]
hrdata.rf3.roc <- roc(response = hrdata.test$recommended, predictor = hrdata.rf1.predict , direction = "<" , levels = c(0, 1))

Next, we create models using LDA, QDA & Naive Bayes

### Creating Models
# Naive Bayes
hrdata.nb <- NaiveBayes(recommended ~ . , data = hrdata.train, usekernel=TRUE)
# LDA
hrdata.lda <- lda(recommended ~., data=hrdata.train)
# QDA
hrdata.qda <- qda(recommended ~., data=hrdata.train)

We use this models with predict() function to get predictions for roc curve on test data set.

### Predict
hrdata.nb.predict <- suppressWarnings(predict(hrdata.nb, newdata = hrdata.test, type = "prob"))
hrdata.lda.predict <- predict(hrdata.lda, newdata = hrdata.test, type = "response")
hrdata.qda.predict <- predict(hrdata.qda, newdata = hrdata.test)

Following creates roc curves respectively for each of three models we predicted from above.

hrdata.nb.roc <- roc(response = hrdata.test$recommended, predictor = as.numeric(hrdata.nb.predict$class) , direction = "<" , levels = c(0, 1))
hrdata.lda.roc <- roc(response = hrdata.test$recommended, predictor = as.numeric(hrdata.lda.predict$class) , direction = "<" , levels = c(0, 1))
hrdata.qda.roc <- roc(response = hrdata.test$recommended, predictor = as.numeric(hrdata.qda.predict$class) , direction = "<" , levels = c(0, 1))

Finally printing out a graph that compares the ROCs:

hrdata.full.models.roc <- ggroc(size = 1, list('Large Tree' =  hrdata.large.tree.roc, 'Pruned Tree' = hrdata.pruned.tree.roc, 'Random Forest (P)' = hrdata.rf1.roc, 'Random Forest (P/2)' = hrdata.rf2.roc, 'Random Forest (sqrt(P))' = hrdata.rf3.roc, 'NaiveBayes' = hrdata.nb.roc, 'LDA' = hrdata.lda.roc, 'QDA'= hrdata.qda.roc), ) + ggtitle("Comparison of ROC Curves") 
hrdata.full.models.roc

Ref. ↓

Plotting are under curve for various models:

hrdata.auc.df <- data.frame("Type" = c("Large Tree", "Pruned Tree", "Random Forest(P)", "Random Forest (P/2)", "Random Forest (sqr(P))", "NaiveBayes", "LDA", "QDA"), "AUC" = c(auc(hrdata.large.tree.roc), auc(hrdata.pruned.tree.roc), auc(hrdata.rf1.roc), auc(hrdata.rf2.roc), auc(hrdata.rf3.roc), auc(hrdata.nb.roc), auc(hrdata.lda.roc), auc(hrdata.qda.roc)))

ggplot(data = hrdata.auc.df) + geom_bar(mapping = aes(x = Type, y = AUC, fill = Type),  position = "dodge", stat = "identity") + ggtitle("Area under curve of various models") + theme(axis.text=element_text(size=12), axis.text.x = element_text(angle = 45, hjust = 1))

From the above we can see that Random Forest performs the best out of all any split size P, or P/2 or sqrt(P) than compared to other models.

Section 5: Conclusion

We had some initial questions the gist of which we used as a framework to establish path for data analysis for this report. We will attempt to answer those questions based on our analysis:

1a. Determine which combination of attributes of the suggestion (and maybe the person who wrote it) can be used to predict a ‘good’ suggestion.

Answer: We found that following predictors in order were picked by the regsubset model:

responses
downvotes
views
author_total_posts

These in order mattered most then predicting a “good” suggestion when compared to other predictors. We dropped the author_id & suggestion_id column before applying the predictor selection function(s).

1b. Does number of views matter more or less than votes?

Answer: # of downvotes & responses to a post actually mattered more than views.

2a. How much does the ‘age’ of the employee matter when it comes to their ability to make a good suggestion?

Answer: Partimat plot matrix for both LDA & Methods showed that age there was relatively resonable prediction rate for most of the variables when clasiffying recommended posts from the bad. The least performing variable against age when trying to classify recommendations was responses.

2b. Are the employees with longer tenures making better suggestions than those with shorter ones?

Answer: There was indeed higher distribution of good recommendations from profile with bigger age value according the density plot. On lower ages profiles however, the distinction was less important.

3a. Can the same data be used to rank employees based on their demonstrated ability to make predominantly good suggestions?

Answer: There were slightly varying order of suggested predictors when grouping the posts made by each authors. For each author, more recommended posts generally tended to come from the ones who had higher “author_profile_age”, “author_posts_per_day” and “responsesiveness” (views/responses). We did not penalize high number of posts with less views or recommendation on this analysis so the model could be slightly biased towards # of “author_posts_per_day” were an author post a lot of posts just because some of them could get responses and less downvotes making them a recommended suggestion. With this analysis we could create a scoring function that creates a sorting index based on higher values of “author_profile_age”, “author_posts_per_day” & “responsiveness” combined to create a ranking function of author who are likely to post more recommended posts.

3b. Can it be used to identify groups of employees whose suggestions could be aggregated to provide more reliable suggestions than made by the best individuals?

Answer: As per the previous question, grouping employees with bigger age profile & # of posts did seem to group the employees who could provide more reliable suggestions. If we were to remove the time element & # of posts out of the analysis, the # of downvotes & responsiveness interaciton term (responses/views) seem to matter the most. So even new profiles as long as their posts have less downvotes & have more responsiveness (response/views) are more likely to be recommended posts.

4a. Make recommendations to your IT department about better ways they could collect this data in the future. What other attributes would prove useful and why? Would it be possible to build a completely automated suggestion ranking system?

Answer: There was high prediction rate for each posts, however less attributes when we dropeed un-necessary predictors for analyzing posts by author_id. Information about author_id such as their real-life age or position of the employee, salary could potentially provide more features that provides insight into author profiles making high recommended suggestion so I would suggest the IT department to collect such data.

We have employeed various classification techniques & validation parameters learned in this class to answer some decision based questions. For the final model selection Random Forest model with all P split per node is recommended for deployment based on the prediction rate.