The goal of this project is to identify various relavant factors that could be used to classify “good” & “bad” suggestions provided on a human resource company. We compare the performance of various model to identify the best algorithm for this prediction task and evaluate if the model can be deployed on production. Finally, we discus and provide recommendations as to if any other attributes that might be more valuable for future classification task.
We group the report as follows and present our findings on last section:
library(dplyr) # Data Manip
library(ggplot2) # Plotting
library(gridExtra)
library(corrplot)
library(plotmo)
library(pROC)
library(randomForest) # Modelling
library(glmnet)
library(caret)
library(klaR)
library(MASS)
library(leaps)
library(rpart)
library(partykit)
library(prettydoc) # Theming
library(knitr)
First we load our data.
Note: For localhost: https://drive.google.com/open?id=1k_0HMD99lPkqOOUfQcFSq5c6lUVmd6eL
## Mirror #1
hrdata <- read.csv("http://www.andrew.cmu.edu/user/agyawali/suggestions.csv", header=TRUE)
# Mirror #2
# hrdata <- read.csv("https://gist.githubusercontent.com/ankitgyawali/608b857d160b3426225b65883bf13b72/raw/1ea4738d81b4465178b29e1f41b00c666eb8fc30/suggestions.csv", header=TRUE)
# Localhost
# hrdata <- read.csv("suggestions.csv", header=TRUE)
Renaming columns with snake case:
column.names <- c("recommended", "suggestion_id","responses","views","upvotes", "downvotes","author_id", "author_profile_age", "author_total_posts", "author_posts_per_day")
colnames(hrdata) <- column.names
column.names
## [1] "recommended" "suggestion_id" "responses" "views" "upvotes" "downvotes" "author_id" "author_profile_age"
## [9] "author_total_posts" "author_posts_per_day"
From the above we can see some columns that could have interactions that we might want to use later on.
The third interaction already present on our data set as author_posts_per_day, we will capture the first two interactions as follows. NA & Inifinite values are replaced by 0 or 1 appropriately when calculating these interactions.
## Popularity
hrdata$popularity <- 1- hrdata$downvotes/hrdata$upvotes
hrdata$popularity[is.na(hrdata$popularity)] <- 0
hrdata$popularity[is.infinite(hrdata$popularity)] <- 1
## Response per view
hrdata$responsiveness <- hrdata$responses/hrdata$views
hrdata$responsiveness[is.na(hrdata$responsiveness)] <- 0
Converting the “recommended” column of interest as factor:
We divide our data into training and test set with approximately 90% & 10% of total samples we have respectively. We also clone the training and testing set into a different variable before dropping out irrelevant primary keys for our analysis like author_id & suggestion_id.
# set.seed(222) # TODO: Disable this
test <- sample(1:nrow(hrdata), 1600)
hrdata.train <- hrdata[-test,]
hrdata.test <- hrdata[test,]
hrdata.train.all <- hrdata[-test,]
hrdata.test.all <- hrdata[test,]
hrdata.train <- subset(hrdata.train, select = -c(author_id, suggestion_id))
hrdata.test <- subset(hrdata.test, select = -c(author_id, suggestion_id))
Following outputs a basic statistic table for our data set grouped by the two kinds of recommended posts.
summary.var.names <- c("responses", "views", "upvotes", "downvotes", "author_profile_age","author_total_posts", "author_posts_per_day")
hrdata.rec.summary <- summary(subset(hrdata, recommended == 1)[,summary.var.names])
hrdata.nrec.summary <- summary(subset(hrdata, recommended == 0)[,summary.var.names])
kable(hrdata.rec.summary, padding= 3 ,format = "html", caption = "**Recommended posts**")
responses | views | upvotes | downvotes | author_profile_age | author_total_posts | author_posts_per_day | |
---|---|---|---|---|---|---|---|
Min. : 0.00 | Min. : 0 | Min. : 0.0 | Min. : 0.000 | Min. : 231 | Min. : 10 | Min. :0.00 | |
1st Qu.: 34.00 | 1st Qu.: 814 | 1st Qu.: 45.0 | 1st Qu.: 1.000 | 1st Qu.:1011 | 1st Qu.: 711 | 1st Qu.:0.50 | |
Median : 48.00 | Median : 1468 | Median : 69.5 | Median : 4.000 | Median :1214 | Median :1330 | Median :1.20 | |
Mean : 60.51 | Mean : 3076 | Mean : 103.9 | Mean : 7.641 | Mean :1173 | Mean :2121 | Mean :1.88 | |
3rd Qu.: 69.00 | 3rd Qu.: 2834 | 3rd Qu.: 115.0 | 3rd Qu.: 10.000 | 3rd Qu.:1341 | 3rd Qu.:2834 | 3rd Qu.:2.90 | |
Max. :959.00 | Max. :63243 | Max. :2607.0 | Max. :199.000 | Max. :1623 | Max. :6920 | Max. :7.10 |
responses | views | upvotes | downvotes | author_profile_age | author_total_posts | author_posts_per_day | |
---|---|---|---|---|---|---|---|
Min. : 0.00 | Min. : 0 | Min. : 0.00 | Min. : 0.000 | Min. : 4 | Min. : 1.0 | Min. : 0.0000 | |
1st Qu.: 3.00 | 1st Qu.: 92 | 1st Qu.: 0.00 | 1st Qu.: 0.000 | 1st Qu.: 773 | 1st Qu.: 54.0 | 1st Qu.: 0.1000 | |
Median : 7.00 | Median : 173 | Median : 2.00 | Median : 1.000 | Median :1027 | Median : 275.0 | Median : 0.3000 | |
Mean : 13.63 | Mean : 430 | Mean : 11.65 | Mean : 3.619 | Mean :1017 | Mean : 697.8 | Mean : 0.6857 | |
3rd Qu.: 15.00 | 3rd Qu.: 365 | 3rd Qu.: 7.00 | 3rd Qu.: 5.000 | 3rd Qu.:1291 | 3rd Qu.: 834.0 | 3rd Qu.: 0.9000 | |
Max. :487.00 | Max. :30072 | Max. :1149.00 | Max. :201.000 | Max. :1624 | Max. :9992.0 | Max. :20.1000 |
We see that consistently- response, views & upvotes are a lot higher on recommended posts then on un-recommended posts.
Simialarly, the mean & median seem to suggest that the author profile age does not matter, however their frequency of posting tends higher on recommended posts - which implies authors whose posts are more recommended do tend to post more frequently.
We will now further visualize these data points as boxplots, removing the first 10% & bottom 10% of outliers so that the boxplot scale is zoomed in.
bpxplot.responses <- ggplot(hrdata, aes(recommended, responses, fill=recommended), na.rm = T) + geom_boxplot(na.rm = T, outlier.shape = NA) + ggtitle("Responses by suggestion group") + scale_y_continuous(limits = quantile(hrdata$responses, c(0.1, 0.9)))
bpxplot.views <- ggplot(hrdata, aes(recommended, views, fill=recommended), na.rm = T) + geom_boxplot(na.rm = T, outlier.shape = NA) + ggtitle("Views by suggestion group") + scale_y_continuous(limits = quantile(hrdata$views, c(0.1, 0.9)))
boxplot.responsiveness <- ggplot(hrdata, aes(recommended, responsiveness, fill=recommended), na.rm = T) + geom_boxplot(na.rm = T, outlier.shape = NA) + ggtitle("Responsiveness by suggestion group") + scale_y_continuous(limits = quantile(hrdata$responsiveness, c(0.1, 0.9)))
boxplot.upvotes <- ggplot(hrdata, aes(recommended, upvotes, fill=recommended), na.rm = T) + geom_boxplot(na.rm = T, outlier.shape = NA) + ggtitle("Upvotes by suggestion group") + scale_y_continuous(limits = quantile(hrdata$upvotes, c(0.1, 0.9)))
boxplot.downvotes <- ggplot(hrdata, aes(recommended, downvotes, fill=recommended), na.rm = T) + geom_boxplot(na.rm = T, outlier.shape = NA) + ggtitle("Downvotes by suggestion group") + scale_y_continuous(limits = quantile(hrdata$downvotes, c(0.1, 0.9)))
boxplot.popularity <- ggplot(hrdata, aes(recommended, popularity, fill=recommended), na.rm = T) + geom_boxplot(na.rm = T, outlier.shape = NA) + ggtitle("Popularity by suggestion group") + scale_y_continuous(limits = quantile(hrdata$popularity, c(0.1, 0.9)))
boxplot.author_profile_age <- ggplot(hrdata, aes(recommended, author_profile_age, fill=recommended), na.rm = T) + geom_boxplot(na.rm = T, outlier.shape = NA) + ggtitle("Author profile age by suggestion group") + scale_y_continuous(limits = quantile(hrdata$author_profile_age, c(0.1, 0.9)))
boxplot.author_total_posts <- ggplot(hrdata, aes(recommended, author_total_posts, fill=recommended), na.rm = T) + geom_boxplot(na.rm = T, outlier.shape = NA) + ggtitle("Author total posts by suggestion group") + scale_y_continuous(limits = quantile(hrdata$author_total_posts, c(0.1, 0.9)))
boxplot.author_posts_per_day <- ggplot(hrdata, aes(recommended, author_posts_per_day, fill=recommended), na.rm = T) + geom_boxplot(na.rm = T, outlier.shape = NA) + ggtitle("Author posts per day by suggestion group") + scale_y_continuous(limits = quantile(hrdata$author_posts_per_day, c(0.1, 0.9)))
grid.arrange(bpxplot.responses, bpxplot.views, boxplot.responsiveness, boxplot.upvotes, boxplot.downvotes, boxplot.popularity, boxplot.author_profile_age, boxplot.author_total_posts, boxplot.author_posts_per_day, ncol=2)
These boxplots confirm the picture we had looking at the two kable tables earlier and the trends we pointed. The mean points seems to be closes on profile age as previously mentioned suggesting age of profile might not be as big of a factor when it comes to quality of posts.
We also see from the interaction terms we introduced that the “responsiveness” coffecient we intrdocued actually does not seem to have any major differences. However the “popularity” coffecient boxplot shows that recommended posts almost always have almost 1 popularity however the un-recommended posts tended lower than that at around 0.6.
We can expect the interaction terms to be somewhat co-related. Modelling the real world, we can also expect that maybe the upvotes & downvotes could potentially be co-related. We present a co-relation matrix of our main predictors below.
panel.cor <- function(x, y, digits = 2, prefix = "", cex.cor, ...)
{
usr <- par("usr"); on.exit(par(usr))
par(usr = c(0, 1, 0, 1))
r <- abs(cor(x, y))
txt <- format(c(r, 0.123456789), digits = digits)[1]
txt <- paste0(prefix, txt)
if(missing(cex.cor)) cex.cor <- 0.8/strwidth(txt)
text(0.5, 0.5, txt, cex = pmax(1, cex.cor * r))
}
pair.var.names <- c("responses", "views", "upvotes", "downvotes", "author_profile_age", "author_posts_per_day")
pairs(hrdata[,pair.var.names], lower.panel = panel.cor, main = "co-relation matrix for hrdata")
And the following is co-relation matrix heatmap using corrplot library.
correlations <- cor(subset(hrdata.train, select = -c(recommended)), use = "pairwise.complete.obs")
corrplot(correlations, diag = FALSE, type = "upper", tl.cex = 1)
The co-relation matrix confirms our initial assumption of data points being modelled after the real world.
This section is further more dividied into following subsection to guide our analysis.
We drive both sections on with initial queries from our report assignment & purpose of modelling to draw conclusions.
We employ various predictor selection techniques to identify any predictors that can be used for our models.
Following presents a sorted histogram of most frequent predictors picked by the regsubset model using exhaustive method accross various sizes of predictor picking.
predictor.analysis <- function(model) {
hrdata.predictor.regsubsets <- sapply(as.data.frame(summary(model)$which), sum)
hrdata.predictor.regsubsets <- hrdata.predictor.regsubsets[2:length(hrdata.predictor.regsubsets)]
hrdata.predictor.regsubsets.df <- data.frame("name" = names(hrdata.predictor.regsubsets), "count" = hrdata.predictor.regsubsets)
hrdata.predictor.regsubsets.df <- transform(hrdata.predictor.regsubsets.df, name = reorder(name, count))
rownames(hrdata.predictor.regsubsets.df) <- NULL
return(hrdata.predictor.regsubsets.df)
}
nvmax.size <- length(colnames(hrdata.train)) - 1
# Exhaustive
hrdata.exhaustive.model <- regsubsets(recommended ~ ., data = hrdata.train, nbest = 1, nvmax = nvmax.size, method = "exhaustive", really.big = TRUE)
hrdata.exhaustive.model.analysis <- predictor.analysis(hrdata.exhaustive.model)
hrdata.exhaustive.model.plot <- ggplot(data = hrdata.exhaustive.model.analysis) + geom_bar(mapping = aes(x = name, y = count, fill = name), position = "dodge", stat = "identity") + ggtitle("sorted predictor frequency on exhaustive model") + theme(axis.text.x = element_text(angle = 90, hjust = 1)) + theme(axis.text=element_text(size=12), axis.text.x = element_text(angle = 45, hjust = 1))
hrdata.exhaustive.model.plot
We can see that some predictors like respones, downvotes, viewes, author_total_posts & upvotes are consistently picked by the method and outperforms other predictors like responsiveness, author_posts_perday. & author_profile_age
We can get a rough estimate of if the predictors picked by regsubset are good estimater by checking if same predictors are picked by lasso. Below is the plot:
# Extract response variable (for lasso)
hrdata.x <- as.matrix(hrdata.train[, -which(names(hrdata.train) == "recommended")])
hrdata.y <- hrdata.train$recommended
hrdata.lasso <- glmnet(data.matrix(hrdata.x), as.factor(hrdata.y), alpha=1, family="binomial")
plot_glmnet(hrdata.lasso)
The selected predictor generally matches the predictors produced by regsubset.
Further confirming the predictors with default value of P at each split on random forest model:
# Quick evaluation of random forest and most important variables
hrdata.rf <- randomForest(recommended ~ ., data = hrdata.train)
varImpPlot(hrdata.rf, main="Important predictors on hrdata Randomforest model", sort = T)
densityplot.author_profile_age <- ggplot(hrdata, aes(x=author_profile_age, color=recommended, fill=recommended)) + geom_density(alpha = 0.5)
densityplot.author_profile_age
We see from the density plot that while there is slightly higher distribution of good recommendations from profile with longer age. In earlier stages of profile making the recommendation the differences were proportional.
First row on the above plot matrix shows error rates using LDA & QDA of author_profile_age against other predictors. The error rates for “age” seems comparable to author predictors.
Below provides scatter plot of the same using loess method.
hrdata.responses.glmplot <- ggplot(hrdata, aes(x=author_profile_age, y=responses, color=recommended)) + geom_point() + geom_smooth(method=loess, se=FALSE, fullrange=TRUE)
hrdata.views.glmplot <- ggplot(hrdata, aes(x=author_profile_age, y=views, color=recommended)) + geom_point() + geom_smooth(method=loess, se=FALSE, fullrange=TRUE)
hrdata.responsiveness.glmplot <- ggplot(hrdata, aes(x=author_profile_age, y=responsiveness, color=recommended)) + geom_point() + geom_smooth(method=loess, se=FALSE, fullrange=TRUE)
hrdata.upvotes.glmplot <- ggplot(hrdata, aes(x=author_profile_age, y=upvotes, color=recommended)) + geom_point() + geom_smooth(method=loess, se=FALSE, fullrange=TRUE)
hrdata.downvotes.glmplot <- ggplot(hrdata, aes(x=author_profile_age, y=downvotes, color=recommended)) + geom_point() + geom_smooth(method=loess, se=FALSE, fullrange=TRUE)
hrdata.popularity.glmplot <- ggplot(hrdata, aes(x=author_profile_age, y=popularity, color=recommended)) + geom_point() + geom_smooth(method=loess, se=FALSE, fullrange=TRUE)
hrdata.author_total_posts.glmplot <- ggplot(hrdata, aes(x=author_profile_age, y=author_total_posts, color=recommended)) + geom_point() + geom_smooth(method=loess, se=FALSE, fullrange=TRUE)
hrdata.author_posts_per_day.glmplot <- ggplot(hrdata, aes(x=author_profile_age, y=author_posts_per_day, color=recommended)) + geom_point() + geom_smooth(method=loess, se=FALSE, fullrange=TRUE)
grid.arrange(hrdata.responses.glmplot, hrdata.views.glmplot, hrdata.responsiveness.glmplot, hrdata.upvotes.glmplot, hrdata.downvotes.glmplot, hrdata.popularity.glmplot, hrdata.author_total_posts.glmplot, hrdata.author_posts_per_day.glmplot, ncol=2)
From the various visualizations above we can infer that author_profile_age is an equally important predictor compared to the other predictors.
Following subsection specifically aims to answer quality of subset of employees who do make good recommendations over others.
We start by creating a grouped set of our dataset grouped by author_id.
### Group Training set
hrdata.train.group.author.rec <- hrdata.train.all %>% group_by(author_id) %>% summarise(recommended = sum(as.numeric(recommended)), views = mean(views), upvotes = mean(upvotes), downvotes = mean(downvotes), responses = mean(responses), author_profile_age = max(author_profile_age), author_total_posts = max(author_total_posts), author_posts_per_day = max(author_posts_per_day), popularity = mean(popularity),responsiveness = mean(responsiveness))
hrdata.train.group.author.rec$success_rate = hrdata.train.group.author.rec$recommended/hrdata.train.group.author.rec$author_total_posts
Following shows how the total posts author has made compares against how frequently they are recommended. It’s not linear suggesting there are indeed some underlying qualities other dimensions have for frequency of recommendation besides just post.
We perform repeated regsubset against these grouped data set to identify on predictors using sucess_rate & total # of recommendation made by that author as the response variable to see what predictors are trending upwards which we will use to infer conclusions about quality of author who get more recommendations.
# success_rate
hrdata.grouped.exhaustive.model <- regsubsets(success_rate ~ ., data = hrdata.train.group.author.rec, nbest = 1, nvmax = nvmax.size, method = "exhaustive", really.big = TRUE)
hrdata.grouped.exhaustive.model.analysis <- predictor.analysis(hrdata.grouped.exhaustive.model)
hrdata.exhaustive.model.plot <- ggplot(data = hrdata.grouped.exhaustive.model.analysis) + geom_bar(mapping = aes(x = name, y = count, fill = name), position = "dodge", stat = "identity") + ggtitle("Sucess_rate - Grouped by author_id") + theme(axis.text.x = element_text(angle = 90, hjust = 1)) + theme(axis.text=element_text(size=12), axis.text.x = element_text(angle = 45, hjust = 1))
# Sucess_rate without author_profile_age & author_posts_per_day
hrdata.groupedrec.exhaustive.model <- regsubsets(success_rate ~ popularity + views + downvotes + upvotes + responsiveness, data = hrdata.train.group.author.rec, nbest = 1, nvmax = nvmax.size, method = "exhaustive", really.big = TRUE)
hrdata.groupedrec.exhaustive.model.analysis <- predictor.analysis(hrdata.groupedrec.exhaustive.model)
hrdata.exhaustive.rec.model.plot <- ggplot(data = hrdata.groupedrec.exhaustive.model.analysis) + geom_bar(mapping = aes(x = name, y = count, fill = name), position = "dodge", stat = "identity") + ggtitle("Success_rate - grouped by author discounting profile #posts or age") + theme(axis.text.x = element_text(angle = 90, hjust = 1)) + theme(axis.text=element_text(size=12), axis.text.x = element_text(angle = 45, hjust = 1))
grid.arrange(hrdata.exhaustive.model.plot, hrdata.exhaustive.rec.model.plot, ncol = 2)
When grouped by author, it looks like the frequency of author definitely comes into play. Besides that we see response & upvotes become a factor for a sucess of author’s.
Having an older profile, mkaing more posts with upvotes & higher responses are likely to get recommended which falls in line with expectations.
We will actually start to build a final model and cross-validate them with various ROC curve to find something presentable in this section.
Allowing a large tree to grow:
### Large Tree
hrdata.test.responses <- hrdata.test.all$recommended
hrdata.large.tree <- rpart(formula = recommended ~ ., data = hrdata.train, method="class")
plotcp(hrdata.large.tree)
# Using cp that minimizes error using 1SE rule
minimize.cp.err <- 0.001
hrdata.large.tree.predict <- predict(hrdata.large.tree, newdata = hrdata.test.all)[,2]
hrdata.large.tree.roc <- roc(response = hrdata.test.all$recommended, predictor = hrdata.large.tree.predict , direction = "<" , levels = c(0, 1))
# plot(hrdata.large.tree.roc)
Pruning back the tree using smaller complexity parameter:
### Pruned Tree
hrdata.pruned.tree <- prune(tree = hrdata.large.tree, cp = minimize.cp.err)
hrdata.pruned.tree.predict <- predict(hrdata.pruned.tree, newdata = hrdata.test.all)[,2]
hrdata.pruned.tree.roc <- roc(response = hrdata.test.responses, predictor = hrdata.pruned.tree.predict, direction = "<" , levels = c(0, 1))
# plot(hrdata.pruned.tree.roc)
# Visualize tree to see which predictors are used
hrdata.pruned.tree.party <- as.party(hrdata.pruned.tree)
plot(hrdata.pruned.tree.party, gp = gpar(fontsize = 7))
Using Random Forest with suggested with p, p/2 & p^0.5 mixed predictor # per each node split next:
### Creating random forests models
hrdata.rf1 <- randomForest(recommended ~ ., data = hrdata.train, mtry = ncol(hrdata.train)-1 ,importance = TRUE)
hrdata.rf2 <- randomForest(recommended ~ ., data = hrdata.train, mtry = ncol(hrdata.train)/2,importance = TRUE)
hrdata.rf3 <- randomForest(recommended ~ ., data = hrdata.train, mtry = ncol(hrdata.train)^0.5,importance = TRUE)
Next drawing OOB estimates for P, p/2 & sqrt of p:
hrdata.train.rfplots <- ggplot(xlab = "# of trees") + xlab("# of trees") + ylab("Out of bag error estimates")
hrdata.train.rfplots <- hrdata.train.rfplots + geom_line( aes(x=1:length(hrdata.rf1$err.rate[,1]), y=hrdata.rf1$err.rate[,1], colour = "OOB err - P"))
hrdata.train.rfplots <- hrdata.train.rfplots + geom_line( aes(x=1:length(hrdata.rf2$err.rate[,1]), y=hrdata.rf2$err.rate[,1], colour = "OOB err - P/2"))
hrdata.train.rfplots <- hrdata.train.rfplots + geom_line( aes(x=1:length(hrdata.rf3$err.rate[,1]), y=hrdata.rf3$err.rate[,1], colour = "OOB err - P^0.5"))
hrdata.train.rfplots
The graph shows that p/2 minimizes error rates a lot faster than other split sizes for each node on the model.
Creating roc curve objects to be fed into final ggroc plot for comparison:
hrdata.rf1.predict <- predict(hrdata.rf1, newdata = hrdata.test, type = "prob")[,2]
hrdata.rf1.roc <- roc(response = hrdata.test$recommended, predictor = hrdata.rf1.predict , direction = "<" , levels = c(0, 1))
hrdata.rf2.predict <- predict(hrdata.rf1, newdata = hrdata.test, type = "prob")[,2]
hrdata.rf2.roc <- roc(response = hrdata.test$recommended, predictor = hrdata.rf1.predict , direction = "<" , levels = c(0, 1))
hrdata.rf3.predict <- predict(hrdata.rf1, newdata = hrdata.test, type = "prob")[,2]
hrdata.rf3.roc <- roc(response = hrdata.test$recommended, predictor = hrdata.rf1.predict , direction = "<" , levels = c(0, 1))
Next, we create models using LDA, QDA & Naive Bayes
### Creating Models
# Naive Bayes
hrdata.nb <- NaiveBayes(recommended ~ . , data = hrdata.train, usekernel=TRUE)
# LDA
hrdata.lda <- lda(recommended ~., data=hrdata.train)
# QDA
hrdata.qda <- qda(recommended ~., data=hrdata.train)
We use this models with predict() function to get predictions for roc curve on test data set.
### Predict
hrdata.nb.predict <- suppressWarnings(predict(hrdata.nb, newdata = hrdata.test, type = "prob"))
hrdata.lda.predict <- predict(hrdata.lda, newdata = hrdata.test, type = "response")
hrdata.qda.predict <- predict(hrdata.qda, newdata = hrdata.test)
Following creates roc curves respectively for each of three models we predicted from above.
hrdata.nb.roc <- roc(response = hrdata.test$recommended, predictor = as.numeric(hrdata.nb.predict$class) , direction = "<" , levels = c(0, 1))
hrdata.lda.roc <- roc(response = hrdata.test$recommended, predictor = as.numeric(hrdata.lda.predict$class) , direction = "<" , levels = c(0, 1))
hrdata.qda.roc <- roc(response = hrdata.test$recommended, predictor = as.numeric(hrdata.qda.predict$class) , direction = "<" , levels = c(0, 1))
Finally printing out a graph that compares the ROCs:
hrdata.full.models.roc <- ggroc(size = 1, list('Large Tree' = hrdata.large.tree.roc, 'Pruned Tree' = hrdata.pruned.tree.roc, 'Random Forest (P)' = hrdata.rf1.roc, 'Random Forest (P/2)' = hrdata.rf2.roc, 'Random Forest (sqrt(P))' = hrdata.rf3.roc, 'NaiveBayes' = hrdata.nb.roc, 'LDA' = hrdata.lda.roc, 'QDA'= hrdata.qda.roc), ) + ggtitle("Comparison of ROC Curves")
hrdata.full.models.roc
Plotting are under curve for various models:
hrdata.auc.df <- data.frame("Type" = c("Large Tree", "Pruned Tree", "Random Forest(P)", "Random Forest (P/2)", "Random Forest (sqr(P))", "NaiveBayes", "LDA", "QDA"), "AUC" = c(auc(hrdata.large.tree.roc), auc(hrdata.pruned.tree.roc), auc(hrdata.rf1.roc), auc(hrdata.rf2.roc), auc(hrdata.rf3.roc), auc(hrdata.nb.roc), auc(hrdata.lda.roc), auc(hrdata.qda.roc)))
ggplot(data = hrdata.auc.df) + geom_bar(mapping = aes(x = Type, y = AUC, fill = Type), position = "dodge", stat = "identity") + ggtitle("Area under curve of various models") + theme(axis.text=element_text(size=12), axis.text.x = element_text(angle = 45, hjust = 1))
From the above we can see that Random Forest performs the best out of all any split size P, or P/2 or sqrt(P) than compared to other models.
We had some initial questions the gist of which we used as a framework to establish path for data analysis for this report. We will attempt to answer those questions based on our analysis:
1a. Determine which combination of attributes of the suggestion (and maybe the person who wrote it) can be used to predict a ‘good’ suggestion.
Answer: We found that following predictors in order were picked by the regsubset model:
These in order mattered most then predicting a “good” suggestion when compared to other predictors. We dropped the author_id & suggestion_id column before applying the predictor selection function(s).
1b. Does number of views matter more or less than votes?
Answer: # of downvotes & responses to a post actually mattered more than views.
2a. How much does the ‘age’ of the employee matter when it comes to their ability to make a good suggestion?
Answer: Partimat plot matrix for both LDA & Methods showed that age there was relatively resonable prediction rate for most of the variables when clasiffying recommended posts from the bad. The least performing variable against age when trying to classify recommendations was responses.
2b. Are the employees with longer tenures making better suggestions than those with shorter ones?
Answer: There was indeed higher distribution of good recommendations from profile with bigger age value according the density plot. On lower ages profiles however, the distinction was less important.
3a. Can the same data be used to rank employees based on their demonstrated ability to make predominantly good suggestions?
Answer: There were slightly varying order of suggested predictors when grouping the posts made by each authors. For each author, more recommended posts generally tended to come from the ones who had higher “author_profile_age”, “author_posts_per_day” and “responsesiveness” (views/responses). We did not penalize high number of posts with less views or recommendation on this analysis so the model could be slightly biased towards # of “author_posts_per_day” were an author post a lot of posts just because some of them could get responses and less downvotes making them a recommended suggestion. With this analysis we could create a scoring function that creates a sorting index based on higher values of “author_profile_age”, “author_posts_per_day” & “responsiveness” combined to create a ranking function of author who are likely to post more recommended posts.
3b. Can it be used to identify groups of employees whose suggestions could be aggregated to provide more reliable suggestions than made by the best individuals?
Answer: As per the previous question, grouping employees with bigger age profile & # of posts did seem to group the employees who could provide more reliable suggestions. If we were to remove the time element & # of posts out of the analysis, the # of downvotes & responsiveness interaciton term (responses/views) seem to matter the most. So even new profiles as long as their posts have less downvotes & have more responsiveness (response/views) are more likely to be recommended posts.
4a. Make recommendations to your IT department about better ways they could collect this data in the future. What other attributes would prove useful and why? Would it be possible to build a completely automated suggestion ranking system?
Answer: There was high prediction rate for each posts, however less attributes when we dropeed un-necessary predictors for analyzing posts by author_id. Information about author_id such as their real-life age or position of the employee, salary could potentially provide more features that provides insight into author profiles making high recommended suggestion so I would suggest the IT department to collect such data.
We have employeed various classification techniques & validation parameters learned in this class to answer some decision based questions. For the final model selection Random Forest model with all P split per node is recommended for deployment based on the prediction rate.