This notebook contains R code to accompany Chapter 4 of the book “Real-World Machine Learning”, by Henrik Brink, Joseph W. Richards, and Mark Fetherolf.
The code was contributed by Paul Adamson.
NOTE: working directory should be set to this file’s location.
NOTE: depending on your machine, you may need to adjust the parameter in the call to makeCluster()
in the code to generate Figure 4.19
As in Chapter 3, we are going to be interested in predicting survival, so again, it is useful to specify the Survived
variable to be of type factor
. For visualizing the data, it is also useful to use the revalue
function to specify the no
and yes
levels for the factor
variable. The kable
function is built into the knitr
package.
titanic <- read.csv("../data/titanic.csv",
colClasses = c(
Survived = "factor",
Name = "character",
Ticket = "character",
Cabin = "character"))
titanic$Survived <- revalue(titanic$Survived, c("0"="no", "1"="yes"))
kable(head(titanic, 5), digits=2)
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked |
---|---|---|---|---|---|---|---|---|---|---|---|
1 | no | 3 | Braund, Mr. Owen Harris | male | 22 | 1 | 0 | A/5 21171 | 7.25 | S | |
2 | yes | 1 | Cumings, Mrs. John Bradley (Florence Briggs Thayer) | female | 38 | 1 | 0 | PC 17599 | 71.28 | C85 | C |
3 | yes | 3 | Heikkinen, Miss. Laina | female | 26 | 0 | 0 | STON/O2. 3101282 | 7.92 | S | |
4 | yes | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35 | 1 | 0 | 113803 | 53.10 | C123 | S |
5 | no | 3 | Allen, Mr. William Henry | male | 35 | 0 | 0 | 373450 | 8.05 | S |
Here, we follow the same process used for Figure 3.6 to process the data and prepare it for our model. First, we get rid of the variables that we do not want in our model. (Cabin
might actually be useful, but it’s not used here.) Then we use is.na
to set missing age values to -1. The mutate
and select
functions make it easy to take square root of the Fare
variable and then drop it from the dataset. We then drop rows with missing Embarked
data and remove the unused level ""
. Finally, we convert factor
variables to dummy variables using the dummyVars
function in the caret
package. To avoid perfect collinearity (a.k.a. the dummy variable trap), we set the fullRank
parameter to TRUE
. Survived.yes
is then converted back to a factor
variable, and its levels are changed back to ‘yes’ and ‘no’ again (otherwise the train
function in caret
will complain later about invalid R variable names).
We then make a 80%/20% train/test split using the Survived
factor variable in the createDataPartition
function to preserve the overall class distribution of the data.
titanicTidy <- subset(titanic, select = -c(PassengerId, Name, Ticket, Cabin))
titanicTidy$Age[is.na(titanicTidy$Age)] <- -1
titanicTidy <- titanicTidy %>%
mutate(sqrtFare = sqrt(Fare)) %>%
select(-Fare)
titanicTidy <- titanicTidy %>%
filter(!(Embarked=="")) %>%
droplevels
dummies <- dummyVars(" ~ .", data = titanicTidy, fullRank = TRUE)
titanicTidyNumeric <- data.frame(predict(dummies, newdata = titanicTidy))
titanicTidyNumeric$Survived.yes <- factor(titanicTidyNumeric$Survived.yes)
titanicTidyNumeric$Survived.yes <-
revalue(titanicTidyNumeric$Survived.yes, c("0"="no", "1"="yes"))
trainIndex <- createDataPartition(titanicTidyNumeric$Survived.yes, p = .8,
list = FALSE,
times = 1)
titanicTrain <- titanicTidyNumeric[ trainIndex,]
titanicTest <- titanicTidyNumeric[-trainIndex,]
kable(head(titanicTidyNumeric, 8), digits=2, caption = "Full dataset")
Survived.yes | Pclass | Sex.male | Age | SibSp | Parch | Embarked.Q | Embarked.S | sqrtFare |
---|---|---|---|---|---|---|---|---|
no | 3 | 1 | 22 | 1 | 0 | 0 | 1 | 2.69 |
yes | 1 | 0 | 38 | 1 | 0 | 0 | 0 | 8.44 |
yes | 3 | 0 | 26 | 0 | 0 | 0 | 1 | 2.82 |
yes | 1 | 0 | 35 | 1 | 0 | 0 | 1 | 7.29 |
no | 3 | 1 | 35 | 0 | 0 | 0 | 1 | 2.84 |
no | 3 | 1 | -1 | 0 | 0 | 1 | 0 | 2.91 |
no | 1 | 1 | 54 | 0 | 0 | 0 | 1 | 7.20 |
no | 3 | 1 | 2 | 3 | 1 | 0 | 1 | 4.59 |
kable(head(titanicTrain, 5), digits=2,
caption = "Training set: used only for building the model")
Survived.yes | Pclass | Sex.male | Age | SibSp | Parch | Embarked.Q | Embarked.S | sqrtFare | |
---|---|---|---|---|---|---|---|---|---|
1 | no | 3 | 1 | 22 | 1 | 0 | 0 | 1 | 2.69 |
2 | yes | 1 | 0 | 38 | 1 | 0 | 0 | 0 | 8.44 |
4 | yes | 1 | 0 | 35 | 1 | 0 | 0 | 1 | 7.29 |
6 | no | 3 | 1 | -1 | 0 | 0 | 1 | 0 | 2.91 |
8 | no | 3 | 1 | 2 | 3 | 1 | 0 | 1 | 4.59 |
kable(head(titanicTest, 3), digits=2,
caption = "Testing set: used only for evaluating model")
Survived.yes | Pclass | Sex.male | Age | SibSp | Parch | Embarked.Q | Embarked.S | sqrtFare | |
---|---|---|---|---|---|---|---|---|---|
3 | yes | 3 | 0 | 26 | 0 | 0 | 0 | 1 | 2.82 |
5 | no | 3 | 1 | 35 | 0 | 0 | 0 | 1 | 2.84 |
7 | no | 1 | 1 | 54 | 0 | 0 | 0 | 1 | 7.20 |
objControl <- trainControl(method='cv', number=3,
returnResamp='none',
summaryFunction = twoClassSummary,
classProbs = TRUE)
titanic.gbm <- train(Survived.yes~.,data=titanicTrain, method='gbm',
trControl=objControl,
metric = "ROC",
preProc = c("center", "scale"))
## Iter TrainDeviance ValidDeviance StepSize Improve
## 1 1.2829 nan 0.1000 0.0267
## 2 1.2449 nan 0.1000 0.0221
## 3 1.2100 nan 0.1000 0.0183
## 4 1.1789 nan 0.1000 0.0150
## 5 1.1536 nan 0.1000 0.0118
## 6 1.1309 nan 0.1000 0.0119
## 7 1.1069 nan 0.1000 0.0084
## 8 1.0878 nan 0.1000 0.0094
## 9 1.0685 nan 0.1000 0.0059
## 10 1.0511 nan 0.1000 0.0077
## 20 0.9584 nan 0.1000 0.0029
## 40 0.8959 nan 0.1000 0.0006
## 60 0.8650 nan 0.1000 -0.0007
## 80 0.8451 nan 0.1000 -0.0011
## 100 0.8313 nan 0.1000 -0.0009
## 120 0.8190 nan 0.1000 -0.0004
## 140 0.8084 nan 0.1000 -0.0011
## 150 0.8040 nan 0.1000 -0.0002
##
## Iter TrainDeviance ValidDeviance StepSize Improve
## 1 1.2611 nan 0.1000 0.0370
## 2 1.2119 nan 0.1000 0.0222
## 3 1.1597 nan 0.1000 0.0254
## 4 1.1206 nan 0.1000 0.0206
## 5 1.0899 nan 0.1000 0.0145
## 6 1.0610 nan 0.1000 0.0138
## 7 1.0398 nan 0.1000 0.0093
## 8 1.0201 nan 0.1000 0.0082
## 9 1.0005 nan 0.1000 0.0082
## 10 0.9818 nan 0.1000 0.0084
## 20 0.8941 nan 0.1000 0.0015
## 40 0.8137 nan 0.1000 -0.0015
## 60 0.7734 nan 0.1000 -0.0021
## 80 0.7347 nan 0.1000 -0.0003
## 100 0.7128 nan 0.1000 -0.0013
## 120 0.6909 nan 0.1000 -0.0009
## 140 0.6715 nan 0.1000 -0.0015
## 150 0.6638 nan 0.1000 -0.0006
##
## Iter TrainDeviance ValidDeviance StepSize Improve
## 1 1.2500 nan 0.1000 0.0379
## 2 1.1871 nan 0.1000 0.0275
## 3 1.1405 nan 0.1000 0.0234
## 4 1.0983 nan 0.1000 0.0180
## 5 1.0611 nan 0.1000 0.0151
## 6 1.0295 nan 0.1000 0.0131
## 7 1.0103 nan 0.1000 0.0071
## 8 0.9879 nan 0.1000 0.0093
## 9 0.9655 nan 0.1000 0.0088
## 10 0.9507 nan 0.1000 0.0032
## 20 0.8474 nan 0.1000 -0.0007
## 40 0.7605 nan 0.1000 -0.0008
## 60 0.7054 nan 0.1000 -0.0011
## 80 0.6673 nan 0.1000 -0.0005
## 100 0.6386 nan 0.1000 -0.0010
## 120 0.6104 nan 0.1000 -0.0023
## 140 0.5835 nan 0.1000 -0.0026
## 150 0.5735 nan 0.1000 -0.0014
##
## Iter TrainDeviance ValidDeviance StepSize Improve
## 1 1.2705 nan 0.1000 0.0317
## 2 1.2223 nan 0.1000 0.0252
## 3 1.1792 nan 0.1000 0.0201
## 4 1.1463 nan 0.1000 0.0168
## 5 1.1151 nan 0.1000 0.0131
## 6 1.0923 nan 0.1000 0.0107
## 7 1.0742 nan 0.1000 0.0082
## 8 1.0595 nan 0.1000 0.0068
## 9 1.0431 nan 0.1000 0.0056
## 10 1.0328 nan 0.1000 0.0039
## 20 0.9487 nan 0.1000 0.0021
## 40 0.8849 nan 0.1000 0.0007
## 60 0.8619 nan 0.1000 -0.0006
## 80 0.8467 nan 0.1000 -0.0012
## 100 0.8327 nan 0.1000 -0.0003
## 120 0.8207 nan 0.1000 -0.0002
## 140 0.8100 nan 0.1000 -0.0009
## 150 0.8047 nan 0.1000 -0.0015
##
## Iter TrainDeviance ValidDeviance StepSize Improve
## 1 1.2554 nan 0.1000 0.0378
## 2 1.2009 nan 0.1000 0.0263
## 3 1.1503 nan 0.1000 0.0236
## 4 1.1112 nan 0.1000 0.0199
## 5 1.0788 nan 0.1000 0.0149
## 6 1.0543 nan 0.1000 0.0122
## 7 1.0288 nan 0.1000 0.0112
## 8 1.0063 nan 0.1000 0.0068
## 9 0.9849 nan 0.1000 0.0089
## 10 0.9692 nan 0.1000 0.0055
## 20 0.8778 nan 0.1000 0.0012
## 40 0.8128 nan 0.1000 -0.0015
## 60 0.7763 nan 0.1000 -0.0002
## 80 0.7448 nan 0.1000 -0.0012
## 100 0.7239 nan 0.1000 -0.0004
## 120 0.6985 nan 0.1000 -0.0016
## 140 0.6854 nan 0.1000 -0.0024
## 150 0.6745 nan 0.1000 -0.0016
##
## Iter TrainDeviance ValidDeviance StepSize Improve
## 1 1.2562 nan 0.1000 0.0354
## 2 1.1911 nan 0.1000 0.0313
## 3 1.1429 nan 0.1000 0.0235
## 4 1.0975 nan 0.1000 0.0185
## 5 1.0635 nan 0.1000 0.0172
## 6 1.0321 nan 0.1000 0.0158
## 7 1.0064 nan 0.1000 0.0105
## 8 0.9826 nan 0.1000 0.0115
## 9 0.9648 nan 0.1000 0.0076
## 10 0.9458 nan 0.1000 0.0073
## 20 0.8404 nan 0.1000 0.0001
## 40 0.7443 nan 0.1000 -0.0004
## 60 0.6926 nan 0.1000 -0.0012
## 80 0.6523 nan 0.1000 -0.0028
## 100 0.6254 nan 0.1000 -0.0008
## 120 0.5947 nan 0.1000 -0.0013
## 140 0.5749 nan 0.1000 -0.0016
## 150 0.5636 nan 0.1000 -0.0040
##
## Iter TrainDeviance ValidDeviance StepSize Improve
## 1 1.2610 nan 0.1000 0.0315
## 2 1.2115 nan 0.1000 0.0245
## 3 1.1743 nan 0.1000 0.0195
## 4 1.1439 nan 0.1000 0.0167
## 5 1.1141 nan 0.1000 0.0138
## 6 1.0846 nan 0.1000 0.0101
## 7 1.0684 nan 0.1000 0.0066
## 8 1.0509 nan 0.1000 0.0043
## 9 1.0339 nan 0.1000 0.0075
## 10 1.0229 nan 0.1000 0.0031
## 20 0.9374 nan 0.1000 0.0027
## 40 0.8699 nan 0.1000 -0.0005
## 60 0.8372 nan 0.1000 -0.0009
## 80 0.8137 nan 0.1000 -0.0006
## 100 0.7969 nan 0.1000 -0.0010
## 120 0.7822 nan 0.1000 -0.0004
## 140 0.7700 nan 0.1000 -0.0015
## 150 0.7628 nan 0.1000 -0.0001
##
## Iter TrainDeviance ValidDeviance StepSize Improve
## 1 1.2613 nan 0.1000 0.0335
## 2 1.1967 nan 0.1000 0.0329
## 3 1.1451 nan 0.1000 0.0203
## 4 1.1046 nan 0.1000 0.0211
## 5 1.0695 nan 0.1000 0.0160
## 6 1.0352 nan 0.1000 0.0142
## 7 1.0115 nan 0.1000 0.0117
## 8 0.9934 nan 0.1000 0.0076
## 9 0.9756 nan 0.1000 0.0082
## 10 0.9605 nan 0.1000 0.0073
## 20 0.8624 nan 0.1000 0.0018
## 40 0.7784 nan 0.1000 -0.0004
## 60 0.7261 nan 0.1000 -0.0005
## 80 0.6906 nan 0.1000 -0.0011
## 100 0.6615 nan 0.1000 -0.0000
## 120 0.6430 nan 0.1000 -0.0005
## 140 0.6254 nan 0.1000 -0.0017
## 150 0.6172 nan 0.1000 -0.0009
##
## Iter TrainDeviance ValidDeviance StepSize Improve
## 1 1.2504 nan 0.1000 0.0385
## 2 1.1850 nan 0.1000 0.0348
## 3 1.1308 nan 0.1000 0.0254
## 4 1.0834 nan 0.1000 0.0223
## 5 1.0510 nan 0.1000 0.0136
## 6 1.0161 nan 0.1000 0.0147
## 7 0.9896 nan 0.1000 0.0125
## 8 0.9726 nan 0.1000 0.0065
## 9 0.9475 nan 0.1000 0.0100
## 10 0.9254 nan 0.1000 0.0068
## 20 0.8160 nan 0.1000 0.0029
## 40 0.7086 nan 0.1000 -0.0001
## 60 0.6494 nan 0.1000 -0.0012
## 80 0.6142 nan 0.1000 -0.0001
## 100 0.5797 nan 0.1000 -0.0001
## 120 0.5551 nan 0.1000 -0.0009
## 140 0.5337 nan 0.1000 -0.0014
## 150 0.5207 nan 0.1000 -0.0009
##
## Iter TrainDeviance ValidDeviance StepSize Improve
## 1 1.2481 nan 0.1000 0.0367
## 2 1.1876 nan 0.1000 0.0323
## 3 1.1344 nan 0.1000 0.0246
## 4 1.0914 nan 0.1000 0.0205
## 5 1.0531 nan 0.1000 0.0148
## 6 1.0265 nan 0.1000 0.0119
## 7 1.0007 nan 0.1000 0.0128
## 8 0.9764 nan 0.1000 0.0110
## 9 0.9584 nan 0.1000 0.0070
## 10 0.9441 nan 0.1000 0.0057
## 20 0.8450 nan 0.1000 0.0004
## 40 0.7663 nan 0.1000 -0.0001
## 50 0.7378 nan 0.1000 -0.0009
titanic.gbm.pred <- predict(titanic.gbm,newdata=titanicTest)
figure4.11.df <- data.frame(head(titanicTest$Survived.yes),
head(titanic.gbm.pred))
kable(figure4.11.df,
col.names = c("Test set labels", "Predictions"))
Test set labels | Predictions |
---|---|
yes | no |
no | no |
no | no |
yes | no |
yes | yes |
no | no |
titanic.gbm.cm <- confusionMatrix(titanic.gbm.pred,titanicTest$Survived.yes)
titanic.gbm.cm
## Confusion Matrix and Statistics
##
## Reference
## Prediction no yes
## no 98 31
## yes 11 37
##
## Accuracy : 0.7627
## 95% CI : (0.6931, 0.8233)
## No Information Rate : 0.6158
## P-Value [Acc > NIR] : 2.393e-05
##
## Kappa : 0.4692
## Mcnemar's Test P-Value : 0.00337
##
## Sensitivity : 0.8991
## Specificity : 0.5441
## Pos Pred Value : 0.7597
## Neg Pred Value : 0.7708
## Prevalence : 0.6158
## Detection Rate : 0.5537
## Detection Prevalence : 0.7288
## Balanced Accuracy : 0.7216
##
## 'Positive' Class : no
##
titanic.gbm.pred.prob <- predict(object=titanic.gbm,
titanicTest, type='prob')
kable(titanic.gbm.pred.prob[15:19,],col.names = c("Died","Survived"))
Died | Survived | |
---|---|---|
15 | 0.8774734 | 0.1225266 |
16 | 0.8842786 | 0.1157214 |
17 | 0.8166056 | 0.1833944 |
18 | 0.8774734 | 0.1225266 |
19 | 0.8989089 | 0.1010911 |
titanic.ordered <- titanic.gbm.pred.prob[order(-titanic.gbm.pred.prob$yes),]
kable(titanic.ordered[titanic.ordered$yes > 0.68 & titanic.ordered$yes < 0.73,])
no | yes | |
---|---|---|
137 | 0.2851770 | 0.7148230 |
25 | 0.3000107 | 0.6999893 |
63 | 0.3190296 | 0.6809704 |
# Returns the false-positive and true-positive rates at nPoints thresholds for
# the given true and predicted labels
# trueLabels: 0=FALSE; 1=TRUE
rocCurve <- function(trueLabels, predictedProbs, nPoints=100, posClass=1){
# Allocates the threshold and ROC lists
thr <- seq(0,1,length=nPoints)
tpr <- numeric(nPoints)
fpr <- numeric(nPoints)
# Precalculates values for the positive and negative cases, used in the loop
pos <- trueLabels == posClass
neg <- !pos
nPos <- sum(pos, na.rm=TRUE)
nNeg <- sum(neg, na.rm=TRUE)
# For each threshold, calculates the rate of true and false positives
for (i in 1:length(thr)) {
t <- thr[i]
meetOrExceedThreshold <- predictedProbs >= t
tpr[i] <- sum((meetOrExceedThreshold & pos), na.rm=TRUE) / nPos
fpr[i] <- sum((meetOrExceedThreshold & neg), na.rm=TRUE) / nNeg
}
# Create data frame without duplicated fpr's to return
duplicatedFPRs <- duplicated(fpr)
df <- data.frame(fpr=fpr[!duplicatedFPRs],tpr=tpr[!duplicatedFPRs],thr=thr[!duplicatedFPRs])
return(df)
}
df<-rocCurve(revalue(titanicTest$Survived.yes, c("no" = 0, "yes" = 1)),
titanic.gbm.pred.prob$yes)
ggplot(df,aes(x=fpr,y=tpr)) +
geom_step(direction="vh") +
labs(x = "False-positive rate",
y = "True-positive rate")
auc <- function(trueLabels, predictedProbs, nPoints=100, posClass=1){
auc <- 0
df <- rocCurve(trueLabels = trueLabels,
predictedProbs = predictedProbs,
nPoints = nPoints,
posClass = posClass)
for (i in 2:length(df$fpr)) {
auc <- auc + 0.5 * (df$fpr[i-1] - df$fpr[i]) * (df$tpr[i-1] + df$tpr[i])
}
return(auc)
}
We can call the summary
function on our model to get a data frame of variables (var
) and their relative influence (rel.inf
) in the model. Before plotting the data, we reorder by the rel.inf
variable.
gbmSummary <- summary(titanic.gbm)
gbmSummary <- transform(gbmSummary,
var = reorder(var,rel.inf))
ggplot(data=gbmSummary, aes(var, rel.inf)) +
geom_bar(stat="identity") +
coord_flip() +
labs(x="Relative Influence",
y="Variable")
Thanks to Longhow Lam for posting the code used in the displayMnistSamples
function that display’s digits from the MNIST dataset.
mnist <- read.csv("../data/mnist_small.csv",
colClasses = c(label = "factor"))
displayMnistSamples <- function(x) {
for(i in x){
y = as.matrix(mnist[i, 2:785])
dim(y) = c(28, 28)
image( y[,nrow(y):1], axes = FALSE, col = gray(0:255 / 255))
text( 0.2, 0, mnist[i,1], cex = 3, col = 2, pos = c(3,4))
}
}
par( mfrow = c(4,5), mai = c(0,0,0.1,0.1))
displayMnistSamples(sample(1:length(mnist),20))
levels(mnist$label) <- make.names(levels(mnist$label))
trainIndex <- createDataPartition(mnist$label, p = .8,
list = FALSE,
times = 1)
mnistTrain <- mnist[ trainIndex,]
mnistTest <- mnist[-trainIndex,]
cl = makeCluster(4)
registerDoParallel(cl)
str(mnistTrain$label)
## Factor w/ 10 levels "X0","X1","X2",..: 2 2 5 1 1 8 6 9 10 2 ...
myTrainingControl <- trainControl(savePredictions = TRUE,
classProbs = TRUE,
verboseIter = FALSE)
mnist.rf <- train(label~., data=mnistTrain, method='rf',
ntree = 50,
trControl = myTrainingControl,
probability=TRUE, allowParallel=TRUE)
mnist.rf.pred = predict(mnist.rf,newdata=mnistTest)
confusion.matrix <- confusionMatrix(mnist.rf.pred,mnistTest$label)
confusionDF <- data.frame(confusion.matrix$table)
confusionDF$Reference = with(confusionDF,
factor(Reference, levels = rev(levels(Reference))))
jBuPuFun <- colorRampPalette(brewer.pal(n = 9, "BuPu"))
paletteSize <- 256
jBuPuPalette <- jBuPuFun(paletteSize)
confusionPlot <- ggplot(
confusionDF, aes(x = Prediction, y = Reference, fill = Freq)) +
#theme(axis.text.x = element_text(angle = 0, hjust = 1, vjust = 0.5)) +
geom_tile() +
labs(x = "Predicted digit", y = "Actual digit") +
scale_fill_gradient2(
low = jBuPuPalette[1],
mid = jBuPuPalette[paletteSize/2],
high = jBuPuPalette[paletteSize],
midpoint = (max(confusionDF$Freq) + min(confusionDF$Freq)) / 2,
name = "") +
theme(legend.key.height = unit(2, "cm"))
confusionPlot
rocDF<-NULL
aucDF<-NULL
mnist.rf.pred.prob <- predict(object=mnist.rf,
mnistTest, type='prob')
for (i in 0:9){
digitLabel = paste0("X",i)
trueLabels <- as.numeric(mnistTest$label==digitLabel)
predictedProbs <- mnist.rf.pred.prob[[digitLabel]]
rocDF <- rbind(rocDF,
data.frame(rocCurve(trueLabels = trueLabels, predictedProbs = predictedProbs),digit=i))
aucDF <- rbind(aucDF,
data.frame(auc=auc(trueLabels = trueLabels, predictedProbs = predictedProbs),digit=i))
}
rocDF[,'digit'] <- as.factor(rocDF[,'digit'])
labelVector <- c(paste0("Digit ",0:9,", AUC ",round(aucDF$auc,3)))
ggplot(rocDF[rocDF$fpr<0.2,],aes(x=fpr,y=tpr, linetype=digit, colour=digit)) +
geom_line() +
labs(x = "False-positive rate",
y = "True-positive rate") +
scale_linetype_discrete(name=NULL,
labels=labelVector,
guide = guide_legend(keywidth = 3)) +
scale_colour_discrete(name=NULL,
labels=labelVector,
guide = guide_legend(keywidth = 3)) +
theme(legend.position=c(.8, .40))
auto <- read.csv("../data/auto-mpg.csv",
colClasses = c(
origin = "factor"))
auto$origin <- revalue(auto$origin,
c("1\t"="USA", "2\t"="Europe", "3\t"="Asia"))
kable(head(auto,5))
mpg | cylinders | displacement | horsepower | weight | acceleration | modelyear | origin |
---|---|---|---|---|---|---|---|
18 | 8 | 307 | 130 | 3504 | 12.0 | 70 | USA |
15 | 8 | 350 | 165 | 3693 | 11.5 | 70 | USA |
18 | 8 | 318 | 150 | 3436 | 11.0 | 70 | USA |
16 | 8 | 304 | 150 | 3433 | 12.0 | 70 | USA |
17 | 8 | 302 | 140 | 3449 | 10.5 | 70 | USA |
dummies <- dummyVars(" ~ .", data = auto, fullRank = TRUE)
autoNumeric <- data.frame(predict(dummies, newdata = auto))
trainIndex <- createDataPartition(autoNumeric$mpg, p = .8,
list = FALSE,
times = 1)
autoTrain <- autoNumeric[ trainIndex,]
autoTest <- autoNumeric[-trainIndex,]
ctrl <- trainControl(method = "repeatedcv", number = 10, savePredictions = TRUE)
auto.glm <- train(mpg ~.,
data=autoTrain,
method="glm",
trControl = ctrl,
tuneLength = 5)
auto.glm.Pred <- predict(auto.glm, newdata=autoTest)
ggplot(autoTest, aes(x=mpg, y=auto.glm.Pred)) +
geom_point() +
geom_abline(slope = 1, intercept = 0) +
labs(x="MPG", y="Predicted MPG")
rmse <- function(trueValues, predictedValues){
return(sqrt(sum((trueValues - predictedValues)**2)/length(trueValues)))
}
rmse(autoTest$mpg, auto.glm.Pred)
## [1] 3.378211
r2 <- function(trueValues, predictedValues) {
meanTrueValues <- mean(trueValues)
return( 1.0 - (sum((trueValues - predictedValues)**2) / sum((trueValues - meanTrueValues)**2) ))
}
r2(autoTest$mpg, auto.glm.Pred)
## [1] 0.8029198
autoTest$residuals <- autoTest$mpg - auto.glm.Pred
ggplot(data=autoTest, aes(x=mpg, y=residuals)) +
geom_point() +
geom_abline(slope = 0, intercept = 0) +
labs(x="MPG", y="Residuals")
The getModelInfo
function in the caret
package can be used to find the available model parameters. By setting summaryFunction = twoClassSummary
in trainControl
, we request ROC
to be used to select the optimal model using the largest value.
svmRadialModelInfo <- getModelInfo("svmRadial")
svmRadialModelInfo$svmRadial$parameters
## parameter class label
## 1 sigma numeric Sigma
## 2 C numeric Cost
svmGrid <- expand.grid(sigma = c(0.0001,0.001,0.01,0.1),
C = seq(0.1,2.1,by=0.5))
fitControl <- trainControl(method = "repeatedcv",
number = 10,
savePredictions = TRUE,
summaryFunction=twoClassSummary,
classProbs=TRUE)
svmFit <- train(Survived.yes ~ ., data = titanicTrain,
method = "svmRadial",
trControl = fitControl,
verbose = FALSE,
tuneGrid = svmGrid)
## Loading required package: kernlab
##
## Attaching package: 'kernlab'
## The following object is masked from 'package:ggplot2':
##
## alpha
## Warning in train.default(x, y, weights = w, ...): The metric "Accuracy" was
## not in the result set. ROC will be used instead.
svmFitDF <- data.frame(sigma = as.factor(svmFit$results$sigma),
C = as.factor(svmFit$results$C),
ROC = svmFit$results$ROC)
ggplot(svmFitDF, aes(x=sigma, y=C)) +
geom_tile(aes(fill=ROC))
svmFit$bestTune
## sigma C
## 14 0.01 1.6
svmGrid <- expand.grid(sigma = c(5:1 %o% 10^(-3:-1)),
C = seq(0.1,2.1,by=0.25))
fitControl <- trainControl(method = "repeatedcv",
number = 10,
savePredictions = TRUE,
summaryFunction=twoClassSummary,
classProbs=TRUE)
svmFit <- train(Survived.yes ~ ., data = titanicTrain,
method = "svmRadial",
trControl = fitControl,
verbose = FALSE,
tuneGrid = svmGrid)
## Warning in train.default(x, y, weights = w, ...): The metric "Accuracy" was
## not in the result set. ROC will be used instead.
svmFitDF <- data.frame(sigma = as.factor(svmFit$results$sigma),
C = as.factor(svmFit$results$C),
ROC = svmFit$results$ROC)
ggplot(svmFitDF, aes(x=sigma, y=C)) +
geom_tile(aes(fill=ROC))
svmFit$bestTune
## sigma C
## 121 0.4 0.85
svmFit$results[row.names(svmFit$bestTune),]
## sigma C ROC Sens Spec ROCSD SensSD
## 121 0.4 0.85 0.8638483 0.9136364 0.6691799 0.05090404 0.06759002
## SpecSD
## 121 0.08247496