Real-World Machine Learning

This notebook contains R code to accompany Chapter 4 of the book “Real-World Machine Learning”, by Henrik Brink, Joseph W. Richards, and Mark Fetherolf.
The code was contributed by Paul Adamson.

NOTE: working directory should be set to this file’s location.

NOTE: depending on your machine, you may need to adjust the parameter in the call to makeCluster() in the code to generate Figure 4.19

Figure 4.9 The first five rows of the Titanic Passengers dataset

As in Chapter 3, we are going to be interested in predicting survival, so again, it is useful to specify the Survived variable to be of type factor. For visualizing the data, it is also useful to use the revalue function to specify the no and yes levels for the factor variable. The kable function is built into the knitr package.

titanic <- read.csv("../data/titanic.csv", 
                    colClasses = c(
                      Survived = "factor",
                      Name = "character",
                      Ticket = "character",
                      Cabin = "character"))
titanic$Survived <- revalue(titanic$Survived, c("0"="no", "1"="yes"))
kable(head(titanic, 5), digits=2)

PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Ticket	Fare	Cabin	Embarked
1	no	3	Braund, Mr. Owen Harris	male	22	1	A/5 21171	7.25		S
2	yes	1	Cumings, Mrs. John Bradley (Florence Briggs Thayer)	female	38	1	PC 17599	71.28	C85	C
3	yes	3	Heikkinen, Miss. Laina	female	26	0	STON/O2. 3101282	7.92		S
4	yes	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35	1	113803	53.10	C123	S
5	no	3	Allen, Mr. William Henry	male	35	0	373450	8.05		S

Figure 4.10 Splitting the full dataset into training and testing sets

Here, we follow the same process used for Figure 3.6 to process the data and prepare it for our model. First, we get rid of the variables that we do not want in our model. (Cabin might actually be useful, but it’s not used here.) Then we use is.na to set missing age values to -1. The mutate and select functions make it easy to take square root of the Fare variable and then drop it from the dataset. We then drop rows with missing Embarked data and remove the unused level "". Finally, we convert factor variables to dummy variables using the dummyVars function in the caret package. To avoid perfect collinearity (a.k.a. the dummy variable trap), we set the fullRank parameter to TRUE. Survived.yes is then converted back to a factor variable, and its levels are changed back to ‘yes’ and ‘no’ again (otherwise the train function in caret will complain later about invalid R variable names).

We then make a 80%/20% train/test split using the Survived factor variable in the createDataPartition function to preserve the overall class distribution of the data.

titanicTidy <- subset(titanic, select = -c(PassengerId, Name, Ticket, Cabin))

titanicTidy$Age[is.na(titanicTidy$Age)] <- -1

titanicTidy <- titanicTidy %>%
  mutate(sqrtFare = sqrt(Fare)) %>%
  select(-Fare)

titanicTidy <- titanicTidy %>%
  filter(!(Embarked=="")) %>%
  droplevels

dummies <- dummyVars(" ~ .", data = titanicTidy, fullRank = TRUE)
titanicTidyNumeric <- data.frame(predict(dummies, newdata = titanicTidy))

titanicTidyNumeric$Survived.yes <- factor(titanicTidyNumeric$Survived.yes)
titanicTidyNumeric$Survived.yes <- 
  revalue(titanicTidyNumeric$Survived.yes, c("0"="no", "1"="yes"))

trainIndex <- createDataPartition(titanicTidyNumeric$Survived.yes, p = .8, 
                                  list = FALSE, 
                                  times = 1)

titanicTrain <- titanicTidyNumeric[ trainIndex,]
titanicTest  <- titanicTidyNumeric[-trainIndex,]

kable(head(titanicTidyNumeric, 8), digits=2, caption = "Full dataset")

Full dataset
Survived.yes	Pclass	Sex.male	Age	SibSp	Parch	Embarked.Q	Embarked.S	sqrtFare
no	3	1	22	1	0	0	1	2.69
yes	1	0	38	1	0	0	0	8.44
yes	3	0	26	0	0	0	1	2.82
yes	1	0	35	1	0	0	1	7.29
no	3	1	35	0	0	0	1	2.84
no	3	1	-1	0	0	1	0	2.91
no	1	1	54	0	0	0	1	7.20
no	3	1	2	3	1	0	1	4.59

kable(head(titanicTrain, 5), digits=2, 
      caption = "Training set: used only for building the model")

Training set: used only for building the model
	Survived.yes	Pclass	Sex.male	Age	SibSp	Parch	Embarked.Q	Embarked.S	sqrtFare
1	no	3	1	22	1	0	0	1	2.69
2	yes	1	0	38	1	0	0	0	8.44
4	yes	1	0	35	1	0	0	1	7.29
6	no	3	1	-1	0	0	1	0	2.91
8	no	3	1	2	3	1	0	1	4.59

kable(head(titanicTest, 3), digits=2,
      caption = "Testing set: used only for evaluating model")

Testing set: used only for evaluating model
	Survived.yes	Pclass	Sex.male	Age	Embarked.S	sqrtFare
3	yes	3	0	26	1	2.82
5	no	3	1	35	1	2.84
7	no	1	1	54	1	7.20

Figure 4.11 Comparing the testing set predictions with the actual values gives you the accuracy of the model.

objControl <- trainControl(method='cv', number=3, 
                           returnResamp='none', 
                           summaryFunction = twoClassSummary, 
                           classProbs = TRUE)
titanic.gbm <- train(Survived.yes~.,data=titanicTrain, method='gbm',
                     trControl=objControl,  
                     metric = "ROC",
                     preProc = c("center", "scale"))

## Iter   TrainDeviance   ValidDeviance   StepSize   Improve
##      1        1.2829             nan     0.1000    0.0267
##      2        1.2449             nan     0.1000    0.0221
##      3        1.2100             nan     0.1000    0.0183
##      4        1.1789             nan     0.1000    0.0150
##      5        1.1536             nan     0.1000    0.0118
##      6        1.1309             nan     0.1000    0.0119
##      7        1.1069             nan     0.1000    0.0084
##      8        1.0878             nan     0.1000    0.0094
##      9        1.0685             nan     0.1000    0.0059
##     10        1.0511             nan     0.1000    0.0077
##     20        0.9584             nan     0.1000    0.0029
##     40        0.8959             nan     0.1000    0.0006
##     60        0.8650             nan     0.1000   -0.0007
##     80        0.8451             nan     0.1000   -0.0011
##    100        0.8313             nan     0.1000   -0.0009
##    120        0.8190             nan     0.1000   -0.0004
##    140        0.8084             nan     0.1000   -0.0011
##    150        0.8040             nan     0.1000   -0.0002
## 
## Iter   TrainDeviance   ValidDeviance   StepSize   Improve
##      1        1.2611             nan     0.1000    0.0370
##      2        1.2119             nan     0.1000    0.0222
##      3        1.1597             nan     0.1000    0.0254
##      4        1.1206             nan     0.1000    0.0206
##      5        1.0899             nan     0.1000    0.0145
##      6        1.0610             nan     0.1000    0.0138
##      7        1.0398             nan     0.1000    0.0093
##      8        1.0201             nan     0.1000    0.0082
##      9        1.0005             nan     0.1000    0.0082
##     10        0.9818             nan     0.1000    0.0084
##     20        0.8941             nan     0.1000    0.0015
##     40        0.8137             nan     0.1000   -0.0015
##     60        0.7734             nan     0.1000   -0.0021
##     80        0.7347             nan     0.1000   -0.0003
##    100        0.7128             nan     0.1000   -0.0013
##    120        0.6909             nan     0.1000   -0.0009
##    140        0.6715             nan     0.1000   -0.0015
##    150        0.6638             nan     0.1000   -0.0006
## 
## Iter   TrainDeviance   ValidDeviance   StepSize   Improve
##      1        1.2500             nan     0.1000    0.0379
##      2        1.1871             nan     0.1000    0.0275
##      3        1.1405             nan     0.1000    0.0234
##      4        1.0983             nan     0.1000    0.0180
##      5        1.0611             nan     0.1000    0.0151
##      6        1.0295             nan     0.1000    0.0131
##      7        1.0103             nan     0.1000    0.0071
##      8        0.9879             nan     0.1000    0.0093
##      9        0.9655             nan     0.1000    0.0088
##     10        0.9507             nan     0.1000    0.0032
##     20        0.8474             nan     0.1000   -0.0007
##     40        0.7605             nan     0.1000   -0.0008
##     60        0.7054             nan     0.1000   -0.0011
##     80        0.6673             nan     0.1000   -0.0005
##    100        0.6386             nan     0.1000   -0.0010
##    120        0.6104             nan     0.1000   -0.0023
##    140        0.5835             nan     0.1000   -0.0026
##    150        0.5735             nan     0.1000   -0.0014
## 
## Iter   TrainDeviance   ValidDeviance   StepSize   Improve
##      1        1.2705             nan     0.1000    0.0317
##      2        1.2223             nan     0.1000    0.0252
##      3        1.1792             nan     0.1000    0.0201
##      4        1.1463             nan     0.1000    0.0168
##      5        1.1151             nan     0.1000    0.0131
##      6        1.0923             nan     0.1000    0.0107
##      7        1.0742             nan     0.1000    0.0082
##      8        1.0595             nan     0.1000    0.0068
##      9        1.0431             nan     0.1000    0.0056
##     10        1.0328             nan     0.1000    0.0039
##     20        0.9487             nan     0.1000    0.0021
##     40        0.8849             nan     0.1000    0.0007
##     60        0.8619             nan     0.1000   -0.0006
##     80        0.8467             nan     0.1000   -0.0012
##    100        0.8327             nan     0.1000   -0.0003
##    120        0.8207             nan     0.1000   -0.0002
##    140        0.8100             nan     0.1000   -0.0009
##    150        0.8047             nan     0.1000   -0.0015
## 
## Iter   TrainDeviance   ValidDeviance   StepSize   Improve
##      1        1.2554             nan     0.1000    0.0378
##      2        1.2009             nan     0.1000    0.0263
##      3        1.1503             nan     0.1000    0.0236
##      4        1.1112             nan     0.1000    0.0199
##      5        1.0788             nan     0.1000    0.0149
##      6        1.0543             nan     0.1000    0.0122
##      7        1.0288             nan     0.1000    0.0112
##      8        1.0063             nan     0.1000    0.0068
##      9        0.9849             nan     0.1000    0.0089
##     10        0.9692             nan     0.1000    0.0055
##     20        0.8778             nan     0.1000    0.0012
##     40        0.8128             nan     0.1000   -0.0015
##     60        0.7763             nan     0.1000   -0.0002
##     80        0.7448             nan     0.1000   -0.0012
##    100        0.7239             nan     0.1000   -0.0004
##    120        0.6985             nan     0.1000   -0.0016
##    140        0.6854             nan     0.1000   -0.0024
##    150        0.6745             nan     0.1000   -0.0016
## 
## Iter   TrainDeviance   ValidDeviance   StepSize   Improve
##      1        1.2562             nan     0.1000    0.0354
##      2        1.1911             nan     0.1000    0.0313
##      3        1.1429             nan     0.1000    0.0235
##      4        1.0975             nan     0.1000    0.0185
##      5        1.0635             nan     0.1000    0.0172
##      6        1.0321             nan     0.1000    0.0158
##      7        1.0064             nan     0.1000    0.0105
##      8        0.9826             nan     0.1000    0.0115
##      9        0.9648             nan     0.1000    0.0076
##     10        0.9458             nan     0.1000    0.0073
##     20        0.8404             nan     0.1000    0.0001
##     40        0.7443             nan     0.1000   -0.0004
##     60        0.6926             nan     0.1000   -0.0012
##     80        0.6523             nan     0.1000   -0.0028
##    100        0.6254             nan     0.1000   -0.0008
##    120        0.5947             nan     0.1000   -0.0013
##    140        0.5749             nan     0.1000   -0.0016
##    150        0.5636             nan     0.1000   -0.0040
## 
## Iter   TrainDeviance   ValidDeviance   StepSize   Improve
##      1        1.2610             nan     0.1000    0.0315
##      2        1.2115             nan     0.1000    0.0245
##      3        1.1743             nan     0.1000    0.0195
##      4        1.1439             nan     0.1000    0.0167
##      5        1.1141             nan     0.1000    0.0138
##      6        1.0846             nan     0.1000    0.0101
##      7        1.0684             nan     0.1000    0.0066
##      8        1.0509             nan     0.1000    0.0043
##      9        1.0339             nan     0.1000    0.0075
##     10        1.0229             nan     0.1000    0.0031
##     20        0.9374             nan     0.1000    0.0027
##     40        0.8699             nan     0.1000   -0.0005
##     60        0.8372             nan     0.1000   -0.0009
##     80        0.8137             nan     0.1000   -0.0006
##    100        0.7969             nan     0.1000   -0.0010
##    120        0.7822             nan     0.1000   -0.0004
##    140        0.7700             nan     0.1000   -0.0015
##    150        0.7628             nan     0.1000   -0.0001
## 
## Iter   TrainDeviance   ValidDeviance   StepSize   Improve
##      1        1.2613             nan     0.1000    0.0335
##      2        1.1967             nan     0.1000    0.0329
##      3        1.1451             nan     0.1000    0.0203
##      4        1.1046             nan     0.1000    0.0211
##      5        1.0695             nan     0.1000    0.0160
##      6        1.0352             nan     0.1000    0.0142
##      7        1.0115             nan     0.1000    0.0117
##      8        0.9934             nan     0.1000    0.0076
##      9        0.9756             nan     0.1000    0.0082
##     10        0.9605             nan     0.1000    0.0073
##     20        0.8624             nan     0.1000    0.0018
##     40        0.7784             nan     0.1000   -0.0004
##     60        0.7261             nan     0.1000   -0.0005
##     80        0.6906             nan     0.1000   -0.0011
##    100        0.6615             nan     0.1000   -0.0000
##    120        0.6430             nan     0.1000   -0.0005
##    140        0.6254             nan     0.1000   -0.0017
##    150        0.6172             nan     0.1000   -0.0009
## 
## Iter   TrainDeviance   ValidDeviance   StepSize   Improve
##      1        1.2504             nan     0.1000    0.0385
##      2        1.1850             nan     0.1000    0.0348
##      3        1.1308             nan     0.1000    0.0254
##      4        1.0834             nan     0.1000    0.0223
##      5        1.0510             nan     0.1000    0.0136
##      6        1.0161             nan     0.1000    0.0147
##      7        0.9896             nan     0.1000    0.0125
##      8        0.9726             nan     0.1000    0.0065
##      9        0.9475             nan     0.1000    0.0100
##     10        0.9254             nan     0.1000    0.0068
##     20        0.8160             nan     0.1000    0.0029
##     40        0.7086             nan     0.1000   -0.0001
##     60        0.6494             nan     0.1000   -0.0012
##     80        0.6142             nan     0.1000   -0.0001
##    100        0.5797             nan     0.1000   -0.0001
##    120        0.5551             nan     0.1000   -0.0009
##    140        0.5337             nan     0.1000   -0.0014
##    150        0.5207             nan     0.1000   -0.0009
## 
## Iter   TrainDeviance   ValidDeviance   StepSize   Improve
##      1        1.2481             nan     0.1000    0.0367
##      2        1.1876             nan     0.1000    0.0323
##      3        1.1344             nan     0.1000    0.0246
##      4        1.0914             nan     0.1000    0.0205
##      5        1.0531             nan     0.1000    0.0148
##      6        1.0265             nan     0.1000    0.0119
##      7        1.0007             nan     0.1000    0.0128
##      8        0.9764             nan     0.1000    0.0110
##      9        0.9584             nan     0.1000    0.0070
##     10        0.9441             nan     0.1000    0.0057
##     20        0.8450             nan     0.1000    0.0004
##     40        0.7663             nan     0.1000   -0.0001
##     50        0.7378             nan     0.1000   -0.0009

titanic.gbm.pred <- predict(titanic.gbm,newdata=titanicTest)

figure4.11.df <- data.frame(head(titanicTest$Survived.yes),
                            head(titanic.gbm.pred))
kable(figure4.11.df,
      col.names = c("Test set labels", "Predictions"))

Test set labels	Predictions
yes	no
no	no
no	no
yes	no
yes	yes
no	no

Figure 4.13 Organizing the class-wise accuracy into a confusion matrix

titanic.gbm.cm <- confusionMatrix(titanic.gbm.pred,titanicTest$Survived.yes)
titanic.gbm.cm

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction no yes
##        no  98  31
##        yes 11  37
##                                           
##                Accuracy : 0.7627          
##                  95% CI : (0.6931, 0.8233)
##     No Information Rate : 0.6158          
##     P-Value [Acc > NIR] : 2.393e-05       
##                                           
##                   Kappa : 0.4692          
##  Mcnemar's Test P-Value : 0.00337         
##                                           
##             Sensitivity : 0.8991          
##             Specificity : 0.5441          
##          Pos Pred Value : 0.7597          
##          Neg Pred Value : 0.7708          
##              Prevalence : 0.6158          
##          Detection Rate : 0.5537          
##    Detection Prevalence : 0.7288          
##       Balanced Accuracy : 0.7216          
##                                           
##        'Positive' Class : no              
##

Figure 4.15 A subset of probabilistic predictions from the Titanic test set. After sorting the full table by decreasing survival probability, you can set a threshold and consider all rows above this threshold as survived. Note that the indices are maintained so you know which original row the instance refers to.

titanic.gbm.pred.prob <- predict(object=titanic.gbm, 
                                 titanicTest, type='prob')
kable(titanic.gbm.pred.prob[15:19,],col.names = c("Died","Survived"))

	Died	Survived
15	0.8774734	0.1225266
16	0.8842786	0.1157214
17	0.8166056	0.1833944
18	0.8774734	0.1225266
19	0.8989089	0.1010911

titanic.ordered <- titanic.gbm.pred.prob[order(-titanic.gbm.pred.prob$yes),]
kable(titanic.ordered[titanic.ordered$yes > 0.68 & titanic.ordered$yes < 0.73,])

	no	yes
137	0.2851770	0.7148230
25	0.3000107	0.6999893
63	0.3190296	0.6809704

Listing 4.3 The ROC curve

# Returns the false-positive and true-positive rates at nPoints thresholds for 
# the given true and predicted labels
# trueLabels: 0=FALSE; 1=TRUE
rocCurve <- function(trueLabels, predictedProbs, nPoints=100, posClass=1){
  # Allocates the threshold and ROC lists
  thr <- seq(0,1,length=nPoints)
  tpr <- numeric(nPoints)
  fpr <- numeric(nPoints)
  
  # Precalculates values for the positive and negative cases, used in the loop
  pos <- trueLabels == posClass
  neg <- !pos 
  nPos <- sum(pos, na.rm=TRUE)
  nNeg <- sum(neg, na.rm=TRUE)
  
  # For each threshold, calculates the rate of true and false positives
  for (i in 1:length(thr)) {
    t <- thr[i]
    meetOrExceedThreshold <- predictedProbs >= t
    tpr[i] <-  sum((meetOrExceedThreshold & pos), na.rm=TRUE) / nPos
    fpr[i] <-  sum((meetOrExceedThreshold & neg), na.rm=TRUE) / nNeg
  }
  
  # Create data frame without duplicated fpr's to return
  duplicatedFPRs <- duplicated(fpr)
  df <- data.frame(fpr=fpr[!duplicatedFPRs],tpr=tpr[!duplicatedFPRs],thr=thr[!duplicatedFPRs])
  
  return(df)
}

Figure 4.16 The ROC curve defined by calculating the confusion matrix and ROC metrics at 100 threshold points from 0 to 1. By convention, you plot the false-positive rate on the x-axis and the true-positive rate on the y-axis.

df<-rocCurve(revalue(titanicTest$Survived.yes, c("no" = 0, "yes" = 1)),
             titanic.gbm.pred.prob$yes)
ggplot(df,aes(x=fpr,y=tpr)) +
  geom_step(direction="vh") +
  labs(x = "False-positive rate",
       y = "True-positive rate")

Listing 4.4 The area under the ROC curve

auc <- function(trueLabels, predictedProbs, nPoints=100, posClass=1){
  auc <- 0
  df <- rocCurve(trueLabels = trueLabels,
                 predictedProbs = predictedProbs,
                 nPoints = nPoints,
                 posClass = posClass)
  
  for (i in 2:length(df$fpr)) {
    auc <- auc + 0.5 * (df$fpr[i-1] - df$fpr[i]) * (df$tpr[i-1] + df$tpr[i])
  }
  
  return(auc)
}

Bonus. Relative influence of variables in model.

We can call the summary function on our model to get a data frame of variables (var) and their relative influence (rel.inf) in the model. Before plotting the data, we reorder by the rel.inf variable.

gbmSummary <- summary(titanic.gbm)

gbmSummary <- transform(gbmSummary,
                        var = reorder(var,rel.inf))
ggplot(data=gbmSummary, aes(var, rel.inf)) +
  geom_bar(stat="identity") +
  coord_flip() + 
  labs(x="Relative Influence",
       y="Variable")

Figure 4.18 Handwritten digits in the MNIST dataset

Thanks to Longhow Lam for posting the code used in the displayMnistSamples function that display’s digits from the MNIST dataset.

mnist <- read.csv("../data/mnist_small.csv",
                  colClasses = c(label = "factor"))
displayMnistSamples <- function(x) {
  for(i in x){
  y = as.matrix(mnist[i, 2:785])
  dim(y) = c(28, 28)
  image( y[,nrow(y):1], axes = FALSE, col = gray(0:255 / 255))
  text( 0.2, 0, mnist[i,1], cex = 3, col = 2, pos = c(3,4))
  }
}
par( mfrow = c(4,5), mai = c(0,0,0.1,0.1))
displayMnistSamples(sample(1:length(mnist),20))

levels(mnist$label) <- make.names(levels(mnist$label))

Figure 4.19 The confusion matrix for the 10-class MNIST handwritten digit classification problem

trainIndex <- createDataPartition(mnist$label, p = .8, 
                                  list = FALSE, 
                                  times = 1)

mnistTrain <- mnist[ trainIndex,]
mnistTest  <- mnist[-trainIndex,]

cl = makeCluster(4)
registerDoParallel(cl)

str(mnistTrain$label)

##  Factor w/ 10 levels "X0","X1","X2",..: 2 2 5 1 1 8 6 9 10 2 ...

myTrainingControl <- trainControl(savePredictions = TRUE, 
                              classProbs = TRUE, 
                              verboseIter = FALSE)

mnist.rf <- train(label~., data=mnistTrain, method='rf',
                  ntree = 50,
                  trControl = myTrainingControl,
                  probability=TRUE, allowParallel=TRUE) 

mnist.rf.pred = predict(mnist.rf,newdata=mnistTest)

confusion.matrix <- confusionMatrix(mnist.rf.pred,mnistTest$label)

confusionDF <- data.frame(confusion.matrix$table)

confusionDF$Reference = with(confusionDF, 
                             factor(Reference, levels = rev(levels(Reference))))

jBuPuFun <- colorRampPalette(brewer.pal(n = 9, "BuPu"))
paletteSize <- 256
jBuPuPalette <- jBuPuFun(paletteSize)

confusionPlot <- ggplot(
  confusionDF, aes(x = Prediction, y = Reference, fill = Freq)) +
  #theme(axis.text.x = element_text(angle = 0, hjust = 1, vjust = 0.5)) +
  geom_tile() +
  labs(x = "Predicted digit", y = "Actual digit") +
  scale_fill_gradient2(
    low = jBuPuPalette[1],
    mid = jBuPuPalette[paletteSize/2],
    high = jBuPuPalette[paletteSize],
    midpoint = (max(confusionDF$Freq) + min(confusionDF$Freq)) / 2,
    name = "") +
  theme(legend.key.height = unit(2, "cm"))
confusionPlot

Figure 4.20 The ROC curves for each class of the MNIST 10-class classifier

rocDF<-NULL
aucDF<-NULL
mnist.rf.pred.prob <- predict(object=mnist.rf,
                              mnistTest, type='prob')
for (i in 0:9){
  digitLabel = paste0("X",i)
  trueLabels <- as.numeric(mnistTest$label==digitLabel)
  predictedProbs <- mnist.rf.pred.prob[[digitLabel]]
  rocDF <- rbind(rocDF, 
                 data.frame(rocCurve(trueLabels = trueLabels, predictedProbs = predictedProbs),digit=i))
  aucDF <- rbind(aucDF, 
                 data.frame(auc=auc(trueLabels = trueLabels, predictedProbs = predictedProbs),digit=i))
}
rocDF[,'digit'] <- as.factor(rocDF[,'digit'])
labelVector <- c(paste0("Digit ",0:9,", AUC ",round(aucDF$auc,3)))  
ggplot(rocDF[rocDF$fpr<0.2,],aes(x=fpr,y=tpr, linetype=digit, colour=digit)) +
  geom_line() +
  labs(x = "False-positive rate",
       y = "True-positive rate") +
  scale_linetype_discrete(name=NULL,
                          labels=labelVector, 
                          guide = guide_legend(keywidth = 3)) +
  scale_colour_discrete(name=NULL,
                        labels=labelVector,
                        guide = guide_legend(keywidth = 3)) +
  theme(legend.position=c(.8, .40))

Figure 4.21 A subset of the Auto MPG dataset

auto <- read.csv("../data/auto-mpg.csv",
                 colClasses = c(
                      origin = "factor"))

auto$origin <- revalue(auto$origin, 
                       c("1\t"="USA", "2\t"="Europe", "3\t"="Asia"))

kable(head(auto,5))

mpg	cylinders	displacement	horsepower	weight	acceleration	modelyear	origin
18	8	307	130	3504	12.0	70	USA
15	8	350	165	3693	11.5	70	USA
18	8	318	150	3436	11.0	70	USA
16	8	304	150	3433	12.0	70	USA
17	8	302	140	3449	10.5	70	USA

Figure 4.22 Scatter plot of the predicted MPG versus actual values from the testing set. The diagonal line shows the optimal model.

dummies <- dummyVars(" ~ .", data = auto, fullRank = TRUE)
autoNumeric <- data.frame(predict(dummies, newdata = auto))

trainIndex <- createDataPartition(autoNumeric$mpg, p = .8, 
                                  list = FALSE, 
                                  times = 1)

autoTrain <- autoNumeric[ trainIndex,]
autoTest  <- autoNumeric[-trainIndex,]

ctrl <- trainControl(method = "repeatedcv", number = 10, savePredictions = TRUE)

auto.glm <- train(mpg ~.,  
                 data=autoTrain, 
                 method="glm", 
                 trControl = ctrl, 
                 tuneLength = 5)

auto.glm.Pred <- predict(auto.glm, newdata=autoTest)

ggplot(autoTest, aes(x=mpg, y=auto.glm.Pred)) + 
  geom_point() + 
  geom_abline(slope = 1, intercept = 0) +
  labs(x="MPG", y="Predicted MPG")

Listing 4.5 The root-mean-square error

rmse <- function(trueValues, predictedValues){
  return(sqrt(sum((trueValues - predictedValues)**2)/length(trueValues)))
}
rmse(autoTest$mpg, auto.glm.Pred)

## [1] 3.378211

Listing 4.6 The R-squared calculation

r2 <- function(trueValues, predictedValues) {
  meanTrueValues <- mean(trueValues)
  return( 1.0 - (sum((trueValues - predictedValues)**2) / sum((trueValues - meanTrueValues)**2) ))
}
r2(autoTest$mpg, auto.glm.Pred)

## [1] 0.8029198

Figure 4.24 The residual plot from predictions on the MPG dataset.

autoTest$residuals <- autoTest$mpg - auto.glm.Pred

ggplot(data=autoTest, aes(x=mpg, y=residuals)) +
  geom_point() + 
  geom_abline(slope = 0, intercept = 0) +
  labs(x="MPG", y="Residuals")

(Substitute for Figure 4.25) Tile plots for Support Vector Machine (SVM) tuning parameter grid search

The getModelInfo function in the caret package can be used to find the available model parameters. By setting summaryFunction = twoClassSummary in trainControl, we request ROC to be used to select the optimal model using the largest value.

svmRadialModelInfo <- getModelInfo("svmRadial")
svmRadialModelInfo$svmRadial$parameters

##   parameter   class label
## 1     sigma numeric Sigma
## 2         C numeric  Cost

svmGrid <- expand.grid(sigma = c(0.0001,0.001,0.01,0.1),
                       C = seq(0.1,2.1,by=0.5)) 
                         
fitControl <- trainControl(method = "repeatedcv", 
                           number = 10, 
                           savePredictions = TRUE,
                           summaryFunction=twoClassSummary, 
                           classProbs=TRUE)

svmFit <- train(Survived.yes ~ ., data = titanicTrain,
                method = "svmRadial", 
                trControl = fitControl, 
                verbose = FALSE, 
                tuneGrid = svmGrid)

## Loading required package: kernlab

## 
## Attaching package: 'kernlab'

## The following object is masked from 'package:ggplot2':
## 
##     alpha

## Warning in train.default(x, y, weights = w, ...): The metric "Accuracy" was
## not in the result set. ROC will be used instead.

svmFitDF <- data.frame(sigma = as.factor(svmFit$results$sigma), 
                       C = as.factor(svmFit$results$C),
                       ROC = svmFit$results$ROC)

ggplot(svmFitDF, aes(x=sigma, y=C)) +
  geom_tile(aes(fill=ROC))

svmFit$bestTune

##    sigma   C
## 14  0.01 1.6

svmGrid <- expand.grid(sigma = c(5:1 %o% 10^(-3:-1)),
                       C = seq(0.1,2.1,by=0.25)) 
                         
fitControl <- trainControl(method = "repeatedcv", 
                           number = 10, 
                           savePredictions = TRUE,
                           summaryFunction=twoClassSummary, 
                           classProbs=TRUE)

svmFit <- train(Survived.yes ~ ., data = titanicTrain,
                method = "svmRadial", 
                trControl = fitControl, 
                verbose = FALSE, 
                tuneGrid = svmGrid)

## Warning in train.default(x, y, weights = w, ...): The metric "Accuracy" was
## not in the result set. ROC will be used instead.

svmFitDF <- data.frame(sigma = as.factor(svmFit$results$sigma), 
                       C = as.factor(svmFit$results$C),
                       ROC = svmFit$results$ROC)

ggplot(svmFitDF, aes(x=sigma, y=C)) +
  geom_tile(aes(fill=ROC))

svmFit$bestTune

##     sigma    C
## 121   0.4 0.85

svmFit$results[row.names(svmFit$bestTune),]

##     sigma    C       ROC      Sens      Spec      ROCSD     SensSD
## 121   0.4 0.85 0.8638483 0.9136364 0.6691799 0.05090404 0.06759002
##         SpecSD
## 121 0.08247496