This notebook contains R code to accompany Chapter 4 of the book “Real-World Machine Learning”, by Henrik Brink, Joseph W. Richards, and Mark Fetherolf.
The code was contributed by Paul Adamson.

NOTE: working directory should be set to this file’s location.

NOTE: depending on your machine, you may need to adjust the parameter in the call to makeCluster() in the code to generate Figure 4.19

Figure 4.9 The first five rows of the Titanic Passengers dataset

As in Chapter 3, we are going to be interested in predicting survival, so again, it is useful to specify the Survived variable to be of type factor. For visualizing the data, it is also useful to use the revalue function to specify the no and yes levels for the factor variable. The kable function is built into the knitr package.

titanic <- read.csv("../data/titanic.csv", 
                    colClasses = c(
                      Survived = "factor",
                      Name = "character",
                      Ticket = "character",
                      Cabin = "character"))
titanic$Survived <- revalue(titanic$Survived, c("0"="no", "1"="yes"))
kable(head(titanic, 5), digits=2)
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
1 no 3 Braund, Mr. Owen Harris male 22 1 0 A/5 21171 7.25 S
2 yes 1 Cumings, Mrs. John Bradley (Florence Briggs Thayer) female 38 1 0 PC 17599 71.28 C85 C
3 yes 3 Heikkinen, Miss. Laina female 26 0 0 STON/O2. 3101282 7.92 S
4 yes 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35 1 0 113803 53.10 C123 S
5 no 3 Allen, Mr. William Henry male 35 0 0 373450 8.05 S

Figure 4.10 Splitting the full dataset into training and testing sets

Here, we follow the same process used for Figure 3.6 to process the data and prepare it for our model. First, we get rid of the variables that we do not want in our model. (Cabin might actually be useful, but it’s not used here.) Then we use is.na to set missing age values to -1. The mutate and select functions make it easy to take square root of the Fare variable and then drop it from the dataset. We then drop rows with missing Embarked data and remove the unused level "". Finally, we convert factor variables to dummy variables using the dummyVars function in the caret package. To avoid perfect collinearity (a.k.a. the dummy variable trap), we set the fullRank parameter to TRUE. Survived.yes is then converted back to a factor variable, and its levels are changed back to ‘yes’ and ‘no’ again (otherwise the train function in caret will complain later about invalid R variable names).

We then make a 80%/20% train/test split using the Survived factor variable in the createDataPartition function to preserve the overall class distribution of the data.

titanicTidy <- subset(titanic, select = -c(PassengerId, Name, Ticket, Cabin))

titanicTidy$Age[is.na(titanicTidy$Age)] <- -1

titanicTidy <- titanicTidy %>%
  mutate(sqrtFare = sqrt(Fare)) %>%
  select(-Fare)

titanicTidy <- titanicTidy %>%
  filter(!(Embarked=="")) %>%
  droplevels

dummies <- dummyVars(" ~ .", data = titanicTidy, fullRank = TRUE)
titanicTidyNumeric <- data.frame(predict(dummies, newdata = titanicTidy))

titanicTidyNumeric$Survived.yes <- factor(titanicTidyNumeric$Survived.yes)
titanicTidyNumeric$Survived.yes <- 
  revalue(titanicTidyNumeric$Survived.yes, c("0"="no", "1"="yes"))

trainIndex <- createDataPartition(titanicTidyNumeric$Survived.yes, p = .8, 
                                  list = FALSE, 
                                  times = 1)

titanicTrain <- titanicTidyNumeric[ trainIndex,]
titanicTest  <- titanicTidyNumeric[-trainIndex,]

kable(head(titanicTidyNumeric, 8), digits=2, caption = "Full dataset")
Full dataset
Survived.yes Pclass Sex.male Age SibSp Parch Embarked.Q Embarked.S sqrtFare
no 3 1 22 1 0 0 1 2.69
yes 1 0 38 1 0 0 0 8.44
yes 3 0 26 0 0 0 1 2.82
yes 1 0 35 1 0 0 1 7.29
no 3 1 35 0 0 0 1 2.84
no 3 1 -1 0 0 1 0 2.91
no 1 1 54 0 0 0 1 7.20
no 3 1 2 3 1 0 1 4.59
kable(head(titanicTrain, 5), digits=2, 
      caption = "Training set: used only for building the model")
Training set: used only for building the model
Survived.yes Pclass Sex.male Age SibSp Parch Embarked.Q Embarked.S sqrtFare
1 no 3 1 22 1 0 0 1 2.69
2 yes 1 0 38 1 0 0 0 8.44
4 yes 1 0 35 1 0 0 1 7.29
6 no 3 1 -1 0 0 1 0 2.91
8 no 3 1 2 3 1 0 1 4.59
kable(head(titanicTest, 3), digits=2,
      caption = "Testing set: used only for evaluating model")
Testing set: used only for evaluating model
Survived.yes Pclass Sex.male Age SibSp Parch Embarked.Q Embarked.S sqrtFare
3 yes 3 0 26 0 0 0 1 2.82
5 no 3 1 35 0 0 0 1 2.84
7 no 1 1 54 0 0 0 1 7.20

Figure 4.11 Comparing the testing set predictions with the actual values gives you the accuracy of the model.

objControl <- trainControl(method='cv', number=3, 
                           returnResamp='none', 
                           summaryFunction = twoClassSummary, 
                           classProbs = TRUE)
titanic.gbm <- train(Survived.yes~.,data=titanicTrain, method='gbm',
                     trControl=objControl,  
                     metric = "ROC",
                     preProc = c("center", "scale"))
## Iter   TrainDeviance   ValidDeviance   StepSize   Improve
##      1        1.2829             nan     0.1000    0.0267
##      2        1.2449             nan     0.1000    0.0221
##      3        1.2100             nan     0.1000    0.0183
##      4        1.1789             nan     0.1000    0.0150
##      5        1.1536             nan     0.1000    0.0118
##      6        1.1309             nan     0.1000    0.0119
##      7        1.1069             nan     0.1000    0.0084
##      8        1.0878             nan     0.1000    0.0094
##      9        1.0685             nan     0.1000    0.0059
##     10        1.0511             nan     0.1000    0.0077
##     20        0.9584             nan     0.1000    0.0029
##     40        0.8959             nan     0.1000    0.0006
##     60        0.8650             nan     0.1000   -0.0007
##     80        0.8451             nan     0.1000   -0.0011
##    100        0.8313             nan     0.1000   -0.0009
##    120        0.8190             nan     0.1000   -0.0004
##    140        0.8084             nan     0.1000   -0.0011
##    150        0.8040             nan     0.1000   -0.0002
## 
## Iter   TrainDeviance   ValidDeviance   StepSize   Improve
##      1        1.2611             nan     0.1000    0.0370
##      2        1.2119             nan     0.1000    0.0222
##      3        1.1597             nan     0.1000    0.0254
##      4        1.1206             nan     0.1000    0.0206
##      5        1.0899             nan     0.1000    0.0145
##      6        1.0610             nan     0.1000    0.0138
##      7        1.0398             nan     0.1000    0.0093
##      8        1.0201             nan     0.1000    0.0082
##      9        1.0005             nan     0.1000    0.0082
##     10        0.9818             nan     0.1000    0.0084
##     20        0.8941             nan     0.1000    0.0015
##     40        0.8137             nan     0.1000   -0.0015
##     60        0.7734             nan     0.1000   -0.0021
##     80        0.7347             nan     0.1000   -0.0003
##    100        0.7128             nan     0.1000   -0.0013
##    120        0.6909             nan     0.1000   -0.0009
##    140        0.6715             nan     0.1000   -0.0015
##    150        0.6638             nan     0.1000   -0.0006
## 
## Iter   TrainDeviance   ValidDeviance   StepSize   Improve
##      1        1.2500             nan     0.1000    0.0379
##      2        1.1871             nan     0.1000    0.0275
##      3        1.1405             nan     0.1000    0.0234
##      4        1.0983             nan     0.1000    0.0180
##      5        1.0611             nan     0.1000    0.0151
##      6        1.0295             nan     0.1000    0.0131
##      7        1.0103             nan     0.1000    0.0071
##      8        0.9879             nan     0.1000    0.0093
##      9        0.9655             nan     0.1000    0.0088
##     10        0.9507             nan     0.1000    0.0032
##     20        0.8474             nan     0.1000   -0.0007
##     40        0.7605             nan     0.1000   -0.0008
##     60        0.7054             nan     0.1000   -0.0011
##     80        0.6673             nan     0.1000   -0.0005
##    100        0.6386             nan     0.1000   -0.0010
##    120        0.6104             nan     0.1000   -0.0023
##    140        0.5835             nan     0.1000   -0.0026
##    150        0.5735             nan     0.1000   -0.0014
## 
## Iter   TrainDeviance   ValidDeviance   StepSize   Improve
##      1        1.2705             nan     0.1000    0.0317
##      2        1.2223             nan     0.1000    0.0252
##      3        1.1792             nan     0.1000    0.0201
##      4        1.1463             nan     0.1000    0.0168
##      5        1.1151             nan     0.1000    0.0131
##      6        1.0923             nan     0.1000    0.0107
##      7        1.0742             nan     0.1000    0.0082
##      8        1.0595             nan     0.1000    0.0068
##      9        1.0431             nan     0.1000    0.0056
##     10        1.0328             nan     0.1000    0.0039
##     20        0.9487             nan     0.1000    0.0021
##     40        0.8849             nan     0.1000    0.0007
##     60        0.8619             nan     0.1000   -0.0006
##     80        0.8467             nan     0.1000   -0.0012
##    100        0.8327             nan     0.1000   -0.0003
##    120        0.8207             nan     0.1000   -0.0002
##    140        0.8100             nan     0.1000   -0.0009
##    150        0.8047             nan     0.1000   -0.0015
## 
## Iter   TrainDeviance   ValidDeviance   StepSize   Improve
##      1        1.2554             nan     0.1000    0.0378
##      2        1.2009             nan     0.1000    0.0263
##      3        1.1503             nan     0.1000    0.0236
##      4        1.1112             nan     0.1000    0.0199
##      5        1.0788             nan     0.1000    0.0149
##      6        1.0543             nan     0.1000    0.0122
##      7        1.0288             nan     0.1000    0.0112
##      8        1.0063             nan     0.1000    0.0068
##      9        0.9849             nan     0.1000    0.0089
##     10        0.9692             nan     0.1000    0.0055
##     20        0.8778             nan     0.1000    0.0012
##     40        0.8128             nan     0.1000   -0.0015
##     60        0.7763             nan     0.1000   -0.0002
##     80        0.7448             nan     0.1000   -0.0012
##    100        0.7239             nan     0.1000   -0.0004
##    120        0.6985             nan     0.1000   -0.0016
##    140        0.6854             nan     0.1000   -0.0024
##    150        0.6745             nan     0.1000   -0.0016
## 
## Iter   TrainDeviance   ValidDeviance   StepSize   Improve
##      1        1.2562             nan     0.1000    0.0354
##      2        1.1911             nan     0.1000    0.0313
##      3        1.1429             nan     0.1000    0.0235
##      4        1.0975             nan     0.1000    0.0185
##      5        1.0635             nan     0.1000    0.0172
##      6        1.0321             nan     0.1000    0.0158
##      7        1.0064             nan     0.1000    0.0105
##      8        0.9826             nan     0.1000    0.0115
##      9        0.9648             nan     0.1000    0.0076
##     10        0.9458             nan     0.1000    0.0073
##     20        0.8404             nan     0.1000    0.0001
##     40        0.7443             nan     0.1000   -0.0004
##     60        0.6926             nan     0.1000   -0.0012
##     80        0.6523             nan     0.1000   -0.0028
##    100        0.6254             nan     0.1000   -0.0008
##    120        0.5947             nan     0.1000   -0.0013
##    140        0.5749             nan     0.1000   -0.0016
##    150        0.5636             nan     0.1000   -0.0040
## 
## Iter   TrainDeviance   ValidDeviance   StepSize   Improve
##      1        1.2610             nan     0.1000    0.0315
##      2        1.2115             nan     0.1000    0.0245
##      3        1.1743             nan     0.1000    0.0195
##      4        1.1439             nan     0.1000    0.0167
##      5        1.1141             nan     0.1000    0.0138
##      6        1.0846             nan     0.1000    0.0101
##      7        1.0684             nan     0.1000    0.0066
##      8        1.0509             nan     0.1000    0.0043
##      9        1.0339             nan     0.1000    0.0075
##     10        1.0229             nan     0.1000    0.0031
##     20        0.9374             nan     0.1000    0.0027
##     40        0.8699             nan     0.1000   -0.0005
##     60        0.8372             nan     0.1000   -0.0009
##     80        0.8137             nan     0.1000   -0.0006
##    100        0.7969             nan     0.1000   -0.0010
##    120        0.7822             nan     0.1000   -0.0004
##    140        0.7700             nan     0.1000   -0.0015
##    150        0.7628             nan     0.1000   -0.0001
## 
## Iter   TrainDeviance   ValidDeviance   StepSize   Improve
##      1        1.2613             nan     0.1000    0.0335
##      2        1.1967             nan     0.1000    0.0329
##      3        1.1451             nan     0.1000    0.0203
##      4        1.1046             nan     0.1000    0.0211
##      5        1.0695             nan     0.1000    0.0160
##      6        1.0352             nan     0.1000    0.0142
##      7        1.0115             nan     0.1000    0.0117
##      8        0.9934             nan     0.1000    0.0076
##      9        0.9756             nan     0.1000    0.0082
##     10        0.9605             nan     0.1000    0.0073
##     20        0.8624             nan     0.1000    0.0018
##     40        0.7784             nan     0.1000   -0.0004
##     60        0.7261             nan     0.1000   -0.0005
##     80        0.6906             nan     0.1000   -0.0011
##    100        0.6615             nan     0.1000   -0.0000
##    120        0.6430             nan     0.1000   -0.0005
##    140        0.6254             nan     0.1000   -0.0017
##    150        0.6172             nan     0.1000   -0.0009
## 
## Iter   TrainDeviance   ValidDeviance   StepSize   Improve
##      1        1.2504             nan     0.1000    0.0385
##      2        1.1850             nan     0.1000    0.0348
##      3        1.1308             nan     0.1000    0.0254
##      4        1.0834             nan     0.1000    0.0223
##      5        1.0510             nan     0.1000    0.0136
##      6        1.0161             nan     0.1000    0.0147
##      7        0.9896             nan     0.1000    0.0125
##      8        0.9726             nan     0.1000    0.0065
##      9        0.9475             nan     0.1000    0.0100
##     10        0.9254             nan     0.1000    0.0068
##     20        0.8160             nan     0.1000    0.0029
##     40        0.7086             nan     0.1000   -0.0001
##     60        0.6494             nan     0.1000   -0.0012
##     80        0.6142             nan     0.1000   -0.0001
##    100        0.5797             nan     0.1000   -0.0001
##    120        0.5551             nan     0.1000   -0.0009
##    140        0.5337             nan     0.1000   -0.0014
##    150        0.5207             nan     0.1000   -0.0009
## 
## Iter   TrainDeviance   ValidDeviance   StepSize   Improve
##      1        1.2481             nan     0.1000    0.0367
##      2        1.1876             nan     0.1000    0.0323
##      3        1.1344             nan     0.1000    0.0246
##      4        1.0914             nan     0.1000    0.0205
##      5        1.0531             nan     0.1000    0.0148
##      6        1.0265             nan     0.1000    0.0119
##      7        1.0007             nan     0.1000    0.0128
##      8        0.9764             nan     0.1000    0.0110
##      9        0.9584             nan     0.1000    0.0070
##     10        0.9441             nan     0.1000    0.0057
##     20        0.8450             nan     0.1000    0.0004
##     40        0.7663             nan     0.1000   -0.0001
##     50        0.7378             nan     0.1000   -0.0009
titanic.gbm.pred <- predict(titanic.gbm,newdata=titanicTest)

figure4.11.df <- data.frame(head(titanicTest$Survived.yes),
                            head(titanic.gbm.pred))
kable(figure4.11.df,
      col.names = c("Test set labels", "Predictions"))
Test set labels Predictions
yes no
no no
no no
yes no
yes yes
no no

Figure 4.13 Organizing the class-wise accuracy into a confusion matrix

titanic.gbm.cm <- confusionMatrix(titanic.gbm.pred,titanicTest$Survived.yes)
titanic.gbm.cm
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction no yes
##        no  98  31
##        yes 11  37
##                                           
##                Accuracy : 0.7627          
##                  95% CI : (0.6931, 0.8233)
##     No Information Rate : 0.6158          
##     P-Value [Acc > NIR] : 2.393e-05       
##                                           
##                   Kappa : 0.4692          
##  Mcnemar's Test P-Value : 0.00337         
##                                           
##             Sensitivity : 0.8991          
##             Specificity : 0.5441          
##          Pos Pred Value : 0.7597          
##          Neg Pred Value : 0.7708          
##              Prevalence : 0.6158          
##          Detection Rate : 0.5537          
##    Detection Prevalence : 0.7288          
##       Balanced Accuracy : 0.7216          
##                                           
##        'Positive' Class : no              
## 

Figure 4.15 A subset of probabilistic predictions from the Titanic test set. After sorting the full table by decreasing survival probability, you can set a threshold and consider all rows above this threshold as survived. Note that the indices are maintained so you know which original row the instance refers to.

titanic.gbm.pred.prob <- predict(object=titanic.gbm, 
                                 titanicTest, type='prob')
kable(titanic.gbm.pred.prob[15:19,],col.names = c("Died","Survived"))
Died Survived
15 0.8774734 0.1225266
16 0.8842786 0.1157214
17 0.8166056 0.1833944
18 0.8774734 0.1225266
19 0.8989089 0.1010911
titanic.ordered <- titanic.gbm.pred.prob[order(-titanic.gbm.pred.prob$yes),]
kable(titanic.ordered[titanic.ordered$yes > 0.68 & titanic.ordered$yes < 0.73,])
no yes
137 0.2851770 0.7148230
25 0.3000107 0.6999893
63 0.3190296 0.6809704

Listing 4.3 The ROC curve

# Returns the false-positive and true-positive rates at nPoints thresholds for 
# the given true and predicted labels
# trueLabels: 0=FALSE; 1=TRUE
rocCurve <- function(trueLabels, predictedProbs, nPoints=100, posClass=1){
  # Allocates the threshold and ROC lists
  thr <- seq(0,1,length=nPoints)
  tpr <- numeric(nPoints)
  fpr <- numeric(nPoints)
  
  # Precalculates values for the positive and negative cases, used in the loop
  pos <- trueLabels == posClass
  neg <- !pos 
  nPos <- sum(pos, na.rm=TRUE)
  nNeg <- sum(neg, na.rm=TRUE)
  
  # For each threshold, calculates the rate of true and false positives
  for (i in 1:length(thr)) {
    t <- thr[i]
    meetOrExceedThreshold <- predictedProbs >= t
    tpr[i] <-  sum((meetOrExceedThreshold & pos), na.rm=TRUE) / nPos
    fpr[i] <-  sum((meetOrExceedThreshold & neg), na.rm=TRUE) / nNeg
  }
  
  # Create data frame without duplicated fpr's to return
  duplicatedFPRs <- duplicated(fpr)
  df <- data.frame(fpr=fpr[!duplicatedFPRs],tpr=tpr[!duplicatedFPRs],thr=thr[!duplicatedFPRs])
  
  return(df)
}

Figure 4.16 The ROC curve defined by calculating the confusion matrix and ROC metrics at 100 threshold points from 0 to 1. By convention, you plot the false-positive rate on the x-axis and the true-positive rate on the y-axis.

df<-rocCurve(revalue(titanicTest$Survived.yes, c("no" = 0, "yes" = 1)),
             titanic.gbm.pred.prob$yes)
ggplot(df,aes(x=fpr,y=tpr)) +
  geom_step(direction="vh") +
  labs(x = "False-positive rate",
       y = "True-positive rate")

Listing 4.4 The area under the ROC curve

auc <- function(trueLabels, predictedProbs, nPoints=100, posClass=1){
  auc <- 0
  df <- rocCurve(trueLabels = trueLabels,
                 predictedProbs = predictedProbs,
                 nPoints = nPoints,
                 posClass = posClass)
  
  for (i in 2:length(df$fpr)) {
    auc <- auc + 0.5 * (df$fpr[i-1] - df$fpr[i]) * (df$tpr[i-1] + df$tpr[i])
  }
  
  return(auc)
}

Bonus. Relative influence of variables in model.

We can call the summary function on our model to get a data frame of variables (var) and their relative influence (rel.inf) in the model. Before plotting the data, we reorder by the rel.inf variable.

gbmSummary <- summary(titanic.gbm)

gbmSummary <- transform(gbmSummary,
                        var = reorder(var,rel.inf))
ggplot(data=gbmSummary, aes(var, rel.inf)) +
  geom_bar(stat="identity") +
  coord_flip() + 
  labs(x="Relative Influence",
       y="Variable")

Figure 4.18 Handwritten digits in the MNIST dataset

Thanks to Longhow Lam for posting the code used in the displayMnistSamples function that display’s digits from the MNIST dataset.

mnist <- read.csv("../data/mnist_small.csv",
                  colClasses = c(label = "factor"))
displayMnistSamples <- function(x) {
  for(i in x){
  y = as.matrix(mnist[i, 2:785])
  dim(y) = c(28, 28)
  image( y[,nrow(y):1], axes = FALSE, col = gray(0:255 / 255))
  text( 0.2, 0, mnist[i,1], cex = 3, col = 2, pos = c(3,4))
  }
}
par( mfrow = c(4,5), mai = c(0,0,0.1,0.1))
displayMnistSamples(sample(1:length(mnist),20))