This file contains R code to accompany Chapter 3 of the book “Real-World Machine Learning”, by Henrik Brink, Joseph W. Richards, and Mark Fetherolf. The code was contributed by Paul Adamson.

NOTE: working directory should be set to this file’s location.

Figure 3.4 A subset of the Titanic Passengers dataset

We are going to be interested in predicting survival, so it is useful to specify the Survived variable to be of type factor. For visualizing the data, it is also useful to use the revalue function to specify the no and yes levels for the factor variable. The kable function is built into the knitr package.

titanic <- read.csv("../data/titanic.csv", 
                    colClasses = c(
                      Survived = "factor",
                      Name = "character",
                      Ticket = "character",
                      Cabin = "character"))
titanic$Survived <- revalue(titanic$Survived, c("0"="no", "1"="yes"))
kable(head(titanic, 6), digits=2)
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
1 no 3 Braund, Mr. Owen Harris male 22 1 0 A/5 21171 7.25 S
2 yes 1 Cumings, Mrs. John Bradley (Florence Briggs Thayer) female 38 1 0 PC 17599 71.28 C85 C
3 yes 3 Heikkinen, Miss. Laina female 26 0 0 STON/O2. 3101282 7.92 S
4 yes 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35 1 0 113803 53.10 C123 S
5 no 3 Allen, Mr. William Henry male 35 0 0 373450 8.05 S
6 no 3 Moran, Mr. James male NA 0 0 330877 8.46 Q

Figure 3.5 Mosaic plot for Titanic data: Gender vs. survival

The “Visualizing Categorical Data” package (vcd) provides an excellent set of functions for exploring categorical data, including mosaic plots.

mosaic(
  ~ Sex + Survived,
  data = titanic, 
  main = "Mosaic plot for Titanic data: Gender vs. survival",
  shade = TRUE,
  split_vertical = TRUE,
  labeling_args = list(
    set_varnames = c(
      Survived = "Survived?",
      Sex = "Gender")))

Figure 3.6 Processed data

First, we get rid of the variables that we do not want in our model. (Cabin might actually be useful, but it’s not used here.) Then we use is.na to set missing age values to -1. The mutate and select functions make it easy to take square root of the Fare variable and then drop it from the dataset. We then drop rows with missing Embarked data and remove the unused level "". Finally, we convert factor variables to dummy variables using the dummyVars function in the caret package. To avoid perfect collinearity (a.k.a. the dummy variable trap), we set the fullRank parameter to TRUE. Survived.yes is then converted back to a factor variable.

titanicTidy <- subset(titanic, select = -c(PassengerId, Name, Ticket, Cabin))

titanicTidy$Age[is.na(titanicTidy$Age)] <- -1

titanicTidy <- titanicTidy %>%
  mutate(sqrtFare = sqrt(Fare)) %>%
  select(-Fare)

titanicTidy <- titanicTidy %>%
  filter(!(Embarked=="")) %>%
  droplevels

dummies <- dummyVars(" ~ .", data = titanicTidy, fullRank = TRUE)
titanicTidyNumeric <- data.frame(predict(dummies, newdata = titanicTidy))

titanicTidyNumeric$Survived.yes <- factor(titanicTidyNumeric$Survived.yes)
kable(head(titanicTidyNumeric))
Survived.yes Pclass Sex.male Age SibSp Parch Embarked.Q Embarked.S sqrtFare
0 3 1 22 1 0 0 1 2.692582
1 1 0 38 1 0 0 0 8.442944
1 3 0 26 0 0 0 1 2.815138
1 1 0 35 1 0 0 1 7.286975
0 3 1 35 0 0 0 1 2.837252
0 3 1 -1 0 0 1 0 2.908316

Figure 3.10 Four randomly chosen handwritten digits from the MNIST database

Thanks to Longhow Lam for posting the code used in the displayMnistSamples function that display’s digits from the MNIST dataset.

mnist <- read.csv("../data/mnist_small.csv",
                  colClasses = c(label = "factor"))
displayMnistSamples <- function(x) {
  for(i in x){
  y = as.matrix(mnist[i, 2:785])
  dim(y) = c(28, 28)
  image( y[,nrow(y):1], axes = FALSE, col = gray(0:255 / 255))
  text( 0.2, 0, mnist[i,1], cex = 3, col = 2, pos = c(3,4))
  }
}
par( mfrow = c(1,4), mai = c(0,0,0,0.1))
displayMnistSamples(sample(1:length(mnist),4))

Figure 3.11 Table of predicted probabilities from a k-nearest neighbors classifier, as applied to the MNIST dataset

trainIndex <- createDataPartition(mnist$label, p = .8, 
                                  list = FALSE, 
                                  times = 1)
mnistTrain <- mnist[ trainIndex,]
mnistTest  <- mnist[-trainIndex,]

mnist.kknn <- kknn(label~., mnistTrain, mnistTest, distance = 1,
                   kernel = "triangular")

confusionMatrix(fitted(mnist.kknn),mnistTest$label)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1  2  3  4  5  6  7  8  9
##          0 20  0  1  0  0  0  1  0  0  0
##          1  0 19  2  0  0  1  0  0  2  1
##          2  0  0 20  0  0  0  0  0  0  0
##          3  0  0  0 16  0  0  0  0  3  0
##          4  0  0  0  0 12  0  1  0  1  2
##          5  0  0  0  1  0 14  2  0  0  0
##          6  1  0  0  0  1  1 15  0  1  0
##          7  0  0  0  1  0  0  0 18  0  3
##          8  0  0  1  0  0  0  0  0 11  0
##          9  0  0  0  0  7  1  0  3  0 13
## 
## Overall Statistics
##                                          
##                Accuracy : 0.8061         
##                  95% CI : (0.7437, 0.859)
##     No Information Rate : 0.1224         
##     P-Value [Acc > NIR] : < 2.2e-16      
##                                          
##                   Kappa : 0.7844         
##  Mcnemar's Test P-Value : NA             
## 
## Statistics by Class:
## 
##                      Class: 0 Class: 1 Class: 2 Class: 3 Class: 4 Class: 5
## Sensitivity            0.9524  1.00000   0.8333  0.88889  0.60000  0.82353
## Specificity            0.9886  0.96610   1.0000  0.98315  0.97727  0.98324
## Pos Pred Value         0.9091  0.76000   1.0000  0.84211  0.75000  0.82353
## Neg Pred Value         0.9943  1.00000   0.9773  0.98870  0.95556  0.98324
## Prevalence             0.1071  0.09694   0.1224  0.09184  0.10204  0.08673
## Detection Rate         0.1020  0.09694   0.1020  0.08163  0.06122  0.07143
## Detection Prevalence   0.1122  0.12755   0.1020  0.09694  0.08163  0.08673
## Balanced Accuracy      0.9705  0.98305   0.9167  0.93602  0.78864  0.90338
##                      Class: 6 Class: 7 Class: 8 Class: 9
## Sensitivity           0.78947  0.85714  0.61111  0.68421
## Specificity           0.97740  0.97714  0.99438  0.93785
## Pos Pred Value        0.78947  0.81818  0.91667  0.54167
## Neg Pred Value        0.97740  0.98276  0.96196  0.96512
## Prevalence            0.09694  0.10714  0.09184  0.09694
## Detection Rate        0.07653  0.09184  0.05612  0.06633
## Detection Prevalence  0.09694  0.11224  0.06122  0.12245
## Balanced Accuracy     0.88344  0.91714  0.80275  0.81103
rowsInTable <- 1:10
prob <- as.data.frame(mnist.kknn$prob[rowsInTable,])
mnistResultsDF <- data.frame(mnistTest$label[rowsInTable],
                             mnist.kknn$fit[rowsInTable],
                             prob)


kable(mnistResultsDF, digits=2,
      col.names=c("actual","fit",0:9))
actual fit 0 1 2 3 4 5 6 7 8 9
1 1 0 1.00 0.00 0.00 0 0.00 0 0.00 0.00 0.00
7 7 0 0.00 0.00 0.08 0 0.00 0 0.67 0.00 0.25
3 7 0 0.00 0.00 0.08 0 0.00 0 0.48 0.31 0.13
3 3 0 0.00 0.17 0.83 0 0.00 0 0.00 0.00 0.00
8 8 0 0.00 0.00 0.00 0 0.00 0 0.00 0.98 0.02
1 1 0 1.00 0.00 0.00 0 0.00 0 0.00 0.00 0.00
1 1 0 0.92 0.00 0.00 0 0.00 0 0.08 0.00 0.00
0 0 1 0.00 0.00 0.00 0 0.00 0 0.00 0.00 0.00
8 8 0 0.00 0.00 0.00 0 0.03 0 0.00 0.97 0.00
1 1 0 1.00 0.00 0.00 0 0.00 0 0.00 0.00 0.00

Figure 3.13 Small subset of the Auto MPG data

auto <- read.csv("../data/auto-mpg.csv",
                 colClasses = c(
                      origin = "factor"))

auto$origin <- revalue(auto$origin, 
                       c("1\t"="USA", "2\t"="Europe", "3\t"="Asia"))

kable(head(auto,5))
mpg cylinders displacement horsepower weight acceleration modelyear origin
18 8 307 130 3504 12.0 70 USA
15 8 350 165 3693 11.5 70 USA
18 8 318 150 3436 11.0 70 USA
16 8 304 150 3433 12.0 70 USA
17 8 302 140 3449 10.5 70 USA

Figure 3.14 Scatter plots of Vehicle Weight and Model Year versus MPG

par(.pardefault)
p1<-ggplot(auto, aes(weight, mpg)) + 
  geom_point() +
  labs(y = "Miles per gallon",
       x = "Vehicle weight")
p2<-ggplot(auto, aes(modelyear, mpg)) + 
  geom_point() +
  theme(axis.title.y=element_blank(),
        axis.text.y=element_blank(),
        axis.ticks.y=element_blank()) +
  labs(x = "Model year")
grid.arrange(p1,p2,ncol=2, 
             top=textGrob("Scatterplots for MPG data",
                          gp=gpar(fontsize=14,font=8)))

Figure 3.15 The Auto MPG data after expanding the categorical Origin column

Note that the row numbering differs between python and R by 1 (python starts row numbering at 0, and R starts at 1).

dummies <- dummyVars(" ~ .", data = auto, fullRank = TRUE)
autoNumeric <- data.frame(predict(dummies, newdata = auto))

kable(tail(autoNumeric,5))
mpg cylinders displacement horsepower weight acceleration modelyear origin.Europe origin.Asia
388 27 4 140 86 2790 15.6 82 0 0
389 44 4 97 52 2130 24.6 82 1 0
390 32 4 135 84 2295 11.6 82 0 0
391 28 4 120 79 2625 18.6 82 0 0
392 31 4 119 82 2720 19.4 82 0 0

Figure 3.16 Comparing MPG predictions on a held-out testing set to actual values

trainIndex <- createDataPartition(autoNumeric$mpg, p = .8, 
                                  list = FALSE, 
                                  times = 1)

autoTrain <- autoNumeric[ trainIndex,]
autoTest  <- autoNumeric[-trainIndex,]

lmFit <- train(mpg ~ ., data = autoTrain,
               method = "lm")
lmPred <- predict(lmFit, newdata = autoTest)

rowsInTable <- 1:5
kable(data.frame("Origin.Europe" = autoTest$origin.Europe[rowsInTable],
                 "Origin.Asia" = autoTest$origin.Asia[rowsInTable],
                 "MPG" = autoTest$mpg[rowsInTable],
                 "Predicted MPG" = lmPred[rowsInTable]))
Origin.Europe Origin.Asia MPG Predicted.MPG
1 0 0 18 14.89092
4 0 0 16 14.92356
6 0 0 15 10.74239
7 0 0 14 10.74370
10 0 0 15 13.14265

Figure 3.17 A scatter plot of the actual versus predicted values on the held-out test set. The diagonal line shows the perfect regressor. The closer all of the predictions are to this line, the better the model.

ggplot(autoTest, aes(x=mpg, y=lmPred)) + 
  geom_point() + 
  geom_abline(slope = 1, intercept = 0) +
  labs(x="MPG", y="Predicted MPG")

Figure 3.18 Table of actual versus predicted MPG values for the nonlinear random forest regression model

rfFit <- train(mpg ~ ., data = autoTrain,
               method = "rf")
rfPred <- predict(rfFit, newdata = autoTest)

kable(data.frame("Origin.Europe" = autoTest$origin.Europe[rowsInTable],
                 "Origin.Asia" = autoTest$origin.Asia[rowsInTable],
                 "MPG" = autoTest$mpg[rowsInTable],
                 "Predicted MPG" = rfPred[rowsInTable]))
Origin.Europe Origin.Asia MPG Predicted.MPG
1 0 0 18 15.80731
4 0 0 16 16.32332
6 0 0 15 13.44830
7 0 0 14 13.53114
10 0 0 15 13.77313

Figure 3.19 Comparison of MPG data versus predicted values for the nonlinear random forest regression model

ggplot(autoTest, aes(x=mpg, y=rfPred)) + 
  geom_point() + 
  geom_abline(slope = 1, intercept = 0) +
  labs(x="MPG", y="Predicted MPG")