This file contains R code to accompany Chapter 3 of the book “Real-World Machine Learning”, by Henrik Brink, Joseph W. Richards, and Mark Fetherolf. The code was contributed by Paul Adamson.

NOTE: working directory should be set to this file’s location.

Figure 3.4 A subset of the Titanic Passengers dataset

We are going to be interested in predicting survival, so it is useful to specify the Survived variable to be of type factor. For visualizing the data, it is also useful to use the revalue function to specify the no and yes levels for the factor variable. The kable function is built into the knitr package.

titanic <- read.csv("../data/titanic.csv", 
                    colClasses = c(
                      Survived = "factor",
                      Name = "character",
                      Ticket = "character",
                      Cabin = "character"))
titanic$Survived <- revalue(titanic$Survived, c("0"="no", "1"="yes"))
kable(head(titanic, 6), digits=2)
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
1 no 3 Braund, Mr. Owen Harris male 22 1 0 A/5 21171 7.25 S
2 yes 1 Cumings, Mrs. John Bradley (Florence Briggs Thayer) female 38 1 0 PC 17599 71.28 C85 C
3 yes 3 Heikkinen, Miss. Laina female 26 0 0 STON/O2. 3101282 7.92 S
4 yes 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35 1 0 113803 53.10 C123 S
5 no 3 Allen, Mr. William Henry male 35 0 0 373450 8.05 S
6 no 3 Moran, Mr. James male NA 0 0 330877 8.46 Q

Figure 3.5 Mosaic plot for Titanic data: Gender vs. survival

The “Visualizing Categorical Data” package (vcd) provides an excellent set of functions for exploring categorical data, including mosaic plots.

mosaic(
  ~ Sex + Survived,
  data = titanic, 
  main = "Mosaic plot for Titanic data: Gender vs. survival",
  shade = TRUE,
  split_vertical = TRUE,
  labeling_args = list(
    set_varnames = c(
      Survived = "Survived?",
      Sex = "Gender")))

Figure 3.6 Processed data

First, we get rid of the variables that we do not want in our model. (Cabin might actually be useful, but it’s not used here.) Then we use is.na to set missing age values to -1. The mutate and select functions make it easy to take square root of the Fare variable and then drop it from the dataset. We then drop rows with missing Embarked data and remove the unused level "". Finally, we convert factor variables to dummy variables using the dummyVars function in the caret package. To avoid perfect collinearity (a.k.a. the dummy variable trap), we set the fullRank parameter to TRUE. Survived.yes is then converted back to a factor variable.

titanicTidy <- subset(titanic, select = -c(PassengerId, Name, Ticket, Cabin))

titanicTidy$Age[is.na(titanicTidy$Age)] <- -1

titanicTidy <- titanicTidy %>%
  mutate(sqrtFare = sqrt(Fare)) %>%
  select(-Fare)

titanicTidy <- titanicTidy %>%
  filter(!(Embarked=="")) %>%
  droplevels

dummies <- dummyVars(" ~ .", data = titanicTidy, fullRank = TRUE)
titanicTidyNumeric <- data.frame(predict(dummies, newdata = titanicTidy))

titanicTidyNumeric$Survived.yes <- factor(titanicTidyNumeric$Survived.yes)
kable(head(titanicTidyNumeric))
Survived.yes Pclass Sex.male Age SibSp Parch Embarked.Q Embarked.S sqrtFare
0 3 1 22 1 0 0 1 2.692582
1 1 0 38 1 0 0 0 8.442944
1 3 0 26 0 0 0 1 2.815138
1 1 0 35 1 0 0 1 7.286975
0 3 1 35 0 0 0 1 2.837252
0 3 1 -1 0 0 1 0 2.908316

Figure 3.10 Four randomly chosen handwritten digits from the MNIST database

Thanks to Longhow Lam for posting the code used in the displayMnistSamples function that display’s digits from the MNIST dataset.

mnist <- read.csv("../data/mnist_small.csv",
                  colClasses = c(label = "factor"))
displayMnistSamples <- function(x) {
  for(i in x){
  y = as.matrix(mnist[i, 2:785])
  dim(y) = c(28, 28)
  image( y[,nrow(y):1], axes = FALSE, col = gray(0:255 / 255))
  text( 0.2, 0, mnist[i,1], cex = 3, col = 2, pos = c(3,4))
  }
}
par( mfrow = c(1,4), mai = c(0,0,0,0.1))
displayMnistSamples(sample(1:length(mnist),4))

Figure 3.11 Table of predicted probabilities from a k-nearest neighbors classifier, as applied to the MNIST dataset

trainIndex <- createDataPartition(mnist$label, p = .8, 
                                  list = FALSE, 
                                  times = 1)
mnistTrain <- mnist[ trainIndex,]
mnistTest  <- mnist[-trainIndex,]

mnist.kknn <- kknn(label~., mnistTrain, mnistTest, distance = 1,
                   kernel = "triangular")

confusionMatrix(fitted(mnist.kknn),mnistTest$label)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1  2  3  4  5  6  7  8  9
##          0 20  0  1  0  0  0  1  0  0  0
##          1  0 19  2  0  0  1  0  0  2  1
##          2  0  0 20  0  0  0  0  0  0  0
##          3  0  0  0 16  0  0  0  0  3  0
##          4  0  0  0  0 12  0  1  0  1  2
##          5  0  0  0  1  0 14  2  0  0  0
##          6  1  0  0  0  1  1 15  0  1  0
##          7  0  0  0  1  0  0  0 18  0  3
##          8  0  0  1  0  0  0  0  0 11  0
##          9  0  0  0  0  7  1  0  3  0 13
## 
## Overall Statistics
##                                          
##                Accuracy : 0.8061         
##                  95% CI : (0.7437, 0.859)
##     No Information Rate : 0.1224         
##     P-Value [Acc > NIR] : < 2.2e-16      
##                                          
##                   Kappa : 0.7844         
##  Mcnemar's Test P-Value : NA             
## 
## Statistics by Class:
## 
##                      Class: 0 Class: 1 Class: 2 Class: 3 Class: 4 Class: 5
## Sensitivity            0.9524  1.00000   0.8333  0.88889  0.60000  0.82353
## Specificity            0.9886  0.96610   1.0000  0.98315  0.97727  0.98324
## Pos Pred Value         0.9091  0.76000   1.0000  0.84211  0.75000  0.82353
## Neg Pred Value         0.9943  1.00000   0.9773  0.98870  0.95556  0.98324
## Prevalence             0.1071  0.09694   0.1224  0.09184  0.10204  0.08673
## Detection Rate         0.1020  0.09694   0.1020  0.08163  0.06122  0.07143
## Detection Prevalence   0.1122  0.12755   0.1020  0.09694  0.08163  0.08673
## Balanced Accuracy      0.9705  0.98305   0.9167  0.93602  0.78864  0.90338
##                      Class: 6 Class: 7 Class: 8 Class: 9
## Sensitivity           0.78947  0.85714  0.61111  0.68421
## Specificity           0.97740  0.97714  0.99438  0.93785
## Pos Pred Value        0.78947  0.81818  0.91667  0.54167
## Neg Pred Value        0.97740  0.98276  0.96196  0.96512
## Prevalence            0.09694  0.10714  0.09184  0.09694
## Detection Rate        0.07653  0.09184  0.05612  0.06633
## Detection Prevalence  0.09694  0.11224  0.06122  0.12245
## Balanced Accuracy     0.88344  0.91714  0.80275  0.81103
rowsInTable <- 1:10
prob <- as.data.frame(mnist.kknn$prob[rowsInTable,])
mnistResultsDF <- data.frame(mnistTest$label[rowsInTable],
                             mnist.kknn$fit[rowsInTable],
                             prob)


kable(mnistResultsDF, digits=2,
      col.names=c("actual","fit",0:9))
actual fit 0 1 2 3 4 5 6 7 8 9
1 1 0 1.00 0.00 0.00 0 0.00 0 0.00 0.00 0.00
7 7 0 0.00 0.00 0.08 0 0.00 0 0.67 0.00 0.25
3 7 0 0.00 0.00 0.08 0 0.00 0 0.48 0.31 0.13
3 3 0 0.00 0.17 0.83 0 0.00 0 0.00 0.00 0.00
8 8 0 0.00 0.00 0.00 0 0.00 0 0.00 0.98 0.02
1 1 0 1.00 0.00 0.00 0 0.00 0 0.00 0.00 0.00
1 1 0 0.92 0.00 0.00 0 0.00 0 0.08 0.00 0.00
0 0 1 0.00 0.00 0.00 0 0.00 0 0.00 0.00 0.00
8 8 0 0.00 0.00 0.00 0 0.03 0 0.00 0.97 0.00
1 1 0 1.00 0.00 0.00 0 0.00 0 0.00 0.00 0.00

Figure 3.13 Small subset of the Auto MPG data

auto <- read.csv("../data/auto-mpg.csv",
                 colClasses = c(
                      origin = "factor"))

auto$origin <- revalue(auto$origin, 
                       c("1\t"="USA", "2\t"="Europe", "3\t"="Asia"))

kable(head(auto,5))
mpg cylinders displacement horsepower weight acceleration modelyear origin
18 8 307 130 3504 12.0 70 USA
15 8 350 165 3693 11.5 70 USA
18 8 318 150 3436 11.0 70 USA
16 8 304 150 3433 12.0 70 USA
17 8 302 140 3449 10.5 70 USA

Figure 3.14 Scatter plots of Vehicle Weight and Model Year versus MPG

par(.pardefault)
p1<-ggplot(auto, aes(weight, mpg)) + 
  geom_point() +
  labs(y = "Miles per gallon",
       x = "Vehicle weight")
p2<-ggplot(auto, aes(modelyear, mpg)) + 
  geom_point() +
  theme(axis.title.y=element_blank(),
        axis.text.y=element_blank(),
        axis.ticks.y=element_blank()) +
  labs(x = "Model year")
grid.arrange(p1,p2,ncol=2, 
             top=textGrob("Scatterplots for MPG data",
                          gp=gpar(fontsize=14,font=8)))