Real-World Machine Learning

This file contains R code to accompany Chapter 3 of the book “Real-World Machine Learning”, by Henrik Brink, Joseph W. Richards, and Mark Fetherolf. The code was contributed by Paul Adamson.

NOTE: working directory should be set to this file’s location.

Figure 3.4 A subset of the Titanic Passengers dataset

We are going to be interested in predicting survival, so it is useful to specify the Survived variable to be of type factor. For visualizing the data, it is also useful to use the revalue function to specify the no and yes levels for the factor variable. The kable function is built into the knitr package.

titanic <- read.csv("../data/titanic.csv", 
                    colClasses = c(
                      Survived = "factor",
                      Name = "character",
                      Ticket = "character",
                      Cabin = "character"))
titanic$Survived <- revalue(titanic$Survived, c("0"="no", "1"="yes"))
kable(head(titanic, 6), digits=2)

PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Ticket	Fare	Cabin	Embarked
1	no	3	Braund, Mr. Owen Harris	male	22	1	A/5 21171	7.25		S
2	yes	1	Cumings, Mrs. John Bradley (Florence Briggs Thayer)	female	38	1	PC 17599	71.28	C85	C
3	yes	3	Heikkinen, Miss. Laina	female	26	0	STON/O2. 3101282	7.92		S
4	yes	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35	1	113803	53.10	C123	S
5	no	3	Allen, Mr. William Henry	male	35	0	373450	8.05		S
6	no	3	Moran, Mr. James	male	NA	0	330877	8.46		Q

Figure 3.5 Mosaic plot for Titanic data: Gender vs. survival

The “Visualizing Categorical Data” package (vcd) provides an excellent set of functions for exploring categorical data, including mosaic plots.

mosaic(
  ~ Sex + Survived,
  data = titanic, 
  main = "Mosaic plot for Titanic data: Gender vs. survival",
  shade = TRUE,
  split_vertical = TRUE,
  labeling_args = list(
    set_varnames = c(
      Survived = "Survived?",
      Sex = "Gender")))

Figure 3.6 Processed data

First, we get rid of the variables that we do not want in our model. (Cabin might actually be useful, but it’s not used here.) Then we use is.na to set missing age values to -1. The mutate and select functions make it easy to take square root of the Fare variable and then drop it from the dataset. We then drop rows with missing Embarked data and remove the unused level "". Finally, we convert factor variables to dummy variables using the dummyVars function in the caret package. To avoid perfect collinearity (a.k.a. the dummy variable trap), we set the fullRank parameter to TRUE. Survived.yes is then converted back to a factor variable.

titanicTidy <- subset(titanic, select = -c(PassengerId, Name, Ticket, Cabin))

titanicTidy$Age[is.na(titanicTidy$Age)] <- -1

titanicTidy <- titanicTidy %>%
  mutate(sqrtFare = sqrt(Fare)) %>%
  select(-Fare)

titanicTidy <- titanicTidy %>%
  filter(!(Embarked=="")) %>%
  droplevels

dummies <- dummyVars(" ~ .", data = titanicTidy, fullRank = TRUE)
titanicTidyNumeric <- data.frame(predict(dummies, newdata = titanicTidy))

titanicTidyNumeric$Survived.yes <- factor(titanicTidyNumeric$Survived.yes)
kable(head(titanicTidyNumeric))

Survived.yes	Pclass	Sex.male	Age	SibSp	Embarked.Q	Embarked.S	sqrtFare
0	3	1	22	1	0	1	2.692582
1	1	0	38	1	0	0	8.442944
1	3	0	26	0	0	1	2.815138
1	1	0	35	1	0	1	7.286975
0	3	1	35	0	0	1	2.837252
0	3	1	-1	0	1	0	2.908316

Figure 3.10 Four randomly chosen handwritten digits from the MNIST database

Thanks to Longhow Lam for posting the code used in the displayMnistSamples function that display’s digits from the MNIST dataset.

mnist <- read.csv("../data/mnist_small.csv",
                  colClasses = c(label = "factor"))
displayMnistSamples <- function(x) {
  for(i in x){
  y = as.matrix(mnist[i, 2:785])
  dim(y) = c(28, 28)
  image( y[,nrow(y):1], axes = FALSE, col = gray(0:255 / 255))
  text( 0.2, 0, mnist[i,1], cex = 3, col = 2, pos = c(3,4))
  }
}
par( mfrow = c(1,4), mai = c(0,0,0,0.1))
displayMnistSamples(sample(1:length(mnist),4))

Figure 3.11 Table of predicted probabilities from a k-nearest neighbors classifier, as applied to the MNIST dataset

trainIndex <- createDataPartition(mnist$label, p = .8, 
                                  list = FALSE, 
                                  times = 1)
mnistTrain <- mnist[ trainIndex,]
mnistTest  <- mnist[-trainIndex,]

mnist.kknn <- kknn(label~., mnistTrain, mnistTest, distance = 1,
                   kernel = "triangular")

confusionMatrix(fitted(mnist.kknn),mnistTest$label)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1  2  3  4  5  6  7  8  9
##          0 20  0  1  0  0  0  1  0  0  0
##          1  0 19  2  0  0  1  0  0  2  1
##          2  0  0 20  0  0  0  0  0  0  0
##          3  0  0  0 16  0  0  0  0  3  0
##          4  0  0  0  0 12  0  1  0  1  2
##          5  0  0  0  1  0 14  2  0  0  0
##          6  1  0  0  0  1  1 15  0  1  0
##          7  0  0  0  1  0  0  0 18  0  3
##          8  0  0  1  0  0  0  0  0 11  0
##          9  0  0  0  0  7  1  0  3  0 13
## 
## Overall Statistics
##                                          
##                Accuracy : 0.8061         
##                  95% CI : (0.7437, 0.859)
##     No Information Rate : 0.1224         
##     P-Value [Acc > NIR] : < 2.2e-16      
##                                          
##                   Kappa : 0.7844         
##  Mcnemar's Test P-Value : NA             
## 
## Statistics by Class:
## 
##                      Class: 0 Class: 1 Class: 2 Class: 3 Class: 4 Class: 5
## Sensitivity            0.9524  1.00000   0.8333  0.88889  0.60000  0.82353
## Specificity            0.9886  0.96610   1.0000  0.98315  0.97727  0.98324
## Pos Pred Value         0.9091  0.76000   1.0000  0.84211  0.75000  0.82353
## Neg Pred Value         0.9943  1.00000   0.9773  0.98870  0.95556  0.98324
## Prevalence             0.1071  0.09694   0.1224  0.09184  0.10204  0.08673
## Detection Rate         0.1020  0.09694   0.1020  0.08163  0.06122  0.07143
## Detection Prevalence   0.1122  0.12755   0.1020  0.09694  0.08163  0.08673
## Balanced Accuracy      0.9705  0.98305   0.9167  0.93602  0.78864  0.90338
##                      Class: 6 Class: 7 Class: 8 Class: 9
## Sensitivity           0.78947  0.85714  0.61111  0.68421
## Specificity           0.97740  0.97714  0.99438  0.93785
## Pos Pred Value        0.78947  0.81818  0.91667  0.54167
## Neg Pred Value        0.97740  0.98276  0.96196  0.96512
## Prevalence            0.09694  0.10714  0.09184  0.09694
## Detection Rate        0.07653  0.09184  0.05612  0.06633
## Detection Prevalence  0.09694  0.11224  0.06122  0.12245
## Balanced Accuracy     0.88344  0.91714  0.80275  0.81103

rowsInTable <- 1:10
prob <- as.data.frame(mnist.kknn$prob[rowsInTable,])
mnistResultsDF <- data.frame(mnistTest$label[rowsInTable],
                             mnist.kknn$fit[rowsInTable],
                             prob)


kable(mnistResultsDF, digits=2,
      col.names=c("actual","fit",0:9))

actual	fit	0	1	2	3	5	7	8	9
1	1	0	1.00	0.00	0.00	0.00	0.00	0.00	0.00
7	7	0	0.00	0.00	0.08	0.00	0.67	0.00	0.25
3	7	0	0.00	0.00	0.08	0.00	0.48	0.31	0.13
3	3	0	0.00	0.17	0.83	0.00	0.00	0.00	0.00
8	8	0	0.00	0.00	0.00	0.00	0.00	0.98	0.02
1	1	0	1.00	0.00	0.00	0.00	0.00	0.00	0.00
1	1	0	0.92	0.00	0.00	0.00	0.08	0.00	0.00
0	0	1	0.00	0.00	0.00	0.00	0.00	0.00	0.00
8	8	0	0.00	0.00	0.00	0.03	0.00	0.97	0.00
1	1	0	1.00	0.00	0.00	0.00	0.00	0.00	0.00

Figure 3.13 Small subset of the Auto MPG data

auto <- read.csv("../data/auto-mpg.csv",
                 colClasses = c(
                      origin = "factor"))

auto$origin <- revalue(auto$origin, 
                       c("1\t"="USA", "2\t"="Europe", "3\t"="Asia"))

kable(head(auto,5))

mpg	cylinders	displacement	horsepower	weight	acceleration	modelyear	origin
18	8	307	130	3504	12.0	70	USA
15	8	350	165	3693	11.5	70	USA
18	8	318	150	3436	11.0	70	USA
16	8	304	150	3433	12.0	70	USA
17	8	302	140	3449	10.5	70	USA

Figure 3.14 Scatter plots of Vehicle Weight and Model Year versus MPG

par(.pardefault)
p1<-ggplot(auto, aes(weight, mpg)) + 
  geom_point() +
  labs(y = "Miles per gallon",
       x = "Vehicle weight")
p2<-ggplot(auto, aes(modelyear, mpg)) + 
  geom_point() +
  theme(axis.title.y=element_blank(),
        axis.text.y=element_blank(),
        axis.ticks.y=element_blank()) +
  labs(x = "Model year")
grid.arrange(p1,p2,ncol=2, 
             top=textGrob("Scatterplots for MPG data",
                          gp=gpar(fontsize=14,font=8)))

Figure 3.15 The Auto MPG data after expanding the categorical Origin column

Note that the row numbering differs between python and R by 1 (python starts row numbering at 0, and R starts at 1).

dummies <- dummyVars(" ~ .", data = auto, fullRank = TRUE)
autoNumeric <- data.frame(predict(dummies, newdata = auto))

kable(tail(autoNumeric,5))

	mpg	cylinders	displacement	horsepower	weight	acceleration	modelyear	origin.Europe
388	27	4	140	86	2790	15.6	82	0
389	44	4	97	52	2130	24.6	82	1
390	32	4	135	84	2295	11.6	82	0
391	28	4	120	79	2625	18.6	82	0
392	31	4	119	82	2720	19.4	82	0

Figure 3.16 Comparing MPG predictions on a held-out testing set to actual values

trainIndex <- createDataPartition(autoNumeric$mpg, p = .8, 
                                  list = FALSE, 
                                  times = 1)

autoTrain <- autoNumeric[ trainIndex,]
autoTest  <- autoNumeric[-trainIndex,]

lmFit <- train(mpg ~ ., data = autoTrain,
               method = "lm")
lmPred <- predict(lmFit, newdata = autoTest)

rowsInTable <- 1:5
kable(data.frame("Origin.Europe" = autoTest$origin.Europe[rowsInTable],
                 "Origin.Asia" = autoTest$origin.Asia[rowsInTable],
                 "MPG" = autoTest$mpg[rowsInTable],
                 "Predicted MPG" = lmPred[rowsInTable]))

	MPG	Predicted.MPG
1	18	14.89092
4	16	14.92356
6	15	10.74239
7	14	10.74370
10	15	13.14265

Figure 3.17 A scatter plot of the actual versus predicted values on the held-out test set. The diagonal line shows the perfect regressor. The closer all of the predictions are to this line, the better the model.

ggplot(autoTest, aes(x=mpg, y=lmPred)) + 
  geom_point() + 
  geom_abline(slope = 1, intercept = 0) +
  labs(x="MPG", y="Predicted MPG")

Figure 3.18 Table of actual versus predicted MPG values for the nonlinear random forest regression model

rfFit <- train(mpg ~ ., data = autoTrain,
               method = "rf")
rfPred <- predict(rfFit, newdata = autoTest)

kable(data.frame("Origin.Europe" = autoTest$origin.Europe[rowsInTable],
                 "Origin.Asia" = autoTest$origin.Asia[rowsInTable],
                 "MPG" = autoTest$mpg[rowsInTable],
                 "Predicted MPG" = rfPred[rowsInTable]))

	MPG	Predicted.MPG
1	18	15.80731
4	16	16.32332
6	15	13.44830
7	14	13.53114
10	15	13.77313

Figure 3.19 Comparison of MPG data versus predicted values for the nonlinear random forest regression model

ggplot(autoTest, aes(x=mpg, y=rfPred)) + 
  geom_point() + 
  geom_abline(slope = 1, intercept = 0) +
  labs(x="MPG", y="Predicted MPG")