This file contains R code to accompany Chapter 3 of the book “Real-World Machine Learning”, by Henrik Brink, Joseph W. Richards, and Mark Fetherolf. The code was contributed by Paul Adamson.
NOTE: working directory should be set to this file’s location.
We are going to be interested in predicting survival, so it is useful to specify the Survived variable to be of type factor. For visualizing the data, it is also useful to use the revalue function to specify the no and yes levels for the factor variable. The kable function is built into the knitr package.
titanic <- read.csv("../data/titanic.csv",
colClasses = c(
Survived = "factor",
Name = "character",
Ticket = "character",
Cabin = "character"))
titanic$Survived <- revalue(titanic$Survived, c("0"="no", "1"="yes"))
kable(head(titanic, 6), digits=2)
| PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | no | 3 | Braund, Mr. Owen Harris | male | 22 | 1 | 0 | A/5 21171 | 7.25 | S | |
| 2 | yes | 1 | Cumings, Mrs. John Bradley (Florence Briggs Thayer) | female | 38 | 1 | 0 | PC 17599 | 71.28 | C85 | C |
| 3 | yes | 3 | Heikkinen, Miss. Laina | female | 26 | 0 | 0 | STON/O2. 3101282 | 7.92 | S | |
| 4 | yes | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35 | 1 | 0 | 113803 | 53.10 | C123 | S |
| 5 | no | 3 | Allen, Mr. William Henry | male | 35 | 0 | 0 | 373450 | 8.05 | S | |
| 6 | no | 3 | Moran, Mr. James | male | NA | 0 | 0 | 330877 | 8.46 | Q |
The “Visualizing Categorical Data” package (vcd) provides an excellent set of functions for exploring categorical data, including mosaic plots.
mosaic(
~ Sex + Survived,
data = titanic,
main = "Mosaic plot for Titanic data: Gender vs. survival",
shade = TRUE,
split_vertical = TRUE,
labeling_args = list(
set_varnames = c(
Survived = "Survived?",
Sex = "Gender")))
First, we get rid of the variables that we do not want in our model. (Cabin might actually be useful, but it’s not used here.) Then we use is.na to set missing age values to -1. The mutate and select functions make it easy to take square root of the Fare variable and then drop it from the dataset. We then drop rows with missing Embarked data and remove the unused level "". Finally, we convert factor variables to dummy variables using the dummyVars function in the caret package. To avoid perfect collinearity (a.k.a. the dummy variable trap), we set the fullRank parameter to TRUE. Survived.yes is then converted back to a factor variable.
titanicTidy <- subset(titanic, select = -c(PassengerId, Name, Ticket, Cabin))
titanicTidy$Age[is.na(titanicTidy$Age)] <- -1
titanicTidy <- titanicTidy %>%
mutate(sqrtFare = sqrt(Fare)) %>%
select(-Fare)
titanicTidy <- titanicTidy %>%
filter(!(Embarked=="")) %>%
droplevels
dummies <- dummyVars(" ~ .", data = titanicTidy, fullRank = TRUE)
titanicTidyNumeric <- data.frame(predict(dummies, newdata = titanicTidy))
titanicTidyNumeric$Survived.yes <- factor(titanicTidyNumeric$Survived.yes)
kable(head(titanicTidyNumeric))
| Survived.yes | Pclass | Sex.male | Age | SibSp | Parch | Embarked.Q | Embarked.S | sqrtFare |
|---|---|---|---|---|---|---|---|---|
| 0 | 3 | 1 | 22 | 1 | 0 | 0 | 1 | 2.692582 |
| 1 | 1 | 0 | 38 | 1 | 0 | 0 | 0 | 8.442944 |
| 1 | 3 | 0 | 26 | 0 | 0 | 0 | 1 | 2.815138 |
| 1 | 1 | 0 | 35 | 1 | 0 | 0 | 1 | 7.286975 |
| 0 | 3 | 1 | 35 | 0 | 0 | 0 | 1 | 2.837252 |
| 0 | 3 | 1 | -1 | 0 | 0 | 1 | 0 | 2.908316 |
Thanks to Longhow Lam for posting the code used in the displayMnistSamples function that display’s digits from the MNIST dataset.
mnist <- read.csv("../data/mnist_small.csv",
colClasses = c(label = "factor"))
displayMnistSamples <- function(x) {
for(i in x){
y = as.matrix(mnist[i, 2:785])
dim(y) = c(28, 28)
image( y[,nrow(y):1], axes = FALSE, col = gray(0:255 / 255))
text( 0.2, 0, mnist[i,1], cex = 3, col = 2, pos = c(3,4))
}
}
par( mfrow = c(1,4), mai = c(0,0,0,0.1))
displayMnistSamples(sample(1:length(mnist),4))
trainIndex <- createDataPartition(mnist$label, p = .8,
list = FALSE,
times = 1)
mnistTrain <- mnist[ trainIndex,]
mnistTest <- mnist[-trainIndex,]
mnist.kknn <- kknn(label~., mnistTrain, mnistTest, distance = 1,
kernel = "triangular")
confusionMatrix(fitted(mnist.kknn),mnistTest$label)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1 2 3 4 5 6 7 8 9
## 0 20 0 1 0 0 0 1 0 0 0
## 1 0 19 2 0 0 1 0 0 2 1
## 2 0 0 20 0 0 0 0 0 0 0
## 3 0 0 0 16 0 0 0 0 3 0
## 4 0 0 0 0 12 0 1 0 1 2
## 5 0 0 0 1 0 14 2 0 0 0
## 6 1 0 0 0 1 1 15 0 1 0
## 7 0 0 0 1 0 0 0 18 0 3
## 8 0 0 1 0 0 0 0 0 11 0
## 9 0 0 0 0 7 1 0 3 0 13
##
## Overall Statistics
##
## Accuracy : 0.8061
## 95% CI : (0.7437, 0.859)
## No Information Rate : 0.1224
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.7844
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: 0 Class: 1 Class: 2 Class: 3 Class: 4 Class: 5
## Sensitivity 0.9524 1.00000 0.8333 0.88889 0.60000 0.82353
## Specificity 0.9886 0.96610 1.0000 0.98315 0.97727 0.98324
## Pos Pred Value 0.9091 0.76000 1.0000 0.84211 0.75000 0.82353
## Neg Pred Value 0.9943 1.00000 0.9773 0.98870 0.95556 0.98324
## Prevalence 0.1071 0.09694 0.1224 0.09184 0.10204 0.08673
## Detection Rate 0.1020 0.09694 0.1020 0.08163 0.06122 0.07143
## Detection Prevalence 0.1122 0.12755 0.1020 0.09694 0.08163 0.08673
## Balanced Accuracy 0.9705 0.98305 0.9167 0.93602 0.78864 0.90338
## Class: 6 Class: 7 Class: 8 Class: 9
## Sensitivity 0.78947 0.85714 0.61111 0.68421
## Specificity 0.97740 0.97714 0.99438 0.93785
## Pos Pred Value 0.78947 0.81818 0.91667 0.54167
## Neg Pred Value 0.97740 0.98276 0.96196 0.96512
## Prevalence 0.09694 0.10714 0.09184 0.09694
## Detection Rate 0.07653 0.09184 0.05612 0.06633
## Detection Prevalence 0.09694 0.11224 0.06122 0.12245
## Balanced Accuracy 0.88344 0.91714 0.80275 0.81103
rowsInTable <- 1:10
prob <- as.data.frame(mnist.kknn$prob[rowsInTable,])
mnistResultsDF <- data.frame(mnistTest$label[rowsInTable],
mnist.kknn$fit[rowsInTable],
prob)
kable(mnistResultsDF, digits=2,
col.names=c("actual","fit",0:9))
| actual | fit | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 1 | 0 | 1.00 | 0.00 | 0.00 | 0 | 0.00 | 0 | 0.00 | 0.00 | 0.00 |
| 7 | 7 | 0 | 0.00 | 0.00 | 0.08 | 0 | 0.00 | 0 | 0.67 | 0.00 | 0.25 |
| 3 | 7 | 0 | 0.00 | 0.00 | 0.08 | 0 | 0.00 | 0 | 0.48 | 0.31 | 0.13 |
| 3 | 3 | 0 | 0.00 | 0.17 | 0.83 | 0 | 0.00 | 0 | 0.00 | 0.00 | 0.00 |
| 8 | 8 | 0 | 0.00 | 0.00 | 0.00 | 0 | 0.00 | 0 | 0.00 | 0.98 | 0.02 |
| 1 | 1 | 0 | 1.00 | 0.00 | 0.00 | 0 | 0.00 | 0 | 0.00 | 0.00 | 0.00 |
| 1 | 1 | 0 | 0.92 | 0.00 | 0.00 | 0 | 0.00 | 0 | 0.08 | 0.00 | 0.00 |
| 0 | 0 | 1 | 0.00 | 0.00 | 0.00 | 0 | 0.00 | 0 | 0.00 | 0.00 | 0.00 |
| 8 | 8 | 0 | 0.00 | 0.00 | 0.00 | 0 | 0.03 | 0 | 0.00 | 0.97 | 0.00 |
| 1 | 1 | 0 | 1.00 | 0.00 | 0.00 | 0 | 0.00 | 0 | 0.00 | 0.00 | 0.00 |
auto <- read.csv("../data/auto-mpg.csv",
colClasses = c(
origin = "factor"))
auto$origin <- revalue(auto$origin,
c("1\t"="USA", "2\t"="Europe", "3\t"="Asia"))
kable(head(auto,5))
| mpg | cylinders | displacement | horsepower | weight | acceleration | modelyear | origin |
|---|---|---|---|---|---|---|---|
| 18 | 8 | 307 | 130 | 3504 | 12.0 | 70 | USA |
| 15 | 8 | 350 | 165 | 3693 | 11.5 | 70 | USA |
| 18 | 8 | 318 | 150 | 3436 | 11.0 | 70 | USA |
| 16 | 8 | 304 | 150 | 3433 | 12.0 | 70 | USA |
| 17 | 8 | 302 | 140 | 3449 | 10.5 | 70 | USA |
par(.pardefault)
p1<-ggplot(auto, aes(weight, mpg)) +
geom_point() +
labs(y = "Miles per gallon",
x = "Vehicle weight")
p2<-ggplot(auto, aes(modelyear, mpg)) +
geom_point() +
theme(axis.title.y=element_blank(),
axis.text.y=element_blank(),
axis.ticks.y=element_blank()) +
labs(x = "Model year")
grid.arrange(p1,p2,ncol=2,
top=textGrob("Scatterplots for MPG data",
gp=gpar(fontsize=14,font=8)))
Note that the row numbering differs between python and R by 1 (python starts row numbering at 0, and R starts at 1).
dummies <- dummyVars(" ~ .", data = auto, fullRank = TRUE)
autoNumeric <- data.frame(predict(dummies, newdata = auto))
kable(tail(autoNumeric,5))
| mpg | cylinders | displacement | horsepower | weight | acceleration | modelyear | origin.Europe | origin.Asia | |
|---|---|---|---|---|---|---|---|---|---|
| 388 | 27 | 4 | 140 | 86 | 2790 | 15.6 | 82 | 0 | 0 |
| 389 | 44 | 4 | 97 | 52 | 2130 | 24.6 | 82 | 1 | 0 |
| 390 | 32 | 4 | 135 | 84 | 2295 | 11.6 | 82 | 0 | 0 |
| 391 | 28 | 4 | 120 | 79 | 2625 | 18.6 | 82 | 0 | 0 |
| 392 | 31 | 4 | 119 | 82 | 2720 | 19.4 | 82 | 0 | 0 |
trainIndex <- createDataPartition(autoNumeric$mpg, p = .8,
list = FALSE,
times = 1)
autoTrain <- autoNumeric[ trainIndex,]
autoTest <- autoNumeric[-trainIndex,]
lmFit <- train(mpg ~ ., data = autoTrain,
method = "lm")
lmPred <- predict(lmFit, newdata = autoTest)
rowsInTable <- 1:5
kable(data.frame("Origin.Europe" = autoTest$origin.Europe[rowsInTable],
"Origin.Asia" = autoTest$origin.Asia[rowsInTable],
"MPG" = autoTest$mpg[rowsInTable],
"Predicted MPG" = lmPred[rowsInTable]))
| Origin.Europe | Origin.Asia | MPG | Predicted.MPG | |
|---|---|---|---|---|
| 1 | 0 | 0 | 18 | 14.89092 |
| 4 | 0 | 0 | 16 | 14.92356 |
| 6 | 0 | 0 | 15 | 10.74239 |
| 7 | 0 | 0 | 14 | 10.74370 |
| 10 | 0 | 0 | 15 | 13.14265 |
ggplot(autoTest, aes(x=mpg, y=lmPred)) +
geom_point() +
geom_abline(slope = 1, intercept = 0) +
labs(x="MPG", y="Predicted MPG")
rfFit <- train(mpg ~ ., data = autoTrain,
method = "rf")
rfPred <- predict(rfFit, newdata = autoTest)
kable(data.frame("Origin.Europe" = autoTest$origin.Europe[rowsInTable],
"Origin.Asia" = autoTest$origin.Asia[rowsInTable],
"MPG" = autoTest$mpg[rowsInTable],
"Predicted MPG" = rfPred[rowsInTable]))
| Origin.Europe | Origin.Asia | MPG | Predicted.MPG | |
|---|---|---|---|---|
| 1 | 0 | 0 | 18 | 15.80731 |
| 4 | 0 | 0 | 16 | 16.32332 |
| 6 | 0 | 0 | 15 | 13.44830 |
| 7 | 0 | 0 | 14 | 13.53114 |
| 10 | 0 | 0 | 15 | 13.77313 |
ggplot(autoTest, aes(x=mpg, y=rfPred)) +
geom_point() +
geom_abline(slope = 1, intercept = 0) +
labs(x="MPG", y="Predicted MPG")