This file contains R code to accompany Chapter 3 of the book “Real-World Machine Learning”, by Henrik Brink, Joseph W. Richards, and Mark Fetherolf. The code was contributed by Paul Adamson.
NOTE: working directory should be set to this file’s location.
We are going to be interested in predicting survival, so it is useful to specify the Survived
variable to be of type factor
. For visualizing the data, it is also useful to use the revalue
function to specify the no
and yes
levels for the factor
variable. The kable
function is built into the knitr
package.
titanic <- read.csv("../data/titanic.csv",
colClasses = c(
Survived = "factor",
Name = "character",
Ticket = "character",
Cabin = "character"))
titanic$Survived <- revalue(titanic$Survived, c("0"="no", "1"="yes"))
kable(head(titanic, 6), digits=2)
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked |
---|---|---|---|---|---|---|---|---|---|---|---|
1 | no | 3 | Braund, Mr. Owen Harris | male | 22 | 1 | 0 | A/5 21171 | 7.25 | S | |
2 | yes | 1 | Cumings, Mrs. John Bradley (Florence Briggs Thayer) | female | 38 | 1 | 0 | PC 17599 | 71.28 | C85 | C |
3 | yes | 3 | Heikkinen, Miss. Laina | female | 26 | 0 | 0 | STON/O2. 3101282 | 7.92 | S | |
4 | yes | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35 | 1 | 0 | 113803 | 53.10 | C123 | S |
5 | no | 3 | Allen, Mr. William Henry | male | 35 | 0 | 0 | 373450 | 8.05 | S | |
6 | no | 3 | Moran, Mr. James | male | NA | 0 | 0 | 330877 | 8.46 | Q |
The “Visualizing Categorical Data” package (vcd
) provides an excellent set of functions for exploring categorical data, including mosaic plots.
mosaic(
~ Sex + Survived,
data = titanic,
main = "Mosaic plot for Titanic data: Gender vs. survival",
shade = TRUE,
split_vertical = TRUE,
labeling_args = list(
set_varnames = c(
Survived = "Survived?",
Sex = "Gender")))
First, we get rid of the variables that we do not want in our model. (Cabin
might actually be useful, but it’s not used here.) Then we use is.na
to set missing age values to -1. The mutate
and select
functions make it easy to take square root of the Fare
variable and then drop it from the dataset. We then drop rows with missing Embarked
data and remove the unused level ""
. Finally, we convert factor
variables to dummy variables using the dummyVars
function in the caret
package. To avoid perfect collinearity (a.k.a. the dummy variable trap), we set the fullRank
parameter to TRUE
. Survived.yes
is then converted back to a factor
variable.
titanicTidy <- subset(titanic, select = -c(PassengerId, Name, Ticket, Cabin))
titanicTidy$Age[is.na(titanicTidy$Age)] <- -1
titanicTidy <- titanicTidy %>%
mutate(sqrtFare = sqrt(Fare)) %>%
select(-Fare)
titanicTidy <- titanicTidy %>%
filter(!(Embarked=="")) %>%
droplevels
dummies <- dummyVars(" ~ .", data = titanicTidy, fullRank = TRUE)
titanicTidyNumeric <- data.frame(predict(dummies, newdata = titanicTidy))
titanicTidyNumeric$Survived.yes <- factor(titanicTidyNumeric$Survived.yes)
kable(head(titanicTidyNumeric))
Survived.yes | Pclass | Sex.male | Age | SibSp | Parch | Embarked.Q | Embarked.S | sqrtFare |
---|---|---|---|---|---|---|---|---|
0 | 3 | 1 | 22 | 1 | 0 | 0 | 1 | 2.692582 |
1 | 1 | 0 | 38 | 1 | 0 | 0 | 0 | 8.442944 |
1 | 3 | 0 | 26 | 0 | 0 | 0 | 1 | 2.815138 |
1 | 1 | 0 | 35 | 1 | 0 | 0 | 1 | 7.286975 |
0 | 3 | 1 | 35 | 0 | 0 | 0 | 1 | 2.837252 |
0 | 3 | 1 | -1 | 0 | 0 | 1 | 0 | 2.908316 |
Thanks to Longhow Lam for posting the code used in the displayMnistSamples
function that display’s digits from the MNIST dataset.
mnist <- read.csv("../data/mnist_small.csv",
colClasses = c(label = "factor"))
displayMnistSamples <- function(x) {
for(i in x){
y = as.matrix(mnist[i, 2:785])
dim(y) = c(28, 28)
image( y[,nrow(y):1], axes = FALSE, col = gray(0:255 / 255))
text( 0.2, 0, mnist[i,1], cex = 3, col = 2, pos = c(3,4))
}
}
par( mfrow = c(1,4), mai = c(0,0,0,0.1))
displayMnistSamples(sample(1:length(mnist),4))
trainIndex <- createDataPartition(mnist$label, p = .8,
list = FALSE,
times = 1)
mnistTrain <- mnist[ trainIndex,]
mnistTest <- mnist[-trainIndex,]
mnist.kknn <- kknn(label~., mnistTrain, mnistTest, distance = 1,
kernel = "triangular")
confusionMatrix(fitted(mnist.kknn),mnistTest$label)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1 2 3 4 5 6 7 8 9
## 0 20 0 1 0 0 0 1 0 0 0
## 1 0 19 2 0 0 1 0 0 2 1
## 2 0 0 20 0 0 0 0 0 0 0
## 3 0 0 0 16 0 0 0 0 3 0
## 4 0 0 0 0 12 0 1 0 1 2
## 5 0 0 0 1 0 14 2 0 0 0
## 6 1 0 0 0 1 1 15 0 1 0
## 7 0 0 0 1 0 0 0 18 0 3
## 8 0 0 1 0 0 0 0 0 11 0
## 9 0 0 0 0 7 1 0 3 0 13
##
## Overall Statistics
##
## Accuracy : 0.8061
## 95% CI : (0.7437, 0.859)
## No Information Rate : 0.1224
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.7844
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: 0 Class: 1 Class: 2 Class: 3 Class: 4 Class: 5
## Sensitivity 0.9524 1.00000 0.8333 0.88889 0.60000 0.82353
## Specificity 0.9886 0.96610 1.0000 0.98315 0.97727 0.98324
## Pos Pred Value 0.9091 0.76000 1.0000 0.84211 0.75000 0.82353
## Neg Pred Value 0.9943 1.00000 0.9773 0.98870 0.95556 0.98324
## Prevalence 0.1071 0.09694 0.1224 0.09184 0.10204 0.08673
## Detection Rate 0.1020 0.09694 0.1020 0.08163 0.06122 0.07143
## Detection Prevalence 0.1122 0.12755 0.1020 0.09694 0.08163 0.08673
## Balanced Accuracy 0.9705 0.98305 0.9167 0.93602 0.78864 0.90338
## Class: 6 Class: 7 Class: 8 Class: 9
## Sensitivity 0.78947 0.85714 0.61111 0.68421
## Specificity 0.97740 0.97714 0.99438 0.93785
## Pos Pred Value 0.78947 0.81818 0.91667 0.54167
## Neg Pred Value 0.97740 0.98276 0.96196 0.96512
## Prevalence 0.09694 0.10714 0.09184 0.09694
## Detection Rate 0.07653 0.09184 0.05612 0.06633
## Detection Prevalence 0.09694 0.11224 0.06122 0.12245
## Balanced Accuracy 0.88344 0.91714 0.80275 0.81103
rowsInTable <- 1:10
prob <- as.data.frame(mnist.kknn$prob[rowsInTable,])
mnistResultsDF <- data.frame(mnistTest$label[rowsInTable],
mnist.kknn$fit[rowsInTable],
prob)
kable(mnistResultsDF, digits=2,
col.names=c("actual","fit",0:9))
actual | fit | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 |
---|---|---|---|---|---|---|---|---|---|---|---|
1 | 1 | 0 | 1.00 | 0.00 | 0.00 | 0 | 0.00 | 0 | 0.00 | 0.00 | 0.00 |
7 | 7 | 0 | 0.00 | 0.00 | 0.08 | 0 | 0.00 | 0 | 0.67 | 0.00 | 0.25 |
3 | 7 | 0 | 0.00 | 0.00 | 0.08 | 0 | 0.00 | 0 | 0.48 | 0.31 | 0.13 |
3 | 3 | 0 | 0.00 | 0.17 | 0.83 | 0 | 0.00 | 0 | 0.00 | 0.00 | 0.00 |
8 | 8 | 0 | 0.00 | 0.00 | 0.00 | 0 | 0.00 | 0 | 0.00 | 0.98 | 0.02 |
1 | 1 | 0 | 1.00 | 0.00 | 0.00 | 0 | 0.00 | 0 | 0.00 | 0.00 | 0.00 |
1 | 1 | 0 | 0.92 | 0.00 | 0.00 | 0 | 0.00 | 0 | 0.08 | 0.00 | 0.00 |
0 | 0 | 1 | 0.00 | 0.00 | 0.00 | 0 | 0.00 | 0 | 0.00 | 0.00 | 0.00 |
8 | 8 | 0 | 0.00 | 0.00 | 0.00 | 0 | 0.03 | 0 | 0.00 | 0.97 | 0.00 |
1 | 1 | 0 | 1.00 | 0.00 | 0.00 | 0 | 0.00 | 0 | 0.00 | 0.00 | 0.00 |
auto <- read.csv("../data/auto-mpg.csv",
colClasses = c(
origin = "factor"))
auto$origin <- revalue(auto$origin,
c("1\t"="USA", "2\t"="Europe", "3\t"="Asia"))
kable(head(auto,5))
mpg | cylinders | displacement | horsepower | weight | acceleration | modelyear | origin |
---|---|---|---|---|---|---|---|
18 | 8 | 307 | 130 | 3504 | 12.0 | 70 | USA |
15 | 8 | 350 | 165 | 3693 | 11.5 | 70 | USA |
18 | 8 | 318 | 150 | 3436 | 11.0 | 70 | USA |
16 | 8 | 304 | 150 | 3433 | 12.0 | 70 | USA |
17 | 8 | 302 | 140 | 3449 | 10.5 | 70 | USA |
par(.pardefault)
p1<-ggplot(auto, aes(weight, mpg)) +
geom_point() +
labs(y = "Miles per gallon",
x = "Vehicle weight")
p2<-ggplot(auto, aes(modelyear, mpg)) +
geom_point() +
theme(axis.title.y=element_blank(),
axis.text.y=element_blank(),
axis.ticks.y=element_blank()) +
labs(x = "Model year")
grid.arrange(p1,p2,ncol=2,
top=textGrob("Scatterplots for MPG data",
gp=gpar(fontsize=14,font=8)))
Note that the row numbering differs between python and R by 1 (python starts row numbering at 0, and R starts at 1).
dummies <- dummyVars(" ~ .", data = auto, fullRank = TRUE)
autoNumeric <- data.frame(predict(dummies, newdata = auto))
kable(tail(autoNumeric,5))
mpg | cylinders | displacement | horsepower | weight | acceleration | modelyear | origin.Europe | origin.Asia | |
---|---|---|---|---|---|---|---|---|---|
388 | 27 | 4 | 140 | 86 | 2790 | 15.6 | 82 | 0 | 0 |
389 | 44 | 4 | 97 | 52 | 2130 | 24.6 | 82 | 1 | 0 |
390 | 32 | 4 | 135 | 84 | 2295 | 11.6 | 82 | 0 | 0 |
391 | 28 | 4 | 120 | 79 | 2625 | 18.6 | 82 | 0 | 0 |
392 | 31 | 4 | 119 | 82 | 2720 | 19.4 | 82 | 0 | 0 |
trainIndex <- createDataPartition(autoNumeric$mpg, p = .8,
list = FALSE,
times = 1)
autoTrain <- autoNumeric[ trainIndex,]
autoTest <- autoNumeric[-trainIndex,]
lmFit <- train(mpg ~ ., data = autoTrain,
method = "lm")
lmPred <- predict(lmFit, newdata = autoTest)
rowsInTable <- 1:5
kable(data.frame("Origin.Europe" = autoTest$origin.Europe[rowsInTable],
"Origin.Asia" = autoTest$origin.Asia[rowsInTable],
"MPG" = autoTest$mpg[rowsInTable],
"Predicted MPG" = lmPred[rowsInTable]))
Origin.Europe | Origin.Asia | MPG | Predicted.MPG | |
---|---|---|---|---|
1 | 0 | 0 | 18 | 14.89092 |
4 | 0 | 0 | 16 | 14.92356 |
6 | 0 | 0 | 15 | 10.74239 |
7 | 0 | 0 | 14 | 10.74370 |
10 | 0 | 0 | 15 | 13.14265 |
ggplot(autoTest, aes(x=mpg, y=lmPred)) +
geom_point() +
geom_abline(slope = 1, intercept = 0) +
labs(x="MPG", y="Predicted MPG")
rfFit <- train(mpg ~ ., data = autoTrain,
method = "rf")
rfPred <- predict(rfFit, newdata = autoTest)
kable(data.frame("Origin.Europe" = autoTest$origin.Europe[rowsInTable],
"Origin.Asia" = autoTest$origin.Asia[rowsInTable],
"MPG" = autoTest$mpg[rowsInTable],
"Predicted MPG" = rfPred[rowsInTable]))
Origin.Europe | Origin.Asia | MPG | Predicted.MPG | |
---|---|---|---|---|
1 | 0 | 0 | 18 | 15.80731 |
4 | 0 | 0 | 16 | 16.32332 |
6 | 0 | 0 | 15 | 13.44830 |
7 | 0 | 0 | 14 | 13.53114 |
10 | 0 | 0 | 15 | 13.77313 |
ggplot(autoTest, aes(x=mpg, y=rfPred)) +
geom_point() +
geom_abline(slope = 1, intercept = 0) +
labs(x="MPG", y="Predicted MPG")