This file contains R code to accompany Chapter 2 of the book “Real-World Machine Learning”, by Henrik Brink, Joseph W. Richards, and Mark Fetherolf. The code was contributed by Paul Adamson.
NOTE: working directory should be set to this file’s location.
Creating dummy variables in R is extremely easy with the model.matrix
function. In the below code, when the personData
dataframe is created, the maritalstatus
variable is of type factor
with levels “single” and “married” by default.
personData <- data.frame(person = 1:2,
name = c("Jane Doe", "John Smith"),
age = c(24, 41),
income = c(81200, 121000),
maritalstatus = c("single","married"))
personDataNew <- data.frame(personData[,1:4],
model.matrix(~ maritalstatus - 1,
data = personData))
str(personData)
## 'data.frame': 2 obs. of 5 variables:
## $ person : int 1 2
## $ name : Factor w/ 2 levels "Jane Doe","John Smith": 1 2
## $ age : num 24 41
## $ income : num 81200 121000
## $ maritalstatus: Factor w/ 2 levels "married","single": 2 1
In the call to model.matrix
, the −1 in the model formula ensures that we create a dummy variable for each of the two marital statuses (technically, it suppresses the creation of an intercept).
model.matrix(~ maritalstatus - 1,
data = personData)
## maritalstatusmarried maritalstatussingle
## 1 0 1
## 2 1 0
## attr(,"assign")
## [1] 1 1
## attr(,"contrasts")
## attr(,"contrasts")$maritalstatus
## [1] "contr.treatment"
The matrix of dummy variables is then joined to the original dataframe (minus the maritalstatus column) with another call to data.frame
.
The packages dplyr
and tidyr
are excellent for tidying and preprocessing data, including creating new features from existing ones. (Note: plyr
will be used later, but we must load it prior to dplyr
.)
titanic <- read.csv("../data/titanic.csv",
colClasses = c(
Survived = "factor",
Name = "character",
Ticket = "character",
Cabin = "character"))
titanic$Survived <- revalue(titanic$Survived, c("0"="no", "1"="yes"))
titanicNew <- titanic %>%
separate(Cabin, into = "firstCabin", sep = " ", extra = "drop", remove = FALSE) %>%
separate(firstCabin, into = c("cabinChar", "cabinNum"), sep = 1) %>%
rowwise() %>%
mutate(numCabins = length(unlist(strsplit(Cabin, " "))))
str(titanicNew)
## Classes 'rowwise_df', 'tbl_df', 'tbl' and 'data.frame': 891 obs. of 15 variables:
## $ PassengerId: int 1 2 3 4 5 6 7 8 9 10 ...
## $ Survived : Factor w/ 2 levels "no","yes": 1 2 2 2 1 1 1 1 2 2 ...
## $ Pclass : int 3 1 3 1 3 3 1 3 3 2 ...
## $ Name : chr "Braund, Mr. Owen Harris" "Cumings, Mrs. John Bradley (Florence Briggs Thayer)" "Heikkinen, Miss. Laina" "Futrelle, Mrs. Jacques Heath (Lily May Peel)" ...
## $ Sex : Factor w/ 2 levels "female","male": 2 1 1 1 2 2 2 2 1 1 ...
## $ Age : num 22 38 26 35 35 NA 54 2 27 14 ...
## $ SibSp : int 1 1 0 1 0 0 0 3 0 1 ...
## $ Parch : int 0 0 0 0 0 0 0 1 2 0 ...
## $ Ticket : chr "A/5 21171" "PC 17599" "STON/O2. 3101282" "113803" ...
## $ Fare : num 7.25 71.28 7.92 53.1 8.05 ...
## $ Cabin : chr "" "C85" "" "C123" ...
## $ cabinChar : chr "" "C" "" "C" ...
## $ cabinNum : chr "" "85" "" "123" ...
## $ Embarked : Factor w/ 4 levels "","C","Q","S": 4 2 4 4 4 3 4 4 4 2 ...
## $ numCabins : int 0 1 0 1 0 0 1 0 0 0 ...
In Listing 2.2, read.csv
is used to read in the comma separated value (csv) data file. The colClasses
argument is used to specify the correct class for some features. Then, the revalue
function changes the levels of the Survived
factor variable so that ‘0’ indicates ‘no’ and ‘1’ indicates ‘yes’. The titanicNew
dataframe is then created by piping together separate
from tidyr
and mutate
from dplyr
. separate
does exactly what its name implies: it separates a single character column into multiple columns. mutate
is used to add a new feature often (as in this case) by acting on values of another feature.
The below code will normalize a feature using the “min-max” method. As an example, the Age
feature of the titanic
dataframe is normalized and a histogram of the new normalized feature is plotted with ggplot2
.
normalizeFeature <- function(data, fMin=-1.0, fMax=1.0){
dMin = min(na.omit(data))
dMax = max(na.omit(data))
factor = (fMax - fMin) / (dMax - dMin)
normalized = fMin + (data - dMin)*factor
normalized
}
titanic$AgeNormalized <- normalizeFeature(titanic$Age)
ggplot(data=titanic, aes(AgeNormalized)) +
geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 177 rows containing non-finite values (stat_bin).
The “Visualizing Categorical Data” (vcd
) package provides an excellent set of functions for exploring categorical data, including mosaic plots.
mosaic(
~ Sex + Survived,
data = titanic,
main = "Mosaic plot for Titanic data: Gender vs. survival",
shade = TRUE,
split_vertical = TRUE,
labeling_args = list(
set_varnames = c(
Survived = "Survived?")))
mosaic(
~ Pclass + Survived,
data = titanic,
main = "Mosaic plot for Titanic data: Passenger Class vs. survival",
shade = TRUE,
split_vertical = TRUE,
labeling_args = list(
set_varnames = c(
Pclass = "Passenger class",
Survived = "Survived?")))
The boxplot
function is provided as part of the standard graphics
package in R. ggplot2
provides a much nicer version.
boxplot(Age ~ Survived,
data = titanic,
xlab = "Survived?",
ylab = "Age\n(years)",
las = 1)
ggplot(titanic, aes(Survived, Age)) +
geom_boxplot() +
xlab("Survived?") +
ylab("Age\n(years)")
## Warning: Removed 177 rows containing non-finite values (stat_boxplot).
Plots can be combined in rows and columns using the mfrow
graphical parameter set via the par
function. (Try help(par)
to learn more.)
par(mfrow=c(1,2))
par(mai=c(1,1,.1,.1), las = 1)
boxplot(Fare ~ Survived,
data = titanic,
xlab = "Survived?",
ylab = "Fare Amount")
boxplot(Fare**(1/2) ~ Survived,
data = titanic,
xlab = "Survived?",
ylab = "sqr\n(fare amount)")
The sm.density.compare
function from the sm
package is useful for comparing a set of univariate density estimates. The first argument is a vector of data, and the second argument is a vector of group labels that correspond to each value. The colfill
variable and legend
function are used to place a legend on the plot generated by sm.density.compare
.
par(mfrow=c(1,1))
auto <- read.csv("../data/auto-mpg.csv",
colClasses = c(origin = "factor"))
auto$origin <- revalue(auto$origin,
c("1\t"="USA", "2\t"="Europe", "3\t"="Asia"))
sm.density.compare(auto$mpg, auto$origin,
xlab="Miles per gallon",
ylab="Density")
title(main="Density plot for MPG data, by region")
colfill<-c(2:(2+length(levels(auto$origin))))
legend(x="topleft",
inset=0.05,
text.width=5,
levels(auto$origin),
fill=colfill)
It doesn’t get much simpler than the plot
function in R.
par(mfrow=c(1,2), mai=c(1,1,.1,.1), las = 1)
plot(auto$weight, auto$mpg,
xlab = "Vehicle weight",
ylab = "Miles per\ngallon")
plot(auto$modelyear, auto$mpg,
xlab = "Model year",
ylab = "Miles per\ngallon")
Although not in the book, everyone should become familiar with facets via ggplot2
.
p <- ggplot(auto, aes(mpg, weight)) +
geom_point()
p + facet_grid(. ~ origin)