Abstract

The rwml-R Github repo is updated with R code to accompany Chapter 3 of the book “Real-World Machine Learning” by Henrik Brink, Joseph W. Richards, and Mark Fetherolf.

Survivors on the Titanic

The Titanic Passengers dataset is used to illustrate various processes used to prepare data for modeling, including conversion of factor variables to dummy variables. For example, the code to produce the following table of processed data is provided:

Survived.yes Pclass Sex.male Age SibSp Parch Embarked.Q Embarked.S sqrtFare
0 3 1 22 1 0 0 1 2.692582
1 1 0 38 1 0 0 0 8.442944
1 3 0 26 0 0 0 1 2.815138
1 1 0 35 1 0 0 1 7.286975
0 3 1 35 0 0 0 1 2.837252
0 3 1 -1 0 0 1 0 2.908316

I also go “off-script” a bit (do some things not contained in the book) and demonstrate some useful visualization, modeling, and performance measuring techniques available with the caret and AppliedPredictiveModeling packages.

MNIST database of handwritten digits

A k-nearest neighbors classifier (from the kknn package) is used to predict the numbers represented in the MNIST database of handwritten digits. Examples of the types of digits present in the dataset and the R code to display them:

Figure generated by above code

Auto MPG dataset

As an example of a linear regression analysis, the Auto MPG dataset introduced in Chapter 2 resurfaces and fuel economy is predicted from origin, year of production, and performance characteristics such as horsepower and engine displacement.

As always, feedback is welcome

As always, I’d love to hear from you if you find the project helpful or if you have any suggestions. Please leave a comment below or use the Tweet button. Also, feel free to fork the rwml-R repo and submit a pull request if you wish to contribute.

Download Fork