Practical Machine Learning course project

The goal of this project is to predict the classe variable based on the rest of the data set.

Load the data:

library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
library(e1071)

pmlraw <- read.csv("pml-training.csv")
pmltest <- read.csv("pml-testing.csv")

set.seed(4483)
inTrain <- createDataPartition(pmlraw$classe, p = 0.75)[[1]]
pmltrain <- pmlraw[inTrain,]
pmlvalid <- pmlraw[-inTrain,]

# Set up parallelism:
library(parallel)
library(doParallel)
## Loading required package: foreach
## Loading required package: iterators
cluster <- makeCluster(detectCores()) # convention to leave 1 core for OS
registerDoParallel(cluster)
fitControl <- trainControl(method = "cv",
    number = 10,
    allowParallel = TRUE)

Clean up some data: remove variables that are mostly NA. These look like summary stats.

Exploratory Data Analysis:

The dimensions of the data:

dim(pmltrain)
## [1] 14718    55

A quick look at a few variables:

plot(pmltrain[,c("classe","roll_belt","pitch_belt","yaw_belt")])

Create a few models:

Random Forest

model.rf <- train(classe ~ ., method="rf", preProcess = c("BoxCox"), data=pmltrain, trControl=fitControl)
## Loading required package: randomForest
## randomForest 4.6-12
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attaching package: 'randomForest'
## The following object is masked from 'package:ggplot2':
## 
##     margin

Logistic Regression

#model.2 <- train(classe ~ ., method="logreg", preProcess = c("BoxCox"), data=pmltrain, trControl=fitControl)

AdaBoost

#model.3 <- train(classe ~ ., method="adaboost", preProcess = c("BoxCox"), data=pmltrain, trControl=fitControl)

Logit boost:

model.logitboost <- train(classe ~ ., method="LogitBoost", preProcess = c("BoxCox"), data=pmltrain, trControl=fitControl)
## Loading required package: caTools

Parallel RF:

model.4 <- train(classe ~ ., method="parRF", preProcess = c("BoxCox"), data=pmltrain, trControl=fitControl)

stopCluster(cluster)

Now run predictions on the validation set and compare the results:

The accuracy results are:

Random forest: 0.9991843

(Logic regression and ada boost don’t work with multiple classes) #Logic regression: r #sum(correct.2, na.rm=TRUE) / length(correct.2) #Ada Boost:

Logit boost: 0.8199429

Parallel RF: 0.9991843

The predictions are:

## Loading required package: randomForest
## randomForest 4.6-12
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attaching package: 'randomForest'
## The following object is masked from 'package:ggplot2':
## 
##     margin
##        A     B     C     D     E Prediction
## 1  0.020 0.910 0.046 0.014 0.010          B
## 2  0.988 0.006 0.004 0.000 0.002          A
## 3  0.092 0.864 0.024 0.006 0.014          B
## 4  0.994 0.000 0.004 0.000 0.002          A
## 5  0.996 0.000 0.002 0.000 0.002          A
## 6  0.002 0.030 0.066 0.034 0.868          E
## 7  0.000 0.000 0.022 0.968 0.010          D
## 8  0.020 0.830 0.062 0.064 0.024          B
## 9  1.000 0.000 0.000 0.000 0.000          A
## 10 0.996 0.004 0.000 0.000 0.000          A
## 11 0.046 0.776 0.114 0.028 0.036          B
## 12 0.008 0.016 0.964 0.002 0.010          C
## 13 0.000 0.994 0.000 0.004 0.002          B
## 14 1.000 0.000 0.000 0.000 0.000          A
## 15 0.000 0.002 0.000 0.002 0.996          E
## 16 0.000 0.006 0.000 0.006 0.988          E
## 17 0.960 0.000 0.000 0.000 0.040          A
## 18 0.036 0.834 0.008 0.110 0.012          B
## 19 0.012 0.978 0.000 0.008 0.002          B
## 20 0.000 0.998 0.000 0.000 0.002          B

And here is a plot of the variable importance:

varImpPlot(model.rf$finalModel)