Practical Machine Learning Project

Practical Machine Learning course project

The goal of this project is to predict the classe variable based on the rest of the data set.

Load the data:

library(caret)

## Loading required package: lattice

## Loading required package: ggplot2

library(e1071)

pmlraw <- read.csv("pml-training.csv")
pmltest <- read.csv("pml-testing.csv")

set.seed(4483)
inTrain <- createDataPartition(pmlraw$classe, p = 0.75)[[1]]
pmltrain <- pmlraw[inTrain,]
pmlvalid <- pmlraw[-inTrain,]

# Set up parallelism:
library(parallel)
library(doParallel)

## Loading required package: foreach

## Loading required package: iterators

cluster <- makeCluster(detectCores()) # convention to leave 1 core for OS
registerDoParallel(cluster)
fitControl <- trainControl(method = "cv",
    number = 10,
    allowParallel = TRUE)

Clean up some data: remove variables that are mostly NA. These look like summary stats.

Exploratory Data Analysis:

The dimensions of the data:

dim(pmltrain)

## [1] 14718    55

A quick look at a few variables:

plot(pmltrain[,c("classe","roll_belt","pitch_belt","yaw_belt")])

Create a few models:

Random Forest

model.rf <- train(classe ~ ., method="rf", preProcess = c("BoxCox"), data=pmltrain, trControl=fitControl)

## Loading required package: randomForest

## randomForest 4.6-12

## Type rfNews() to see new features/changes/bug fixes.

## 
## Attaching package: 'randomForest'

## The following object is masked from 'package:ggplot2':
## 
##     margin

Logistic Regression

#model.2 <- train(classe ~ ., method="logreg", preProcess = c("BoxCox"), data=pmltrain, trControl=fitControl)

AdaBoost

#model.3 <- train(classe ~ ., method="adaboost", preProcess = c("BoxCox"), data=pmltrain, trControl=fitControl)

Logit boost:

model.logitboost <- train(classe ~ ., method="LogitBoost", preProcess = c("BoxCox"), data=pmltrain, trControl=fitControl)

## Loading required package: caTools

Parallel RF:

model.4 <- train(classe ~ ., method="parRF", preProcess = c("BoxCox"), data=pmltrain, trControl=fitControl)

stopCluster(cluster)

Now run predictions on the validation set and compare the results:

The accuracy results are:

Random forest: 0.9991843

(Logic regression and ada boost don’t work with multiple classes) #Logic regression: r #sum(correct.2, na.rm=TRUE) / length(correct.2) #Ada Boost:

Logit boost: 0.8199429

Parallel RF: 0.9991843

The predictions are:

## Loading required package: randomForest

## randomForest 4.6-12

## Type rfNews() to see new features/changes/bug fixes.

## 
## Attaching package: 'randomForest'

## The following object is masked from 'package:ggplot2':
## 
##     margin

##        A     B     C     D     E Prediction
## 1  0.020 0.910 0.046 0.014 0.010          B
## 2  0.988 0.006 0.004 0.000 0.002          A
## 3  0.092 0.864 0.024 0.006 0.014          B
## 4  0.994 0.000 0.004 0.000 0.002          A
## 5  0.996 0.000 0.002 0.000 0.002          A
## 6  0.002 0.030 0.066 0.034 0.868          E
## 7  0.000 0.000 0.022 0.968 0.010          D
## 8  0.020 0.830 0.062 0.064 0.024          B
## 9  1.000 0.000 0.000 0.000 0.000          A
## 10 0.996 0.004 0.000 0.000 0.000          A
## 11 0.046 0.776 0.114 0.028 0.036          B
## 12 0.008 0.016 0.964 0.002 0.010          C
## 13 0.000 0.994 0.000 0.004 0.002          B
## 14 1.000 0.000 0.000 0.000 0.000          A
## 15 0.000 0.002 0.000 0.002 0.996          E
## 16 0.000 0.006 0.000 0.006 0.988          E
## 17 0.960 0.000 0.000 0.000 0.040          A
## 18 0.036 0.834 0.008 0.110 0.012          B
## 19 0.012 0.978 0.000 0.008 0.002          B
## 20 0.000 0.998 0.000 0.000 0.002          B

And here is a plot of the variable importance:

varImpPlot(model.rf$finalModel)

Practical Machine Learning Project

Dylan Peters

Tuesday, October 18, 2016

Practical Machine Learning course project

Create a few models:

The accuracy results are:

The predictions are: