The goal of this project is to predict the classe variable based on the rest of the data set.
Load the data:
library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
library(e1071)
pmlraw <- read.csv("pml-training.csv")
pmltest <- read.csv("pml-testing.csv")
set.seed(4483)
inTrain <- createDataPartition(pmlraw$classe, p = 0.75)[[1]]
pmltrain <- pmlraw[inTrain,]
pmlvalid <- pmlraw[-inTrain,]
# Set up parallelism:
library(parallel)
library(doParallel)
## Loading required package: foreach
## Loading required package: iterators
cluster <- makeCluster(detectCores()) # convention to leave 1 core for OS
registerDoParallel(cluster)
fitControl <- trainControl(method = "cv",
number = 10,
allowParallel = TRUE)
Clean up some data: remove variables that are mostly NA. These look like summary stats.
Exploratory Data Analysis:
The dimensions of the data:
dim(pmltrain)
## [1] 14718 55
A quick look at a few variables:
plot(pmltrain[,c("classe","roll_belt","pitch_belt","yaw_belt")])
Random Forest
model.rf <- train(classe ~ ., method="rf", preProcess = c("BoxCox"), data=pmltrain, trControl=fitControl)
## Loading required package: randomForest
## randomForest 4.6-12
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:ggplot2':
##
## margin
Logistic Regression
#model.2 <- train(classe ~ ., method="logreg", preProcess = c("BoxCox"), data=pmltrain, trControl=fitControl)
AdaBoost
#model.3 <- train(classe ~ ., method="adaboost", preProcess = c("BoxCox"), data=pmltrain, trControl=fitControl)
Logit boost:
model.logitboost <- train(classe ~ ., method="LogitBoost", preProcess = c("BoxCox"), data=pmltrain, trControl=fitControl)
## Loading required package: caTools
Parallel RF:
model.4 <- train(classe ~ ., method="parRF", preProcess = c("BoxCox"), data=pmltrain, trControl=fitControl)
stopCluster(cluster)
Now run predictions on the validation set and compare the results:
Random forest: 0.9991843
(Logic regression and ada boost don’t work with multiple classes) #Logic regression: r #sum(correct.2, na.rm=TRUE) / length(correct.2)
#Ada Boost:
Logit boost: 0.8199429
Parallel RF: 0.9991843
## Loading required package: randomForest
## randomForest 4.6-12
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:ggplot2':
##
## margin
## A B C D E Prediction
## 1 0.020 0.910 0.046 0.014 0.010 B
## 2 0.988 0.006 0.004 0.000 0.002 A
## 3 0.092 0.864 0.024 0.006 0.014 B
## 4 0.994 0.000 0.004 0.000 0.002 A
## 5 0.996 0.000 0.002 0.000 0.002 A
## 6 0.002 0.030 0.066 0.034 0.868 E
## 7 0.000 0.000 0.022 0.968 0.010 D
## 8 0.020 0.830 0.062 0.064 0.024 B
## 9 1.000 0.000 0.000 0.000 0.000 A
## 10 0.996 0.004 0.000 0.000 0.000 A
## 11 0.046 0.776 0.114 0.028 0.036 B
## 12 0.008 0.016 0.964 0.002 0.010 C
## 13 0.000 0.994 0.000 0.004 0.002 B
## 14 1.000 0.000 0.000 0.000 0.000 A
## 15 0.000 0.002 0.000 0.002 0.996 E
## 16 0.000 0.006 0.000 0.006 0.988 E
## 17 0.960 0.000 0.000 0.000 0.040 A
## 18 0.036 0.834 0.008 0.110 0.012 B
## 19 0.012 0.978 0.000 0.008 0.002 B
## 20 0.000 0.998 0.000 0.000 0.002 B
And here is a plot of the variable importance:
varImpPlot(model.rf$finalModel)