Practical Machine Learning Project

The goal will be to use data from accelerometers on the belt, forearm, arm, and dumbell of 6 participants, to predict how well they do barbell lifts.

traindestfile = "./data/pml-training.csv"
testdestfile = "./data/pml-testing.csv"
if(!file.exists(traindestfile) || !file.exists(testdestfile)){
        download.file(url = "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv", destfile = traindestfile)
        download.file(url = "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv", destfile = testdestfile)
}

training <- read.csv(traindestfile)
testing <- read.csv(testdestfile)

Exploratory Data Analysis

library(ggplot2)
g <- qplot(user_name, data = training, colour = classe)
g

Get the accelerometer data

library(caret)
trainset <- training[,c(grep("accel", names(training)), 160)]

str(trainset)

'data.frame':	19622 obs. of  21 variables:
 $ total_accel_belt    : int  3 3 3 3 3 3 3 3 3 3 ...
 $ var_total_accel_belt: num  NA NA NA NA NA NA NA NA NA NA ...
 $ accel_belt_x        : int  -21 -22 -20 -22 -21 -21 -22 -22 -20 -21 ...
 $ accel_belt_y        : int  4 4 5 3 2 4 3 4 2 4 ...
 $ accel_belt_z        : int  22 22 23 21 24 21 21 21 24 22 ...
 $ total_accel_arm     : int  34 34 34 34 34 34 34 34 34 34 ...
 $ var_accel_arm       : num  NA NA NA NA NA NA NA NA NA NA ...
 $ accel_arm_x         : int  -288 -290 -289 -289 -289 -289 -289 -289 -288 -288 ...
 $ accel_arm_y         : int  109 110 110 111 111 111 111 111 109 110 ...
 $ accel_arm_z         : int  -123 -125 -126 -123 -123 -122 -125 -124 -122 -124 ...
 $ total_accel_dumbbell: int  37 37 37 37 37 37 37 37 37 37 ...
 $ var_accel_dumbbell  : num  NA NA NA NA NA NA NA NA NA NA ...
 $ accel_dumbbell_x    : int  -234 -233 -232 -232 -233 -234 -232 -234 -232 -235 ...
 $ accel_dumbbell_y    : int  47 47 46 48 48 48 47 46 47 48 ...
 $ accel_dumbbell_z    : int  -271 -269 -270 -269 -270 -269 -270 -272 -269 -270 ...
 $ total_accel_forearm : int  36 36 36 36 36 36 36 36 36 36 ...
 $ var_accel_forearm   : num  NA NA NA NA NA NA NA NA NA NA ...
 $ accel_forearm_x     : int  192 192 196 189 189 193 195 193 193 190 ...
 $ accel_forearm_y     : int  203 203 204 206 206 203 205 205 204 205 ...
 $ accel_forearm_z     : int  -215 -216 -213 -214 -214 -215 -215 -213 -214 -215 ...
 $ classe              : Factor w/ 5 levels "A","B","C","D",..: 1 1 1 1 1 1 1 1 1 1 ...

We see that some variables are mostly NA. Hence we can choose to ignore them and use the rest of the variables to make our models. Therefore we leave out the following:

var_total_accel_belt
var_accel_arm
var_accel_dumbbell
var_accel_forearm

Also looking at the names we can see that they are some sort of variable component of the total acceleration measurements of the various accelerometers. We can probably safely ignore them

# removing the NA variables
grep("var_accel", names(trainset))

[1]  7 12 17

trainset <- trainset[, -c(grep("var_total_accel", names(trainset)), grep("var_accel", names(trainset)))]
str(trainset)

'data.frame':	19622 obs. of  17 variables:
 $ total_accel_belt    : int  3 3 3 3 3 3 3 3 3 3 ...
 $ accel_belt_x        : int  -21 -22 -20 -22 -21 -21 -22 -22 -20 -21 ...
 $ accel_belt_y        : int  4 4 5 3 2 4 3 4 2 4 ...
 $ accel_belt_z        : int  22 22 23 21 24 21 21 21 24 22 ...
 $ total_accel_arm     : int  34 34 34 34 34 34 34 34 34 34 ...
 $ accel_arm_x         : int  -288 -290 -289 -289 -289 -289 -289 -289 -288 -288 ...
 $ accel_arm_y         : int  109 110 110 111 111 111 111 111 109 110 ...
 $ accel_arm_z         : int  -123 -125 -126 -123 -123 -122 -125 -124 -122 -124 ...
 $ total_accel_dumbbell: int  37 37 37 37 37 37 37 37 37 37 ...
 $ accel_dumbbell_x    : int  -234 -233 -232 -232 -233 -234 -232 -234 -232 -235 ...
 $ accel_dumbbell_y    : int  47 47 46 48 48 48 47 46 47 48 ...
 $ accel_dumbbell_z    : int  -271 -269 -270 -269 -270 -269 -270 -272 -269 -270 ...
 $ total_accel_forearm : int  36 36 36 36 36 36 36 36 36 36 ...
 $ accel_forearm_x     : int  192 192 196 189 189 193 195 193 193 190 ...
 $ accel_forearm_y     : int  203 203 204 206 206 203 205 205 204 205 ...
 $ accel_forearm_z     : int  -215 -216 -213 -214 -214 -215 -215 -213 -214 -215 ...
 $ classe              : Factor w/ 5 levels "A","B","C","D",..: 1 1 1 1 1 1 1 1 1 1 ...

Testing tree algorithm

set.seed(33833)
inTrain <- createDataPartition(y=trainset$classe,
                              p=0.75, list=FALSE)
modeltrainset <- trainset[inTrain, ]
modeltestset <- trainset[-inTrain, ]

mf1 <- train(classe ~ ., method = "rpart", data = modeltrainset, na.action = na.pass)
confusionMatrix(predict(mf1, modeltestset),modeltestset$classe)$overall[1]

 Accuracy 
0.4208809 

Clearly the accuracy is pretty low

Testing random forest

Here we will do a principle component analysis with a threshold of 0.8 to account for 80 percent of the variation.
This has been done to reduce the time taken for the algorithm to run on this large dataset.
We will also use cross validation using the train control method

ctrl <- trainControl(preProcOptions = list(thresh = 0.8), method = "cv")
mf2 <- train(classe ~ ., method = "rf", data = modeltrainset, preProcess="pca", trControl = ctrl)

confusionMatrix(predict(mf2, modeltestset), modeltestset$classe)$overall[1]

 Accuracy 
0.8203507 

We can see that the accuracy has increased considerably by doing PCA and then using random forest to train the model

library(ggplot2)
predict2 <- predict(mf2, modeltestset)

predictionDF <- data.frame(classe = predict2)
testdf <- data.frame(classe = modeltestset$classe)

g <- ggplot(testdf, aes(classe)) + geom_histogram(data = testdf, fill = "red", stat = "count", alpha = 0.6) + geom_histogram(data = predictionDF, stat = "count", alpha = 0.4, fill = "darkgray")
g

Next we will try stacking the models

NOTE: Can possibly lead to overfitting the model on the training set.

pr1 <- predict(mf1, modeltestset)
pr2 <- predict(mf2, modeltestset)

prdf <- data.frame(pr1, pr2, classe = modeltestset$classe)

combfit <- train(classe ~ ., method="rf", data=prdf, trControl = trainControl(method = "cv"))

confusionMatrix(predict(combfit, modeltestset), modeltestset$classe)$overall[1]

Accuracy 
0.824429 

The accuracy seems to have improved marginally.

We can safely say that based on the results that the model fit trained using random forest can reasonably predict on the test data.

Final result for the quiz

predict(mf2, testing)

 [1] B A B A A E D B A A B C C A A B A B B B
Levels: A B C D E