Machine Learning Project for Johns Hopkins Data Science Specialization via Coursera



Predicting Workout Form Using Machine Learning






knitr::opts_chunk$set(echo = TRUE)
library(ggplot2)
library(dplyr)
library(caret)
library(Hmisc)
library(GGally)
library(e1071)
load("variables.RData") ## because machine learning models are time consuming!
my_url <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv"
my_url2 <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv"

download.file(my_url, "training")
download.file(my_url2, "testing")


testing <- read.csv("testing")
training <- read.csv("training")

Introduction

In this paper we analyse the Weight Lifting Exercise Dataset (WLE) from Groupware-LES. This data was collected by taking measurements of weight lifters doing an exercise in 5 different ways (the variable “classe” encodes these). The measurements were obtained using gyroscopes on the arm, the forearm, the waist, and a dumbbell.

Data Preparation

We first prepare the data for analysis by creating a training, a test, and a validation set. The data is of medium size with ~20,000 observations. However, the goal of this analysis is to produce a good estimate of the out-of-sample error rate for 20 samples, per the course requirements of Johns Hopkins University’s Intro to Machine Learning. Therefore, while we treat our test set produced in this analysis as truly an untouched sample of the data, we won’t be excessively concerned about our algorithm’s performance on the test set. The implication, then, is that we won’t need a huge test set – we break it up here into 80% training (further broken into 60% training and 20% validation), and 20% testing.

set.seed(1233)
fortest <- createDataPartition(training$classe,p=.2,list=FALSE)
test <- training[fortest,]


pretrain <- training[-fortest,]
  
forvalidation <- createDataPartition(pretrain$classe,p=.25,list=FALSE) #.25*.8=.2 obvi
train <- pretrain[-forvalidation,]
validation <- pretrain[forvalidation,]

Exploratory Data Analysis

We explore the pretrain object, which is our undifferentiated training set. The data has 15,695 observations with 160 variables. From the user_name field, we see that 6 individuals participated in producing the data. Thus, 6 individuals repeated 5 different forms of each exercise. Some information is provided which we will not be concerned about including time_stamps and “window” variables, which have little variation.

One immediate observation is that the variables for each of the bands include a decomposition of role, pitch, yaw, and acceleration many other components including directional information and descriptive statistics about that information. Most of the descriptive statistics are not tabulated for each observation so these can be automatically thrown away (in this case, the mean, standard deviation, kurtosis, and skewness are not sufficient statistics for our class probabilities in the classification models below).

If we look at a table of our outcome variable, we notice that class A is over-represented, which may have to be dealt with by penalizing the fit on this class.

## [1] "Number of Observations Per Workout Form"
## 
##    A    B    C    D    E 
## 4464 3037 2737 2572 2885

Here are some Very Hungry Caterpillar-esque graphs relating user and exercise form to a few of the various movements.

ggpairs(data=smaller, mapping=aes(color=classe, alpha=.5),columns=c(2:5))

ggpairs(data=smaller, mapping=aes(color=classe),columns=c(6:9))