My Daily Hacks

Tree Based Models

5/16/2016

Below is the R Markdown for Tree Based Models.

---
title: "TreeBasedModels"
output: html_document
---
# Tree Based Models

## Decision Trees

```{r}
require(ISLR)
require(tree)
attach(Carseats)
hist(Sales)
High = ifelse(Sales <= 8, "No", "Yes")
Carseats = data.frame(Carseats, High)
fit.tree = tree(High ~.-Sales, data = Carseats)
summary(fit.tree)
plot(fit.tree)
text(fit.tree, pretty=0)
```

To print Detailed tree
Description of entry:
  `Node Observation Mean_Deviance (%yes, %no)`
```{r}
fit.tree
```

Lets generate test-set by spliting careats data into (250, 150)
Grow the tree on training set and evaluate the performance of test-set
```{r}
set.seed(1011)
train = sample(1:nrow(Carseats), 250)
fit.tree.1 = tree(High ~.-Sales, data = Carseats, subset = train)
summary(fit.tree.1)
plot(fit.tree.1)
text(fit.tree.1, pretty=0)
fit.tree.predict = predict(fit.tree.1, Carseats[-train,], type = "class")
with(Carseats[-train,], table(fit.tree.predict, High))
```

Lets now use the CV method to prune the tree to reduce variance.
```{r}
fit.tree.cv = cv.tree(fit.tree.1, FUN = prune.misclass)
plot(fit.tree.cv)
prune.fit.tree.1 = prune.misclass(fit.tree.1, best = 13)
plot(prune.fit.tree.1)
text(prune.fit.tree.1, pretty = 0)
prune.tree.1.predict = predict(prune.fit.tree.1, Carseats[-train,], type = "class")
with(Carseats[-train,], table(prune.tree.1.predict, High))
```

# Random Forests And Boosting

## Random forest (package : randomForest)
We shall use Boston Housing data from MASS package
Response: medv
```{r}
require(randomForest)
require(MASS)
set.seed(101)
attach(Boston)
train = sample(1:nrow(Boston), 300)
fit.RF = randomForest(medv ~., data = Boston, subset = train)
```

The MSR & % of variance Explained are based on OOB (out og bag).
No. of variables randomly chosen at each split is 4, since $p=13$ we can use
all 13 values of `mtry`.
```{r}
oob.error = double(13)
test.error = double(13)
for(mtry in 1:13){
  fit = randomForest(medv ~., data = Boston, subset = train, mtry = mtry, ntree=400)
  oob.error[mtry] = fit$mse[400]
  pred = predict(fit, Boston[-train,])
  test.error[mtry] = with(Boston[-train,], mean((medv - pred)^2))
}
matplot(1:mtry, cbind(test.error, oob.error), pch = 19, col = c("red", "blue"), type = "b", ylab = "MSE")
legend("topright", legend = c("Test", "OOB"), pch = 19, col = c("red", "blue"))
```

$`mtry` = 13$ corresponds to bagging

## Boosting (package: gbm)
Boosting tries to reduce bias unlike RF targeted at variance.
By building (`n.trees`) numerous shallow trees (`interaction.depth`)
`summary(fit.boost)` gives the variable importance graph.
```{r}
require(gbm)
fit.boost = gbm(medv ~., data = Boston[train,], distribution = "gaussian", n.trees = 10000, shrinkage = 0.01, interaction.depth = 4)
summary(fit.boost)

plot(fit.boost, i = "lstat")
plot(fit.boost, i = "rm")
```

Tuning the model parameters
```{r}
n.trees = seq(from = 100, to = 10000, by = 100)
predmat = predict(fit.boost, newdata = Boston[-train,], n.trees = n.trees)
# Column wise MSE
error = with(Boston[-train,], apply((predmat - medv)^2,2,mean))
plot(n.trees, error, pch = 19, ylab = "MSE", xlab = "#trees", main = "Boosting Test Error")
abline(h = min(test.error), col = "red")
```

0 Comments

Non-Linear Models

5/16/2016

0 Comments

Below is the R Markdown file with snippets on non-linear models.

---
title: "NonLinearModels"
output: html_document
---
# Nonlinear Models

```{r}
require(ISLR)
attach(Wage)
```

## Polynimoal Regression

keyword `poly()` generates abasis function of *orthogonal polynomial*.
```{r}
fit.poly = lm(wage ~ poly(age, 4), data = Wage)
summary(fit.poly)
```

Plot the fitted function along with SE of fit.
```{r}
age.limits = range(age)
age.grid = seq(from = age.limits[1], to = age.limits[2])
preds = predict(fit.poly, newdata = list(age = age.grid), se = T)
se.bands = cbind(preds$fit + 2 * preds$se, preds$fit - 2 * preds$se)
plot(age, wage, col="darkgrey")
lines(age.grid, preds$fit, col="blue")
matlines(age.grid, se.bands, col="blue", lty=2)
```

Use of `anova()` to test differences between multiple models
```{r}
fita = lm(wage ~ education, data = Wage)
fitb= lm(wage ~ education+age, data = Wage)
fitc = lm(wage ~ education+poly(age,2), data = Wage)
fitd = lm(wage ~ education+poly(age,3), data = Wage)
anova(fita,fitb,fitc,fitd)
```

## Polynomial Logistic Regression
Let the binary responsible variable be wage > 250K as 1 or 0.
```{r}
fit.log = glm(I(wage > 250) ~ poly(age, 3), data = Wage, family = "binomial")
summary(fit.log)
preds.log = predict(fit.log, newdata = list(age = age.grid), se = T)
se.bands.1 = preds.log$fit + cbind(fit = 0, lower = -2*preds.log$se, upper = 2*preds.log$se)
prob.bands = exp(se.bands.1)/(1+exp(se.bands.1))
plot(age, wage, col="darkgrey")
matplot(age.grid, prob.bands, col = "blue", lwd = c(2,1,1), lty = c(1,2,2),  type = "l", ylim = c(0, 0.1))
```

## Splines
Lets implement cubic spine with knots at 25,40,60
bs() gives teh basis for cubic polynomials
```{r}
require(splines)
fit.splines = lm(wage ~ bs(age, knots = c(25, 40, 60)), data = Wage)
plot(age, wage, col = "darkgray")
lines(age.grid, predict(fit.splines, list(age = age.grid)), col = "green", lwd = 2)
abline(v = c(25, 40, 60), lty = 2, col = "darkgreen" )
```

Smoothing Splines doesnot require knot selection but have smoothing parameter, which can be selected by choosing degree of freedom df

```{r}
fit.smooth = smooth.spline(age, wage, df = 16)
lines(fit.smooth, col = "red", lwd = 2)
```

Another way to choose smoothing parameters is to use LOOCV { leave one out cross validation }
```{r}
fit.smooth.loocv = smooth.spline(age, wage, cv = TRUE)
lines(fit.smooth.loocv, col = "blue", lwd = 2)
```

## GAM - Generalized Additive Models
To fit models with more than one non-linear terms we use GAMs
`gam` package.
s() in gam will tell to create a smoothing spline.
```{r}
require(gam)
fit.gam = gam(wage ~ s(age, df = 4) + s(year, df = 4) + education, data = Wage)
par(mfrow = c(1,3))
plot(fit.gam, se = TRUE)
```

Lets see if we need a nonlinear term for year

```{r}
fit.gam.1  = gam(wage ~ s(age, df = 4)+year+education, data = Wage)
anova(fit.gam, fit.gam.1, test = "Chisq")

```

0 Comments

Model Selection

4/28/2016

0 Comments

Below is the R Markdown file showing different techniques to do feature selection in R.
Covered techniques are:

Best subset selection
Forward Selection
Ridge & Lasso Regression

---
title: "ModelSelection"
output: html_document
---

```{r}
library(ISLR)
summary(Hitters)
```

Hitters is baseball dataset where we aim to predict salary of baseball player.

Summary shows that there are some NA values in Salary, lets remove them:

na.omit: Removes any row that has NA value.

```{r}
Hitters = na.omit(Hitters)
with(Hitters, sum(is.na(Salary)))
```

Best subset selection
______________________
Look for all combination of models for each specific number of features and determine the best model for each.

cp statistic: Pick the model with least cp statistic

```{r}
library(leaps)
regFit.full = regsubsets(Salary~., data = Hitters, nvmax = 19)
reg.summary = summary(regFit.full)
names(reg.summary)
plot(reg.summary$cp, xlab="Number of Variables", ylab = "cp")
best.model.num = which.min(reg.summary$cp)
coef(regFit.full, best.model.num)
```

Forward step-wise selection
______________________
Greedy Algorithm, each time it includes a new variable to set as a nested Sequence.

```{r}
regFit.fwd = regsubsets(Salary~., data = Hitters, nvmax = 19, method = "forward")
reg.summary.fwd  = summary(regFit.fwd)
plot(regFit.fwd, scale="Cp")
```

Model Selection using a validation-set
______________________
Lets make a training and a validation set, so that we can choose a good subset model.

```{r}
dim(Hitters)
set.seed(1)
train = sample(seq(263), 180, replace = FALSE)
regFit.fwd.v = regsubsets(Salary~., data = Hitters[train,], nvmax = 19, method = "forward")
reg.summary.fwd.v  = summary(regFit.fwd.v)
plot(regFit.fwd.v, scale="Cp")
```

Preparing the test data
Note: regFit.fwd.v$rss[-1] removes the RSS for null model

```{r}
errors.v = rep(NA, 19)
test.v = model.matrix(Salary~., data = Hitters[-train,])
for(i in 1:19){
  coef.i = coef(regFit.fwd.v, id = i)
  pred = test.v[, names(coef.i)]%*%coef.i
  errors.v[i] = mean((Hitters$Salary[-train] - pred)^2)
}
plot(sqrt(errors.v), ylab="RMSE", ylim=c(300, 400), pch = 19, type = "b")
points(sqrt(regFit.fwd.v$rss[-1]/180), col="blue", pch = 19, type = "b")
legend("topright", legend=c("Training", "Validation"), col=c("blue", "black"), pch = 19)
```

Model-Selection by cross-validation
______________________
We shall use 10 fold cross validation

```{r}
set.seed(11)
folds = sample(rep(1:10, length = nrow(Hitters)))
table(folds)
errors.cv = matrix(NA, 10, 19)
for(k in 1:10){
  best.fit = regsubsets(Salary~.,data=Hitters[folds!=k,],nvmax = 19, method = "forward")
  test.v = model.matrix(Salary~., data = Hitters[folds == k,])
  for(i in 1:19) {
    coef.i = coef(best.fit, id = i)
    pred = test.v[,names(coef.i)]%*%coef.i
    errors.cv[k, i] = mean((Hitters$Salary[folds == k] - pred)^2)
  }
}
rmse = sqrt(apply(errors.cv,2,mean))
plot(rmse,pch=19,type="b")
```

Ridge & Lasso Regression
______________________
Split the Hitters package to predictors (X) and Response (Y)

```{r}
library(glmnet)
x = model.matrix(Salary~.-1, data = Hitters)
y = Hitters$Salary
```

First we shalls ee ridge regression by calling glmnet with alpha = 0
cv.glmnet function which will do the cross-validation for us.
Ridge Regression keeps all predictors and try to reduce the coefficients to zero.
Ridge uses L1-Norm (Sum of Squares of coefficients). Penality is applied on it.
RSS + lambda * L-1 Norm

```{r}
fit.ridge = glmnet(x, y, alpha = 0)
plot(fit.ridge, xvar = "lambda", label = TRUE)
cv.ridge = cv.glmnet(x, y, alpha = 0)
plot(cv.ridge)
```


We shall use Lasso regression, by setting the alpha = 1. Does both Shrinkage and feature selection
Lasoo Regression uses Mean Absolute Error of coefficients
RSS + lambda * MAE
Using MAE, Lasso reduces some of its coefficients to Zero.
```{r}
fit.lasso = glmnet(x, y, alpha = 1)
plot(fit.lasso, xvar = "lambda", label = TRUE)
cv.lasso = cv.glmnet(x, y, alpha = 1)
plot(cv.lasso)
coef(cv.lasso)
```

Note that the `echo = FALSE` parameter was added to the code chunk to prevent printing of the R code that generated the plot.

0 Comments

Basic Data Analysis R

11/1/2015

0 Comments

> fileLocation <- "./dataset/Consumer_Complaints.csv"
> 
> #reading csv file
> data <- read.csv(fileLocation, header = TRUE)
> 
> #To get dimensions of the data
> dimData <- dim(data)
> print(dimData)
[1] 471871     16
> 
> #get number of rows
> rows <- nrow(data)
> print(rows)
[1] 471871
>
>#To get column names 
>colnames(data)
 [1] "Date.received"                "Product"                      "Sub.product"                 
 [4] "Issue"                        "Sub.issue"                    "Consumer.complaint.narrative"
 [7] "Company.public.response"      "Company"                      "State"                       
[10] "ZIP.code"                     "Submitted.via"                "Date.sent.to.company"        
[13] "Company.response.to.consumer" "Timely.response."             "Consumer.disputed."          
[16] "Complaint.ID" 
>
>#To get rowNames             
> rownames(data)
    [1] "1"     "2"     "3"     "4"     "5"     "6"     "7"     "8"     "9"     "10"    "11"    "12"    "13"   
   [14] "14"    "15"    "16"    "17"    "18"    "19"    "20"    "21"    "22"    "23"    "24"    "25"    "26"   
   [27] "27"    "28"    "29"    "30"    "31"    "32"    "33"    "34"    "35"    "36"    "37"    "38"    "39"   
   [40] "40"    "41"    "42" ......
>
>#To view summary of data. Output is not shown.
>summary <- summary(data)
>#print(summary)
>
>#To view quick snapshot of data structure
>struct <- str(data)
>#print(struct)
>
> #Inspect data using HEAD
> headData <- head(data)
> #print(headData)
>
> #To see the last few rows of your data, use the tail()
> tailData <- tail(data)
> #print(tailData)
>
> #get headers in table
> headers <- names(data)
> print(headers)
 [1] "Date.received"                "Product"                      "Sub.product"                 
 [4] "Issue"                        "Sub.issue"                    "Consumer.complaint.narrative"
 [7] "Company.public.response"      "Company"                      "State"                       
[10] "ZIP.code"                     "Submitted.via"                "Date.sent.to.company"        
[13] "Company.response.to.consumer" "Timely.response."             "Consumer.disputed."          
[16] "Complaint.ID"                
> 
> #check if string is in headers
> testString <- "Product"
> check <- is.na(match(testString, headers))
> print(check)
[1] FALSE
> 
> #pick a column 'Product' from the data
> productData <- data["Product"]
> print(head(productData))
           Product
1  Debt collection
2 Credit reporting
3  Money transfers
4  Debt collection
5  Debt collection
6 Credit reporting
> 
> 
> #subset data to have only first 3 columns
> #R indexes from 1, not 0. So your first column is at [1] and not [0]
> subsetData <- data[, 1:3]
> #print(subsetData)
> 
> #subset data to with columns that aren't contiguous
> #or we can use select() function to achieve the same
> subsetData1 <- data[, c(1,3,5,7)]
> #print(subsetData1)
> 
> #Filter rows where product is Mortgage
> #or we can use filter() function to achieve the same
> subsetData2 <- data[data$Product == "Mortgage", ]
> #print(subsetData2)
> 
> #slicing data using subset(dataset, row filters, col selections)
> #Also, we can use %.% to create a execution pipeline
> subsetData3 <- subset(data, Product == "Mortgage", 1:3)
> #print(subsetData3)
> 
> #check for NA values in productData
> chkProductData <- complete.cases(productData)
> summary(chkProductData)
   Mode    TRUE    NA's 
logical  471871       0 
> 
> #get only nonNA values
> productDataGood <- productData[chkProductData,]
> 
> #to get count of issues for every product kind
> countMatrix <- table(productDataGood)
> print(countMatrix)
productDataGood
Bank account or service           Consumer loan             Credit card        Credit reporting 
                  53460                   16880                   57374                   73234 
        Debt collection         Money transfers                Mortgage Other financial service 
                  83117                    2995                  166237                     352 
            Payday loan            Prepaid card            Student loan 
                   3190                    1157                   13875 
> 
#Normalize table data
> cm_state <- table(data["State"])
> total <- sum(cm_state)
> for(state in names(cm_state)){
+     cm_state[state] = cm_state[state] / total
+ }
> print(cm_state)
>
> #default mean function returns NA when there is any NA in the numerical
> #data. So use na.rm = TRUE to tell function to exclude NAs
> mean(data$numeric_column, na.rm = TRUE)
>
> 
> #computing combinations
> choose(5,2)
[1] 10
> 
> #how many ways to combine fruits into group of 2
> fruits <- c("apple", "mango", "banana", "grapes", "plum")
> combn(fruits, 2)
     [,1]    [,2]     [,3]     [,4]    [,5]     [,6]     [,7]    [,8]     [,9]     [,10]   
[1,] "apple" "apple"  "apple"  "apple" "mango"  "mango"  "mango" "banana" "banana" "grapes"
[2,] "mango" "banana" "grapes" "plum"  "banana" "grapes" "plum"  "grapes" "plum"   "plum"  
> 
>#To Inspect in deep, use describe function on psych package.
>#This would give data like MAD, Skew, kurtosis along with mean, median , std
>install.packages("psych")
>library(psych)
>describe(data)

0 Comments

Tree Based Models

Non-Linear Models

Model Selection

Basic Data Analysis R

Archives

Categories

Tree Based Models

Non-Linear Models﻿

Model Selection

Basic Data Analysis R

Archives

Categories

Non-Linear Models