Blog Archives

Tree Based Models

5/16/2016

Below is the R Markdown for Tree Based Models.

---
title: "TreeBasedModels"
output: html_document
---
# Tree Based Models

## Decision Trees

```{r}
require(ISLR)
require(tree)
attach(Carseats)
hist(Sales)
High = ifelse(Sales <= 8, "No", "Yes")
Carseats = data.frame(Carseats, High)
fit.tree = tree(High ~.-Sales, data = Carseats)
summary(fit.tree)
plot(fit.tree)
text(fit.tree, pretty=0)
```

To print Detailed tree
Description of entry:
  `Node Observation Mean_Deviance (%yes, %no)`
```{r}
fit.tree
```

Lets generate test-set by spliting careats data into (250, 150)
Grow the tree on training set and evaluate the performance of test-set
```{r}
set.seed(1011)
train = sample(1:nrow(Carseats), 250)
fit.tree.1 = tree(High ~.-Sales, data = Carseats, subset = train)
summary(fit.tree.1)
plot(fit.tree.1)
text(fit.tree.1, pretty=0)
fit.tree.predict = predict(fit.tree.1, Carseats[-train,], type = "class")
with(Carseats[-train,], table(fit.tree.predict, High))
```

Lets now use the CV method to prune the tree to reduce variance.
```{r}
fit.tree.cv = cv.tree(fit.tree.1, FUN = prune.misclass)
plot(fit.tree.cv)
prune.fit.tree.1 = prune.misclass(fit.tree.1, best = 13)
plot(prune.fit.tree.1)
text(prune.fit.tree.1, pretty = 0)
prune.tree.1.predict = predict(prune.fit.tree.1, Carseats[-train,], type = "class")
with(Carseats[-train,], table(prune.tree.1.predict, High))
```

# Random Forests And Boosting

## Random forest (package : randomForest)
We shall use Boston Housing data from MASS package
Response: medv
```{r}
require(randomForest)
require(MASS)
set.seed(101)
attach(Boston)
train = sample(1:nrow(Boston), 300)
fit.RF = randomForest(medv ~., data = Boston, subset = train)
```

The MSR & % of variance Explained are based on OOB (out og bag).
No. of variables randomly chosen at each split is 4, since $p=13$ we can use
all 13 values of `mtry`.
```{r}
oob.error = double(13)
test.error = double(13)
for(mtry in 1:13){
  fit = randomForest(medv ~., data = Boston, subset = train, mtry = mtry, ntree=400)
  oob.error[mtry] = fit$mse[400]
  pred = predict(fit, Boston[-train,])
  test.error[mtry] = with(Boston[-train,], mean((medv - pred)^2))
}
matplot(1:mtry, cbind(test.error, oob.error), pch = 19, col = c("red", "blue"), type = "b", ylab = "MSE")
legend("topright", legend = c("Test", "OOB"), pch = 19, col = c("red", "blue"))
```

$`mtry` = 13$ corresponds to bagging

## Boosting (package: gbm)
Boosting tries to reduce bias unlike RF targeted at variance.
By building (`n.trees`) numerous shallow trees (`interaction.depth`)
`summary(fit.boost)` gives the variable importance graph.
```{r}
require(gbm)
fit.boost = gbm(medv ~., data = Boston[train,], distribution = "gaussian", n.trees = 10000, shrinkage = 0.01, interaction.depth = 4)
summary(fit.boost)

plot(fit.boost, i = "lstat")
plot(fit.boost, i = "rm")
```

Tuning the model parameters
```{r}
n.trees = seq(from = 100, to = 10000, by = 100)
predmat = predict(fit.boost, newdata = Boston[-train,], n.trees = n.trees)
# Column wise MSE
error = with(Boston[-train,], apply((predmat - medv)^2,2,mean))
plot(n.trees, error, pch = 19, ylab = "MSE", xlab = "#trees", main = "Boosting Test Error")
abline(h = min(test.error), col = "red")
```

0 Comments

Non-Linear Models

5/16/2016

0 Comments

Below is the R Markdown file with snippets on non-linear models.

---
title: "NonLinearModels"
output: html_document
---
# Nonlinear Models

```{r}
require(ISLR)
attach(Wage)
```

## Polynimoal Regression

keyword `poly()` generates abasis function of *orthogonal polynomial*.
```{r}
fit.poly = lm(wage ~ poly(age, 4), data = Wage)
summary(fit.poly)
```

Plot the fitted function along with SE of fit.
```{r}
age.limits = range(age)
age.grid = seq(from = age.limits[1], to = age.limits[2])
preds = predict(fit.poly, newdata = list(age = age.grid), se = T)
se.bands = cbind(preds$fit + 2 * preds$se, preds$fit - 2 * preds$se)
plot(age, wage, col="darkgrey")
lines(age.grid, preds$fit, col="blue")
matlines(age.grid, se.bands, col="blue", lty=2)
```

Use of `anova()` to test differences between multiple models
```{r}
fita = lm(wage ~ education, data = Wage)
fitb= lm(wage ~ education+age, data = Wage)
fitc = lm(wage ~ education+poly(age,2), data = Wage)
fitd = lm(wage ~ education+poly(age,3), data = Wage)
anova(fita,fitb,fitc,fitd)
```

## Polynomial Logistic Regression
Let the binary responsible variable be wage > 250K as 1 or 0.
```{r}
fit.log = glm(I(wage > 250) ~ poly(age, 3), data = Wage, family = "binomial")
summary(fit.log)
preds.log = predict(fit.log, newdata = list(age = age.grid), se = T)
se.bands.1 = preds.log$fit + cbind(fit = 0, lower = -2*preds.log$se, upper = 2*preds.log$se)
prob.bands = exp(se.bands.1)/(1+exp(se.bands.1))
plot(age, wage, col="darkgrey")
matplot(age.grid, prob.bands, col = "blue", lwd = c(2,1,1), lty = c(1,2,2),  type = "l", ylim = c(0, 0.1))
```

## Splines
Lets implement cubic spine with knots at 25,40,60
bs() gives teh basis for cubic polynomials
```{r}
require(splines)
fit.splines = lm(wage ~ bs(age, knots = c(25, 40, 60)), data = Wage)
plot(age, wage, col = "darkgray")
lines(age.grid, predict(fit.splines, list(age = age.grid)), col = "green", lwd = 2)
abline(v = c(25, 40, 60), lty = 2, col = "darkgreen" )
```

Smoothing Splines doesnot require knot selection but have smoothing parameter, which can be selected by choosing degree of freedom df

```{r}
fit.smooth = smooth.spline(age, wage, df = 16)
lines(fit.smooth, col = "red", lwd = 2)
```

Another way to choose smoothing parameters is to use LOOCV { leave one out cross validation }
```{r}
fit.smooth.loocv = smooth.spline(age, wage, cv = TRUE)
lines(fit.smooth.loocv, col = "blue", lwd = 2)
```

## GAM - Generalized Additive Models
To fit models with more than one non-linear terms we use GAMs
`gam` package.
s() in gam will tell to create a smoothing spline.
```{r}
require(gam)
fit.gam = gam(wage ~ s(age, df = 4) + s(year, df = 4) + education, data = Wage)
par(mfrow = c(1,3))
plot(fit.gam, se = TRUE)
```

Lets see if we need a nonlinear term for year

```{r}
fit.gam.1  = gam(wage ~ s(age, df = 4)+year+education, data = Wage)
anova(fit.gam, fit.gam.1, test = "Chisq")

```

0 Comments

Tree Based Models

Non-Linear Models﻿

Archives

Categories

Non-Linear Models