Below is the R Markdown for Tree Based Models.
--- title: "TreeBasedModels" output: html_document --- # Tree Based Models ## Decision Trees ```{r} require(ISLR) require(tree) attach(Carseats) hist(Sales) High = ifelse(Sales <= 8, "No", "Yes") Carseats = data.frame(Carseats, High) fit.tree = tree(High ~.-Sales, data = Carseats) summary(fit.tree) plot(fit.tree) text(fit.tree, pretty=0) ``` To print Detailed tree Description of entry: `Node Observation Mean_Deviance (%yes, %no)` ```{r} fit.tree ``` Lets generate test-set by spliting careats data into (250, 150) Grow the tree on training set and evaluate the performance of test-set ```{r} set.seed(1011) train = sample(1:nrow(Carseats), 250) fit.tree.1 = tree(High ~.-Sales, data = Carseats, subset = train) summary(fit.tree.1) plot(fit.tree.1) text(fit.tree.1, pretty=0) fit.tree.predict = predict(fit.tree.1, Carseats[-train,], type = "class") with(Carseats[-train,], table(fit.tree.predict, High)) ``` Lets now use the CV method to prune the tree to reduce variance. ```{r} fit.tree.cv = cv.tree(fit.tree.1, FUN = prune.misclass) plot(fit.tree.cv) prune.fit.tree.1 = prune.misclass(fit.tree.1, best = 13) plot(prune.fit.tree.1) text(prune.fit.tree.1, pretty = 0) prune.tree.1.predict = predict(prune.fit.tree.1, Carseats[-train,], type = "class") with(Carseats[-train,], table(prune.tree.1.predict, High)) ``` # Random Forests And Boosting ## Random forest (package : randomForest) We shall use Boston Housing data from MASS package Response: medv ```{r} require(randomForest) require(MASS) set.seed(101) attach(Boston) train = sample(1:nrow(Boston), 300) fit.RF = randomForest(medv ~., data = Boston, subset = train) ``` The MSR & % of variance Explained are based on OOB (out og bag). No. of variables randomly chosen at each split is 4, since $p=13$ we can use all 13 values of `mtry`. ```{r} oob.error = double(13) test.error = double(13) for(mtry in 1:13){ fit = randomForest(medv ~., data = Boston, subset = train, mtry = mtry, ntree=400) oob.error[mtry] = fit$mse[400] pred = predict(fit, Boston[-train,]) test.error[mtry] = with(Boston[-train,], mean((medv - pred)^2)) } matplot(1:mtry, cbind(test.error, oob.error), pch = 19, col = c("red", "blue"), type = "b", ylab = "MSE") legend("topright", legend = c("Test", "OOB"), pch = 19, col = c("red", "blue")) ``` $`mtry` = 13$ corresponds to bagging ## Boosting (package: gbm) Boosting tries to reduce bias unlike RF targeted at variance. By building (`n.trees`) numerous shallow trees (`interaction.depth`) `summary(fit.boost)` gives the variable importance graph. ```{r} require(gbm) fit.boost = gbm(medv ~., data = Boston[train,], distribution = "gaussian", n.trees = 10000, shrinkage = 0.01, interaction.depth = 4) summary(fit.boost) plot(fit.boost, i = "lstat") plot(fit.boost, i = "rm") ``` Tuning the model parameters ```{r} n.trees = seq(from = 100, to = 10000, by = 100) predmat = predict(fit.boost, newdata = Boston[-train,], n.trees = n.trees) # Column wise MSE error = with(Boston[-train,], apply((predmat - medv)^2,2,mean)) plot(n.trees, error, pch = 19, ylab = "MSE", xlab = "#trees", main = "Boosting Test Error") abline(h = min(test.error), col = "red") ```
0 Comments
Below is the R Markdown file with snippets on non-linear models.
--- title: "NonLinearModels" output: html_document --- # Nonlinear Models ```{r} require(ISLR) attach(Wage) ``` ## Polynimoal Regression keyword `poly()` generates abasis function of *orthogonal polynomial*. ```{r} fit.poly = lm(wage ~ poly(age, 4), data = Wage) summary(fit.poly) ``` Plot the fitted function along with SE of fit. ```{r} age.limits = range(age) age.grid = seq(from = age.limits[1], to = age.limits[2]) preds = predict(fit.poly, newdata = list(age = age.grid), se = T) se.bands = cbind(preds$fit + 2 * preds$se, preds$fit - 2 * preds$se) plot(age, wage, col="darkgrey") lines(age.grid, preds$fit, col="blue") matlines(age.grid, se.bands, col="blue", lty=2) ``` Use of `anova()` to test differences between multiple models ```{r} fita = lm(wage ~ education, data = Wage) fitb= lm(wage ~ education+age, data = Wage) fitc = lm(wage ~ education+poly(age,2), data = Wage) fitd = lm(wage ~ education+poly(age,3), data = Wage) anova(fita,fitb,fitc,fitd) ``` ## Polynomial Logistic Regression Let the binary responsible variable be wage > 250K as 1 or 0. ```{r} fit.log = glm(I(wage > 250) ~ poly(age, 3), data = Wage, family = "binomial") summary(fit.log) preds.log = predict(fit.log, newdata = list(age = age.grid), se = T) se.bands.1 = preds.log$fit + cbind(fit = 0, lower = -2*preds.log$se, upper = 2*preds.log$se) prob.bands = exp(se.bands.1)/(1+exp(se.bands.1)) plot(age, wage, col="darkgrey") matplot(age.grid, prob.bands, col = "blue", lwd = c(2,1,1), lty = c(1,2,2), type = "l", ylim = c(0, 0.1)) ``` ## Splines Lets implement cubic spine with knots at 25,40,60 bs() gives teh basis for cubic polynomials ```{r} require(splines) fit.splines = lm(wage ~ bs(age, knots = c(25, 40, 60)), data = Wage) plot(age, wage, col = "darkgray") lines(age.grid, predict(fit.splines, list(age = age.grid)), col = "green", lwd = 2) abline(v = c(25, 40, 60), lty = 2, col = "darkgreen" ) ``` Smoothing Splines doesnot require knot selection but have smoothing parameter, which can be selected by choosing degree of freedom df ```{r} fit.smooth = smooth.spline(age, wage, df = 16) lines(fit.smooth, col = "red", lwd = 2) ``` Another way to choose smoothing parameters is to use LOOCV { leave one out cross validation } ```{r} fit.smooth.loocv = smooth.spline(age, wage, cv = TRUE) lines(fit.smooth.loocv, col = "blue", lwd = 2) ``` ## GAM - Generalized Additive Models To fit models with more than one non-linear terms we use GAMs `gam` package. s() in gam will tell to create a smoothing spline. ```{r} require(gam) fit.gam = gam(wage ~ s(age, df = 4) + s(year, df = 4) + education, data = Wage) par(mfrow = c(1,3)) plot(fit.gam, se = TRUE) ``` Lets see if we need a nonlinear term for year ```{r} fit.gam.1 = gam(wage ~ s(age, df = 4)+year+education, data = Wage) anova(fit.gam, fit.gam.1, test = "Chisq") ``` |
Archives
May 2016
Categories |