Below is the R Markdown for Tree Based Models.
--- title: "TreeBasedModels" output: html_document --- # Tree Based Models ## Decision Trees ```{r} require(ISLR) require(tree) attach(Carseats) hist(Sales) High = ifelse(Sales <= 8, "No", "Yes") Carseats = data.frame(Carseats, High) fit.tree = tree(High ~.-Sales, data = Carseats) summary(fit.tree) plot(fit.tree) text(fit.tree, pretty=0) ``` To print Detailed tree Description of entry: `Node Observation Mean_Deviance (%yes, %no)` ```{r} fit.tree ``` Lets generate test-set by spliting careats data into (250, 150) Grow the tree on training set and evaluate the performance of test-set ```{r} set.seed(1011) train = sample(1:nrow(Carseats), 250) fit.tree.1 = tree(High ~.-Sales, data = Carseats, subset = train) summary(fit.tree.1) plot(fit.tree.1) text(fit.tree.1, pretty=0) fit.tree.predict = predict(fit.tree.1, Carseats[-train,], type = "class") with(Carseats[-train,], table(fit.tree.predict, High)) ``` Lets now use the CV method to prune the tree to reduce variance. ```{r} fit.tree.cv = cv.tree(fit.tree.1, FUN = prune.misclass) plot(fit.tree.cv) prune.fit.tree.1 = prune.misclass(fit.tree.1, best = 13) plot(prune.fit.tree.1) text(prune.fit.tree.1, pretty = 0) prune.tree.1.predict = predict(prune.fit.tree.1, Carseats[-train,], type = "class") with(Carseats[-train,], table(prune.tree.1.predict, High)) ``` # Random Forests And Boosting ## Random forest (package : randomForest) We shall use Boston Housing data from MASS package Response: medv ```{r} require(randomForest) require(MASS) set.seed(101) attach(Boston) train = sample(1:nrow(Boston), 300) fit.RF = randomForest(medv ~., data = Boston, subset = train) ``` The MSR & % of variance Explained are based on OOB (out og bag). No. of variables randomly chosen at each split is 4, since $p=13$ we can use all 13 values of `mtry`. ```{r} oob.error = double(13) test.error = double(13) for(mtry in 1:13){ fit = randomForest(medv ~., data = Boston, subset = train, mtry = mtry, ntree=400) oob.error[mtry] = fit$mse[400] pred = predict(fit, Boston[-train,]) test.error[mtry] = with(Boston[-train,], mean((medv - pred)^2)) } matplot(1:mtry, cbind(test.error, oob.error), pch = 19, col = c("red", "blue"), type = "b", ylab = "MSE") legend("topright", legend = c("Test", "OOB"), pch = 19, col = c("red", "blue")) ``` $`mtry` = 13$ corresponds to bagging ## Boosting (package: gbm) Boosting tries to reduce bias unlike RF targeted at variance. By building (`n.trees`) numerous shallow trees (`interaction.depth`) `summary(fit.boost)` gives the variable importance graph. ```{r} require(gbm) fit.boost = gbm(medv ~., data = Boston[train,], distribution = "gaussian", n.trees = 10000, shrinkage = 0.01, interaction.depth = 4) summary(fit.boost) plot(fit.boost, i = "lstat") plot(fit.boost, i = "rm") ``` Tuning the model parameters ```{r} n.trees = seq(from = 100, to = 10000, by = 100) predmat = predict(fit.boost, newdata = Boston[-train,], n.trees = n.trees) # Column wise MSE error = with(Boston[-train,], apply((predmat - medv)^2,2,mean)) plot(n.trees, error, pch = 19, ylab = "MSE", xlab = "#trees", main = "Boosting Test Error") abline(h = min(test.error), col = "red") ```
0 Comments
Below is the R Markdown file with snippets on non-linear models.
--- title: "NonLinearModels" output: html_document --- # Nonlinear Models ```{r} require(ISLR) attach(Wage) ``` ## Polynimoal Regression keyword `poly()` generates abasis function of *orthogonal polynomial*. ```{r} fit.poly = lm(wage ~ poly(age, 4), data = Wage) summary(fit.poly) ``` Plot the fitted function along with SE of fit. ```{r} age.limits = range(age) age.grid = seq(from = age.limits[1], to = age.limits[2]) preds = predict(fit.poly, newdata = list(age = age.grid), se = T) se.bands = cbind(preds$fit + 2 * preds$se, preds$fit - 2 * preds$se) plot(age, wage, col="darkgrey") lines(age.grid, preds$fit, col="blue") matlines(age.grid, se.bands, col="blue", lty=2) ``` Use of `anova()` to test differences between multiple models ```{r} fita = lm(wage ~ education, data = Wage) fitb= lm(wage ~ education+age, data = Wage) fitc = lm(wage ~ education+poly(age,2), data = Wage) fitd = lm(wage ~ education+poly(age,3), data = Wage) anova(fita,fitb,fitc,fitd) ``` ## Polynomial Logistic Regression Let the binary responsible variable be wage > 250K as 1 or 0. ```{r} fit.log = glm(I(wage > 250) ~ poly(age, 3), data = Wage, family = "binomial") summary(fit.log) preds.log = predict(fit.log, newdata = list(age = age.grid), se = T) se.bands.1 = preds.log$fit + cbind(fit = 0, lower = -2*preds.log$se, upper = 2*preds.log$se) prob.bands = exp(se.bands.1)/(1+exp(se.bands.1)) plot(age, wage, col="darkgrey") matplot(age.grid, prob.bands, col = "blue", lwd = c(2,1,1), lty = c(1,2,2), type = "l", ylim = c(0, 0.1)) ``` ## Splines Lets implement cubic spine with knots at 25,40,60 bs() gives teh basis for cubic polynomials ```{r} require(splines) fit.splines = lm(wage ~ bs(age, knots = c(25, 40, 60)), data = Wage) plot(age, wage, col = "darkgray") lines(age.grid, predict(fit.splines, list(age = age.grid)), col = "green", lwd = 2) abline(v = c(25, 40, 60), lty = 2, col = "darkgreen" ) ``` Smoothing Splines doesnot require knot selection but have smoothing parameter, which can be selected by choosing degree of freedom df ```{r} fit.smooth = smooth.spline(age, wage, df = 16) lines(fit.smooth, col = "red", lwd = 2) ``` Another way to choose smoothing parameters is to use LOOCV { leave one out cross validation } ```{r} fit.smooth.loocv = smooth.spline(age, wage, cv = TRUE) lines(fit.smooth.loocv, col = "blue", lwd = 2) ``` ## GAM - Generalized Additive Models To fit models with more than one non-linear terms we use GAMs `gam` package. s() in gam will tell to create a smoothing spline. ```{r} require(gam) fit.gam = gam(wage ~ s(age, df = 4) + s(year, df = 4) + education, data = Wage) par(mfrow = c(1,3)) plot(fit.gam, se = TRUE) ``` Lets see if we need a nonlinear term for year ```{r} fit.gam.1 = gam(wage ~ s(age, df = 4)+year+education, data = Wage) anova(fit.gam, fit.gam.1, test = "Chisq") ```
Below is the R Markdown file showing different techniques to do feature selection in R.
Covered techniques are:
--- title: "ModelSelection" output: html_document --- ```{r} library(ISLR) summary(Hitters) ``` Hitters is baseball dataset where we aim to predict salary of baseball player. Summary shows that there are some NA values in Salary, lets remove them: na.omit: Removes any row that has NA value. ```{r} Hitters = na.omit(Hitters) with(Hitters, sum(is.na(Salary))) ``` Best subset selection ______________________ Look for all combination of models for each specific number of features and determine the best model for each. cp statistic: Pick the model with least cp statistic ```{r} library(leaps) regFit.full = regsubsets(Salary~., data = Hitters, nvmax = 19) reg.summary = summary(regFit.full) names(reg.summary) plot(reg.summary$cp, xlab="Number of Variables", ylab = "cp") best.model.num = which.min(reg.summary$cp) coef(regFit.full, best.model.num) ``` Forward step-wise selection ______________________ Greedy Algorithm, each time it includes a new variable to set as a nested Sequence. ```{r} regFit.fwd = regsubsets(Salary~., data = Hitters, nvmax = 19, method = "forward") reg.summary.fwd = summary(regFit.fwd) plot(regFit.fwd, scale="Cp") ``` Model Selection using a validation-set ______________________ Lets make a training and a validation set, so that we can choose a good subset model. ```{r} dim(Hitters) set.seed(1) train = sample(seq(263), 180, replace = FALSE) regFit.fwd.v = regsubsets(Salary~., data = Hitters[train,], nvmax = 19, method = "forward") reg.summary.fwd.v = summary(regFit.fwd.v) plot(regFit.fwd.v, scale="Cp") ``` Preparing the test data Note: regFit.fwd.v$rss[-1] removes the RSS for null model ```{r} errors.v = rep(NA, 19) test.v = model.matrix(Salary~., data = Hitters[-train,]) for(i in 1:19){ coef.i = coef(regFit.fwd.v, id = i) pred = test.v[, names(coef.i)]%*%coef.i errors.v[i] = mean((Hitters$Salary[-train] - pred)^2) } plot(sqrt(errors.v), ylab="RMSE", ylim=c(300, 400), pch = 19, type = "b") points(sqrt(regFit.fwd.v$rss[-1]/180), col="blue", pch = 19, type = "b") legend("topright", legend=c("Training", "Validation"), col=c("blue", "black"), pch = 19) ``` Model-Selection by cross-validation ______________________ We shall use 10 fold cross validation ```{r} set.seed(11) folds = sample(rep(1:10, length = nrow(Hitters))) table(folds) errors.cv = matrix(NA, 10, 19) for(k in 1:10){ best.fit = regsubsets(Salary~.,data=Hitters[folds!=k,],nvmax = 19, method = "forward") test.v = model.matrix(Salary~., data = Hitters[folds == k,]) for(i in 1:19) { coef.i = coef(best.fit, id = i) pred = test.v[,names(coef.i)]%*%coef.i errors.cv[k, i] = mean((Hitters$Salary[folds == k] - pred)^2) } } rmse = sqrt(apply(errors.cv,2,mean)) plot(rmse,pch=19,type="b") ``` Ridge & Lasso Regression ______________________ Split the Hitters package to predictors (X) and Response (Y) ```{r} library(glmnet) x = model.matrix(Salary~.-1, data = Hitters) y = Hitters$Salary ``` First we shalls ee ridge regression by calling glmnet with alpha = 0 cv.glmnet function which will do the cross-validation for us. Ridge Regression keeps all predictors and try to reduce the coefficients to zero. Ridge uses L1-Norm (Sum of Squares of coefficients). Penality is applied on it. RSS + lambda * L-1 Norm ```{r} fit.ridge = glmnet(x, y, alpha = 0) plot(fit.ridge, xvar = "lambda", label = TRUE) cv.ridge = cv.glmnet(x, y, alpha = 0) plot(cv.ridge) ``` We shall use Lasso regression, by setting the alpha = 1. Does both Shrinkage and feature selection Lasoo Regression uses Mean Absolute Error of coefficients RSS + lambda * MAE Using MAE, Lasso reduces some of its coefficients to Zero. ```{r} fit.lasso = glmnet(x, y, alpha = 1) plot(fit.lasso, xvar = "lambda", label = TRUE) cv.lasso = cv.glmnet(x, y, alpha = 1) plot(cv.lasso) coef(cv.lasso) ``` Note that the `echo = FALSE` parameter was added to the code chunk to prevent printing of the R code that generated the plot. > fileLocation <- "./dataset/Consumer_Complaints.csv" > > #reading csv file > data <- read.csv(fileLocation, header = TRUE) > > #To get dimensions of the data > dimData <- dim(data) > print(dimData) [1] 471871 16 > > #get number of rows > rows <- nrow(data) > print(rows) [1] 471871 > >#To get column names >colnames(data) [1] "Date.received" "Product" "Sub.product" [4] "Issue" "Sub.issue" "Consumer.complaint.narrative" [7] "Company.public.response" "Company" "State" [10] "ZIP.code" "Submitted.via" "Date.sent.to.company" [13] "Company.response.to.consumer" "Timely.response." "Consumer.disputed." [16] "Complaint.ID" > >#To get rowNames > rownames(data) [1] "1" "2" "3" "4" "5" "6" "7" "8" "9" "10" "11" "12" "13" [14] "14" "15" "16" "17" "18" "19" "20" "21" "22" "23" "24" "25" "26" [27] "27" "28" "29" "30" "31" "32" "33" "34" "35" "36" "37" "38" "39" [40] "40" "41" "42" ...... > >#To view summary of data. Output is not shown. >summary <- summary(data) >#print(summary) > >#To view quick snapshot of data structure >struct <- str(data) >#print(struct) > > #Inspect data using HEAD > headData <- head(data) > #print(headData) > > #To see the last few rows of your data, use the tail() > tailData <- tail(data) > #print(tailData) > > #get headers in table > headers <- names(data) > print(headers) [1] "Date.received" "Product" "Sub.product" [4] "Issue" "Sub.issue" "Consumer.complaint.narrative" [7] "Company.public.response" "Company" "State" [10] "ZIP.code" "Submitted.via" "Date.sent.to.company" [13] "Company.response.to.consumer" "Timely.response." "Consumer.disputed." [16] "Complaint.ID" > > #check if string is in headers > testString <- "Product" > check <- is.na(match(testString, headers)) > print(check) [1] FALSE > > #pick a column 'Product' from the data > productData <- data["Product"] > print(head(productData)) Product 1 Debt collection 2 Credit reporting 3 Money transfers 4 Debt collection 5 Debt collection 6 Credit reporting > > > #subset data to have only first 3 columns > #R indexes from 1, not 0. So your first column is at [1] and not [0] > subsetData <- data[, 1:3] > #print(subsetData) > > #subset data to with columns that aren't contiguous > #or we can use select() function to achieve the same > subsetData1 <- data[, c(1,3,5,7)] > #print(subsetData1) > > #Filter rows where product is Mortgage > #or we can use filter() function to achieve the same > subsetData2 <- data[data$Product == "Mortgage", ] > #print(subsetData2) > > #slicing data using subset(dataset, row filters, col selections) > #Also, we can use %.% to create a execution pipeline > subsetData3 <- subset(data, Product == "Mortgage", 1:3) > #print(subsetData3) > > #check for NA values in productData > chkProductData <- complete.cases(productData) > summary(chkProductData) Mode TRUE NA's logical 471871 0 > > #get only nonNA values > productDataGood <- productData[chkProductData,] > > #to get count of issues for every product kind > countMatrix <- table(productDataGood) > print(countMatrix) productDataGood Bank account or service Consumer loan Credit card Credit reporting 53460 16880 57374 73234 Debt collection Money transfers Mortgage Other financial service 83117 2995 166237 352 Payday loan Prepaid card Student loan 3190 1157 13875 > #Normalize table data > cm_state <- table(data["State"]) > total <- sum(cm_state) > for(state in names(cm_state)){ + cm_state[state] = cm_state[state] / total + } > print(cm_state) > > #default mean function returns NA when there is any NA in the numerical > #data. So use na.rm = TRUE to tell function to exclude NAs > mean(data$numeric_column, na.rm = TRUE) > > > #computing combinations > choose(5,2) [1] 10 > > #how many ways to combine fruits into group of 2 > fruits <- c("apple", "mango", "banana", "grapes", "plum") > combn(fruits, 2) [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [1,] "apple" "apple" "apple" "apple" "mango" "mango" "mango" "banana" "banana" "grapes" [2,] "mango" "banana" "grapes" "plum" "banana" "grapes" "plum" "grapes" "plum" "plum" > >#To Inspect in deep, use describe function on psych package. >#This would give data like MAD, Skew, kurtosis along with mean, median , std >install.packages("psych") >library(psych) >describe(data) |
Archives
May 2016
Categories |