Decision Trees

4/10/2016

Decision Trees (DTs) are a non-parametric supervised learning method used for classification and regression. The goal is to create a model that predicts the value of a target variable by learning simple decision rules inferred from the data features
Able to handle both numerical and categorical data. Other techniques are usually specialized in analyzing datasets that have only one type of variable.
Simple to understand and to interpret. Trees can be visualized
Decision trees can be unstable because small variations in the data might result in a completely different tree being generated. This problem is mitigated by using decision trees within an ensemble.
Stop criteria can be specified to avoid complex or overfit models. Example:
- Controlling depth of the tree model.
- Number of instances per node(that is minimum number of instances for node to qualify for further split. Widely used in Random Forest as trees get deeper),
- Minimum info gain(minimum info gain per split needed for the node to be split further)
Tunable parameters to optimize the model include:
- Maximum number of bins used when discretizing continuous data.
- Maximum memory in MB, amount of memory used for collecting sufficient statistics.
- Subsampling Rate, fraction of training data used for learning decision tree.
- Impurity(measure of homogeneity), used to choose between candidate splits.
  - Classification: Gini Impurity & Entropy
  - Regression: Variance

    val sqlContext = new SQLContext(sc)
    import sqlContext.implicits._
    import org.apache.spark.ml.classification.DecisionTreeClassifier
    import org.apache.spark.ml.classification.DecisionTreeClassificationModel
    import org.apache.spark.ml.Pipeline
    import org.apache.spark.mllib.util.MLUtils
    import org.apache.spark.ml.feature.{StringIndexer, IndexToString, VectorIndexer}

    // Decision Tree Classifier
    val classifier = new DecisionTreeClassifier().setLabelCol("indexedLabel").setFeaturesCol("indexedFeatures")
    val data = MLUtils.loadLibSVMFile(sc, "./data/sample_libsvm_data.txt").toDF()

    val labelIndexer = new StringIndexer().setInputCol("label").setOutputCol("indexedLabel").fit(data)
    val indexToString = new IndexToString().setInputCol("prediction").setOutputCol("predictedLabel").setLabels(labelIndexer.labels)
    val vectorIndexer = new VectorIndexer().setInputCol("features").setOutputCol("indexedFeatures").setMaxCategories(4).fit(data)
    val stages = Array(labelIndexer, vectorIndexer, classifier, indexToString)
    val pipeline = new Pipeline().setStages(stages)
    val Array(train, test) = data.randomSplit(Array(0.7, 0.3))
    val model = pipeline.fit(train)
    val treeModel = model.stages(2).asInstanceOf[DecisionTreeClassificationModel]
    // Print the tree model
    treeModel.toDebugString
    val prediction = model.transform(test)
    prediction.show()

    // Decision Tree Regressor
    import org.apache.spark.ml.regression.DecisionTreeRegressor
    import org.apache.spark.ml.regression.DecisionTreeRegressionModel
    val regressor = new DecisionTreeRegressor().setLabelCol("label").setFeaturesCol("indexedFeatures")
    val pipelineReg = new Pipeline().setStages(Array(vectorIndexer, regressor))
    val modelReg = pipelineReg.fit(train)
    val treeModelReg = modelReg.stages(1).asInstanceOf[DecisionTreeRegressionModel]
    // Print the tree model
    treeModelReg.toDebugString
    val predictionReg = modelReg.transform(test)
    predictionReg.show()

0 Comments

Decision Trees

Leave a Reply.

Archives

Categories