Blog Archives

Data Science With Scala: Hypothesis Testing

3/31/2016

Hypothesis Testing is used to determine if the result is statistically significant or not.
Tests supported by MLlib include:
- Pearson's chi-squared test for goodness of fit.
  - Determines whether observed frequency distribution differs from given distribution or not.
  - Input is the vector containing frequencies of events and another vector to test against.
  - It runs against uniform distribution if second vector to test against is not supplied.
  - Available as chiSqTest() from Statistics.
  - If the p-value is very high then there is no presumption against null hypothesis therefore observed frequencies follow given frequency distribution.
- Pearson's chi-squared test for independence.
  - Determines whether unpaired observations on 2 variables are independent of each other.
  - Input is either matrix representing contingency table or RDD of labeledPoint.
  - Available as chiSqTest() from Statistics.
  - if the p-value is very small then there is a strong presumption against null hypothesis therefore occurrence of both outcome is NOT likely independent
- kolmogorov-Smirnov test for equality of distribution.
  - Determines whether or not two probability distributions are equal.
  - It is generally used on continuous data.
  - Supported distribution to test against are normal distribution and customized density function.
All the tests are performed in RDD[Vector], can also be performed on RDD[LabeledPoint] for feature selection.
Kernel Density Estimation computes an estimate of probability density function of a random variable evaluated at a given set of points. Currently in Spark only Gaussian kernel is supported.
- It is available as estimate() function in KernelDensity.

import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.mllib.stat.test.ChiSqTestResult
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.mllib.linalg._
import org.apache.spark.mllib.stat.{KernelDensity, Statistics, MultivariateStatisticalSummary}
import org.apache.spark.rdd.RDD
import org.apache.spark.mllib.random.RandomRDDs._

 // chi-squared test for goodnessOfFit
    val testVector = Vectors.dense(0.3, 0.1, 0.2, 0.1, 0.9, 0.05)
    val goodnessOfFitResult: ChiSqTestResult = Statistics.chiSqTest(testVector)
    println(s"goodnessOfFitResult p-value: ${goodnessOfFitResult.pValue} nullHypothesis: ${goodnessOfFitResult.nullHypothesis}")

    //chi-squared test for independence on matrix
    val testMatrix = Matrices.dense(numRows = 3, numCols = 2, Array(2.0,4.0,1.0,5.0,7.0,3.0))
    val independenceResult: ChiSqTestResult = Statistics.chiSqTest(testMatrix)
    println(s"independenceResult p-value: ${independenceResult.pValue} nullHypothesis: ${independenceResult.nullHypothesis}")

    //chi-squared test for independence on labeledPoint
    val testLabeledPoint = sc.parallelize(Array(
      LabeledPoint(0, Vectors.dense(0.5, 0.4)),
      LabeledPoint(1, Vectors.dense(0.3, 0.1)),
      LabeledPoint(0, Vectors.dense(0.2, 0.8))
    ))
    val independenceResult1: Array[ChiSqTestResult] = Statistics.chiSqTest(testLabeledPoint)
    independenceResult1 foreach { r =>
      println(s"independenceResult1 p-value: ${r.pValue} nullHypothesis: ${r.nullHypothesis}")
    }

    // Kolmogorov - Smirnov Test for equality of distribution
    val testNormal: RDD[Double] = normalRDD(sc, size = 1000L, numPartitions = 1, seed = 90L)
    val testNormalResult = Statistics.kolmogorovSmirnovTest(testNormal, "norm", 0, 1)
    println(s"testNormalResult p-value: ${testNormalResult.pValue} nullHypothesis: ${testNormalResult.nullHypothesis}")

    // Kernel-Density Estimate
    val kd = new KernelDensity().setSample(testNormal).setBandwidth(0.1)
    val densities = kd.estimate((-2.0 to 2 by 0.5).toArray)
    println(s"Kernel densities ${densities foreach println}")

0 Comments

Data Science With Scala: Sampling

3/30/2016

0 Comments

Traditional Sampling can be performed on any RDD and it returns subset of RDD.
Sampling can be done in 2 ways: with replacement & without replacement.
Fraction:
- Without replacement: expected size of the sample as a fraction of RDD size.
- With replacement: Expected number of times each element is chosen.
Example rdds.sample(withReplacement=false, fraction=0.5, seed=4L).collect()
Apache spark has an implementation for sampling that is Random Splits. Useful for splitting data into training set, test set and validation sets.
Random Split can be performed on any RDDs and returns an array of RDDs.
Weights for the split will be normalized if they don't add up to 1.
Stratified Sampling can be performed on RDDs of key value pairs, where keys are labels and values are the features.
This is done by pairedRDD functions like
- sampleByKey requires one pass over the data and provides expected sample size.
- sampleByKeyExact provides the exact sample size with 99.9% confidence.

import org.apache.spark.mllib.linalg.distributed.IndexedRow
    // RandomSplit
    val data = sc.parallelize(1 to 10000000)
    val Array(train, test, validation) = data.randomSplit(Array(0.5, 0.25, 0.25), seed = 90L)
    println(s"splits train ${train.count()} test ${test.count()} validate ${validation.count()}")

    // StratifiedSampling
    val rows: RDD[IndexedRow] = sc.parallelize(Array(
      IndexedRow(0L, Vectors.dense(1.0, 6.0, 2.0)),
      IndexedRow(1L, Vectors.dense(3.0, 1.0, 3.0)),
      IndexedRow(1L, Vectors.dense(4.0, 2.0, 1.0))
    ))
    // set probability to pick a sample of label/index 0L is 1.0 and probability to pick a sample of label/index 1L is 0.5
    val fractions = Map(0L -> 1.0, 1L -> 0.5)

    val sample = rows map {
      case IndexedRow(i, v) => (i, v)
    } sampleByKey(withReplacement = false, fractions = fractions, seed = 90L)

    println(sample.collect())

0 Comments

Data Science With Scala: Summary Statistics

3/30/2016

0 Comments

Statistics library in MLlib is org.apache.spark.mllib.stat.Statistics
colStats() is a function in Statistics that when applied on RDD[Vector] returns MultivariateStatisticalSummary which contains column wise results for:
- min, max
- mean, variance
- numNonZeros
- normL1, normL2
count() returns the total number of elements.
corr() is used for finding the pairWise correlation among series. It can be applied on RDD[Doubles] (result is single Double Value) and RDD[Vector] (result is correlation matrix).
Correlation methods supported are Pearson (default) Spearman(used for rank variables).
Random Number generation generates RandomRDD which are either Random Double RDDs or Random Vector RDDs. Supported distributions are uniform, normal, poisson, exponential, gamma and lognormal.
RandomRDDs are available in org.apache.spark.mllib.random.RandomRDDs._. Example poissonRDD() generates RDD[Double], normalVectorRDD() generates RDD[Vector].

import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.mllib.linalg._
import org.apache.spark.mllib.stat.Statistics
import org.apache.spark.mllib.stat.MultivariateStatisticalSummary
import org.apache.spark.rdd.RDD
import org.apache.spark.mllib.random.RandomRDDs._

val observations: RDD[Vector] = sc.parallelize(Array(
      Vectors.dense(1.0, 2.0, 3.0),
      Vectors.dense(4.0, 3.0, 7.0),
      Vectors.dense(5.0, 6.0, 4.0)
    ))
    val x: RDD[Double] = sc.parallelize(Array(1.0, 2.0, 3.0))
    val y: RDD[Double] = sc.parallelize(Array(3.0, 2.0, 1.0))
    val stats: MultivariateStatisticalSummary = Statistics.colStats(observations)
    println(stats.mean)
    println(stats.count)

    val corr: Double = Statistics.corr(x, y, "pearson")
    println(corr)

    val corrMatrix: Matrix = Statistics.corr(observations, "pearson")
    println(corrMatrix)

    val rand1: RDD[Double] = poissonRDD(sc, mean=1, size=1000000L, numPartitions = 10)
    println(s"Rand1 mean ${rand1.mean()} variance ${rand1.variance()}")

    val rand2: RDD[Vector] = normalVectorRDD(sc, numRows = 1000L, numCols = 10, numPartitions = 10)
    val rand2Stats = Statistics.colStats(rand2)
    println(s"Rand2 mean ${rand2Stats.mean} variance ${rand2Stats.variance}")

0 Comments

Data Science With Scala: Local & Distributed Matricies

3/30/2016

0 Comments

MLlib Matrices can be used by importing org.apache.spark.mllib.linalg.{matrix, matrices}
Local Matrices are the extension of local Vectors. They are stored on single machine. MLlib Matrices can be dense or sparse.
These Matrices are filled in column major order.
Dense Matrices is a 'reshaped' dense vector. First 2 arguments determine the dimensions of matrix and followed by array of doubles. Not as said in point 3 Matrices are filled in column major order. Example Matrices.dense(2, 2, Array(1.0, 2.0, 3.0, 4.0)).
Sparse Matrices similar to sparse vector we wont be representing zero elements. First 2 arguments determine the dimensions of matrix, 3rd arguments is column indexes containing an element for each column that is non-zero, 4th argument is the row indexes array of non-zero rows and finally array of values. Example Matrices.sparse(5,4, Array(0,0,1,2,2), Array(1,3), Array(1,2))
Distributed Matrices are stored in one or more RDDs. There are three implementations of distributed matrices RowMatrix, IndexedRowMatrix, Coordinate Matrix. Conversions between different types is possible but requires expensive shuffling.
Distributed Matrices are available in org.apache.spark.mllib.linalg.distributed.* package.
RowMatrix is the basic implementation of Distributed Matrix. Can be easily created from RDD[Vector]. Example
- val rows = sc.parallelize(Array(Vectors.dense(1.0,2.0),Vectors.dense(3.0, 4.0)));
- val rowMatrix = new RowMatrix(rows)
IndexedRowMatrix is similar to RowMatrix but it has meaningful row indexes used for identifying rows and executing joins.
IndexedRowMatrix is a RDD of IndexedRow that is a tuple containing index and local vector.
- val rows = sc.parallelize(Array(IndexedRow(1, Vectors.dense(1.0, 2.0)), IndexedRow(2, Vectors.dense(3.0, 4.0))))
- val indexedRowMatrix = new IndexedRowMatrix(rows)
CoordinateMatrix should be used when the dimensions are huge and matrix is very sparse.
CoordinateMatrix is a RDD of MatrixEntry. MatrixEntry is a tuple containing rowNumber, columnNumber and value.
- val rows = sc.parallelize(Array(MatrixEntry(0,1,1.0), MatrixEntry(1,1,3.0)))
- val coordinateMatrix = new CoordinateMatrix(rows)

0 Comments

Data Science With Scala - Vectors & Labeled Points

3/30/2016

0 Comments

A Vector is a point in multi-dimensional space where dimension are the elements of vector. A non-fixed length sequence.
To use MLlib Vectors we should import org.apache.spark.mllib.linalg.{Vector, Vectors}.
Scala by default imports scala.collection.immutable.Vector by default.
Values in MLlib vectors must always be Doubles.
MLlib's vectors can be either Dense or Sparse.
Mllib Dense Vector is an array of Doubles example Vectors.dense(1.0, 2.0, 3.0)
Mllib Sparse Vector is backed by 2 arrays namely indexes and values example Vectors.sparse(3, Array(0,2), Array(1.0, 3.0)) or Vectors.sparse(3, Seq((0, 1.0),(2, 3.0)))
Two variants of defining sparse vector: (Length of sparse Vector, Array of non zero indexes, Array of values) or (length of sparse vector, sequence of tuples representing non-zero index and corresponding value).
Labeled Point is an association of vector(dense or sparse) with label. Used in supervised learning algorithms.
Label is of type Double. Hence for 2 class classification problem the labels would be 0.0 & 1.0 and for multi class classification problem the Labeled Point would be (0.0, 1.0, 2.0 ..)
Example LabeledPoint(1.0, Vectors.dense(1.0,2.0,3.0)); LabeledPoint(0.0, Vectors.sparse(3, Array(0,2), Array(1.0, 3.0))

0 Comments

collect operation

3/9/2016

0 Comments

Chaining of filter and map operations can be compressed into one operation, otherwise we end up with more garbage and more time spent building the final collection. This can be accomplished by using collect operation.
Below is the example:

val x = List((1,2),(2,3),(3,4),(4,5),(5,6))
x.filter(_._1 % 2 == 0).map(_._2)

# Above chaining can be replace by collect

x.collect { case x if (x._1 % 2 == 0) => x._2 }

0 Comments

Lazy Evaluation on list using view

3/8/2016

0 Comments

View: A view runs transformations as functional composition instead of as a series of intermediate collections.

(1 to 1000000).view map (_ + 5) map (_ * 2)

using lazy evaluation on collections doesn't always guarantee optimized performance. Lazy evaluation requires the creation of an additional closure. If creating the closures takes longer than creating intermediate collections, the lazy version will run slower. Typically, for smaller value of n strict version will run faster than Lazy version.

0 Comments

Data Science With Scala: Hypothesis Testing

Data Science With Scala: Sampling

Data Science With Scala: Summary Statistics﻿

Data Science With Scala: Local & Distributed Matricies﻿

Data Science With Scala - Vectors & Labeled Points

collect operation

Lazy Evaluation﻿ on list using view

Archives

Categories

Data Science With Scala: Summary Statistics

Data Science With Scala: Local & Distributed Matricies

Lazy Evaluation on list using view