Blog Posts

Data Science With Scala: Handling Missing Data

4/1/2016

na function of DataFrame provides methods to work on missing data.
- It returns an instance of DataFrameNAFunctions.
- Following methods are available:
  - drop - dropping rows containing NaN or Null values
    - minNonNulls tells minimum number of non-null values in the row. Drops rows containing less than minNonNulls.
    - how parameter:
      - "all" then all the specified columns equal to 'na' are deleted,
      - "any" if any of the specified column is na then the row is deleted.
  - fill - for filling NaN or null values
    - Accepts valueMap that is map of column name and value to be filled.
  - replace - For replacing values matching specific keys
    - Cols argument is either a single column or array of columns
    - Replacement argument is a map with key is value to be matched and value is the value to be replaced.
- dropDuplicates is property of DataFrame that removes the duplicates on comparing all columns or specified columns.

val randomDF = df1.select("id").withColumn("uniform", rand(10L)).withColumn("normal", randn(10L))
    println(randomDF.show())

    // Handling Missing data
    val halfToNaN = udf[Double, Double](x => if(x > 0.5) Double.NaN else x)
    val negHalfToNaN = udf[Double, Double](x => if(x < - 0.5) Double.NaN else x)

    val nanRandomDF = randomDF.withColumn("uniformNan", halfToNaN(randomDF("uniform")))
      .withColumn("normalNan", negHalfToNaN(randomDF("normal")))
      .drop("uniform")
      .withColumnRenamed("uniformNan", "uniform")
      .drop("normal")
      .withColumnRenamed("normalNan", "normal")
    println(nanRandomDF.show())

    // drop the rows with minNonNulls < 3
    nanRandomDF.na.drop(minNonNulls = 3).show()

    // drop rows when all the specified columns are NaN
    nanRandomDF.na.drop("all", Array("uniform", "normal")).show()

    // drop rows when any the specified columns are NaN
    nanRandomDF.na.drop("any", Array("uniform", "normal")).show()

    // fill all nan values with 0.0
    nanRandomDF.na.fill(0.0).show()

    // fill Uniform with columnMean value, fill takes valueMap Map("column" -> value)
    val uniformMean = nanRandomDF.filter("uniform <> 'NaN'").groupBy().agg(mean("uniform")).first()(0)
    nanRandomDF.na.fill(Map("uniform" -> uniformMean)).show()

    // column names except id
    val dfCols = nanRandomDF.columns.drop(1)
    val dfMeans = nanRandomDF.na.drop().groupBy().agg(mean("uniform"), mean("normal")).first().toSeq
    val meansMap = dfCols.zip(dfMeans).toMap
    nanRandomDF.na.fill(meansMap).show()

    // Replace all Nan in uniform column to 0
    nanRandomDF.na.replace("uniform", Map(Double.NaN -> 0.0)).show()

0 Comments

Data Science With Scala: Statistics for DataFrames

4/1/2016

0 Comments

Summary Statistics for DataFrames

Column summary statistics for DataFrames are available through describe() method of DataFrame.
- It gives column wise min, max, mean, stddev, count
- resultDF.select("column_name") can be used to fetch thhe statistics for column_name column.
Column summary statistics for DataFrames can also be computed using groupBy() and agg() functions. It returns a DataFrame with results.
DataFrame stat method returns DataFrameStatFunctions object that has more statistical functions like:
- corr() computes Pearson correlation between 2 columns
- cov() computes covariance between 2 columns
- crosstab() computes pair-wise frequency table of given columns
- freqItems() finds frequent items for given columns

Sampling On DataFrames

dataFrame.sample() is the function used for sampling DataFrame. It returns a subset of dataFrame with the rows specified in fraction.
RandomSplit can also be applied on DataFrame. It returns an array of DataFrames. Weights for the split will be normalized if they don't add up to 1.
Stratified Sampling can be performed on any DataFrame it is available as sampleBy in DataFrameStatFunctions. Any column can work as key. Fraction to be sampled is specified by Key

Random Data Generation

Sql functions to generate columns filled with random values.
Two supported distributions uniform and normal

import org.apache.spark.{SparkContext, SparkConf}
import org.apache.spark.sql.functions.{rand, randn}
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import sqlContext.implicits._

    // Statistics in DF
    val recordsDF = sc.parallelize(Array(
      Record("alpha", 1, 2),
      Record("beta", 3, 4),
      Record("gamma", 5, 6)
    )).toDF()
    val recordStats = recordsDF.describe()
    println(s"df.describe() output: ${recordStats.show()}")

    // Fetching the results from dataFrame
    val stddev = recordStats.filter("summary = 'stddev'").first().toSeq.toArray.drop(1) map { _.toString.toDouble }
    val val1Stats: Array[Double] = recordStats.select("val1") map { x => x(0).toString.toDouble } collect()
    println(val1Stats)

    // Statistics using groupBy to find min of column val1 and max of column val2
    val grpBy1 = recordsDF.groupBy().agg(Map("val1" -> "min", "val2" -> "max"))
    val grpBy1Summary = grpBy1.first().toSeq.toArray.map { _.toString.toDouble }
    println(s"column Names: ${grpBy1.columns} \n Data: $grpBy1Summary")

    // More statistics, corr(), cov(), freqItems()
    val recordStatFun = recordsDF.stat
    recordStatFun.corr("val1", "val2")
    recordStatFun.cov("val1", "val2")
    recordStatFun.freqItems(Seq("val1"), 0.3)

    // Sampling on dataFrames
    val df = sqlContext.createDataFrame(Seq((1,10),(2, 11),(3,12),(4,13))).toDF("key", "value")
    df.sample(withReplacement = false, fraction = 0.4, seed = 90L)
    val Array(train, test) = df.randomSplit(weights = Array(0.7, 0.3), seed = 90L)
    println(train.show())
    println(test.show())

    val stratifiedSampling = df.stat.sampleBy("key", fractions = Map(1 -> 0.7, 2 -> 0.7), seed = 90L)
    println(stratifiedSampling.show())

    // Random Data Generation
    val df1 = sqlContext.range(0, 10)
    val newDF = df1.select("id").withColumn("uniform", rand(10L)).withColumn("normal", randn(10L))
    println(newDF.show())

0 Comments

Data Science With Scala: Hypothesis Testing

3/31/2016

0 Comments

Hypothesis Testing is used to determine if the result is statistically significant or not.
Tests supported by MLlib include:
- Pearson's chi-squared test for goodness of fit.
  - Determines whether observed frequency distribution differs from given distribution or not.
  - Input is the vector containing frequencies of events and another vector to test against.
  - It runs against uniform distribution if second vector to test against is not supplied.
  - Available as chiSqTest() from Statistics.
  - If the p-value is very high then there is no presumption against null hypothesis therefore observed frequencies follow given frequency distribution.
- Pearson's chi-squared test for independence.
  - Determines whether unpaired observations on 2 variables are independent of each other.
  - Input is either matrix representing contingency table or RDD of labeledPoint.
  - Available as chiSqTest() from Statistics.
  - if the p-value is very small then there is a strong presumption against null hypothesis therefore occurrence of both outcome is NOT likely independent
- kolmogorov-Smirnov test for equality of distribution.
  - Determines whether or not two probability distributions are equal.
  - It is generally used on continuous data.
  - Supported distribution to test against are normal distribution and customized density function.
All the tests are performed in RDD[Vector], can also be performed on RDD[LabeledPoint] for feature selection.
Kernel Density Estimation computes an estimate of probability density function of a random variable evaluated at a given set of points. Currently in Spark only Gaussian kernel is supported.
- It is available as estimate() function in KernelDensity.

import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.mllib.stat.test.ChiSqTestResult
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.mllib.linalg._
import org.apache.spark.mllib.stat.{KernelDensity, Statistics, MultivariateStatisticalSummary}
import org.apache.spark.rdd.RDD
import org.apache.spark.mllib.random.RandomRDDs._

 // chi-squared test for goodnessOfFit
    val testVector = Vectors.dense(0.3, 0.1, 0.2, 0.1, 0.9, 0.05)
    val goodnessOfFitResult: ChiSqTestResult = Statistics.chiSqTest(testVector)
    println(s"goodnessOfFitResult p-value: ${goodnessOfFitResult.pValue} nullHypothesis: ${goodnessOfFitResult.nullHypothesis}")

    //chi-squared test for independence on matrix
    val testMatrix = Matrices.dense(numRows = 3, numCols = 2, Array(2.0,4.0,1.0,5.0,7.0,3.0))
    val independenceResult: ChiSqTestResult = Statistics.chiSqTest(testMatrix)
    println(s"independenceResult p-value: ${independenceResult.pValue} nullHypothesis: ${independenceResult.nullHypothesis}")

    //chi-squared test for independence on labeledPoint
    val testLabeledPoint = sc.parallelize(Array(
      LabeledPoint(0, Vectors.dense(0.5, 0.4)),
      LabeledPoint(1, Vectors.dense(0.3, 0.1)),
      LabeledPoint(0, Vectors.dense(0.2, 0.8))
    ))
    val independenceResult1: Array[ChiSqTestResult] = Statistics.chiSqTest(testLabeledPoint)
    independenceResult1 foreach { r =>
      println(s"independenceResult1 p-value: ${r.pValue} nullHypothesis: ${r.nullHypothesis}")
    }

    // Kolmogorov - Smirnov Test for equality of distribution
    val testNormal: RDD[Double] = normalRDD(sc, size = 1000L, numPartitions = 1, seed = 90L)
    val testNormalResult = Statistics.kolmogorovSmirnovTest(testNormal, "norm", 0, 1)
    println(s"testNormalResult p-value: ${testNormalResult.pValue} nullHypothesis: ${testNormalResult.nullHypothesis}")

    // Kernel-Density Estimate
    val kd = new KernelDensity().setSample(testNormal).setBandwidth(0.1)
    val densities = kd.estimate((-2.0 to 2 by 0.5).toArray)
    println(s"Kernel densities ${densities foreach println}")

0 Comments

Data Science With Scala: Sampling

3/30/2016

0 Comments

Traditional Sampling can be performed on any RDD and it returns subset of RDD.
Sampling can be done in 2 ways: with replacement & without replacement.
Fraction:
- Without replacement: expected size of the sample as a fraction of RDD size.
- With replacement: Expected number of times each element is chosen.
Example rdds.sample(withReplacement=false, fraction=0.5, seed=4L).collect()
Apache spark has an implementation for sampling that is Random Splits. Useful for splitting data into training set, test set and validation sets.
Random Split can be performed on any RDDs and returns an array of RDDs.
Weights for the split will be normalized if they don't add up to 1.
Stratified Sampling can be performed on RDDs of key value pairs, where keys are labels and values are the features.
This is done by pairedRDD functions like
- sampleByKey requires one pass over the data and provides expected sample size.
- sampleByKeyExact provides the exact sample size with 99.9% confidence.

import org.apache.spark.mllib.linalg.distributed.IndexedRow
    // RandomSplit
    val data = sc.parallelize(1 to 10000000)
    val Array(train, test, validation) = data.randomSplit(Array(0.5, 0.25, 0.25), seed = 90L)
    println(s"splits train ${train.count()} test ${test.count()} validate ${validation.count()}")

    // StratifiedSampling
    val rows: RDD[IndexedRow] = sc.parallelize(Array(
      IndexedRow(0L, Vectors.dense(1.0, 6.0, 2.0)),
      IndexedRow(1L, Vectors.dense(3.0, 1.0, 3.0)),
      IndexedRow(1L, Vectors.dense(4.0, 2.0, 1.0))
    ))
    // set probability to pick a sample of label/index 0L is 1.0 and probability to pick a sample of label/index 1L is 0.5
    val fractions = Map(0L -> 1.0, 1L -> 0.5)

    val sample = rows map {
      case IndexedRow(i, v) => (i, v)
    } sampleByKey(withReplacement = false, fractions = fractions, seed = 90L)

    println(sample.collect())

0 Comments

Data Science With Scala: Summary Statistics

3/30/2016

0 Comments

Statistics library in MLlib is org.apache.spark.mllib.stat.Statistics
colStats() is a function in Statistics that when applied on RDD[Vector] returns MultivariateStatisticalSummary which contains column wise results for:
- min, max
- mean, variance
- numNonZeros
- normL1, normL2
count() returns the total number of elements.
corr() is used for finding the pairWise correlation among series. It can be applied on RDD[Doubles] (result is single Double Value) and RDD[Vector] (result is correlation matrix).
Correlation methods supported are Pearson (default) Spearman(used for rank variables).
Random Number generation generates RandomRDD which are either Random Double RDDs or Random Vector RDDs. Supported distributions are uniform, normal, poisson, exponential, gamma and lognormal.
RandomRDDs are available in org.apache.spark.mllib.random.RandomRDDs._. Example poissonRDD() generates RDD[Double], normalVectorRDD() generates RDD[Vector].

import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.mllib.linalg._
import org.apache.spark.mllib.stat.Statistics
import org.apache.spark.mllib.stat.MultivariateStatisticalSummary
import org.apache.spark.rdd.RDD
import org.apache.spark.mllib.random.RandomRDDs._

val observations: RDD[Vector] = sc.parallelize(Array(
      Vectors.dense(1.0, 2.0, 3.0),
      Vectors.dense(4.0, 3.0, 7.0),
      Vectors.dense(5.0, 6.0, 4.0)
    ))
    val x: RDD[Double] = sc.parallelize(Array(1.0, 2.0, 3.0))
    val y: RDD[Double] = sc.parallelize(Array(3.0, 2.0, 1.0))
    val stats: MultivariateStatisticalSummary = Statistics.colStats(observations)
    println(stats.mean)
    println(stats.count)

    val corr: Double = Statistics.corr(x, y, "pearson")
    println(corr)

    val corrMatrix: Matrix = Statistics.corr(observations, "pearson")
    println(corrMatrix)

    val rand1: RDD[Double] = poissonRDD(sc, mean=1, size=1000000L, numPartitions = 10)
    println(s"Rand1 mean ${rand1.mean()} variance ${rand1.variance()}")

    val rand2: RDD[Vector] = normalVectorRDD(sc, numRows = 1000L, numCols = 10, numPartitions = 10)
    val rand2Stats = Statistics.colStats(rand2)
    println(s"Rand2 mean ${rand2Stats.mean} variance ${rand2Stats.variance}")

0 Comments

Data Science With Scala: Local & Distributed Matricies

3/30/2016

0 Comments

MLlib Matrices can be used by importing org.apache.spark.mllib.linalg.{matrix, matrices}
Local Matrices are the extension of local Vectors. They are stored on single machine. MLlib Matrices can be dense or sparse.
These Matrices are filled in column major order.
Dense Matrices is a 'reshaped' dense vector. First 2 arguments determine the dimensions of matrix and followed by array of doubles. Not as said in point 3 Matrices are filled in column major order. Example Matrices.dense(2, 2, Array(1.0, 2.0, 3.0, 4.0)).
Sparse Matrices similar to sparse vector we wont be representing zero elements. First 2 arguments determine the dimensions of matrix, 3rd arguments is column indexes containing an element for each column that is non-zero, 4th argument is the row indexes array of non-zero rows and finally array of values. Example Matrices.sparse(5,4, Array(0,0,1,2,2), Array(1,3), Array(1,2))
Distributed Matrices are stored in one or more RDDs. There are three implementations of distributed matrices RowMatrix, IndexedRowMatrix, Coordinate Matrix. Conversions between different types is possible but requires expensive shuffling.
Distributed Matrices are available in org.apache.spark.mllib.linalg.distributed.* package.
RowMatrix is the basic implementation of Distributed Matrix. Can be easily created from RDD[Vector]. Example
- val rows = sc.parallelize(Array(Vectors.dense(1.0,2.0),Vectors.dense(3.0, 4.0)));
- val rowMatrix = new RowMatrix(rows)
IndexedRowMatrix is similar to RowMatrix but it has meaningful row indexes used for identifying rows and executing joins.
IndexedRowMatrix is a RDD of IndexedRow that is a tuple containing index and local vector.
- val rows = sc.parallelize(Array(IndexedRow(1, Vectors.dense(1.0, 2.0)), IndexedRow(2, Vectors.dense(3.0, 4.0))))
- val indexedRowMatrix = new IndexedRowMatrix(rows)
CoordinateMatrix should be used when the dimensions are huge and matrix is very sparse.
CoordinateMatrix is a RDD of MatrixEntry. MatrixEntry is a tuple containing rowNumber, columnNumber and value.
- val rows = sc.parallelize(Array(MatrixEntry(0,1,1.0), MatrixEntry(1,1,3.0)))
- val coordinateMatrix = new CoordinateMatrix(rows)

0 Comments

Data Science With Scala - Vectors & Labeled Points

3/30/2016

0 Comments

A Vector is a point in multi-dimensional space where dimension are the elements of vector. A non-fixed length sequence.
To use MLlib Vectors we should import org.apache.spark.mllib.linalg.{Vector, Vectors}.
Scala by default imports scala.collection.immutable.Vector by default.
Values in MLlib vectors must always be Doubles.
MLlib's vectors can be either Dense or Sparse.
Mllib Dense Vector is an array of Doubles example Vectors.dense(1.0, 2.0, 3.0)
Mllib Sparse Vector is backed by 2 arrays namely indexes and values example Vectors.sparse(3, Array(0,2), Array(1.0, 3.0)) or Vectors.sparse(3, Seq((0, 1.0),(2, 3.0)))
Two variants of defining sparse vector: (Length of sparse Vector, Array of non zero indexes, Array of values) or (length of sparse vector, sequence of tuples representing non-zero index and corresponding value).
Labeled Point is an association of vector(dense or sparse) with label. Used in supervised learning algorithms.
Label is of type Double. Hence for 2 class classification problem the labels would be 0.0 & 1.0 and for multi class classification problem the Labeled Point would be (0.0, 1.0, 2.0 ..)
Example LabeledPoint(1.0, Vectors.dense(1.0,2.0,3.0)); LabeledPoint(0.0, Vectors.sparse(3, Array(0,2), Array(1.0, 3.0))

0 Comments

collect operation

3/9/2016

0 Comments

Chaining of filter and map operations can be compressed into one operation, otherwise we end up with more garbage and more time spent building the final collection. This can be accomplished by using collect operation.
Below is the example:

val x = List((1,2),(2,3),(3,4),(4,5),(5,6))
x.filter(_._1 % 2 == 0).map(_._2)

# Above chaining can be replace by collect

x.collect { case x if (x._1 % 2 == 0) => x._2 }

0 Comments

Lazy Evaluation on list using view

3/8/2016

0 Comments

View: A view runs transformations as functional composition instead of as a series of intermediate collections.

(1 to 1000000).view map (_ + 5) map (_ * 2)

using lazy evaluation on collections doesn't always guarantee optimized performance. Lazy evaluation requires the creation of an additional closure. If creating the closures takes longer than creating intermediate collections, the lazy version will run slower. Typically, for smaller value of n strict version will run faster than Lazy version.

0 Comments

Merge Sort Algorithm

10/30/2015

0 Comments

package com.examples.algorithms

object MergeSortAlgorithm {
  def merge(left: List[Int], right: List[Int]): List[Int] = {
    (left, right) match {
      case (l, Nil) => l
      case (Nil, r) => r
      case(l :: l1, r :: r1) =>
        if(l < r)
          l :: merge(l1, right)
        else
          r :: merge(left, r1)
    }
  }

  def run(input: List[Int]): List[Int] = {
    val n = input.length / 2
    if (n == 0) input
    else {
      val (left, right) = input splitAt n
      merge(run(left), run(right))
    }
  }
}

0 Comments

<<Previous

Forward>>

Data Science With Scala: Handling Missing Data

Data Science With Scala: Statistics for DataFrames

Data Science With Scala: Hypothesis Testing

Data Science With Scala: Sampling

Data Science With Scala: Summary Statistics

Data Science With Scala: Local & Distributed Matricies

Data Science With Scala - Vectors & Labeled Points

collect operation

Lazy Evaluation on list using view

Merge Sort Algorithm

Archives

Categories