基本的统计工具（2）

当前在线: 60 /历史访问: 2554057

基本的统计工具（2） - spark.mllib

赖永炫 Sun Dec 11 2016 17:43:33 GMT+0800 (中国标准时间) [ 技术 ] 浏览次数:9924

返回 [Spark MLlib入门教程](http://mocom.xmu.edu.cn/article/show/5858ab782b2730e00d70fa08/0/1) --- ## 五、假设检验 Hypothesis testing Spark目前支持皮尔森卡方检测（Pearson’s chi-squared tests），包括“适配度检定”（Goodness of fit）以及“独立性检定”（independence）。首先，我们导入必要的包 ```scala import org.apache.spark.SparkContext import org.apache.spark.mllib.linalg._ import org.apache.spark.mllib.regression.LabeledPoint import org.apache.spark.mllib.stat.Statistics._ ``` 接下来，我们从数据集中选择要分析的数据，比如说我们取出iris数据集中的前两条数据v1和v2。不同的输入类型决定了是做拟合度检验还是独立性检验。拟合度检验要求输入为Vector, 独立性检验要求输入是Matrix。 ```scala scala> val v1: Vector = sc.textFile("G:/spark/iris.data").map(_.split(",")).map(p => Vectors.dense(p(0).toDouble, p(1).toDouble, p(2).toDouble, p(3).toDouble)).first v1: org.apache.spark.mllib.linalg.Vector = [5.1,3.5,1.4,0.2] scala> val v2: Vector = sc.textFile("G:/spark/iris.data").map(_.split(",")).map(p => Vectors.dense(p(0).toDouble, p(1).toDouble, p(2).toDouble, p(3).toDouble)).take(2).last v2: org.apache.spark.mllib.linalg.Vector = [4.9,3.0,1.4,0.2] ``` ### (一) 适合度检验 Goodness fo fit Goodness fo fit（适合度检验）：验证一组观察值的次数分配是否异于理论上的分配。其 H0假设（虚无假设，null hypothesis）为一个样本中已发生事件的次数分配会服从某个特定的理论分配。实际执行多项式试验而得到的观察次数，与虚无假设的期望次数相比较，检验二者接近的程度，利用样本数据以检验总体分布是否为某一特定分布的统计方法。通常情况下这个特定的理论分配指的是均匀分配，目前Spark默认的是均匀分配。以下是代码： ```scala scala> val goodnessOfFitTestResult = Statistics.chiSqTest(v1) goodnessOfFitTestResult: org.apache.spark.mllib.stat.test.ChiSqTestResult = Chi squared test summary: method: pearson degrees of freedom = 3 statistic = 5.588235294117647 pValue = 0.1334553914430291 No presumption against null hypothesis: observed follows the same distribution as expected.. ``` 可以看到P值，自由度，检验统计量，所使用的方法，以及零假设等信息。我们先简单介绍下每个输出的意义： method: 方法。这里采用pearson方法。 statistic：检验统计量。简单来说就是用来决定是否可以拒绝原假设的证据。检验统计量的值是利用样本数据计算得到的，它代表了样本中的信息。检验统计量的绝对值越大，拒绝原假设的理由越充分，反之，不拒绝原假设的理由越充分。 degrees of freedom：自由度。表示可自由变动的样本观测值的数目， pValue：统计学根据显著性检验方法所得到的P 值。一般以P < 0.05 为显著， P<0.01 为非常显著，其含义是样本间的差异由抽样误差所致的概率小于0.05 或0.01。一般来说，假设检验主要看P值就够了。在本例中pValue =0.133，说明两组的差别无显著意义。通过V1的观测值[5.1, 3.5, 1.4, 0.2]，无法拒绝其服从于期望分配（这里默认是均匀分配）的假设。 ### （二）独立性检验 Indenpendence 卡方独立性检验是用来检验两个属性间是否独立。其中一个属性做为行，另外一个做为列，通过貌似相关的关系考察其是否真实存在相关性。比如天气温变化和肺炎发病率。首先，我们通过v1、v2构造一个举证Matrix，然后进行独立性检验： ```scala scala> val mat: Matrix =Matrices.dense(2,2,Array(v1(0),v1(1),v2(0),v2(1))) mat: org.apache.spark.mllib.linalg.Matrix = 5.1 4.9 3.5 3.0 scala> val a =Statistics.chiSqTest(mat) a: org.apache.spark.mllib.stat.test.ChiSqTestResult = Chi squared test summary: method: pearson degrees of freedom = 1 statistic = 0.012787584067389817 pValue = 0.90996538641943 No presumption against null hypothesis: the occurrence of the outcomes is statistically independent.. ``` 这里所要检验是否独立的两个属性，一个是样本的序号，另一个是样本的数据值。在本例中pValue =0.91，说明无法拒绝“样本序号与数据值无关”的假设。这也符合数据集的实际情况，因为v1和v2是从同一个样本抽取的两条数据，样本的序号与数据的取值应该是没有关系的。我们也可以把v1作为样本，把v2作为期望值，进行卡方检验： ```scala scala> val c1 = Statistics.chiSqTest(v1, v2) c1: org.apache.spark.mllib.stat.test.ChiSqTestResult = Chi squared test summary: method: pearson degrees of freedom = 3 statistic = 0.03717820461517941 pValue = 0.9981145601231336 No presumption against null hypothesis: observed follows the same distribution as expected.. ``` 本例中pValue =0.998，说明样本v1与期望值等于V2的数据分布并无显著差异。事实上，v1=[5.1,3.5,1.4,0.2]与v2= [4.9,3.0,1.4,0.2]很像，v1可以看做是从期望值为v2的数据分布中抽样出来的的。同样的，键值对也可以进行独立性检验，这里我们取iris的数据组成键值对： ```scala scala> val data=sc.textFile("G:/spark/iris.data") data: org.apache.spark.rdd.RDD[String] = G:/spark/iris.data MapPartitionsRDD[13] at textFile at :44 scala> val obs = data.map{ line => | val parts = line.split(',') | LabeledPoint(if(parts(4)=="Iris-setosa") 0.toDouble else if (parts(4)=="Iris-versicolor") 1.toDouble else | 2.toDouble, Vectors.dense(parts(0).toDouble,parts(1).toDouble,parts (2).toDouble,parts(3).toDouble))} obs: org.apache.spark.rdd.RDD[org.apache.spark.mllib.regression.LabeledPoint] = MapPartitionsRDD[14] at map at :46 ``` 进行独立性检验，返回一个包含每个特征对于标签的卡方检验的数组： ```scala scala> val featureTestResults= Statistics.chiSqTest(obs) featureTestResults: Array[org.apache.spark.mllib.stat.test.ChiSqTestResult] = Array(Chi squared test summary: method: pearson degrees of freedom = 68 statistic = 156.26666666666665 pValue = 6.6659873176888595E-9 Very strong presumption against null hypothesis: the occurrence of the outcomes is statistically independent.., Chi squared test summary: method: pearson degrees of freedom = 44 statistic = 88.36446886446883 pValue = 8.303947787857702E-5 Very strong presumption against null hypothesis: the occurrence of the outcomes is statistically independent.., Chi squared test summary: method: pearson degrees of freedom = 84 statistic = 271.79999999999995 pValue = 0.0 Very strong presumption against null hypothesis: the occurrence of the outcomes is statistically independent.., Chi... ``` 这里实际上是把特征数据中的每一列都与标签进行独立性检验。可以看出，P值都非常小，说明可以拒绝“某列与标签列无关”的假设。也就是说，可以认为每一列的数据都与最后的标签有相关性。我们用foreach把完整结果打印出来： ```scala scala> var i = 1 i: Int = 1 scala> featureTestResults.foreach { result => | println(s"Column $i:\n$result") | i += 1 | } Column 1: Chi squared test summary: method: pearson degrees of freedom = 68 statistic = 156.26666666666665 pValue = 6.6659873176888595E-9 Very strong presumption against null hypothesis: the occurrence of the outcomes is statistically independent.. Column 2: Chi squared test summary: method: pearson degrees of freedom = 44 statistic = 88.36446886446883 pValue = 8.303947787857702E-5 Very strong presumption against null hypothesis: the occurrence of the outcomes is statistically independent.. Column 3: Chi squared test summary: method: pearson degrees of freedom = 84 statistic = 271.79999999999995 pValue = 0.0 Very strong presumption against null hypothesis: the occurrence of the outcomes is statistically independent.. Column 4: Chi squared test summary: method: pearson degrees of freedom = 42 statistic = 271.75 pValue = 0.0 Very strong presumption against null hypothesis: the occurrence of the outcomes is statistically independent.. ``` spark也支持Kolmogorov-Smirnov 检验，下面将展示具体的步骤： ```scala scala> val test = sc.textFile("G:/spark/iris.data").map(_.split(",")).map(p => p(0).toDouble) test: org.apache.spark.rdd.RDD[Double] = MapPartitionsRDD[22] at map at :44 // run a KS test for the sample versus a standard normal distribution scala> val testResult = Statistics.kolmogorovSmirnovTest(test, "norm", 0, 1) testResult: org.apache.spark.mllib.stat.test.KolmogorovSmirnovTestResult = Kolmogorov-Smirnov test summary: degrees of freedom = 0 statistic = 0.999991460094529 pValue = 0.0 Very strong presumption against null hypothesis: Sample follows theoretical distribution. // perform a KS test using a cumulative distribution function of our making scala> val myCDF: Double => Double = (p=>p*2) myCDF: Double => Double = scala> val testResult2 = Statistics.kolmogorovSmirnovTest(test, myCDF) testResult2: org.apache.spark.mllib.stat.test.KolmogorovSmirnovTestResult = Kolmogorov-Smirnov test summary: degrees of freedom = 0 statistic = 14.806666666666668 pValue = 0.0 Very strong presumption against null hypothesis: Sample follows theoretical distribution. ``` ## 六、随机数生成 Random data generation RandomRDDs 是一个工具集，用来生成含有随机数的RDD，可以按各种给定的分布模式生成数据集，Random RDDs包下现支持正态分布、泊松分布和均匀分布三种分布方式。RandomRDDs提供随机double RDDS或vector RDDS。下面的例子中生成一个随机double RDD，其值是标准正态分布N（0，1），然后将其映射到N（1，4）。首先，导入必要的包： ```scala import org.apache.spark.SparkContext import org.apache.spark.mllib.random.RandomRDDs._ ``` 生成1000000个服从正态分配N(0,1)的RDD[Double]，并且分布在 10 个分区中： ```scala scala> val u = normalRDD(sc, 10000000L, 10) u: org.apache.spark.rdd.RDD[Double] = RandomRDD[35] at RDD at RandomRDD.scala:38 ``` 把生成的随机数转化成N(1,4) 正态分布： ```scala scala> val v = u.map(x => 1.0 + 2.0 * x) v: org.apache.spark.rdd.RDD[Double] = MapPartitionsRDD[36] at map at :50 ``` ## 七、核密度估计 Kernel density estimation Spark ML 提供了一个工具类 KernelDensity 用于核密度估算，核密度估算的意思是根据已知的样本估计未知的密度，属於非参数检验方法之一。核密度估计的原理是。观察某一事物的已知分布，如果某一个数在观察中出现了，可认为这个数的概率密度很大，和这个数比较近的数的概率密度也会比较大，而那些离这个数远的数的概率密度会比较小。Spark1.6.2版本支持高斯核(Gaussian kernel)。首先，导入必要的包： ```scala import org.apache.spark.mllib.stat.KernelDensity import org.apache.spark.rdd.RDD ``` 同时留意到已经导入的数据： ```scala scala> val test = sc.textFile("G:/spark/iris.data").map(_.split(",")).map(p => p(0).toDouble) test: org.apache.spark.rdd.RDD[Double] = MapPartitionsRDD[22] at map at :44 ``` 用样本数据构建核函数，这里用假设检验中得到的iris的第一个属性的数据作为样本数据进行估计： ```scala scala> val kd = new KernelDensity().setSample(test).setBandwidth(3.0) kd: org.apache.spark.mllib.stat.KernelDensity = org.apache.spark.mllib.stat.KernelDensity@26216fa3 ``` 其中setBandwidth表示高斯核的宽度，为一个平滑参数，可以看做是高斯核的标准差。构造了核密度估计kd，就可以对给定数据数据进行核估计： ```scala scala> val densities = kd.estimate(Array(-1.0, 2.0, 5.0, 5.8)) densities: Array[Double] = Array(0.011372003554433524, 0.059925911357198915, 0.12365409462424519, 0.12816280708978114) ``` 这里表示的是，在样本-1.0, 2.0, 5.0, 5.8等样本点上，其估算的概率密度函数值分别是：0.011372003554433524, 0.059925911357198915, 0.12365409462424519, 0.12816280708978114。 ---------------

自动标签 : spark.mllib 统计工具检验样本数据分布可以独立性检验假设分配生成进行独立性检验是否方法

更多 [ 技术 ] 文章

请先登录, 查看相关评论.