Chapter 13 ■ Apache Spark	329       The work for the entire model is done in one line.        DecisionTreeModel model = DecisionTree      .trainClassifier(trainingData, numberOfClasses, categoricalFeaturesInfo,      impurity, maximumTreeDepth, maximumBins);       Once complete, the model is evaluated with the evaluation data, and the results  are output to the console output. The final code is shown here:        package mlbook.ch13.spark;      import java.util.HashMap;      import java.util.Map;      import scala.Tuple2;      import org.apache.spark.SparkConf;      import org.apache.spark.api.java.JavaPairRDD;      import org.apache.spark.api.java.JavaRDD;      import org.apache.spark.api.java.JavaSparkContext;      import org.apache.spark.mllib.regression.LabeledPoint;      import org.apache.spark.mllib.tree.DecisionTree;      import org.apache.spark.mllib.tree.model.DecisionTreeModel;      import org.apache.spark.mllib.util.MLUtils;        public class BasicMLLibDecisionTree {           public static void main(String[] args) {                  int numberOfClasses = 2;                  Map<Integer, Integer> categoricalFeaturesInfo = new HashMap<>();                  String impurity = \"gini\";                  int maximumTreeDepth = 7;                  int maximumBins = 48;                    SparkConf sparkConf = new SparkConf()      .setAppName(\"BasicMLLibDecisionTree\");                    JavaSparkContext jsc = new JavaSparkContext(sparkConf);                    String datapath = \"/path/to/data/ch13/mllib/dtree.txt\";                  JavaRDD<LabeledPoint> data = MLUtils      .loadLibSVMFile(jsc.sc(), datapath).toJavaRDD();                  JavaRDD<LabeledPoint>[] splits = data      .randomSplit(new double[]{0.7, 0.3});                  JavaRDD<LabeledPoint> trainingData = splits[0];                  JavaRDD<LabeledPoint> testData = splits[1];                    DecisionTreeModel model = DecisionTree      .trainClassifier(trainingData, numberOfClasses,      categoricalFeaturesInfo, impurity, maximumTreeDepth, maximumBins);                    JavaPairRDD<Double, Double> predictionAndLabel =                                 testData.mapToPair(p -> new Tuple2<>(model        .predict(p.features()), p.label()));                  double predictionTestErrorValue =
330	 Chapter 13 ■ Apache Spark                               predictionAndLabel.filter(pl -> !pl._1().equals(pl._2()))  .count() / (double) testData.count();                System.out.println(\"Prediction test error value: \"    +  predictionTestErrorValue);                System.out.println(\"Output classification tree:\\n\" +  model.toDebugString());         }  }       Run Maven to compile and repackage the JAR file with the new application.        mvn package       Using spark-submit, you can then test the application against the training  data supplied.    $ /usr/local/spark-2.4.4-bin-hadoop2.7/bin/spark-submit \\   --class \"mlbook.ch13.spark.BasicMLLibDecisionTree\" --master local[4]    target/Chapter13-1.0.0-SNAPSHOT.jar       You will see Spark start a local cluster and process the training data. Eventu-  ally you will see the output of the decision tree along with the test error value.    19/10/24 11:28:54 INFO DAGScheduler: ResultStage 8  (count at BasicMLLibDecisionTree.java:42) finished in 0.015 s  19/10/24 11:28:54 INFO DAGScheduler: Job 6 finished:  count at BasicMLLibDecisionTree.java:42, took 0.017851 s  Prediction test error value: 0.047619047619047616  Output classification tree:  DecisionTreeModel classifier of depth 1 with 3 nodes       If (feature 406 <= 12.0)       Predict: 0.0       Else (feature 406 > 12.0)       Predict: 1.0    Clustering    Spark includes a K-means clustering implementation. Again, the emphasis is  on large-scale data. These processing steps during training can be partitioned  and clustered over many machines. For this example, I will keep it simple and  on the local cluster.       The training data for this example is basic doubles; these will be read and  clustered into three cluster classifications. The data has been kept isolated to  illustrate the clusters clearly.        0.0 0.0 0.0      0.1 0.1 0.1      0.2 0.2 0.2
Chapter 13 ■ Apache Spark	331        5.0 5.1 5.1      5.2 5.0 5.2      5.0 5.3 5.4      5.0 5.1 5.3      9.0 9.0 9.0      9.1 9.1 9.1      9.2 9.2 9.2       The code is made up of the following steps. First, the test data is loaded and  split by the space character. This is then iterated and stored in a Vector class.  Each Vector RSS contains a dense vector of each of the values from that line.       Once the number of desired clusters and number of iterations to train the  model are set, it’s a case of creating the model. This is done in one line, passing  in the RDD of vectors, the number of clusters required, and the iteration count.        KMeansModel kMeansClusters = KMeans.train(parsedData.rdd(),       numberOfClassClusters, numberOfIterations);       The final steps of the application are to report the findings of the model  training. The cost value is calculated and output to the console; then the full  clusters are displayed.        package mlbook.ch13.spark;      import org.apache.spark.SparkConf;      import org.apache.spark.api.java.JavaSparkContext;      import org.apache.spark.api.java.JavaRDD;      import org.apache.spark.mllib.clustering.KMeans;      import org.apache.spark.mllib.clustering.KMeansModel;      import org.apache.spark.mllib.linalg.Vector;      import org.apache.spark.mllib.linalg.Vectors;        public class BasicMLLibKMeans {                  public static void main(String[] args) {                          SparkConf sparkConf = new SparkConf()        .setAppName(\"BasicMLLibKMeans\");                          JavaSparkContext sparkContext = new        JavaSparkContext(sparkConf);                          JavaRDD<String> data = sparkContext        .textFile(\"/path/to/data/ch13/mllib/kmeans.txt\");                          JavaRDD<Vector> parsedData = data.map(s -> {                                 String[] sarray = s.split(\" \");                                 double[] values = new double[sarray.length];                                 for (int i = 0; i < sarray.length; i++) {                                        values[i] = Double.parseDouble(sarray[i]);                                 }                                 return Vectors.dense(values);                          });                          parsedData.cache();
332	 Chapter 13 ■ Apache Spark                                       int numberOfClassClusters = 3;                                     int numberOfIterations = 50;                                     KMeansModel kMeansClusters = KMeans.train(parsedData.rdd(),                 numberOfClassClusters, numberOfIterations);                                       double cost = kMeansClusters.computeCost(parsedData.rdd());                                     System.out.println(\"Computed cost: \" + cost);                                       System.out.println(\"Showing cluster centres: \");                                     for (Vector center: kMeansClusters.clusterCenters()) {                                              System.out.println(\" \" + center);                                     }                                       sparkContext.stop();                              }                 }               Again, run Maven to compile and repackage the JAR file with the new K-means          application code.                   mvn package               Using spark-submit, you can then test the application against the training          data supplied.                   $ /usr/local/spark-2.4.4-bin-hadoop2.7/bin/spark-submit \\                 --class \"mlbook.ch13.spark.BasicMLLibKMeans\" --master local[4] target/                 Chapter13-1.0.0-SNAPSHOT.jar               Spark will go through the usual startup process and then will output the cost          function and the clusters.                   19/10/24 11:48:56 INFO DAGScheduler: ResultStage 11                 (sum at KMeansModel.scala:105) finished in 0.018 s                 19/10/24 11:48:56 INFO DAGScheduler: Job 9 finished:                 sum at KMeansModel.scala:105, took 0.021366 s                 19/10/24 11:48:56 INFO TorrentBroadcast: Destroying Broadcast(17)                 (from destroy at KMeansModel.scala:106)                 Computed cost: 0.24749999999988637                 Showing cluster centres:                     [5.05,5.125,5.25]                   [0.1,0.1,0.1]                   [9.1,9.1,9.1]          Association Rules with FP-Growth            As we’ve previously seen, it’s possible to suggest items for a customer based on          historical shopping basket contents. Spark has an implantation of the FP-Growth          algorithm within Spark MLLib to do basket (or any other data) analysis at volume.
Chapter 13 ■ Apache Spark	333       First, I need some shopping basket transactions. This is just a simple text file  of items; each line represents the contents of one basket checked out.        milk tea fruit flour pencils      tea cake biscuits beans peas hot_chocolate eggs coffee      coffee biscuits pasta newspaper milk      biscuits tea cake soap eggs coffee paper books      tea      biscuits tea cake milk paper eggs pencils       The supporting Spark code does similar functions as to the decision tree and  K-means examples. The context is set up, and then the data is read and parsed  into an RDD.       For FP-Growth, we need to set the minimum support and the number of  partitions to create while the associations are computed.       As you may remember from the previous association rule example in Weka,  we want to show only the associations with a minimum confidence level. For  this example, it’s set to 70 percent (set as a double value of 0.7 in the code).       The final step is to generate the association rules and output them to the  console. We also want to see the antecedent item and the item consequent along  with the confidence score for that rule.        package mlbook.ch13.spark;      import java.util.Arrays;      import java.util.List;      import org.apache.spark.api.java.JavaRDD;      import org.apache.spark.api.java.JavaSparkContext;      import org.apache.spark.mllib.fpm.AssociationRules;      import org.apache.spark.mllib.fpm.FPGrowth;      import org.apache.spark.mllib.fpm.FPGrowthModel;      import org.apache.spark.SparkConf;        public class BasicMLLibFPGrowth {           public static void main(String[] args) {                  SparkConf conf = new SparkConf().setAppName(\"BasicMLLibFPGro        wth\");                  JavaSparkContext sc = new JavaSparkContext(conf);                    JavaRDD<String> data = sc      .textFile(\"/path/to/data/ch13/mllib/fpgrowth_items.txt\");                    JavaRDD<List<String>> basketItems = data.map(line -> Arrays      .asList(line.split(\" \")));                    FPGrowth fpg = new FPGrowth()                                 .setMinSupport(0.2)                                 .setNumPartitions(10);                    FPGrowthModel<String> model = fpg.run(basketItems);
334	 Chapter 13 ■ Apache Spark                                for (FPGrowth.FreqItemset<String> itemset: model.freqItemsets()                 .toJavaRDD().collect()) {                                       System.out.println(\"(\" + itemset.javaItems() + \"), \" +                 itemset.freq());                                }                                double minConfidence = 0.7;                              for (AssociationRules.Rule<String> rule                                              : model.generateAssociationRules(minConfidence)                 .toJavaRDD().collect()) {                                       System.out.println(                                                    rule.javaAntecedent() + \" => \" +                   rule.javaConsequent() + \", \" + rule.confidence());                              }                              sc.stop();                        }                 }               For one last time, run Maven to compile and repackage the JAR file with the          association rules application code.                   mvn package               Using spark-submit, you can then test the application against the shopping          basket data.                   $ /usr/local/spark-2.4.4-bin-hadoop2.7/bin/spark-submit \\                   --class \"mlbook.ch13.spark.BasicMLLibFPGrowth\" --master local[4]                   target/Chapter13-1.0.0-SNAPSHOT.jar               Spark will start, read in the data, and compute the association rules model.          Once complete, the results of the model will be output to the console.                   19/10/24 11:01:03 INFO TaskSchedulerImpl: Removed TaskSet 8.0,                   whose tasks have all completed, from pool                   19/10/24 11:01:03 INFO DAGScheduler: ResultStage 8                 (collect at BasicMLLibFPGrowth.java:32) finished in 0.144 s                 19/10/24 11:01:03 INFO DAGScheduler: Job 3 finished:                 collect at BasicMLLibFPGrowth.java:32, took 0.485325 s                 [cake, eggs, coffee] => [biscuits], 1.0                 [cake, eggs, coffee] => [tea], 1.0                 [paper, eggs, tea] => [cake], 1.0                 [paper, eggs, tea] => [biscuits], 1.0                 [eggs, coffee, tea] => [biscuits], 1.0                 [eggs, coffee, tea] => [cake], 1.0                 [coffee] => [biscuits], 1.0                 [paper, cake, biscuits] => [tea], 1.0                 [paper, cake, biscuits] => [eggs], 1.0                 [eggs, coffee] => [biscuits], 1.0                 [eggs, coffee] => [tea], 1.0
Chapter 13 ■ Apache Spark	335        [eggs, coffee] => [cake], 1.0      [eggs, biscuits] => [tea], 1.0      [eggs, biscuits] => [cake], 1.0    Summary    The Spark framework has changed a lot since it was covered in the first edition  of this book. The APIs are a lot easier to code with, especially for Java. If you  use Python, then it’s worth looking at the PySpark REPL for doing Spark work.  If you are a Clojure user, then there are a number of Spark wrappers available  to you as well.       Remember, this framework is for large-scale data. If you have medium or  small amounts of data, then look at the alternatives that were shown earlier in  the book first. If those seem to struggle while training, then it’s worth looking  at Spark as the next stage of your development pipeline.
14C H A P T E R    Machine Learning with R    When you’re in a room of data scientists, statisticians, and math types, you’ll  hear one letter crop up again and again: the letter R. R is a programming lan-  guage, and it’s basically command-line driven. In addition to being used in the  command-line shell, R can be written in code form and run.       Why am I telling you all this? Well, on top of the programming skills that get  mentioned, you might also be asked, “Do you do R?” After this chapter, you’ll  hopefully have a starting point to reply, “Yes!”    Installing R    The R language comes ready to use for a number of operating systems. The  download page at http://www.r-project.org has a number of mirror sites, so  pick a mirror that’s closest to you. From the mirror, choose the download for  your operating system.    macOS    The current version of R (3.6 at time of writing) will run on the 64-bit Intel-based  Macs. Download the file and open it to install it. It installs the R binaries into  the /Applications folder.                                                                                                        337    Machine Learning: Hands-On for Developers and Technical Professionals, Second Edition. Jason Bell.  © 2020 John Wiley & Sons, Inc. Published 2020 by John Wiley & Sons, Inc.
                                
                                
                                Search
                            
                            Read the Text Version
- 1
 - 2
 - 3
 - 4
 - 5
 - 6
 - 7
 - 8
 - 9
 - 10
 - 11
 - 12
 - 13
 - 14
 - 15
 - 16
 - 17
 - 18
 - 19
 - 20
 - 21
 - 22
 - 23
 - 24
 - 25
 - 26
 - 27
 - 28
 - 29
 - 30
 - 31
 - 32
 - 33
 - 34
 - 35
 - 36
 - 37
 - 38
 - 39
 - 40
 - 41
 - 42
 - 43
 - 44
 - 45
 - 46
 - 47
 - 48
 - 49
 - 50
 - 51
 - 52
 - 53
 - 54
 - 55
 - 56
 - 57
 - 58
 - 59
 - 60
 - 61
 - 62
 - 63
 - 64
 - 65
 - 66
 - 67
 - 68
 - 69
 - 70
 - 71
 - 72
 - 73
 - 74
 - 75
 - 76
 - 77
 - 78
 - 79
 - 80
 - 81
 - 82
 - 83
 - 84
 - 85
 - 86
 - 87
 - 88
 - 89
 - 90
 - 91
 - 92
 - 93
 - 94
 - 95
 - 96
 - 97
 - 98
 - 99
 - 100
 - 101
 - 102
 - 103
 - 104
 - 105
 - 106
 - 107
 - 108
 - 109
 - 110
 - 111
 - 112
 - 113
 - 114
 - 115
 - 116
 - 117
 - 118
 - 119
 - 120
 - 121
 - 122
 - 123
 - 124
 - 125
 - 126
 - 127
 - 128
 - 129
 - 130
 - 131
 - 132
 - 133
 - 134
 - 135
 - 136
 - 137
 - 138
 - 139
 - 140
 - 141
 - 142
 - 143
 - 144
 - 145
 - 146
 - 147
 - 148
 - 149
 - 150
 - 151
 - 152
 - 153
 - 154
 - 155
 - 156
 - 157
 - 158
 - 159
 - 160
 - 161
 - 162
 - 163
 - 164
 - 165
 - 166
 - 167
 - 168
 - 169
 - 170
 - 171
 - 172
 - 173
 - 174
 - 175
 - 176
 - 177
 - 178
 - 179
 - 180
 - 181
 - 182
 - 183
 - 184
 - 185
 - 186
 - 187
 - 188
 - 189
 - 190
 - 191
 - 192
 - 193
 - 194
 - 195
 - 196
 - 197
 - 198
 - 199
 - 200
 - 201
 - 202
 - 203
 - 204
 - 205
 - 206
 - 207
 - 208
 - 209
 - 210
 - 211
 - 212
 - 213
 - 214
 - 215
 - 216
 - 217
 - 218
 - 219
 - 220
 - 221
 - 222
 - 223
 - 224
 - 225
 - 226
 - 227
 - 228
 - 229
 - 230
 - 231
 - 232
 - 233
 - 234
 - 235
 - 236
 - 237
 - 238
 - 239
 - 240
 - 241
 - 242
 - 243
 - 244
 - 245
 - 246
 - 247
 - 248
 - 249
 - 250
 - 251
 - 252
 - 253
 - 254
 - 255
 - 256
 - 257
 - 258
 - 259
 - 260
 - 261
 - 262
 - 263
 - 264
 - 265
 - 266
 - 267
 - 268
 - 269
 - 270
 - 271
 - 272
 - 273
 - 274
 - 275
 - 276
 - 277
 - 278
 - 279
 - 280
 - 281
 - 282
 - 283
 - 284
 - 285
 - 286
 - 287
 - 288
 - 289
 - 290
 - 291
 - 292
 - 293
 - 294
 - 295
 - 296
 - 297
 - 298
 - 299
 - 300
 - 301
 - 302
 - 303
 - 304
 - 305
 - 306
 - 307
 - 308
 - 309
 - 310
 - 311
 - 312
 - 313
 - 314
 - 315
 - 316
 - 317
 - 318
 - 319
 - 320
 - 321
 - 322
 - 323
 - 324
 - 325
 - 326
 - 327
 - 328
 - 329
 - 330
 - 331
 - 332
 - 333
 - 334
 - 335
 - 336
 - 337
 - 338
 - 339
 - 340
 - 341
 - 342
 - 343
 - 344
 - 345
 - 346
 - 347
 - 348
 - 349
 - 350
 - 351
 - 352
 - 353
 - 354
 - 355
 - 356
 - 357
 - 358
 - 359
 - 360
 - 361
 - 362
 - 363
 - 364
 - 365
 - 366
 - 367
 - 368
 - 369
 - 370
 - 371
 - 372
 - 373
 - 374
 - 375
 - 376
 - 377
 - 378
 - 379
 - 380
 - 381
 - 382
 - 383
 - 384
 - 385
 - 386
 - 387
 - 388
 - 389
 - 390
 - 391
 - 392
 - 393
 - 394
 - 395
 - 396
 - 397
 - 398
 - 399
 - 400
 - 401
 - 402
 - 403
 - 404
 - 405
 - 406
 - 407
 - 408
 - 409
 - 410
 - 411
 - 412
 - 413
 - 414
 - 415
 - 416
 - 417
 - 418
 - 419