Important Announcement
PubHTML5 Scheduled Server Maintenance on (GMT) Sunday, June 26th, 2:00 am - 8:00 am.
PubHTML5 site will be inoperative during the times indicated!

Home Explore Machine Learning

Machine Learning

Published by Willington Island, 2021-08-20 02:39:41

Description: Machine Learning: Hands-On for Developers and Technical Professionals provides hands-on instruction and fully-coded working examples for the most common machine learning techniques used by developers and technical professionals. The book contains a breakdown of each ML variant, explaining how it works and how it is used within certain industries, allowing readers to incorporate the presented techniques into their own work as they follow along. A core tenant of machine learning is a strong focus on data preparation, and a full exploration of the various types of learning algorithms illustrates how the proper tools can help any developer extract information and insights from existing data. The book includes a full complement of Instructor's Materials to facilitate use in the classroom, making this resource useful for students and as a professional reference.

Search

Read the Text Version

Chapter 13 ■ Apache Spark 329 The work for the entire model is done in one line. DecisionTreeModel model = DecisionTree .trainClassifier(trainingData, numberOfClasses, categoricalFeaturesInfo, impurity, maximumTreeDepth, maximumBins); Once complete, the model is evaluated with the evaluation data, and the results are output to the console output. The final code is shown here: package mlbook.ch13.spark; import java.util.HashMap; import java.util.Map; import scala.Tuple2; import org.apache.spark.SparkConf; import org.apache.spark.api.java.JavaPairRDD; import org.apache.spark.api.java.JavaRDD; import org.apache.spark.api.java.JavaSparkContext; import org.apache.spark.mllib.regression.LabeledPoint; import org.apache.spark.mllib.tree.DecisionTree; import org.apache.spark.mllib.tree.model.DecisionTreeModel; import org.apache.spark.mllib.util.MLUtils; public class BasicMLLibDecisionTree { public static void main(String[] args) { int numberOfClasses = 2; Map<Integer, Integer> categoricalFeaturesInfo = new HashMap<>(); String impurity = \"gini\"; int maximumTreeDepth = 7; int maximumBins = 48; SparkConf sparkConf = new SparkConf() .setAppName(\"BasicMLLibDecisionTree\"); JavaSparkContext jsc = new JavaSparkContext(sparkConf); String datapath = \"/path/to/data/ch13/mllib/dtree.txt\"; JavaRDD<LabeledPoint> data = MLUtils .loadLibSVMFile(jsc.sc(), datapath).toJavaRDD(); JavaRDD<LabeledPoint>[] splits = data .randomSplit(new double[]{0.7, 0.3}); JavaRDD<LabeledPoint> trainingData = splits[0]; JavaRDD<LabeledPoint> testData = splits[1]; DecisionTreeModel model = DecisionTree .trainClassifier(trainingData, numberOfClasses, categoricalFeaturesInfo, impurity, maximumTreeDepth, maximumBins); JavaPairRDD<Double, Double> predictionAndLabel = testData.mapToPair(p -> new Tuple2<>(model .predict(p.features()), p.label())); double predictionTestErrorValue =

330 Chapter 13 ■ Apache Spark predictionAndLabel.filter(pl -> !pl._1().equals(pl._2())) .count() / (double) testData.count(); System.out.println(\"Prediction test error value: \" + predictionTestErrorValue); System.out.println(\"Output classification tree:\\n\" + model.toDebugString()); } } Run Maven to compile and repackage the JAR file with the new application. mvn package Using spark-submit, you can then test the application against the training data supplied. $ /usr/local/spark-2.4.4-bin-hadoop2.7/bin/spark-submit \\ --class \"mlbook.ch13.spark.BasicMLLibDecisionTree\" --master local[4] target/Chapter13-1.0.0-SNAPSHOT.jar You will see Spark start a local cluster and process the training data. Eventu- ally you will see the output of the decision tree along with the test error value. 19/10/24 11:28:54 INFO DAGScheduler: ResultStage 8 (count at BasicMLLibDecisionTree.java:42) finished in 0.015 s 19/10/24 11:28:54 INFO DAGScheduler: Job 6 finished: count at BasicMLLibDecisionTree.java:42, took 0.017851 s Prediction test error value: 0.047619047619047616 Output classification tree: DecisionTreeModel classifier of depth 1 with 3 nodes If (feature 406 <= 12.0) Predict: 0.0 Else (feature 406 > 12.0) Predict: 1.0 Clustering Spark includes a K-means clustering implementation. Again, the emphasis is on large-scale data. These processing steps during training can be partitioned and clustered over many machines. For this example, I will keep it simple and on the local cluster. The training data for this example is basic doubles; these will be read and clustered into three cluster classifications. The data has been kept isolated to illustrate the clusters clearly. 0.0 0.0 0.0 0.1 0.1 0.1 0.2 0.2 0.2

Chapter 13 ■ Apache Spark 331 5.0 5.1 5.1 5.2 5.0 5.2 5.0 5.3 5.4 5.0 5.1 5.3 9.0 9.0 9.0 9.1 9.1 9.1 9.2 9.2 9.2 The code is made up of the following steps. First, the test data is loaded and split by the space character. This is then iterated and stored in a Vector class. Each Vector RSS contains a dense vector of each of the values from that line. Once the number of desired clusters and number of iterations to train the model are set, it’s a case of creating the model. This is done in one line, passing in the RDD of vectors, the number of clusters required, and the iteration count. KMeansModel kMeansClusters = KMeans.train(parsedData.rdd(), numberOfClassClusters, numberOfIterations); The final steps of the application are to report the findings of the model training. The cost value is calculated and output to the console; then the full clusters are displayed. package mlbook.ch13.spark; import org.apache.spark.SparkConf; import org.apache.spark.api.java.JavaSparkContext; import org.apache.spark.api.java.JavaRDD; import org.apache.spark.mllib.clustering.KMeans; import org.apache.spark.mllib.clustering.KMeansModel; import org.apache.spark.mllib.linalg.Vector; import org.apache.spark.mllib.linalg.Vectors; public class BasicMLLibKMeans { public static void main(String[] args) { SparkConf sparkConf = new SparkConf() .setAppName(\"BasicMLLibKMeans\"); JavaSparkContext sparkContext = new JavaSparkContext(sparkConf); JavaRDD<String> data = sparkContext .textFile(\"/path/to/data/ch13/mllib/kmeans.txt\"); JavaRDD<Vector> parsedData = data.map(s -> { String[] sarray = s.split(\" \"); double[] values = new double[sarray.length]; for (int i = 0; i < sarray.length; i++) { values[i] = Double.parseDouble(sarray[i]); } return Vectors.dense(values); }); parsedData.cache();

332 Chapter 13 ■ Apache Spark int numberOfClassClusters = 3; int numberOfIterations = 50; KMeansModel kMeansClusters = KMeans.train(parsedData.rdd(), numberOfClassClusters, numberOfIterations); double cost = kMeansClusters.computeCost(parsedData.rdd()); System.out.println(\"Computed cost: \" + cost); System.out.println(\"Showing cluster centres: \"); for (Vector center: kMeansClusters.clusterCenters()) { System.out.println(\" \" + center); } sparkContext.stop(); } } Again, run Maven to compile and repackage the JAR file with the new K-means application code. mvn package Using spark-submit, you can then test the application against the training data supplied. $ /usr/local/spark-2.4.4-bin-hadoop2.7/bin/spark-submit \\ --class \"mlbook.ch13.spark.BasicMLLibKMeans\" --master local[4] target/ Chapter13-1.0.0-SNAPSHOT.jar Spark will go through the usual startup process and then will output the cost function and the clusters. 19/10/24 11:48:56 INFO DAGScheduler: ResultStage 11 (sum at KMeansModel.scala:105) finished in 0.018 s 19/10/24 11:48:56 INFO DAGScheduler: Job 9 finished: sum at KMeansModel.scala:105, took 0.021366 s 19/10/24 11:48:56 INFO TorrentBroadcast: Destroying Broadcast(17) (from destroy at KMeansModel.scala:106) Computed cost: 0.24749999999988637 Showing cluster centres: [5.05,5.125,5.25] [0.1,0.1,0.1] [9.1,9.1,9.1] Association Rules with FP-Growth As we’ve previously seen, it’s possible to suggest items for a customer based on historical shopping basket contents. Spark has an implantation of the FP-Growth algorithm within Spark MLLib to do basket (or any other data) analysis at volume.

Chapter 13 ■ Apache Spark 333 First, I need some shopping basket transactions. This is just a simple text file of items; each line represents the contents of one basket checked out. milk tea fruit flour pencils tea cake biscuits beans peas hot_chocolate eggs coffee coffee biscuits pasta newspaper milk biscuits tea cake soap eggs coffee paper books tea biscuits tea cake milk paper eggs pencils The supporting Spark code does similar functions as to the decision tree and K-means examples. The context is set up, and then the data is read and parsed into an RDD. For FP-Growth, we need to set the minimum support and the number of partitions to create while the associations are computed. As you may remember from the previous association rule example in Weka, we want to show only the associations with a minimum confidence level. For this example, it’s set to 70 percent (set as a double value of 0.7 in the code). The final step is to generate the association rules and output them to the console. We also want to see the antecedent item and the item consequent along with the confidence score for that rule. package mlbook.ch13.spark; import java.util.Arrays; import java.util.List; import org.apache.spark.api.java.JavaRDD; import org.apache.spark.api.java.JavaSparkContext; import org.apache.spark.mllib.fpm.AssociationRules; import org.apache.spark.mllib.fpm.FPGrowth; import org.apache.spark.mllib.fpm.FPGrowthModel; import org.apache.spark.SparkConf; public class BasicMLLibFPGrowth { public static void main(String[] args) { SparkConf conf = new SparkConf().setAppName(\"BasicMLLibFPGro wth\"); JavaSparkContext sc = new JavaSparkContext(conf); JavaRDD<String> data = sc .textFile(\"/path/to/data/ch13/mllib/fpgrowth_items.txt\"); JavaRDD<List<String>> basketItems = data.map(line -> Arrays .asList(line.split(\" \"))); FPGrowth fpg = new FPGrowth() .setMinSupport(0.2) .setNumPartitions(10); FPGrowthModel<String> model = fpg.run(basketItems);

334 Chapter 13 ■ Apache Spark for (FPGrowth.FreqItemset<String> itemset: model.freqItemsets() .toJavaRDD().collect()) { System.out.println(\"(\" + itemset.javaItems() + \"), \" + itemset.freq()); } double minConfidence = 0.7; for (AssociationRules.Rule<String> rule : model.generateAssociationRules(minConfidence) .toJavaRDD().collect()) { System.out.println( rule.javaAntecedent() + \" => \" + rule.javaConsequent() + \", \" + rule.confidence()); } sc.stop(); } } For one last time, run Maven to compile and repackage the JAR file with the association rules application code. mvn package Using spark-submit, you can then test the application against the shopping basket data. $ /usr/local/spark-2.4.4-bin-hadoop2.7/bin/spark-submit \\ --class \"mlbook.ch13.spark.BasicMLLibFPGrowth\" --master local[4] target/Chapter13-1.0.0-SNAPSHOT.jar Spark will start, read in the data, and compute the association rules model. Once complete, the results of the model will be output to the console. 19/10/24 11:01:03 INFO TaskSchedulerImpl: Removed TaskSet 8.0, whose tasks have all completed, from pool 19/10/24 11:01:03 INFO DAGScheduler: ResultStage 8 (collect at BasicMLLibFPGrowth.java:32) finished in 0.144 s 19/10/24 11:01:03 INFO DAGScheduler: Job 3 finished: collect at BasicMLLibFPGrowth.java:32, took 0.485325 s [cake, eggs, coffee] => [biscuits], 1.0 [cake, eggs, coffee] => [tea], 1.0 [paper, eggs, tea] => [cake], 1.0 [paper, eggs, tea] => [biscuits], 1.0 [eggs, coffee, tea] => [biscuits], 1.0 [eggs, coffee, tea] => [cake], 1.0 [coffee] => [biscuits], 1.0 [paper, cake, biscuits] => [tea], 1.0 [paper, cake, biscuits] => [eggs], 1.0 [eggs, coffee] => [biscuits], 1.0 [eggs, coffee] => [tea], 1.0

Chapter 13 ■ Apache Spark 335 [eggs, coffee] => [cake], 1.0 [eggs, biscuits] => [tea], 1.0 [eggs, biscuits] => [cake], 1.0 Summary The Spark framework has changed a lot since it was covered in the first edition of this book. The APIs are a lot easier to code with, especially for Java. If you use Python, then it’s worth looking at the PySpark REPL for doing Spark work. If you are a Clojure user, then there are a number of Spark wrappers available to you as well. Remember, this framework is for large-scale data. If you have medium or small amounts of data, then look at the alternatives that were shown earlier in the book first. If those seem to struggle while training, then it’s worth looking at Spark as the next stage of your development pipeline.

14C H A P T E R Machine Learning with R When you’re in a room of data scientists, statisticians, and math types, you’ll hear one letter crop up again and again: the letter R. R is a programming lan- guage, and it’s basically command-line driven. In addition to being used in the command-line shell, R can be written in code form and run. Why am I telling you all this? Well, on top of the programming skills that get mentioned, you might also be asked, “Do you do R?” After this chapter, you’ll hopefully have a starting point to reply, “Yes!” Installing R The R language comes ready to use for a number of operating systems. The download page at http://www.r-project.org has a number of mirror sites, so pick a mirror that’s closest to you. From the mirror, choose the download for your operating system. macOS The current version of R (3.6 at time of writing) will run on the 64-bit Intel-based Macs. Download the file and open it to install it. It installs the R binaries into the /Applications folder. 337 Machine Learning: Hands-On for Developers and Technical Professionals, Second Edition. Jason Bell. © 2020 John Wiley & Sons, Inc. Published 2020 by John Wiley & Sons, Inc.




















































































Like this book? You can publish your book online for free in a few minutes!
Create your own flipbook