Step 3: Calculate the conditional probabilities for all the classes, i.e., the following in our example: Step 4: Calculate maxiP(Ci|x1,x2,…,xn). In our example, the maximum probability is for the class banana, therefore, the fruit, which is long, sweet and yellow is a banana by Naive Bayes’ Algorithm. In a nutshell, we say that a new element will belong to the class which will have the maximum conditional probability described above. Example: Construct Naïve Bayes’ classification model for the Iris dataset in Python. # Naive Bayes on the Iris Dataset 101 from csv import reader from random import seed from random import randrange from math import sqrt from math import exp from math import pi CU IDOL SELF LEARNING MATERIAL (SLM)
# Load a CSV file 102 defload_csv(filename): dataset = list() with open(filename, 'r') as file: csv_reader = reader(file) for row in csv_reader: if not row: continue dataset.append(row) return dataset # Convert string column to float defstr_column_to_float(dataset, column): for row in dataset: row[column] = float(row[column].strip()) # Convert string column to integer defstr_column_to_int(dataset, column): class_values = [row[column] for row in dataset] unique = set(class_values) lookup = dict() fori, value in enumerate(unique): lookup[value] = i for row in dataset: row[column] = lookup[row[column]] CU IDOL SELF LEARNING MATERIAL (SLM)
return lookup 103 # Split a dataset into k folds defcross_validation_split(dataset, n_folds): dataset_split = list() dataset_copy = list(dataset) fold_size = int(len(dataset) / n_folds) for _ in range(n_folds): fold = list() whilelen(fold) <fold_size: index = randrange(len(dataset_copy)) fold.append(dataset_copy.pop(index)) dataset_split.append(fold) returndataset_split # Calculate accuracy percentage defaccuracy_metric(actual, predicted): correct = 0 fori in range(len(actual)): if actual[i] == predicted[i]: correct += 1 return correct / float(len(actual)) * 100.0 # Evaluate an algorithm using a cross validation split defevaluate_algorithm(dataset, algorithm, n_folds, *args): folds = cross_validation_split(dataset, n_folds) CU IDOL SELF LEARNING MATERIAL (SLM)
scores = list() 104 for fold in folds: train_set = list(folds) train_set.remove(fold) train_set = sum(train_set, []) test_set = list() for row in fold: row_copy = list(row) test_set.append(row_copy) row_copy[-1] = None predicted = algorithm(train_set, test_set, *args) actual = [row[-1] for row in fold] accuracy = accuracy_metric(actual, predicted) scores.append(accuracy) return scores # Split the dataset by class values, returns a dictionary defseparate_by_class(dataset): separated = dict() fori in range(len(dataset)): vector = dataset[i] class_value = vector[-1] if (class_value not in separated): separated[class_value] = list() CU IDOL SELF LEARNING MATERIAL (SLM)
separated[class_value].append(vector) 105 return separated # Calculate the mean of a list of numbers def mean(numbers): return sum(numbers)/float(len(numbers)) # Calculate the standard deviation of a list of numbers defstdev(numbers): avg = mean(numbers) variance = sum([(x-avg)**2 for x in numbers]) / float(len(numbers)-1) returnsqrt(variance) # Calculate the mean, stdev and count for each column in a dataset defsummarize_dataset(dataset): summaries = [(mean(column), stdev(column), len(column)) for column in zip(*dataset)] del(summaries[-1]) return summaries # Split dataset by class then calculate statistics for each row defsummarize_by_class(dataset): separated = separate_by_class(dataset) summaries = dict() forclass_value, rows in separated.items(): summaries[class_value] = summarize_dataset(rows) return summaries # Calculate the Gaussian probability distribution function for x CU IDOL SELF LEARNING MATERIAL (SLM)
defcalculate_probability(x, mean, stdev): exponent = exp(-((x-mean)**2 / (2 * stdev**2 ))) return (1 / (sqrt(2 * pi) * stdev)) * exponent # Calculate the probabilities of predicting each class for a given row defcalculate_class_probabilities(summaries, row): total_rows = sum([summaries[label][0][2] for label in summaries]) probabilities = dict() forclass_value, class_summaries in summaries.items(): probabilities[class_value] = summaries[class_value][0][2]/float(total_rows) fori in range(len(class_summaries)): mean, stdev, _ = class_summaries[i] probabilities[class_value] *= calculate_probability(row[i], mean, stdev) return probabilities # Predict the class for a given row def predict(summaries, row): probabilities = calculate_class_probabilities(summaries, row) best_label, best_prob = None, -1 forclass_value, probability in probabilities.items(): ifbest_label is None or probability >best_prob: best_prob = probability best_label = class_value returnbest_label # Naive Bayes Algorithm 106 CU IDOL SELF LEARNING MATERIAL (SLM)
defnaive_bayes(train, test): 107 summarize = summarize_by_class(train) predictions = list() for row in test: output = predict(summarize, row) predictions.append(output) return(predictions) # Test Naive Bayes on Iris Dataset seed(1) filename = 'iris.csv' dataset = load_csv(filename) fori in range(len(dataset[0])-1): str_column_to_float(dataset, i) # convert class column to integers str_column_to_int(dataset, len(dataset[0])-1) # evaluate algorithm n_folds = 5 scores = evaluate_algorithm(dataset, naive_bayes, n_folds) print('Scores: %s' % scores) print('Mean Accuracy: %.3f%%' % (sum(scores)/float(len(scores)))) Output: Scores: [93.33333333333333, 96.66666666666667, 100.0, 93.33333333333333, 93.33333333333333] Mean Accuracy: 95.333% CU IDOL SELF LEARNING MATERIAL (SLM)
6.4 APPLICATION OF NAÏVE BAYES ALGORITHM • Text classification: It is used as a probabilistic learning method for text classification. The Naive Bayes’ classifier is one of the most successful known algorithms when it comes to the classification of text documents, i.e., whether a text document belongs to one or more categories (classes). • Spam filtration: It is an example of text classification. This has become a popular mechanism to distinguish spam email from legitimate email. Several modern email services implement Bayesian spam filtering. • Many server-side email filters, such as DSPAM, SpamBayes, SpamAssassin, Bogofilter, and ASSP, use this technique. • Sentiment Analysis: It can be used to analyze the tone of tweets, comments, and reviews—whether they are negative, positive or neutral. • Recommendation System: The Naive Bayes’ algorithm in combination with collaborative filtering is used to build hybrid recommendation systems which help in predicting if a user would like a given resource or not. 6.5 SUMMARY • Naive Bayes’ is a probabilistic machine learning algorithm that can be used in a wide variety of classification tasks. • It is used in wide range of allocation including filtering spam, classifying documents, sentiment prediction etc. • It is easy and fast to predict the class of the test data set. It also performs well in multi-class prediction. When assumption of independence holds, a Naive Bayes classifier performs better compare to other models like logistic regression and you need less training data. • If categorical variable has a category (in test data set), which was not observed in training data set, then model will assign a 0 (zero) probability and will be unable to make a prediction. This is often known as Zero Frequency. 6.6 KEYWORDS 108 CU IDOL SELF LEARNING MATERIAL (SLM)
• Conditional probability- likelihood of an event or outcome occurring • prior probability- ssessed before making reference to certain relevant observations • posterior probability- probability distribution of an unknown quantity • Bernoulli Naïve Bayes- used for discrete data and it works on Bernoulli distribution • Gaussian Bayes’- used when the features have continuous values. 6.7 LEARNING ACTIVITY 1. Suppose we want to classify potential bank customers as good creditors or bad creditors for loan applications. We have a training dataset describing past customers using the following attributes: Marital status {married, single, divorced}, Gender {male, female}, Age {[18..30[, [30..50[, [50..65[,[65+]}, Income {[10K..25K[, [25K..50K[, [50K..65K[, [65K..100K[, [100K+]}. Use the Naïve Bayes’ classifier to classify the customer as good creditor or bad creditor. ___________________________________________________________________________ ____________________________________________________________________ 2. We have knowledge that over the entire population of people 0.8% has cancer. There exists a (binary) laboratory test that represents an imperfect indicator of this disease. That test returns a correct positive result in 98% of the cases in which the disease is present, and a correct negative result in 97% of the cases where the disease is not present. (a) Suppose we observe a patient for whom the laboratory test returns a positive result. Calculate the a posteriori probability that this patient truly suffers from cancer. (b) Knowing that the lab test is an imperfect one, a second test (which is assumed to be independent of the former one) is conducted. Calculate the a posteriori probabilities for cancer and ¬cancer given that the second test has returned a positive result as well. ___________________________________________________________________________ ____________________________________________________________________ 6.8 UNIT END QUESTIONS A.Descriptive Questions Short Question 109 CU IDOL SELF LEARNING MATERIAL (SLM)
1. What is meant by conditional probability? 2. Compare prior probability with posterior probability. 3. What is meant by sentiment analysis? 4. What is the need for Bernoulli Naïve Bayes’ algorithm? 5. Define naïve bays theorem. Long Question 1. How the conventional Naïve Bayes’ algorithm differs from Gaussian Naive Bayes? 2. Naïve Bayes classifier is best suited model for recommendation system, justify. 3. Discuss the strength and weakness and Naïve Bayes Algorithm. 4. Illustrate the working of Naïve Bayes algorithm. 5. Compare the types of Naïve Bayes theorem with their advantages. B. Multiple ChoiceQuestions 1. Where does the bayes rule can be used? a. Solving queries b. Increasing complexity c. Decreasing complexity d. Answering probabilistic query 2. Which of the following is the consequence between a node and its predecessors while creating Bayesian network? a. Conditionally independent b. Functionally dependent c. Both Conditionally dependent&Dependent d. Dependent 3. Which of the following provided by the Bayesian Network? 110 a. Complete description of the problem b. Partial description of the domain c. Complete description of the domain d. All of these CU IDOL SELF LEARNING MATERIAL (SLM)
4. Causal chain (For example, Smoking cause cancer) gives rise to:- a. Conditionally Independence b. Conditionally Dependence c. Both d. None of these 5. The Bayesian network can be used to answer any query by using:- a. Full distribution b. Joint distribution c. Partial distribution d. All of these 6. Naïve Bayes Algorithm is a ________ learning algorithm. a. Supervised b. Reinforcement c. Unsupervised d. None of these 7. Application of Naïve Bayes Algorithm is/are: a. Spam filtration b. Sentimental analysis c. Classifying articles d. All of these 8. Disadvantages of Naïve Bayes’ Classifier: a. Naive Bayes’ assumes that all features are independent or unrelated, so it cannot learn the relationship between features. b. It performs well in Multi-class predictions as compared to the other Algorithms. c. Naïve Bayes is one of the fast and easy ML algorithms to predict a class of 111 CU IDOL SELF LEARNING MATERIAL (SLM)
datasets. d. It is the most popular choice for text classification problems. 9. The benefit of Naïve Bayes:- a. Naïve Bayes is one of the fast and easy ML algorithms to predict a class of datasets. b. It is the most popular choice for text classification problems. c. It can be used for Binary as well as Multi-class Classifications. d. All of these 10. What is the number of parameters needed to represent a Naive Bayes’ classifier with n Boolean variables and a Boolean label? a. 2n + 1 b. n + 1 c. 2n d. n Answers 1 – d, 2-a, 3 –c, 4- a, 5-b, 6- a, 7- d, 8-a, 9-d, 10-a. 6.9 REFERENCES Text Books • Peter Harrington “Machine Learning in Action”, Dream Tech Press • EthemAlpaydin, “Introduction to Machine Learning”, MIT Press • Steven Bird, Ewan Klein and Edward Loper, “Natural Language Processing with Python”, O’Reilly Media. • Stephen Marsland, “Machine Learning an Algorithmic Perspective” CRC Press Reference Books • William W. Hsieh, “Machine Learning Methods in the Environmental Sciences”, Cambridge • Grant S. Ingersoll, Thomas S. Morton, Andrew L. Farris, “Tamming Text”, Manning Publication Co. 112 CU IDOL SELF LEARNING MATERIAL (SLM)
• Margaret. H. Dunham, “Data Mining Introductory and Advanced Topics”, Pearson Education 113 CU IDOL SELF LEARNING MATERIAL (SLM)
UNIT - 7: SUPPORT VECTOR MACHINES Structure 7.0 Learning Objectives 7.1 Introduction 7.2 Maximum Margin Linear Separators 7.3 Quadratic Programming solution to finding maximum margin separators 7.4 Summary 7.5 Keywords 7.6 Learning Activity 7.7 Unit End Questions 7.8 References 7.0 LEARNING OBJECTIVES After studying this unit, you will be able to: • Describe the basics of Support Vector Machine • Illustrate the implementation of SVM • Apply SVM for solving classification and regression problems 7.1 INTRODUCTION A Support Vector Machine (SVM) is a supervised machine learning algorithm that can be employed for both classification and regression purposes. SVMs are more commonly used in classification problems and as such, this is what we will focus on in this post. SVMs are based on the idea of finding a hyper plane that best divides a dataset into two classes, as shown in the image below. 114 CU IDOL SELF LEARNING MATERIAL (SLM)
Figure 7.1: Hyperplane in SVM Support vectors are the data points nearest to the hyperplane, the points of a data set that, if removed, would alter the position of the dividing hyperplane. Because of this, they can be considered the critical elements of a data set. As a simple example, for a classification task with only two features (like the image above), you can think of a hyperplane as a line that linearly separates and classifies a set of data. Intuitively, the further from the hyperplane our data points lie, the more confident we are that they have been correctly classified. We therefore want our data points to be as far away from the hyperplane as possible, while still being on the correct side of it. So when new testing data is added, whatever side of the hyperplane it lands will decide the class that we assign to it. How do we find the right hyperplane? The distance between the hyperplane and the nearest data point from either set is known as the margin. The goal is to choose a hyperplane with the greatest possible margin between the hyperplane and any point within the training set, giving a greater chance of new data being classified correctly. 115 CU IDOL SELF LEARNING MATERIAL (SLM)
Figure 7.2: Margins and Hyperplane in SVM Data is rarely ever as clean as our simple example above. A dataset will often look more like the jumbled balls below which represent a linearly non separable dataset. Figure 7.3: Linearly Non separable Dataset In order to classify a dataset like the one above it’s necessary to move away from a 2d view of the data to a 3d view. Explaining this is easiest with another simplified example. Imagine that our two sets of colored balls above are sitting on a sheet and this sheet is lifted suddenly, launching the balls into the air. While the balls are up in the air, you use the sheet to separate them. This ‘lifting’ of the balls represents the mapping of data into a higher dimension. This is known as kernelling. 116 CU IDOL SELF LEARNING MATERIAL (SLM)
Figure 7.4: 3D view of Hyperplane in SVM Because we are now in three dimensions, our hyperplane can no longer be a line. It must now be a plane as shown in the example above. The idea is that the data will continue to be mapped into higher and higher dimensions until a hyperplane can be formed to segregate it. 7.2 MAXIMUM MARGIN LINEAR SEPARATORS “The support vector machine (SVM) is a supervised learning method that generates input- output mapping functions from a set of labeled training data.\" A Support Vector Machine (SVM) performs classification by finding the hyper-plane that maximizes the margin between the two classes. The vectors (cases) that define the hyper-plane are the support vectors. Figure 7.5: Hyperplane with maximum width Algorithm: • Define an optimal hyperplane: maximize margin. • Extend the above definition for non-linearly separable problems: have a penalty term for misclassifications. 117 CU IDOL SELF LEARNING MATERIAL (SLM)
• Map data to high dimensional space where it is easier to classify with linear decision surfaces: reformulate problem so that data is mapped implicitly to this space. To define an optimal hyperplane we need to maximize the width of the margin (w). The important feature of SVM is that if the data is linearly separable, there is a unique global minimum value. An ideal SVM analysis should produce a hyperplane that completely separates the vectors (cases) into two non-overlapping classes. However, perfect separation may not be possible, or it may result in a model with so many cases that the model does not classify correctly. In this situation SVM finds the hyperplane that maximizes the margin and minimizes the misclassifications. For the maximum margin hyperplane only examples on the margin matter (only these affect the distances). These are called support vectors. The objective of the support vector machine algorithm is to find a hyperplane in an N-dimensional space (N — the number of features) that distinctly classifies the data points. Figure 7.6: Possible Hyperplanes To separate the two classes of data points, there are many possible hyperplanes that could be chosen. Our objective is to find a plane that has the maximum margin, i.e the maximum distance between data points of both classes. Maximizing the margin distance provides some reinforcement so that future data points can be classified with more confidence. Hyper planes: Hyperplanes are decision boundaries that help classify the data points. Data points falling on either side of the hyperplane can be attributed to different classes. Also, the dimension of the hyperplane depends upon the number of features. If the number of input 118 CU IDOL SELF LEARNING MATERIAL (SLM)
features is 2, then the hyperplane is just a line. If the number of input features is 3, then the hyperplane becomes a two-dimensional plane. It becomes difficult to imagine when the number of features exceeds 3. Figure 7.7: Hyperplanes in 2D and 3D feature space Support Vectors: Support vectors are data points that are closer to the hyperplane and influence the position and orientation of the hyperplane. Using these support vectors, we maximize the margin of the classifier. Deleting the support vectors will change the position of the hyperplane. These are the points that help us build our SVM. It will be useful computationally if only a small fraction of the datapoints are support vectors, because we use the support vectors to decide which side of the separator a test case is on. The support vectors are indicated by the circles around them. 119 CU IDOL SELF LEARNING MATERIAL (SLM)
Figure 7.8: Hyperplane with small and large margin To find the maximum margin the separator, we have to solve following optimization problem: w.xc+b>+1w.xc+b>+1 for positive cases w.xc+b<−1w.xc+b<−1 for negative cases and ||w||2||w||2 is as small as possible Kernels for Learning Non-Linear Functions: Linear models are nice and interpretable but have limitations. Can’t learn from difficult nonlinear patterns. Figure 7.9: Limitations of SVM Linear models rely on linear notions of similarity/distance Sim(xn;xm)=x>nxm(xn;xm)=x>nxm Dist (xn;xm)=(xn−xm)T(xn−xm)(xn;xm)=(xn−xm)T(xn−xm) which wouldn’t work well if the patterns we want to learn are nonlinear. 120 CU IDOL SELF LEARNING MATERIAL (SLM)
7.3 QUADRATIC PROGRAMMING SOLUTION TO FIND MAXIMUM MARGIN SEPARATORS We can formulate our search for the maximum margin separating hyperplane as a constrained optimization problem. The objective is to maximize the margin under the constraints that all data points must lie on the correct side of the hyperplane: If we plug in the definition of γ we obtain: Because the hyperplane is scale invariant, we can fix the scale of w,b anyway we want. Let's be clever about it, and choose it such that We can add this re-scaling as an equality constraint. Then our objective becomes: (Where we made use of the fact f(z)=z2 is a monotonically increasing function for z≥0 and ∥w∥≥0; i.e. the w that maximizes ∥w∥2 also maximizes w⊤w.) The new optimization problem becomes: These constraints are still hard to deal with, however luckily, we can show that (for the optimal solution) they are equivalent to a much simpler formulation. (Makes sure you know how to prove that the two sets of constraints are equivalent.) 121 CU IDOL SELF LEARNING MATERIAL (SLM)
This new formulation is a quadratic optimization problem. The objective is quadratic, and the constraints are all linear. We can solve it efficiently with any QCQP (Quadratically Constrained Quadratic Program) solver. It has a unique solution whenever a separating hyper plane exists. 7.4 SUMMARY • Support vector machine works by finding an optimal separation line called a ‘hyperplane’ to accurately separate 2 or more different classes in a classification problem. The goal is to find the optimal hyperplane separation through training the linearly separable data with the SVM algorithm. • SVM works really well with a clear margin of separation nIt is effective in high dimensional spaces. • SVM doesn’t directly provide probability estimates, these are calculated using an expensive five-fold cross-validation. It is included in the related SVC method of Python scikit-learn library. 7.5 KEYWORDS • SVM- uses classification algorithms for two-group classification problems • kernels- take data as input and transform it into the required form • Dimensionality Reduction- transformation of data from a high-dimensional space into a low-dimensional space • Margin Separator- separates 2 classes of observations • Hyperplane- maximizes the margin between the two classes 7.6 LEARNING ACTIVITY 1. Suppose we want to classify potential bank customers as good creditors or bad creditors for loan applications. We have a training dataset describing past customers using the following attributes: Marital status {married, single, divorced}, Gender {male, female}, Age {[18..30[, [30..50[, [50..65[,[65+]}, Income {[10K..25K[, [25K..50K[, [50K..65K[, [65K..100K[, 122 CU IDOL SELF LEARNING MATERIAL (SLM)
[100K+]}. Use the Naïve Bayes’ classifier to classify the customer as good creditor or bad creditor. ___________________________________________________________________________ ____________________________________________________________________ 2. We have knowledge that over the entire population of people 0.8% have cancer. There exists a (binary) laboratory test that represents an imperfect indicator of this disease. That test returns a correct positive result in 98% of the cases in which the disease is present, and a correct negative result in 97% of the cases where the disease is not present. (a) Suppose we observe a patient for whom the laboratory test returns a positive result. Calculate the a posteriori probability that this patient truly suffers from cancer. (b) Knowing that the lab test is an imperfect one, a second test (which is assumed to be independent of the former one) is conducted. Calculate the a posteriori probability for cancer and ¬cancer given that the second test has returned a positive result as well. ___________________________________________________________________________ ____________________________________________________________________ 7.7 UNIT END QUESTIONS A.Descriptive Questions Short Questions 1. How does the distance between positive and negative data points affect the width of the street and the support vector weights? 2. What would a diagram look like for a support vector machine that has overfit the data? 3. How can you tell by comparing diagrams which kernel does the best job at building a classifier for a particular set of data? 4. When describing the placement of decision boundaries using a support vector machine, what function are we maximizing in our LaGrangian formulation of the problem? What do our constraints represent? 5. The best decision boundary yields the widest street. To maximize the width of the street, we end up maximizing an equation written in terms of what variables? Long Question 1. Once we have found support vectors and their weights, how do we find a 123 CU IDOL SELF LEARNING MATERIAL (SLM)
classification vector? 2. Explain how SVM can be used for classification problem? 3. Explain in detail how SVM can be used for feature selection? 4. Discuss in detail about maximum margin separator. 5. Discuss in detail about (i) Support Vector (ii) Hyperplanes (iii) Kernels. B. Multiple ChoiceQuestions 1. What do you mean by generalization error in terms of the SVM? a. How far the hyperplane is from the support vectors b. How accurately the SVM can predict outcomes for unseen data c. The threshold amount of error in an SVM d. None of these 2. The effectiveness of an SVM depends upon a. Selection of Kernel b. Kernel Parameters c. Soft Margin Parameter C d. All the above 3. Suppose you are dealing with 4 class classification problem and you want to train a SVM model on the data for that you are using One-vs-all method. How many times we need to train our SVM model in such case? a. 1 b. 2 c. 3 d. 4 4. What is/are true about kernel in SVM? a. 1 b. 2 124 CU IDOL SELF LEARNING MATERIAL (SLM)
c. 1 and 2 d. None of these 5. Suppose you have trained an SVM classifier with a Gaussian kernel, and it learned the following decision boundary on the training set: You suspect that the SVM is under fitting your dataset. Should you try increasing or decreasing ? Increasing or decreasing ? a. It would be reasonable to try decreasing C. It would also be reasonable to try increasing . b. It would be reasonable to try decreasing C. It would also be reasonable to try decreasing . c. It would be reasonable to try increasing C. It would also be reasonable to try decreasing . d. It would be reasonable to try increasing C. It would also be reasonable to try increasing . 6. Suppose you have a dataset with n = 10 features and m = 5000 examples. After training your logistic regression classifier with gradient descent, you find that it has underfit the training set and does not achieve the desired performance on the training or cross validation sets. Which of the following might be promising steps to take? Check all that apply. a. Increase the regularization parameter λ. b. Reduce the number of example in the training set. c. Create / add new polynomial features. d. Use an SVM with a linear kernel, without introducing new features. 7. Which of the following statements are true? a. Suppose you are using SVMs to do multi-class classification and would like to use the one-vs-all approach. If you have K different classes, you will train K-1 different SVMs. b. If the data are linearly separable, an SVM using a linear kernel will return the same parameters θ regardless of the chosen value of C (i.e., the resulting value θ of does not depend on C). c. If you are training multi-class SVMs with one-vs-all method, it is not possible to use a kernel. d. It is important to perform feature normalization before using the Gaussian kernel. 125 CU IDOL SELF LEARNING MATERIAL (SLM)
Answers 1 – b, 2-c, 3 – d , 4- c, 5-c, 6- c, 7- d. 7.8 REFERENCES Text Books • Peter Harrington “Machine Learning in Action”, Dream Tech Press • EthemAlpaydin, “Introduction to Machine Learning”, MIT Press • Steven Bird, Ewan Klein and Edward Loper, “Natural Language Processing with Python”, O’Reilly Media. • Stephen Marsland, “Machine Learning an Algorithmic Perspective” CRC Press Reference Books • William W. Hsieh, “Machine Learning Methods in the Environmental Sciences”, Cambridge • Grant S. Ingersoll, Thomas S. Morton, Andrew L. Farris, “Tamming Text”, Manning Publication Co. • Margaret. H. Dunham, “Data Mining Introductory and Advanced Topics”, Pearson Education 126 CU IDOL SELF LEARNING MATERIAL (SLM)
UNIT – 8: BAYESIAN BELIEF NETWORKS AND CLUSTERING Structure 8.0 Learning Objectives 8.1 Introduction 8.2 Bayesian Belief Networks 8.3 Clustering 8.4 Kernels for learning non-linear functions 8.5 Artificial Neural Networks 8.6 Summary 8.7 Keywords 8.8 Learning Activity 8.9 Unit End Questions 8.10 References 8.0 LEARNING OBJECTIVES After studying this unit, you will be able to: • Describe the basics of Bayesian Belief Networks • Discuss the kernels for learning non-linear functions • Illustrate the application of Artificial Neural Networks 8.1 INTRODUCTION Probabilistic models can be challenging to design and use. Most often, the problem is the lack of information about the domain required to fully specify the conditional dependence between random variables. If available, calculating the full conditional probability for an event can be impractical. A common approach to addressing this challenge is to add some simplifying assumptions, such as assuming that all random variables in the model are conditionally independent. This is a drastic assumption, although it proves useful in practice, providing the basis for the Naive Bayes classification algorithm. An alternative approach is to develop a probabilistic model of a problem with some 127 CU IDOL SELF LEARNING MATERIAL (SLM)
conditional independence assumptions. This provides an intermediate approach between a fully conditional model and a fully conditionally independent model. Bayesian belief networks are one example of a probabilistic model where some variables are conditionally independent. Bayesian Belief Networks are graphical models that communicate causal information and provide a framework for describing and evaluating probabilities when we have a network of interrelated variables. We can then use the graphical models to evaluate information about external interventions and hence predict the effect of such interventions. By exploiting the dependencies and interdependencies in the graph we can develop efficient algorithms that calculate probabilities of variables in graphs which are often very large and complex. Such a facility makes this technique suitable for Data Mining, where we are often trying to sift through large amounts of data looking for previously undiscovered relationships. A key feature of Bayesian Belief Networks is that they discover and describe causality rather than merely identifying associations as is the case in standard Statistics and Database Technology. Such causal relationships are represented by means of DAGs (Directed Acyclic Graphs) that are also used to describe conditional independence assumptions. Such conditional independence occurs when two variables are independent, conditional on another variable. Not every Bayesian belief network topology is suitable for classification. Only such graphs, where the class is the parent of (all) attributes and itself has no parent, can be used for this purpose. For example, in medical diagnosis problems, the diagnosis (illness) is the source of all symptoms; thus the class (diagnosis) is the parent of all attributes (symptoms). 8.2 BAYESIAN BELIEF NETWORKS A Bayesian belief network is a type of probabilistic graphical model. A probabilistic graphical model (PGM), or simply “graphical model” for short, is a way of representing a probabilistic model with a graph structure. The nodes in the graph represent random variables and the edges that connect the nodes represent the relationships between the random variables. • Nodes: Random variables in a graphical model. • Edges: Relationships between random variables in a graphical model. There are many different types of graphical models, although the two most commonly described are the Hidden Markov Model and the Bayesian Network. The Hidden Markov 128 CU IDOL SELF LEARNING MATERIAL (SLM)
Model (HMM) is a graphical model where the edges of the graph are undirected, meaning the graph contains cycles. Bayesian Networks are more restrictive, where the edges of the graph are directed, meaning they can only be navigated in one direction. This means that cycles are not possible, and the structure can be more generally referred to as a directed acyclic graph (DAG). A Bayesian Belief Network, or simply “Bayesian Network,” provides a simple way of applying Bayes Theorem to complex problems. The networks are not exactly Bayesian by definition, although given that both the probability distributions for the random variables (nodes) and the relationships between the random variables (edges) are specified subjectively, the model can be thought to capture the “belief” about a complex domain. Bayesian probability is the study of subjective probabilities or belief in an outcome, compared to the frequents approach where probabilities are based purely on the past occurrence of the event. A Bayesian Network captures the joint probabilities of the events represented by the model. Central to the Bayesian network is the notion of conditional independence. Independence refers to a random variable that is unaffected by all other variables. A dependent variable is a random variable whose probability is conditional on one or more other random variables. Conditional independence describes the relationship among multiple random variables, where a given variable may be conditionally independent of one or more other random variables. This does not mean that the variable is independent per se; instead, it is a clear definition that the variable is independent of specific other known random variables. A probabilistic graphical model, such as a Bayesian Network, provides a way of defining a probabilistic model for a complex problem by stating all of the conditional independence assumptions for the known variables, whilst allowing the presence of unknown (latent) variables. As such, both the presence and the absence of edges in the graphical model are important in the interpretation of the model. Bayesian networks provide useful benefits as a probabilistic model. For example: Visualization: The model provides a direct way to visualize the structure of the model and motivate the design of new models. Relationships: Provides insights into the presence and absence of the relationships between random variables. Computations: Provides a way to structure complex probability calculations. 129 CU IDOL SELF LEARNING MATERIAL (SLM)
Designing Bayesian Belief Network: Designing a Bayesian Network requires defining at least three things: • Random Variables. What are the random variables in the problem? • Conditional Relationships. What are the conditional relationships between the variables? • Probability Distributions. What are the probability distributions for each variable? It may be possible for an expert in the problem domain to specify some or all of these aspects in the design of the model. In many cases, the architecture or topology of the graphical model can be specified by an expert, but the probabilitydistributions must be estimated from data from the domain. Both the probability distributions and the graph structure itself can be estimated from data, although it can be a challenging process. As such, it is common to use learning algorithms for this purpose; for example, assuming a Gaussian distribution for continuous random variables gradient ascent for estimating the distribution parameters. Once a Bayesian Network has been prepared for a domain, it can be used for reasoning, e.g. making decisions. Reasoning is achieved via inference with the model for a given situation. For example, the outcome for some events is known and plugged into the random variables. The model can be used to estimate the probability of causes for the events or possible further outcomes. Practical examples of using Bayesian Networks in practice include medicine (symptoms and diseases), bioinformatics (traits and genes), and speech recognition (utterances and time). Example: Consider a problem with three random variables: A, B, and C. A is dependent upon B, and C is dependent upon B. We can state the conditional dependencies as follows: A is conditionally dependent upon B, e.g. P(A|B) C is conditionally dependent upon B, e.g. P(C|B) We know that C and A have no effect on each other. We can also state the conditional independencies as follows: A is conditionally independent from C: P(A|B, C) 130 CU IDOL SELF LEARNING MATERIAL (SLM)
C is conditionally independent from A: P(C|B, A) Notice that the conditional dependence is stated in the presence of the conditional independence. That is, A is conditionally independent of C, or A is conditionally dependent upon B in the presence of C. We might also state the conditional independence of A given C as the conditional dependence of A given B, as A is unaffected by C and can be calculated from A given B alone. P(A|C, B) = P(A|B) We can see that B is unaffected by A and C and has no parents; we can simply state the conditional independence of B from A and C as P(B, P(A|B), P(C|B)) or P(B). We can also write the joint probability of A and C given B or conditioned on B as the product of two conditional probabilities; for example: P(A, C | B) = P(A|B) * P(C|B) The model summarizes the joint probability of P(A, B, C), calculated as: P(A, B, C) = P(A|B) * P(C|B) * P(B) Figure 8.1: Example of Simple Bayesian Network Notice that the random variables are each assigned a node, and the conditional probabilities are stated as directed connections between the nodes. Also notice that it is not possible to navigate the graph in a cycle, e.g. no loops are possible when navigating from node to node via the edges. Also notice that the graph is useful even at this point where we don’t know the probability distributions for the variables. 8.3 CLUSTERING 131 CU IDOL SELF LEARNING MATERIAL (SLM)
The learning of Bayesian network classifiers from data is commonly performed in a supervised manner, meaning that a training set containing examples which have been previously classified by an expert are used to generate the DAG and its CPT. In practice, this can be viewed as having a class label assigned to each example (row) of the data set. Unfortunately, in many real-world applications of machine learning in industrial problems, it is difficult to obtain a large data set with classified examples. This is generally due to the fact that a human expert is needed to manually classify each example, of which in many cases there can be thousands, making it an exhausting and time-consuming task. With this in mind, it is desirable to have an alternative way of training a classifier with data that has no class label assigned to each example. Figure 8.2 :Bayesian Clustering This approach is known as unsupervised training, also referred to as clustering. The main goal of clustering is to find the natural groupings of the data. It is important to mention at this point that unsupervised training will usually obtain lower performances when compared to models trained in a supervised way, primarily due to the lack of the class information during the learning process. However, it is still a very valuable tool for data exploration and preliminary classification, which can be improved later on once a training set with class labels for each example is built. This task can be carried out by a human expert with assistance of the clustering results, making it a less difficult task. 8.4 KERNALS FOR LEARNING NON-LINEAR FUNCTIONS With what we have presented so far, data sets that are linearly separable (perhaps with a few exceptions or some noise) are well-handled. But what are we going to do if the data set just doesn't allow classification by a linear classifier? Let us look at a one-dimensional case. The below figure is straightforwardly classified by a linear classifier, but the middle data set is 132 CU IDOL SELF LEARNING MATERIAL (SLM)
not. We instead need to be able to pick out an interval. One way to solve this problem is to map the data on to a higher dimensional space and then to use a linear classifier in the higher dimensional space. For example, the bottom part of the figure shows that a linear separator can easily classify the data if we use a quadratic function to map the data into two dimensions (a polar coordinates projection would be another possibility). The general idea is to map the original feature space to some higher-dimensional feature space where the training set is separable. Of course, we would want to do so in ways that preserve relevant dimensions of relatedness between data points, so that the resultant classifier should still generalize well. SVMs, and also a number of other linear classifiers, provide an easy and efficient way of doing this mapping to a higher dimensional space, which is referred to as ``the kernel trick ''. It's not really a trick: it just exploits the math that we have seen. The SVM linear classifier relies on a dot product between data point vectors. Let . Then the classifier we have seen so far is: 133 CU IDOL SELF LEARNING MATERIAL (SLM)
Now suppose we decide to map every data point into a higher dimensional space via some transformation . Then the dot product becomes . If it turned out that this dot product (which is just a real number) could be computed simply and efficiently in terms of the original data points, then we wouldn't have to actually map from . Rather, we could simply compute the quantity , and then use the function's value in above Equation. A kernel function K is such a function that corresponds to a dot product in some expanded feature space. The ‘kernel trick’ maps observations from an observation space X into an inner product space V (equipped with its natural norm), without having to compute the mapping explicitly, because the observations will gain a meaningful linear structure in the inner product space. The kernel trick allows performing calculations more efficiently if the classification algorithm uses observations x only in a form of a dot product. 8.5 ARTIFICIAL NEURAL NETWORKS One of the most influential technologies of the past decade is artificial neural networks, the fundamental piece of deep learning algorithms, the bleeding edge of artificial intelligence. We thank neural networks for many of applications you use every day, such as Google’s translation service, Apple’s Face ID iPhone lock and Amazon’s Alexa AI-powered assistant. Neural networks are also behind some of the important artificial intelligence breakthroughs in other fields, such as diagnosing skin and breast cancer, and giving eyes to self-driving cars. The concept and science behind artificial neural networks have existed for many decades. But it has only been in the past few years that the promises of neural networks have turned to reality and helped the AI industry emerge from an extended winter. While neural networks have helped the AI take great leaps, they are also often misunderstood. Here’s everything you need to know about neural networks. The original vision of the pioneers of artificial intelligence was to replicate the functions of the human brain, nature’s smartest and most complex known creation. That’s why the field has derived much of its nomenclature (including the term “artificial intelligence”) from the physique and functions of the human mind. Artificial neural networks are inspired from their biological counterparts. Many of the functions of the brain continue to remain a mystery, but 134 CU IDOL SELF LEARNING MATERIAL (SLM)
what we know is that biological neural networks enable the brain to process huge amounts of information in complicated ways. The brain’s biological neural network consists of approximately 100 billion neurons, the basic processing unit of the brain. Neurons perform their functions through their massive connections to each other, called synapses. The human brain has approximately 100 trillion synapses, about 1,000 per neuron. Every function of the brain involves electrical currents and chemical reactions running across a vast number of these neurons. The core component of ANNs is artificial neurons. Each neuron receives inputs from several other neurons, multiplies them by assigned weights, adds them and passes the sum to one or more neurons. Some artificial neurons might apply an activation function to the output before passing it to the next variable. Figure 8.3: Structure of Artificial Neuron At its core, this might sound like a very trivial math operation. But when you place hundreds, thousands and millions of neurons in multiple layers and stack them up on top of each other, you’ll obtain an artificial neural network that can perform very complicated tasks, such as classifying images or recognizing speech. Artificial neural networks are composed of an input layer, which receives data from outside sources (data files, images, hardware sensors, microphone…), one or more hidden layers that process the data, and an output layer that provides one or more data points based on the function of the network. For instance, a neural network that detects persons, cars and animals will have an output layer with three nodes. A network that classifies bank transactions between safe and fraudulent will have a single output. Training Artificial Neural Network: 135 CU IDOL SELF LEARNING MATERIAL (SLM)
Artificial neural networks start by assigning random values to the weights of the connections between neurons. The key for the ANN to perform its task correctly and accurately is to adjust these weights to the right numbers. But finding the right weights is not very easy, especially when you’re dealing with multiple layers and thousands of neurons. This calibration is done by “training” the network with annotated examples. For instance, if you want to train the image classifier mentioned above, you provide it with multiple photos, each labeled with its corresponding class (person, car or animal). As you provide it with more and more training examples, the neural network gradually adjusts its weights to map each input to the correct outputs. Basically, what happens during training is the network adjusts itself to glean specific patterns from the data. Again, in the case of an image classifier network, when you train the AI model with quality examples, each layer detects a specific class of features. For instance, the first layer might detect horizontal and vertical edges, the next layers might detect corners and round shapes. Further down the network, deeper layers will start to pick out more advanced features such as faces and objects. Figure 8.4 : Training an Artificial Neural Network When we run a new image through a well-trained neural network, the adjusted weights of the neurons will be able to extract the right features and determine with accuracy to which output class the image belongs. One of the challenges of training neural networks is to find the right 136 CU IDOL SELF LEARNING MATERIAL (SLM)
amount and quality of training examples. Also, training large AI models requires vast amounts of computing resources. To overcome this challenge, many engineers use “transfer learning,” a training technique where you take a pre-trained model and fine-tune it with new, domain-specific examples. Transfer learning is especially efficient when there’s already an AI model that is close to your use case. 8.6 SUMMARY • Bayesian networks are a probabilistic graphical model that explicitly captures the known conditional dependence with directed edges in a graph model. • Bayesian Networks provide a useful tool to visualize the probabilistic model for a domain, review all of the relationships between the random variables, and reason about causal probabilities for scenarios given available evidence. • An artificial neural network (ANN) is the piece of a computing system designed to simulate the way the human brain analyzes and processes information. It is the foundation of artificial intelligence (AI) and solves problems that would prove impossible or difficult by human or statistical standards. ANNs have self-learning capabilities that enable them to produce better results as more data becomes available. 8.7 KEYWORDS • Bayesian Belief Network- graphical representation of different probabilistic relationships among random variables in a particular set • Clustering- task of dividing the population or data points into a number of groups • Kernels- take data as input and transform it into the required form • Activation function- decides, whether a neuron should be activated or not 8.8 LEARNING ACTIVITY 1. The ALARM network is a Bayesian network designed to provide an alarm message system for patients hospitalized in intensive care units (ICU). Since ALARM is commonly used as a benchmark in literature, a synthetic data set of 5000 observations generated from this network is available from bnlearn as alarm. Create a bn object for the “true” structure of the network using the model string provided in its manual page. 137 CU IDOL SELF LEARNING MATERIAL (SLM)
___________________________________________________________________________ ____________________________________________________________________ 2. Perform parameter learning with the bn.fit function from bnlearn and the validated network structure. How do the maximum likelihood estimates differ from the Bayesian ones, and how do the latter vary as the imaginary sample size increases? ___________________________________________________________________________ ____________________________________________________________________ 3. Build a multilayer feed forward neural network for pattern recognition. The network is trained as a character classifier for a collection of characters given as 7x5 black- white pixel maps. Ideally, the trained network can recognize characters it has learnt even when some of them are distorted. ___________________________________________________________________________ ____________________________________________________________________ 8.9 UNIT END QUESTIONS A.Descriptive Questions Short Question 1. Define artificial neuron model and activation Function. 2. What is the role of the hidden layers in a neural network? 3. How non-linear data can be used in Bayesian belief network? 4. Differentiate classification and clustering. 5. How Naïve Bayes’ classifier differs from Bayesian belief network? Long Question 1. Explain briefly multilayered feed forward network. 2. Discuss briefly about how Bayesian Network modeling can be used for Information retrieval? 3. How Artificial Neural Network can be used to solve classification problems? 4. Differentiate between perceptron representation and perceptron training and the single layer perceptron in detail. 5. How can an Artificial Neural Network be applied in Forecasting? Give few strategies that can implement the feed forward networks. 138 CU IDOL SELF LEARNING MATERIAL (SLM)
B. Multiple Choice Questions 1. Bayesian Belief Network is also known as: a. Belief network b. Decision network c. Bayesian model d. All of these 2. The generalized form of Bayesian network that represents and solve decision problems under uncertain knowledge is known as an? a. Directed Acyclic Graph b. Table of conditional probabilities c. Influence diagram d. None of these 3. How many components does Bayesian network have? a. 2 b. 3 c. 4 d. 5 4. The Bayesian network graph does not contain any cyclic graph. Hence, it is known as a ………. a. DCG b. DAG c. CAG d. SAG 5. If we have variables x1, x2, x3,....., xn, then the probabilities of a different combination of x1, x2, x3.. xn, are known as? a. Table of conditional probabilities b. Causal Component 139 CU IDOL SELF LEARNING MATERIAL (SLM)
c. Actual numbers d. Joint probability distribution 6. What is perceptron? a. a single layer feed-forward neural network with pre-processing b. an auto-associative neural network c. a double layer auto-associative neural network d. a neural network that contains feedback 7. Which of the following is true? (i) On average, neural networks have higher computational rates than conventional computers. (ii) Neural networks learn by example. (iii) Neural networks mimic the way the human brain works. a. (ii) and (iii) are true b. (i), (ii) and (iii) are true c. All of the mentioned are true d. None of these 8. Which of the following is true for neural networks? (i) The training time depends on the size of the network. (ii) Neural networks can be simulated on a conventional computer. (iii) Artificial neurons are identical in operation to biological ones. a. All of the mentioned b. (ii) is true c. (i) and (ii) are true d. None of these 9. What are the advantages of neural networks over conventional computers? 140 (i) They have the ability to learn by example CU IDOL SELF LEARNING MATERIAL (SLM)
(ii) They are more fault tolerant (iii)They are more suited for real time operation due to their high ‘computational’ rates a. (i) and (ii) are true b. (i) and (iii) are true c. Only (i) d. All of these 10. What is back propagation? a. It is another name given to the curvy function in the perceptron b. It is the transmission of error back through the network to adjust the inputs c. It is the transmission of error back through the network to allow weights to be adjusted so that the network can learn d. None of these Answers 1 – d, 2-c, 3 –a, 4-b, 5-d, 6-a, 7-c, 8-c, 9-d, 10-c. 8.10 REFERENCES Text Books • Peter Harrington “Machine Learning in Action”, Dream Tech Press • EthemAlpaydin, “Introduction to Machine Learning”, MIT Press • Steven Bird, Ewan Klein and Edward Loper, “Natural Language Processing with Python”, O’Reilly Media. • Stephen Marsland, “Machine Learning an Algorithmic Perspective” CRC Press Reference Books • William W. Hsieh, “Machine Learning Methods in the Environmental Sciences”, Cambridge • Grant S. Ingersoll, Thomas S. Morton, Andrew L. Farris, “Tamming Text”, Manning Publication Co. • Margaret. H. Dunham, “Data Mining Introductory and Advanced Topics”, Pearson Education 141 CU IDOL SELF LEARNING MATERIAL (SLM)
UNIT – 9: HIDDEN MARKOV MODEL Structure 9.0 Learning Objectives 9.1 Introduction 9.2 Hidden Markov Model 9.3 Clustering 9.4 K-means clustering 9.5 Hierarchical clustering 9.6 Density Based clustering 9.7 Summary 9.8 Keywords 9.9 Learning Activity 9.10 Unit End Questions 9.11 References 9.0 LEARNING OBJECTIVES After studying this unit, you will be able to: • Discuss the Hidden Markov Model • Explain the basics of Clustering concepts • Demonstrate the K-means clustering algorithm • Illustrate the concept of Hierarchical clustering & Density Based clustering • Apply the appropriate clustering algorithm for solving the real world problem 9.1 INTRODUCTION The Hidden Markov Model is based on augmenting the Markov chain. A Markov chain is a model that tells us something about the probabilities of sequences of random variables, states, each of which can take on values from some set. These sets can be words, or tags, or symbols representing anything, like the weather. A Markov chain makes a very strong assumption that if we want to predict the future in the sequence all that matters is the current state. The states before the current state have no impact on the future except via the current state. It’s as if to 142 CU IDOL SELF LEARNING MATERIAL (SLM)
predict tomorrow’s weather you could examine today’s weather but you weren’t allowed to look at yesterday’s weather. Figure 9.1: A Markov chain for weather (a) and one for words (b), showing states and transitions. More formally, consider a sequence of state variables q1,q2,...,qi . A Markov model embodies the Markov assumption on the probabilities of this sequence: that when predicting the future, the past doesn’t matter, only the present. 9.2 HIDDEN MARKOV MODEL A Markov chain is useful when we need to compute a probability for a sequence of observable events. In many cases, however, the events we are interested in are hidden: we don’t observe them directly. For example we don’t normally observe part-of-speech tags in a text. Rather, we see words, and must infer the tags from the word sequence. We call the tags hidden because they are not observed. A hidden Markov model (HMM) allows us to talk about both observed events (like words that we see in the input) and hidden events (like part-of-speech tags) that we think of as causal factors in our probabilistic model. An HMM is specified by the following components: 143 CU IDOL SELF LEARNING MATERIAL (SLM)
A first-order hidden Markov model instantiates two simplifying assumptions. First, as with a first-order Markov chain, the probability of a particular state depends only on the previous state: Markov Assumption: P(qi |q1...qi−1) = P(qi |qi−1) Second, the probability of an output observation oi depends only on the state that produced the observation qi and not on any other states or any other observations: Output Independence: P(oi |q1 ...qi ,...,qT ,o1,...,oi ,...,oT ) = P(oi |qi) The term hidden refers to the first order Markov process behind the observation. Observation refers to the data we know and can observe. Markov process is shown by the interaction between “Rainy” and “Sunny” in the below diagram and each of these are HIDDEN STATES. Figure 9.2: Example Hidden Markov Model Observations are known data and refer to “Walk”, “Shop”, and “Clean” in the above diagram. In machine learning sense, observation is our training data, and the number of hidden states is our hyper parameter for our model. Evaluation of the model will be discussed later. 144 CU IDOL SELF LEARNING MATERIAL (SLM)
T = don’t have any observation yet, N = 2, M = 3, Q = {“Rainy”, “Sunny”}, V = {“Walk”, “Shop”, “Clean”} State transition probabilities are the arrows pointing to each hidden state. Observation probability matrix is the blue and red arrows pointing to each observation from each hidden state. The matrix is row stochastic meaning the rows add up to 1. The matrix explains what the probability is from going to one state to another, or going from one state to an observation. Initial state distribution gets the model going by starting at a hidden state. Full model with known state transition probabilities, observation probability matrix, and initial state distribution is marked as, Hidden Markov models are especially known for their application in reinforcement learning and temporal pattern recognition such as speech, handwriting, gesture recognition, part-of- speech tagging, musical score following, partial discharges and bioinformatics. 9.3 CLUSTERING Clustering is the task of dividing the population or data points into a number of groups such that data points in the same groups are more similar to other data points in the same group than those in other groups. In simple words, the aim is to segregate groups with similar traits and assign them into clusters. 145 CU IDOL SELF LEARNING MATERIAL (SLM)
Let’s understand this with an example. Suppose, you are the head of a rental store and wish to understand preferences of your costumers to scale up your business. Is it possible for you to look at details of each costumer and devise a unique business strategy for each one of them? Definitely not. But, what you can do is to cluster all of your costumers into say 10 groups based on their purchasing habits and use a separate strategy for costumers in each of these 10 groups. And this is what we call clustering. Figure 9.3: Example of Clustering So, the goal of clustering is to determine the intrinsic grouping in a set of unlabeled data. But how to decide what constitutes a good clustering? It can be shown that there is no absolute “best” criterion which would be independent of the final aim of the clustering. Consequently, it is the user which must supply this criterion, in such a way that the result of the clustering will suit their needs. For instance, we could be interested in finding representatives for homogeneous groups (data reduction), in finding “natural clusters” and describe their unknown properties (“natural” data types), in finding useful and suitable groupings (“useful” data classes) or in finding unusual data objects (outlier detection). The clustering is broadly classified into two types, namely: • Hard Clustering: In hard clustering, each data point either belongs to a cluster completely or not. For example, in the above example each customer is put into one group out of the 10 groups. • Soft Clustering: In soft clustering, instead of putting each data point into a separate cluster, a probability or likelihood of that data point to be in those clusters is assigned. For example, from the above scenario each costumer is assigned a probability to be in either of 10 clusters of the retail store. 9.4 K-MEANS CLUSTERING The basic idea behind k-means clustering consists of defining clusters so that the total intra- cluster variation (known as total within-cluster variation) is minimized. There are several k- means algorithms available. The standard algorithm is the Hartigan-Wong algorithm 146 CU IDOL SELF LEARNING MATERIAL (SLM)
(Hartigan and Wong 1979), which defines the total within-cluster variation as the sum of squared distances Euclidean distances between items and the corresponding centroid: • xi design a data point belonging to the cluster Ck • μk is the mean value of the points assigned to the cluster Ck Each observation (xi) is assigned to a given cluster such that the sum of squares (SS) distance of the observation to their assigned cluster centers μk is a minimum. The first step when using k-means clustering is to indicate the number of clusters (k) that will be generated in the final solution. The algorithm starts by randomly selecting k objects from the data set to serve as the initial centers for the clusters. The selected objects are also known as cluster means or centroids. Next, each of the remaining objects is assigned to it’s closest centroid, where closest is defined using the Euclidean distance between the object and the cluster mean. This step is called “cluster assignment step”. Note that, to use correlation distance, the data are input as z-scores. After the assignment step, the algorithm computes the new mean value of each cluster. The term cluster “centroid update” is used to design this step. Now that the centers have been recalculated, every observation is checked again to see if it might be closer to a different cluster. All the objects are reassigned again using the updated cluster means. The cluster assignment and centroid update steps are iteratively repeated until the cluster assignments stop changing (i.e until convergence is achieved). That is, the clusters formed in the current iteration are the same as those obtained in the previous iteration. K-means algorithm can be summarized as follow: 1. Specify the number of clusters (K) to be created (by the analyst) 2. Select randomly k objects from the dataset as the initial cluster centers or means 3. Assigns each observation to their closest centroid, based on the Euclidean distance between the object and the centroid 4. For each of the k clusters update the cluster centroid by calculating the new mean values of all the data points in the cluster. The centoid of a Kth cluster is a vector of length p containing the means of all variables for the observations in 147 CU IDOL SELF LEARNING MATERIAL (SLM)
the kth cluster; p is the number of variables. 5. Iteratively minimize the total within sum of square. That is, iterate steps 3 and 4 until the cluster assignments stop changing or the maximum number of iterations is reached. 9.5 HIERARCHICAL CLUSTERING Hierarchical clustering is an alternative approach to k-means clustering for identifying groups in a data set. In contrast to k-means, hierarchical clustering will create a hierarchy of clusters and therefore does not require us to pre-specify the number of clusters. Furthermore, hierarchical clustering has an added advantage over k-means clustering in that its results can be easily visualized using an attractive tree-based representation called a dendrogram. Figure 9.4: Illustrative Dendrogram Hierarchical clustering can be divided into two main types: 1. Agglomerative clustering: Commonly referred to as AGNES (AGglomerative NESting) works in a bottom-up manner. That is, each observation is initially considered as a single-element cluster (leaf). At each step of the algorithm, the two clusters that are the most similar are combined into a new bigger cluster (nodes). This procedure is iterated until all points are a member of just one single big cluster (root) 148 CU IDOL SELF LEARNING MATERIAL (SLM)
(see Figure 21.2). The result is a tree which can be displayed using a dendrogram. Figure 9.5: AGNES (bottom-up) versus DIANA (top-down) clustering 2. Divisive hierarchical clustering: Commonly referred to as DIANA (DIvisive ANAlysis) works in a top-down manner. DIANA is like the reverse of AGNES. It begins with the root, in which all observations are included in a single cluster. At each step of the algorithm, the current cluster is split into two clusters that are considered most heterogeneous. The process is iterated until all observations are in their own cluster. Similar to k-means, we measure the (dis)similarity of observations using distance measures (e.g., Euclidean distance, Manhattan distance, etc.); the Euclidean distance is most commonly the default. However, a fundamental question in hierarchical clustering is: How do we measure the dissimilarity between two clusters of observations? A number of different cluster agglomeration methods (i.e., linkage methods) have been developed to answer this question. The most common methods are: 1. Maximum or complete linkage clustering: Computes all pairwise dissimilarities between the elements in cluster 1 and the elements in cluster 2 and considers the 149 CU IDOL SELF LEARNING MATERIAL (SLM)
largest value of these dissimilarities as the distance between the two clusters. It tends to produce more compact clusters. 2. Minimum or single linkage clustering: Computes all pairwise dissimilarities between the elements in cluster 1 and the elements in cluster 2 and considers the smallest of these dissimilarities as a linkage criterion. It tends to produce long, “loose” clusters. 3. Mean or average linkage clustering: Computes all pairwise dissimilarities between the elements in cluster 1 and the elements in cluster 2 and considers the average of these dissimilarities as the distance between the two clusters. Can vary in the compactness of the clusters it creates. 4. Centroid linkage clustering: Computes the dissimilarity between the centroid for cluster 1 (a mean vector of length p, one element for each variable) and the centroid for cluster 2. 5. Ward’s minimum variance method: Minimizes the total within-cluster variance. At each step the pair of clusters with the smallest between-cluster distance are merged. Tends to produce more compact clusters. 9.6 DENSITY BASED CLUSTERING DBSCAN (Density-Based Spatial Clustering of Applications with Noise) is the most well- known density-based clustering algorithm, first introduced in 1996 by Ester et. al. Due to its importance in both theory and applications, this algorithm is one of three algorithms awarded the Test of Time Award at the KDD conference in 2014. Unlike k-means, DBSCAN does not require the number of clusters as a parameter. Rather it infers the number of clusters based on the data, and it can discover clusters of arbitrary shape (for comparison, k-means usually discovers spherical clusters). The ɛ-neighborhood is fundamental to DBSCAN to approximate local density, so the algorithm has two parameters: • ɛ: The radius of our neighborhoods around a data point p. • minPts: The minimum number of data points we want in a neighborhood to define a cluster. Using these two parameters, DBSCAN categories the data points into three categories: • Core Points: A data point p is a core point if Nbhd(p,ɛ) [ɛ-neighborhood of p] contains at least minPts ; |Nbhd(p,ɛ)| >= minPts. 150 CU IDOL SELF LEARNING MATERIAL (SLM)
Search
Read the Text Version
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
- 12
- 13
- 14
- 15
- 16
- 17
- 18
- 19
- 20
- 21
- 22
- 23
- 24
- 25
- 26
- 27
- 28
- 29
- 30
- 31
- 32
- 33
- 34
- 35
- 36
- 37
- 38
- 39
- 40
- 41
- 42
- 43
- 44
- 45
- 46
- 47
- 48
- 49
- 50
- 51
- 52
- 53
- 54
- 55
- 56
- 57
- 58
- 59
- 60
- 61
- 62
- 63
- 64
- 65
- 66
- 67
- 68
- 69
- 70
- 71
- 72
- 73
- 74
- 75
- 76
- 77
- 78
- 79
- 80
- 81
- 82
- 83
- 84
- 85
- 86
- 87
- 88
- 89
- 90
- 91
- 92
- 93
- 94
- 95
- 96
- 97
- 98
- 99
- 100
- 101
- 102
- 103
- 104
- 105
- 106
- 107
- 108
- 109
- 110
- 111
- 112
- 113
- 114
- 115
- 116
- 117
- 118
- 119
- 120
- 121
- 122
- 123
- 124
- 125
- 126
- 127
- 128
- 129
- 130
- 131
- 132
- 133
- 134
- 135
- 136
- 137
- 138
- 139
- 140
- 141
- 142
- 143
- 144
- 145
- 146
- 147
- 148
- 149
- 150
- 151
- 152
- 153
- 154
- 155
- 156
- 157
- 158
- 159
- 160
- 161
- 162
- 163
- 164
- 165
- 166
- 167
- 168
- 169
- 170
- 171
- 172
- 173
- 174
- 175
- 176
- 177
- 178
- 179
- 180
- 181
- 182
- 183
- 184
- 185
- 186
- 187
- 188
- 189
- 190
- 191
- 192
- 193
- 194
- 195
- 196
- 197
- 198
- 199
- 200
- 201
- 202
- 203
- 204
- 205
- 206
- 207
- 208
- 209
- 210
- 211
- 212
- 213
- 214
- 215
- 216
- 217
- 218
- 219
- 220
- 221
- 222
- 223
- 224
- 225
- 226
- 227
- 228
- 229
- 230
- 231
- 232
- 233
- 234
- 235
- 236
- 237
- 238
- 239
- 240
- 241
- 242
- 243
- 244
- 245
- 246
- 247
- 248
- 249
- 250
- 251
- 252
- 253
- 254
- 255
- 256
- 257
- 258
- 259
- 260
- 261
- 262
- 263
- 264
- 265
- 266
- 267
- 268
- 269
- 270
- 271
- 272
- 273
- 274
- 275
- 276
- 277
- 278
- 279
- 280
- 281
- 282
- 283
- 284
- 285
- 286
- 287
- 288
- 289
- 290
- 291
- 292
- 293
- 294
- 295
- 296
- 297
- 298
- 299
- 300
- 301
- 302
- 303
- 304
- 305
- 306
- 307
- 308
- 309
- 310
- 311
- 312
- 313