labelencoder_x= LabelEncoder() x[:, 3]= labelencoder_x.fit_transform(x[:,3]) onehotencoder= OneHotEncoder(categorical_features= [3]) x= onehotencoder.fit_transform(x).toarray() • Now we will split the dataset into training and test set. The code for this is given below: # Splitting the dataset into training and test set. from sklearn.model_selection import train_test_split x_train, x_test, y_train, y_test= train_test_split(x, y, test_size= 0.2, random_state=0) Figure 3.8: Test Set Figure3.9 : Training Set 51 Step: 2- Fitting our MLR model to the Training set: CU IDOL SELF LEARNING MATERIAL (SLM)
Now, we have well prepared our dataset in order to provide training, which means we will fit our regression model to the training set. It will be similar to as we did in Simple Linear Regression model. The code for this will be: #Fitting the MLR model to the training set: from sklearn.linear_model import LinearRegression regressor= LinearRegression() regressor.fit(x_train, y_train) Step: 3- Prediction of Test set results: The last step for our model is checking the performance of the model. We will do it by predicting the test set result. For prediction, we will create a y_pred vector. Below is the code for it: #Predicting the Test set result; y_pred= regressor.predict(x_test) By executing the above lines of code, a new vector will be generated under the variable explorer option. We can test our model by comparing the predicted values and test set values. Figure 3.10: Predicted Output • We can also check the score for training dataset and test dataset. Below is the code for it: 52 CU IDOL SELF LEARNING MATERIAL (SLM)
print('Train Score: ', regressor.score(x_train, y_train)) print('Test Score: ', regressor.score(x_test, y_test)) Output: The score is: Train Score: 0.9501847627493607 Test Score: 0.9347068473282446 The above score tells that our model is 95% accurate with the training dataset and 93% accurate with the test dataset. 3.3 LOGISTIC REGRESSION Logistic Regression in Machine Learning • Logistic regression is one of the most popular Machine Learning algorithms, which comes under the Supervised Learning technique. It is used for predicting the categorical dependent variable using a given set of independent variables. • Logistic regression predicts the output of a categorical dependent variable. Therefore the outcome must be a categorical or discrete value. It can be either Yes or No, 0 or 1, true or False, etc. but instead of giving the exact value as 0 and 1, it gives the probabilistic values which lie between 0 and 1. • Logistic Regression is much similar to the Linear Regression except that how they are used. Linear Regression is used for solving Regression problems, whereas Logistic regression is used for solving the classification problems. • In Logistic regression, instead of fitting a regression line, we fit an \"S\" shaped logistic function, which predicts two maximum values (0 or 1). • The curve from the logistic function indicates the likelihood of something such as whether the cells are cancerous or not, a mouse is obese or not based on its weight, etc. • Logistic Regression is a significant machine learning algorithm because it has the ability to provide probabilities and classify new data using continuous and discrete datasets. • Logistic Regression can be used to classify the observations using different types of data and can easily determine the most effective variables used for the classification. 53 CU IDOL SELF LEARNING MATERIAL (SLM)
Logistic Function (Sigmoid Function): • The sigmoid function is a mathematical function used to map the predicted values to probabilities. • It maps any real value into another value within a range of 0 and 1. • The value of the logistic regression must be between 0 and 1, which cannot go beyond this limit, so it forms a curve like the \"S\" form. The S-form curve is called the Sigmoid function or the logistic function. • In logistic regression, we use the concept of the threshold value, which defines the probability of either 0 or 1. Such as values above the threshold value tends to 1, and a value below the threshold values tends to 0. Type of Logistic Regression On the basis of the categories, Logistic Regression can be classified into three types: • Binomial: In binomial Logistic regression, there can be only two possible types of the dependent variables, such as 0 or 1, Pass or Fail, etc. • Multinomial: In multinomial Logistic regression, there can be 3 or more possible unordered types of the dependent variable, such as \"cat\", \"dogs\", or \"sheep\" • Ordinal: In ordinal Logistic regression, there can be 3 or more possible ordered types of dependent variables, such as \"low\", \"Medium\", or \"High\". Consider a dataset given which contains the information of various users obtained from the social networking sites. There is a car making company that has recently launched a new SUV car. So the company wanted to check how many users from the dataset, wants to purchase the car. For this problem, we will build a Machine Learning model using the Logistic regression algorithm. The dataset is shown in the below image. In this problem, we will predict the purchased variable (Dependent Variable) by using age and salary (Independent variables). 54 CU IDOL SELF LEARNING MATERIAL (SLM)
Steps in Logistic Regression: To implement the Logistic Regression using Python, we will use the same steps as we have done in previous topics of Regression. Below are the steps: • Data Pre-processing step • Fitting Logistic Regression to the Training set • Predicting the test result • Test accuracy of the result(Creation of Confusion matrix) • Visualizing the test set result. Data Pre-processing step: In this step, we will pre-process/prepare the data so that we can use it in our code efficiently. It will be the same as we have done in Data pre-processing topic. #Data Pre-procesing Step # importing libraries import numpy as nm 55 CU IDOL SELF LEARNING MATERIAL (SLM)
import matplotlib.pyplot as mtp import pandas as pd #importing datasets data_set= pd.read_csv('user_data.csv') Now, we will extract the dependent and independent variables from the given dataset. #Extracting Independent and dependent Variable x= data_set.iloc[:, [2,3]].values y= data_set.iloc[:, 4].values Now we will split the dataset into a training set and test set. Below is the code for it: # Splitting the dataset into training and test set. from sklearn.model_selection import train_test_split x_train, x_test, y_train, y_test= train_test_split(x, y, test_size= 0.25, random_state=0) In logistic regression, we will do feature scaling because we want accurate result of predictions. Here we will only scale the independent variable because dependent variable has only 0 and 1 values. Below is the code for it: #feature Scaling from sklearn.preprocessing import StandardScaler st_x= StandardScaler() x_train= st_x.fit_transform(x_train) x_test= st_x.transform(x_test) Fitting Logistic Regression to the Training set: We have well prepared our dataset, and now we will train the dataset using the training set. For providing training or fitting the model to the training set, we will import the LogisticRegression class of the sklearn library. After importing the class, we will create a classifier object and use it to fit the model to the logistic regression. Below is the code for it: #Fitting Logistic Regression to the training set 56 CU IDOL SELF LEARNING MATERIAL (SLM)
from sklearn.linear_model import LogisticRegression classifier= LogisticRegression(random_state=0) classifier.fit(x_train, y_train) Predicting the Test Result Our model is well trained on the training set, so we will now predict the result by using test set data. Below is the code for it: #Predicting the test set result y_pred= classifier.predict(x_test) Test Accuracy of the result Now we will create the confusion matrix here to check the accuracy of the classification. To create it, we need to import the confusion_matrix function of the sklearn library. After importing the function, we will call it using a new variable cm. The function takes two parameters, mainly y_true( the actual values) and y_pred (the targeted value return by the classifier). Below is the code for it: #Creating the Confusion matrix from sklearn.metrics import confusion_matrix cm= confusion_matrix() Figure 3.11:Confusion Matrix 57 Visualizing the training set result Finally, we will visualize the training set result. To visualize the result, we will use ListedColormap class of matplotlib library. Below is the code for it: from matplotlib.colors import ListedColormap x_set, y_set = x_train, y_train x1, x2 = nm.meshgrid(nm.arange(start = x_set[:, 0].min() - 1, stop = x_set[:, 0].max() + 1, step =0.01), CU IDOL SELF LEARNING MATERIAL (SLM)
nm.arange(start = x_set[:, 1].min() - 1, stop = x_set[:, 1].max() + 1, step = 0.01)) mtp.contourf(x1, x2, classifier.predict(nm.array([x1.ravel(), x2.ravel()]).T).reshape(x1.sh ape), alpha = 0.75, cmap = ListedColormap(('purple','green' ))) mtp.xlim(x1.min(), x1.max()) mtp.ylim(x2.min(), x2.max()) for i, j in enumerate(nm.unique(y_set)): mtp.scatter(x_set[y_set == j, 0], x_set[y_set == j, 1], c = ListedColormap(('purple', 'green'))(i), label = j) mtp.title('Logistic Regression (Training set)') mtp.xlabel('Age') mtp.ylabel('Estimated Salary') mtp.legend() mtp.show() In the above code, we have imported the ListedColormap class of Matplotlib library to create the colormap for visualizing the result. We have created two new variables x_set and y_set to replace x_train and y_train. After that, we have used the nm.meshgrid command to create a rectangular grid, which has a range of -1(minimum) to 1 (maximum). The pixel points we have taken are of 0.01 resolution. To create a filled contour, we have used mtp.contourf command, it will create regions of provided colours (purple and green). In this function, we have passed the classifier.predict to show the predicted data points predicted by the classifier. 58 CU IDOL SELF LEARNING MATERIAL (SLM)
Figure 3.12 Visualization of Training Set • In the above graph, we can see that there are some Green points within the green region and Purple points within the purple region. • All these data points are the observation points from the training set, which shows the result for purchased variables. • This graph is made by using two independent variables i.e., Age on the x- axis and Estimated salary on the y-axis. • The purple point observations are for which purchased (dependent variable) is probably 0, i.e., users who did not purchase the SUV car. • The green point observations are for which purchased (dependent variable) is probably 1 means user who purchased the SUV car. • We can also estimate from the graph that the users who are younger with low salary, did not purchase the car, whereas older users with high estimated salary purchased the car. • But there are some purple points in the green region (Buying the car) and some green points in the purple region(Not buying the car). So we can say that younger users with a high estimated salary purchased the car, whereas an older user with a low estimated salary did not purchase the car. Visualizing the test set result: Our model is well trained using the training dataset. Now, we will visualize the result for new observations (Test set). The code for the test set will remain same as above except that here we will use x_test and y_test instead of x_train and y_train. Below is the code for it: #Visualising the test set result from matplotlib.colors import ListedColormap x_set, y_set = x_test, y_test x1, x2 = nm.meshgrid(nm.arange(start = x_set[:, 0].min() - 1, stop = x_set[:, 0].max() + 1, step =0.01), nm.arange(start = x_set[:, 1].min() - 1, stop = x_set[:, 1].max() + 1, step = 0.01)) mtp.contourf(x1, x2, classifier.predict(nm.array([x1.ravel(), x2.ravel()]).T).reshape(x1.sh 59 CU IDOL SELF LEARNING MATERIAL (SLM)
ape), alpha = 0.75, cmap = ListedColormap(('purple','green' ))) mtp.xlim(x1.min(), x1.max()) mtp.ylim(x2.min(), x2.max()) for i, j in enumerate(nm.unique(y_set)): mtp.scatter(x_set[y_set == j, 0], x_set[y_set == j, 1], c = ListedColormap(('purple', 'green'))(i), label = j) mtp.title('Logistic Regression (Test set)') mtp.xlabel('Age') mtp.ylabel('Estimated Salary') mtp.legend() mtp.show() Figure 3.13 Visualizing test set The above graph shows the test set result. As we can see, the graph is divided into two regions (Purple and Green). And Green observations are in the green region, and Purple observations are in the purple region. So we can say it is a good prediction and model. Some of the green and purple data points are in different regions, which can be ignored as we have already calculated this error using the confusion matrix 60 CU IDOL SELF LEARNING MATERIAL (SLM)
3.4 SUMMARY • Regression models are used to predict a continuous value. • Linear regression is a linear model, e.g. a model that assumes a linear relationship between the input variables (x) and the single output variable (y) • Linear regression is categorized into simple and multi variable regression • Logistic regression is a supervised learning classification algorithm used to predict the probability of a target variable • Logistic regression is used to obtain odds ratio in the presence of more than one explanatory variable • Python provides variety of libraries for machine learning 3.5 KEYWORDS • Regression- allow us to predict a continuous outcome variable • Linear regression- the predicted output is continuous and has a constant slope • Logistic Regression- to predict the probability of a target variable • Simple Linear regression- that models the relationship between a dependent variable and a single independent variable • Multivarable Linear Regression- t estimates a single regression model with more than one outcome variable. 3.6 LEARNING ACTIVITY 1. Assume you are a public health researcher interested in social factors that influence heart disease. You survey 500 towns and gather data on the percentage of people in each town who smoke, the percentage of people in each town who bike to work, and the percentage of people in each town who have heart disease. What kind of regression you will use to analyze the relationship between them? ___________________________________________________________________________ ____________________________________________________________________ 2. Consider the Credit Card Fraud Detection problem is of significant importance to the 61 CU IDOL SELF LEARNING MATERIAL (SLM)
banking industry because banks each year spend hundreds of millions of dollars due to fraud. When a credit card transaction happens, the bank makes a note of several factors. For instance, the date of the transaction, amount, place, type of purchase, etc. Based on these factors which regression model will you choose to determine the transaction is a fraud or not? ___________________________________________________________________________ ____________________________________________________________________ 3.7 UNIT END QUESTIONS A. Descriptive Questions Short Question 1. Define regression. 2. What is the need for sigmoid function? 3. Compare simple and multiple linear regression. 4. List the types of logistic regression 5. How logistic regression differs from liner regression? Long Question 1. Describe the Library functions available in python for machine learning application 2. Elaborate the working of linear regression with example. 3. We are given an email and we need to classify whether or not it is spam. If the email is spam, we label it 1; if it is not spam, we label it 0.Compare the models and describe which model best suits for a real time application 4. Compare the features of linear and logistic regression 5. Describe about the factors that evaluate the performance of regression models. B. Multiple ChoiceQuestions 62 1. If Linear regression model perfectly first i.e., train error is zero, then _____________________ a. Test error is also always zero b. Test error is non zero c. Depends on test data CU IDOL SELF LEARNING MATERIAL (SLM)
d. Test error is equal to Train error 2. How many coefficients do you need to estimate in a simple linear regression model? a. 1 b. 2 c. 3 d. 4 3. In the mathematical Equation of Linear Regression Y = β1 + β2X + ϵ, (β1, β2) refers to __________ a. X-intercept, Slope b. Slope, X-Intercept c. Y-Intercept, Slope d. slope, Y-Intercept 4. Logistic regression assumes a a. Linear relationship between continuous predictor variables and the outcome variable. b. Linear relationship between continuous predictor variables and the logic of the outcome variable c. Linear relationship between continuous predictor variables. d. Linear relationship between observations. 5. Which of the following methods do we use to best fit the data in Logistic Regression? a. Least Square Error b. Maximum Likelihood c. Jaccard distance d. Both A and B Answers 1 – c, 2 – b, 3 – c, 4 – b, 5 – b 3.8 REFERENCES 63 CU IDOL SELF LEARNING MATERIAL (SLM)
Text Books • Peter Harrington “Machine Learning in Action”, Dream Tech Press • EthemAlpaydin, “Introduction to Machine Learning”, MIT Press • Steven Bird, Ewan Klein and Edward Loper, “Natural Language Processing with Python”, O’Reilly Media. • Stephen Marsland, “Machine Learning an Algorithmic Perspective” CRC Press Reference Books • William W. Hsieh, “Machine Learning Methods in the Environmental Sciences”, Cambridge • Grant S. Ingersoll, Thomas S. Morton, Andrew L. Farris, “Tamming Text”, Manning Publication Co. • Margaret. H. Dunham, “Data Mining Introductory and Advanced Topics”, Pearson Education 64 CU IDOL SELF LEARNING MATERIAL (SLM)
UNIT - 4: LEARNING WITH CLASSIFICATION Structure 4.0. Learning Objectives 4.1. Introduction 4.2. Rule Based Classification 4.3. Classification Using Decision Trees 4.4. Constructing a Decision Tree 4.5. Summary 4.6. Keywords 4.7. Learning Activity 4.8. Unit End Questions 4.9. References 4.0 LEARNING OBJECTIVES After studying this unit, you will be able to: • Describe the basics of Naïve Bayes’ theorem • Illustrate the implementation of Naïve Bayes’ theorem • Apply the Naïve Bayes’ algorithm for solving real world problems • Familiarize various types of naïve bayes algorithm 4.1 INTRODUCTION Classification is a process of categorizing a given set of data into classes, It can be performed on both structured or unstructured data. The process starts with predicting the class of given data points. The classes are often referred to as target, label or categories. The classification predictive modeling is the task of approximating the mapping function from input variables to discrete output variables. The main goal is to identify which class/category the new data will fall into. 65 CU IDOL SELF LEARNING MATERIAL (SLM)
Figure 4.1 Example for classification Let us try to understand this with a simple example. Heart disease detection can be identified as a classification problem, this is a binary classification since there can be only two classes i.e has heart disease or does not have heart disease. The classifier, in this case, needs training data to understand how the given input variables are related to the class. And once the classifier is trained accurately, it can be used to detect whether heart disease is there or not for a particular patient. Since classification is a type of supervised learning, even the targets are also provided with the input data. Let us get familiar with the classification in machine learning terminologies. Classification Terminologies in Machine Learning Classifier – It is an algorithm that is used to map the input data to a specific category. Classification Model – The model predicts or draws a conclusion to the input data given for training, it will predict the class or category for the data. Feature – A feature is an individual measurable property of the phenomenon being observed. Binary Classification – It is a type of classification with two outcomes, for eg – either true or false. Multi-Class Classification – The classification with more than two classes, in multi-class classification each sample is assigned to one and only one label or target. Multi-label Classification – This is a type of classification where each sample is assigned to a set of labels or targets. Predict the Target – For an unlabeled observation X, the predict(X) method returns predicted label y. 66 CU IDOL SELF LEARNING MATERIAL (SLM)
Evaluate – This basically means the evaluation of the model i.e classification report, accuracy score, etc. 4.2 RULE BASED CLASSIFICATION Rule-based classifier makes use of a set of IF-THEN rules for classification. We can express a rule. They are also used in the class prediction algorithm to give a ranking to the rules which will then be utilized to predict the class of new cases. We can express a rule in the following from − IF condition THEN conclusion Let us consider a rule R1, R1: IF age = youth AND student = yes THEN buy_computer = yes The condition used with “if” is called the antecedent and the predicted class of each rule is called the consequent. The antecedent part the condition consists of one or more attribute tests and these tests are logically ANDed. The consequent part consists of class prediction We can also write rule R1 as follows − R1: (age = youth) ^ (student = yes))(buys computer = yes) Properties of rule-based classifiers Coverage: The percentage of records which satisfy the antecedent conditions of a particular rule. The rules generated by the rule-based classifiers are generally not mutually exclusive, i.e. many rules can cover the same record. The rules generated by the rule-based classifiers may not be exhaustive, i.e. there may be some records which are not covered by any of the rules. The decision boundaries created by them is linear, but these can be much more complex than the decision tree because the many rules are triggered for the same record. 67 CU IDOL SELF LEARNING MATERIAL (SLM)
Rule Induction Using Sequential Covering Algorithm Sequential Covering Algorithm can be used to extract IF-THEN rules form the training data. We do not require generating a decision tree first. In this algorithm, each rule for a given class covers many of the tuples of that class. Some of the sequential Covering Algorithms are AQ, CN2, and RIPPER. As per the general strategy the rules are learned one at a time. For each time rules are learned, a tuple covered by the rule is removed and the process continues for the rest of the tuples. This is because the path to each leaf in a decision tree corresponds to a rule. The Decision tree induction can be considered as learning a set of rules simultaneously. The Following is the sequential learning Algorithm where rules are learned for one class at a time. When learning a rule from a class Ci, we want the rule to cover all the tuples from class C only and no tuple form any other class. Algorithm: Sequential Covering Input: D, a data set class-labeled tuples, Att_vals, the set of all attributes and their possible values. Output: A Set of IF-THEN rules. Method: Rule_set={ }; // initial set of rules learned is empty for each class c do repeat Rule = Learn_One_Rule(D, Att_valls, c); remove tuples covered by Rule form D; until termination condition. Rule_set=Rule_set+Rule; // add a new rule to rule-set end for returnRule_Set; 68 CU IDOL SELF LEARNING MATERIAL (SLM)
Rule Pruning The rule is pruned is due to the following reason − The Assessment of quality is made on the original set of training data. The rule may perform well on training data but less well on subsequent data. That's why the rule pruning is required. The rule is pruned by removing conjunct. The rule R is pruned, if pruned version of R has greater quality than what was assessed on an independent set of tuples. FOIL is one of the simple and effective method for rule pruning. For a given rule R, FOIL_Prune = pos - neg / pos + neg Wherepos and neg is the number of positive tuples covered by R, respectively. 4.3 CLASSIFICATION USING DECISION TREES Decision tree builds classification or regression models in the form of a tree structure. It breaks down a dataset into smaller and smaller subsets while at the same time an associated decision tree is incrementally developed. Decision trees can handle both categorical and numerical data. Decision Tree consists of : Nodes : Test for the value of a certain attribute. Edges/ Branch: Correspond to the outcome of a test and connect to the next node or leaf. Leaf nodes : Terminal nodes that predict the outcome (represent class labels or class distribution). 4.4 CONSTRUCTING A DECISION TREE The core algorithm for building decision trees called ID3 by J. R. Quinlan which employs a top-down, greedy search through the space of possible branches with no backtracking. ID3 uses Entropy and Information Gain to construct a decision tree. Information Gain Information gain is the measurement of changes in entropy after the segmentation of a dataset based on an attribute. It calculates how much information a feature provides us about a class. According to the value of information gain, we split the node and build the decision tree. 69 CU IDOL SELF LEARNING MATERIAL (SLM)
A decision tree algorithm always tries to maximize the value of information gain, and a node/attribute having the highest information gain is split first. It can be calculated using the below formula: Information Gain= Entropy(S)- [(Weighted Avg) *Entropy(each feature) ] Entropy: Entropy is a metric to measure the impurity in a given attribute. It specifies randomness in data. Entropy can be calculated as: Steps in ID3 Algorithm 1 Calculate entropy for dataset. 2 For each attribute/feature a. Calculate entropy for all its categorical values. b. Calculate information gain for the feature. 3 Find the feature with maximum information gain. 4 Repeat it until we get the desired tree. Let us consider the below example Here, dataset is of binary classes(yes and no), where 9 out of 14 are \"yes\" and 5 out of 14 are \"no\". 70 CU IDOL SELF LEARNING MATERIAL (SLM)
Complete entropy of dataset is - H(S) = - p(yes) * log2(p(yes)) - p(no) * log2(p(no)) = - (9/14) * log2(9/14) - (5/14) * log2(5/14) = - (-0.41) - (-0.53) = 0.94 Calculate information gain for the attribute Outlook Categorical values - sunny, overcast and rain H(Outlook=sunny) = -(2/5)*log(2/5)-(3/5)*log(3/5) =0.971 H(Outlook=rain) = -(3/5)*log(3/5)-(2/5)*log(2/5) =0.971 H(Outlook=overcast) = -(4/4)*log(4/4)-0 = 0 Average Entropy Information for Outlook - I(Outlook) = p(sunny) * H(Outlook=sunny) + p(rain) * H(Outlook=rain) + p(overcast) * H(Outlook=overcast) = (5/14)*0.971 + (5/14)*0.971 + (4/14)*0 = 0.693 Information Gain = H(S) - I(Outlook) = 0.94 - 0.693 = 0.247 Repeat the same procedure for all the attributes Attribute: Temperature Categorical values - hot, mild, cool H(Temperature=hot) = -(2/4)*log(2/4)-(2/4)*log(2/4) = 1 H(Temperature=cool) = -(3/4)*log(3/4)-(1/4)*log(1/4) = 0.811 H(Temperature=mild) = -(4/6)*log(4/6)-(2/6)*log(2/6) = 0.9179 Average Entropy Information for Temperature - 71 CU IDOL SELF LEARNING MATERIAL (SLM)
I(Temperature) = p(hot)*H(Temperature=hot) + p(mild)*H(Temperature=mild) + 72 p(cool)*H(Temperature=cool) = (4/14)*1 + (6/14)*0.9179 + (4/14)*0.811 = 0.9108 Attribute - Humidity Categorical values - high, normal H(Humidity=high) = -(3/7)*log(3/7)-(4/7)*log(4/7) = 0.983 H(Humidity=normal) = -(6/7)*log(6/7)-(1/7)*log(1/7) = 0.591 Average Entropy Information for Humidity - I(Humidity) = p(high)*H(Humidity=high) + p(normal)*H(Humidity=normal) = (7/14)*0.983 + (7/14)*0.591 = 0.787 Information Gain = H(S) - I(Humidity) = 0.94 - 0.787 = 0.153 Information Gain = H(S) - I(Temperature) = 0.94 - 0.9108 = 0.0292 Attribute - Wind Categorical values - weak, strong H(Wind=weak) = -(6/8)*log(6/8)-(2/8)*log(2/8) = 0.811 H(Wind=strong) = -(3/6)*log(3/6)-(3/6)*log(3/6) = 1 Average Entropy Information for Wind - I(Wind) = p(weak)*H(Wind=weak) + p(strong)*H(Wind=strong) = (8/14)*0.811 + (6/14)*1 CU IDOL SELF LEARNING MATERIAL (SLM)
= 0.892 Information Gain = H(S) - I(Wind) = 0.94 - 0.892 = 0.048 Among these attributes, outlook has the highest information gain. So the decision tree is built as Now, finding the best attribute for splitting the data with Outlook=Sunny values{ Dataset rows = [1, 2, 8, 9, 11]}. Complete entropy of Sunny is - H(S) = - p(yes) * log2(p(yes)) - p(no) * log2(p(no)) = - (2/5) * log2(2/5) - (3/5) * log2(3/5) = 0.971 Attribute-Temperature Categorical values - hot, mild, cool H(Sunny, Temperature=hot) = -0-(2/2)*log(2/2) = 0 H(Sunny, Temperature=cool) = -(1)*log(1)- 0 = 0 H(Sunny, Temperature=mild) = -(1/2)*log(1/2)-(1/2)*log(1/2) = 1 Average Entropy Information for Temperature - I(Sunny, Temperature) = p(Sunny, hot)*H(Sunny, Temperature=hot) + p(Sunny, mild)*H(Sunny, Temperature=mild) + p(Sunny, cool)*H(Sunny, Temperature=cool) = (2/5)*0 + (1/5)*0 + (2/5)*1 73 CU IDOL SELF LEARNING MATERIAL (SLM)
= 0.4 Information Gain = H(Sunny) - I(Sunny, Temperature) = 0.971 - 0.4 = 0.571 Attribute-Humidity Categorical values - high, normal H(Sunny, Humidity=high) = - 0 - (3/3)*log(3/3) = 0 H(Sunny, Humidity=normal) = -(2/2)*log(2/2)-0 = 0 Average Entropy Information for Humidity - I(Sunny, Humidity) = p(Sunny, high)*H(Sunny, Humidity=high) + p(Sunny, normal)*H(Sunny, Humidity=normal) = (3/5)*0 + (2/5)*0 =0 Information Gain = H(Sunny) - I(Sunny, Humidity) = 0.971 - 0 = 0.971 Attribute Wind Categorical values - weak, strong H(Sunny, Wind=weak) = -(1/3)*log(1/3)-(2/3)*log(2/3) = 0.918 H(Sunny, Wind=strong) = -(1/2)*log(1/2)-(1/2)*log(1/2) = 1 Average Entropy Information for Wind - I(Sunny, Wind) = p(Sunny, weak)*H(Sunny, Wind=weak) + p(Sunny, strong)*H(Sunny, Wind=strong) = (3/5)*0.918 + (2/5)*1 = 0.9508 74 CU IDOL SELF LEARNING MATERIAL (SLM)
Information Gain = H(Sunny) - I(Sunny, Wind) = 0.971 - 0.9508 = 0.0202 Here, the attribute with maximum information gain is Humidity. So, the decision tree built so far looks like when Outlook = Sunny and Humidity = High, it is a pure class of category \"no\". And When Outlook = Sunny and Humidity = Normal, it is again a pure class of category \"yes\". Therefore, we don't need to do further calculations. Now, finding the best attribute for splitting the data with Outlook=Sunny values{ Dataset rows = [4, 5, 6, 10, 14]}. Complete entropy of Rain is - H(S) = - p(yes) * log2(p(yes)) - p(no) * log2(p(no)) = - (3/5) * log(3/5) - (2/5) * log(2/5) = 0.971 Attribute - Temperature Categorical values - mild, cool H(Rain, Temperature=cool) = -(1/2)*log(1/2)- (1/2)*log(1/2) = 1 H(Rain, Temperature=mild) = -(2/3)*log(2/3)-(1/3)*log(1/3) = 0.918 75 CU IDOL SELF LEARNING MATERIAL (SLM)
Average Entropy Information for Temperature - I(Rain, Temperature) = p(Rain, mild)*H(Rain, Temperature=mild) + p(Rain, cool)*H(Rain, Temperature=cool) = (2/5)*1 + (3/5)*0.918 = 0.9508 Information Gain = H(Rain) - I(Rain, Temperature) = 0.971 - 0.9508 = 0.0202 Attribute - Wind Categorical values - weak, strong H(Wind=weak) = -(3/3)*log(3/3)-0 = 0 H(Wind=strong) = 0-(2/2)*log(2/2) = 0 Average Entropy Information for Wind - I(Wind) = p(Rain, weak)*H(Rain, Wind=weak) + p(Rain, strong)*H(Rain, Wind=strong) = (3/5)*0 + (2/5)*0 =0 Information Gain = H(Rain) - I(Rain, Wind) = 0.971 - 0 = 0.971 Here, the attribute with maximum information gain is Wind. The final decision tree is formed as shown below: 76 CU IDOL SELF LEARNING MATERIAL (SLM)
4.5 SUMMARY • Rule-based classifier makes use of a set of IF-THEN rules for classification • The rules generated by the rule-based classifiers are generally not mutually exclusive • Decision tree builds classification or regression models in the form of a tree structure • Decision trees can handle both categorical and numerical data. • Decision Trees are a type of Supervised Machine Learning • Each branch of the decision tree represents a possible decision • Information gain is the measurement of changes in entropy after the segmentation of a dataset based on an attribute • ID3 uses Entropy and Information Gain to construct a decision tree • If the sample is completely homogeneous the entropy is zero and if the sample is an equally divided it has entropy of one 4.6 KEYWORDS • Decision Tree- tree representation to solve the problem • Rule Based Tree- uses a set of declarative rules as an input for generating a decision tree • Entropy- quantifies the amount of uncertainty • Information Gain- eduction in entropy or surprise by transforming a dataset and is often used in training decision trees 77 CU IDOL SELF LEARNING MATERIAL (SLM)
4.7 LEARNING ACTIVITY 1. Imagine you are doing four things at the weekend: • Go Shopping, Watch A Movie, Play Tennis OrJust Stay In. What you do depends on three things: • The Weather (Windy, Rainy or Sunny); • How Much Money You Have (Rich or Poor) And • Whether Your Parents Are Visiting. The decision for Tree Classification is based on the below rules: • Rule 1: If My Parents Are Visiting, We’ll Go to The Cinema. • Rule 2: If They’re Not Visiting and It’s Sunny, Then I’ll Play Tennis, But • Rule 3: If It’s Windy, And I’m Rich, Then I’ll Go Shopping. • Rule 3: If They’re Not Visiting, It’s Windy and I’m Poor, Then I Will Go To The Cinema. • Rule 4: If They’re Not Visiting and It’s Rainy, Then I’ll Stay In. Now, you construct the decision tree based on IF THEN rules. ___________________________________________________________________________ ____________________________________________________________________ 2. The decision tree created by using IF- THEN rules and ID3 algorithms are same. Comment. ___________________________________________________________________________ ____________________________________________________________________ 4.8 UNIT END QUESTIONS A.Descriptive Questions Short Question 1. What is Decision tree? 2. What is classification? 78 CU IDOL SELF LEARNING MATERIAL (SLM)
3. What are the algorithms used to construct a decision tree? 4. What does the terminal nodes of decision tree contain? 5. Define information gain Long Question 1. Describe the benefits of Decision tree. 2. Compare ID3 and IF then Rules to generate decision tree. 3. Can decision tree be used for multiclass classification? Justify your answer 4. Describe the working of ID3 algorithm in constructing a decision tree 5. Is it possible for decision tree to have unused braches. Justify with example 5.0. B. Multiple Choice Questions 1. A _________ is a decision support tool that uses a tree-like graph or model of decisions and their possible consequences, including chance event outcomes, resource costs, and utility. a. Decision tree b. Graphs c. Trees d. Neural Networks 2. Choose from the following that are Decision Tree nodes?? a. Decision Nodes b. End Nodes c. Chance Nodes d. All the above 3. What is the biggest weakness of decision trees compared to logistic regression classifiers? a. Decision trees are more likely to overfit the data b. Decision trees are more likely to underfit the data c. Decision trees do not assume independence of the input feature d. None of the mentioned 79 CU IDOL SELF LEARNING MATERIAL (SLM)
4. In decision tree algorithms, attribute selection measures are used to a. Reduce the dimensionality b. Select the splitting criteria which best separate the data c. Reduce the error rate. d. Rank attributes. 5. How the decision tree reaches its decision?? e. Single Test f. Two test g. Sequence of test h. No test Answers: 1 – c, 2 – d, 3 – a, 4 – b, 5 – c 4.9 REFERENCES Textbooks • Peter Harrington “Machine Learning in Action”, Dream Tech Press • EthemAlpaydin, “Introduction to Machine Learning”, MIT Press • Steven Bird, Ewan Klein and Edward Loper, “Natural Language Processing with Python”, O’Reilly Media. • Stephen Marsland, “Machine Learning an Algorithmic Perspective” CRC Press Reference Books • William W. Hsieh, “Machine Learning Methods in the Environmental Sciences”, Cambridge • Grant S. Ingersoll, Thomas S. Morton, Andrew L. Farris, “Tamming Text”, Manning Publication Co. • Margaret. H. Dunham, “Data Mining Introductory and Advanced Topics”, Pearson Education 80 CU IDOL SELF LEARNING MATERIAL (SLM)
UNIT - 5: CLASSIFICATION AND REGRESSION TREE Structure 5.0. LearningObjectives 5.1. Classification and Regression Trees 5.2. Advantages of CART 5.3. Overfitting and Pruning 5.4. Summary 5.5. Keywords 5.6. Learning Activity 5.7. Unit End Questions 5.8. References 5.0 LEARNING OBJECTIVES After studying this unit, you will be able to: • Describe the basics of CART trees • Apply CART to perform classification and regression • Compare ID3 and CART algorithms • Construct decision tree for classification and regression using CART 5.1 CLASSIFICATION AND REGRESSION TREES (CART) Decision Trees are commonly used in data mining with the objective of creating a model that predicts the value of a target (or dependent variable) based on the values of several input (or independent variables). CART is an alternative decision tree building algorithm. It can handle both classification and regression tasks. This algorithm uses a new metric named gini index to create decision points for classification tasks.CART supports numerical target variables, which enables itself to become a Regression Tree that predicts continuous values. Gini Index Gini index is a metric for classification tasks in CART. It stores sum of squared probabilities of each class. We can formulate it as illustrated below. 81 CU IDOL SELF LEARNING MATERIAL (SLM)
Gini = 1 – Σ (Pi)2 for i=1 to number of classes Example Suppose we want to start the Decision Tree by splitting using the “Outlook” feature. Then, we need to calculate the Gini coefficients for its conditions.Outlook is a nominal feature. It can be sunny, overcast or rain. I will summarize the final decisions for outlook feature. 82 CU IDOL SELF LEARNING MATERIAL (SLM)
Gini(Outlook=Sunny) = 1 – (2/5)2 – (3/5)2 = 1 – 0.16 – 0.36 = 0.48 Gini(Outlook=Overcast) = 1 – (4/4)2 – (0/4)2 = 0 Gini(Outlook=Rain) = 1 – (3/5)2 – (2/5)2 = 1 – 0.36 – 0.16 = 0.48 Then, we will calculate weighted sum of gini indexes for outlook feature. Gini(Outlook) = (5/14) x 0.48 + (4/14) x 0 + (5/14) x 0.48 = 0.171 + 0 + 0.171 = 0.342 Similarly, temperature is a nominal feature and it could have 3 different values: Cool, Hot and Mild. Temp. Yes No Number of instances Hot 2 24 Cool 3 14 Mild 4 26 Gini(Temp=Hot) = 1 – (2/4)2 – (2/4)2 = 0.5 Gini(Temp=Cool) = 1 – (3/4)2 – (1/4)2 = 1 – 0.5625 – 0.0625 = 0.375 Gini(Temp=Mild) = 1 – (4/6)2 – (2/6)2 = 1 – 0.444 – 0.111 = 0.445 We’ll calculate weighted sum of gini index for temperature feature Gini(Temp) = (4/14) x 0.5 + (4/14) x 0.375 + (6/14) x 0.445 = 0.142 + 0.107 + 0.190 = 0.439 Humidity is a binary class feature. It can be high or normal. 83 CU IDOL SELF LEARNING MATERIAL (SLM)
Humidity Yes No Number of instances High 3 47 6 17 Normal Gini(Humidity=High) = 1 – (3/7)2 – (4/7)2 = 1 – 0.183 – 0.326 = 0.489 Gini(Humidity=Normal) = 1 – (6/7)2 – (1/7)2 = 1 – 0.734 – 0.02 = 0.244 Weighted sum for humidity feature will be calculated next Gini(Humidity) = (7/14) x 0.489 + (7/14) x 0.244 = 0.367 Wind is a binary class similar to humidity. It can be weak and strong. Wind Yes No Number of instances Weak 6 2 8 Strong 3 3 6 Gini(Wind=Weak) = 1 – (6/8)2 – (2/8)2 = 1 – 0.5625 – 0.062 = 0.375 Gini(Wind=Strong) = 1 – (3/6)2 – (3/6)2 = 1 – 0.25 – 0.25 = 0.5 Gini(Wind) = (8/14) x 0.375 + (6/14) x 0.5 = 0.428 Among the features, outlook has the lowest Gini index. So it is chooen and the decision tree looks like The sub dataset in the overcast leaf has only yes decisions. This means that overcast leaf is over. 84 CU IDOL SELF LEARNING MATERIAL (SLM)
Apply the same principles to those sub datasets in the following steps. Focus on the sub dataset for sunny outlook. We need to find the gini index scores for temperature, humidity and wind features respectively Day Outlook Temp. Humidity Wind Decision 1 Sunny Hot High Weak No 2 Sunny Hot High Strong No 8 Sunny Mild High Weak No 9 Sunny Cool Normal Weak Yes 11 Sunny Mild Normal Strong Yes Gini of temperature for sunny outlook Temperature Yes No Number of instances Hot 0 2 2 Cool 10 1 Mild 11 2 Gini(Outlook=Sunny and Temp.=Hot) = 1 – (0/2)2 – (2/2)2 = 0 Gini(Outlook=Sunny and Temp.=Cool) = 1 – (1/1)2 – (0/1)2 = 0 Gini(Outlook=Sunny and Temp.=Mild) = 1 – (1/2)2 – (1/2)2 = 1 – 0.25 – 0.25 = 0.5 85 CU IDOL SELF LEARNING MATERIAL (SLM)
Gini(Outlook=Sunny and Temp.) = (2/5)x0 + (1/5)x0 + (2/5)x0.5 = 0.2 Gini of humidity for sunny outlook Humidity Yes No Number of instances High 03 3 Normal 20 2 Gini(Outlook=Sunny and Humidity=High) = 1 – (0/3)2 – (3/3)2 = 0 Gini(Outlook=Sunny and Humidity=Normal) = 1 – (2/2)2 – (0/2)2 = 0 Gini(Outlook=Sunny and Humidity) = (3/5)x0 + (2/5)x0 = 0 Gini of wind for sunny outlook Wind Yes No Number of instances Weak Strong 12 3 11 2 Gini(Outlook=Sunny and Wind=Weak) = 1 – (1/3)2 – (2/3)2 = 0.266 Gini(Outlook=Sunny and Wind=Strong) = 1- (1/2)2 – (1/2)2 = 0.2 Gini(Outlook=Sunny and Wind) = (3/5)x0.266 + (2/5)x0.2 = 0.466 As seen, decision is always no for high humidity and sunny outlook. On the other hand, 86 CU IDOL SELF LEARNING MATERIAL (SLM)
decision will always be yes for normal humidity and sunny outlook. This branch is over. Now, we need to focus on rain outlook. Rain outlook Day Outlook Temp. Humidity Wind Decision 4 Rain Mild High Weak Yes 5 Rain Cool Normal Weak Yes 6 Rain Cool Normal Strong No 10 Rain Mild Normal Weak Yes 14 Rain Mild High Strong No We’ll calculate gini index scores for temperature, humidity and wind features when outlook is rain. Gini of temperature for rain outlook Temperature Yes No Number of instances Cool Mild 11 2 21 3 Gini(Outlook=Rain and Temp.=Cool) = 1 – (1/2)2 – (1/2)2 = 0.5 87 CU IDOL SELF LEARNING MATERIAL (SLM)
Gini(Outlook=Rain and Temp.=Mild) = 1 – (2/3)2 – (1/3)2 = 0.444 Gini(Outlook=Rain and Temp.) = (2/5)x0.5 + (3/5)x0.444 = 0.466 Gini of humidity for rain outlook Humidity Yes No Number of instances High 11 2 Normal 21 3 Gini(Outlook=Rain and Humidity=High) = 1 – (1/2)2 – (1/2)2 = 0.5 Gini(Outlook=Rain and Humidity=Normal) = 1 – (2/3)2 – (1/3)2 = 0.444 Gini(Outlook=Rain and Humidity) = (2/5)x0.5 + (3/5)x0.444 = 0.466 Gini of wind for rain outlook Wind Yes No Number of instances Weak Strong 30 3 02 2 Gini(Outlook=Rain and Wind=Weak) = 1 – (3/3)2 – (0/3)2 = 0 Gini(Outlook=Rain and Wind=Strong) = 1 – (0/2)2 – (2/2)2 = 0 Gini(Outlook=Rain and Wind) = (3/5)x0 + (2/5)x0 = 0 Among these features wind has the lowest Gini index 88 CU IDOL SELF LEARNING MATERIAL (SLM)
As seen, decision is always yes when wind is weak. On the other hand, decision is always no if wind is strong. This means that this branch is over.The final decision tree built by CART algorithm is 5.2 ADVANTAGES OF CART ▪ CART is nonparametric and therefore does not rely on data belonging to a particular type of distribution. ▪ CART is not significantly impacted by outliers in the input variables. ▪ You can relax stopping rules to \"overgrow\" decision trees and then prune back the tree to the optimal size. This approach minimizes the probability that important structure in the data set will be overlooked by stopping too soon. ▪ CART incorporates both testing with a test data set and cross-validation to assess the goodness of fit more accurately. ▪ CART can use the same variables more than once in different parts of the tree. This 89 CU IDOL SELF LEARNING MATERIAL (SLM)
capability can uncover complex interdependencies between sets of variables. ▪ CART can be used in conjunction with other prediction methods to select the input set of variables. 5.3 OVERFITTING AND PRUNING Overfitting happens when a decision tree tries to be as perfect as possible by increasing the depth of tests and thereby reduces the error. This results in very complex trees and leads to overfitting. Overfitting reduces the predictive nature of the decision tree. The approaches to avoid overfitting of the trees include pre pruning and post pruning. Tree Pruning Pruning is the method of removing the unused branches from the decision tree. Some branches of the decision tree might represent outliers or noisy data. Tree pruning is the method to reduce the unwanted branches of the tree. This will reduce the complexity of the tree and help in effective predictive analysis. It reduces the overfitting as it removes the unimportant branches from the trees. Prepruning: In this approach, the construction of the decision tree is stopped early. It means it is decided not to further partition the branches. The last node constructed becomes the leaf node and this leaf node may hold the most frequent class among the tuples. The attribute selection measures are used to find out the weightage of the split. Threshold values are prescribed to decide which splits are regarded as useful. If the portioning of the node results in splitting by falling below threshold then the process is halted. Postpruning: This method removes the outlier branches from a fully grown tree. The unwanted branches are removed and replaced by a leaf node denoting the most frequent class label. This technique requires more computation than prepruning, however, it is more reliable. The pruned trees are more precise and compact when compared to unpruned trees but they carry a disadvantage of replication and repetition. 90 CU IDOL SELF LEARNING MATERIAL (SLM)
Repetition occurs when the same attribute is tested again and again along a branch of a tree. Replication occurs when the duplicate sub trees are present within the tree. These issues can be solved by multivariate splits. Figure 5.1 Example for pruning 5.4 SUMMARY • CART can perform both classification and regression. • Uses gini index to create decision points • Gini index is created using the formula Gini = 1 – Σ (Pi)2 for i=1 to number of classes • The attribute that has less Gini index values is chosen as decision criteria • Tree pruning is the method to reduce the unwanted branches of the tree • Post pruning removes the outlier branches from a fully grown tree 5.5 KEYWORDS • Gini Index- measures the degree or probability of a particular variable being wrongly classified when it is randomly chosen • pruning- reduces the size of decision trees by removing sections • overfitting- modeling error in statistics that occurs when a function is too closely aligned to a limited set of data points • prepruning- stopping the tree before it has completed • postpruning- pruning the tree after it has finished 91 CU IDOL SELF LEARNING MATERIAL (SLM)
5.6 LEARNING ACTIVITY 1. Consider the example problem of Loan Approval Prediction, and this is also a binary classification problem - either 'Yes' or 'No'. Each record is for one loan applicant at a famous bank. The attributes being considered are - age, job status, do they own a house or not, and their credit rating. In the real world, banks would look into many more attributes. They may even classify individuals on the basis of risk - high, medium and low. Use CART to perform the task ___________________________________________________________________________ ____________________________________________________________________ 2. ID3 and CART produces same decision tree. Comment. ___________________________________________________________________________ ____________________________________________________________________ 5.7 UNIT END QUESTIONS A.Descriptive Questions Short Question 1. Define CART. 2. What is tree pruning? 3. Compare prepruning and post pruning. 92 CU IDOL SELF LEARNING MATERIAL (SLM)
4. List the advantages of CART? 5. What does overfitting mean? Long Question 1. Describe the steps involved in CART. 2. How overfitting is avoided in CART. Give relevant example. 3. Describe how unused braches are avoided in CART algorithm. B. Multiple Choice Questions 1. CART is used for …………. a. Classification b. Regression c. Both A and B d. None of these 2. The measures developed for selecting the best split are often based on the degree of impurity of the child nodes. Which of the following is NOT an impurity measure? a. Gini b. Entropy c. Pruning d. Classification Error 3. Pruning a decision tree always 93 a. Increases the error rate b. Reduces classification accuracy c. Provides the partitions with lower entropy d. Reduces the size of the tree 4. …………………removes the outlier branches from a fully grown tree? a. Classification CU IDOL SELF LEARNING MATERIAL (SLM)
b. Prepruning c. post pruning d. All the above 5. The techniques for handling noise in decision tree learning are a. Prepruning b. post pruning c. Both and B d. None of the above Answers 1 – c, 2 – c, 3 – d, 4 – c, 5 – c 5.8 REFERENCES Textbooks • Peter Harrington “Machine Learning in Action”, Dream Tech Press • EthemAlpaydin, “Introduction to Machine Learning”, MIT Press • Steven Bird, Ewan Klein and Edward Loper, “Natural Language Processing with Python”, O’Reilly Media. • Stephen Marsland, “Machine Learning an Algorithmic Perspective” CRC Press Reference Books • William W. Hsieh, “Machine Learning Methods in the Environmental Sciences”, Cambridge • Grant S. Ingersoll, Thomas S. Morton, Andrew L. Farris, “Tamming Text”, Manning Publication Co. 94 CU IDOL SELF LEARNING MATERIAL (SLM)
UNIT - 6: NAIVE BAYES Structure 6.0. LearningObjectives 6.1. Introduction 6.2. Concepts and mechanism 6.3. Training Bayesian Belief Networks 6.4. Application of Naïve Bayes’ algorithm 6.5. Summary 6.6. Keywords 6.7. Learning Activity 6.8. Unit End Questions 6.9. References 6.0 LEARNING OBJECTIVES After studying this unit, you will be able to: • Describe the basics of Naïve Bayes’ theorem • Illustrate the implementation of Naïve Bayes’ theorem • Apply the Naïve Bayes’ algorithm for solving real world problems • Familiarize various types of naïve bayes algorithm 6.1 INTRODUCTION The Naive Bayesian classifier is based on Bayes’ theorem with the independence assumptions between predictors. Let us go through some of the simple concepts of probability that we will use. Consider the following example of tossing two coins. If we toss two coins and look at all the different possibilities, we have the sample space as:{HH, HT, TH, TT} While calculating the math on probability, we usually denote probability as P. Some of the probabilities in this event would be as follows: The probability of getting two heads = 1/4 95 CU IDOL SELF LEARNING MATERIAL (SLM)
The probability of at least one tail = 3/4 The probability of the second coin being head given the first coin is tail = 1/2 The probability of getting two heads given the first coin is a head = 1/2 The Bayes’ theorem gives us the conditional probability of event A, given that event B has occurred. In this case, the first coin toss will be B and the second coin toss A. This could be confusing because we've reversed the order of them and go from B to A instead of A to B. According to Bayes’ theorem: Let us apply Bayes’ theorem to our coin example. Here, we have two coins, and the first two probabilities of getting two heads and at least one tail are computed directly from the sample space. Now in this sample space, let A be the event that the second coin is head, and B be the event that the first coin is tails. Again, we reversed it because we want to know what the second event is going to be. We're going to focus on A, and we write that out as a probability of A given B: Probability = P(A|B) = [ P(B|A) * P(A) ] / P(B) = [ P(First coin being tail given the second coin is the head) * P(Second coin being head) ] / P(First coin being tail) = [ (1/2) * (1/2) ] / (1/2) 96 = 1/2 = 0.5 CU IDOL SELF LEARNING MATERIAL (SLM)
Bayes’ theorem calculates the conditional probability of the occurrence of an event based on prior knowledge of conditions that might be related to the event. Example: You are planning a picnic today, but the morning is cloudy. What is the chance of rain during the day? The three possibilities are: Oh no! 50% of all rainy days start off cloudy! But cloudy mornings are common (about 40% of days start cloudy) And this is usually a dry month (only 3 of 30 days tend to be rainy or 10%) We will use Rain to mean rain during the day, and Cloud to mean cloudy morning. The chance of Rain given Cloud is written P(Rain|Cloud) So let's put that in the formula: P(Rain|Cloud) = (P(Rain) P(Cloud|Rain)) / P(Cloud) P(Rain) is Probability of Rain = 10% P(Cloud|Rain) is Probability of Cloud, given that Rain happens = 50% P(Cloud) is Probability of Cloud = 40% P(Rain|Cloud) = 0.1 x 0.50.4 = .125 Or a 12.5% chance of rain. Not too bad, let's have a picnic! 6.2 CONCEPT AND MECHANISIM We will introduce the main concepts regarding Navive Bayes algorithm, by studying an example: Let’s consider the case of two colleagues that work in the same office: Alice and Bruno. And we know that: Alice comes to the office 3 days a week. Bruno comes to the office 1days a week. This will be our ‘prior’ information. We are at the office and we see passing across us someone very fast, so fast that we don’t know who the person is: Alice or Bruno. 97 CU IDOL SELF LEARNING MATERIAL (SLM)
Given the information that we have until know and assuming that they only work 4 days a week, the probabilities of the person seen to be either Alice or Bruno are: P(Alice) = 3/4 = 0.75 P(Bruno) = 1/4 = 0.25 When we saw the person passing by, we saw that he/she was wearing a red jacket. We also know the following: Alice wears red 2 times a week. Bruno wears red 3 times a week. So for every workweek, that has 5 days, we can infer the following: The probability of Alice to wear red is → P(Red|Alice) = 2/5 = 0.4 The probability of Bruno to wear red is → P(Red|Bruno) = 3/5 = 0.6 These new probabilities will be the ‘posterior’ information. Initially, we knew the probability of P(Alice) and P(Bruno), and later we inferred the probabilities of P(Red|Alice) and P(Red|Bruno). So, the real probabilities are: Types of Naïve Bayes Classifier: Multinomial Naïve Bayes: Feature vectors represent the frequencies with which certain events have been generated by a multinomial distribution. This is the event model typically 98 CU IDOL SELF LEARNING MATERIAL (SLM)
used for document classification. Bernoulli Naïve Bayes: In the multivariate Bernoulli event model, features are independent booleans (binary variables) describing inputs. Like the multinomial model, this model is popular for document classification tasks, where binary term occurrence(i.e. a word occurs in a document or not) features are used rather than term frequencies(i.e. frequency of a word in the document). Gaussian Naïve Bayes: In Gaussian Naïve Bayes, continuous values associated with each feature are assumed to be distributed according to a Gaussian distribution(Normal distribution). When plotted, it gives a bell-shaped curve which is symmetric about the mean of the feature values. Strengths and Weaknesses of Naive Bayes Classifier: The main strengths are: • Easy and quick way to predict classes, both in binary and multiclass classification problems. • In the cases that the independence assumption fits, the algorithm performs better compared to other classification models, even with less training data. • The decoupling of the class conditional feature distributions means that each distribution can be independently estimated as a one dimensional distribution. This helps with problems derived from the curse of dimensionality and improve the performance. Whereas the main disadvantages of using this method are: • Although they are pretty good classifiers, Naive Bayes’ are known to be poor estimators. So the probability that outputs from it shouldn’t be taken very seriously. • The Naive assumption of independence is very unlikely to match real-world data. • When the test data set has a feature that has not been observed in the training set, the model will assign a 0 probability to it and will be useless to make predictions. One of the main methods to avoid this drawback is the smoothing technique. 6.3 TRAINING BAYESIAN BELIEF NETWORKS So far, we learned what the Naive Bayes’ algorithm is, how the Bayes’ theorem is related to 99 CU IDOL SELF LEARNING MATERIAL (SLM)
it, and what the expression of the Bayes’ theorem for this algorithm is. Let us take a simple example to understand the functionality of the algorithm. Suppose we have a training data set consisting of 1200 fruits. The features in the data set are these: is the fruit yellow or not, is the fruit long or not, and is the fruit sweet or not. There are three different classes: mango, banana, and others. Step 1: Create a frequency table for all the features against the different classes. Name Yellow Sweet Long Total Mango 350 450 0 650 Banana 400 300 350 400 Others 50 100 50 150 Total 800 850 400 1200 What can we conclude from the above table? • Out of 1200 fruits, 650 are mangoes, 400 are bananas, and 150 are others. • 350 of the total 650 mangoes are yellow and the rest are not and so on. • 800 fruits are yellow, 850 are sweet and 400 are long from a total of 1200 fruits. Let’s say you are given with a fruit which is yellow, sweet, and long and you have to check the class to which it belongs. Step 2: Draw the likelihood table for the features against the classes. Name Yellow Sweet Long Total Mango 350/800=P(Mango|Yellow) 450/850 0/400 650/1200=P(Mango) Banana 400/800 300/850 350/400 400/1200 Others 50/800 100/850 50/400 150/1200 Total 800=P(Yellow) 850 400 1200 100 CU IDOL SELF LEARNING MATERIAL (SLM)
Search
Read the Text Version
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
- 12
- 13
- 14
- 15
- 16
- 17
- 18
- 19
- 20
- 21
- 22
- 23
- 24
- 25
- 26
- 27
- 28
- 29
- 30
- 31
- 32
- 33
- 34
- 35
- 36
- 37
- 38
- 39
- 40
- 41
- 42
- 43
- 44
- 45
- 46
- 47
- 48
- 49
- 50
- 51
- 52
- 53
- 54
- 55
- 56
- 57
- 58
- 59
- 60
- 61
- 62
- 63
- 64
- 65
- 66
- 67
- 68
- 69
- 70
- 71
- 72
- 73
- 74
- 75
- 76
- 77
- 78
- 79
- 80
- 81
- 82
- 83
- 84
- 85
- 86
- 87
- 88
- 89
- 90
- 91
- 92
- 93
- 94
- 95
- 96
- 97
- 98
- 99
- 100
- 101
- 102
- 103
- 104
- 105
- 106
- 107
- 108
- 109
- 110
- 111
- 112
- 113
- 114
- 115
- 116
- 117
- 118
- 119
- 120
- 121
- 122
- 123
- 124
- 125
- 126
- 127
- 128
- 129
- 130
- 131
- 132
- 133
- 134
- 135
- 136
- 137
- 138
- 139
- 140
- 141
- 142
- 143
- 144
- 145
- 146
- 147
- 148
- 149
- 150
- 151
- 152
- 153
- 154
- 155
- 156
- 157
- 158
- 159
- 160
- 161
- 162
- 163
- 164
- 165
- 166
- 167
- 168
- 169
- 170
- 171
- 172
- 173
- 174
- 175
- 176
- 177
- 178
- 179
- 180
- 181
- 182
- 183
- 184
- 185
- 186
- 187
- 188
- 189
- 190
- 191
- 192
- 193
- 194
- 195
- 196
- 197
- 198
- 199
- 200
- 201
- 202
- 203
- 204
- 205
- 206
- 207
- 208
- 209
- 210
- 211
- 212
- 213
- 214
- 215
- 216
- 217
- 218
- 219
- 220
- 221
- 222
- 223
- 224
- 225
- 226
- 227
- 228
- 229
- 230
- 231
- 232
- 233
- 234
- 235
- 236
- 237
- 238
- 239
- 240
- 241
- 242
- 243
- 244
- 245
- 246
- 247
- 248
- 249
- 250
- 251
- 252
- 253
- 254
- 255
- 256
- 257
- 258
- 259
- 260
- 261
- 262
- 263
- 264
- 265
- 266
- 267
- 268
- 269
- 270
- 271
- 272
- 273
- 274
- 275
- 276
- 277
- 278
- 279
- 280
- 281
- 282
- 283
- 284
- 285
- 286
- 287
- 288
- 289
- 290
- 291
- 292
- 293
- 294
- 295
- 296
- 297
- 298
- 299
- 300
- 301
- 302
- 303
- 304
- 305
- 306
- 307
- 308
- 309
- 310
- 311
- 312
- 313