Home Explore Classification_Algorithms_in_Machine_Lea

Classification_Algorithms_in_Machine_Lea

Published by Riinda Aulia Utami, 2022-10-07 14:39:33

Description: Classification_Algorithms_in_Machine_Lea

Read the Text Version

Pages:

1 - 14

Journal of Machine Learning Research 1 (2000) 1-48 Submitted 4/00; Published 10/00 Survey of Classiﬁcation Algorithms and Various Model Selection Methods Vishal Sharma [email protected] Department of Physics Indian Institute of Technology Delhi Hauz Khas,New Delhi-110016, India Editor: Leslie Pack Kaelbling Abstract This report describes in a comprehensive manner the various types of classiﬁcation algo- rithms that already exist. I will mainly be discussing and comparing in detail the major 7 types of classiﬁcation algorithms here. The comparison will essentially be based on objec- tive function, assumptions, advantages and draw backs of each. Keywords: Naive Bayes Classiﬁer, Support Vector Machines(SVM), Linear Classiﬁer, k-Nearest Neighbours(k-NN), Artiﬁcial Neural Networks(ANN’s), Quadratic Classiﬁer 1. Introduction Classiﬁcation is an important tool for the analysis of statistical problems. In machine learn- ing or statistics, classiﬁcation is referred to as the problem of identifying whether an object belongs to a particular category based on a previously learned model. This model is learned statistically based on a set of training data whose categorization is predeﬁned. This method is known as supervised learning in Machine Learning. There are various types of classiﬁca- tion techniques that are employed for learning these statistical models for a best possible result on the test data. All of these techniques possess diﬀerent approaches to tackle the categorization and needs to be carefully selected by the individual in accordance to the need of the problem. In this paper, I will discuss some of the most common classiﬁcation algo- rithms that are employed, various types of scoring schemes to measure their performance, Multi-class classiﬁcation techniques and lastly the model selection techniques. 2. Survey of Various Classiﬁcation Algorithms 2.1 Linear Classiﬁers Linear classiﬁers are a sub-type of classiﬁcation algorithms that use of a linear combination of the characteristics of an object to make a decision on which category to place the object into. In general, these characteristics of an object are referred to as feature vectors. These feature vectors give us the required information on which class to put the object in question into. c 2000 Vishal Sharma.

Classification Algorithms 2.1.1 Logistic Regression Logistic regression is a type of linear classiﬁer. It is used to predict whether a given object would lie into the class ’1’ or class ’0’. This is the most common type of problem of a logistic regression classiﬁer where the dependent variable is binary. Logistic regression can also be used for when there are more than two types of dependent variables. We use the multinomial logistic regression for these cases. The basis of classiﬁcation used in logistic regression is the logistic function which is used to create a probability distribution corresponding to the weighted feature vectors. The logistic function is: hW (X) = 1 (1) 1 + e−W T X In the above equation the the independent variable, ′x′ is the feature vector with each component a weighted independent characteristic of the object along with a bias. The total input into the logistic function can thus be written mathematically as: X = WTX (2) where, ′W ′ is a vector of associated weights inclusive of the bias term and ′x′ is the n- dimensional feature vector. W = (wn, wn−1...., w2, w1, b) (3) X = (xn, xn−1...., x2, x1, 1) (4) For the case of logistic regression our objective function is the negative log likelihood function or the log likelihood function. This is the function that needs to be optimized in order to learn the decision boundary associated with the classiﬁcation. The likelihood function is deﬁned as follows: L(W, x) = P r(Y |X; W ) (5) which is the multiplication of all individual probabilities since we assume all the feature vector components to be independent of each other. The actual objective function which is the negative log likelihood is the function which is deﬁned as follows: N (6) − log L(W |x) = − log P r(yi|xi; W ) i=1 N − log L(θ|x) = − yi log hW (x) + (1 − yi) log(1 − hW (x)) (7) i=1 Learning in logistic regression techniques corresponds to minimization of the above men- tioned function, which is also known as the cost function. Again, as already mentioned the algorithm takes the leverage of certain assumptions for the ease of implementation. These include: 2

Classification Algorithms • The observations are assumed to be independent of each other. This enables us to implement the probabilities of the measurements as a product of individual probabil- ities in the likelihood function. • The correlation amongst the independent variables should be fairly low. • It requires the relationship between the log probability and the independent variables to be linear. Below I talk about the various advantages and disadvantages of logistic regression: It does not assume linear relationship between the independent and the dependent variable. Both independent and the dependent variables need not be normally distributed. Handling non-linear eﬀects is possible. It requires much more training data to be able to achieve meaningful results. 2.1.2 Naive Bayes Classifier They are a family of classiﬁers based on the probabilistic classiﬁcation via the Bayes’ The- orem with strong independence between feature vectors. It is a popular method in practice for text categorization, for instance, classiﬁcation of spam or non-spam emails, with the word frequencies as being the feature vectors. Despite their oversimpliﬁed design they ﬁnd good applications in many real world problems. The Bayes is a conditional probability model which assigns the class ’Ck’ to a feature vector ’x’ with the probability given as: P r(Ck|x1, x2, ....xn) (8) It can be seen clearly that this type of model will calculate the conditional probability for each class ’k’ based on the feature vector. This means that if the feature vector is high dimensional then this process is unfeasible. Therefore, we remodel the probability given in equation (8) as: P r(Ck|x) = P r(Ck)P r(x|Ck) (9) P r(x) Since the denominator is independent of ’C’ and since the values of feature vector are given, therefore it is a constant, we can focus on the numerator. The numerator can be broken into the following expression using the chain rule: P r(Ck, x1, x2, ....xn) = P r(x1, x2, ....xn, Ck) (10) P r(Ck, x1, ...xn) = P r(x1|x2...xn, Ck)P r(x2|x3...xn, Ck)P r(x3|x4...xn, Ck)...P r(xn−1|xn|Ck) (11) The model can ﬁnally be expressed as: P r(Ck|x1, ...xn) = P r(Ck)ΠiN=1P r(xi|Ck) (12) 3

Classification Algorithms The objective function is the posterior probability which is maximized during the process. The function likelihood, P r(x|Ck) is composed initially. Then using the Bayes equation mentioned above is used to calculate the posterior probability. The class with the highest probability is the prediction. The posterior probability is given by, P r(Ck|x). This technique assumes the condition of independence among the feature variables. This can be considered as a disadvantage of the classiﬁer as in most of the cases in case of real data, the assumption of independence can break down and can provide meaningless or un- trustworthy results. However, this classiﬁer is easy and fast and can perform well in multi-class classiﬁcation. The shortcoming of this technique which is the assumption of independence holds within the data, it performs better than other classiﬁcation algorithms and needs less training data. 2.2 Quadratic Classiﬁer As the name suggests, a quadratic classiﬁer learns a more general or more complex decision boundary for the separation of two or more classes. It is a generalisation of the linear classiﬁer. The correct solution for a quadratic classiﬁer is assumed to be quadratic in nature, whose ’y’ depends on the following term: xT Ax + W T x + b (13) The technique used for the calculation of the more complex quadratic decision boundary is based on the Quadratic Discriminant Analysis (QDA). The learning rule with this type of algorithm is pretty simple. We have a modelled quadratic discriminant function given as follows, which needs to be maximized. δk(x) = − 1 log(Σk ) − 1 (x − µk )T Σ−k 1(x − µk ) + logπk (14) 2 2 In the above equation the Σk is the covariance matrix and it is not identical. The learning will be based on the optimization of the aforementioned function. That is the classiﬁcation rule can be mathematically represented as: G(x) = argmaxkδk(x) (15) This same function is the objective function and needs to be maximized. The classiﬁcation works similarly. We need to determine the class ’k’ for which the δk(x) is maximized. The advantages of QDA are that it allows more ﬂexibility and tends to ﬁt the data well. But on the other hand, it requires more deﬁning parameters. Since, we will have a distinct covariance matrix for every class, the problem could be unfeasible for a situation with many classes but not many data points. The QDA can be useful where the training data is big enough so that reducing the variance is not important. The classiﬁer assumes that the observations under each of the classes are distributed in the form of a Gaussian. One example of this kind is a Gaussian Discriminant Analysis (GDA). 4

Classification Algorithms 2.3 Artiﬁcial Neural Networks (ANN’s) Artiﬁcial Neural Networks is another kind of an approach to solve machine learning prob- lems. They are akin to the way a human brain processes information. Their structure constitutes of an artiﬁcial neuron and their corresponding weights that serve as the connec- tions between two neurons. The fundamental working principle of ANN’s involve learning via adjusting the diﬀerent weights between various neurons to learn the relationship be- tween the dependent and the independent variables. The objective function in the case of a neural network is the sum-of-squares error function, which gives the network the information as to how incorrect or diverged the output is from the expected result. This information about the error is then used to remodel the weights in a manner corresponding to a further reduction in the error function. The aim of the learning process is to evaluate the error function at each iteration and re-adjust the weights in order to attain a local minima. The concept of a layer in a neural network is deﬁned as the neurons and their corresponding weights residing in that layer. Consequently, for every neural network there are three types of layers, namely, Input layer, Hidden layers, Output layer. The neuron in every layer ﬁrstly computes the weighted input using the function, W T x and then applies the activation function on the weighted input to determine the output as either 0 or 1. There are various activation functions that are used, for instance, the ReLU function, the Signum function and the most common one, the sigmoid function given as 1 1+e−x The most common type of neural network is the feed forward neural network in which the connections are only forward propagating and exist between the neurons of layer ’n’ and ’n+1’ with no backward connections. For learning purposes the error function is calculated corresponding to the current weights and consequently the weights are re-adjusted in order to decrease the error function, moving towards the minima. Backpropagation algorithm is used to implement error correction The error for a feed-forward neural network forms the objective function that needs to be minimized using the principle of gradient descent. Given below is the sum-of-squares error function: 1 2 En = ΣkK=1 (yk − tk )2 (16) This error function follows the criterion for Lyapunov’s stability and hence is called a Lya- punov’s function. The above mentioned error function is easy to optimize using a gradient descent technique for the adjustment of weights. The various advantages and disadvantages of neural networks are discussed here. The ANN’s are easy to use. They can provide a good ﬁt for any function regardless of its non- linearity. They ﬁnd their best use for complex problems for example image recognition. On the other hand, the ANN’s are often used in places where simple linear regression can be implemented. They require a great amount of data to be trained suﬃciently well. They are essentially a black box without allowing details to be studied. To increase the accuracy by a few percent, the size of the network needs to be scaled highly. It performs well even when the input data is noisy. 5

Classification Algorithms 2.4 Support Vector Machine (SVM) Support vector machine are a set of learning models in machine learning that are super- vised in nature. The model is trained with a set of training data points that belong to either of the two classes of a binary classiﬁer. Based on this, the SVM then implements the learned model on the newer data points belonging to the test sample and place them into either of the two classes. The SVM, thus, is a non-probabilistic binary classiﬁer. The idea behind the classiﬁcation is to implement an (n-1)-dimensional hyperplane to linearly classify n-dimensional feature vectors into two separate classes. It is trivial, that there can be inﬁnitely many such hyperplanes that separate the data points(given they are linearly separable), we need to choose the hyperplane or the linear classiﬁer that has the maximum margin. We deﬁne this margin as the distance of the nearest data points of the each class to this hyperplane. The objective function is this distance of the nearest points in both the classes to the hyperplane and the aim of the optimization is to maximize this. The SVM’s possess a parameter for regularization, which helps the optimization algorithm from overﬁtting the data. It can also ﬁt the data non-linearly by using a technique called kernel-trick. Also it is based on a convex optimization problem which has very eﬃcient methods for solving. The disadvantage that it possesses is that it learns the parameters corresponding to a given value of regularization and choice of kernel. 2.5 k-Nearest Neighbours (k-NN) k-Nearest Neighbours algorithm is a classiﬁcation technique which is instance based and is also referred to as Lazy Learning. The basic idea of this type of classiﬁcation is very simple. Although, the learnt model is ’k’ dependent, in essence, for diﬀerent value of ’k’ a diﬀerent model will be learnt. This algorithm assigns a class to a data point based on the classes of its k nearest neighbours. It picks the class that is most common among the k neighbours of the data point in picture. When, k=1, the problem becomes trivial and the classiﬁcation is essentially based on the class of the nearest neighbour. Which means that the new data points which needs to be classiﬁed will be classiﬁed into the class to which its nearest neighbour point belongs to. 3. Summary of various scoring methods for classiﬁcation 3.1 Accuracy This is the most common type of scoring method and essentially the most misused one. This type of method is viable for certain types of classiﬁcation problems. It is calculated as the total number of correct predictions made over the total predictions (these predictions include correct classiﬁcation and correct non-classiﬁcation). If the problem in consideration has equal number of classiﬁcations in each of the classes and the case in which both the correct prediction of observation being present in the class and correct prediction on observation being absent from the class are equally important. Arguably, this type of nature of a classiﬁcation problem can exist, but for the other types of problems, it is not a good idea to use this metric for the performance evaluation. 6

Classification Algorithms To make in clear further, let us suppose an example problem in the medical diagnosis of a disease X. Assume that out of a population about 10% of the people are suﬀering from this disease. We design a faulty medical device that for the diagnosis of this disease for a random person out of this population. Our device, by the virtue of being faulty outputs negative for all everyone irrespective of whether they have that medical condition or not. In such a scenario, if we choose accuracy to be our scoring method, we get an accuracy of 90% over the given test population, even though our device failed miserably and did not diagnose the 10% of the people actually suﬀering from the disease. Mathematically, it can be presented as: Accuracy = TP TP + TN + FN (17) + TN + FP where, TP/TN = True Positives/Negatives, FP/FN = False Positives/Negatives Total Predictions = TP + TN + FP + FN 3.2 Precision Precision is deﬁned as the correctly predicted classiﬁcations over the total predicted classi- ﬁcations. High precision relates to the model’s low error in classifying the data points that do not belong to a certain class within that class. P recision = T P/(T P + F P ) (18) with the abbreviations having their usual meaning. 3.3 Recall Recall is deﬁned as the correctly predicted classiﬁcations over all the classiﬁcations of mem- bers of a certain class. The model gives the information as to how many objects that actually belong to the class in question get non-classiﬁed or get classiﬁed outside that class. Recall = T P/(T P + F N ) (19) with the abbreviations having their usual meaning. 3.4 F1-Score F1 Score is another metric which at ﬁrst sight is hard to understand intuitively. It essentially brings in both the False Positives and False Negatives to weigh in the error in decision making. It is deﬁned as the Harmonic mean of precision and recall. Ideally, we would want to list all the true positive observations that exist for a particular class while being careful to omit all those who do not belong to that class. If we could do that, then we would have both high precision and high recall respectively. And this consequently will ensure, a high F1-Score corresponding to our model. Important thing to note here is, even if our precision is remarkably high, having a low recall will always dominate and bring down the F1-Score necessarily and vice-versa. 1 = 1 + 1 (20) F 1 − Score P recision Recall 7

Classification Algorithms F1 − Score = 2 ∗ P recision ∗ Recall (21) P recision + Recall 3.5 Area Under Curve(AUC) of a Receiver Operating Characteristics(ROC) Receiver Operating Characteristics (ROC) is a statistical curve that is a graphical plot of a characteristic of a model’s classiﬁcation quality by varying its discrimination threshold. For instance, for mapping a probabilistic output to either the Class-1 or the Class-0 we consider, the case that if the probability of it being in the Class-1 is greater than 0.5 then classify it into Class-1, otherwise if it is less than 0.5 classify it into Class-0, our discriminant threshold here is 0.5. We can essentially change this threshold and see how the True Positives and the False Positives change as a result. This very plot over diﬀerent such thresholds is the ROC. Therefore, we can use the area under this curve as a metric for a accuracy evaluation of our model. Since, the graph is plotted as True Positives vs. False Negatives, a large area under the curve (AUC) corresponds to a high eﬃcacy. Quite often, the score system for a quality classiﬁcation of the eﬃciency is given below: • 0.9 - 1 : Excellent • 0.8 - 0.9 : Good • 0.7 - 0.8 : Fair • 0.6 - 0.7 : Poor • 0.5 - 0.6 : Fail 4. Multiclass classiﬁcation strategies In general there are three kinds of multi-class classiﬁcation categories: • Extension from Binary • Transformation to Binary • Heirarchical Classiﬁcation 4.1 Transformation to Binary 4.1.1 One versus Rest Classification For the case when there are more than two classes that a data point can be classiﬁed into, we can reduce this problem to a series of binary classiﬁcations for each of the classes in situation. The algorithm involves training of a single classiﬁer for each class, one at a time, where, the data points belonging to that class are given the Class-1 and the other points are classiﬁed into the Class-0. Again, applying the same method for the rest of the classes to learn their respective classiﬁers. The implementation criteria requires the model to classify the data points based on a conﬁdence score based on the probability, rather than providing the data point with the corresponding class label. The provision of discrete class labels to 8

Classification Algorithms each data point in a multi-class classiﬁer can lead to meaningless results, for instance, single data point could be classiﬁed in more than one class. Making classiﬁcations using the learned model on real data or test data is performed by applying all the learned classiﬁers for each of the classes and obtaining a conﬁdence score corresponding to each class. The data point is then classiﬁed into the class ’k’ with the highest conﬁdence score. 4.1.2 One versus One Classification The one versus one classiﬁcation strategy is another technique that can be employed to learn a multi-class classiﬁcation relationship. Herein, all possible combinations of ’k’ classes are k(k−1) learnt one-by-one, producing 2 classiﬁers, each corresponding to a binary classiﬁcation subset of the original multi-class classiﬁcation problem learnt to diﬀerentiate every class from every other class in the problem. During, the classiﬁcation of an unseen data point or a test data point, all the classiﬁers are applied to the data point and the class that gets the highest number of predictions k(k−1) corresponding to all classiﬁcation via all 2 classiﬁers is the class the data point belongs to. 5. Model selection Techniques Model selection is an important aspect of any machine learning solution and most ideally the ﬁrst issue to be tackled. Given, a plethora of machine learning algorithms to choose from, we need to select the algorithm that best suits a given problem in hand before we start the analysis on the data provided. Essentially, we need to perform this analysis of algorithms because we wish to evaluate the prediction performance of our model. This is particularly important because the ﬁnal aim is to maximize this performance statistics for an excellent model that gives meaningful predictions when applied on the real world data. For this we wish to perform our evaluation based on the below mentioned criteria: • Estimation of the performance of the algorithm for an idea of the quality of the results • Given the knowledge of the performance, changing the performance for better upon tweaking diﬀerent parameters and consequently selecting a the best hypothesis func- tion from the hypothesis space. • Selecting the best performing machine learning algorithm and further selecting the best performing hypothesis/model from its subspace. Our objective is to select the optimum mix oﬀ the above conditions so as to make our learned model perform with the maximum possible accuracy on a future real world data set. 5.1 Hyper-parameters versus Learn-able Parameters Learn-able Parameters or simply parameters are variables that are an intrinsic property of a model. Their values are estimated using the data. They form the part of the model that are 9

Classification Algorithms learned iteratively using the training data and are very much a part of the learned model. They form an essential part of the model and are of utmost importance for making accurate predictions on future data. These are learned by the algorithm and are not set manually by the user. They are estimated by using what are called as optimization algorithms. Few examples include: • The weights corresponding to neurons in an ANN. • The W matrix (or vector) in case of logistic regression • The vectors in case of SVM’s On the other hand, the hyper-parameters are the variables that are extrinsic to the model and their values cannot be determined using the data. These are often speciﬁed manually by the user. They are tuned for a model corresponding to a particular problem. There is no method of calculating the best possible value of a hyper-parameter, we can use an optimum value following a trail and error analysis. They often ﬁnd application in processes designed for the estimation of the learn-able parameters. There are various types of models for which there are no analytical formula to calculate certain parameters. The hyper-parameter ﬁnd their application in such cases. Some examples include: • The learning rate corresponding to optimization techniques for various algorithms such as a neural network • In case of k-Nearest Neighbors, the value of ’k’ 6. Implementation of Logistic Regression Algorithm The data set provided for the implementation of this algorithm is divided into two sets, the training data and the test data. The aim of the algorithm is to train a classiﬁer based on the multi-class Logistic Regression implementation via transformation to binary class logis- tic regression. I followed the one-vs-all approach for the modelling in the context of given problem. The problem is a supervised learning problem with 10 distinct classes, ranging from 0 to 9. Using the ones-vs-all approach the model learns the optimum parameters of each boundary one at a time. There are 10 decision boundaries to be learnt for the complete classiﬁcation of the data points. Each decision boundary of a class with its corresponding optimized parameters(in essence, the weights) separates that class from the rest of the data points. Classiﬁcation via logistic regression is a linear classiﬁcation technique as is already discussed in the preceding sections. The aim of the algorithm is to learn ’k’ distinct (n-1)-dimensional hyperplanes in the n-dimensional space where the feature vectors are deﬁned. Out of the many inﬁnite possibilities for each of the hyperplane, we need to choose the hyperplane that is deﬁned by the set of parameters that are optimized w.r.t to a cost function. We say, that the values of parameters are optimized when the cost function has attained a minima (need not necessarily be a minimum, as that cannot be ensured). In the case of binary logistic regression we deﬁne the cost function, which is known to be the 10

Classification Algorithms Class Test Data Train Data 0 0.827 0.879 1 0.602 0.718 2 0.752 0.836 3 0.851 0.873 4 0.865 0.908 5 0.583 0.685 6 0.856 0.881 7 0.725 0.782 8 0.766 0.768 9 0.797 0.808 Table 1: Class-wise F1 Scores on Test and Train Data Likelihood function. Details about the Likelihood function and its optimization techniques are already discussed in the section 2.1.1 The algorithm tries to minimize the cost function and correspondingly updates the value of weights (W) after every successive iteration, according to the equation below: W (k + 1) = W (k) − η( ∂C ostF unction ) (22) ∂W These iterations repeat until the Cost Function converges to a local minima. The weights (W) thus obtained through this procedure are described as an optimum set. This process of learning is repeated until all decision boundaries are optimized. The η in the above equation is a hyper-parameter which needs manual tuning. It is one crucial aspect of the model and tweaking its value can vary the eﬃciency of the learned model. Taking a large value of learning rate could guide the optimization algorithm away from the vicinity of a local minima resulting in divergence, on the other hand, a value too low, could fail to converge at the local minima, stuck in the loop. As a result, we need to optimize the learning rate in a way which yields the best eﬃciency. As I have talked thoroughly about how diﬀerent scoring methods are the right choice in certain types of problems but could be drastically harmful for other. I believe, F1 Score gives a good idea of the eﬃciency in this case. Therefore, I have listed in Table 1 the F1 Scores on both the test and train data corresponding to each class after learning the model with the learning rate set as: 10−3 and convergence criteria as: 10−6 I have also explicitly shown in Table 2, the basic accuracy scores of the test and train data sets. Next it would be obvious to try and experiment with the learning rate to obtain the optimum value so as to maximize the eﬃciency of the predictions. Below are the F1-Score values for test data for diﬀerent values of Learning Rate: • Function diverged for η = 0.1. 11

Classification Algorithms Class Test Data Train Data 0 96.46% 97.49% 1 92.62% 94.22% 2 94.45% 96.29% 3 96.94% 97.37% 4 97.37% 98.17% 5 93.45% 94.81% 6 96.77% 97.45% 7 94.63% 95.76% 8 95.54% 95.92% 9 95.57% 96.06% Table 2: Class-wise Accuracy Scores on Test and Train Data • For η = 0.01, the cost function corresponding to a few classes diverged, whereas rest converged. • For η = 0.001 I have already mentioned the score results as above. • For η = 0.0001, the F1-Score for the test data is : F 1 − Score = (0.764, 0.542, 0.659, 0.767, 0.740, 0.555, 0.669, 0.636, 0.631, 0.707) 12

Classification Algorithms References • https://en.wikipedia.org/wiki/Logisticregression • https://en.wikipedia.org/wiki/NaiveBayesclassif ier • https://en.wikipedia.org/wiki/Linearclassif ier • https://www.analyticsvidhya.com/blog/2017/09/naiveBayes-explained/ • http://www.pythonforbeginners.com/systems-programming/using-the-csv-module-in-python/ • https://en.wikipedia.org/wiki/K-nearestneighborsalgorithm • http://mlwiki.org/index.php/One-vs-AllClassif ication • https://en.wikipedia.org/wiki/Multinomiallogisticregression • https://en.wikipedia.org/wiki/Supportvectormachine • https://en.wikipedia.org/wiki/ModelselectionM ethodsf orchoosingthesetofcandidatemodels • https://en.wikipedia.org/wiki/Bayesf actor • https://machinelearningmastery.com/diﬀerence-between-a-parameter-and-a-hyperparameter/ • https://en.wikipedia.org/wiki/ModelselectionM ethodsf orchoosingthesetofcandidatemodels • https://sebastianraschka.com/blog/2016/model-evaluation-selection-part1.html • https://cs.stackexchange.com/questions/76057/scoring-metric-for-machine-learning-method • https://machinelearningmastery.com/metrics-evaluate-machine-learning-algorithms-python/ • https://datascience.stackexchange.com/questions/5023/what-is-a-discrimination-threshold- of-binary-classiﬁer • http://gim.unmc.edu/dxtests/roc3.htm • https://en.wikipedia.org/wiki/Receiveroperatingcharacteristic • https://en.wikipedia.org/wiki/F1score • https://www.analyticsvidhya.com/blog/2017/09/naive-bayes-explained/ • https://en.wikipedia.org/wiki/Stochasticgradientdescent • https://machinelearningmastery.com/a-tour-of-machine-learning-algorithms/ • https://en.wikipedia.org/wiki/Statisticalclassif ication • http://blog.exsilio.com/all/accuracy-precision-recall-f1-score-interpretation-of-performance- measures/ 13

Classification Algorithms • https://machinelearningmastery.com/assessing-comparing-classiﬁer-performance-roc-curves- 2/ • https://stats.stackexchange.com/questions/24437/advantages-and-disadvantages-of-svm • https://www.doc.ic.ac.uk/ nd/surprise96/journal/vol4/cs11/report.html • http://www.ics.uci.edu/ kibler/ics171/RNLectures/NeuralNets.pdf 14

Riinda Aulia Utami

Classification_Algorithms_in_Machine_Lea

Like this book? You can publish your book online for free in a few minutes!

Create your own flipbook

TOP SEARCH

business design fashion music health life sports home marketing children

Classification_Algorithms_in_Machine_Lea

Description: Classification_Algorithms_in_Machine_Lea

Read the Text Version

Riinda Aulia Utami

TOP SEARCH

RELATED PUBLICATIONS