Important Announcement
PubHTML5 Scheduled Server Maintenance on (GMT) Sunday, June 26th, 2:00 am - 8:00 am.
PubHTML5 site will be inoperative during the times indicated!

Home Explore Classification_Algorithms_in_Machine_Lea

Classification_Algorithms_in_Machine_Lea

Published by Riinda Aulia Utami, 2022-10-07 14:39:33

Description: Classification_Algorithms_in_Machine_Lea

Search

Read the Text Version

Journal of Machine Learning Research 1 (2000) 1-48 Submitted 4/00; Published 10/00 Survey of Classification Algorithms and Various Model Selection Methods Vishal Sharma [email protected] Department of Physics Indian Institute of Technology Delhi Hauz Khas,New Delhi-110016, India Editor: Leslie Pack Kaelbling Abstract This report describes in a comprehensive manner the various types of classification algo- rithms that already exist. I will mainly be discussing and comparing in detail the major 7 types of classification algorithms here. The comparison will essentially be based on objec- tive function, assumptions, advantages and draw backs of each. Keywords: Naive Bayes Classifier, Support Vector Machines(SVM), Linear Classifier, k-Nearest Neighbours(k-NN), Artificial Neural Networks(ANN’s), Quadratic Classifier 1. Introduction Classification is an important tool for the analysis of statistical problems. In machine learn- ing or statistics, classification is referred to as the problem of identifying whether an object belongs to a particular category based on a previously learned model. This model is learned statistically based on a set of training data whose categorization is predefined. This method is known as supervised learning in Machine Learning. There are various types of classifica- tion techniques that are employed for learning these statistical models for a best possible result on the test data. All of these techniques possess different approaches to tackle the categorization and needs to be carefully selected by the individual in accordance to the need of the problem. In this paper, I will discuss some of the most common classification algo- rithms that are employed, various types of scoring schemes to measure their performance, Multi-class classification techniques and lastly the model selection techniques. 2. Survey of Various Classification Algorithms 2.1 Linear Classifiers Linear classifiers are a sub-type of classification algorithms that use of a linear combination of the characteristics of an object to make a decision on which category to place the object into. In general, these characteristics of an object are referred to as feature vectors. These feature vectors give us the required information on which class to put the object in question into. c 2000 Vishal Sharma.

Classification Algorithms 2.1.1 Logistic Regression Logistic regression is a type of linear classifier. It is used to predict whether a given object would lie into the class ’1’ or class ’0’. This is the most common type of problem of a logistic regression classifier where the dependent variable is binary. Logistic regression can also be used for when there are more than two types of dependent variables. We use the multinomial logistic regression for these cases. The basis of classification used in logistic regression is the logistic function which is used to create a probability distribution corresponding to the weighted feature vectors. The logistic function is: hW (X) = 1 (1) 1 + e−W T X In the above equation the the independent variable, ′x′ is the feature vector with each component a weighted independent characteristic of the object along with a bias. The total input into the logistic function can thus be written mathematically as: X = WTX (2) where, ′W ′ is a vector of associated weights inclusive of the bias term and ′x′ is the n- dimensional feature vector. W = (wn, wn−1...., w2, w1, b) (3) X = (xn, xn−1...., x2, x1, 1) (4) For the case of logistic regression our objective function is the negative log likelihood function or the log likelihood function. This is the function that needs to be optimized in order to learn the decision boundary associated with the classification. The likelihood function is defined as follows: L(W, x) = P r(Y |X; W ) (5) which is the multiplication of all individual probabilities since we assume all the feature vector components to be independent of each other. The actual objective function which is the negative log likelihood is the function which is defined as follows: N (6) − log L(W |x) = − log P r(yi|xi; W ) i=1 N − log L(θ|x) = − yi log hW (x) + (1 − yi) log(1 − hW (x)) (7) i=1 Learning in logistic regression techniques corresponds to minimization of the above men- tioned function, which is also known as the cost function. Again, as already mentioned the algorithm takes the leverage of certain assumptions for the ease of implementation. These include: 2

Classification Algorithms • The observations are assumed to be independent of each other. This enables us to implement the probabilities of the measurements as a product of individual probabil- ities in the likelihood function. • The correlation amongst the independent variables should be fairly low. • It requires the relationship between the log probability and the independent variables to be linear. Below I talk about the various advantages and disadvantages of logistic regression: It does not assume linear relationship between the independent and the dependent variable. Both independent and the dependent variables need not be normally distributed. Handling non-linear effects is possible. It requires much more training data to be able to achieve meaningful results. 2.1.2 Naive Bayes Classifier They are a family of classifiers based on the probabilistic classification via the Bayes’ The- orem with strong independence between feature vectors. It is a popular method in practice for text categorization, for instance, classification of spam or non-spam emails, with the word frequencies as being the feature vectors. Despite their oversimplified design they find good applications in many real world problems. The Bayes is a conditional probability model which assigns the class ’Ck’ to a feature vector ’x’ with the probability given as: P r(Ck|x1, x2, ....xn) (8) It can be seen clearly that this type of model will calculate the conditional probability for each class ’k’ based on the feature vector. This means that if the feature vector is high dimensional then this process is unfeasible. Therefore, we remodel the probability given in equation (8) as: P r(Ck|x) = P r(Ck)P r(x|Ck) (9) P r(x) Since the denominator is independent of ’C’ and since the values of feature vector are given, therefore it is a constant, we can focus on the numerator. The numerator can be broken into the following expression using the chain rule: P r(Ck, x1, x2, ....xn) = P r(x1, x2, ....xn, Ck) (10) P r(Ck, x1, ...xn) = P r(x1|x2...xn, Ck)P r(x2|x3...xn, Ck)P r(x3|x4...xn, Ck)...P r(xn−1|xn|Ck) (11) The model can finally be expressed as: P r(Ck|x1, ...xn) = P r(Ck)ΠiN=1P r(xi|Ck) (12) 3

Classification Algorithms The objective function is the posterior probability which is maximized during the process. The function likelihood, P r(x|Ck) is composed initially. Then using the Bayes equation mentioned above is used to calculate the posterior probability. The class with the highest probability is the prediction. The posterior probability is given by, P r(Ck|x). This technique assumes the condition of independence among the feature variables. This can be considered as a disadvantage of the classifier as in most of the cases in case of real data, the assumption of independence can break down and can provide meaningless or un- trustworthy results. However, this classifier is easy and fast and can perform well in multi-class classification. The shortcoming of this technique which is the assumption of independence holds within the data, it performs better than other classification algorithms and needs less training data. 2.2 Quadratic Classifier As the name suggests, a quadratic classifier learns a more general or more complex decision boundary for the separation of two or more classes. It is a generalisation of the linear classifier. The correct solution for a quadratic classifier is assumed to be quadratic in nature, whose ’y’ depends on the following term: xT Ax + W T x + b (13) The technique used for the calculation of the more complex quadratic decision boundary is based on the Quadratic Discriminant Analysis (QDA). The learning rule with this type of algorithm is pretty simple. We have a modelled quadratic discriminant function given as follows, which needs to be maximized. δk(x) = − 1 log(Σk ) − 1 (x − µk )T Σ−k 1(x − µk ) + logπk (14) 2 2 In the above equation the Σk is the covariance matrix and it is not identical. The learning will be based on the optimization of the aforementioned function. That is the classification rule can be mathematically represented as: G(x) = argmaxkδk(x) (15) This same function is the objective function and needs to be maximized. The classification works similarly. We need to determine the class ’k’ for which the δk(x) is maximized. The advantages of QDA are that it allows more flexibility and tends to fit the data well. But on the other hand, it requires more defining parameters. Since, we will have a distinct covariance matrix for every class, the problem could be unfeasible for a situation with many classes but not many data points. The QDA can be useful where the training data is big enough so that reducing the variance is not important. The classifier assumes that the observations under each of the classes are distributed in the form of a Gaussian. One example of this kind is a Gaussian Discriminant Analysis (GDA). 4

Classification Algorithms 2.3 Artificial Neural Networks (ANN’s) Artificial Neural Networks is another kind of an approach to solve machine learning prob- lems. They are akin to the way a human brain processes information. Their structure constitutes of an artificial neuron and their corresponding weights that serve as the connec- tions between two neurons. The fundamental working principle of ANN’s involve learning via adjusting the different weights between various neurons to learn the relationship be- tween the dependent and the independent variables. The objective function in the case of a neural network is the sum-of-squares error function, which gives the network the information as to how incorrect or diverged the output is from the expected result. This information about the error is then used to remodel the weights in a manner corresponding to a further reduction in the error function. The aim of the learning process is to evaluate the error function at each iteration and re-adjust the weights in order to attain a local minima. The concept of a layer in a neural network is defined as the neurons and their corresponding weights residing in that layer. Consequently, for every neural network there are three types of layers, namely, Input layer, Hidden layers, Output layer. The neuron in every layer firstly computes the weighted input using the function, W T x and then applies the activation function on the weighted input to determine the output as either 0 or 1. There are various activation functions that are used, for instance, the ReLU function, the Signum function and the most common one, the sigmoid function given as 1 1+e−x The most common type of neural network is the feed forward neural network in which the connections are only forward propagating and exist between the neurons of layer ’n’ and ’n+1’ with no backward connections. For learning purposes the error function is calculated corresponding to the current weights and consequently the weights are re-adjusted in order to decrease the error function, moving towards the minima. Backpropagation algorithm is used to implement error correction The error for a feed-forward neural network forms the objective function that needs to be minimized using the principle of gradient descent. Given below is the sum-of-squares error function: 1 2 En = ΣkK=1 (yk − tk )2 (16) This error function follows the criterion for Lyapunov’s stability and hence is called a Lya- punov’s function. The above mentioned error function is easy to optimize using a gradient descent technique for the adjustment of weights. The various advantages and disadvantages of neural networks are discussed here. The ANN’s are easy to use. They can provide a good fit for any function regardless of its non- linearity. They find their best use for complex problems for example image recognition. On the other hand, the ANN’s are often used in places where simple linear regression can be implemented. They require a great amount of data to be trained sufficiently well. They are essentially a black box without allowing details to be studied. To increase the accuracy by a few percent, the size of the network needs to be scaled highly. It performs well even when the input data is noisy. 5

Classification Algorithms 2.4 Support Vector Machine (SVM) Support vector machine are a set of learning models in machine learning that are super- vised in nature. The model is trained with a set of training data points that belong to either of the two classes of a binary classifier. Based on this, the SVM then implements the learned model on the newer data points belonging to the test sample and place them into either of the two classes. The SVM, thus, is a non-probabilistic binary classifier. The idea behind the classification is to implement an (n-1)-dimensional hyperplane to linearly classify n-dimensional feature vectors into two separate classes. It is trivial, that there can be infinitely many such hyperplanes that separate the data points(given they are linearly separable), we need to choose the hyperplane or the linear classifier that has the maximum margin. We define this margin as the distance of the nearest data points of the each class to this hyperplane. The objective function is this distance of the nearest points in both the classes to the hyperplane and the aim of the optimization is to maximize this. The SVM’s possess a parameter for regularization, which helps the optimization algorithm from overfitting the data. It can also fit the data non-linearly by using a technique called kernel-trick. Also it is based on a convex optimization problem which has very efficient methods for solving. The disadvantage that it possesses is that it learns the parameters corresponding to a given value of regularization and choice of kernel. 2.5 k-Nearest Neighbours (k-NN) k-Nearest Neighbours algorithm is a classification technique which is instance based and is also referred to as Lazy Learning. The basic idea of this type of classification is very simple. Although, the learnt model is ’k’ dependent, in essence, for different value of ’k’ a different model will be learnt. This algorithm assigns a class to a data point based on the classes of its k nearest neighbours. It picks the class that is most common among the k neighbours of the data point in picture. When, k=1, the problem becomes trivial and the classification is essentially based on the class of the nearest neighbour. Which means that the new data points which needs to be classified will be classified into the class to which its nearest neighbour point belongs to. 3. Summary of various scoring methods for classification 3.1 Accuracy This is the most common type of scoring method and essentially the most misused one. This type of method is viable for certain types of classification problems. It is calculated as the total number of correct predictions made over the total predictions (these predictions include correct classification and correct non-classification). If the problem in consideration has equal number of classifications in each of the classes and the case in which both the correct prediction of observation being present in the class and correct prediction on observation being absent from the class are equally important. Arguably, this type of nature of a classification problem can exist, but for the other types of problems, it is not a good idea to use this metric for the performance evaluation. 6

Classification Algorithms To make in clear further, let us suppose an example problem in the medical diagnosis of a disease X. Assume that out of a population about 10% of the people are suffering from this disease. We design a faulty medical device that for the diagnosis of this disease for a random person out of this population. Our device, by the virtue of being faulty outputs negative for all everyone irrespective of whether they have that medical condition or not. In such a scenario, if we choose accuracy to be our scoring method, we get an accuracy of 90% over the given test population, even though our device failed miserably and did not diagnose the 10% of the people actually suffering from the disease. Mathematically, it can be presented as: Accuracy = TP TP + TN + FN (17) + TN + FP where, TP/TN = True Positives/Negatives, FP/FN = False Positives/Negatives Total Predictions = TP + TN + FP + FN 3.2 Precision Precision is defined as the correctly predicted classifications over the total predicted classi- fications. High precision relates to the model’s low error in classifying the data points that do not belong to a certain class within that class. P recision = T P/(T P + F P ) (18) with the abbreviations having their usual meaning. 3.3 Recall Recall is defined as the correctly predicted classifications over all the classifications of mem- bers of a certain class. The model gives the information as to how many objects that actually belong to the class in question get non-classified or get classified outside that class. Recall = T P/(T P + F N ) (19) with the abbreviations having their usual meaning. 3.4 F1-Score F1 Score is another metric which at first sight is hard to understand intuitively. It essentially brings in both the False Positives and False Negatives to weigh in the error in decision making. It is defined as the Harmonic mean of precision and recall. Ideally, we would want to list all the true positive observations that exist for a particular class while being careful to omit all those who do not belong to that class. If we could do that, then we would have both high precision and high recall respectively. And this consequently will ensure, a high F1-Score corresponding to our model. Important thing to note here is, even if our precision is remarkably high, having a low recall will always dominate and bring down the F1-Score necessarily and vice-versa. 1 = 1 + 1 (20) F 1 − Score P recision Recall 7

Classification Algorithms F1 − Score = 2 ∗ P recision ∗ Recall (21) P recision + Recall 3.5 Area Under Curve(AUC) of a Receiver Operating Characteristics(ROC) Receiver Operating Characteristics (ROC) is a statistical curve that is a graphical plot of a characteristic of a model’s classification quality by varying its discrimination threshold. For instance, for mapping a probabilistic output to either the Class-1 or the Class-0 we consider, the case that if the probability of it being in the Class-1 is greater than 0.5 then classify it into Class-1, otherwise if it is less than 0.5 classify it into Class-0, our discriminant threshold here is 0.5. We can essentially change this threshold and see how the True Positives and the False Positives change as a result. This very plot over different such thresholds is the ROC. Therefore, we can use the area under this curve as a metric for a accuracy evaluation of our model. Since, the graph is plotted as True Positives vs. False Negatives, a large area under the curve (AUC) corresponds to a high efficacy. Quite often, the score system for a quality classification of the efficiency is given below: • 0.9 - 1 : Excellent • 0.8 - 0.9 : Good • 0.7 - 0.8 : Fair • 0.6 - 0.7 : Poor • 0.5 - 0.6 : Fail 4. Multiclass classification strategies In general there are three kinds of multi-class classification categories: • Extension from Binary • Transformation to Binary • Heirarchical Classification 4.1 Transformation to Binary 4.1.1 One versus Rest Classification For the case when there are more than two classes that a data point can be classified into, we can reduce this problem to a series of binary classifications for each of the classes in situation. The algorithm involves training of a single classifier for each class, one at a time, where, the data points belonging to that class are given the Class-1 and the other points are classified into the Class-0. Again, applying the same method for the rest of the classes to learn their respective classifiers. The implementation criteria requires the model to classify the data points based on a confidence score based on the probability, rather than providing the data point with the corresponding class label. The provision of discrete class labels to 8

Classification Algorithms each data point in a multi-class classifier can lead to meaningless results, for instance, single data point could be classified in more than one class. Making classifications using the learned model on real data or test data is performed by applying all the learned classifiers for each of the classes and obtaining a confidence score corresponding to each class. The data point is then classified into the class ’k’ with the highest confidence score. 4.1.2 One versus One Classification The one versus one classification strategy is another technique that can be employed to learn a multi-class classification relationship. Herein, all possible combinations of ’k’ classes are k(k−1) learnt one-by-one, producing 2 classifiers, each corresponding to a binary classification subset of the original multi-class classification problem learnt to differentiate every class from every other class in the problem. During, the classification of an unseen data point or a test data point, all the classifiers are applied to the data point and the class that gets the highest number of predictions k(k−1) corresponding to all classification via all 2 classifiers is the class the data point belongs to. 5. Model selection Techniques Model selection is an important aspect of any machine learning solution and most ideally the first issue to be tackled. Given, a plethora of machine learning algorithms to choose from, we need to select the algorithm that best suits a given problem in hand before we start the analysis on the data provided. Essentially, we need to perform this analysis of algorithms because we wish to evaluate the prediction performance of our model. This is particularly important because the final aim is to maximize this performance statistics for an excellent model that gives meaningful predictions when applied on the real world data. For this we wish to perform our evaluation based on the below mentioned criteria: • Estimation of the performance of the algorithm for an idea of the quality of the results • Given the knowledge of the performance, changing the performance for better upon tweaking different parameters and consequently selecting a the best hypothesis func- tion from the hypothesis space. • Selecting the best performing machine learning algorithm and further selecting the best performing hypothesis/model from its subspace. Our objective is to select the optimum mix off the above conditions so as to make our learned model perform with the maximum possible accuracy on a future real world data set. 5.1 Hyper-parameters versus Learn-able Parameters Learn-able Parameters or simply parameters are variables that are an intrinsic property of a model. Their values are estimated using the data. They form the part of the model that are 9

Classification Algorithms learned iteratively using the training data and are very much a part of the learned model. They form an essential part of the model and are of utmost importance for making accurate predictions on future data. These are learned by the algorithm and are not set manually by the user. They are estimated by using what are called as optimization algorithms. Few examples include: • The weights corresponding to neurons in an ANN. • The W matrix (or vector) in case of logistic regression • The vectors in case of SVM’s On the other hand, the hyper-parameters are the variables that are extrinsic to the model and their values cannot be determined using the data. These are often specified manually by the user. They are tuned for a model corresponding to a particular problem. There is no method of calculating the best possible value of a hyper-parameter, we can use an optimum value following a trail and error analysis. They often find application in processes designed for the estimation of the learn-able parameters. There are various types of models for which there are no analytical formula to calculate certain parameters. The hyper-parameter find their application in such cases. Some examples include: • The learning rate corresponding to optimization techniques for various algorithms such as a neural network • In case of k-Nearest Neighbors, the value of ’k’ 6. Implementation of Logistic Regression Algorithm The data set provided for the implementation of this algorithm is divided into two sets, the training data and the test data. The aim of the algorithm is to train a classifier based on the multi-class Logistic Regression implementation via transformation to binary class logis- tic regression. I followed the one-vs-all approach for the modelling in the context of given problem. The problem is a supervised learning problem with 10 distinct classes, ranging from 0 to 9. Using the ones-vs-all approach the model learns the optimum parameters of each boundary one at a time. There are 10 decision boundaries to be learnt for the complete classification of the data points. Each decision boundary of a class with its corresponding optimized parameters(in essence, the weights) separates that class from the rest of the data points. Classification via logistic regression is a linear classification technique as is already discussed in the preceding sections. The aim of the algorithm is to learn ’k’ distinct (n-1)-dimensional hyperplanes in the n-dimensional space where the feature vectors are defined. Out of the many infinite possibilities for each of the hyperplane, we need to choose the hyperplane that is defined by the set of parameters that are optimized w.r.t to a cost function. We say, that the values of parameters are optimized when the cost function has attained a minima (need not necessarily be a minimum, as that cannot be ensured). In the case of binary logistic regression we define the cost function, which is known to be the 10

Classification Algorithms Class Test Data Train Data 0 0.827 0.879 1 0.602 0.718 2 0.752 0.836 3 0.851 0.873 4 0.865 0.908 5 0.583 0.685 6 0.856 0.881 7 0.725 0.782 8 0.766 0.768 9 0.797 0.808 Table 1: Class-wise F1 Scores on Test and Train Data Likelihood function. Details about the Likelihood function and its optimization techniques are already discussed in the section 2.1.1 The algorithm tries to minimize the cost function and correspondingly updates the value of weights (W) after every successive iteration, according to the equation below: W (k + 1) = W (k) − η( ∂C ostF unction ) (22) ∂W These iterations repeat until the Cost Function converges to a local minima. The weights (W) thus obtained through this procedure are described as an optimum set. This process of learning is repeated until all decision boundaries are optimized. The η in the above equation is a hyper-parameter which needs manual tuning. It is one crucial aspect of the model and tweaking its value can vary the efficiency of the learned model. Taking a large value of learning rate could guide the optimization algorithm away from the vicinity of a local minima resulting in divergence, on the other hand, a value too low, could fail to converge at the local minima, stuck in the loop. As a result, we need to optimize the learning rate in a way which yields the best efficiency. As I have talked thoroughly about how different scoring methods are the right choice in certain types of problems but could be drastically harmful for other. I believe, F1 Score gives a good idea of the efficiency in this case. Therefore, I have listed in Table 1 the F1 Scores on both the test and train data corresponding to each class after learning the model with the learning rate set as: 10−3 and convergence criteria as: 10−6 I have also explicitly shown in Table 2, the basic accuracy scores of the test and train data sets. Next it would be obvious to try and experiment with the learning rate to obtain the optimum value so as to maximize the efficiency of the predictions. Below are the F1-Score values for test data for different values of Learning Rate: • Function diverged for η = 0.1. 11

Classification Algorithms Class Test Data Train Data 0 96.46% 97.49% 1 92.62% 94.22% 2 94.45% 96.29% 3 96.94% 97.37% 4 97.37% 98.17% 5 93.45% 94.81% 6 96.77% 97.45% 7 94.63% 95.76% 8 95.54% 95.92% 9 95.57% 96.06% Table 2: Class-wise Accuracy Scores on Test and Train Data • For η = 0.01, the cost function corresponding to a few classes diverged, whereas rest converged. • For η = 0.001 I have already mentioned the score results as above. • For η = 0.0001, the F1-Score for the test data is : F 1 − Score = (0.764, 0.542, 0.659, 0.767, 0.740, 0.555, 0.669, 0.636, 0.631, 0.707) 12

Classification Algorithms References • https://en.wikipedia.org/wiki/Logisticregression • https://en.wikipedia.org/wiki/NaiveBayesclassif ier • https://en.wikipedia.org/wiki/Linearclassif ier • https://www.analyticsvidhya.com/blog/2017/09/naiveBayes-explained/ • http://www.pythonforbeginners.com/systems-programming/using-the-csv-module-in-python/ • https://en.wikipedia.org/wiki/K-nearestneighborsalgorithm • http://mlwiki.org/index.php/One-vs-AllClassif ication • https://en.wikipedia.org/wiki/Multinomiallogisticregression • https://en.wikipedia.org/wiki/Supportvectormachine • https://en.wikipedia.org/wiki/ModelselectionM ethodsf orchoosingthesetofcandidatemodels • https://en.wikipedia.org/wiki/Bayesf actor • https://machinelearningmastery.com/difference-between-a-parameter-and-a-hyperparameter/ • https://en.wikipedia.org/wiki/ModelselectionM ethodsf orchoosingthesetofcandidatemodels • https://sebastianraschka.com/blog/2016/model-evaluation-selection-part1.html • https://cs.stackexchange.com/questions/76057/scoring-metric-for-machine-learning-method • https://machinelearningmastery.com/metrics-evaluate-machine-learning-algorithms-python/ • https://datascience.stackexchange.com/questions/5023/what-is-a-discrimination-threshold- of-binary-classifier • http://gim.unmc.edu/dxtests/roc3.htm • https://en.wikipedia.org/wiki/Receiveroperatingcharacteristic • https://en.wikipedia.org/wiki/F1score • https://www.analyticsvidhya.com/blog/2017/09/naive-bayes-explained/ • https://en.wikipedia.org/wiki/Stochasticgradientdescent • https://machinelearningmastery.com/a-tour-of-machine-learning-algorithms/ • https://en.wikipedia.org/wiki/Statisticalclassif ication • http://blog.exsilio.com/all/accuracy-precision-recall-f1-score-interpretation-of-performance- measures/ 13

Classification Algorithms • https://machinelearningmastery.com/assessing-comparing-classifier-performance-roc-curves- 2/ • https://stats.stackexchange.com/questions/24437/advantages-and-disadvantages-of-svm • https://www.doc.ic.ac.uk/ nd/surprise96/journal/vol4/cs11/report.html • http://www.ics.uci.edu/ kibler/ics171/RNLectures/NeuralNets.pdf 14


Like this book? You can publish your book online for free in a few minutes!
Create your own flipbook