Important Announcement
PubHTML5 Scheduled Server Maintenance on (GMT) Sunday, June 26th, 2:00 am - 8:00 am.
PubHTML5 site will be inoperative during the times indicated!

Home Explore CU-MCA-SEM III-Introduction to Machine Learning - Second Draft (1)-converted

CU-MCA-SEM III-Introduction to Machine Learning - Second Draft (1)-converted

Published by Teamlease Edtech Ltd (Amita Chitroda), 2022-04-15 10:12:19

Description: CU-MCA-SEM III-Introduction to Machine Learning - Second Draft (1)-converted

Search

Read the Text Version

MASTER OF COMPUTER APPLICATIONS INTRODUCTION TO MACHINE LEARNING MCA653

2 CU IDOL SELF LEARNING MATERIAL (SLM)

CHANDIGARH UNIVERSITY Institute of Distance and Online Learning Course Development Committee Prof. (Dr.) R.S.Bawa Pro Chancellor, Chandigarh University, Gharuan, Punjab Advisors Prof. (Dr.) Bharat Bhushan, Director – IGNOU Prof. (Dr.) Majulika Srivastava, Director – CIQA, IGNOU Programme Coordinators & Editing Team Master of Business Administration (MBA) Bachelor of Business Administration (BBA) Coordinator – Dr. Rupali Arora Coordinator – Dr. Simran Jewandah Master of Computer Applications (MCA) Bachelor of Computer Applications (BCA) Coordinator – Dr. Raju Kumar Coordinator – Dr. Manisha Malhotra Master of Commerce (M.Com.) Bachelor of Commerce (B.Com.) Coordinator – Dr. Aman Jindal Coordinator – Dr. Minakshi Garg Master of Arts (Psychology) Bachelor of Science (Travel &Tourism Management) Coordinator – Dr. Samerjeet Kaur Coordinator – Dr. Shikha Sharma Master of Arts (English) Bachelor of Arts (General) Coordinator – Dr. Ashita Chadha Coordinator – Ms. Neeraj Gohlan Academic and Administrative Management Prof. (Dr.) R. M. Bhagat Prof. (Dr.) S.S. Sehgal Executive Director – Sciences Registrar Prof. (Dr.) Manaswini Acharya Prof. (Dr.) Gurpreet Singh Executive Director – Liberal Arts Director – IDOL © No part of this publication should be reproduced, stored in a retrieval system, or transmitted in any form or by any means, electronic, mechanical, photocopying, recording and/or otherwise without the prior written permission of the authors and the publisher. SLM SPECIALLY PREPARED FOR CU IDOL STUDENTS Printed and Published by: TeamLease Edtech Limited www.teamleaseedtech.com CONTACT NO:- 01133002345 For: CHANDIGARH UNIVERSITY 3 Institute of Distance and Online Learning CU IDOL SELF LEARNING MATERIAL (SLM)

First Published in 2021 All rights reserved. No Part of this book may be reproduced or transmitted, in any form or by any means, without permission in writing from Chandigarh University. Any person who does any unauthorized act in relation to this book may be liable to criminal prosecution and civil claims for damages. This book is meant for educational and learning purpose. The authors of the book has/have taken all reasonable care to ensure that the contents of the book do not violate any existing copyright or other intellectual property rights of any person in any manner whatsoever. In the event the Authors has/ have been unable to track any source and if any copyright has been inadvertently infringed, please notify the publisher in writing for corrective action. 4 CU IDOL SELF LEARNING MATERIAL (SLM)

CONTENT Unit - 1: Introduction To Machine Learning I ...........................................................................6 Unit - 2: Introduction To Machine Learning Ii ........................................................................27 Unit - 3: Introduction To Machine Learning Iii.......................................................................33 Unit - 4: Learning With Classification.....................................................................................65 Unit - 5: Classification And Regression Tree ..........................................................................81 Unit - 6: Naive Bayes...............................................................................................................95 Unit - 7: Support Vector Machines........................................................................................114 Unit – 8: Bayesian Belief Networks And Clustering.............................................................127 Unit – 9: Hidden Markov Model ...........................................................................................142 Unit - 10: Natural Language Processing I..............................................................................158 Unit - 11: Natural Language Processing Ii ............................................................................177 Unit - 12: Natural Language Understanding..........................................................................196 Unit - 13: Natural Language Processing With Ml And Dl ....................................................232 Unit - 14: Accessing Text Corpora ........................................................................................259 Unit - 15: Regular Expressions ..............................................................................................299 5 CU IDOL SELF LEARNING MATERIAL (SLM)

UNIT - 1: INTRODUCTION TO MACHINE LEARNING I Structure 1.0. LearningObjectives 1.1. Introduction 1.2. Key Terminology 1.3. Types of Machine Learning 1.4. Issues in Machine Learning 1.5. Applications of Machine Learning 1.6. Summary 1.7. Keywords 1.8. Learning Activity 1.9. Unit End Questions 1.10 References 1.0 LEARNING OBJECTIVES After studying this unit, you will be able to: • Describe the basics of machine learning • Identify the key terminologies of machine learning • Illustrate the types of machine learning • Describe the applications of machine learning • Identify relevant machine learning approach to solve a problem 1.1 INTRODUCTION Artificial intelligence (AI) is a hot topic right now, and it's only getting bigger thanks to the success of technologies like Siri.Talking to your phone is both entertaining and useful for determining the location of the best sushi restaurant in town or determining how to get to the concert hall. When you talk to your mobile phone, it knows more about the way you talk and makes less mistakes in understanding your questions.The capability of your smartphone to understand and translate your unique style of communicating is an example of AI, and 6 CU IDOL SELF LEARNING MATERIAL (SLM)

machine learning is a component of the technologies used to make that possible. For example, the capability to speak to devices and have them actually does what you intend is an example of machine learning at work. Likewise, recommender systems, such as those found on Amazon, help you make purchases based on criteria such as previous product purchases or products that complement a current choice. The use of both AI and machine learning will only increase with time. You might also have heard the terms machine learning and artificial intelligence used interchangeably. Machine learning is a component of AI, but it does not completely determine AI. Both machine learning and AI have significant engineering components. That is, you can quantify both technologies precisely based on theory (substantiated and tested explanations) rather than simply hypothesis (a suggested explanation for a phenomenon). Furthermore, both have strong scientific components, through which people test concepts and generate new ideas about how to express the thought process.Finally, machine learning has an artistic component, which a gifted scientist can excel at. AI and machine learning seem to defy logic in certain ways, and only a true artist can make them perform as intended. The true birth of AI as we know it today began with Alan Turing’s publication of “Computing Machinery and Intelligence” in 1950. In this paper, Turing explored the idea of how to determine whether machines can think. Of course, this paper led to the Imitation Game involving three players. Player A is a computer and Player B is a human. Each must convince Player C (a human who can’t see either Player A or Player B) that they are human. If Player C can’t determine who is human and who isn’t on a consistent basis, the computer wins. A continuing problem with AI is too much optimism. The problem that scientists are trying to solve with AI is incredibly complex. However, the early optimism of the 1950s and 1960s led scientists to believe that the world would produce intelligent machines in as little as 20 years. After all, machines were doing all sorts of amazing things, such as playing complex games. AI currently has its greatest success in areas such as logistics, data mining, and medical diagnosis. The main point of confusion between learning and intelligence is that people assume that simply because a machine gets better at its job (learning) it’s also aware (intelligence). Nothing supports this view of machine learning. The same phenomenon occurs when people assume that a computer is purposely causing problems for them. The computer can’t assign emotions and therefore acts only upon the input provided and the instruction contained within an application to process that input. 7 CU IDOL SELF LEARNING MATERIAL (SLM)

Machine learningis about making computers modify or adapt their actions (whether these actions are making predictions, or controlling a robot) so that these actions get more accurate, where accuracy is measured by how well the chosen actions reflect the correct ones. Imagine that you are playing Scrabble (or some other game) against a computer. You might beat it every time in the beginning, but after lots of games it starts beating you, until finally you never win. Either you are getting worse, or the computer is learning how to win at Scrabble. Having learnt to beat you, it can go on and use the same strategies against other players, so that it doesn’t start from scratch with each new player; this is a form of generalization. It is only over the past decade or so that the inherent multi-disciplinarity of machine learning has been recognized. It merges ideas from neuroscience and biology, statistics, mathematics, and physics, to make computers learn. There is a fantastic existence proof that learning is possible, which is the bag of water and electricity (together with a few trace chemicals) sitting between your ears Machine learning lies at the intersection of computer science, engineering, and statistics and often appears in other disciplines. It’s a tool that can be applied to many problems. Any field that needs to interpret and act on data can benefit from machine learning techniques. Machine learning uses statistics. To most people, statistics is an esoteric subject used for companies to lie about how great their products are The practice of engineering is applying science to solve a problem. In engineering we’re used to solving a deterministic problem where our solution solves the problem all the time. If we’re asked to write software to control a vending machine, it had better work all the time, regardless of the money entered or the buttons pressed. There are many problems where the solution isn’t deterministic. That is, we don’t know enough about the problem or don’t have enough computing power to properly model the problem. For these problems we need statistics. For example, the motivation of humans is a problem that is currently too difficult to model. In the social sciences, being right 60% of the time is considered successful. If we can predict the way people will behave 60% of the time, we’re doing well. How can this be? Shouldn’t we be right all the time? If we’re not right all the time, doesn’t that mean we’re doing something wrong? 1.2 KEY TERMINOLOGY Artificial Intelligence Artificial intelligence (AI) is the emulation of human intelligence of computers that are 8 CU IDOL SELF LEARNING MATERIAL (SLM)

designed to think and behave like humans. The potential of artificial intelligence to rationalise and take decisions that have the greatest chance of achieving a given target is its optimal feature. Machine Learning Machine learning is \"concerning how to create software programs that automatically evolve with experience.\"Machine learning is an interdisciplinary area that uses techniques from computer science, analytics, and artificial intelligence, among other areas. The key outputs of machine learning research are algorithms that allow this automated improvement from practise, algorithms that can be implemented in fields as diverse as computer vision, artificial intelligence, and data mining. Artificial Intelligence is a branch of it. Classification Classification is concerned with the creation of models that divide data into distinct groups. This model are created by supplying a series of training data with pre-labeled groups for the algorithm to learn from. The algorithm is then applied by inputting a new dataset withheld from the classes, allowing the model to predict their class membership based on what it has learned from the training set. Decision trees and help vector machines are two well-known classification systems. Classification is a form of supervised learning since it involves explicit class labelling. Regression Regression and classification are inextricably linked. While classification is concerned with predicting discrete groups, regression is used where the \"class\" to be forecast consists of continuous numerical values. A regression methodology is an example of linear regression. Clustering Clustering is used to analyse data that lacks pre-labeled classes or even a class attribute at all. The principle of \"maximising intra-class similarity and minimising interclass similarity\" is used to group data instances together. This means that the clustering algorithm can classify and group instances that are very similar to one another, as opposed to ungrouped instances that are much less similar to one another. The most well-known example of a clustering algorithm is K-Means clustering.. As clustering does not require the pre-labelling of instance classes, it is a form of unsupervised learning, meaning that it learns by observation as opposed to learning by example. 9 CU IDOL SELF LEARNING MATERIAL (SLM)

Association Association is most simply explained by introducing market basket analysis, a typical task that it's well-known. Market basket analysis tries to spot associations between the varied things that are chosen by a selected shopper and placed in their market basket, be it real or virtual, and assigns support and confidence measures for comparison. The worth of this lies in cross-marketing and client behaviour analysis. Association could be a generalization of market basket analysis, and is analogous to classification except that any attribute will be foreseen in association. Apriori enjoys success because the most well-known example of Associate in Nursing association formula Decision Trees Decision trees are top-down, recursive, divide-and-conquer classifiers. Decision trees are generally composed of two main tasks: tree induction and tree pruning. Tree induction is that the task of taking a group of pre-classified instances as input, deciding which attributes are best to separate on, splitting the dataset, and recursing on the resulting split datasets until all training instances are categorized. While building our tree, the goal is to separate on the attributes which create the purest child nodes possible, which might keep to a minimum the amount of splits that might got to be made so as to classify all instances in our dataset. This purity is measured by the concept of data, which relates to what proportion would wish to be known a few previously-unseen instance so as for it to be properly classified. A completed decision tree model are often overly-complex, contain unnecessary structure, and be difficult to interpret. Tree pruning is that the process of removing the unnecessary structure from a choice tree so as to form it more efficient, more easily-readable for humans, and more accurate also . This increased accuracy is thanks to pruning’s ability to scale back overfitting. Support Vector Machines SVMs are a unit ready to classify each linear and nonlinear information. SMVs work by remodeling the training dataset into a better dimension, a better dimension that is then inspected for the optimum separation boundary, or boundaries, between categories. In SVMs, these boundaries area unit mentioned as hyperplanes, that area unit known by locating support vectors, or the instances that the majority primarily outline categories, and their margins, that area unit the lines parallel to the hyperplane outlined by the shortest distance between a hyperplane and its support vectors. 10 CU IDOL SELF LEARNING MATERIAL (SLM)

The grand plan with SVMs is that, with a high enough range of dimensions, a hyperplane separating a pair of categories will forever be found, thereby delineating dataset member categories. Once recurrent a spare range of times, enough hyperplanes will be generated to separate all categories in n-dimension area. Neural networks Neural networks are algorithms inspired by the biological brain, although the extent to which they capture actual brain functionality is very controversial, and claims that they model the biological brain are obviously false. Neural networks are made from numerous interconnected conceptualized artificial neurons, which pass data between themselves, and which have associated weights which are tuned based upon the newtork's \"experience.\" Neurons have activation thresholds which, if met by a mixture of their associated weights and data passed to them, are fired; combinations of fired neurons end in \"learning.\" Deep Learning Deep learning may be a relatively new term, although it's existed before the dramatic uptick in online searches lately . Enjoying a surge in research and industry, due mainly to its incredible successes during a number of various areas, deep learning is that the process of applying deep neural network technologies - that's , neural network architectures with multiple hidden layers of neurons - to unravel problems. Deep learning may be a process, like data processing , which employs deep neural network architectures, which are particular sorts of machine learning algorithms. Feature With respect to a dataset, a feature represents an attribute and value combination. Color is an attribute. “Color is blue” is a feature. In Excel terms, features are similar to cells. The term feature has other definitions in different contexts. Feature Selection Feature selection is the process of selecting relevant features from a data-set for creating a Machine Learning model. Model A data structure that stores a representation of a dataset (weights and biases). Models are created/learned when you train an algorithm on a dataset. 11 CU IDOL SELF LEARNING MATERIAL (SLM)

Cross Validation Cross-validation is a deterministic method for model building, achieved by leaving out one of k segments, or folds, of a dataset, training on all k-1 segments, and using the remaining kth segment for testing; this process is then repeated k times, with the individual prediction error results being combined and averaged in a single, integrated model. This provides variability, with the goal of producing the most accurate predictive models possible. Bayesian When referring to probability, there are 2 major schools of thought: classical, or frequentist, probability interpretation views probabilities in terms of the frequencies of random events. In somewhat of a contrast, the Bayesian view of probability aims to quantify uncertainty, and updates a given probability as additional evidence is available. If these probabilities are extended to truth values, and are assigned to hypotheses, we then have \"learning\" to varying degrees of certainty. Inputs An input vector is the data given as one input to the algorithm. Written as x, with elements xi , where i runs from 1 to the number of input dimensions, m. Weights Weights wij , are the weighted connections between nodes i and j. For neural networks these weights are analogous to the synapses in the brain. They are arranged into a matrix W. Outputs The output vector is y, with elements yj , where j runs from 1 to the number of output dimensions, n. We can write y(x,W) to remind ourselves that the output depends on the inputs to the algorithm and the current set of weights of the network. Targets The target vector t, with elements tj , where j runs from 1 to the number of output dimensions, n, are the extra data that we need for supervised learning, since they provide the ‘correct’ answers that the algorithm is learning about. Activation Function For neural networks, g(·) is a mathematical function that describes the firing of the neuron as a response to the weighted inputs, such as the threshold function. 12 CU IDOL SELF LEARNING MATERIAL (SLM)

Error It is a function that computes the inaccuracies of the network as a function of the outputs y and targets t. 1.3 TYPES OF MACHINE LEARNING “A computer program is said to learn from experience E with respect to some task T and some performance measure P, if its performance on T, as measured by P, improves with experience E.” -- Tom Mitchell, Carnegie Mellon University So if you want your program to predict, for example, traffic patterns at a busy intersection (task T), you can run it through a machine learning algorithm with data about past traffic patterns (experience E) and, if it has successfully “learned”, it will then do better at predicting future traffic patterns (performance measure P). The highly complex nature of many real-world problems, though, often means that inventing specialized algorithms that will solve them perfectly every time is impractical, if not impossible. Examples of machine learning problems include, “Is this cancer?”, “What is the market value of this house?”, “Which of these people are good friends with each other?”, “Will this rocket engine explode on take off?”, “Will this person like this movie?”, “Who is this?”, “What did you say?”, and “How do you fly this thing?”. All of these problems are excellent targets for an ML project, and in fact ML has been applied to each of them with great success. Among the different types of ML tasks, a crucial distinction is drawn between supervised and unsupervised learning: Supervised Learning Supervised learning is one of the most basic types of machine learning. In this type, the machine learning algorithm is trained on labeled data. Even though the data needs to be labeled accurately for this method to work, supervised learning is extremely powerful when used in the right circumstances. In supervised learning, the ML algorithm is given a small training dataset to work with. This training dataset is a smaller part of the bigger dataset and serves to give the algorithm a basic idea of the problem, solution, and data points to be dealt with. The training dataset is also very similar to the final dataset in its characteristics and provides the algorithm with the labeled parameters required for the problem. The algorithm finds the relationship between the parameters given, essentially establishing a 13 CU IDOL SELF LEARNING MATERIAL (SLM)

cause-and-effect relationship between the variables in the dataset. At the end of the training, the algorithm has an idea of how the data works and the relationship between the input and the output. This solution is then deployed for use with the final dataset, which it learns from in the same way as the training dataset. This means that supervised machine learning algorithms will continue to improve even after being deployed, discovering new patterns and relationships as it trains itself on new data. Given data in the form of examples with labels, we can feed a learning algorithm these example-label pairs one by one, allowing the algorithm to predict the label for each example, and giving it feedback as to whether it predicted the right answer or not. Over time, the algorithm will learn to approximate the exact nature of the relationship between examples and their labels. When fully trained, the supervised learning algorithm will be able to observe a new, never-before-seen example and predict a good label for it. Supervised learning is often described as task-oriented because of this. It is highly focused on a singular task, feeding more and more examples to the algorithm until it can accurately perform on that task. Although there are a variety of supervised machine learning algorithms, the most commonly used include: • Linear regression • Logistic regression • Decision tree • Random forest classification algorithm Unsupervised Learning Unsupervised machine learning holds the advantage of being able to work with unlabeled data. This means that human labor is not required to make the dataset machine-readable, allowing much larger datasets to be worked on by the program. In supervised learning, the labels allow the algorithm to find the exact nature of the relationship between any two data points. However, unsupervised learning does not have labels to work off of, resulting in the creation of hidden structures. Relationships between data points are perceived by the algorithm in an abstract manner, with no input required from human beings. The creation of these hidden structures is what makes unsupervised learning algorithms versatile. Instead of a defined and set problem statement, unsupervised learning algorithms 14 CU IDOL SELF LEARNING MATERIAL (SLM)

can adapt to the data by dynamically changing hidden structures. This offers more post- deployment development than supervised learning algorithms. It features no labels. Instead, our algorithm would be fed a lot of data and given the tools to understand the properties of the data. From there, it can learn to group, cluster, and/or organize the data in a way such that a human (or other intelligent algorithm) can come in and make sense of the newly organized data. What makes unsupervised learning such an interesting area is that an overwhelming majority of data in this world is unlabeled. Having intelligent algorithms that can take our terabytes and terabytes of unlabeled data and make sense of it is a huge source of potential profit for many industries. That alone could help boost productivity in a number of fields. For example, what if we had a large database of every research paper ever published and we had an unsupervised learning algorithm that knew how to group these in such a way so that you were always aware of the current progression within a particular domain of research. Now, you begin to start a research project yourself, hooking your work into this network that the algorithm can see. As you write your work up and take notes, the algorithm makes suggestions to you about related works, works you may wish to cite, and works that may even help you push that domain of research forward. With such a tool, your productivity can be extremely boosted. Because unsupervised learning is based upon the data and its properties, we can say that unsupervised learning is data-driven. The outcomes from an unsupervised learning task are controlled by the data and the way it is formatted. Most commonly used to unsupervised machine learning algorithms are: • Clustering • Association • Anomaly detection • Neural networks Reinforcement Learning Reinforcement learning directly takes inspiration from how human beings learn from data in their lives. It features an algorithm that improves upon itself and learns from new situations using a trial-and-error method. Favorable outputs are encouraged or ‘reinforced’, and non- favorable outputs are discouraged or ‘punished’. 15 CU IDOL SELF LEARNING MATERIAL (SLM)

Based on the psychological concept of conditioning, reinforcement learning works by putting the algorithm in a work environment with an interpreter and a reward system. In every iteration of the algorithm, the output result is given to the interpreter, which decides whether the outcome is favorable or not. In case of the program finding the correct solution, the interpreter reinforces the solution by providing a reward to the algorithm. If the outcome is not favorable, the algorithm is forced to reiterate until it finds a better result. In most cases, the reward system is directly tied to the effectiveness of the result. In typical reinforcement learning use-cases, such as finding the shortest route between two points on a map, the solution is not an absolute value. Instead, it takes on a score of effectiveness, expressed in a percentage value. The higher this percentage value is, the more reward is given to the algorithm. Thus, the program is trained to give the best possible solution for the best possible reward. Reinforcement learning is fairly different when compared to supervised and unsupervised learning. Where we can easily see the relationship between supervised and unsupervised (the presence or absence of labels), the relationship to reinforcement learning is a bit murkier. Some people try to tie reinforcement learning closer to the two by describing it as a type of learning that relies on a time-dependent sequence of labels, however, my opinion is that that simply makes things more confusing. Place a reinforcement learning algorithm into any environment and it will make a lot of mistakes in the beginning. So long as we provide some sort of signal to the algorithm that associates good behaviors with a positive signal and bad behaviors with a negative one, we can reinforce our algorithm to prefer good behaviors over bad ones. Over time, our learning algorithm learns to make fewer mistakes than it used to. Reinforcement learning is very behavior driven. It has influences from the fields of neuroscience and psychology. If you’ve heard of Pavlov’s dog, then you may already be familiar with the idea of reinforcing an agent, albeit a biological one. For any reinforcement learning problem, we need an agent and an environment as well as a way to connect the two through a feedback loop. To connect the agent to the environment, we give it a set of actions that it can take that affect the environment. To connect the environment to the agent, we have it continually issue two signals to the agent: an updated state and a reward (our reinforcement signal for behaviour). 16 CU IDOL SELF LEARNING MATERIAL (SLM)

1.4 ISSUES IN MACHINE LEARNING Data quality Machine learning systems rely on data. That data can be broadly classified into two groups: features and labels. Features are the inputs to the ML model. For example, this could be data from sensors, customer questionnaires, website cookies or historical information. The quality of these features can be variable. For example, customers may not fill questionnaires correctly or omit responses. Sensors can malfunction and deliver erroneous data, and website cookies may give incomplete information about a user’s precise actions on a website. The quality of datasets is important so that models can be correctly trained. Data can also be noisy, filled with unwanted information that can mislead a machine learning model into making incorrect predictions. The outputs of ML models are labels. The sparsity of labels, where we know the inputs to a system but are unsure of what outputs have occurred, is also an issue. In such cases, it can be extremely challenging to detect the relationships between features and the labels of a model. In many cases, this can be labour intensive as it requires human intervention to associate labels to inputs. Without accurate mapping of inputs to outputs, the model might not be able to learn the correct relationship between the inputs and outputs. Machine learning relies on the relationships between input and output data to create generalizations that can be used to make predictions and provide recommendations for future actions. When the input data is noisy, incomplete or erroneous, it can be extremely difficult to understand why a particular output, or label, occurred. The complexity and quality trade-off Building robust machine learning models requires substantial computational resources to process the features and labels. Coding a complex model requires significant effort from data scientists and software engineers. Complex models can require substantial computing power to execute and can take longer to derive a usable result. This represents a trade-off for businesses. They can choose a faster response but a potentially less accurate outcome. Or they can accept a slower response but receive a more accurate 17 CU IDOL SELF LEARNING MATERIAL (SLM)

result from the model. But these compromises aren’t all bad news. The decision of whether to go for a higher cost and more accurate model over a faster response comes down to the use case. For example, making recommendations to shoppers on a retail shopping site requires real- time responses, but can accept some unpredictability in the result. On the other hand, a stock trading system requires a more robust result. So, a model that uses more data and performs more computations is likely to deliver a better outcome when a real-time result is not needed. As Machine Learning as a Service (MLaaS) offerings enter the market, the complexity and quality of trade-offs will get greater attention. Researchers from the University of Chicago looked at the effectiveness of MLaaS and found that “they can achieve results comparable to standalone classifiers if they have sufficient insight into key decisions like classifiers and feature selection”. Sampling bias in data Many companies use machine learning algorithms to assist them in recruitment. For example, Amazon discovered that the algorithm they used to assist with selecting candidates to work in the business was biased. Also, researchers from Princeton found that European names were favored by other systems, mimicking some human biases. The problem here isn’t the model specifically. The problem is that the data used to train the model comes with its own biases. However, when we know the data is biased, there are ways to unbias or to reduce the weighting given to that data. The first challenge is determining if there is inherent bias in the data. That means conducting some pre-processing. And while it may not be possible to remove all bias from the data, its impact can be minimized by injecting human knowledge. In some cases, it may also be necessary to limit the number of features in the data. For example, omitting traits such as race or gender can help limit the impact of biased data on the results from a model. Changing expectations and concept drift Machine learning models operate within specific contexts. For example, ML models that power recommendation engines for retailers operate at a specific time when customers are looking at certain products. However, customer needs change over time, and that means the ML model can drift away from what it was designed to deliver. 18 CU IDOL SELF LEARNING MATERIAL (SLM)

Models can decay for a number of reasons. Drift can occur when new data is introduced to the model. This is called data drift. It can also occur when our interpretation of the data changes. This is concept drift. Advertisement To accommodate this drift, you need a model that continuously updates and improves itself using data that comes in. That means you need to keep checking the model. That requires the collection of features and labels and to react to changes so the model can be updated and retrained. While some aspects of the retraining can be conducted automatically, some human intervention is needed. It’s critical to recognize that the deployment of a machine learning tool is not a one-off activity. Machine learning tools require regular review and update to remain relevant and continue to deliver value. Monitoring and maintenance Creating a model is easy. Building a model can be automatic. However, maintaining and updating the models requires a plan and resources. Machine learning models are part of a longer pipeline that starts with the features that are used to train the model. Then there is the model itself, which is a piece of software that can require modification and updates. That model requires labels so that the results of an input can be recognized and used by the model. And there may be disconnect between the model and the final signal in a system. In many cases when an unexpected outcome is delivered, it’s not the machine learning that has broken down but some other part of the chain. For example, a recommendation engine may have offered a product to a customer, but sometimes the connection between the sales system and the recommendation could be broken, and it takes time to find the bug. In this case, it would be hard to tell the model if the recommendation was successful. Troubleshooting issues like this can be quite labor intensive. Machine learning offers significant benefits to businesses. The ability to predict future outcomes to anticipate and influence customer behavior and to support business operations are substantial. However, ML also brings challenges to businesses. By recognizing these challenges and developing strategies to address them, companies can ensure they are prepared and equipped to handle them and get the most out of machine learning technology. 19 CU IDOL SELF LEARNING MATERIAL (SLM)

1.5 APPLICATIONS OF MACHINE LEARNING Machine learning is widely used in all the fields. Below are few applications of machine learning: Image recognition is one of the most common applications of machine learning. It is used to identify objects, persons, places, digital images, etc. The popular use case of image recognition and face detection is, Automatic friend tagging suggestion: Facebook provides us a feature of auto friend tagging suggestion. Whenever we upload a photo with our Facebook friends, then we automatically get a tagging suggestion with name, and the technology behind this is machine learning's face detection and recognition algorithm. Figure 1.1 Applications of Machine Learning Speech Recognition While using Google, we get an option of \"Search by voice,\" it comes under speech recognition, and it's a popular application of machine learning. Speech recognition is a process of converting voice instructions into text, and it is also known as \"Speech to text\", or \"Computer speech recognition.\" At present, machine learning algorithms are widely used by various applications of speech recognition. Google assistant, Siri, Cortana, and Alexa are using speech recognition technology to follow the 20 CU IDOL SELF LEARNING MATERIAL (SLM)

voice instructions. Traffic prediction If we want to visit a new place, we take help of Google Maps, which shows us the correct path with the shortest route and predicts the traffic conditions. It predicts the traffic conditions such as whether traffic is cleared, slow-moving, or heavily congested with the help of two ways: o Real Time location of the vehicle form Google Map app and sensors o Average time has taken on past days at the same time. Everyone who is using Google Map is helping this app to make it better. It takes information from the user and sends back to its database to improve the performance. Product recommendations Machine learning is widely used by various e-commerce and entertainment companies such as Amazon, Netflix, etc., for product recommendation to the user. Whenever we search for some product on Amazon, then we started getting an advertisement for the same product while internet surfing on the same browser and this is because of machine learning. Google understands the user interest using various machine learning algorithms and suggests the product as per customer interest. As similar, when we use Netflix, we find some recommendations for entertainment series, movies, etc., and this is also done with the help of machine learning. Self-driving cars One of the most exciting applications of machine learning is self-driving cars. Machine learning plays a significant role in self-driving cars. Tesla, the most popular car manufacturing company is working on self-driving car. It is using unsupervised learning method to train the car models to detect people and objects while driving. Email Spam and Malware Filtering Whenever we receive a new email, it is filtered automatically as important, normal, and spam. We always receive an important mail in our inbox with the important symbol and spam emails in our spam box, and the technology behind this is Machine learning. Below are some spam filters used by Gmail: 21 CU IDOL SELF LEARNING MATERIAL (SLM)

• Content Filter • Header filter • General blacklists filter • Rules-based filters • Permission filters Some machine learning algorithms such as Multi-Layer Perceptron, Decision tree, and Naïve Bayes classifier are used for email spam filtering and malware detection. Virtual Personal Assistant We have various virtual personal assistants such as Google assistant, Alexa, Cortana, Siri. As the name suggests, they help us in finding the information using our voice instruction. These assistants can help us in various ways just by our voice instructions such as Play music, call someone, Open an email, Scheduling an appointment, etc. These virtual assistants use machine learning algorithms as an important part. These assistant record our voice instructions, send it over the server on a cloud, and decode it using ML algorithms and act accordingly. Online Fraud Detection Machine learning is making our online transaction safe and secure by detecting fraud transaction. Whenever we perform some online transaction, there may be various ways that a fraudulent transaction can take place such as fake accounts, fake ids, and steal money in the middle of a transaction. So to detect this, Feed Forward Neural network helps us by checking whether it is a genuine transaction or a fraud transaction. For each genuine transaction, the output is converted into some hash values, and these values become the input for the next round. For each genuine transaction, there is a specific pattern which gets change for the fraud transaction hence, it detects it and makes our online transactions more secure. Stock Market trading Machine learning is widely used in stock market trading. In the stock market, there is always a risk of up and downs in shares, so for this machine learning's long short term memory neural network is used for the prediction of stock market trends. 22 CU IDOL SELF LEARNING MATERIAL (SLM)

Medical Diagnosis In medical science, machine learning is used for diseases diagnoses. With this, medical technology is growing very fast and able to build 3D models that can predict the exact position of lesions in the brain. It helps in finding brain tumors and other brain-related diseases easily. Automatic Language Translation Nowadays, if we tend to visit a replacement place and that we aren't awake to the language then it's not a tangle the least bit, as for this conjointly machine learning helps America by changing the text into our well-known languages. Google's GNMT (Google Neural Machine Translation) offer this feature that may be a Neural Machine Learning that interprets the text into our acquainted language, and it known as automatic translation. The technology behind the automated translation may be a sequence to sequence learning rule, that is employed with image recognition and interprets the text from one language to a different language. Video Surveillance With ML, video police work systems will sight an attainable crime before it. Unethical behaviour like individuals standing static for a minute observance a scenario, off her guard on a bench, and following another individual will alert human attendants. Once this will stop a mishap and save a life, incidents like these facilitate improve such police work services. 1.6 SUMMARY • Artificial intelligence is a type of computer technology which is concerned with making machines work in an intelligent way, similar to the way that the human • Machine Learning is the ability to automatically learn and improve from experience without being explicitly programmed. • Machine Learning is categorized into Supervised, Unsupervised and Reinforcement Learning • Supervised Learning trains the machine using labeled data. • Unsupervised learning is a kind of machine learning where a model must look for patterns in a dataset with no labels and with minimal human supervision. 23 CU IDOL SELF LEARNING MATERIAL (SLM)

• Reinforcement learning is an area of machine learning concerned with how intelligent agents ought to take actions in an environment in order to maximize the notion of cumulative reward. 1.7 KEYWORDS • Supervised Learning- learning algorithm is trained on labeled data • Unsupervised Leaning- able to work with unlabeled data • Machine Learning- software programs that automatically evolve with experience • Artificial Intelligence- computers that are designed to think and behave like humans • Reinforcement Learning- improves upon itself and learns from new situations using a trial-and-error method 1.8 LEARNING ACTIVITY 1. Consider a scenario where your data with no labels and the objective is to identify the clusters. Which type of machine learning can be chosen? ___________________________________________________________________________ ____________________________________________________________________ 2. Suppose you have a dog that is not so well trained, every time the dog messes up the living room you reduce the amount of tasty foods you give it (punishment) and every time it behaves well you double the tasty snacks (reward). What will the dog eventually learn? Well, that messing up the living room is bad. So in order for the dog to maximize the goal of eating more tasty snacks, it will simply behave well, never to mess with the living room again which type of machine learning can be used. Justify ___________________________________________________________________________ ____________________________________________________________________ 1.9 UNIT END QUESTIONS A.Descriptive Questions Short Question 1. Define Machine Learning. 2. List the types of Machine Learning. 24 CU IDOL SELF LEARNING MATERIAL (SLM)

3. What is the use of activation function? 25 4. Differentiate supervised and unsupervised learning. 5. What is Reinforcement Learning? Long Question 1. Illustrate the key terminologies used in Machine Learning. 2. Describe the area where Machine learning is used. 3. Discuss how Supervised Learning works. 4. Compare supervised and unsupervised learning with a real time example. 5. List few applications of Machine learning B. Multiple Choice Questions 1. ML is a field of AI consisting of learning algorithms that? a. Improve their performance b. At executing some task c. Over time with experience d. All of these 2. The model will be trained with data in one single batch is known as ? a. Batch learning b. Offline learning c. Both A and B d. None of these 3. Artificial Intelligence is about_____. a. Playing a game on Computer b. Making a machine Intelligent c. Programming on Machine with your Own Intelligence d. Putting your intelligence in Machine 4. Who is known as the -Father of AI\"? CU IDOL SELF LEARNING MATERIAL (SLM)

a. Fisher Ada b. Alan Turing c. John McCarthy d. Allen Newell 5. When a function is too closely fit to a limited set of data points it is called? a. Overfitting b. Underfitting c. Either A or B d. Both A and B Answers 1 – d, 2 – c, 3 – b, 4 – c, 5 – a 1.10 REFERENCES Text Books • Peter Harrington “Machine Learning in Action”, Dream Tech Press • EthemAlpaydin, “Introduction to Machine Learning”, MIT Press • Steven Bird, Ewan Klein and Edward Loper, “Natural Language Processing with Python”, O’Reilly Media. • Stephen Marsland, “Machine Learning an Algorithmic Perspective” CRC Press Reference Books • William W. Hsieh, “Machine Learning Methods in the Environmental Sciences”, Cambridge • Grant S. Ingersoll, Thomas S. Morton, Andrew L. Farris, “Tamming Text”, Manning Publication Co. • Margaret. H. Dunham, “Data Mining Introductory and Advanced Topics”, Pearson Education 26 CU IDOL SELF LEARNING MATERIAL (SLM)

UNIT - 2: INTRODUCTION TO MACHINE LEARNING II Structure 2.0. Learning Objectives 2.1. Steps in Developing a Machine Learning Application 2.2. Summary 2.3. Keywords 2.4. Learning Activity 2.5. Unit End Questions 2.6. References 2.0 LEARNING OBJECTIVES After studying this unit, you will be able to: • Explain the process of developing a machine learning application • Develop a machine learning application to solve real time problem 2.1 STEPS IN DEVELOPING A MACHINE LEARNING APPLICATION Figure 2.1 Life cycle of Machine Learning Process 1. Gathering Data: Once you recognize precisely what you wish and therefore the instrumentation ar in hand, it takes you to the primary real step of machine learning- Gathering knowledge. This step is incredibly crucial because the quality and amount of information gathered can directly confirm however sensible the prognosticative model can end up being. The information collected is then tabulated and referred to as training knowledge. 27 CU IDOL SELF LEARNING MATERIAL (SLM)

2. Data Preparation: After the training data is gathered, you progress on to consequent step of machine learning:. Here, the information is initial place all at once so the order is irregular because the order of knowledge shouldn't have an effect on what's learned. This is additionally an honest enough time to try to any knowledge preparation, wherever the information is loaded into an acceptable place so ready to be used in machine learning training visualizations of the information, as which will assist you see if there square measure any relevant relationships between the various variables, however you'll take their advantage and also as show you if there square measure any knowledge imbalances gift. Also, the information currently has got to be split into 2 elements. The primary half that's employed in training our model, are going to be the bulk of the dataset and also the second are going to be used for the analysis of the trained model’s performance. The opposite sorts of adjusting and manipulation like normalisation, error correction, and more take place at this step. 3. Choosing a model: The next step that follows within the advancement is selecting a model among the numerous that researchers and information scientists have created over the years. create the selection of the proper one that ought to get the work done. Some square measure fine suited to image information, others for sequences (like text, or music), some for numerical information, et al for text-based information. 4. Training: After the before steps ar completed, you then move onto what's typically thought of the majority of machine learning referred to as training wherever the info is employed to incrementally improve the model’s ability to predict. The training method involves initializing some random values for say A and B of our model, predicts the output with those values, then compares it with the model's prediction and so adjusts the values so they match the predictions that were created antecedently. This method then repeats and every cycle of change is named one training step.5. Evaluation: Once training is complete, you now check if it is good enough using this step. This is where that dataset you set aside earlier comes into play. Evaluation allows the testing of the model against data that has never been seen and used for training and is meant to be representative 28 CU IDOL SELF LEARNING MATERIAL (SLM)

of how the model might perform when in the real world. This is where the information learned in the previous step is put to use. When you’re evaluating an algorithm, you will test it to see how well your algorithm does. In case of supervised learning, you have some known values you can use to evaluate the algorithm. In case of unsupervised learning, you may have to use some other metrics to evaluate the success. 6. Parameter Tuning: Once the analysis is over, to any extent further improvement in your training will be doable by standardisation the parameters. there have been a number of parameters that were implicitly assumed once the training was done. Another parameter enclosed is that the learning rate that defines however way the road is shifted throughout every step, supported the knowledge from the previous training step. These values all play a task within the accuracy of the training model, and the way long the training can take. For models that square measure a lot of advanced, initial conditions play a big role within the determination of the end result of training. variations will be seen betting on whether or not a model starts off training with values initialized to zeroes versus some distribution of values, that then ends up in the question of that distribution is to be used. Since there square measure several issues at this section of training, it’s necessary that you simply outline what makes a model smart. These parameters square measure stated as Hyper parameters. The adjustment or standardisation of those parameters depends on the dataset, model, and therefore the training method. Once finished these parameters and are glad you'll go to the last step. 7. Prediction: Machine learning is basically using data to answer questions. So this is the final step where you get to answer few questions. This is the point where the value of machine learning is realized. Here you can finally use your model to predict the outcome of what you want. The above-mentioned steps take you from where you create a model to where you predict its output and thus act as a learning path. 2.2 SUMMARY • Developing a machine learning application involves seven steps 29 CU IDOL SELF LEARNING MATERIAL (SLM)

• The first real step of machine learning is Gathering Data • In Data preparation, the data is loaded into a suitable place and then prepared for use in machine learning training • Evaluation aims to estimate the generalization accuracy of a model on future. • Once the evaluation is over, any further improvement in your training can be possible by tuning the parameters • Tuning is the problem of choosing a set of optimal hyper-parameters for a learning algorithm. • A hyper-parameter is a parameter whose value is used to control the learning process • Prediction” refers to the output of an algorithm after it has been trained on a historical dataset and applied to new data when forecasting the likelihood of a particular outcome 2.3 KEYWORDS • Tuning- process of maximizing a model's performance • Prediction- is a forecast • Hyperparameters- determines the network structure • Machine Learning- software programs that automatically evolve with experience • Deep Learning- capable of learning unsupervised from data that is unstructured or unlabeled 2.4 LEARNING ACTIVITY 1. The order of steps in developing a machine learning application is interchangeable. Comment ___________________________________________________________________________ ____________________________________________________________________ 2. The data preparation step is optional. Justify ___________________________________________________________________________ ____________________________________________________________________ 2.5 UNIT END QUESTIONS A.Descriptive Questions 30 CU IDOL SELF LEARNING MATERIAL (SLM)

Short Questions 1. What is the need for parameter tuning? 2. How to choose a machine learning model for a specific application? 3. List few popular models used for classification. 4. Is it necessary to prepare data before applying a data model? 5. What are the parameters used to evaluate a model? Long Question 1. Describe the life cycle of designing a Machine learning Application. 2. Compare the models and describe which model best suits for a real time application. 3. How do choose a model for performing machine learning? 4. Do parameter tuning improve the performance of the system? Comment. 5. Describe about the optional and mandatory steps in building an application. B. Multiple ChoiceQuestions 1. Which of the following is a good test dataset characteristic? a. Large enough to yield meaningful results b. Is representative of the dataset as a whole c. Both A and B d. None of these 2. How do you handle missing or corrupted data in a dataset? a. Drop missing rows or columns b. Replace missing values with mean/median/mode c. Assign a unique category to missing values d. All of these 3. Which of the factors affect the performance of the learner system does not include? a. Good data structures b. Representation scheme used c. Training scenario 31 CU IDOL SELF LEARNING MATERIAL (SLM)

d. Type of feedback 4. In general, to have a well-defined learning problem, we must identity which of the following a. The class of tasks b. The measure of performance to be improved c. The source of experience d. All of these 5. What kind of learning algorithm for \"Facial identities or facial expressions\"? a. Prediction b. Recognition Patterns c. Generating Patterns d. Recognizing Anomalies Answers 1 – c, 2 – d, 3 – a, 4 – d, 5 – b 2.6 REFERENCES Text Books • Peter Harrington “Machine Learning in Action”, Dream Tech Press • EthemAlpaydin, “Introduction to Machine Learning”, MIT Press • Steven Bird, Ewan Klein and Edward Loper, “Natural Language Processing with Python”, O’Reilly Media. • Stephen Marsland, “Machine Learning an Algorithmic Perspective” CRC Press Reference Books • William W. Hsieh, “Machine Learning Methods in the Environmental Sciences”, Cambridge • Grant S. Ingersoll, Thomas S. Morton, Andrew L. Farris, “Tamming Text”, Manning Publication Co. • Margaret. H. Dunham, “Data Mining Introductory and Advanced Topics”, Pearson Education 32 CU IDOL SELF LEARNING MATERIAL (SLM)

UNIT - 3: INTRODUCTION TO MACHINE LEARNING III Structure 3.0. LearningObjectives 3.1. Python Libraries for Machine Learning 3.2. Regression 3.2.1. Linear Regression 3.2.2. Simple Linear Regression 3.2.3. Multiple Linear Regression 3.3. Logistic Regression 3.4. Summary 3.5. Keywords 3.6. Learning Activity 3.7. Unit End Questions 3.8. References 3.0 LEARNING OBJECTIVES After studying this unit, you will be able to: • Explain about the python libraries for machine learning • Describe the concept of regression • Compare linear regression and logistic regression • Use linear and logistic regression to solve real time problems 3.1 PYTHON LIBRARIES FOR MACHINE LEARNING 1) NumPy NumPy could be a well-known general array-processing package. an in depth assortment of high complexness mathematical functions build NumPy powerful to method giant multi- dimensional arrays and matrices. NumPy is extremely helpful for handling algebra, Fourier transforms, and random numbers. Alternative libraries like TensorFlow use NumPy at the backend for manipulating tensors. 33 CU IDOL SELF LEARNING MATERIAL (SLM)

With NumPy, you'll outline discretional knowledge varieties and simply integrate with most databases. NumPy also can function associate economical multi-dimensional instrumentality for any generic knowledge that's in any datatype. The key options of NumPy embody powerful N-dimensional array object, broadcasting functions, and out-of-box tools to integrate C/C++ and FORTRAN code. 2) SciPy With machine learning growing at supersonic speed, several Python developers were making python libraries for machine learning, particularly for scientific and analytical computing. Travis Oliphant, Eric Jones, and Pearu Peterson in 2001 set to merge most of those bits and items codes and standardize it. The ensuing library was then named as SciPy library. The current development of the SciPy library is supported and sponsored by an open community of developers and distributed beneath the free BSD license. The SciPy library offers modules for algebra, image improvement, integration interpolation, special functions, quick Fourier rework, signal and image process, normal equation (ODE) resolution, and different procedure tasks in science and analytics. The underlying organization utilized by SciPy could be a multi-dimensional array provided by the NumPy module. SciPy depends on NumPy for the array manipulation subroutines. The SciPy library was designed to figure with NumPy arrays beside providing easy and economical numerical functions. 3) Scikit-learn In 2007, David Cournapeau developed the Scikit-learn library as part of the Google Summer of Code project. In 2010 INRIA involved and did the public release in January 2010. Skikit- learn were built on top of two Python libraries – NumPy and SciPy and has become the most popular Python machine learning library for developing machine learning algorithms. It has a wide range of supervised and unsupervised learning algorithms that works on a consistent interface in Python. The library can also be used for data mining and data analysis. The main machine learning functions that the Scikit-learn library can handle are classification, regression, clustering, dimensionality reduction, model selection, and preprocessing. 4) Theano Theano is a python machine learning library that can act as an optimizing compiler for 34 CU IDOL SELF LEARNING MATERIAL (SLM)

evaluating and manipulating mathematical expressions and matrix calculations. Built on NumPy, Theano exhibits a tight integration with NumPy and has a very similar interface. Theano can work on Graphics Processing Unit (GPU) and CPU. Working on GPU architecture yields faster results. Theano can perform data-intensive computations up to 140x faster on GPU than on a CPU. Theano can automatically avoid errors and bugs when dealing with logarithmic and exponential functions. Theano has built-in tools for unit-testing and validation, thereby avoiding bugs and problems. 5) TensorFlow TensorFlow was developed for Google’s internal use by the Google Brain team. Its first release came in November 2015 under Apache License 2.0. TensorFlow is a popular computational framework for creating machine learning models. TensorFlow supports a variety of different toolkits for constructing models at varying levels of abstraction. TensorFlow exposes a very stable Python and C++ APIs. It can expose, backward compatible APIs for other languages too, but they might be unstable. TensorFlow has a flexible architecture with which it can run on a variety of computational platforms CPUs, GPUs, and TPUs. TPU stands for Tensor processing unit, a hardware chip built around TensorFlow for machine learning and artificial intelligence. 6) Keras Keras has over 200,000 users as of November 2017. Keras is an open-source library used for neural networks and machine learning. Keras can run on top of TensorFlow, Theano, Microsoft Cognitive Toolkit, R, or PlaidML. Keras also can run efficiently on CPU and GPU. Keras works with neural-network building blocks like layers, objectives, activation functions, and optimizers. Keras also have a bunch of features to work on images and text images that comes handy when writing Deep Neural Network code. Apart from the standard neural network, Keras supports convolutional and recurrent neural networks. 7) PyTorch PyTorch has a range of tools and libraries that support computer vision, machine learning, and natural language processing. The PyTorch library is open-source and is based on the Torch library. The most significant advantage of PyTorch library is it’s ease of learning and 35 CU IDOL SELF LEARNING MATERIAL (SLM)

using. PyTorch can smoothly integrate with the python data science stack, including NumPy. You will hardly make out a difference between NumPy and PyTorch. PyTorch also allows developers to perform computations on Tensors. PyTorch has a robust framework to build computational graphs on the go and even change them in runtime. Other advantages of PyTorchinclude multi GPU support, simplified preprocessors, and custom data loaders. 8) Pandas Pandas are turning up to be the most popular Python library that is used for data analysis with support for fast, flexible, and expressive data structures designed to work on both “relational” and “labeled” data. Pandas today are an inevitable library for solving practical, real-world data analysis in Python. Pandas are highly stable, providing highly optimized performance. The backend code is purely written in C or Python. The two main types of data structures used by pandas are : • Series (1-dimensional) • Data Frame (2-dimensional) These two put together can handle a vast majority of data requirements and use cases from most sectors like science, statistics, social, finance, and of course, analytics and other areas of engineering. Pandas support and perform well with different kinds of data including the below : • Tabular data with columns of heterogeneous data. For instance, consider the data coming from the SQL table or Excel spreadsheet. • Ordered and unordered time series data. The frequency of time series need not be fixed, unlike other libraries and tools. Pandas is exceptionally robust in handling uneven time-series data • Arbitrary matrix data with the homogeneous or heterogeneous type of data in the rows and columns • Any other form of statistical or observational data sets. The data need not be labeled at all. Pandas data structure can process it even without labeling. 9) Matplotlib 36 CU IDOL SELF LEARNING MATERIAL (SLM)

Matplotlib is a data visualization library that is used for 2D plotting to produce publication- quality image plots and figures in a variety of formats. The library helps to generate histograms, plots, error charts, scatter plots, bar charts with just a few lines of code. It provides a MATLAB-like interface and is exceptionally user-friendly. It works by using standard GUI toolkits like GTK+, wxPython, Tkinter, or Qt to provide an object-oriented API that helps programmers to embed graphs and plots into their applications. 10) NLTK The Natural Language Processing Toolkit (NLTK) is a Python library for natural language processing. NLTK is a popular library for processing human language. NLTK comes with a simple interface and a wide variety of lexical resources like WordNet, Word2Vec, FrameNet, and many others. 3.2 REGRESSION Regression models are used to predict a continuous value. Predicting costs of a house given the options of house like size, value etc is one among the common samples of Regression. it's a supervised technique. 3.2.1 Linear Regression Linear Regression is a supervised machine learning algorithm where the predicted output is continuous and has a constant slope. It’s used to predict values within a continuous range, (e.g. sales, price) rather than trying to classify them into categories (e.g. cat, dog). There are two main types: • Simple regression If a single independent variable is used to predict the value of a numerical dependent variable, then such a Linear Regression algorithm is called Simple Linear Regression. • Multivariable Regression If more than one independent variable is used to predict the value of a numerical dependent variable, then such a Linear Regression algorithm is called Multiple Linear Regression. Linear regression is one of the easiest and most popular Machine Learning algorithms. It is a statistical method that is used for predictive analysis. Linear regression makes predictions for 37 CU IDOL SELF LEARNING MATERIAL (SLM)

continuous/real or numeric variables such as sales, salary, age, product price, etc. Linear regression algorithm shows a linear relationship between a dependent (y) and one or more independent (y) variables, hence called as linear regression. Since linear regression shows the linear relationship, which means it finds how the value of the dependent variable is changing according to the value of the independent variable. The linear regression model provides a sloped straight line representing the relationship between the variables. Consider the below image: Fig 3.1 Linear Regression Mathematically, we can represent a linear regression as:y= a0+a1x+ ε Here, Y= Dependent Variable (Target Variable) X= Independent Variable (predictor Variable) a0= intercept of the line (Gives an additional degree of freedom) a1 = Linear regression coefficient (scale factor to each input value). ε = random error Finding the best fit line When working with linear regression, our main goal is to find the best fit line that means the error between predicted values and actual values should be minimized. The best fit line will have the least error. 38 CU IDOL SELF LEARNING MATERIAL (SLM)

The different values for weights or the coefficient of lines (a0, a1) gives a different line of regression, so we need to calculate the best values for a0 and a1 to find the best fit line, so to calculate this we use cost function. Cost function The different values for weights or coefficient of lines (a0, a1) gives the different line of regression, and the cost function is used to estimate the values of the coefficient for the best fit line. Cost function optimizes the regression coefficients or weights. It measures how a linear regression model is performing. We can use the cost function to find the accuracy of the mapping function, which maps the input variable to the output variable. This mapping function is also known as Hypothesis function. For Linear Regression, we use the Mean Squared Error (MSE) cost function, which is the average of squared error occurred between the predicted values and actual values. It can be written as: For the above linear equation, MSE can be calculated as: Where, N=Total number of observations Yi = Actual value (a1xi+a0)= Predicted value. Residuals: The distance between the actual value and predicted values is called residual. If the observed points are far from the regression line, then the residual will be high, and so cost function will high. If the scatter points are close to the regression line, then the residual will be small and hence the cost function. Gradient Descent: Gradient descent is used to minimize the MSE by calculating the gradient of the cost function. A regression model uses gradient descent to update the coefficients of the line by reducing 39 CU IDOL SELF LEARNING MATERIAL (SLM)

the cost function. It is done by a random selection of values of coefficient and then iteratively update the values to reach the minimum cost function. Model Performance: The Goodness of fit determines how the line of regression fits the set of observations. The process of finding the best model out of various models is called optimization. It can be achieved by below method: R-squared method: R-squared is a statistical method that determines the goodness of fit. It measures the strength of the relationship between the dependent and independent variables on a scale of 0-100%. The high value of R-square determines the less difference between the predicted values and actual values and hence represents a good model. It is also called a coefficient of determination, or coefficient of multiple determination for multiple regression. It can be calculated from the below formula: R-Squared Error= Explained Variation Total Variation Assumptions of Linear Regression Below are some important assumptions of Linear Regression. These are some formal checks while building a Linear Regression model, which ensures to get the best possible result from the given dataset. Linear relationship between the features and target: Linear regression assumes the linear relationship between the dependent and independent variables. Small or no multicollinearity between the features: Multicollinearity means high-correlation between the independent variables. Due to multicollinearity, it may difficult to find the true relationship between the predictors and target variables. Or we can say, it is difficult to determine which predictor variable is affecting the target variable and which is not. So, the model assumes either little or no 40 CU IDOL SELF LEARNING MATERIAL (SLM)

multicollinearity between the features or independent variables. Homoscedasticity Assumption: Homoscedasticity is a situation when the error term is the same for all the values of independent variables. With homoscedasticity, there should be no clear pattern distribution of data in the scatter plot. Normal distribution of error terms: Linear regression assumes that the error term should follow the normal distribution pattern. If error terms are not normally distributed, then confidence intervals will become either too wide or too narrow, which may cause difficulties in finding coefficients. It can be checked using the q-q plot. If the plot shows a straight line without any deviation, which means the error is normally distributed. No autocorrelations: The linear regression model assumes no autocorrelation in error terms. If there will be any correlation in the error term, then it will drastically reduce the accuracy of the model. Autocorrelation usually occurs if there is a dependency between residual errors. 3.2.2 Simple Linear Regression Simple Linear Regression is a type of Regression algorithms that models the relationship between a dependent variable and a single independent variable. The relationship shown by a Simple Linear Regression model is linear or a sloped straight line, hence it is called Simple Linear Regression. The key point in Simple Linear Regression is that the dependent variable must be a continuous/real value. However, the independent variable can be measured on continuous or categorical values.Simple Linear regression algorithm has mainly two objectives: ➢ Model the relationship between the two variables. Such as the relationship between Income and expenditure, experience and Salary, etc. ➢ Forecasting new observations. Such as Weather forecasting according to temperature, Revenue of a company according to the investments in a year, etc. The Simple Linear Regression model can be represented using the below equation: y= a0+a1x+ ε 41 CU IDOL SELF LEARNING MATERIAL (SLM)

Where, a0 is the intercept of the Regression line (can be obtained putting x=0) a1 is the slope of the regression line, which tells whether the line is increasing or decreasing. ε is the error term. (For a good model it will be negligible) Consider a dataset that has two variables: salary (dependent variable) and experience (Independent variable). The goals of this problem are: ➢ To find out if there is any correlation between these two variables ➢ To find the best fit line for the dataset. ➢ To determine how dependent variable is changing by changing the dependent variable. We will create a Simple Linear Regression model to find out the best fitting line for representing the relationship between these two variables. To implement the Simple Linear regression model in machine learning using Python, we need to follow the below steps: Step-1: Data Pre-processing The first step for creating the Simple Linear Regression model is data pre-processing. We have already done it earlier in this tutorial. But there will be some changes, which are given in the below steps: First, we will import the three important libraries, which will help us for loading the dataset, plotting the graphs, and creating the Simple Linear Regression model. importnumpy as nm importmatplotlib.pyplot as mtp import pandas as pd Next, we will load the dataset into our code: data_set= pd.read_csv('Salary_Data.csv') 42 CU IDOL SELF LEARNING MATERIAL (SLM)

Fig 3.2 Dataset with two attributes After that, we need to extract the dependent and independent variables from the given dataset. The independent variable is years of experience, and the dependent variable is salary. Below is code for it: x= data_set.iloc[:, :-1].values y= data_set.iloc[:, 1].values In the above lines of code, for x variable, we have taken -1 value since we want to remove the last column from the dataset. For y variable, we have taken 1 value as a parameter, since we want to extract the second column and indexing starts from the zero. Figure 3.3 Dependent and independent Variable in the dataset 43 CU IDOL SELF LEARNING MATERIAL (SLM)

In the above output image, we can see the X (independent) variable and Y (dependent) variable has been extracted from the given dataset. Next, we will split both variables into the test set and training set. We have 30 observations, so we will take 20 observations for the training set and 10 observations for the test set. We are splitting our dataset so that we can train our model using a training dataset and then test the model using a test dataset. The code for this is given below: # Splitting the dataset into training and test set. fromsklearn.model_selection import train_test_split x_train, x_test, y_train, y_test= train_test_split(x, y, test_size= 1/3, random_state=0) By executing the above code, we will get x-test, x-train and y-test, y-train dataset. Figure 3.4 Training Dataset 44 CU IDOL SELF LEARNING MATERIAL (SLM)

Figure 3.5 Testing Dataset Step-2: Fitting the Simple Linear Regression to the Training Set: Now the second step is to fit our model to the training dataset. To do so, we will import the LinearRegression class of the linear_model library from the scikit learn. After importing the class, we are going to create an object of the class named as a regressor. The code for this is given below: #Fitting the Simple Linear Regression model to the training dataset fromsklearn.linear_model import LinearRegression regressor= LinearRegression() regressor.fit(x_train, y_train) In the above code, we have used a fit() method to fit our Simple Linear Regression object to the training set. In the fit() function, we have passed the x_train and y_train, which is our training dataset for the dependent and an independent variable. We have fitted our regressor object to the training set so that the model can easily learn the correlations between the predictor and target variables. After executing the above lines of code, we will get the below output. 45 CU IDOL SELF LEARNING MATERIAL (SLM)

Output: Out[7]: LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False) Step: 3. Prediction of test set result: dependent (salary) and an independent variable (Experience). So, now, our model is ready to predict the output for the new observations. In this step, we will provide the test dataset (new observations) to the model to check whether it can predict the correct output or not. We will create a prediction vector y_pred, and x_pred, which will contain predictions of test dataset, and prediction of training set respectively. #Prediction of Test and Training set result y_pred= regressor.predict(x_test) x_pred= regressor.predict(x_train) On executing the above lines of code, two variables named y_pred and x_pred will generate in the variable explorer options that contain salary predictions for the training set and test set. Step: 4. visualizing the Training set results: Now in this step, we will visualize the training set result. To do so, we will use the scatter() function of the pyplot library, which we have already imported in the pre-processing step. The scatter () function will create a scatter plot of observations. In the x-axis, we will plot the Years of Experience of employees and on the y-axis, salary of employees. In the function, we will pass the real values of training set, which means a year of experience x_train, training set of Salaries y_train, and color of the observations. Here we are taking a green color for the observation, but it can be any color as per the choice. Now, we need to plot the regression line, so for this, we will use the plot() function of the pyplot library. In this function, we will pass the years of experience for training set, predicted salary for training set x_pred, and color of the line. Next, we will give the title for the plot. So here, we will use the title() function of the pyplot library and pass the name (\"Salary vs Experience (Training Dataset)\". After that, we will assign labels for x-axis and y-axis using xlabel() and ylabel() function. Finally, we will represent all above things in a graph using show(). The code is given below: 46 CU IDOL SELF LEARNING MATERIAL (SLM)

mtp.scatter(x_train, y_train, color=\"green\") mtp.plot(x_train, x_pred, color=\"red\") mtp.title(\"Salary vs Experience (Training Dataset)\") mtp.xlabel(\"Years of Experience\") mtp.ylabel(\"Salary(In Rupees)\") mtp.show() Figure 3.6: Predictions of Training Dataset In the above plot, we can see the real values observations in green dots and predicted values are covered by the red regression line. The regression line shows a correlation between the dependent and independent variable. The good fit of the line can be observed by calculating the difference between actual values and predicted values. But as we can see in the above plot, most of the observations are close to the regression line, hence our model is good for the training set Step: 5. visualizing the Test set results: In the previous step, we have visualized the performance of our model on the training set. Now, we will do the same for the Test set. The complete code will remain the same as the above code, except in this, we will use x_test, and y_test instead of x_train and y_train. 47 CU IDOL SELF LEARNING MATERIAL (SLM)

Here we are also changing the color of observations and regression line to differentiate between the two plots, but it is optional. #visualizing the Test set results mtp.scatter(x_test, y_test, color=\"blue\") mtp.plot(x_train, x_pred, color=\"red\") mtp.title(\"Salary vs Experience (Test Dataset)\") mtp.xlabel(\"Years of Experience\") mtp.ylabel(\"Salary(In Rupees)\") mtp.show() Figure 3.7: Predictions of Test Dataset In the above plot, there are observations given by the blue color, and prediction is given by the red regression line. As we can see, most of the observations are close to the regression line, hence we can say our Simple Linear Regression is a good model and able to make good predictions. 3.2.3 Multiple Linear Regression Multiple Linear Regression is an extension of Simple Linear regression as it takes more than one predictor variable to predict the response variable. Multiple Linear Regression is one of the important regression algorithms which models the linear relationship between a single dependent continuous variable and more than one independent variable. 48 CU IDOL SELF LEARNING MATERIAL (SLM)

Some key points about MLR: • For MLR, the dependent or target variable(Y) must be the continuous/real, but the predictor or independent variable may be of continuous or categorical form. • Each feature variable must model the linear relationship with the dependent variable. • MLR tries to fit a regression line through a multidimensional space of data-points. Multiple linear regression formula The formula for a multiple linear regression is: • y is the predicted value of the dependent variable • B0 is the y-intercept (value of y when all other parameters are set to 0) • B1X1is the regression coefficient (B1) of the first independent variable (X1) (the effect that increasing the value of the independent variable has on the predicted y value) • BnXn = the regression coefficient of the last independent variable • e = model error (. how much variation there is in our estimate of y) To find the best-fit line for each independent variable, multiple linear regression calculates three things: • The regression coefficients that lead to the smallest overall model error. • The t-statistic of the overall model. • The associated p-value (how likely it is that the t-statistic would have occurred by chance if the null hypothesis of no relationship between the independent and dependent variables was true). It then calculates the t-statistic and p-value for each regression coefficient in the model. Consider a dataset of 50 start-up companies. This dataset contains five main information: R&D Spend, Administration Spend, Marketing Spend, State, and Profit for a financial year. Our goal is to create a model that can easily determine which company has a maximum profit, and which is the most affecting factor for the profit of a company. Since we need to find the Profit, so it is the dependent variable, and the other four variables are independent variables. Below are the main steps of deploying the MLR model: 49 CU IDOL SELF LEARNING MATERIAL (SLM)

1. Data Pre-processing Steps 2. Fitting the MLR model to the training set 3. Predicting the result of the test set Step-1: Data Pre-processing Step: The very first step is data pre-processing, which we have already discussed in this tutorial. This process contains the below steps: • Importing libraries: Firstly we will import the library which will help in building the model. Below is the code for it: # importing libraries import numpy as nm import matplotlib.pyplot as mtp import pandas as pd • Importing dataset: Now we will import the dataset(50_CompList), which contains all the variables. Below is the code for it: #importing datasets data_set= pd.read_csv('50_CompList.csv') • Extracting dependent and independent Variables: #Extracting Independent and dependent Variable x= data_set.iloc[:, :-1].values y= data_set.iloc[:, 4].values Encoding Dummy Variables: As categorical variable (State) cannot be directly applied to the model, so we will encode it. To encode the categorical variable into numbers, we will use the Label Encoder class. But it is not sufficient because it still has some relational order, which may create a wrong model. So in order to remove this problem, we will use OneHotEncoder, which will create the dummy variables. Below is code for it: #Catgorical data from sklearn.preprocessing import LabelEncoder, OneHotEncoder 50 CU IDOL SELF LEARNING MATERIAL (SLM)


Like this book? You can publish your book online for free in a few minutes!
Create your own flipbook