Important Announcement
PubHTML5 Scheduled Server Maintenance on (GMT) Sunday, June 26th, 2:00 am - 8:00 am.
PubHTML5 site will be inoperative during the times indicated!

Home Explore CU-MCA-SEM III-Introduction to Machine Learning - Second Draft (1)-converted

CU-MCA-SEM III-Introduction to Machine Learning - Second Draft (1)-converted

Published by Teamlease Edtech Ltd (Amita Chitroda), 2022-04-15 10:12:19

Description: CU-MCA-SEM III-Introduction to Machine Learning - Second Draft (1)-converted

Search

Read the Text Version

• Border Points: A data point *q is a border point if Nbhd(q, ɛ) contains less than minPts data points, but q is reachable from some core point p. • Outlier: A data point o is an outlier if it is neither a core point nor a border point. Essentially, this is the “other” class. Core Points: Core Points are the foundations for our clusters are based on the density approximation I discussed in the previous section. We use the same ɛ to compute the neighborhood for each point, so the volume of all the neighborhoods is the same. However, the number of other points in each neighborhood is what differs. Recall that I said we can think of the number of data points in the neighborhood as its mass. The volume of each neighborhood is constant, and the mass of neighborhood is variable, so by putting a threshold on the minimum amount of mass needed to be a core point, we are essentially setting a minimum density threshold. Therefore, core points are data points that satisfy a minimum density requirement. Our clusters are built around our core points (hence the core part), so by adjusting our minPts parameter, we can fine-tune how dense our clusters’ cores must be. Border Points: Border Points are the points in our clusters that are not core points. In the definition above for border points, I used the term density-reachable. I have not defined this term yet, but the concept is simple. To explain this concept, let’s revisit our neighborhood example with epsilon = 0.15. Consider the point r (the black dot) that is outside of the point p‘s neighborhood. Outliers: Finally, we get to our “other” class. Outliers are points that are neither core points nor are they close enough to a cluster to be density-reachable from a core point. Outliers are not assigned to any cluster and, depending on the context, may be considered anomalous points. The Density based clustering algorithm can be summarized as follows: 1. For each point \\(x_i\\), compute the distance between \\(x_i\\) and the other points. Finds all neighbor points within distance eps of the starting point (\\(x_i\\)). Each point, with a neighbor count greater than or equal to MinPts, is marked as core point or visited. 2. For each core point, if it’s not already assigned to a cluster, create a new cluster. Find 151 CU IDOL SELF LEARNING MATERIAL (SLM)

recursively all its density connected points and assign them to the same cluster as the core point. 3. Iterate through the remaining unvisited points in the dataset. 4. Those points that do not belong to any cluster are treated as outliers or noise. 9.7 SUMMARY • The Hidden Markov Model (HMM) is a relatively simple way to model sequential data. A hidden Markov model implies that the Markov Model underlying the data is hidden or unknown to you. More specifically, you only know observational data and not information about the states. • A Markov chain is useful when we need to compute a probability for a sequence of observable events • Clustering is the task of dividing the population or data points into a number of groups such that data points in the same groups are more similar to other data points in the same group than those in other groups • K-means algorithm identifies k number of centroids, and then allocates every data point to the nearest cluster, while keeping the centroids as small as possible • Hierarchical clustering is an alternative approach to k-means clustering for identifying groups in a data set • An artificial neural network (ANN) is the piece of a computing system designed to simulate the way the human brain analyzes and processes information. It is the foundation of artificial intelligence (AI) and solves problems that would prove impossible or difficult by human or statistical standards. 9.8 KEYWORDS • Markov Modelsed to model pseudo-randomly changing systems • Dendogram- shows the hierarchical relationship between objects • AGNES- constructs a hierarchy of clusterings • DBSCAN -to separate clusters of high density from clusters of low density • DIANA - constructs the hierarchy in the inverse order 152 CU IDOL SELF LEARNING MATERIAL (SLM)

9.9 LEARNING ACTIVITY 1. Read in the Titanic train.csv data-set from Kaggle.com (you might have to sign up first). Turn the sex column into a dummy variable, =1, if it is male, “0” otherwise, and “Pclass” into a dummy variable for the most common class 3. Using four columns, Sex, SibSp, Parch and Fare, apply the k-means algorithm to get 4 clusters and use nstart=20. Remember to set the seed to 1 so the results are comparable. Note that these four variables have no special meaning to the problem and dummy data in k-means is probably not a good idea in general – we are just playing around. ___________________________________________________________________________ ____________________________________________________________________ 2. Use the k-means algorithm and Euclidean distance to cluster the following 8 examples into 3 clusters: A1=(2,10), A2=(2,5), A3=(8,4), A4=(5,8), A5=(7,5), A6=(6,4), A7=(1,2), A8=(4,9). ___________________________________________________________________________ ____________________________________________________________________ 3. The table below is a distance matrix for 6 objects. Among Hierarchical clustering, Density based clustering and K-Means clustering which clustering method is most likely to produce the following results at k = 2? ___________________________________________________________________________ ____________________________________________________________________ 9.10 UNIT END QUESTIONS A.Descriptive Questions 153 CU IDOL SELF LEARNING MATERIAL (SLM)

Short Question 1. List few typical applications of clustering. 2. Let (x1, x2, ........xn) are set elements in a cluster. Write equation to find Centroid and Radius of the cluster. 3. What is an outlier? Mention its applications. 4. Express the average intra cluster distance in terms of cluster features. 5. What is density-based clustering method? How DBSCAN algorithm works? Long Question 1. What do you mean by hierarchical clustering? How is it represented? What are differences between single link and complete link algorithms 2. Discuss the parameters to evaluate results of a clustering method? 3. Compare K-means and K-medoid algorithms. 4. Discuss briefly about hierarchical based clustering method. 5. Explain in detail about density-based clustering algorithm. B. Multiple ChoiceQuestions 1. Which algorithm is used for solving temporal probabilistic reasoning? a. Hill-climbing search b. Hidden markov model c. Depth-first search d. Breadth-first search 2. Where the Hidden Markov Model is used? a. Speech recognition b. Understanding of real world c. Both Speech recognition & Understanding of real world d. None of these 3. How does the state of the process is described in HMM? 154 a. Literal b. Single random variable c. Single discrete random variable CU IDOL SELF LEARNING MATERIAL (SLM)

d. None of the mentioned 4. Where the additional variable is added in HMM? a. Temporal model b. Reality model c. Probability model d. All of the mentioned. 5. Which of the following is finally produced by Hierarchical Clustering? a. final estimate of cluster centroids b. tree showing how close things are to each other c. assignment of each point to clusters d. all of the mentioned 6. Which of the following is required by K-means clustering? a. defined distance metric b. number of clusters c. initial guess as to cluster centroids d. all of the mentioned 7. Point out the wrong statement. a. k-means clustering is a method of vector quantization b. k-means clustering aims to partition n observations into k clusters c. k-nearest neighbor is same as k-means d. None of these 8. Which of the following can act as possible termination conditions in K-Means? 1. For a fixed number of iterations. 2. Assignment of observations to clusters does not change between iterations. Except for cases with a bad local minimum. 155 CU IDOL SELF LEARNING MATERIAL (SLM)

3. Centroids do not change between successive iterations. 4. Terminate when RSS falls below a threshold. a. 1, 3 and 4 b. 1, 2 and 3 c. 1, 2 and 4 d. All of these 9. After performing K-Means Clustering analysis on a dataset, you observed the following dendrogram. Which of the following conclusion can be drawn from the dendrogram? a. There were 28 data points in clustering analysis b. The best no. of clusters for the analyzed data points is 4 c. The proximity function used is Average-link clustering d. The above dendrogram interpretation is not possible for K-Means clustering analysis Answers 1 – b, 2- a , 3 – c , 4- a , 5- b, 6- d, 7- c, 8- d, 9- d. 9.11 REFERENCES Textbooks • Peter Harrington “Machine Learning in Action”, Dream Tech Press 156 CU IDOL SELF LEARNING MATERIAL (SLM)

• EthemAlpaydin, “Introduction to Machine Learning”, MIT Press • Steven Bird, Ewan Klein and Edward Loper, “Natural Language Processing with Python”, O’Reilly Media. • Stephen Marsland, “Machine Learning an Algorithmic Perspective” CRC Press Reference Books • William W. Hsieh, “Machine Learning Methods in the Environmental Sciences”, Cambridge • Grant S. Ingersoll, Thomas S. Morton, Andrew L. Farris, “Tamming Text”, Manning Publication Co. • Margaret. H. Dunham, “Data Mining Introductory and Advanced Topics”, Pearson Education 157 CU IDOL SELF LEARNING MATERIAL (SLM)

UNIT - 10: NATURAL LANGUAGE PROCESSING I Structure 10.0 LearningObjectives 10.1 Introduction 10.2 Components of Dimensionality Reduction 10.2.1 Feature Selection 10.2.2 Feature Extraction 10.2.3 Importance of Dimension Reduction 10.2.4 Advantages of Dimension Reduction 10.3 Dimensionality Reduction Techniques 10.4 Principal Component Analysis 10.5 Summary 10.6 Keywords 10.7 Learning Activity 10.8 Unit End Questions 10.9 References 10.0 LEARNING OBJECTIVES After studying this unit, you will be able to: • Describe the basics of natural language processing • Identify the key terminologies of natural language processing • Illustrate the types of dimensionality reduction techniques • Describe the use of principal component analysis 10.1 INTRODUCTION Dimensionality reduction refers to techniques for reducing the number of input variables in training data. Definition: When dealing with high dimensional data, it is often useful to reduce the dimensionality by projecting the data to a lower dimensional subspace which captures the “essence” of the data. 158 CU IDOL SELF LEARNING MATERIAL (SLM)

This is called dimensionality reduction. High dimensionality mean hundreds, thousands, or even millions of input variables. Fewer input dimensions often mean correspondingly fewer parameters or a simpler structure in the machine learning model, referred to as degrees of freedom. A model with too many degrees of freedom is likely to overfit the training dataset and therefore may not perform well on new data. It is desirable to have simple models that generalize well, and in turn, input data with few input variables. This is particularly true for linear models where the number of inputs and the degrees of freedom of the model are often closely related. Motivation: • When dealing with real problems and real data we often deal with high dimensional data that can go up to millions. • In original high dimensional structure, data represents itself. Although sometimes we need to reduce its dimensionality. • We need to reduce the dimensionality that needs to associate with visualizations. Although, that is not always the case. 10.2 COMPONENTSOF DIMENSIONALITY REDUCTION There are two components of dimensionality reduction: 10.2.1 Feature selection In feature selection, finding k of the d dimensions that give us the most information and we discard the other (d − k) dimensions. That is to find a subset of the original set of variables. Also, need a subset which can be used to model the problem. It usually involves three ways: • Filter • Wrapper • Embedded 10.2.2. Feature Extraction In feature extraction, finding a new set of k dimensions that are combinations of the original 159 CU IDOL SELF LEARNING MATERIAL (SLM)

d dimensions. 10.2.3 Importance of Dimensionality Reduction: Why Dimension Reduction is important in machine learning predictive modeling? The problem of unwanted increase in dimension is closely related to other. That was to fixation of measuring/recording data at a far granular level then it was done in past. This is no way suggesting that this is a recent problem. It has started gaining more importance lately due to a surge in data. Lately, there has been a tremendous increase in the way sensors are being used in the industry. These sensors continuously record data and store it for analysis at a later point. In the way data gets captured, there can be a lot of redundancy. 10.2.4 Advantages of Dimensionality Reduction: • Dimensionality Reduction helps in data compression, and hence reduced storage space. • It reduces computation time. • It also helps remove redundant features, if any. • Dimensionality Reduction helps in data compressing and reducing the storage space required • It fastens the time required for performing same computations. • If there present fewer dimensions, then it leads to less computing. Also, dimensions can allow usage of algorithms unfit for a large number of dimensions. • It takes care of multicollinearity that improves the model performance. It removes redundant features. For example, there is no point in storing a value in two different units (meters and inches). • Reducing the dimensions of data to 2D or 3D may allow us to plot and visualize it precisely. You can then observe patterns more clearly. Below you can see that, how a 3D data is converted into 2D. First, it has identified the 2D plane then represented the points on these two new axes z1 and z2. 160 CU IDOL SELF LEARNING MATERIAL (SLM)

Figure 10.1 Dimension Reduction • It is helpful in noise removal also and as a result of that, we can improve the performance of models. 10.2.5 Disadvantages of Dimensionality Reduction • Basically, it may lead to some amount of data loss. • Although, PCA tends to find linear correlations between variables, which is sometimes undesirable. • Also, PCA fails in cases where mean and covariance are not enough to define datasets. • Further, we may not know how many principal components to keep- in practice, some thumb rules are applied. 10.3 DIMENSIONALITY REDUCTION TECHNIQUES 161 CU IDOL SELF LEARNING MATERIAL (SLM)

Fig 10.2 Dimensionality Reduction Techniques • Missing Values Ratio. Data columns with too many missing values are unlikely to carry much useful information. Thus, data columns with a ratio of missing values greater than a given threshold can be removed. The higher the threshold, the more aggressive the reduction. • Low Variance Filter. Similar to the previous technique, data columns with little changes in the data carry little information. Thus, all data columns with a variance lower than a given threshold can be removed. Notice that the variance depends on the column range, and therefore normalization is required before applying this technique. • High Correlation Filter. Data columns with very similar trends are also likely to carry very similar information, and only one of them will suffice for classification. Here we calculate the Pearson product-moment correlation coefficient between numeric columns and the Pearson’s chi-square value between nominal columns. For the final classification, we only retain one column of each pair of columns whose pairwise correlation exceeds a given threshold. Notice that correlation depends on the column range, and therefore, normalization is required before applying this technique. • Random Forests/Ensemble Trees. Decision tree ensembles, often called random forests, are useful for column selection in addition to being effective classifiers. Here we 162 CU IDOL SELF LEARNING MATERIAL (SLM)

generate a large and carefully constructed set of trees to predict the target classes and then use each column’s usage statistics to find the most informative subset of columns. We generate a large set (2,000) of very shallow trees (two levels), and each tree is trained on a small fraction (three columns) of the total number of columns. If a column is often selected as the best split, it is very likely to be an informative column that we should keep. For all columns, we calculate a score as the number of times that the column was selected for the split, divided by the number of times in which it was a candidate. The most predictive columns are those with the highest scores. • Principal Component Analysis (PCA). Principal component analysis (PCA) is a statistical procedure that orthogonally transforms the original n numeric dimensions of a dataset into a new set of n dimensions called principal components. As a result of the transformation, the first principal component has the largest possible variance; each succeeding principal component has the highest possible variance under the constraint that it is orthogonal to (i.e., uncorrelated with) the preceding principal components. Keeping only the first m < n principal components reduces the data dimensionality while retaining most of the data information, i.e., variation in the data. Notice that the PCA transformation is sensitive to the relative scaling of the original columns, and therefore, the data need to be normalized before applying PCA. Also notice that the new coordinates (PCs) are not real, system-produced variables anymore. Applying PCA to your dataset loses its interpretability. If interpretability of the results is important for your analysis, PCA is not the transformation that you should apply. • Backward Feature Elimination. In this technique, at a given iteration, the selected classification algorithm is trained on n input columns. Then we remove one input column at a time and train the same model on n-1 columns. The input column whose removal has produced the smallest increase in the error rate is removed, leaving us with n-1 input columns. The classification is then repeated using n-2 columns, and so on. Each iteration k produces a model trained on n-k columns and an error rate e(k). By selecting the maximum tolerable error rate, we define the smallest number of columns necessary to reach that classification performance with the selected machine learning algorithm. • Forward Feature Construction. This is the inverse process to backward feature elimination. We start with one column only, progressively adding one column at a time, i.e., the column that produces the highest increase in performance. Both algorithms, backward feature elimination and forward feature construction, are quite expensive in 163 CU IDOL SELF LEARNING MATERIAL (SLM)

terms of time and computation. They are practical only when applied to a dataset with an already relatively low number of input columns. ThreeMore Techniques for Data Dimensionality Reduction Let’s start with the three newly added techniques: linear discriminant analysis (LDA), neural autoencoder and t-distributed stochastic neighbour embedding (t-SNE). Linear Discriminant Analysis (LDA) A number m of linear combinations (discriminant functions) of the n input features, with m < n, are produced to be uncorrelated and to maximize class separation. These discriminant functions become the new basis for the dataset. All numeric columns in the dataset are projected onto these linear discriminant functions, effectively moving the dataset from the n- dimensionality to the m-dimensionality. In order to apply the LDA technique for dimensionality reduction, the target column has to be selected first. The maximum number of reduced dimensions m is the number of classes in the target column minus one, or if smaller, the number of numeric columns in the data. Notice that linear discriminant analysis assumes that the target classes follow a multivariate normal distribution with the same variance but with a different mean for each class. Autoencoder An autoencoder is a neural network, with as many n output units as input units, at least one hidden layer with m units where m < n, and trained with the backpropagation algorithm to reproduce the input vector onto the output layer. It reduces the numeric columns in the data by using the output of the hidden layer to represent the input vector. The first part of the autoencoder — from the input layer to the hidden layer of m units — is called the encoder. It compresses the n dimensions of the input dataset into an m-dimensional space. The second part of the autoencoder — from the hidden layer to the output layer — is known as the decoder. The decoder expands the data vector from an m-dimensional space into the original n-dimensional dataset and brings the data back to their original values. t-distributed Stochastic Neighbor Embedding (t-SNE) This technique reduces the n numeric columns in the dataset to fewer dimensions m (m < n) based on nonlinear local relationships among the data points. Specifically, it models each high-dimensional object by a two- or three-dimensional point in such a way that similar objects are modelled by nearby points and dissimilar objects are modelled by distant points in the new lower dimensional space. 164 CU IDOL SELF LEARNING MATERIAL (SLM)

In the first step, the data points are modelled through a multivariate normal distribution of the numeric columns. In the second step, this distribution is replaced by a lower dimensional t- distribution, which follows the original multivariate normal distribution as closely as possible. The t-distribution gives the probability of picking another point in the dataset as a neighbour to the current point in the lower dimensional space. The perplexity parameter controls the density of the data as the “effective number of neighbours for any point.” The greater the value of the perplexity, the more global structure is considered in the data. The t- SNE technique works only on the current dataset. It is not possible to export the model to apply it to new data. 10.4 PRINCIPAL COMPONENT ANALYSIS Principal Component Analysis, or PCA, is a dimensionality-reduction method that is often used to reduce the dimensionality of large data sets, by transforming a large set of variables into a smaller one that still contains most of the information in the large set. Reducing the number of variables of a data set naturally comes at the expense of accuracy, but the trick in dimensionality reduction is to trade a little accuracy for simplicity. Because smaller data sets are easier to explore and visualize and make analyzing data much easier and faster for machine learning algorithms without extraneous variables to process. PCA is simple — reduce the number of variables of a data set, while preserving as much information as possible. In projection methods, we are interested in finding a mapping from the inputs in the original d-dimensional space to a new (k < d)-dimensional space, with minimum loss of information. The projection of x on the direction of w is z = wTx Principal components analysis (PCA) is an unsupervised method in that it does not use the output information; the criterion to be maximized is the variance. The principal component is w1 such that the sample, after projection on to w1, is most spread out so that the difference between the sample points becomes most apparent. For a unique solution and to make the direction the important factor, we require ||w1|| = 1. We know from equation 5.14 that if z1= w1Tx with Cov(x) = Σ, then Var(z1) = w1T Σw1 165 CU IDOL SELF LEARNING MATERIAL (SLM)

We seek w1 such that Var(z1) is maximized subject to the constraint that w1T w1= 1. Writing this as a Lagrange problem, we have max w1T Σw1− α(w1T w1− 1) w1 Taking the derivative with respect to w1 and setting it equal to 0, wehave 2Σw1− 2αw1= 0, and therefore Σw1= αw1 which holds if w1 is an eigenvector of Σ and α the corresponding eigenvalue. Because we want to maximize w1T Σw1= αwT1w1= α We choose the eigenvector with the largest eigenvalue for the varianceto be maximum. Therefore the principal component is the eigenvector of the covariance matrix of the input sample with the largest eigenvalue, λ1= α. The second principal component, w2, should also maximize variance, be of unit length, and be orthogonal to w1. This latter requirement is so that after projection z2= w2T x is uncorrelated with z1. For the second principal component, we have Maxw2T Σw2− α(w2T w2− 1) − β(w2T w1− 0) w2 Taking the derivative with respect to w2 and setting it equal to 0, wehave 2Σw2− 2αw2− βw1= 0 Premultiply by w1Tand we get 2 w1TΣw2 − 2αw1Tw2 − βw1Tw1= 0 Note that w1T w2= 0. w1T Σw2 is a scalar, equal to its transpose w2TΣw1where, because w1 is the leading eigenvector of Σ, Σw1= λ1w1. Therefore 166 CU IDOL SELF LEARNING MATERIAL (SLM)

w1TΣw2= w2TΣw1= λ1w2Tw1= 0 Then β = 0 and equation 6.8 reduces to Σw2= αw2 Which implies that w2 should be the eigenvector of Σ with the second largest eigenvalue, λ2= α. Similarly, we can show that the other dimensions are given by the eigenvectors with decreasing eigenvalues. Because Σ is symmetric, for two different eigenvalues, the eigenvectors are orthogonal. If Σ is positive definite (xTΣx >0, for all nonnull x), thenall its eigenvalues are positive. If Σ is singular, then its rank, the effective dimensionality, is k with k < d and λi, i = k + 1, . . . , d are 0. The k eigenvectors with nonzero eigenvalues are the dimensions of the reduced space. The first eigenvector (the one with the largest eigenvalue), w1, namely, the principal component, explains the largest part of the variance; the second explains the second largest; and so on. We define z = WT (x −m) Where the k columns of W are the k leading eigenvectors of S, the estimator to Σ. We subtract the sample mean m from x before projection to center the data on the origin. After this linear transformation, we get to a k-dimensional space whose dimensions are the eigenvectors, and the variances over these new dimensions are equal to the eigenvalues (see figure). To normalize variances, we can divide by the square roots of the eigenvalues. Figure: 10.3 Principal Component Analysis Figure Principal Component’s analysis centers the sample and then rotates the axes to line up with the directions of highest variance. If the variance on z2is too small, it can be ignored and we have dimensionality reduction from two tone. 167 CU IDOL SELF LEARNING MATERIAL (SLM)

Let us see another derivation: We want to find a matrix W such thatwhen we have z = WTx (assume without loss of generality that x are alreadycentered), we will get Cov(z) = D_ where D_ is any diagonal matrix;that is, we would like to get uncorrelated zi . If we form a (d × d) matrix C whose ith column is the normalized eigenvector ci of S, then CTC = I and S = SCCT = S(c1, c2, . . . , cd)CT = (Sc1, Sc2, . . . , Scd)CT = (λ1c1, λ2c2, . . . , λdcd)CT = λ1c1cT1+· · ·+λdcdcTd = CDCT where D is a diagonal matrix whose diagonal elements are the eigenvalspectralues, λ1, . . . , λd. This is called the spectral decomposition of S. Since C is decomposition orthogonal and CCT= CTC = I, we can multiply on the left by CTand on the right by C to obtain CTSC = D We know that if z = WTx, then Cov(z) = WTSW, which we would like to be equal to a diagonal matrix. Then from equation 6.11, we see that we can set W = C. Let us see an example to get some intuition (Rencher 1995): Assume we are given a class of students with grades on five courses and we want to order these students. That is, we want to project the data onto one dimension, such that the difference between the data points becomes most apparent. We can use PCA. The eigenvector with the highest eigenvalues the direction that has the highest variance, that is, the direction on which the students are most spread out. This works better than taking the average because we take into account correlations and differences invariances. In practice even if all eigenvalues are greater than 0, if |S| is small, remembering that |S| =∏di=1 λi, we understand that some eigenvalues have little contribution to variance and may be discarded. Then, we take into account the leading k components that explain more than, for example,90 percent, of the variance. When λi are sorted in descending order, 168 CU IDOL SELF LEARNING MATERIAL (SLM)

the proportion of proportion of variance explained by the k principal components invariance λ1+ λ2+· · ·+λk λ1+ λ2+· · ·+λk+· · ·+λd If the dimensions are highly correlated, there will be a small number of eigenvectors with large eigenvalues and k will be much smaller than d and large reduction in dimensionality may be attained. This is typically the case in many image and speech processing tasks where nearby inputs (in space or time) are highly correlated. If the dimensions are not correlated will be as large as d and there is no gain through PCA. scree graph Scree graph is the plot of variance explained as a function of the number of eigenvectors kept (see figure 6.2). By visually analyzing it, one canals decide on k. At the “elbow,” adding another eigenvector does not significantly increase the variance explained. Another possibility is to ignore the eigenvectors whose eigenvalues are less than the average input variance. Given that ∑i λi=∑i s2i(equal to the trace of S, denoted as tr(S)), the average eigenvalue is equal to the average input variance. When we keep only the eigenvectors with eigenvalues greater than the average eigenvalue, we keep only those that have variance higher than the average input variance. If the variances of the original xi dimensions vary considerably, they affect the direction of the principal components more than the correlations, so a common procedure is to preprocess the data so that each dimension has mean 0 and unit variance, before using PCA. Or, one may use the eigenvectors of the correlation matrix, R, instead of the covariance matrix, S, for the correlations to be effective and not the individual variances. PCA explains variance and is sensitive to outliers: A few points distant from the center would have a large effect on the variances and thus the eigenvectors. Robust estimation methods allow calculating parameters in the presence of outliers. A simple method is to calculate the Mahalanobis distance of the data points, discarding the isolated data points that are far away. 169 CU IDOL SELF LEARNING MATERIAL (SLM)

Figure 10.4 (a) Scree graph. (b) Proportion of variance explained is given for theOptdigits dataset from the UCI Repository. If the first two principal components explain a large percentage of the variance, we can do visual analysis: We can plot the data in this two dimensional space and search visually for structure, groups, outliers, normality, and so forth. 170 CU IDOL SELF LEARNING MATERIAL (SLM)

Figure 10.5 Principal Component in 2D space This plot gives a better pictorial description of the sample than a plot of any two of the original variables. By looking at the dimensions of the principal components, we can also try to recover meaningful underlying variables that describe the data. For example, in image applications where the inputs are images, the eigenvectors can also be displayed as images and can be seen as templates for important features; they are typically named “eigenfaces,” “eigendigits,”and so forth (Turk and Pentland 1991). When d is large, calculating, storing, and processing S may be tedious.It is possible to calculate the eigenvectors and eigenvalues directly fromdata without explicitly calculating the covariance matrix (Chatfield and Collins 1980). We know from equation 5.15 that if x ∼Nd(μ,Σ), then after projectionWTx ∼Nk(WTμ,WTΣW). If the sample contains d-variate normals, thenit projects to k-variate normals allowing us to do parametric discriminationin this hopefully much lower dimensional space. Because zj areuncorrelated, the new covariance matrices will be diagonal, and if theyare normalized to have unit variance, Euclidean distance can be used inthis new space, leading to a simple classifier. Instance xt is projected to the z-space as zt= WT (xt− μ) When W is an orthogonal matrix such that WWT= I, it can be backprojectedto the original space as ˆxt = Wzt+ μ ˆxtis the reconstruction of xtfrom its representation in the z-space.It is known that among all orthogonal linear projections, PCA minimizesthe reconstruction error, which is the distance between the instance andits reconstruction from the lower dimensional space: ∑||ˆxt –x||2 The reconstruction error depends on how many of the leading componentsare taken into account. In a visual recognition application—forexample, face recognition—displaying ˆxt allows a visual check for information loss during PCA. PCA is unsupervised and does not use output information. It is a onegroupprocedure. However, in the case of classification, there are multiple groups. Karhunen-Loève expansion 171 CU IDOL SELF LEARNING MATERIAL (SLM)

allows using class information; for example, instead of using the covariance matrix of the whole sample, we can estimate separate class covariance matrices, take their average (weighted by the priors) as the covariance matrix, and use its eigenvectors. In common principal components (Flury 1988), we assume that the principal components are the same for each class whereas the variances of these components differ for different classes: Si= CDiCT This allows pooling data and is a regularization method whose complexity is less than that of a common covariance matrix for all classes, while still allowing differentiation of Si . A related approach is flexible discriminant analysis(Hastie, Tibshirani, and Buja 1994), which does alinear projection to a lower-dimensional space where all features are uncorrelated and then uses a minimum distance classifier. 10.5 SUMMARY • Dimensionality reduction refers to techniques for reducing the number of input variables in training data. • A feature selection method is proposed to select a subset of variables in principal component analysis • Feature Extraction is getting useful features from existing data • Dimensionality Reduction helps in data compression, and hence reduced storage space. • Principal Component Analysis, or PCA, is a dimensionality-reduction method that is often used to reduce the dimensionality of large data sets • Decision tree ensembles, often called random forests, are useful for column selection in addition to being effective classifiers • Principal components analysis (PCA) is an unsupervised method 10.6 KEYWORDS • Dimensionality Reduction- reduce the number of input variables in a dataset. • Principal Component Analysis- to reduce the dimensionality of large data sets • Feature Selection- elect those features which contribute most to your prediction variable 172 CU IDOL SELF LEARNING MATERIAL (SLM)

• Feature Extraction- constructing combinations of the variables to get around these problems while still describing the data • Neural Autoencoder- that learns to copy its input to its output 10.7 LEARNING ACTIVITY 1. Use PCA to analyze the following 3-variate dataset with 10 observations. Each observation consists of 3 measurements on a wafer: thickness, horizontal displacement, and vertical displacement. ___________________________________________________________________________ ____________________________________________________________________ 2. Assume we have the below dataset which has 4 features and a total of 5 training examples. Perform dimension reduction using PCA ___________________________________________________________________________ ____________________________________________________________________ 10.8 UNIT END QUESTIONS A.Descriptive Questions 173 CU IDOL SELF LEARNING MATERIAL (SLM)

Short Question 1. Define dimensionality reduction. 2. List the types of dimensionality reduction techniques. 3. What is the principal component analysis? 4. Define miss value ration. 5. List the advantage and disadvantage of dimensionality reduction. Long Question 1. Describe the components of PCA 2. Elaborate on various dimension reduction strategies 3. Compare the pros and cons of Dimension reduction. 4. Illustrate the working of PCA with relevant example 5. How feature selection is used to perform dimension reduction? B.Multiple Choice Questions 1. Which of the following algorithms cannot be used for reducing the dimensionality of data? a. t-SNE b. PCA c. LDA False d. None of these 2. Which of the following comparison(s) are true about PCA and LDA? a. Both LDA and PCA are linear transformation techniques b. LDA is supervised whereas PCA is unsupervised c. PCA can be trapped into local minima problem. d. Both B and C 3. PCA maximize the variance of the data, whereas LDA maximize the separation between different classes, a. 1 and 2 b. 2 and 3 c. 1 and 3 d. 1, 2 and 3 174 CU IDOL SELF LEARNING MATERIAL (SLM)

4. What will happen when eigenvalues are roughly equal? a. PCA will perform outstandingly b. PCA will perform badly c. Can’t Say d. None of these 5. Under which condition SVD and PCA produce the same projection result? a. When data has zero median b. When data has zero mean c. Both are always same d. None of these 5. The most popularly used dimensionality reduction algorithm is Principal Component Analysis (PCA). Which of the following is/are true about PCA? 1. PCA is an unsupervised method 2. It searches for the directions that data have the largest variance 3. Maximum number of principal components <= number of features 4. All principal components are orthogonal to each other a. 1 and 2 b. 1 and 3 c. 2 and 3 d. All of these Answers 1 – d, 2 – d, 3 – b, 4 – b, 5 – d 10.9 REFERENCES Textbooks • Peter Harrington “Machine Learning in Action”, Dream Tech Press • EthemAlpaydin, “Introduction to Machine Learning”, MIT Press • Steven Bird, Ewan Klein and Edward Loper, “Natural Language Processing with 175 CU IDOL SELF LEARNING MATERIAL (SLM)

Python”, O’Reilly Media. • Stephen Marsland, “Machine Learning an Algorithmic Perspective” CRC Press Reference Books • William W. Hsieh, “Machine Learning Methods in the Environmental Sciences”, Cambridge • Grant S. Ingersoll, Thomas S. Morton, Andrew L. Farris, “Tamming Text”, Manning Publication Co. • Margaret. H. Dunham, “Data Mining Introductory and Advanced Topics”, Pearson Education 176 CU IDOL SELF LEARNING MATERIAL (SLM)

UNIT - 11: NATURAL LANGUAGE PROCESSING II Structure 11.0 Learning Objectives 11.1 Introduction 11.2 Key Terminology 11.3 Natural Language Processing 11.4 Feature Engineering on Text Data 11.5 Summary 11.6 Keywords 11.7 Learning Activity 11.8 Unit End Questions 11.9 References 11.0 LEARNING OBJECTIVES After studying this unit, you will be able to: • Discuss the fundamentals of NLP • Identify the key terminologies of NLP • Illustrate the types of feature engineering on text data 11.1 INTRODUCTION Natural language processing (NLP), as the title clears our perception that it has a sort of processing to do with language or linguistics. NLP primarily comprises two major functionalities, the first is “Human to Machine Translation” (Natural Language Understanding), and the second is “Machine to Human translation”(Natural Language Generation). This blog will cover what is NLP, the history of NLP and different NLP techniques for finding inferences mainly from sentiment data. Natural language processing (NLP ) is an intersection of Artificial intelligence, Computer Science and Linguistics. The end goal of this technology is for computers to understand the content, nuances and the sentiment of the document. 177 CU IDOL SELF LEARNING MATERIAL (SLM)

With NLP we can perfectly extract the information and insights contained in the document and then organize it to their respective categories. For example whenever a user searches something on Google search engine, Google’s algorithm shows all the relevant documents, blogs and articles using NLP techniques. 11.2 KEY TERMINOLOGY Natural Language Processing Natural language processing (NLP) is an intersection of Artificial intelligence, Computer Science and Linguistics. The end goal of this technology is for computers to understand the content, nuances and the sentiment of the document. Tokenization Tokenization is basically splitting of the whole text into the list of tokens, lists can be anything such as words, sentences, characters, numbers, punctuation, etc. Stemming Stemming is the process of reducing inflected (or sometimes derived) words to their word stem, base or root form - generally a written form of the word.” Lemmatization Lemmatization deals with lemma of a word that involves reducing the word form after understanding the part of speech (POS) or context of the word in any document. Natural language generation Natural language generation (NLG) is a technique that uses raw structured data to convert it into plain English (or any other) language. NLG makes data understandable to all by making reports that are mainly data-driven, like, stock-market and financial reports, meeting memos, reports on product requirements, etc. Sentiment analysis Sentiment analysis is to find whether expressed opinions in any document, sentence, text, social media, and reviews are positive, negative, or neutral, it is also called finding the Polarity of Text. Syntactical parsing Syntactical parsing involves the analysis of words in the sentence for grammar and their arrangement in a manner that shows the relationships among the words. Dependency 178 CU IDOL SELF LEARNING MATERIAL (SLM)

Grammar and Part of Speech tags are the important attributes of text syntactic. 11.3 NATURAL LANGUAGE PROCESSING Natural language processing (NLP), as the title clears our perception that it has a sort of processing to do with language or linguistics. NLP primarily comprises two major functionalities, The first is “Human to Machine Translation” (Natural Language Understanding), and the second is “Machine to Human translation”(Natural Language Generation). This blog will cover what is NLP, the history of NLP and different NLP techniques for finding inferences mainly from sentiment data. What is NLP? Natural language processing (NLP) is an intersection of Artificial intelligence, Computer Science and Linguistics. The end goal of this technology is for computers to understand the content, nuances and the sentiment of the document. With NLP we can perfectly extract the information and insights contained in the document and then organize it to their respective categories. For example whenever a user searches something on Google search engine, Google’s algorithm shows all the relevant documents, blogs and articles using NLP techniques. HISTORY OF NLP Let me take into account about the brief history of NLP, It started back in the year 1950 (yeah too old,D ) when Alan Turing had published an article titled “Computing Machinery and Intelligence” which is also known as the “Turing test”. In that article, a question was considered, like, “Can machines think?”, since this question had small ambiguous words, like, “machines” and “think”. Turing test suggested a few changes, the question with another question that had expressed in unambiguous words and closely related. In the year 1960, some natural language processing systems developed, SHRDLU, the work of Chomsky and others together on formal language theory and generative syntax. Up to the 1980s, the evolution originated in natural language processing with the introduction of Machine Learning algorithms for language processing. Later, In 2000, a massive amount of audio and textual data was available for everyone. 179 CU IDOL SELF LEARNING MATERIAL (SLM)

Techniques of natural language processing 1. Named Entity Recognition (NER) 2. Tokenization 3. Stemming and Lemmatization 4. Bag of Words 5. Natural language generation 6. Sentiment Analysis 7. Sentence Segmentation 1. NAMED ENTITY RECOGNITION (NER) This technique is one of the most popular and advantageous techniques in Semantic analysis, Semantics is something conveyed by the text. Under this technique, the algorithm takes a phrase or paragraph as input and identifies all the nouns or names present in that input. There are many popular use cases of this algorithm below we are mentioning some of the daily use cases; 1. News Categorization: This algorithm automatically scans all the news article and extract out all sorts of information, like, individuals, companies, organizations, people, celebrities name, and places from that article. Using this algorithm we can easily classify news content into different categories. 2. Efficient Search Engine: The Named entity recognition algorithm applies to all the articles, results, news to extract relevant tags and stores them separately. These will boost up the searching process and makes an efficient search engine. 3. Customer Support :> You must have read out thousands of feedbacks provided by people concerning heavy traffic areas on twitter on a daily basis. If Named Entity Recognition API is used then we can easily be pulled out all the keywords (or tags) to inform concerned traffic police departments. 2. TOKENIZATION First of all, understanding the meaning of Tokenization, it is basically splitting of the whole text into the list of tokens, lists can be anything such as words, sentences, characters, numbers, punctuation, etc. Tokenization has two main advantages, one is to reduce search with a significant degree, and the second is to be effective in the use of storage space. The process of mapping sentences from character to strings and strings into words are initially the basic steps of any NLP problem because to understand any text or document we 180 CU IDOL SELF LEARNING MATERIAL (SLM)

need to understand the meaning of the text by interpreting words/sentences present in the text. Tokenization is an integral part of any Information Retrieval (IR) system, it not only involves the pre-process of text but also generates tokens respectively that are used in the indexing/ranking process. There are various tokenization’ techniques available among which Porter’s Algorithm is one of the most prominent techniques. 3. STEMMING AND LEMMATIZATION The increasing size of data and information on the web is all-time high from the past couple of years. This huge data and information demand necessary tools and techniques to extract inferences with much ease. “Stemming is the process of reducing inflected (or sometimes derived) words to their word stem, base or root form - generally a written form of the word.” For example, what stemming does, basically it cuts off all the suffixes. So after applying a step of stemming on the word “playing”, it becomes “play”, or like, “asked” becomes “ask”. Figure 11.1 Stemming vs Lemmatization Lemmatization usually refers to do things with the proper use of vocabulary and morphological analysis of words, normally aiming to remove inflectional endings only and to return the base or dictionary form of a word, which is known as the lemma. In simple words, Lemmatization deals with lemma of a word that involves reducing the word form after understanding the part of speech (POS) or context of the word in any document. 4. BAG OF WORDS 181 CU IDOL SELF LEARNING MATERIAL (SLM)

Bag of words technique is used to pre-process text and to extract all the features from a text document to use in Machine Learning modelling. It is also a representation of any text that elaborates/explains the occurrence of the words within a corpus (document). It is also called “Bag” due to its mechanism, i.e. it is only concerned with whether known words occur in the document, not the location of the words. Let’s take an example to understand bag-of-words in more detail. Like below, we are taking 2 text documents: “Neha was angry on Sunil and he was angry on Ramesh.” “Neha love animals.” Above you see two corpora as documents, we treat both documents as a different entity and make a list of all the words present in both documents except punctuations as here, “Neha”, “was”, “angry”, “on”, “Sunil”, “and”, “he”, “Ramesh”, “love”, “animals” Then we create these documents into vectors (or we can say, creating a text into numbers is called vectorization in ML) for further modelling. Presentation of “Neha was angry on Sunil and he was angry on Ramesh” into vector form as [1,1,1,1,1,1,1,0,0] , and the same as in, “Neha love animals” having vector form as [1,0,0,0,0,0,0,0,1,1]. So, the bag-of-words technique is mainly used for featuring generation from text data. 5. NATURAL LANGUAGE GENERATION Natural language generation (NLG) is a technique that uses raw structured data to convert it into plain English (or any other) language. We also call it data storytelling. This technique is very helpful in many organizations where a large amount of data is used, it converts structured data into natural languages for a better understanding of patterns or detailed insights into any business. As this can be viewed opposite of Natural Language Understanding (NLU) that we have already explained above. NLG makes data understandable to all by making reports that are mainly data-driven, like, stock-market and financial reports, meeting memos, reports on product requirements, etc. There are many stages of any NLG; 182 CU IDOL SELF LEARNING MATERIAL (SLM)

1. Content Determination: Deciding the main content to be represented in text or information provided in the text. 2. Document Clustering: Deciding the overall structure of the information to convey. 3. Aggregation: Merging of sentences to improve sentence understanding and readability. 4. Lexical Choice: Putting appropriate words to convey the meaning of the sentence more clearly. 5. Referring Expression Generation: Creating references to identify main objects and regions of the text properly. 6. Realization: Creating and optimizing text that should follow all the norms of grammar (like syntax, morphology, orthography). 6. SENTIMENT ANALYSIS It is one of the most common natural language processing techniques. With sentiment analysis, we can understand the emotion/feeling of the written text. Sentiment analysis is also known as Emotion AI or Opinion Mining. The basic task of Sentiment analysis is to find whether expressed opinions in any document, sentence, text, social media, and reviews are positive, negative, or neutral, it is also called finding the Polarity of Text. Figure 11.2 Analysis understanding 183 CU IDOL SELF LEARNING MATERIAL (SLM)

Sentiment analysis usually works best on subjective text data rather than objective test data. Generally, objective text data are either statements or facts which does not represent any emotion or feeling. On the other hand, the subjective text is usually written by humans showing emotions and feelings. For example, Twitter is all filled up with sentiments, users are addressing their reactions or expressing their opinions on each topic whichever or wherever possible. So, to access tweets of users in a real-time scenario, there is a powerful python library called “twippy”. 7. SENTENCE SEGMENTATION The most fundamental task of this technique is to divide all text into meaningful sentences or phrases. This task involves identifying sentence boundaries between words in text documents. We all know that almost all languages have punctuation marks that are presented at sentence boundaries, so sentence segmentation also referred to as sentence boundary detection, sentence boundary disambiguation or sentence boundary recognition. 11.4 FEATURE ENGINEERING ON TEXT DATA To analyze a preprocessed data, it needs to be converted into features. Depending upon the usage, text features can be constructed using assorted techniques – Syntactical Parsing, Entities / N-grams / word-based features, Statistical features, and word embeddings. Read on to understand these techniques in detail. 11.4.1 Syntactic Parsing Syntactical parsing involves the analysis of words in the sentence for grammar and their arrangement in a manner that shows the relationships among the words. Dependency Grammar and Part of Speech tags are the important attributes of text syntactic. Dependency Trees – Sentences are composed of some words sewed together. The relationship among the words in a sentence is determined by the basic dependency grammar. Dependency grammar is a class of syntactic text analysis that deals with (labeled) asymmetrical binary relations between two lexical items (words). Every relation can be represented in the form of a triplet (relation, governor, dependent). For example: consider the sentence – “Bills on ports and immigration were submitted by Senator Brownback, Republican of Kansas.” The relationship among the words can be observed in the form of a tree representation as shown: 184 CU IDOL SELF LEARNING MATERIAL (SLM)

Figure 11.3 Dependency Tree The tree shows that “submitted” is the root word of this sentence and is linked by two sub- trees (subject and object subtrees). Each subtree is itself a dependency tree with relations such as – (“Bills” <-> “ports” <by> “proposition” relation), (“ports” <-> “immigration” <by> “conjugation” relation). This type of tree, when parsed recursively in top-down manner gives grammar relation triplets as output which can be used as features for many nlp problems like entity wise sentiment analysis, actor & entity identification, and text classification. The python wrapper StanfordCoreNLP (by Stanford NLP Group, only commercial license) and NLTK dependency grammars can be used to generate dependency trees. Part of speech tagging – Apart from the grammar relations, every word in a sentence is also associated with a part of speech (pos) tag (nouns, verbs, adjectives, adverbs etc). The pos tags defines the usage and function of a word in the sentence. H ere is a list of all possible pos- tags defined by Pennsylvania university. Following code using NLTK performs pos tagging annotation on input text. (it provides several implementations, the default one is perceptron tagger) ``` 185 CU IDOL SELF LEARNING MATERIAL (SLM)

from nltk import word_tokenize, pos_tag text = \"I am learning Natural Language Processing on Analytics Vidhya\" tokens = word_tokenize(text) print pos_tag(tokens) >>> [('I', 'PRP'), ('am', 'VBP'), ('learning', 'VBG'), ('Natural', 'NNP'),('Language', 'NNP '), ('Processing', 'NNP'), ('on', 'IN'), ('Analytics', 'NNP'),('Vidhya', 'NNP')] ``` Part of Speech tagging is used for many important purposes in NLP: A. Word sense disambiguation: Some language words have multiple meanings according to their usage. For example, in the two sentences below: I. “Please book my flight for Delhi” II. “I am going to read this book in the flight” “Book” is used with different context, however the part of speech tag for both of the cases are different. In sentence I, the word “book” is used as v erb, while in II it is used as no un. B. Improving word-based features: A learning model could learn different contexts of a word when used word as the features, however if the part of speech tag is linked with them, the context is preserved, thus making strong features. For example: Sentence -“book my flight, I will read this book” Tokens – (“book”, 2), (“my”, 1), (“flight”, 1), (“I”, 1), (“will”, 1), (“read”, 1), (“this”, 1) Tokens with POS – (“book_VB”, 1), (“my_PRP$”, 1), (“flight_NN”, 1), (“I_PRP”, 1), (“will_MD”, 1), (“read_VB”, 1), (“this_DT”, 1), (“book_NN”, 1) C. Normalization and Lemmatization: POS tags are the basis of lemmatization process for converting a word to its base form (lemma). D. Efficient stopword removal : P OS tags are also useful in efficient removal of stopwords. For example, there are some tags which always define the low frequency / less important words of a language. For example: (IN – “within”, “upon”, “except”), (CD – “one”,”two”, “hundred”), (MD – “may”, “mu st” etc) 186 CU IDOL SELF LEARNING MATERIAL (SLM)

11.4.2 Entity Extraction (Entities As Features) Entities are defined as the most important chunks of a sentence – noun phrases, verb phrases or both. Entity Detection algorithms are generally ensemble models of rule-based parsing, dictionary lookups, pos tagging and dependency parsing. The applicability of entity detection can be seen in the automated chat bots, content analyzers and consumer insights. Topic Modelling & Named Entity Recognition are the two key entity detection methods in NLP. A. Named Entity Recognition (NER) The process of detecting the named entities such as person names, location names, company names etc from the text is called as NER. For example : Sentence – Sergey Brin, the manager of Google Inc. is walking in the streets of New York. Named Entities – ( “person” : “Sergey Brin” ), (“org” : “Google Inc.”), (“location” : “New York”) A typical NER model consists of three blocks: Noun phrase identification: This step deals with extracting all the noun phrases from a text using dependency parsing and part of speech tagging. Phrase classification: This is the classification step in which all the extracted noun phrases are classified into respective categories (locations, names etc). Google Maps API provides a good path to disambiguate locations, Then, the open databases from dbpedia, wikipedia can be used to identify person names or company names. Apart from this, one can curate the lookup tables and dictionaries by combining information from different sources. Entity disambiguation: Sometimes it is possible that entities are misclassified, hence creating a validation layer on top of the results is useful. Use of knowledge graphs can be exploited for this purposes. The popular knowledge graphs are – Google Knowledge Graph, IBM Watson and Wikipedia. B. Topic Modeling 187 CU IDOL SELF LEARNING MATERIAL (SLM)

Topic modeling is a process of automatically identifying the topics present in a text corpus, it derives the hidden patterns among the words in the corpus in an unsupervised manner. Topics are defined as “a repeating pattern of co-occurring terms in a corpus”. A good topic model results in – “health”, “doctor”, “patient”, “hospital” for a topic – Healthcare, and “farm”, “crops”, “wheat” for a topic – “Farming”. Latent Dirichlet Allocation (LDA) is the most popular topic modelling technique, Following is the code to implement topic modeling using LDA in python. For a detailed explanation about its working and implementation, check the complete article here. ``` doc1 = \"Sugar is bad to consume. My sister likes to have sugar, but not my father.\" doc2 = \"My father spends a lot of time driving my sister around to dance practice.\" doc3 = \"Doctors suggest that driving may cause increased stress and blood pressure.\" doc_complete = [doc1, doc2, doc3] doc_clean = [doc.split() for doc in doc_complete] import gensim from gensim import corpora # Creating the term dictionary of our corpus, where every unique term is assigned an i ndex. dictionary = corpora.Dictionary(doc_clean) # Converting list of documents (corpus) into Document Term Matrix using dictionary prepared above. doc_term_matrix = [dictionary.doc2bow(doc) for doc in doc_clean] # Creating the object for LDA model using gensim library Lda = gensim.models.ldamodel.LdaModel # Running and Training LDA model on the document term matrix ldamodel = Lda(doc_term_matrix, num_topics=3, id2word = dictionary, passes=50) # Results 188 CU IDOL SELF LEARNING MATERIAL (SLM)

print(ldamodel.print_topics()) ``` C. N-Grams as Features A combination of N words together are called N-Grams. N grams (N > 1) are generally more informative as compared to words (Unigrams) as features. Also, bigrams (N = 2) are considered as the most important features of all the others. The following code generates bigram of a text. ``` def generate_ngrams(text, n): words = text.split() output = [] for i in range(len(words)-n+1): output.append(words[i:i+n]) return output >>> generate_ngrams('this is a sample text', 2) # [['this', 'is'], ['is', 'a'], ['a', 'sample'], , ['sample', 'text']] ``` 11.4.3 Statistical Features Text data can also be quantified directly into numbers using several techniques described in this section: A. Term Frequency – Inverse Document Frequency (TF – IDF) TF-IDF is a weighted model commonly used for information retrieval problems. It aims to convert the text documents into vector models on the basis of occurrence of words in the documents without taking considering the exact ordering. For Example – let say there is a dataset of N text documents, In any document “D”, TF and IDF will be defined as – Term Frequency (TF) – TF for a term “t” is defined as the count of a term “t” in a document “D” 189 CU IDOL SELF LEARNING MATERIAL (SLM)

Inverse Document Frequency (IDF) – IDF for a term is defined as logarithm of ratio of total documents available in the corpus and number of documents containing the term T. TF . IDF – TF IDF formula gives the relative importance of a term in a corpus (list of documents), given by the following formula below. Following is the code using python’s scikit learn package to convert a text into tf idf vectors: ``` from sklearn.feature_extraction.text import TfidfVectorizer obj = TfidfVectorizer() corpus = ['This is sample document.', 'another random document.', 'third sample docu ment text'] X = obj.fit_transform(corpus) print X >>> (0, 1) 0.345205016865 (0, 4) ... 0.444514311537 (2, 1) 0.345205016865 (2, 4) 0.444514311537 ``` The model creates a vocabulary dictionary and assigns an index to each word. Each row in the output contains a tuple (i,j) and a tf-idf value of word at index j in document i. B. Count / Density / Readability Features Count or Density based features can also be used in models and analysis. These features might seem trivial but shows a great impact in learning models. Some of the features are: Word Count, Sentence Count, Punctuation Counts and Industry specific word counts. Other types of measures include readability measures such as syllable counts, smog index and flesch reading ease. Refer to Textstat library to create such features. 190 CU IDOL SELF LEARNING MATERIAL (SLM)

11.4.4 Word Embedding (Text Vectors) Word embedding is the modern way of representing words as vectors. The aim of word embedding is to redefine the high dimensional word features into low dimensional feature vectors by preserving the contextual similarity in the corpus. They are widely used in deep learning models such as Convolutional Neural Networks and Recurrent Neural Networks. Word2Vec and GloVe are the two popular models to create word embedding of a text. These models takes a text corpus as input and produces the word vectors as output. Word2Vec model is composed of preprocessing module, a shallow neural network model called Continuous Bag of Words and another shallow neural network model called skip- gram. These models are widely used for all other nlp problems. It first constructs a vocabulary from the training corpus and then learns word embedding representations. Following code using gensim package prepares the word embedding as the vectors. ``` from gensim.models import Word2Vec sentences = [['data', 'science'], ['vidhya', 'science', 'data', 'analytics'],['machine', 'learnin g'], ['deep', 'learning']] # train the model on your corpus model = Word2Vec(sentences, min_count = 1) print model.similarity('data', 'science') >>> 0.11222489293 print model['learning'] >>> array([ 0.00459356 0.00303564 -0.00467622 0.00209638, ...]) ``` They can be used as feature vectors for ML model, used to measure text similarity using cosine similarity techniques, words clustering and text classification techniques. 11.5 SUMMARY • Natural Language Processing (NLP) is the part of AI that studies 191 CU IDOL SELF LEARNING MATERIAL (SLM)

how machines interact with human language • NLP primarily comprises two major functionalities, the first is “Human to Machine Translation” (Natural Language Understanding), and the second is “Machine to Human translation”(Natural Language Generation) • Tokenization is essentially splitting a phrase, sentence, paragraph, or an entire text document into smaller units, such as individual words or terms. • Stemming and Lemmatization both generate the root form of the inflected words. • Stemming just removes or stems the last few characters of a word, often leading to incorrect meanings and spelling. Lemmatization considers the context and converts the word to its meaningful base form, which is called Lemma. • Syntactical parsing involves the analysis of words in the sentence for grammar and their arrangement in a manner that shows the relationships among the words • Entities are defined as the most important chunks of a sentence – noun phrases, verb phrases or both 11.6 KEY WORDS • Stemming- efers to a crude heuristic process that chops off the ends of words • sentiment Analysis- used to determine whether data is positive, negative or neutral • Feature Engineering- using domain knowledge to extract features from raw data • Natural Language Generation- transforms data into plain-English content. • Named Entity Recognition- probably the first step towards information extraction that seeks to locate and classify named entities in text 11.7 LEARNING ACTIVITY 1. Assume you wish to buy a smart phone in amazon. How do you finalize a smart phone? Use sentiment analysis to decide the product based on customer review that helps in recognizing the primary issues with their products (if there are any). Some products have thousands of reviews on Amazon while some others only have a few hundred. 192 CU IDOL SELF LEARNING MATERIAL (SLM)

___________________________________________________________________________ ____________________________________________________________________ 2. Bag of words plays a major role in performing sentiment analysis. Comment. ___________________________________________________________________________ ____________________________________________________________________ 11.8 UNIT END QUESTIONS A.Descriptive Questions Short Question 1. What is NLP? 2. List the stages of Natural Language of Generation. 3. How bag of word is used in NLP? 4. Compare stemming and Lemmatization 5. What is the use of dependency trees? Long Question 1. Discuss about the techniques used for Natural Language Processing. 2. Illustrate various feature engineering approaches on text data. 3. How decision trees are used to perform syntactic parsing? Give example 4. Compare the techniques used for converting into numerical features. 5. Describe how the features are extracted from sentence with example B.Multiple ChoiceQuestions 1. What is the main challenge/s of NLP? a. Handling Ambiguity of Sentences b. Handling Tokenization c. Handling POS-Tagging d. Stemming 2. What is Machine Translation? 193 a. Converts one human language to another CU IDOL SELF LEARNING MATERIAL (SLM)

b. Converts human language to machine language c. Converts any human language to English d. Converts Machine language to human language 3. Many words have more than one meaning; we have to select the meaning which makes the most sense in context. This can be resolved by ____________ a. Fuzzy Logic b. Word Sense Disambiguation c. Shallow Semantic Analysis d. All of these 4.Given a sound clip of a person or people speaking, determine the textual representation of the speech. a. Text-to-speech b. Speech-to-text c. All of the mentioned d. None of these 5. In linguistic morphology _____________ is the process for reducing inflected words to their root form. a. Rooting b. Stemming c. Text-Proofing d. Sentiment Analysis Answers 194 1 – a, 2 – a, 3 – b, 4 – b, 5 – b 11.9 REFERENCES Textbooks • Peter Harrington “Machine Learning in Action”, Dream Tech Press CU IDOL SELF LEARNING MATERIAL (SLM)

• EthemAlpaydin, “Introduction to Machine Learning”, MIT Press • Steven Bird, Ewan Klein and Edward Loper, “Natural Language Processing with Python”, O’Reilly Media. • Stephen Marsland, “Machine Learning an Algorithmic Perspective” CRC Press Reference Books • William W. Hsieh, “Machine Learning Methods in the Environmental Sciences”, Cambridge • Grant S. Ingersoll, Thomas S. Morton, Andrew L. Farris, “Tamming Text”, Manning Publication Co. • Margaret. H. Dunham, “Data Mining Introductory and Advanced Topics”, Pearson Education 195 CU IDOL SELF LEARNING MATERIAL (SLM)

UNIT - 12: NATURAL LANGUAGE UNDERSTANDING Structure 12.0 LearningObjectives 12.1 Natural Language Understanding 12.2 Applications of Natural Language Understanding 12.3 Relation Extraction 12.3.1 Rule-based Relation Extraction 12.3.2 Weakly Supervised Relation Extraction 12.3.3 Supervised Relation Extraction 12.3.4 Distantly Supervised Relation Extraction 12.3.5 Unsupervised Relation Extraction 12.4 Natural Language Generation 12.4.1 Applications Of Natural Language Generation (Nlg) 12.4.2 Components Of A Generator 12.5 Natural Language Processing Libraries 12.6 Summary 12.7 Keywords 12.8 Learning Activity 12.9 Unit End Questions 12.10 References 12.0 LEARNING OBJECTIVES After studying this unit, you will be able to: • Familiarize the fundamentals of NLU Techniques • Illustrate the types of Natural Language Generation • Use the NLP libraries for various problems 12.1 NATURAL LANGUAGE UNDERSTANDING Natural language understanding is a subset of natural language processing, which uses syntactic and semantic analysis of text and speech to determine the meaning of a sentence. 196 CU IDOL SELF LEARNING MATERIAL (SLM)

Syntax refers to the grammatical structure of a sentence, while semantics alludes to its intended meaning. NLU also establishes a relevant ontology: a data structure which specifies the relationships between words and phrases. While humans naturally do this in conversation, the combination of these analyses is required for a machine to understand the intended meaning of different texts. Figure 12.1 Components of NLP Figure 12.2 NLP to NLU Conversion 197 The understanding of natural language is based on the following three techniques: • Syntax: understands the grammar of the text • Semantics: understands the actual meaning of the text • Pragmatics: understands what the text is trying to communicate CU IDOL SELF LEARNING MATERIAL (SLM)

12.2 APPLICATIONS OF NATURAL LANGUAGE UNDERSTANDING (NLU) The main use of NLU is to read, understand, process, and create speech & chat-enabled business bots that can interact with users just like a real human would, without any supervision. Popular applications include sentiment detection and profanity filtering among others. It can be applied to gather news, categorize and archive text, and analyse content. Google acquired API.ai provides tools for speech recognition and NLU. It mainly scrapes through unstructured language information and converts it into data, preferably structured data, which can be processed and analysed for the desired results. The important Techniques for the NLU are: • Relation Extraction • Semantic Parsing • Question Answering • Sentiment Analysis • Dialogue Agents • Summarization • Paraphrase & NL Interface 12.3 RELATION EXTRACTION Relation Extraction (RE) is the task of extracting semantic relationships from text, which usually occur between two or more entities. These relations can be of different types. E.g “Paris is in France” states a “is in” relationship from Paris to France. This can be denoted using triples, (Paris, is in, France). Information Extraction (IE) is the field of extracting structured information from natural language text. This field is used for various NLP tasks, such as creating Knowledge Graphs, Question-Answering System, Text Summarization, etc. Relation extraction is in itself a subfield of IE. There are five different methods of doing Relation Extraction: 1. Rule-based RE 2. Weakly Supervised RE 3. Supervised RE 198 CU IDOL SELF LEARNING MATERIAL (SLM)

4. Distantly Supervised RE 5. Unsupervised RE We will go through all of them at a high level, and discuss some pros and cons which for each one. 12.3.1 Rule-based Relation Extraction Many instances of relations can be identified through hand-crafted patterns, looking for triples (X, α, Y) where X are entities and α are words in between. For the “Paris is in France” example, α=”is in”. This could be extracted with a regular expression. Figure 12.1 Named entities in sentence Figure 12.2 Part-of-speech tags in sentence Only looking at keyword matches will also retrieve many false positive. We can mitigate this by filtering on named entities, only retrieving (CITY, is in, COUNTRY). We can also take into account the part-of-speech (POS) tags to remove additional false positive. These are examples of doing word sequence patterns, because the rule specifies a pattern following the order of the text. Unfortunately these type of rules fall apart for longer-range patterns and sequences with greater variety. E.g. “Fred and Mary got married” cannot successfully be handled by a word sequence pattern. Figure 12.3 Dependency paths in sentence Instead, we can make use of dependency paths in the sentences, knowing which word is having a grammatical dependency on what other word. This can greatly increase the coverage of the rule without extra effort. 199 CU IDOL SELF LEARNING MATERIAL (SLM)

We can also transform the sentences before applying the rule. E.g. “The cake was baked by Harry” or “The cake which Harry baked” can be transformed into “Harry baked the cake”. Then we are changing the order to work with our “linear rule”, while also removing redundant modifying word in between. Pros • Humans can create pattern which tend to have high precision • Can be tailored to specific domains Cons • Human patterns are still often low-recall (too much variety in languages) • A lot of manual work to create all possible rules • Have to create rules for every relation type 12.3.2 Weakly Supervised Relation Extraction The idea here is to start out with a set of hand-crafted rules and automatically find new ones from the unlabeled text data, through and iterative process (bootstrapping). Alternatively, one can start out with a sed of seed tuples, describing entities with a specific relation. E.g. seed={(ORG:IBM, LOC:Armonk), (ORG:Microsoft, LOC:Redmond)} states entities having the relation “based in”. Figure 12.4 Extracting relations from large plain-text Agichtein, Eugene, and Luis Gravano. “Snowball: Extracting relations from large plain-text collections.” Proceedings of the fifth ACM conference on Digital libraries. ACM, 2000. Snowball is a fairly old example of an algorithm which does this: 1. Start with a set of seed tuples (or extract a seed set from the unlabeled text with a few hand-crafted rules). 200 CU IDOL SELF LEARNING MATERIAL (SLM)


Like this book? You can publish your book online for free in a few minutes!
Create your own flipbook