Important Announcement
PubHTML5 Scheduled Server Maintenance on (GMT) Sunday, June 26th, 2:00 am - 8:00 am.
PubHTML5 site will be inoperative during the times indicated!

Home Explore Book-DataMining

Book-DataMining

Published by patcharapolonline, 2022-08-16 14:11:21

Description: Book-DataMining

Search

Read the Text Version

7.6. HUMAN AND VISUALLY SUPERVISED CLUSTERING 227 its closest seed that does not violate any of the constraints implied by the assignments that have already been executed. In the event that the data point cannot be assigned to any cluster in a consistent way, the algorithm terminates. In this case, the clusters in the last iteration where a feasible solution was found are reported. In some cases, no feasible solution may be found even in the first iteration, depending on the choice of seeds. Therefore, the constrained k-means approach may be executed multiple times, and the best solution over these executions is reported. Numerous other methods are available in the literature, both in terms of the kinds of constraints that are specified, and in terms of the solution methodology. The bibliographic notes contain pointers to many of these methods. 7.6 Human and Visually Supervised Clustering The previous section discussed ways of incorporating supervision in the input data in the form of constraints or labels. A different way of incorporating supervision is to use direct feedback from the user during the clustering process, based on an understandable summary of the clusters. The core idea is that semantically meaningful clusters are often difficult to isolate using fully automated methods in which rigid mathematical formalizations are used as the only criteria. The utility of clusters is based on their application-specific usability, which is often semantically interpretable. In such cases, human involvement is necessary to incorporate the intuitive and semantically meaningful aspects during the cluster discovery process. Cluster- ing is a problem that requires both the computational power of a computer and the intuitive understanding of a human. Therefore, a natural solution is to divide the clustering task in such a way that each entity performs the task that it is most well suited to. In the interactive approach, the computer performs the computationally intensive analysis, which is leveraged to provide the user with an intuitively understandable summary of the clustering structure. The user utilizes this summary to provide feedback about the key choices that should be made by a clustering algorithm. The result of this cooperative technique is a system that can perform the task of clustering better than either a human or a computer. There are two natural ways of providing feedback during the clustering process: 1. Semantic feedback as an intermediate process in standard clustering algorithms: Such an approach is relevant in domains where the objects are semantically interpretable (e.g., documents or images), and the user provides feedback at specific stages in a clustering algorithm when critical choices are made. For example, in a k-means algo- rithm, a user may choose to drop some clusters during each iteration and manually specify new seeds reflecting uncovered segments of the data. 2. Visual feedback in algorithms specifically designed for human–computer interaction: In many high-dimensional data sets, the number of attributes is very large, and it is difficult to associate direct semantic interpretability with the objects. In such cases, the user must be provided visual representations of the clustering structure of the data in different subsets of attributes. The user may leverage these representatives to provide feedback to the clustering process. This approach can be viewed as an interactive version of projected clustering methods. In the following section, each of these different types of algorithms will be addressed in detail.

228 CHAPTER 7. CLUSTER ANALYSIS: ADVANCED CONCEPTS 7.6.1 Modifications of Existing Clustering Algorithms Most clustering algorithms use a number of key decision steps in which choices need to be made, such as the choice of merges in a hierarchical clustering algorithm, or the resolution of close ties in assignment of data points to clusters. When these choices are made on the basis of stringent and predefined clustering criteria, the resulting clusters may not reflect the natural structure of clusters in the data. Therefore, the goal in this kind of approach is to present the user with a small number of alternatives corresponding to critical choices in the clustering process. Some examples of simple modifications of the existing clustering algorithms are as follows: 1. Modifications to k-means and related methods: A number of critical decision points in the k-means algorithm can be utilized to improve the clustering process. For example, after each iteration, representative data points from each cluster can be presented to the user. The user may choose to manually discard either clusters with very few data points or clusters that are closely related to others. The corresponding seeds may be dropped and replaced with randomly chosen seeds in each iteration. Such an approach works well when the representative data points presented to the user have clear semantic interpretability. This is true in many domains such as image data or document data. 2. Modifications to hierarchical methods: In the bottom-up hierarchical algorithms, the clusters are successively merged by selecting the closest pair for merging. The key here is that if a bottom-up algorithm makes an error in the merging process, the merging decision is final, resulting in a lower quality clustering. Therefore, one way of reducing such mistakes is to present the users with the top-ranking choices for the merge corresponding to a small number of different pairs of clusters. These choices can be made by the user on the basis of semantic interpretability. It is important to point out that the key steps at which a user may provide the feedback depend on the level of semantic interpretability of the objects in the underlying clusters. In some cases, such semantic interpretability may not be available. 7.6.2 Visual Clustering Visual clustering is particularly helpful in scenarios, such as high-dimensional data, where the semantic interpretability of the individual objects is low. In such cases, it is useful to visualize lower dimensional projections of the data to determine subspaces in which the data are clustered. The ability to discover such lower dimensional projections is based on a combination of the computational ability of a computer and the intuitive feedback of the user. IPCLUS is an approach that combines interactive projected clustering methods with visualization methods derived from density-based methods. One challenge with high-dimensional clustering is that the density, distribution, and shapes of the clusters may be quite different in different data localities and subspaces. Fur- thermore, it may not be easy to decide the optimum density threshold at which to separate out the clusters in any particular subspace. This is a problem even for full-dimensional clustering algorithms where any particular density threshold4 may either merge clusters or completely miss clusters. While subspace clustering methods such as CLIQUE address these issues by reporting a huge number of overlapping clusters, projected clustering methods such 4See discussion in Chap. 6 about Fig. 6.14.

7.6. HUMAN AND VISUALLY SUPERVISED CLUSTERING 229 Algorithm IPCLUS(Data Set: D, Polarization Points: k) begin while not(termination criterion) do begin Randomly sample k points Y1 . . . Yk from D; Compute 2-dimensional subspace E polarized around Y1 . . . Yk; Generate density profile in E and present to user; Record membership statistics of clusters based on user-specified density-based feedback; end; return consensus clusters from membership statistics; end Figure 7.6: The IPCLUS algorithm as PROCLUS address these issues by making hard decisions about how the data should be most appropriately summarized. Clearly, such decisions can be made more effectively by interactive user exploration of these alternative views and creating a final consensus from these different views. The advantage of involving the user is the greater intuition available in terms of the quality of feedback provided to the clustering process. The result of this cooperative technique is a system that can perform the clustering task better than either a human or a computer. The idea behind the Interactive Projected CLUStering algorithm (IPCLUS) is to provide the user with a set of meaningful visualizations in lower dimensional projections together with the ability to decide how to separate the clusters. The overall algorithm is illustrated in Fig. 7.6. The interactive projected clustering algorithm works in a series of iterations; in each, a projection is determined in which there are distinct sets of points that can be clearly distinguished from one another. Such projections are referred to as well polarized. In a well-polarized projection, it is easier for the user to clearly distinguish a set of clusters from the rest of the data. Examples of the data density distribution of a well-polarized projection and a poorly polarized projection are illustrated in Fig. 7.7a and b, respectively. These polarized projections are determined by randomly selecting a set of k records from the database that are referred to as the polarization anchors. The number of polariza- tion anchors k is one of the inputs to the algorithm. A 2-dimensional subspace of the data is determined such that the data are clustered around each of these polarization anchors. Specifically, a 2-dimensional subspace is selected so that the mean square radius of assign- ments of data points to the polarization points as anchors is minimized. Different projections are repeatedly determined with different sampled anchors in which the user can provide feed- back. A consensus clustering is then determined from the different clusterings generated by the user over multiple subspace views of the data. The polarization subspaces can be determined either in axis-parallel subspaces or arbi- trary subspaces, although the former provides greater interpretability. The overall approach for determining polarization subspaces starts with the full dimensionality and iteratively reduces the dimensionality of the current subspace until a 2-dimensional subspace is obtained. This is achieved by iteratively assigning data points to their closest subspace- specific anchor points in each iteration, while discarding the most noisy (high variance) dimensions in each iteration about the polarization points. The dimensionality is reduced

230 CHAPTER 7. CLUSTER ANALYSIS: ADVANCED CONCEPTS Figure 7.7: Polarizations of different quality and flexibility of user interaction by a constant factor of 2 in each iteration. Thus, this is a k-medoids type approach, which reduces the dimensionality of the subspaces for distance computation, but not the seeds in each iteration. This typically results in the discovery of a 2-dimensional subspace that is highly clustered around the polarization anchors. Of course, if the polarization anchors are poorly sampled, then this will result in poorly separated clusters. Nevertheless, repeated sampling of polarization points ensures that good subspaces will be selected in at least a few iterations. After the projection subspace has been found, kernel density estimation techniques can be used to determine the data density at each point in a 2-dimensional grid of values in the relevant subspace. The density values at these grid points are used to create a surface plot. Examples of such plots are illustrated in Fig. 7.7. Because clusters correspond to dense regions in the data, they are represented by peaks in the density profile. To actually separate out the clusters, the user can visually specify density value thresholds that correspond to noise levels at which clusters can be separated from one another. Specifically, a cluster may be defined to be a connected region in the space with density above a certain noise threshold τ that is specified by the user. This cluster may be of arbitrary shape, and the points inside it can be determined. Note that when the density distribution varies significantly with locality, different numbers, shapes, and sizes of clusters will be discovered by different density thresholds. Examples of density thresholding are illustrated in Fig. 7.7c and 7.7d,

7.7. CLUSTER ENSEMBLES 231 where clusters of different numbers and shapes are discovered at different thresholds. It is in this step that the user intuition is very helpful, both in terms of deciding which polarized projections are most relevant, and in terms of deciding what density thresholds to specify. If desired, the user may discard a projection altogether or specify multiple thresholds in the same projection to discover clusters of different density in different localities. The specification of the density threshold τ need not be done directly by value. The density separator hyperplane can be visually superposed on the density profile with the help of a graphical interface. Each feedback of the user results in the generation of connected sets of points within the density contours. These sets of points can be viewed as one or more binary “transac- tions” drawn on the “item” space of data points. The key is to determine the consensus clusters from these newly created transactions that encode user feedback. While the prob- lem of finding consensus clusters from multiple clusterings will be discussed in detail in the next section, a very simple way of doing this is to use either frequent pattern mining (to find overlapping clusters) or a second level of clustering on the transactions to generate nonoverlapping clusters. Because this new set of transactions encodes the user preferences, the quality of the clusters found with such an approach will typically be quite high. 7.7 Cluster Ensembles The previous section illustrated how different views of the data can lead to different solutions to the clustering problem. This notion is closely related to the concept of multiview clustering or ensemble clustering, which studies this issue from a broader perspective. It is evident from the discussion in this chapter and the previous one that clustering is an unsupervised problem with many alternative solutions. In spite of the availability of a large number of validation criteria, the ability to truly test the quality of a clustering algorithm remains elusive. The goal in ensemble clustering is to combine the results of many clustering models to create a more robust clustering. The idea is that no single model or criterion truly captures the optimal clustering, but an ensemble of models will provide a more robust solution. Most ensemble models use the following two steps to generate the clustering solution: 1. Generate k different clusterings with the use of different models or data selection mechanisms. These represent the different ensemble components. 2. Combine the different results into a single and more robust clustering. The following section provides a brief overview of the different ways in which the alternative clusterings can be constructed. 7.7.1 Selecting Different Ensemble Components The different ensemble components can be selected in a wide variety of ways. They can be either modelbased or data-selection based. In model-based ensembles, the different compo- nents of the ensemble reflect different models, such as the use of different clustering models, different settings of the same model, or different clusterings provided by different runs of the same randomized algorithm. Some examples follow: 1. The different components can be a variety of models such as partitioning methods, hierarchical methods, and density-based methods. The qualitative differences between the models will be data set-specific.

232 CHAPTER 7. CLUSTER ANALYSIS: ADVANCED CONCEPTS 2. The different components can correspond to different settings of the same algorithm. An example is the use of different initializations for algorithms such as k-means or EM, the use of different mixture models for EM, or the use of different parameter settings of the same algorithm, such as the choice of the density threshold in DBSCAN. An ensemble approach is useful because the optimum choice of parameter settings is also hard to determine in an unsupervised problem such as clustering. 3. The different components could be obtained from a single algorithm. For example, a 2-means clustering applied to the 1-dimensional embedding obtained from spectral clustering will yield a different clustering solution for each eigenvector. Therefore, the smallest k nontrivial eigenvectors will provide k different solutions that are often quite different as a result of the orthogonality of the eigenvectors. A second way of selecting the different components of the ensemble is with the use of data selection. Data selection can be performed in two different ways: 1. Point selection: Different subsets of the data are selected, either via random sampling, or by systematic selection for the clustering process. 2. Dimension selection: Different subsets of dimensions are selected to perform the clus- tering. An example is the IPCLUS method discussed in the previous section. After the individual ensemble components have been constructed, it is often a challenge to combine the results from these different components to create a consensus clustering. 7.7.2 Combining Different Ensemble Components After the different clustering solutions have been obtained, it is desired to create a robust consensus from the different solutions. In the following section, some simple methods are described that use the base clusterings as input for the generation of the final set of clusters. 7.7.2.1 Hypergraph Partitioning Algorithm Each object in the data is represented by a vertex. A cluster in any of the ensemble compo- nents is represented as a hyperedge. A hyperedge is a generalization of the notion of edge, because it connects more than two nodes in the form of a complete clique. Any off-the- shelf hypergraph clustering algorithm such as HMETIS [302] can be used to determine the optimal partitioning. Constraints are added to ensure a balanced partitioning. One major challenge with hypergraph partitioning is that a hyperedge can be “broken” by a partition- ing in many different ways, not all of which are qualitatively equivalent. Most hypergraph partitioning algorithms use a constant penalty for breaking a hyperedge. This can sometimes be undesirable from a qualitative perspective. 7.7.2.2 Meta-clustering Algorithm This is also a graph-based approach, except that vertices are associated with each cluster in the ensemble components. For example, if there are k1 . . . kr different clusters in each of the r erenpsreemsebnletscoamseptonofendtast,athoebnjeacttso.taAlnofedgire=i1skaidvdeerdticbeestwweilelnbea created. Each vertex therefore pair of vertices if the Jaccard coefficient between the corresponding object sets is nonzero. The weight of the edge is equal to the Jaccard coefficient. This is therefore an r-partite graph because there are no edges

7.8. PUTTING CLUSTERING TO WORK: APPLICATIONS 233 between two vertices from the same ensemble component. A graph partitioning algorithm is applied to this graph to create the desired number of clusters. Each data point has r different instances corresponding to the different ensemble components. The distribution of the membership of different instances of the data point to the meta-partitions can be used to determine its meta-cluster membership, or soft assignment probability. Balancing constraints may be added to the meta-clustering phase to ensure that the resulting clusters are balanced. 7.8 Putting Clustering to Work: Applications Clustering can be considered a specific type of data summarization where the summaries of the data points are constructed on the basis of similarity. Because summarization is a first step to many data mining applications, such summaries can be widely useful. This section will discuss the many applications of data clustering. 7.8.1 Applications to Other Data Mining Problems Clustering is intimately related to other data mining problems and is used as a first summa- rization step in these cases. In particular, it is used quite often for the data mining problems of outlier analysis and classification. These specific applications are discussed below. 7.8.1.1 Data Summarization Although many forms of data summarization, such as sampling, histograms, and wavelets are available for different kinds of data, clustering is the only natural form of summariza- tion based on the notion of similarity. Because the notion of similarity is fundamental to many data mining applications, such summaries are very useful for similarity-based applica- tions. Specific applications include recommendation analysis methods, such as collaborative filtering. This application is discussed later in this chapter, and in Chap. 18 on Web mining. 7.8.1.2 Outlier Analysis Outliers are defined as data points that are generated by a different mechanism than the normal data points. This can be viewed as a complementary problem to clustering where the goal is to determine groups of closely related data points generated by the same mechanism. Therefore, outliers may be defined as data points that do not lie in any particular cluster. This is of course a simplistic abstraction but is nevertheless a powerful principle as a starting point. Sections 8.3 and 8.4 of Chap. 8 discuss how many algorithms for outlier analysis can be viewed as variations of clustering algorithms. 7.8.1.3 Classification Many forms of clustering are used to improve the accuracy of classification methods. For example, nearest-neighbor classifiers report the class label of the closest set of training data points to a given test instance. Clustering can help speed up this process by replacing the data points with centroids of fine-grained clusters belonging to a particular class. In addition, semisupervised methods can also be used to perform categorization in many domains such as text. The bibliographic notes contain pointers to these methods.

234 CHAPTER 7. CLUSTER ANALYSIS: ADVANCED CONCEPTS 7.8.1.4 Dimensionality Reduction Clustering methods, such as nonnegative matrix factorization, are related to the problem of dimensionality reduction. In fact, the dual output of this algorithm is a set of concepts, together with a set of clusters. Another related approach is probabilistic latent semantic indexing, which is discussed in Chap. 13 on mining text data. These methods show the intimate relationship between clustering and dimensionality reduction and that common solutions can be exploited by both problems. 7.8.1.5 Similarity Search and Indexing A hierarchical clustering such as CF-Tree can sometimes be used as an index, at least from a heuristic perspective. For any given target record, only the branches of the tree that are closest to the relevant clusters are searched, and the most relevant data points are returned. This can be useful in many scenarios where it is not practical to build exact indexes with guaranteed accuracy. 7.8.2 Customer Segmentation and Collaborative Filtering In customer segmentation applications, similar customers are grouped together on the basis of the similarity of their profiles or other actions at a given site. Such segmentation methods are very useful in cases where the data analysis is naturally focused on similar segments of the data. A specific example is the case of collaborative filtering applications in which ratings are provided by different customers based on their items of interest. Similar customers are grouped together, and recommendations are made to the customers in a cluster on the basis of the distribution of ratings in a particular group. 7.8.3 Text Applications Many Web portals need to organize the material at their Web sites on the basis of simi- larity in content. Text clustering methods can be useful for organization and browsing of text documents. Hierarchical clustering methods can be used to organize the documents in an exploration-friendly tree structure. Many hierarchical directories in Web sites are con- structed with a combination of user labeling and semisupervised clustering methods. The semantic insights provided by hierarchical cluster organizations are very useful in many applications. 7.8.4 Multimedia Applications With the increasing proliferation of electronic forms of multimedia data, such as images, photos, and music, numerous methods have been designed in the literature for finding clusters in such scenarios. Clusters of such multimedia data also provide the user the ability to search for relevant objects in social media Web sites containing this kind of data. This is because heuristic indexes can be constructed with the use of clustering methods. Such indexes are useful for effective retrieval. 7.8.5 Temporal and Sequence Applications Many forms of temporal data, such as time-series data, and Web logs can be clustered for effective analysis. For example, clusters of sequences in a Web log provide insights

7.10. BIBLIOGRAPHIC NOTES 235 about the normal patterns of users. This can be used to reorganize the site, or optimize its structure. In some cases, such information about normal patterns can be used to discover anomalies that do not conform to the normal patterns of interaction. A related domain is that of biological sequence data where clusters of sequences are related to their underlying biological properties. 7.8.6 Social Network Analysis Clustering methods are useful for finding related communities of users in social-networking Web sites. This problem is known as community detection. Community detection has a wide variety of other applications in network science, such as anomaly detection, classification, influence analysis, and link prediction. These applications are discussed in detail in Chap. 19 on social network analysis. 7.9 Summary This chapter discusses a number of advanced scenarios for cluster analysis. These scenarios include the clustering of advanced data types such as categorical data, large-scale data, and high-dimensional data. Many traditional clustering algorithms can be modified to work with categorical data by making changes to specific criteria, such as the similarity function or mixture model. Scalable algorithms require a change in algorithm design to reduce the number of passes over the data. High-dimensional data is the most difficult case because of the presence of many irrelevant features in the underlying data. Because clustering algorithms yield many alternative solutions, supervision can help guide the cluster discovery process. This supervision can either be in the form of background knowledge or user interaction. In some cases, the alternative clusterings can be combined to create a consensus clustering that is more robust than the solution obtained from a single model. 7.10 Bibliographic Notes The problem of clustering categorical data is closely related to that of finding suit- able similarity measures [104, 182], because many clustering algorithms use similarity measures as a subroutine. The k-modes and a fuzzy version of the algorithm may be found in [135, 278]. Popular clustering algorithms include ROCK [238], CACTUS [220], LIMBO [75], and STIRR [229]. The three scalable clustering algorithms discussed in this book are CLARANS [407], BIRCH [549], and CURE [239]. The high-dimensional clus- tering algorithms discussed in this chapter include CLIQUE [58], PROCLUS [19], and ORCLUS [22]. Detailed surveys on many different types of categorical, scalable, and high- dimensional clustering algorithms may be found in [32]. Methods for semisupervised clustering with the use of seeding, constraints, metric learn- ing, probabilistic learning, and graph-based learning are discussed in [80, 81, 94, 329]. The IPCLUS method presented in this chapter was first presented in [43]. Two other tools that are able to discover clusters by visualizing lower dimensional subspaces include HD- Eye [268] and RNavGraph [502]. The cluster ensemble framework was first proposed in [479]. The hypergraph partitioning algorithm HMETIS, which is used in ensemble clustering, was proposed in [302]. Subsequently, the utility of the method has also been demonstrated for high-dimensional data [205].

236 CHAPTER 7. CLUSTER ANALYSIS: ADVANCED CONCEPTS 7.11 Exercises 1. Implement the k-modes algorithm. Download the KDD CUP 1999 Network Intrusion Data Set [213] from the UCI Machine Learning Repository, and apply the algorithm to the categorical attributes of the data set. Compute the cluster purity with respect to class labels. 2. Repeat the previous exercise with an implementation of the ROCK algorithm. 3. What changes would be required to the BIRCH algorithm to implement it with the use of the Mahalanobis distance, to compute distances between data points and centroids? The diameter of a cluster is computed as its RMS Mahalanobis radius. 4. Discuss the connection between high-dimensional clustering algorithms, such as PRO- CLUS and ORCLUS, and wrapper models for feature selection. 5. Show how to create an implementation of the cluster feature vector that allows the incremental computation of the covariance matrix of the cluster. Use this to create an incremental and scalable version of the Mahalanobis k-means algorithm. 6. Implement the k-means algorithm, with an option of selecting any of the points from the original data as seeds. Apply the approach to the quantitative attributes of the data set in Exercise 1, and select one data point from each class as a seed. Compute the cluster purity with respect to an implementation that uses random seeds. 7. Describe an automated way to determine whether a set of “must-link” and “cannot- link” constraints are consistent.

Chapter 8 Outlier Analysis “You are unique, and if that is not fulfilled, then something has been lost.”—Martha Graham 8.1 Introduction An outlier is a data point that is very different from most of the remaining data. Hawkins formally defined the notion of an outlier as follows: “An outlier is an observation which deviates so much from the other observations as to arouse suspicions that it was generated by a different mechanism.” Outliers can be viewed as a complementary concept to that of clusters. While clustering attempts to determine groups of data points that are similar, outliers are individual data points that are different from the remaining data. Outliers are also referred to as abnor- malities, discordants, deviants, or anomalies in the data mining and statistics literature. Outliers have numerous applications in many data mining scenarios: 1. Data cleaning: Outliers often represent noise in the data. This noise may arise as a result of errors in the data collection process. Outlier detection methods are, therefore, useful for removing such noise. 2. Credit card fraud: Unusual patterns of credit card activity may often be a result of fraud. Because such patterns are much rarer than the normal patterns, they can be detected as outliers. 3. Network intrusion detection: The traffic on many networks can be considered as a stream of multidimensional records. Outliers are often defined as unusual records in this stream or unusual changes in the underlying trends. C. C. Aggarwal, Data Mining: The Textbook, DOI 10.1007/978-3-319-14142-8 8 237 c Springer International Publishing Switzerland 2015

238 CHAPTER 8. OUTLIER ANALYSIS Most outlier detection methods create a model of normal patterns. Examples of such models include clustering, distance-based quantification, or dimensionality reduction. Outliers are defined as data points that do not naturally fit within this normal model. The “outlierness” of a data point is quantified by a numeric value, known as the outlier score. Consequently, most outlier detection algorithms produce an output that can be one of two types: 1. Real-valued outlier score: Such a score quantifies the tendency for a data point to be considered an outlier. Higher values of the score make it more (or, in some cases, less) likely that a given data point is an outlier. Some algorithms may even output a probability value quantifying the likelihood that a given data point is an outlier. 2. Binary label: A binary value is output, indicating whether or not a data point is an outlier. This type of output contains less information than the first one because a threshold can be imposed on the outlier scores to convert them into binary labels. However, the reverse is not possible. Therefore, outlier scores are more general than binary labels. Nevertheless, a binary score is required as the end result in most appli- cations because it provides a crisp decision. The generation of an outlier score requires the construction of a model of the normal pat- terns. In some cases, a model may be designed to produce specialized types of outliers based on a very restrictive model of normal patterns. Examples of such outliers are extreme values, and they are useful only for certain specific types of applications. In the following, some of the key models for outlier analysis are summarized. These will be discussed in more detail in later sections. 1. Extreme values: A data point is an extreme value, if it lies at one of the two ends of a probability distribution. Extreme values can also be defined equivalently for multidi- mensional data by using a multivariate probability distribution, instead of a univariate one. These are very specialized types of outliers but are useful in general outlier anal- ysis because of their utility in converting scores to labels. 2. Clustering models: Clustering is considered a complementary problem to outlier anal- ysis. The former problem looks for data points that occur together in a group, whereas the latter problem looks for data points that are isolated from groups. In fact, many clustering models determine outliers as a side-product of the algorithm. It is also possible to optimize clustering models to specifically detect outliers. 3. Distance-based models: In these cases, the k-nearest neighbor distribution of a data point is analyzed to determine whether it is an outlier. Intuitively, a data point is an outlier, if its k-nearest neighbor distance is much larger than that of other data points. Distance-based models can be considered a more fine-grained and instance-centered version of clustering models. 4. Density-based models: In these models, the local density of a data point is used to define its outlier score. Density-based models are intimately connected to distance- based models because the local density at a given data point is low only when its distance to its nearest neighbors is large. 5. Probabilistic models: Probabilistic algorithms for clustering are discussed in Chap. 6. Because outlier analysis can be considered a complementary problem to clustering, it is natural to use probabilistic models for outlier analysis as well. The steps are almost analogous to those of clustering algorithms, except that the EM algorithm is used for

8.2. EXTREME VALUE ANALYSIS 239 clustering, and the probabilistic fit values are used to quantify the outlier scores of data points (instead of distance values). 6. Information-theoretic models: These models share an interesting relationship with other models. Most of the other methods fix the model of normal patterns and then quantify outliers in terms of deviations from the model. On the other hand, information-theoretic methods constrain the maximum deviation allowed from the normal model and then examine the difference in space requirements for constructing a model with or without a specific data point. If the difference is large, then this point is reported as an outlier. In the following sections, these different types of models will be discussed in detail. Repre- sentative algorithms from each of these classes of algorithms will also be introduced. It should be pointed out that this chapter defines outlier analysis as an unsupervised problem in which previous examples of anomalies and normal data points are not available. The supervised scenario, in which examples of previous anomalies are available, is a special case of the classification problem. That case will be discussed in detail in Chap. 11. This chapter is organized as follows. Section 8.2 discusses methods for extreme value analysis. Probabilistic methods are introduced in Sect. 8.3. These can be viewed as mod- ifications of EM-clustering methods that leverage the connections between the clustering and outlier analysis problem for detecting outliers. This issue is discussed more formally in Sect. 8.4. Distance-based models for outlier detection are discussed in Sect. 8.5. Density- based models are discussed in Sect. 8.6. Information-theoretic models are addressed in Sect. 8.7. The problem of cluster validity is discussed in Sect. 8.8. A summary is given in Sect. 8.9. 8.2 Extreme Value Analysis Extreme value analysis is a very specific kind of outlier analysis where the data points at the outskirts of the data are reported as outliers. Such outliers correspond to the statistical tails of probability distributions. Statistical tails are more naturally defined for 1-dimensional distributions, although an analogous concept can be defined for the multidimensional case. It is important to understand that extreme values are very specialized types of outliers; in other words, all extreme values are outliers, but the reverse may not be true. The tradi- tional definition of outliers is based on Hawkins’s definition of generative probabilities. For example, consider the 1-dimensional data set corresponding to {1, 3, 3, 3, 50, 97, 97, 97, 100}. Here, the values 1 and 100 may be considered extreme values. The value 50 is the mean of the data set and is therefore not an extreme value. However, it is the most isolated point in the data set and should, therefore, be considered an outlier from a generative perspective. A similar argument applies to the case of multivariate data where the extreme values lie in the multivariate tail area of the distribution. It is more challenging to formally define the concept of multivariate tails, although the basic concept is analogous to that of univariate tails. Consider the example illustrated in Fig. 8.1. Here, data point A may be considered an extreme value, and also an outlier. However, data point B is also isolated, and should, therefore, be considered an outlier. However, it cannot be considered a multivariate extreme value. Extreme value analysis has important applications in its own right, and, therefore, plays an integral role in outlier analysis. An example of an important application of extreme value analysis is that of converting outlier scores to binary labels by identifying those outlier scores that are extreme values. Multivariate extreme value analysis is often useful in multicriteria

240 CHAPTER 8. OUTLIER ANALYSIS Figure 8.1: Multivariate extreme values outlier-detection algorithms where it can be utilized to unify multiple outlier scores into a single value, and also generate a binary label as the output. For example, consider a meteo- rological application where outlier scores of spatial regions have been generated on the basis of analyzing their temperature and pressure variables independently. These evidences need to be unified into a single outlier score for the spatial region, or a binary label. Multivariate extreme value analysis is very useful in these scenarios. In the following discussion, methods for univariate and multivariate extreme value analysis will be discussed. 8.2.1 Univariate Extreme Value Analysis Univariate extreme value analysis is intimately related to the notion of statistical tail con- fidence tests. Typically, statistical tail confidence tests assume that the 1-dimensional data are described by a specific distribution. These methods attempt to determine the fraction of the objects expected to be more extreme than the data point, based on these distribu- tion assumptions. This quantification provides a level of confidence about whether or not a specific data point is an extreme value. How is the “tail” of a distribution defined? For distributions that are not symmetric, it is often meaningful to talk about an upper tail and a lower tail, which may not have the same probability. The upper tail is defined as all extreme values larger than a particular threshold, and the lower tail is defined as all extreme values lower than a particular threshold. Consider the density distribution fX (x). In general, the tail may be defined as the two extreme regions of the distribution for which fX (x) ≤ θ, for some user defined threshold θ. Examples of the lower tail and the upper tail for symmetric and asymmetric distributions are illustrated in Fig. 8.2a and b, respectively. As evident from Fig. 8.2b, the area in the upper tail and the lower tail of an asymmetric distribution may not be the same. Furthermore, some regions in the interior of the distribution of Fig. 8.2b have density below the density threshold θ, but are not extreme values because they do not lie in the tail of the distribution. The data points in this region may be considered outliers, but not extreme values. The areas inside the upper tail or lower tail in Fig. 8.2a and b represent the cumulative probability of these extreme regions. In symmetric probability distributions, the tail is defined in terms of this area, rather than a density threshold. However, the concept of density threshold is the defining characteristic of the tail, especially in the case of asymmetric univariate or multivariate

8.2. EXTREME VALUE ANALYSIS 241 0.4 0.35 0.35 0.3 PROBABILITY DENSITY0.3 0.25 PROBABILITY DENSITY OUTLIERS 0.25 BUT NOT 0.2 EXTREME VALUES 0.2 0.15 0.15 UPPER 0.1 UPPER DENSITY TAIL LOWER THRESHOLD 0.1 LOWER TAIL DENSITY THRESHOLD 0.05 0.05 0 5 0 5 −5 −4 −3 −2 −1 0 1 2 3 4 −5 −4 −3 −2 −1 0 1 2 3 4 VALUE VALUE (a) Symmetric distribution (b) Asymmetric distribution Figure 8.2: Tails of a symmetric and asymmetric distribution distributions. Some asymmetric distributions, such as an exponential distribution, may not even have a tail at one end of the distribution. A model distribution is selected for quantifying the tail probability. The most commonly used model is the normal distribution. The density function fX (x) of the normal distribution with mean μ and standard deviation σ is defined as follows: fX (x) = σ · √12 · π −(x−μ)2 (8.1) · e 2·σ2 . A standard normal distribution is one in which the mean is 0, and the standard deviation σ is 1. In some application scenarios, the mean μ and standard deviation σ of the distribution may be known through prior domain knowledge. Alternatively, when a large number of data samples is available, the mean and standard deviation may be estimated very accurately. These can be used to compute the Z-value for a random variable. The Z-number zi of an observed value xi can be computed as follows: zi = (xi − μ)/σ. (8.2) Large positive values of zi correspond to the upper tail, whereas large negative values correspond to the lower tail. The normal distribution can be expressed directly in terms of the Z-number because it corresponds to a scaled and translated random variable with a mean 0 and standard deviation of 1. The normal distribution of Eq. 8.3 can be written directly in terms of the Z-number, with the use of a standard normal distribution as follows: fX (zi) = · √12 · π · e −zi2 . (8.3) 2 σ This implies that the cumulative normal distribution may be used to determine the area of the tail that is larger than zi. As a rule of thumb, if the absolute values of the Z-number are greater than 3, the corresponding data points are considered extreme values. At this threshold, the cumulative area inside the tail can be shown to be less than 0.01 % for the normal distribution. When a smaller number n of data samples is available for estimating the mean μ and standard deviations σ, the aforementioned methodology can be used with a minor modifi- cation. The value of zi is computed as before, and the student t-distribution with n degrees

242 CHAPTER 8. OUTLIER ANALYSIS Figure 8.3: Multivariate extreme values of freedom is used to quantify the cumulative distribution in the tail instead of the nor- mal distribution. Note that, when n is large, the t-distribution converges to the normal distribution. 8.2.2 Multivariate Extreme Values Strictly speaking, tails are defined for univariate distributions. However, just as the uni- variate tails are defined as extreme regions with probability density less than a particular threshold, an analogous concept can also be defined for multivariate distributions. The concept is more complex than the univariate case and is defined for unimodal probability distributions with a single peak. As in the previous case, a multivariate Gaussian model is used, and the corresponding parameters are estimated in a data-driven manner. The implicit modeling assumption of multivariate extreme value analysis is that all data points are located in a probability distribution with a single peak (i.e., single Gaussian cluster), and data points in all directions that are as far away as possible from the center of the cluster should be considered extreme values. Let μ be the d-dimensional mean vector of a d-dimensional data set, and Σ be its d × d covariance matrix. Thus, the (i, j)th entry of the covariance matrix is equal to the covariance between the dimensions i and j. These represent the estimated parameters of the multivariate Gaussian distribution. Then, the probability distribution f (X) for a d- dimensional data point X can be defined as follows: f (X) = 1 · e− 1 ·(X −μ)Σ−1 (X −μ)T . (8.4) |Σ| · (2 · 2 π)(d/2) The value of |Σ| denotes the determinant of the covariance matrix. The term in the exponent is half the square of the Mahalanobis distance between data point X and the mean μ of the data. In other words, if M aha(X, μ, Σ) represents the Mahalanobis distance between X and μ, with respect to the covariance matrix Σ, then the probability density function of the normal distribution is as follows: f (X) = 1 · e− 1 ·M aha(X,μ,Σ)2 . (8.5) |Σ| · (2 · π)(d/2) 2

8.2. EXTREME VALUE ANALYSIS 243 For the probability density to fall below a particular threshold, the Mahalanobis distance needs to be larger than a particular threshold. Thus, the Mahalanobis distance to the mean of the data can be used as an extreme-value score. The relevant extreme values are defined by the multidimensional region of the data for which the Mahalanobis distance to the mean is larger than a particular threshold. This region is illustrated in Fig. 8.3b. Therefore, the extreme value score for a data point can be reported as the Mahalanobis distance between that data point and the mean. Larger values imply more extreme behavior. In some cases, one might want a more intuitive probability measure. Correspondingly, the extreme value probability of a data point X is defined by the cumulative probability of the multidimensional region for which the Mahalanobis distance to the mean μ of the data is greater than that between X and μ. How can one estimate this cumulative probability? As discussed in Chap. 3, the Mahalanobis distance is similar to the Euclidean distance except that it standardizes the data along uncorrelated directions. For example, if the axis system of the data were to be rotated to the principal directions (shown in Fig. 8.3), then the transformed coordinates in this new axis system would have no interattribute correlations (i.e., a diagonal covariance matrix). The Mahalanobis distance is simply equal to the Euclidean distance in such a transformed (axes-rotated) data set after dividing each of the transformed coordinate values by the standard deviation along its direction. This approach provides a neat way to model the probability distribution of the Mahalanobis distance, and it also provides a concrete estimate of the cumulative probability in the multivariate tail. Because of the scaling by the standard deviation, each of the independent components of the Mahalanobis distances along the principal correlation directions can be modeled as a 1-dimensional standard normal distribution with mean 0 and variance 1. The sum of the squares of d variables, drawn independently from standard normal distributions, will result in a variable drawn from an χ2 distribution with d degrees of freedom. Therefore, the cumulative probability of the region of the χ2 distribution with d degrees of freedom, for which the value is greater than M aha(X, μ, Σ), can be reported as the extreme value probability of X. Smaller values of the probability imply greater likelihood of being an extreme value. Intuitively, this approach models the data distribution along the various uncorrelated directions as statistically independent normal distributions and standardizes them so as to provide each such direction equal importance in the outlier score. In Fig. 8.3a, data point B can be more reasonably considered a multivariate extreme value than data point A, on the basis of the natural correlations in the data. On the other hand, the data point B is closer to the centroid of the data (than data point A) on the basis of Euclidean distance but not on the basis of the Mahalanobis distance. This shows the utility of the Mahalanobis distance in using the underlying statistical distribution of the data to infer the outlier behavior of the data points more effectively. 8.2.3 Depth-Based Methods Depth-based methods are based on the general principle that the convex hull of a set of data points represents the pareto-optimal extremes of this set. A depth-based algorithm proceeds in an iterative fashion, where during the k-th iteration, all points at the corners of the convex hull of the data set are removed. The index of the iteration k also provides an outlier score where smaller values indicate a greater tendency for a data point to be an outlier. These steps are repeated until the data set is empty. The outlier score may be converted to a binary label by reporting all data points with depth at most r as outliers.

244 CHAPTER 8. OUTLIER ANALYSIS Algorithm FindDepthOutliers(Data Set: D, Score Threshold: r) begin k = 1; repeat Find set S of corners of convex hull of D; Assign depth k to points in S; D = D − S; k = k + 1; until(D is empty); Report points with depth at most r as outliers; end Figure 8.4: Depth-based methods The value of r may itself need to be determined by univariate extreme value analysis. The steps of the depth-based approach are illustrated in Fig. 8.4. A pictorial illustration of the depth-based method is illustrated in Fig. 8.5. The process can be viewed as analogous to peeling the different layers of an onion (as shown in Fig. 8.5b) where the outermost layers define the outliers. Depth-based methods try to achieve the same goal as the multivariate method of the previous section, but it generally tends to be less effective both in terms of quality and computational efficiency. From a qualitative perspec- tive, depth-based methods do not normalize for the characteristics of the statistical data distribution, as is the case for multivariate methods based on the Mahalanobis distance. All data points at the corners of a convex hull are treated equally. This is clearly not desirable, and the scores of many data points are indistinguishable because of ties. Furthermore, the fraction of data points at the corners of the convex hull generally increases with dimen- sionality. For very high dimensionality, it may not be uncommon for the majority of the data points to be located at the corners of the outermost convex hull. As a result, it is no longer possible to distinguish the outlier scores of different data points. The computational complexity of convex-hull methods increases significantly with dimensionality. The combi- nation of qualitative and computational issues associated with this method make it a poor alternative to the multivariate methods based on the Mahalanobis distance. 8.3 Probabilistic Models Probabilistic models are based on a generalization of the multivariate extreme values analy- sis methods discussed in Sect. 8.2.2. The Mahalanobis distance-based multivariate extreme value analysis method can be viewed as a Gaussian mixture model with a single component in the mixture. By generalizing this model to multiple mixture components, it is possible to determine general outliers, rather than multivariate extreme values. This idea is intimately related to the EM-clustering algorithm discussed in Sect. 6.5 of Chap. 6. At an intuitive level, data points that do not naturally fit any cluster in the probabilistic sense may be reported as outliers. The reader is referred to Sect. 6.5 of Chap. 6 for a more detailed discussion of the EM algorithm, though a brief outline is provided here for convenience. The broad principle of a mixture-based generative model is to assume that the data were generated from a mixture of k distributions with the probability distributions G1 . . . Gk based on the following process:

8.3. PROBABILISTIC MODELS 245 Figure 8.5: Depth-based outlier detection 1. Select a mixture component with prior probability αi, where i ∈ {1 . . . k}. Assume that the rth one is selected. 2. Generate a data point from Gr. This generative model will be denoted by M, and it generates each point in the data set D. The data set D is used to estimate the parameters of the model. Although it is natural to use Gaussians to represent each component of the mixture, other models may be used if desired. This flexibility is very useful to apply the approach to different data types. For example, in a categorical data set, a categorical probability distribution may be used for each mixture component instead of the Gaussian distribution. After the parameters of the model have been estimated, outliers are defined as those data points in D that are highly unlikely to be generated by this model. Note that such an assumption exactly reflects Hawkins’s definition of outliers, as stated at the beginning of this chapter. Next, we discuss the estimation of the various parameters of the model such as the estimation of different values of αi and the parameters of the different distributions Gr. The objective function of this estimation process is to ensure that the full data D has the maximum likelihood fit to the generative model. Assume that the density function of Gi is given by f i(·). The probability (density function) of the data point Xj being generated by the model is given by the following: k (8.6) f point(Xj |M) = αi · f i(Xj ). i=1 Note that the density value f point(Xj|M) provides an estimate of the outlier score of the data point. Data points that are outliers will naturally have low fit values. Examples of the relationship of the fit values to the outlier scores are illustrated in Fig. 8.6. Data points A and B will typically have very low fit to the mixture model and will be considered outliers because the data points A and B do not naturally belong to any of the mixture components. Data point C will have high fit to the mixture model and will, therefore, not be considered an outlier. The parameters of the model M are estimated using a maximum likelihood criterion, which is discussed below. For data set D containing n data points, denoted by X1 . . . Xn, the probability density of the data set being generated by model M is the product of the various point-specific

246 CHAPTER 8. OUTLIER ANALYSIS Figure 8.6: Likelihood fit values versus outlier scores probability densities: n f data(D|M) = f point(Xj |M). (8.7) j=1 The log-likelihood fit L(D|M) of the data set D with respect to M is the logarithm of the aforementioned expression, and can be (more conveniently) represented as a sum of values over the different data points: n nk (8.8) L(D|M) = log( f point(Xj|M)) = log( αi · f i(Xj)). j=1 j=1 i=1 This log-likelihood fit needs to be optimized to determine the model parameters. This objective function maximizes the fit of the data points to the generative model. For this purpose, the EM algorithm discussed in Sect. 6.5 of Chap. 6 is used. After the parameters of the model have been determined, the value of f point(Xj|M) (or its logarithm) may be reported as the outlier score. The major advantage of such mixture models is that the mixture components can also incorporate domain knowledge about the shape of each individual mixture component. For example, if it is known that the data points in a particular cluster are correlated in a certain way, then this fact can be incorporated in the mixture model by fixing the appropriate parameters of the covariance matrix, and learning the remaining parameters. On the other hand, when the available data is limited, mixture models may overfit the data. This will cause data points that are truly outliers to be missed. 8.4 Clustering for Outlier Detection The probabilistic algorithm of the previous section provides a preview of the relationship between clustering and outlier detection. Clustering is all about finding “crowds” of data points, whereas outlier analysis is all about finding data points that are far away from these crowds. Clustering and outlier detection, therefore, share a well-known complementary relationship. A simplistic view is that every data point is either a member of a cluster or an outlier. Clustering algorithms often have an “outlier handling” option that removes data

8.4. CLUSTERING FOR OUTLIER DETECTION 247 Figure 8.7: Small isolated groups of anomalies points outside the clusters. The detection of outliers as a side-product of clustering methods is, however, not an appropriate approach because clustering algorithms are not optimized for outlier detection. Data points on the boundary regions of a cluster may also be considered weak outliers but are rarely useful in most application-specific scenarios. Clustering models do have some advantages as well. Outliers often tend to occur in small clusters of their own. This is because the anomaly in the generating process may be repeated a few times. As a result, a small group of related outliers may be created. An example of a small set of isolated outliers is illustrated in Fig. 8.7. As will be discussed later, clustering methods are generally robust to such scenarios because such groups often do not have the critical mass required to form clusters of their own. A simple way of defining the outlier score of a data point is to first cluster the data set and then use the raw distance of the data point to its closest cluster centroid. One can, however, do better when the clusters are elongated or have varying density over the data set. As discussed in Chap. 3, the local data distribution often distorts the distances, and, therefore, it is not optimal to use the raw distance. This broader principle is used in multivariate extreme value analysis where the global Mahalanobis distance defines outlier scores. In this case, the local Mahalanobis distance can be used with respect to the centroid of the closest cluster. Consider a data set in which k clusters have been discovered with the use of a clus- tering algorithm. Assume that the rth cluster in d-dimensional space has a corresponding d-dimensional mean vector μr, and a d × d covariance matrix Σr. The (i, j)th entry of this matrix is the covariance between the dimensions i and j in that cluster. Then, the Maha- lanobis distance M aha(X, μr, Σr) between a data point X and cluster centroid μr is defined as follows: M aha(X, μr, Σr) = (X − μr)Σr−1(X − μr)T . (8.9) This distance is reported as the outlier score. Larger values of the outlier score indicate a greater outlier tendency. After the outlier score has been determined, univariate extreme value analysis may be used to convert the scores to binary labels. The justification for using the Mahalanobis distance is exactly analogous to the case of extreme value analysis of multivariate distances, as discussed in Sect. 8.2. The only difference is that the local cluster-specific Mahalanobis distances are more relevant to determination of

248 CHAPTER 8. OUTLIER ANALYSIS general outliers, whereas global Mahalanobis distances are more relevant to determination of specific types of outliers, such as extreme values. The use of the local Mahalanobis distance also has an interesting connection to the likelihood fit criterion of EM algorithm where the (squared) Mahalanobis distance occurs in the exponent of each Gaussian mixture. Thus, the sum of the inverse exponentiated Mahalanobis distances of a data point to different mixture component means (cluster means) are used to determine the outlier score in the EM algorithm. Such a score can be viewed as a soft version of the score determined by hard clustering algorithms. Clustering methods are based on global analysis. Therefore, small, closely related groups of data points will not form their own clusters in most cases. For example, the four isolated points in Fig. 8.7 will not typically be considered a cluster. Most clustering algorithms require a minimum critical mass for a set of data points to be considered a cluster. As a result, these points will have a high outlier score. This means that clustering methods are able to detect these small and closely related groups of data points meaningfully and report them as outliers. This is not the case for some of the density-based methods that are based purely on local analysis. The major problem with clustering algorithms is that they are sometimes not able to properly distinguish between a data point that is ambient noise and a data point that is a truly isolated anomaly. Clearly, the latter is a much stronger anomaly than the former. Both these types of points will not reside in a cluster. Therefore, the distance to the closest cluster centroid will often not be very representative of their local distribution (or instance-specific distribution). In these cases, distance-based methods are more effective. 8.5 Distance-Based Outlier Detection Because outliers are defined as data points that are far away from the “crowded regions” (or clusters) in the data, a natural and instance-specific way of defining an outlier is as follows: The distance-based outlier score of an object O is its distance to its kth nearest neighbor. The aforementioned definition, which uses the k-nearest neighbor distance, is the most common one. Other variations of this definition are sometimes used, such as the average distance to the k-nearest neighbors. The value of k is a user-defined parameter. Selecting a value of k larger than 1 helps identify isolated groups of outliers. For example, in Fig. 8.7, as long as k is fixed to any value larger than 3, all data points within the small groups of closely related points will have a high outlier score. Note that the target data point, for which the outlier score is computed, is itself not included among its k-nearest neighbors. This is done to avoid scenarios where a 1-nearest neighbor method will always yield an outlier score of 0. Distance-based methods typically use a finer granularity of analysis than clustering methods and can therefore distinguish between ambient noise and truly isolated anomalies. This is because ambient noise will typically have a lower k-nearest neighbor distance than a truly isolated anomaly. This distinction is lost in clustering methods where the distance to the closest cluster centroid does not accurately reflect the instance-specific isolation of the underlying data point. The price of this better granularity is higher computational complexity. Consider a data set D containing n data points. The determination of the k-nearest neighbor distance requires O(n) time for each data point, when a sequential scan is used. Therefore, the

8.5. DISTANCE-BASED OUTLIER DETECTION 249 determination of the outlier scores of all data points may require O(n2) time. This is clearly not a feasible option for very large data sets. Therefore, a variety of methods are used to speed up the computation: 1. Index structures: Index structures can be used to determine kth nearest neighbor distances efficiently. This is, however, not an option, if the data is high dimensional. In such cases, the effectiveness of index structures tends to degrade. 2. Pruning tricks: In most applications, the outlier scores of all the data points are not required. It may suffice to return binary labels for the top-r outliers, together with their scores. The outlier scores of the remaining data points are irrelevant. In such cases, it may be possible to terminate a k-nearest neighbor sequential scan for an outlier candidate when its current upper bound estimate on the k-nearest neighbor distance value falls below the rth best outlier score found so far. This is because such a candidate is guaranteed to be not among the top-r outliers. This methodology is referred to as the “early termination trick,” and it is described in detail later in this section. In some cases, it is also possible to combine the pruning approach with index structures. 8.5.1 Pruning Methods Pruning methods are used only for the case where the top-r ranked outliers need to be returned, and the outlier scores of the remaining data points are irrelevant. Thus, pruning methods can be used only for the binary-decision version of the algorithm. The basic idea in pruning methods is to reduce the time required for the k-nearest neighbor distance computations by ruling out data points quickly that are obviously nonoutliers even with approximate computation. 8.5.1.1 Sampling Methods The first step is to pick a sample S of size s n from the data D, and compute all pairwise distances between the data points in sample S and those in database D. There are a total of n · s such pairs. This process requires O(n · s) O(n2) distance computations. Thus, for each of the sampled points in S, the k-nearest neighbor distance is already known exactly. The top rth ranked outlier in sample S is determined, where r is the number of outliers to be returned. The score of the rth rank outlier provides a lower bound1 L on the rth ranked outlier score over the entire data set D. For the data points in D − S, only an upper bound V k(X) on the k-nearest neighbor distance is known. This upper bound is equal to the k-nearest neighbor distance of each point in D − S to the sample S ⊂ D. However, if this upper bound V k(X) is no larger than the lower bound L already determined, then such a data point X ∈ D − S can be excluded from further consideration as a top-r outlier. Typically, this will result in the removal of a large number of outlier candidates from D − S immediately, as long as the underlying data set is clustered well. This is because most of the data points in clusters will be removed, as long as at least one point from each cluster is included in the sample S, and at least r points in S are located in somewhat sparse regions. This can often be achieved with modest values of the sample size s in real-world data sets. After removing these data points from D − S, the remaining set of points is R ⊆ D − S. The k-nearest neighbor approach can be applied to a much smaller set of candidates R. 1Note that higher k-nearest neighbor distances indicate greater outlierness.

250 CHAPTER 8. OUTLIER ANALYSIS The top-r ranked outliers in R ∪ S are returned as the final output. Depending on the level of pruning already achieved, this can result in a very significant reduction in computational time, especially when |R ∪ S| |D|. 8.5.1.2 Early Termination Trick with Nested Loops The approach discussed in the previous section can be improved even further by speeding up the second phase of computing the k-nearest neighbor distances of each data point in R. The idea is that the computation of the k-nearest neighbor distance of any data point X ∈ R need not be followed through to termination once it has been determined that X cannot possibly be among the top-r outliers. In such cases, the scan of the database D for computation of the k-nearest neighbor of X can be terminated early. Note that one already has an estimate (upper bound) V k(X) of the k-nearest neighbor distance of every X ∈ R, based on distances to sample S. Furthermore, the k-nearest neigh- bor distance of the rth best outlier in S provides a lower bound on the “cut-off” required to make it to the top-r outliers. This lower-bound is denoted by L. This estimate V k(X) of the k-nearest neighbor distance of X is further tightened (reduced) as the database D − S is scanned, and the distance of X is computed to each point in D − S. Because this running estimate V k(X) is always an upper bound on the true k-nearest neighbor distance of X, the process of determining the k-nearest neighbor of X can be terminated as soon as V k(X) falls below the known lower bound L on the top-r outlier distance. This is referred to as early termination and provides significant computational savings. Then, the next data point in R can be processed. In cases where early termination is not achieved, the data point X will almost2 always be among the top-r (current) outliers. Therefore, in this case, the lower bound L can be tightened (increased) as well, to the new rth best outlier score. This will result in even better pruning when the next data point from R is processed to determine its k-nearest neighbor distance value. To maximize the benefits of pruning, the data points in R should not be processed in arbitrary order. Rather, they should be processed in decreas- ing order of the initially sampled estimate V k(·) of the k-nearest neighbor distances (based on S). This ensures that the outliers in R are found early on, and the global bound L is tightened as fast as possible for even better pruning. Furthermore, in the inner loop, the data points Y in D − S can be ordered in the opposite direction, based on increasing value of V k(Y ). Doing so ensures that the k-nearest neighbor distances are updated as fast as possible, and the advantage of early termination is maximized. The nested loop approach can also be implemented without the first phase3 of sampling, but such an approach will not have the advantage of proper ordering of the data points processed. Starting with an initial lower bound L on the rth best outlier score obtained from the sampling phase, the nested loop is executed as follows: 2We say “almost,” because the very last distance computation for X may bring V (X) below L. This scenario is unusual, but might occasionally occur. 3Most descriptions in the literature omit the first phase of sampling, which is very important for efficiency maximization. A number of implementations in time-series analysis [306] do order the data points more carefully but not with sampling.

8.5. DISTANCE-BASED OUTLIER DETECTION 251 for each X ∈ R do begin for each Y ∈ D − S do begin Update current k-nearest neighbor distance estimate V k(X) by computing distance of Y to X; if V k(X) ≤ L then terminate inner loop; endfor if V k(X) > L then include X in current r best outliers and update L to the new rth best outlier score; endfor Note that the k-nearest neighbors of a data point X do not include the data point itself. Therefore, care must be taken in the nested loop structure to ignore the trivial cases where X = Y while updating k-nearest neighbor distances. 8.5.2 Local Distance Correction Methods Section 3.2.1.8 of Chap. 3 provides a detailed discussion of the impact of the local data distribution on distance computation. In particular, it is shown that straightforward mea- sures, such as the Euclidean distance, do not reflect the intrinsic distances between data points when the density and shape of the clusters vary significantly with data locality. This principle was also used in Sect. 8.4 to justify the use of the local Mahalanobis distance for measuring the distances to cluster centroids, rather than the Euclidean distance. One of the earliest methods that recognized this principle in the context of varying data density was the Local Outlier Factor (LOF) method. A formal justification is based on the generative principles of data sets, but only an intuitive understanding will be provided here. It should be pointed out that the use of the Mahalanobis distance (instead of the Euclidean distance) for multivariate extreme value analysis (Sect. 8.2.2) is also based on generative principles of the likelihood of a data point conforming to the statistical properties of the underlying distribution. The main difference is that the analysis was global in that case, whereas the analysis is local in this case. The reader is also advised to revisit Sect. 3.2.1.8 of Chap. 3 for a discussion of the impact of data distributions on intrinsic distances between data points. To motivate the principles of local distance correction in the context of outlier analysis, two examples will be used. One of these examples illustrates the impact of varying local distribution density, whereas another example illustrates the impact of varying local cluster shape. Both these aspects can be addressed with different kinds of local normalization of distance computations. In Fig. 8.8a, two different clusters have been shown, one of which is much sparser than the other. In this case, both data points A and B are clearly outliers. While the outlier B will be easily detected by most distance-based algorithms, a challenge arises in the detection of outlier A. This is because the nearest neighbor distance of many data points in the sparser cluster is at least as large as the nearest neighbor distance of outlier A. As a result, depending on the distance-threshold used, a k-nearest neighbor algorithm will either falsely report portions of the sparse cluster, or will completely miss outlier A. Simply speaking, the ranking of the outliers by distance-based algorithms is an incorrect one. This is because the true distance of points in cluster A should be computed in a normalized way, based on its local data distribution. This aspect is relevant to the discussion in Sect. 3.2.1.8 of Chap. 3 on the impact of local data distributions on distance function design, and it is important for many distance-based data mining problems. The key

252 CHAPTER 8. OUTLIER ANALYSIS Figure 8.8: Impact of local variations in data distribution on distance-based outlier detection issue here is the generative principle, that data point A is much less likely to be generated by its closest (tightly knit) cluster than many slightly isolated data points belonging to the relatively diffuse cluster are likely to be generated by their cluster. Hawkins’s definition of outliers, stated at the beginning of this chapter, was formulated on the basis of generative principles. It should be pointed out that the probabilistic EM algorithm of Sect. 8.3 does a much better job at recognizing these generative differences. However, the probabilistic EM method is often not used practically because of overfitting issues with smaller data sets. The LOF approach is the first method that recognized the importance of incorporating these generative principles in nonparametric distance-based algorithms. This point can be emphasized further by examining clusters of different local shape and orientation in Fig. 8.8b. In this case, a distance-based algorithm will report one of the data points along the long axis of one of the elongated clusters, as the strongest outlier, if the 1-nearest neighbor distance is used. This data point is far more likely to be generated by its closest cluster, than the outlier marked by “X.” However, the latter has a smaller 1-nearest neighbor distance. Therefore, the significant problem with distance-based algorithms is that they do not account for the local generative behavior of the underlying data. In this section, two methods will be discussed for addressing this issue. One of them is LOF, and the other is a direct generalization of the global Mahalanobis method for extreme value analysis. The first method can adjust for the generative variations illustrated in Fig. 8.8a, and the second method can adjust for the generative variations illustrated in Fig. 8.8b. 8.5.2.1 Local Outlier Factor (LOF) The Local Outlier Factor (LOF) approach adjusts for local variations in cluster density by normalizing distances with the average point-specific distances in a data locality. It is often understood popularly as a density-based approach, although, in practice, it is a (normalized) distance-based approach where the normalization factor corresponds to the average local data density. This normalization is the key to addressing the challenges posed by the scenario of Fig. 8.8a. For a given data point X, let V k(X) be the distance to its k-nearest neighbor, and let Lk(X) be the set of points within the k-nearest neighbor distance of X. The set Lk(X) will

8.5. DISTANCE-BASED OUTLIER DETECTION 253 typically contain k points, but may sometimes contain more than k points because of ties in the k-nearest neighbor distance. Then, the reachability distance Rk(X, Y ) of object X with respect to Y is defined as the maximum of the distance Dist(X, Y ), between the pair (X, Y ) and the k-nearest neighbor distance of Y . Rk(X, Y ) = max{Dist(X, Y ), V k(Y )} (8.10) The reachability distance is not symmetric between X and Y . Intuitively, when Y is in a dense region and the distance between X and Y is large, the reachability distance of X with respect to it is equal to the true distance Dist(X, Y ). On the other hand, when the distances between X and Y are small, then the reachability distance is smoothed out by the k-nearest neighbor distance of Y . The larger the value of k, the greater the smoothing. Correspondingly, the reachability distances with respect to different points will also become more similar. The reason for using this smoothing is that it makes the intermediate distance computations more stable. This is especially important when the distances between X and Y are small, and it will result in greater statistical fluctuations in the raw distances. At a conceptual level, it is possible to define a version of LOF directly in terms of raw distances, rather than reachability distances. However, such a version would be missing the stability provided by smoothing. The average reachability distance ARk(X) of data point X with respect to its neigh- borhood Lk(X) is defined as the average of its reachability distances to all objects in its neighborhood. ARk(X) = MEANY ∈Lk(X)Rk(X, Y ) (8.11) Here, the MEAN function simply represents the mean value over the entire set Lk(X). The Local Outlier Factor LOFk(X) is then equal to the mean ratio of ARk(X) to the corresponding values of all points in the k-neighborhood of X. LOFk (X ) = MEANY ∈Lk (X ) ARk (X ) (8.12) ARk(Y ) The use of distance ratios in the definition ensures that the local distance behavior is well accounted for in this definition. As a result, the LOF values for the objects in a cluster are often close to 1 when the data points in the cluster are homogeneously distributed. For example, in the case of Fig. 8.8a, the LOF values of data points in both clusters will be quite close to 1, even though the densities of the two clusters are different. On the other hand, the LOF values of both the outlying points will be much higher because they will be computed in terms of the ratios to the average neighbor reachability distances. In practice, the maximum value of LOFk(X) over a range of different values of k is used as the outlier score to determine the best size of the neighborhood. One observation about the LOF method is that while it is popularly understood in the literature as a density-based approach, it can be more simply understood as a relative distance-based approach with smoothing. The smoothing is really a refinement to make distance computations more stable. The basic LOF method will work reasonably well on many data sets, even if the raw distances are used instead of reachability distances, for the aforementioned computations of Eq. 8.11. The LOF method, therefore, has the ability to adjust well to regions of varying density because of this relative normalization in the denominator of each term of Eq. 8.12. In the original presentation of the LOF algorithm (see bibliographic notes), the LOF is defined in terms of a density variable. The density variable is loosely defined as the inverse of the

254 CHAPTER 8. OUTLIER ANALYSIS average of the smoothed reachability distances. This is, of course, not a precise definition of density. Density is traditionally defined in terms of the number of data points within a specified area or volume. This book provides exactly the same definition of LOF but presents it slightly differently by omitting the intermediate density variable. This is done both for simplicity, and for a definition of LOF directly in terms of (normalized) distances. The real connection of LOF to data density lies in its insightful ability to adjust to varying data density with the use of relative distances. Therefore, this book has classified this approach as a (normalized) distance-based method, rather than as a density-based method. 8.5.2.2 Instance-Specific Mahalanobis Distance The instance-specific Mahalanobis distance is designed for adjusting to varying shapes of the distributions in the locality of a particular data point, as illustrated in Fig. 8.8b. The Mahalanobis distance is directly related to shape of the data distribution, although it is tra- ditionally used in the global sense. Of course, it is also possible to use the local Mahalanobis distance by using the covariance structure of the neighborhood of a data point. The problem here is that the neighborhood of a data point is hard to define with the Euclidean distance when the shape of the neighborhood cluster is not spherical. For exam- ple, the use of the Euclidean distance to a data point is biased toward capturing the circular region around that point, rather than an elongated cluster. To address this issue, an agglom- erative approach is used to determine the k-neighborhood Lk(X) of a data point X. First, data point X is added to Lk(X). Then, data points are iteratively added to Lk(X) that have the smallest distance to their closest point in Lk(X). This approach can be viewed as a special case of single-linkage hierarchical clustering methods, where singleton points are merged with clusters. Single-linkage methods are well-known for creating clusters of arbitrary shape. Such an approach tends to “grow” the neighborhood with the same shape as the cluster. The mean μk(X) and covariance matrix Σk(X) of the neighborhood Lk(X) are computed. Then, the instance-specific Mahalanobis score LM ahak(X) of a data point X provides its outlier score. This score is defined as the Mahalanobis distance of X to the mean μk(X) of data points in Lk(X). LM ahak(X) = M aha(X, μk(X), Σk(X)) (8.13) The only difference between this computation and that of the global Mahalanobis distance for extreme value analysis is that the local neighborhood set Lk(X) is used as the “relevant” data for comparison in the former. While the clustering approach of Sect. 8.4 does use a Mahalanobis metric on the local neighborhood, the computation is subtly different in this case. In the case of clustering-based outlier detection, a preprocessing approach predefines a limited number of clusters as the universe of possible neighborhoods. In this case, the neighborhood is constructed in an instance-specific way. Different points will have slightly different neighborhoods, and they may not neatly correspond to a predefined cluster. This additional granularity allows more refined analysis. At a conceptual level, this approach computes whether data point X can be regarded as an extreme value with respect to its local cluster. As in the case of the LOF method, the approach can be applied for different values of k, and the highest outlier score for each data point can be reported. If this approach is applied to the example of Fig. 8.8b, the method will correctly deter- mine the outlier because of the local Mahalanobis normalization with the appropriate (local) covariance matrix for each data point. No distance normalizations are necessary for vary- ing data density (scenario of Fig. 8.8a) because the Mahalanobis distance already performs these local normalizations under the covers. Therefore, such a method can be used for the

8.6. DENSITY-BASED METHODS 255 scenario of Fig. 8.8a as well. The reader is referred to the bibliographic notes for variations of LOF that use the concept of varying local cluster shapes with agglomerative neighborhood computation. 8.6 Density-Based Methods Density-based methods are loosely based on similar principles as density-based clustering. The idea is to determine sparse regions in the underlying data in order to report out- liers. Correspondingly, histogram-based, grid-based, or kernel density-based methods can be used. Histograms can be viewed as 1-dimensional special cases of grid-based methods. These methods have not, however, found significant popularity because of their difficulty in adjusting to variations of density in different data localities. The definition of density also becomes more challenging with increasing dimensionality. Nevertheless, these meth- ods are used more frequently in the univariate case because of their natural probabilistic interpretation. 8.6.1 Histogram- and Grid-Based Techniques Histograms are simple and easy to construct for univariate data, and are therefore used quite frequently in many application domains. In this case, the data is discretized into bins, and the frequency of each bin is estimated. Data points that lie in bins with very low frequency are reported as outliers. If a continuous outlier score is desired, then the number of other data points in the bin for data point X is reported as the outlier score for X. Therefore, the count for a bin does not include the point itself in order to minimize overfitting for smaller bin widths or smaller number of data points. In other words, the outlier score for each data point is one less than its bin count. In the context of multivariate data, a natural generalization is the use of a grid-structure. Each dimension is partitioned into p equi-width ranges. As in the previous case, the number of points in a particular grid region is reported as the outlier score. Data points that have density less than τ in any particular grid region are reported as outliers. The appropriate value of τ can be determined by using univariate extreme value analysis. The major challenge with histogram-based techniques is that it is often hard to determine the optimal histogram width well. Histograms that are too wide, or too narrow, will not model the frequency distribution well. These are similar issues to those encountered with the use of grid-structures for clustering. When the bins are too narrow, the normal data points falling in these bins will be declared outliers. On the other hand, when the bins are too wide, anomalous data points and high-density regions may be merged into a single bin. Therefore, such anomalous data points may not be declared outliers. A second issue with the use of histogram techniques is that they are too local in nature, and often do not take the global characteristics of the data into account. For example, for the case of Fig. 8.7, a multivariate grid-based approach may not be able to classify an isolated group of data points as outliers, unless the resolution of the grid structure is calibrated carefully. This is because the density of the grid only depends on the data points inside it, and an isolated group of points may create an artificially dense grid cell when the granularity of representation is high. Furthermore, when the density distribution varies significantly with data locality, grid-based methods may find it difficult to normalize for local variations in density.

256 CHAPTER 8. OUTLIER ANALYSIS Finally, histogram methods do not work very well in high dimensionality because of the sparsity of the grid structure with increasing dimensionality, unless the outlier score is computed with respect to carefully chosen lower dimensional projections. For example, a d- dimensional space will contain at least 2d grid-cells, and, therefore, the number of data points expected to populate each cell reduces exponentially with increasing dimensionality. These problems with grid-based methods are well known, and are also frequently encountered in the context of other data mining applications such as clustering. 8.6.2 Kernel Density Estimation Kernel density estimation methods are similar to histogram techniques in terms of building density profiles, though the major difference is that a smoother version of the density profile is constructed. In kernel density estimation, a continuous estimate of the density is generated at a given point. The value of the density at a given point is estimated as the sum of the smoothed values of kernel functions Kh(·) associated with each point in the data set. Each kernel function is associated with a kernel width h that determines the level of smoothing created by the function. The kernel estimation f (X) based on n data points of dimensionality d, and kernel function Kh(·) is defined as follows: f (X) = 1 · n n Kh(X − Xi). (8.14) i=1 Thus, each discrete point Xi in the data set is replaced by a continuous function Kh(·) that peaks at Xi and has a variance determined by the smoothing parameter h. An example of such a distribution is the Gaussian kernel with width h. Kh(X − Xi) = 1 d (8.15) √2π · h · e−||X−Xi||2/(2h2) The estimation error is defined by the kernel width h, which is chosen in a data-driven man- ner. It has been shown that for most smooth functions Kh(·), when the number of data points goes to infinity, the estimate asymptotically converges to the true density value, provided that the width h is chosen appropriately. The density at each data point is computed with- out including the point itself in the density computation. The value of the density is reported as the outlier score. Low values of the density indicate greater tendency to be an outlier. Density-based methods have similar challenges as histogram- and grid-based techniques. In particular, the use of a global kernel width h to estimate density may not work very well in cases where there are wide variations in local density, such as those in Figs. 8.7 and 8.8. This is because of the myopic nature of density-based methods, in which the variations in the density distribution are not well accounted for. Nevertheless, kernel-density-based methods can be better generalized to data with local variations, especially if the bandwidth is chosen locally. As in the case of grid-based methods, these techniques are not very effective for higher dimensionality. The reason is that the accuracy of the density estimation approach degrades with increasing dimensionality. 8.7 Information-Theoretic Models Outliers are data points that do not naturally fit the remaining data distribution. Therefore, if a data set were to be somehow compressed with the use of the “normal” patterns in the

8.7. INFORMATION-THEORETIC MODELS 257 data distribution, the outliers would increase the minimum code length required to describe it. For example, consider the following two strings: ABABABABABABABABABABABABABABABABAB ABABACABABABABABABABABABABABABABAB The second string is of the same length as the first and is different at only a single position containing the unique symbol C. The first string can be described concisely as “AB 17 times.” However, the second string has a single position corresponding to the symbol C. Therefore, the second string can no longer be described as concisely. In other words, the presence of the symbol C somewhere in the string increases its minimum description length. It is also easy to see that this symbol corresponds to an outlier. Information-theoretic models are based on this general principle because they measure the increase in model size required to describe the data as concisely as possible. Information-theoretic models can be viewed as almost equivalent to conventional deviation-based models, except that the outlier score is defined by the model size for a fixed deviation, rather than the deviation for a fixed model. In conventional models, out- liers are always defined on the basis of a “summary” model of normal patterns. When a data point deviates significantly from the estimations of the summary model, then this deviation value is reported as the outlier score. Clearly, a trade-off exists between the size of the summary model and the level of deviation. For example, if a clustering model is used, then a larger number of cluster centroids (model size) will result in lowering the maximum deviation of any data point (including the outlier) from its nearest centroid. Therefore, in conventional models, the same clustering is used to compute deviation values (scores) for the different data points. A slightly different way of computing the outlier score is to fix the maximum allowed deviation (instead of the number of cluster centroids) and compute the number of cluster centroids required to achieve the same level of deviation, with and without a particular data point. It is this increase that is reported as the outlier score in the information-theoretic version of the same model. The idea here is that each point can be estimated by its closest cluster centroid, and the cluster centroids serve as a “code-book” in terms of which the data is compressed in a lossy way. Information-theoretic models can be viewed as a complementary version of conventional models where a different aspect of the space-deviation trade-off curve is examined. Virtu- ally every conventional model can be converted into an information-theoretic version by examining the bi-criteria space-deviation trade-off in terms of space rather than deviation. The bibliographic notes will also provide specific examples of each of the cases below: 1. The probabilistic model of Sect. 8.3 models the normal patterns in terms of genera- tive model parameters such as the mixture means and covariance matrices. The space required by the model is defined by its complexity (e.g., number of mixture com- ponents), and the deviation corresponds to the probabilistic fit. In an information- theoretic version of the model, the complementary approach is to examine the size of the model required to achieve a fixed level of fit. 2. A clustering or density-based summarization model describes a data set in terms of cluster descriptions, histograms or other summarized representations. The granularity of these representations (number of cluster centroids, or histogram bins) controls the space, whereas the error in approximating the data point with a central element of the cluster (bin) defines the deviation. In conventional models, the size of the model (number of bins or clusters) is fixed, whereas in the information-theoretic version, the

258 CHAPTER 8. OUTLIER ANALYSIS maximum allowed deviation is fixed, and the required model size is reported as the outlier score. 3. A frequent pattern mining model describes the data in terms of an underlying code- book of frequent patterns. The larger the size of the code-book (by using frequent pat- terns of lower support), the more accurately the data can be described. These models are particularly popular, and some pointers are provided in the bibliographic notes. All these models represent the data approximately in terms of individual condensed compo- nents representing aggregate trends. In general, outliers increase the length of the descrip- tion in terms of these condensed components to achieve the same level of approximation. For example, a data set with outliers will require a larger number of mixture parame- ters, clusters, or frequent patterns to achieve the same level of approximation. Therefore, in information-theoretic methods, the components of these summary models are loosely referred to as “code books.” Outliers are defined as data points whose removal results in the largest decrease in description length for the same error. The actual construction of the coding is often heuristic, and is not very different from the summary models used in conven- tional outlier analysis. In some cases, the description length for a data set can be estimated without explicitly constructing a code book, or building a summary model. An example is that of the entropy of a data set, or the Kolmogorov complexity of a string. Readers are referred to the bibliographic notes for examples of such methods. While information-theoretic models are approximately equivalent to conventional models in that they explore the same trade-off in a slightly different way, they do have an advantage in some cases. These are cases where an accurate summary model of the data is hard to explicitly construct, and measures such as the entropy or Kolmogorov complexity can be used to estimate the compressed space requirements of the data set indirectly. In such cases, information-theoretic methods can be useful. In cases where the summary models can be explicitly constructed, it is better to use conventional models because the outlier scores are directly optimized to point-specific deviations rather than the more blunt measure of differential space impact. The bibliographic notes provide specific examples of some of the aforementioned methods. 8.8 Outlier Validity As in the case of clustering models, it is desirable to determine the validity of outliers determined by a particular algorithm. Although the relationship between clustering and outlier analysis is complementary, the measures for outlier validity cannot easily be designed in a similar complementary way. In fact, validity analysis is much harder in outlier detection than data clustering. The reasons for this are discussed in the next section. 8.8.1 Methodological Challenges As in the case of data clustering, outlier analysis is an unsupervised problem. Unsupervised problems are hard to validate because of the lack of external criteria, unless such criteria are synthetically generated, or some rare aspects of real data sets are used as proxies. Therefore, a natural question arises, as to whether internal criteria can be defined for outlier validation, as is the case for data clustering. However, internal criteria are rarely used in outlier analysis. While such criteria are well- known to be flawed even in the context of data clustering, these flaws become significant

8.8. OUTLIER VALIDITY 259 enough to make these criteria unusable for outlier analysis. The reader is advised to refer to Sect. 6.9.1 of Chap. 6 for a discussion of the challenges of internal cluster validity. Most of these challenges are related to the fact that cluster validity criteria are derived from the objective function criteria of clustering algorithms. Therefore, a particular validity measure will favor (or overfit) a clustering algorithm using a similar objective function criterion. These problems become magnified in outlier analysis because of the small sample solution space. A model only needs to be correct on a few outlier data points to be considered a good model. Therefore, the overfitting of internal validity criteria, which is significant even in clustering, becomes even more problematic in outlier analysis. As a specific example, if one used the k-nearest neighbor distance as an internal validity measure, then a pure distance-based outlier detector will always outperform a locally normalized detector such as LOF. This is, of course, not consistent with known experience in real settings, where LOF usually provides more meaningful results. One can try to reduce the overfitting effect by designing a validity measure which is different from the outlier detection models being compared. However, this is not a satisfactory solution because significant uncertainty always remains about the impact of hidden interrelationships between such measures and outlier detection models. The main problem with internal measures is that the relative bias in evaluation of various algorithms is consistently present, even when the data set is varied. A biased selection of internal measures can easily be abused in algorithm benchmarking. Internal measures are almost never used in outlier analysis, although they are often used in clustering evaluation. Even in clustering, the use of internal validity measures is question- able in spite of its wider acceptance. Therefore, most of the validity measures used for outlier analysis are based on external measures such as the Receiver Operating Characteristic curve. 8.8.2 Receiver Operating Characteristic Outlier detection algorithms are typically evaluated with the use of external measures where the known outlier labels from a synthetic data set or the rare class labels from a real data set are used as the ground-truth. This ground-truth is compared systematically with the outlier score to generate the final output. While such rare classes may not always reflect all the natural outliers in the data, the results are usually reasonably representative of algorithm quality, when evaluated over many data sets. In outlier detection models, a threshold is typically used on the outlier score to generate a binary label. If the threshold is picked too restrictively to minimize the number of declared outliers then the algorithm will miss true outlier points (false-negatives). On the other hand, if the threshold is chosen in a more relaxed way, this will lead to too many false-positives. This leads to a trade-off between the false-positives and false-negatives. The problem is that the “correct” threshold to use is never known exactly in a real scenario. However, this entire trade-off curve can be generated, and various algorithms can be compared over the entire trade-off curve. One example of such a curve is the Receiver Operating Characteristic (ROC) curve. For any given threshold t on the outlier score, the declared outlier set is denoted by S(t). As t changes, the size of S(t) changes as well. Let G represent the true set (ground-truth set) of outliers in the data set. The true-positive rate, which is also referred to as the recall, is defined as the percentage of ground-truth outliers that have been reported as outliers at threshold t. |S(t) ∩ G| T P R(t) = Recall(t) = 100 ∗ |G|

260 CHAPTER 8. OUTLIER ANALYSIS Table 8.1: ROC construction with rank of ground-truth outliers Algorithm Rank of ground-truth outliers Algorithm A 1, 5, 8, 15, 20 Algorithm B 3, 7, 11, 13, 15 Random Algorithm 17, 36, 45, 59, 66 Perfect Oracle 1, 2, 3, 4, 5 The false positive rate F P R(t) is the percentage of the falsely reported positives out of the ground-truth negatives. Therefore, for a data set D with ground-truth positives G, this measure is defined as follows: FP R(t) = 100 ∗ |S(t) − G| . (8.16) |D − G| The ROC curve is defined by plotting the F P R(t) on the X-axis, and T P R(t) on the Y -axis for varying values of t. Note that the end points of the ROC curve are always at (0, 0) and (100, 100), and a random method is expected to exhibit performance along the diagonal line connecting these points. The lift obtained above this diagonal line provides an idea of the accuracy of the approach. The area under the ROC curve provides a concrete quantitative evaluation of the effectiveness of a particular method. To illustrate the insights gained from these different graphical representations, consider an example of a data set with 100 points from which 5 points are outliers. Two algorithms, A and B, are applied to this data set that rank all data points from 1 to 100, with lower rank representing greater propensity to be an outlier. Thus, the true-positive rate and false- positive rate values can be generated by determining the ranks of the 5 ground-truth outlier points. In Table 8.1, some hypothetical ranks for the five ground-truth outliers have been illustrated for the different algorithms. In addition, the ranks of the ground-truth outliers for a random algorithm have been indicated. The random algorithm outputs a random outlier score for each data point. Similarly, the ranks for a “perfect oracle” algorithm, which ranks the correct top 5 points as outliers, have also been illustrated in the table. The corresponding ROC curves are illustrated in Fig. 8.9. What do these curves really tell us? For cases in which one curve strictly dominates another, it is clear that the algorithm for the former curve is superior. For example, it is immediately evident that the oracle algorithm is superior to all algorithms, and the random algorithm is inferior to all the other algorithms. On the other hand, algorithms A and B show domination at different parts of the ROC curve. In such cases, it is hard to say that one algorithm is strictly superior. From Table 8.1, it is clear that Algorithm A, ranks three of the correct ground-truth outliers very highly, but the remaining two outliers are ranked poorly. In the case of Algorithm B, the highest ranked outliers are not as well ranked as the case of Algorithm A, though all five outliers are determined much earlier in terms of rank threshold. Correspondingly, Algorithm A dominates on the earlier part of the ROC curve whereas Algorithm B dominates on the later part. Some practitioners use the area under the ROC curve as a proxy for the overall effectiveness of the algorithm, though such a measure should be used very carefully because all parts of the ROC curve may not be equally important for different applications.

8.9. SUMMARY 261 Figure 8.9: Receiver operating characteristic 8.8.3 Common Mistakes A common mistake in benchmarking outlier analysis applications is that the area under the ROC curve is used repeatedly to tune the parameters of the outlier analysis algorithm. Note that such an approach implicitly uses the ground-truth labels for model construction, and it is, therefore, no longer an unsupervised algorithm. For problems such as clustering and outlier detection, it is not acceptable to use external labels in any way to tune the algorithm. In the particular case of outlier analysis, the accuracy can be drastically overestimated with such a tuning approach because the relative scores of a small number of outlier points have a very large influence on the ROC curve. 8.9 Summary The problem of outlier analysis is an important one because of its applicability to a variety of problem domains. The common models in outlier detection include probabilistic models, clustering models, distance-based models, density-based models, and information-theoretic models. Among these, distance models are the most popular, but are computationally more expensive. A number of speed-up tricks have been proposed to make these models much faster. Local variations of distance-based models generally tend to be more effective because of their sensitivity to the generative aspects of the underlying data. Information-theoretic models are closely related to conventional models, and explore a different aspect of the space-deviation trade-off than conventional models. Outlier validation is a difficult problem because of the challenges associated with the unsupervised nature of outlier detection, and the small sample-space problem. Typically, external validation criteria are used. The effectiveness of an outlier analysis algorithm is quantified with the use of the receiver operating characteristic curve that shows the trade- off between the false-positives and false-negatives for different thresholds on the outlier score. The area under this curve provides a quantitative evaluation of the outlier detection algorithm.

262 CHAPTER 8. OUTLIER ANALYSIS 8.10 Bibliographic Notes A number of books and surveys have been written on the problem of outlier analysis. The classical books [89, 259] in this area have mostly been written from the perspective of the statistics community. Most of these books were written before the wider adoption of database technology and are therefore not written from a computational perspective. More recently, this problem has been studied extensively by the computer science community. These works consider practical aspects of outlier detection corresponding to the cases where the data may be very large, or may have very high dimensionality. A recent book [5] also studies this area from the perspective of the computer science community. Numerous surveys have also been written that discuss the concept of outliers from different points of view, methodologies, or data types [61, 84, 131, 378, 380]. Among these, the survey by Chandola et al. [131] is the most recent and, arguably, the most comprehensive. It is an excellent review that covers the work on outlier detection quite broadly from the perspective of multiple communities. The Z-value test is used commonly in the statistical literature, and numerous extensions, such as the t-value test are available [118]. While this test makes the normal distribution assumption for large data sets, it has been used fairly extensively as a good heuristic even for data distributions that do not satisfy the normal distribution assumption. A variety of distance-based methods for outlier detection are proposed in [319, 436], and distance-correction methods for outlier detection are proposed in [109]. The determination of arbitrarily-shape clusters in the context of the LOF algorithm is explored in [487]. The agglomerative algorithm for discovering arbitrarily shaped neighborhoods, in the section on instance-specific Mahalanobis distance, is based on that approach. However, this method uses a connectivity-outlier factor, rather than the instance-specific Mahalanobis distance. The use of the Mahalanobis distance as a model for outlier detection was proposed in [468], though these methods are global, rather than local. A graph-based algorithm for local outlier detection is discussed in [257]. The ORCLUS algorithm also shows how to determine outliers in the presence of arbitrarily shaped clusters [22]. Methods for interpreting distance-based outliers were first proposed in [320]. A variety of information-theoretic methods for outlier detection are discussed in [68, 102, 160, 340, 472]. Many of these different models can be viewed in a complementary way to traditional models. For example, the work in [102] explores probabilistic methods in the context of information-theoretic models. The works in [68, 472] use code books of frequent-patterns for the modeling process. The connection between frequent patterns and compression has been explored in [470]. The use of measures such as entropy and Kolmogorov complexity for outlier analysis is explored in [340, 305]. The concept of coding complexity is explored in [129] in the context of set-based sequences. Evaluation methods for outlier analysis are essentially identical to the techniques used in information retrieval for understanding precision-recall trade-offs, or in classification for ROC curve analysis. A detailed discussion may be found in [204]. 8.11 Exercises 1. Suppose a particular random variable has mean 3 and standard deviation 2. Compute the Z-number for the values -1, 3. and 9. Which of these values can be considered the most extreme value?

8.11. EXERCISES 263 2. Define the Mahalanobis-based extreme value measure when the d dimensions are sta- tistically independent of one another, in terms of the dimension-specific standard deviations σ1 . . . σd. 3. Consider the four 2-dimensional data points (0, 0), (0, 1), (1, 0), and (100, 100). Plot them using mathematical software such as MATLAB. Which data point visually seems like an extreme value? Which data point is reported by the Mahalanobis measure as the strongest extreme value? Which data points are reported by depth-based measure? 4. Implement the EM algorithm for clustering, and use it to implement a computation of the probabilistic outlier scores. 5. Implement the Mahalanobis k-means algorithm, and use it to implement a compu- tation of the outlier score in terms of the local Mahalanobis distance to the closest cluster centroids. 6. Discuss the connection between the algorithms implemented in Exercises 4 and 5. 7. Discuss the advantages and disadvantages of clustering models over distance-based models. 8. Implement a naive distance-based outlier detection algorithm with no pruning. 9. What is the effect of the parameter k in k-nearest neighbor outlier detection? When do small values of k work well and when do larger values of k work well? 10. Design an outlier detection approach with the use of the NMF method of Chap. 6. 11. Discuss the relative effectiveness of pruning of distance-based algorithms in data sets which are (a) uniformly distributed, and (b) highly clustered with modest ambient noise and outliers. 12. Implement the LOF-algorithm for outlier detection. 13. Consider the set of 1-dimensional data points {1, 2, 2, 2, 2, 2, 6, 8, 10, 12, 14}. What are the data point(s) with the highest outlier score for a distance-based algorithm, using k = 2? What are the data points with highest outlier score using the LOF algorithm? Why the difference? 14. Implement the instance-specific Mahalanobis method for outlier detection. 15. Given a set of ground-truth labels, and outlier scores, implement a computer program to compute the ROC curve for a set of data points. 16. Use the objective function criteria of various outlier detection algorithms to design corresponding internal validity measures. Discuss the bias in these measures towards favoring specific algorithms. 17. Suppose that you construct a directed k-nearest neighbor graph from a data set. How can you use the degrees of the nodes to obtain an outlier score? What characteristics does this algorithm share with LOF?

Chapter 9 Outlier Analysis: Advanced Concepts “If everyone is thinking alike, then somebody isn’t thinking.”—George S. Patton 9.1 Introduction Many scenarios for outlier analysis cannot be addressed with the use of the techniques discussed in the previous chapter. For example, the data type has a critical impact on the outlier detection algorithm. In order to use an outlier detection algorithm on categorical data, it may be necessary to change the distance function or the family of distributions used in expectation–maximization (EM) algorithms. In many cases, these changes are exactly analogous to those required in the context of the clustering problem. Other cases are more challenging. For example, when the data is very high dimensional, it is often difficult to apply outlier analysis because of the masking behavior of the noisy and irrelevant dimensions. In such cases, a new class of methods, referred to as subspace methods, needs to be used. In these methods, the outlier analysis is performed in lower dimensional projections of the data. In many cases, it is hard to discover these projections, and therefore results from multiple subspaces may need to be combined for better robustness. The combination of results from multiple models is more generally referred to as ensemble analysis. Ensemble analysis is also used for other data mining problems such as clustering and classification. In principle, ensemble analysis in outlier detection is analogous to that in data clustering or classification. However, in the case of outlier detection, ensemble analysis is especially challenging. This chapter will study the following three classes of challenging problems in outlier analysis: 1. Outlier detection in categorical data: Because outlier models use notions such as near- est neighbor computation and clustering, these models need to be adjusted to the data type at hand. This chapter will address the changes required to handle categorical data types. C. C. Aggarwal, Data Mining: The Textbook, DOI 10.1007/978-3-319-14142-8 9 265 c Springer International Publishing Switzerland 2015

266 CHAPTER 9. OUTLIER ANALYSIS: ADVANCED CONCEPTS 2. High-dimensional data: This is a very challenging scenario for outlier detection because of the “curse-of-dimensionality.” Many of the attributes are irrelevant and contribute to the errors in model construction. A common approach to address these issues is that of subspace outlier detection. 3. Outlier ensembles: In many cases, the robustness of an outlier detection algorithm can be improved with ensemble analysis. This chapter will study the fundamental principles of ensemble analysis for outlier detection. Outlier analysis has numerous applications in a very wide variety of domains such as data cleaning, fraud detection, financial markets, intrusion detection, and law enforcement. This chapter will also study some of the more common applications of outlier analysis. This chapter is organized as follows: Section 9.2 discusses outlier detection models for categorical data. The difficult case of high-dimensional data is discussed in Sect. 9.3. Outlier ensembles are studied in Sect. 9.4. A variety of applications of outlier detection are discussed in Sect. 9.5. Section 9.6 provides the summary. 9.2 Outlier Detection with Categorical Data As in the case of other problems in data mining, the type of the underlying data has a significant impact on the specifics of the algorithm used for solving it. Outlier analysis is no exception. However, in the case of outlier detection, the changes required are relatively minor because, unlike clustering, many of the outlier detection algorithms (such as distance-based algorithms) use very simple definitions of outliers. These definitions can often be modified to work with categorical data with small modifications. In this section, some of the models discussed in the previous chapter will be revisited for categorical data. 9.2.1 Probabilistic Models Probabilistic models can be modified easily to work with categorical data. A probabilistic model represents the data as a mixture of cluster components. Therefore, each component of the mixture needs to reflect a set of discrete attributes rather than numerical attributes. In other words, a generative mixture model of categorical data needs to be designed. Data points that do not fit this mixture model are reported as outliers. The k components of the mixture model are denoted by G1 . . . Gk. The generative process uses the following two steps to generate each point in the d-dimensional data set D: 1. Select a mixture component with prior probability αi, where i ∈ {1 . . . k}. 2. If the rth component of the mixture was selected in the first step, then generate a data point from Gr. The values of αi denote the prior probabilities. An example of a model for the mixture component is one in which the jth value of the ith attribute is generated by cluster m with probability pijm. The set of all model parameters is collectively denoted by the notation Θ. Consider a data point X containing the attribute value indices j1 . . . jd where the rth attribute takes on the value jr. Then, the value of the generative probability gm,Θ(X) of a data point from cluster m is given by the following expression: d (9.1) gm,Θ(X) = prjrm. r=1

9.2. OUTLIER DETECTION WITH CATEGORICAL DATA 267 The fit probability of the data point to the rth component is given by αr ·gr,Θ(X). Therefore, the sum of the fits over all components is given by mork=d1elα. rT·hgerr,Θef(oXre),. This fit represents the likelihood of the data point being generated by the it is used as the outlier score. However, in order to compute this fit value, one needs to estimate the parameters Θ. This is achieved with the EM algorithm. The assignment (or posterior) probability P (Gm|X, Θ) for the mth cluster may be esti- mated as follows: · gm,Θ(X) P (Gm|X, Θ) = αm αr · gr,Θ(X ) . (9.2) k r=1 This step provides a soft assignment probability of the data point to a cluster, and it corresponds to the E-step. The soft-assignment probability is used to estimate the probability pijm. While esti- mating the parameters for cluster m, the weight of a data point is assumed to be equal to its assignment probability P (Gm|X, Θ) to cluster m. The value αm is estimated to be the average assignment probability to cluster m over all data points. For each cluster m, the weighted number wijm of records for which the ith attribute takes on the jth possible discrete value in cluster m is estimated. The value of wijm is estimated as the sum of the posterior probabilities P (Gm|X, Θ) for all records X in which the ith attribute takes on the jth value. Then, the value of the probability pijm may be estimated as follows: pijm = wijm , Θ) . (9.3) P (Gm|X X ∈D When the number of data points is small, the estimation of Eq. 9.3 can be difficult to perform in practice. In such cases, some of the attribute values may not appear in a cluster (or wijm ≈ 0). This situation can lead to poor parameter estimation, or overfitting. The Laplacian smoothing method is commonly used to address such ill-conditioned probabilities. Let mi be the number of distinct attribute values of categorical attribute i. In Laplacian smoothing, a small value β is added to the numerator, and mi · β is added to the denom- inator of Eq. 9.3. Here, β is a parameter that controls the level of smoothing. This form of smoothing is also sometimes applied in the estimation of the prior probabilities αi when the data sets are very small. This completes the description of the M-step. As in the case of numerical data, the E- and M-steps are iterated to convergence. The maximum likelihood fit value is reported as the outlier score. 9.2.2 Clustering and Distance-Based Methods Most of the clustering- and distance-based methods can be generalized easily from numerical to categorical data. Two main modifications required are as follows: 1. Categorical data requires specialized clustering methods that are typically different from those of numerical data. This is discussed in detail in Chap. 7. Any of these mod- els can be used to create the initial set of clusters. If a distance- or similarity-based clus- tering algorithm is used, then the same distance or similarity function should be used to compute the distance (similarity) of the candidate point to the cluster centroids. 2. The choice of the similarity function is important in the context of categorical data, whether a centroid-based algorithm is used or a raw, distance-based algorithm is used. The distance functions in Sect. 3.2.2 of Chap. 3 can be very useful for this task. The pruning tricks for distance-based algorithms are agnostic to the choice of distance

268 CHAPTER 9. OUTLIER ANALYSIS: ADVANCED CONCEPTS function and can therefore be generalized to this case. Many of the local methods, such as LOF, can be generalized to this case as well with the use of this modified definition of distances. Therefore, clustering and distance-based methods can be generalized to the scenario of categorical data with relatively modest modifications. 9.2.3 Binary and Set-Valued Data Binary data are a special kind of categorical data, which occur quite frequently in many real scenarios. The chapters on frequent pattern mining were based on these kind of data. Furthermore, both categorical and numerical data can always be converted to binary data. One common characteristic of this domain is that, while the number of attributes is large, the number of nonzero values of the attribute is small in a typical transaction. Frequent pattern mining is used as a subroutine for the problem of outlier detection in these cases. The basic idea is that frequent patterns are much less likely to occur in outlier transactions. Therefore, one possible measure is to use the sum of all the supports of frequent patterns occurring in a particular transaction. The total sum is normalized by dividing with the number of frequent patterns. This provides an outlier score for the pattern. Strictly speaking, the normalization can be omitted from the final score, because it is the same across all transactions. Let D be a transaction database containing the transactions denoted by T1 . . . TN . Let s(X, D) represent the support of itemset X in D. Therefore, if F P S(D, sm) represents the set of frequent patterns in the database D at minimum support level sm, then, the frequent pattern outlier factor F P OF (Ti) of a transaction Ti ∈ D at minimum support sm is defined as follows: P S(D,sm),X⊆Ti s(X, D) F P OF (Ti) = X ∈F |F P S(D, sm)| . (9.4) Intuitively, a transaction containing a large number of frequent patterns with high support will have a high value of F P OF (Ti). Such a transaction is unlikely to be an outlier because it reflects the major patterns in the data. Therefore, lower scores indicate greater propensity to be an outlier. Such an approach is analogous to nonmembership of data points in clusters to define outliers rather than determining the deviation or sparsity level of the transactions in a more direct way. The problem with this approach is that it may not be able to distinguish between truly isolated data points and ambient noise. This is because neither of these kinds of data points will be likely to contain many frequent patterns. As a result, such an approach may sometimes not be able to effectively determine the strongest anomalies in the data. 9.3 High-Dimensional Outlier Detection High-dimensional outlier detection can be particularly challenging because of the varying importance of the different attributes with data locality. The idea is that the causality of an anomaly can be typically perceived in only a small subset of the dimensions. The remaining dimensions are irrelevant and only add noise to the anomaly-detection process. Furthermore, different subsets of dimensions may be relevant to different anomalies. As a result, full- dimensional analysis often does not properly expose the outliers in high-dimensional data. This concept is best understood with a motivating example. In Fig. 9.1, four different 2-dimensional views of a hypothetical data set have been illustrated. Each of these views

9.3. HIGH-DIMENSIONAL OUTLIER DETECTION 269 FEATURE Y 10 FEATURE Y 11 9 X <− POINT A 10 FEATURE Y 8 FEATURE Y 7 9 X <− POINT B 8 6 5 X <− POINT A 4 7 3 6 2 5 1 4 0 3 01 2 3 4 5 6 7 8 9 2 FEATURE X 1 X <− POINT B 0 (a) View 1 Point A is outlier 0 1 2 3 4 5 6 7 8 9 10 11 FEATURE X 10 (b) View 2 9 No outliers 8 X <− POINT B 9 7 8 6 7 X <− POINT A 5 6 4 X <− POINT A 5 3 4 2 3 2 3 4 5 6 7 8 9 10 11 12 FEATURE X 2 1 X <− POINT B (c) View 3 No outliers 0 0 2 4 6 8 10 12 14 FEATURE X (d) View 4 Point B is outlier Figure 9.1: Impact of irrelevant attributes on outlier analysis corresponds to a disjoint set of dimensions. As a result, these views look very different from one another. Data point A is exposed as an outlier in the first view of the data set, whereas point B is exposed as an outlier in the fourth view of the data set. However, data points A and B are not exposed as outliers in the second and third views of the data set. These views are therefore not very useful from the perspective of measuring the outlierness of either A or B. Furthermore, from the perspective of any specific data point (e.g., point A), three of the four views are irrelevant. Therefore, the outliers are lost in the random distributions within these views when the distance measurements are performed in full dimensionality. In many scenarios, the proportion of irrelevant views (features) may increase with dimensionality. In such cases, outliers are lost in low-dimensional subspaces of the data because of irrelevant attributes. The physical interpretation of this situation is quite clear in many application-specific scenarios. For example, consider a credit card fraud application in which different features such as a customer’s purchase location, frequency of purchase, and size of purchase are being tracked over time. In the case of one particular customer, the anomaly may be tracked by examining the location and purchase frequency attributes. In another anomalous customer,

270 CHAPTER 9. OUTLIER ANALYSIS: ADVANCED CONCEPTS the size and timing of the purchase may be relevant. Therefore, all the features are useful from a global perspective but only a small subset of features is useful from a local perspective. The major problem here is that the dilution effects of the vast number of “normally noisy” dimensions will make the detection of outliers difficult. In other words, outliers are lost in low-dimensional subspaces when full-dimensional analysis is used because of the masking and dilution effects of the noise in full-dimensional computations. A similar problem is also discussed in Chap. 7 in the context of data clustering. In the case of data clustering, this problem is solved by defining subspace-specific clus- ters, or projected clusters. This approach also provides a natural path for outlier analysis in high dimensions. In other words, an outlier can now be defined by associating it with one or more subspaces that are specific to that outlier. While there is a clear analogy between the problems of subspace clustering and subspace outlier detection, the difficulty levels of the two problems are not even remotely similar. Clustering is, after all, about the determination of frequent groups of data points, whereas outliers are about determination of rare groups of data points. As a rule, statistical learning methods find it much easier to determine frequent characteristics than rare characteristics of a data set. This problem is further magnified in high dimensionality. The number of possible subspaces of a d-dimensional data point is 2d. Of these, only a small fraction will expose the outlier behavior of individual data points. In the case of clustering, dense subspaces can be easily determined by aggregate statistical analysis of the data points. This is not true of outlier detection, where the subspaces need to be explicitly explored in a way that is specific to the individual data points. An effective outlier detection method would need to search the data points and dimen- sions in an integrated way to reveal the most relevant outliers. This is because different subsets of dimensions may be relevant to different outliers, as is evident from the example of Fig. 9.1. The integration of point and subspace exploration leads to a further expansion in the number of possibilities that need to be examined for outlier analysis. This chapter will explore two methods for subspace exploration, though many other methods are pointed out in the bibliographic notes. These methods are as follows: 1. Grid-based rare subspace exploration: In this case, rare subspaces of the data are explored after discretizing the data into a grid-like structure. 2. Random subspace sampling: In this case, subspaces of the data are sampled to discover the most relevant outliers. The following subsections will discuss each of these methods in detail. 9.3.1 Grid-Based Rare Subspace Exploration Projected outliers are determined by finding localized regions of the data in low-dimensional space that have abnormally low density. Because this is a density-based approach, a grid- based technique is the method of choice. To do so, nonempty grid-based subspaces need to be identified, whose density is very low. This is complementary to the definitions that were used for subspace clustering methods such as CLIQUE. In those cases, frequent subspaces were reported. It should be pointed out that the determination of frequent subspaces is a much simpler problem than the determination of rare subspaces, simply because there are many more rare subspaces than there are dense ones in a typical data set. This results in a combinatorial explosion of the number of possibilities, and level-wise algorithms, such as those used in CLIQUE, are no longer practical avenues for finding rare subspaces. The first step in all these models is to determine a proper statistical definition of rare lower dimensional projections.

9.3. HIGH-DIMENSIONAL OUTLIER DETECTION 271 9.3.1.1 Modeling Abnormal Lower Dimensional Projections An abnormal lower dimensional projection is one in which the density of the data is excep- tionally lower than the average. The concept of Z-number, introduced in the previous chap- ter, comes in very handy in this respect. The first step is to discretize the data. Each attribute of the data is divided into p ranges. These ranges are constructed on an equidepth basis. In other words, each range contains a fraction f = 1/p of the records. When k different intervals from different dimensions are selected, it creates a grid cell of dimensionality k. The expected fraction of the data records in this grid cell is equal to f k, if the attributes are statistically independent. In practice, the data are not statistically independent, and, therefore, the distribution of the points in the cube differs significantly from the expected value. Deviations that correspond to rare regions in the data are of particular interest. Let D be a database with n records, and dimensionality d. Under the independence assumption mentioned earlier, the presence or absence of any point in a k-dimensional cube is a Bernoulli random variable with probability f k. The expected number and standard deviation of the points in a k-dimensional cube are given by n · f k and n · f k · (1 − f k). When the value of n is large, the number of data points in a cube is a random variable that is approximated by a normal distribution, with the aforementioned mean and standard deviation. Let R represent a cube in k-dimensional space, and nR represent the number of data points inside this cube. Then, the sparsity coefficient S(R) of the cube R can be computed as follows: nR −n· fk S(R) = n·f k · (1 −f k ) . (9.5) A negative value of the sparsity coefficient indicates that the presence of data points in the cube is significantly lower than expected. Because nR is assumed to fit a normal distribution, the normal distribution can be used to quantify the probabilistic level of significance of its deviation. This is only a heuristic approximation because the assumption of a normal distribution is generally not true in practice. 9.3.1.2 Grid Search for Subspace Outliers As discussed earlier, level-wise algorithms are not practical for rare subspace discovery. Another challenge is that lower dimensional projections provide no information about the statistical behavior of combinations of dimensions, especially in the context of subspace analysis. For example, consider a data set containing student scores in an examination. Each attribute represents a score in a particular subject. The scores in some of the subjects are likely to be highly correlated. For example, a student scoring well in a course on probability theory would likely also score well in a course on statistics. However, it would be extremely uncommon to find a student who scored well in one, but not the other. The problem here is that the individual dimensions provide no information about the combination of the dimensions. The rare nature of outliers makes such unpredictable scenarios common. This lack of predictability about the behavior of combinations of dimensions necessitates the use of evolutionary (genetic) algorithms to explore the search space. Genetic algorithms mimic the process of biological evolution to solve optimization prob- lems. Evolution is, after all, nature’s great optimization experiment in which only the fittest individuals survive, thereby leading to more “optimal” organisms. Correspondingly, every

272 CHAPTER 9. OUTLIER ANALYSIS: ADVANCED CONCEPTS solution to an optimization problem can be disguised as an individual in an evolutionary system. The measure of fitness of this “individual” is equal to the objective function value of the corresponding solution. The competing population with an individual is the group of other solutions to the optimization problem. In general, fitter organisms are more likely to survive and multiply in the population. This corresponds to the selection operator. The other two operators that are commonly used are crossover and mutation. Each feasible solu- tion is encoded in the form of a string, and is considered the chromosome representation of the solution. This is also referred to as encoding. Thus, each string is a solution that is associated with a particular objective function value. In genetic algorithms, this objective function value is also referred to as the fitness function. The idea here is that the selection operator should favor strings with better fitness (objective function value). This is similar to hill-climbing algorithms, except that genetic algorithms work with a population of solutions instead of a single one. Furthermore, instead of only checking the neighborhood (as in hill-climbing methods), genetic algorithms use crossover and mutation operators to explore a more complex concept of a neighborhood. Therefore, evolutionary algorithms are set up to repeat the process of selection, crossover, and mutation to improve the fitness (objective) function value. As the process of evolution progresses, all the individuals in the population typically improve in fitness and also become more similar to each other. The convergence of a particular position in the string is defined as the stage at which a predefined fraction of the population has the same value for that gene. The population is said to have converged when all positions in the string representation have converged. So, how does all this map to finding rare patterns? The relevant localized subspace patterns can be easily represented as strings of length d, where d denotes the dimension- ality of the data. Each position in the string represents the index of an equi-depth range. Therefore, each position in the string can take on any value from 1 through p, where p is the granularity of the discretization. It can also take on the value ∗ (“don’t care”), which indicates that the dimension is not included in the subspace corresponding to that string. The fitness of a string is based on the sparsity coefficient discussed earlier. Highly negative values of the objective function (sparsity coefficient) indicate greater fitness because one is searching for sparse subspaces. The only caveat is that the subspaces need to be nonempty to be considered fit. This is because empty subspaces are not useful for finding anomalous data points. Consider a 4-dimensional problem in which the data have been discretized into ten ranges. Then, the string will be of length 4, and each position can take on one of 11 possible values (including the “*”). Therefore, there are a total of 114 strings, each of which corresponds to a subspace. For example, the string *2*6 is a 2-dimensional subspace on the second and fourth dimensions. The evolutionary algorithm uses the dimensionality of the projection k as an input parameter. Therefore, for a d-dimensional data set, the string of length d will contain k specified position and (d − k) “don’t care” positions. The fitness for the corresponding solution may be computed using the sparsity coefficient discussed earlier. The evolutionary search technique starts with a population of Q random solutions and iteratively uses the processes of selection, crossover, and mutation to perform a combination of hill climbing, solution recombination, and random search over the space of possible projections. The process is continued until the population converges, based on the criterion discussed above. At each stage of the algorithm, the m best projection solutions (most negative sparsity coefficients) are kept track of. At the end of the algorithm, these solutions are reported as the best projections in the data. The following operators are defined for selection, crossover, and mutation:

9.3. HIGH-DIMENSIONAL OUTLIER DETECTION 273 1. Selection: This step achieves the hill climbing required by the method, though quite differently from traditional hill-climbing methods. The copies of a solution are repli- cated by ordering them by rank and biasing them in the population in the favor of higher ranked solutions. This is referred to as rank selection. This results in a bias in favor of more optimal solutions. 2. Crossover: The crossover technique is key to the success of the algorithm because it implicitly defines the subspace exploration process. One solution is to use a uniform two-point crossover to create the recombinant children strings. It can be viewed as the process of combining the characteristics of two solutions to create two new recombinant solutions. Traditional hill-climbing methods only test an adjacent solution for a single string. The recombinant crossover approach examines a more complex neighborhood by combining the characteristics of two different strings, to yield two new neighborhood points. The two-point crossover mechanism works by determining a point in the string at random, called the crossover point, and exchanging the segments to the right of this point. This is equivalent to creating two new subspaces, by sampling subspaces from both solutions and combining them. However, such a blind recombination process may create poor solutions too often. Therefore, an optimized crossover mechanism is defined. In this mechanism, it is guaranteed that both children solutions correspond to a feasible k-dimensional projection, and the children typically have high fitness values. This is achieved by examining a subset of the different possibilities for recombination and picking the best among them. 3. Mutation: In this case, random positions in the string are flipped with a predefined mutation probability. Care must be taken to ensure that the dimensionality of the projection does not change after the flipping process. This is an exact analogue of traditional hill-climbing except that it is done to a population of solutions to ensure robustness. Thus, while genetic algorithms try to achieve the same goals as hill climb- ing, they do so in a different way to achieve better solutions. At termination, the algorithm is followed by a postprocessing phase. In the postprocess- ing phase, all data points containing the abnormal projections are reported by the algorithm as the outliers. The approach also provides the relevant projections that provide the causal- ity (or intensional knowledge) for the outlier behavior of a data point. Thus, this approach also has a high degree of interpretability in terms of providing the reasoning for why a data point should be considered an outlier. 9.3.2 Random Subspace Sampling The determination of the rare subspaces is a very difficult task, as is evident from the unusual genetic algorithm designed discussed in the previous section. A different way of addressing the same problem is to explore many possible subspaces and examine if at least one of them contains outliers. One such well known approach is feature bagging. The broad approach is to repeatedly apply the following two steps: 1. Randomly, select between (d/2) and d features from the underlying data set in itera- tion t to create a data set Dt in the tth iteration. 2. Apply the outlier detection algorithm Ot on the data set Dt to create score vectors St.

274 CHAPTER 9. OUTLIER ANALYSIS: ADVANCED CONCEPTS Virtually any algorithm Ot can be used to determine the outlier score, as long as the scores across different instantiations are comparable. From this perspective, the LOF algorithm is the ideal algorithm to use because of its normalized scores. At the end of the process, the outlier scores from the different algorithms need to be combined. Two distinct methods are used to combine the different subspaces: 1. Breadth-first approach: The ranking of the data points returned by the different algo- rithms is used for combination purposes. The top-ranked outliers over all the different algorithm executions are ranked first, followed by the second-ranked outliers (with repetitions removed), and so on. Minor variations could exist because of tie-breaking between the outliers within a particular rank. This is equivalent to using the best rank for each data point over all executions as the final outlier score. 2. Cumulative sum approach: The outlier scores over the different algorithm executions are summed up. The top-ranked outliers are reported on this basis. At first sight, it seems that the random subspace sampling approach does not attempt to optimize the discovery of subspaces to finding rare instances at all. However, the idea here is that it is often hard to discover the rare subspaces anyway, even with the use of heuristic optimization methods. The robustness resulting from multiple subspace sampling is clearly a very desirable quality of the approach. Such methods are common in high-dimensional analysis where multiple subspaces are sampled for greater robustness. This is related to the field of ensemble analysis, which will be discussed in the next section. 9.4 Outlier Ensembles Several algorithms in outlier analysis, such as the high-dimensional methods discussed in the previous section, combine the scores from different executions of outlier detection algo- rithms. These algorithms can be viewed as different forms of ensemble analysis. Some exam- ples are enumerated below: 1. Parameter tuning in LOF: Parameter tuning in the LOF algorithm (cf. Sect. 8.5.2.1 of Chap. 8) can be viewed as a form of ensemble analysis. This is because the algorithm is executed over different values of the neighborhood size k, and the highest LOF score over each data point is selected. As a result, a more robust ensemble model is constructed. In fact, many parameter-tuning algorithms in outlier analysis can be viewed as ensemble methods. 2. Random subspace sampling: The random subspace sampling method applies the approach to multiple random subspaces of the data to determine the outlier scores as a combination function of the original scores. Even the evolutionary high-dimensional outlier detection algorithm can be shown to be an ensemble with a maximization combination function. Ensemble analysis is a popular approach in many data mining problems such as clustering, classification, and outlier detection. On the other hand, ensemble analysis is not quite well studied in the context of outlier analysis. Furthermore, ensemble analysis is particularly important in the context of outlier analysis because of the rare nature of outliers, and a corresponding possibility of overfitting. A typical outlier ensemble contains a number of different components:

9.4. OUTLIER ENSEMBLES 275 1. Model components: These are the individual methodologies or algorithms that are integrated to create an ensemble. For example, a random subspace sampling method combines many LOF algorithms that are each applied to different subspace projec- tions. 2. Normalization: Different methods may create outlier scores on very different scales. In some cases, the scores may be in ascending order. In others, the scores may be in descending order. In such cases, normalization is important for meaningfully combin- ing the scores, so that the scores from different components are roughly comparable. 3. Model combination: The combination process refers to the approach used to integrate the outlier score from different components. For example, in random subspace sam- pling, the cumulative outlier score over different ensemble components is reported. Other combination functions include the use of the maximum score, or the highest rank of the score among all ensembles. It is assumed that higher ranks imply greater propensity to be an outlier. Therefore, the highest rank is similar to the maximum outlier score, except that the rank is used instead of the raw score. The highest-rank combination function is also used by random subspace sampling. Outlier ensemble methods can be categorized into different types, depending on the depen- dencies among different components and the process of selecting a specific model. The following section will study some of the common methods. 9.4.1 Categorization by Component Independence This categorization examines whether or not the components are developed independently. 1. In sequential ensembles, a given algorithm or set of algorithms is applied sequentially, so that future applications of the algorithm are influenced by previous applications. This influence may be realized in terms of either modifications of the base data for analysis, or in terms of the specific choices of the algorithms. The final result is either a weighted combination of, or the final result of the last application of the outlier algorithm component. The typical scenario for sequential ensembles is that of model refinement, where successive iterations continue to improve a particular base model. 2. In independent ensembles, different algorithms, or different instantiations of the same algorithm, are independently applied to either the complete data or portions of the data. In other words, the various ensemble components are independent of the results of each other’s executions. In this section, both types of ensembles will be studied in detail. 9.4.1.1 Sequential Ensembles In sequential ensembles, one or more outlier detection algorithms are applied sequentially to either all or portions of the data. The idea is that the result of a particular algorithmic execution may provide insights that may help refine future executions. Thus, depending upon the approach, either the data set or the algorithm may be changed in sequential executions. For example, consider the case where a clustering model is created for outlier detection. Because outliers interfere with the robust cluster generation, one possibility would be to apply the method to a successively refined data set after removing the obvious outliers through the insights gained in earlier iterations of the ensemble. Typically, the quality of the

276 CHAPTER 9. OUTLIER ANALYSIS: ADVANCED CONCEPTS Algorithm SequentialEnsemble(Data Set: D Base Algorithms: A1 . . . Ar) begin j = 1; repeat Pick an algorithm Qj ∈ {A1 . . . Ar} based on results from past executions; Create a new data set fj(D) from D based on results from past executions; Apply Qj to fj(D); j = j + 1; until(termination); return outliers based on combinations of results from previous executions; end Figure 9.2: Sequential ensemble framework outliers in later iterations will be better. This also facilitates a more robust outlier detection model. Thus, the sequential nature of the approach is used for successive refinement. If desired, this approach can either be applied for a fixed number of times or used to converge to a more robust solution. The broad framework of a sequential ensemble approach is provided in Fig. 9.2. It is instructive to examine the execution of the algorithm in Fig. 9.2 in some detail. In each iteration, a successively refined algorithm may be used on a refined data set, based on the results from previous executions. The function fj(·) is used to create a refinement of the data, which could correspond to data subset selection, attribute-subset selection, or a generic data transformation method. The generality of the aforementioned description ensures that many natural variations of the method can be explored with the use of this ensemble. For example, while the algorithm of Fig. 9.2 assumes that many different algorithms A1 . . . Ar are available, it is possible to select only one of them, and use it on successive modifications of the data. Sequential ensembles are often hard to use effectively in outlier analysis because of the lack of available groundtruth in interpreting the intermediate results. In many cases, the distribution of the outlier scores is used as a proxy for these insights. 9.4.1.2 Independent Ensembles In independent ensembles, different instantiations of the algorithm or different portions of the data are executed independently for outlier analysis. Alternatively, the same algorithm may be applied, but with either a different initialization, parameter set, or even random seed in the case of a randomized algorithm. The LOF method, the high-dimensional evolu- tionary exploration method, and the random subspace sampling method discussed earlier are all examples of independent ensembles. Independent ensembles are more common in outlier analysis than sequential ensembles. In this case, the combination function requires careful normalization, especially if the different components of the ensemble are heteroge- neous. A general-purpose description of independent ensemble algorithms is provided in the pseudocode description of Fig. 9.3.

9.4. OUTLIER ENSEMBLES 277 Algorithm IndependentEnsemble(Data Set: D Base Algorithms: A1 . . . Ar) begin j = 1; repeat Pick an algorithm Qj ∈ {A1 . . . Ar}; Create a new data set fj(D) from D; Apply Qj to fj(D); j = j + 1; until(termination); return outliers based on combination of results from previous executions; end Figure 9.3: Independent ensemble framework The broad principle of independent ensembles is that different ways of looking at the same problem provide more robust results that are not dependent on specific artifacts of a particular algorithm or data set. Independent ensembles are used commonly for parameter tuning of outlier detection algorithms. Another application is that of exploring outlier scores over multiple subspaces, and then providing the best result. 9.4.2 Categorization by Constituent Components A second way of categorizing ensemble analysis algorithms is on the basis of their constituent components. In general, these two ways of categorization are orthogonal to one another, and an ensemble algorithm may be any of the four combinations created by these two forms of categorization. Consider the case of parameter tuning in LOF and the case of subspace sampling in the feature bagging method. In the first case, each model is an application of the LOF model with a different parameter choice. Therefore, each component can itself be viewed as an outlier analysis model. On the other hand, in the random subspace method, the same algorithm is applied to a different selection (projection) of the data. In principle, it is possible to create an ensemble with both types of components, though this is rarely done in practice. Therefore, the categorization by component independence leads to either model-centered ensembles, or data-centered ensembles. 9.4.2.1 Model-Centered Ensembles Model-centered ensembles combine the outlier scores from different models built on the same data set. The example of parameter tuning for LOF can be considered a model-centered ensemble. Thus, it is an independent ensemble based on one form or categorization, and a model-centered ensemble, based on another. One advantage of using LOF in an ensemble algorithm is that the scores are roughly comparable to one another. This may not be true for an arbitrary algorithm. For example, if the raw k-nearest neighbor distance were used, the parameter tuning ensemble would always favor larger values of k when using a combination function that picks the maximum


Like this book? You can publish your book online for free in a few minutes!
Create your own flipbook