Home Explore Book-DataMining

Book-DataMining

Published by patcharapolonline, 2022-08-16 14:11:21

Description: Book-DataMining

Read the Text Version

Pages:

19.3. COMMUNITY DETECTION 637 The intuition is that it is better to contract heavy edges because they are less likely to be part of an optimal partitioning. 3. Heavy clique matching: The contraction of densely connected sets of nodes in the graph will maximize the number of collapsed edges. This method tracks the weight vi of node i, which corresponds to the number of contracted nodes it represents. Furthermore, the notation si denotes the sum of the weights of the collapsed edges at node i (or its precursors) in previous contraction phases. Note that if the contracted node i represents a clique in the original graph, then si will approach vi · (vi − 1)/2. Because it is desirable to contract dense components, one must try to ensure that the value of si resulting from the contraction approaches its upper limit. This is achieved by computing the edge density μij ∈ (0, 1) of edge (i, j): μij = 2· (si + sj + wij ) 1) (19.23) (vi + vj) · (vi + vj − When nodes across high-density edges are contracted, they typically correspond to cliques in the original graph G = G0, if it was unweighted. Even for weighted graphs, the use of high-edge density is generally quite eﬀective. The nodes of the graph are vis- ited in random order. For each node, its highest density unmatched neighbor is selected for matching. Unlike heavy edge matching, the heavy clique matching approach is not myopic to the contractions that have occurred in previous phases of the algorithm. The multilevel scheme is eﬀective because of its hierarchical approach, where the early clustering of coarsened graphs ensures a good initial global structure to the bisection. In other words, key components of the graph are assigned to the appropriate partitions early on, in the form of coarsened nodes. This partition is then successively improved in reﬁnement phases. Such an approach avoids local optima more eﬀectively because of its “big picture” approach to clustering. 19.3.4 Spectral Clustering It is assumed that the nodes are unweighted, though the edge (i, j) is associated with the weight wij. The n × n matrix of weights is denoted by W . The spectral method uses a graph embedding approach, so that the local clustering structure of the network is preserved by the embedding of the nodes into multidimensional space. The idea is to create a multidi- mensional representation of the graph so that a standard k-means algorithm can be used on the transformed representation. The simpler problem of mapping the nodes onto a 1-dimensional space will be discussed ﬁrst. The generalization to the k-dimensional case is relatively straightforward. We would like to map the nodes in N into a set of 1-dimensional real values y1 . . . yn on a line, so that the distances between these points reﬂect the connectivity among the nodes. Therefore, it is undesirable for nodes that are connected with high-weight edges to be mapped onto distant points on this line. This can be achieved by determining values of yi, for which the following objective function O is minimized: nn O= wij · (yi − yj )2 (19.24) i=1 j=1 This objective function penalizes the distances between yi and yj with weight proportional to wij. Therefore, when wij is very large, the data points yi and yj will be more likely to

638 CHAPTER 19. SOCIAL NETWORK ANALYSIS be closer to one another in the embedded space. The objective function O can be rewritten in terms of the Laplacian matrix L of weight matrix W . The Laplacian matrix L is deﬁned as Λ − W , where Λ is a diagonal matrix satisfying Λii = n wij . Let the n-dimensional j=1 column vector of embedded values be denoted by y = (y1 . . . yn)T . It can be shown after some algebraic rearrangement of Eq. 19.24, that the objective function O can be rewritten in terms of the Laplacian matrix: O = 2yT Ly (19.25) The matrix L is positive semideﬁnite with nonnegative eigenvalues because the sum-of- squares objective function O is always nonnegative. We need to incorporate a scaling con- straint to ensure that the trivial value of yi = 0 for all i is not selected by the optimization solution. A possible scaling constraint is as follows: yT Λy = 1 (19.26) The matrix Λ is incorporated in the constraint of Eq. 19.26 to achieve normalization, so that the resulting clusters are more balanced across partitions. If Λ is not used in the constraint, the result is referred to as unnormalized spectral clustering. In practice, the eﬀect of this normalization is that low-degree nodes tend to clearly “pick sides” with either large positive or large negative values of yi, whereas very high-degree nodes, which might also be hub nodes, will be embedded closer to central regions near the origin (see Exercise 7). Note that each diagonal entry Λii, which is the sum of the weights of the edges incident on node i, can be viewed as the local density of the network at node i. It can also be shown that incorporating Λ in the constraint approximates an unnormalized embedding in which edge weights wij = wji have been divided by the geometric average Λii · Λjj of the local densities at their end points (see Exercise 8). As discussed in Chap. 3, normalizing distance or similarity values with local densities is often helpful in obtaining high-quality results that are more accurate in their local context. This constrained optimization formulation can be solved by setting the gradient of its Lagrangian relaxation yT Ly − λ(yT Λy − 1) to 0. It can be shown that the resulting opti- mization condition is Λ−1Ly = λy where λ is the Lagrangian parameter. In other words, y is an eigenvector of Λ−1L and λ is an eigenvalue. Furthermore, this optimization condition can be used to easily show that the objective function O = 2yT Ly evaluates to twice the eigenvalue λ for an eigenvector y satisfying this condition. Therefore, among the alternative eigenvector solutions for y, the optimal solution is the smallest nontrivial eigenvector of the normalized Laplacian Λ−1L. The smallest eigenvalue of Λ−1L is always 0, and it corre- sponds to the trivial solution where the node embedding y is proportional to (1, 1, . . . 1)T . Such a trivial 1-dimensional embedding corresponds to mapping every node to the same point. This trivial eigenvector is noninformative. Therefore, it can be discarded, and it is not used in the analysis. The second smallest eigenvector then provides an optimal solution that is more informative. This model can be generalized to a k-dimensional embedding by setting up an analogous optimization formulation with its decision variables as an n × k matrix Y with k column vectors Y = [y1 . . . yk] representing each dimension of the embedding. This optimization formulation minimizes the trace of the k × k matrix Y T LY subject to the normalization constraints Y T ΛY = I. Because of the presence of Λ in the constraint, the columns of Y will not necessarily be orthogonal. The optimal solutions for these k column vectors can be shown to be proportional to the successive directions corresponding to the (not necessarily orthogonal) right eigenvectors of the asymmetric matrix Λ−1L with increasing eigenvalues. After discarding the ﬁrst trivial eigenvector e1 with eigenvalue λ1 = 0, this results in a set

19.3. COMMUNITY DETECTION 639 of k eigenvectors e2, e3 . . . ek+1, with corresponding eigenvalues λ2 ≤ λ3 . . . ≤ λk+1. Because k eigenvectors were selected, this approach creates an n × k matrix Dk = Y , corresponding to a k-dimensional embedding of each of the n nodes. Note that even though the normal- ization constraint Y T ΛY = I will not result in columns of Dk having an L2-norm of 1, each column (eigenvector) of Dk is scaled to an L2-norm of 1 as a post-processing4 step. Because of this column scaling, the n × k matrix Dk does not exactly reﬂect the original optimization formulation in terms of Y . The resulting k-dimensional embedding preserves the clustering structure of the nodes because the optimization formulation of Y tries to minimize distances between highly connected nodes. Therefore, any multidimensional clus- tering algorithm discussed in Chap. 6, such as k-means, can be applied to this embedded representation to generate the clusters on the nodes. This formulation is also sometimes referred to as the random walk version of spectral clustering because of an interpretation in terms of random walks. It is noteworthy that the small eigenvectors of the normalized Laplacian Λ−1L are the same as the large eigenvectors of the stochastic transition matrix Λ−1W (see Exercise 15). dAenciseiqounivvaalerniatbwleasyzof=set√tiΛnyg up the spectral clustering model is to use the related vector of in the optimization formulation of Eqs. 19.25 and 19.26. This related version is referred to as the symmetric version of the spectral clustering model, although it is diﬀerent f=ro√mΛtyh,e random walk version only in terms of the scaling of decision variables. By setting z it can be shown that the earlier formulation is equivalent to optimizing zT Λ−1/2LΛ−1/2z subject to zT z = 1. We determine the smallest k (orthogonal) eigenvectors of the symmetric normalized Laplacian Λ−1/2LΛ−1/2, excluding the ﬁrst. Each eigenvector of this matrix ocfanthaelsroanbdeom(pwroaplkorftoiormnaulllyat)ioonbtwaiitnhedthbeydpiargeo-mnaulltmipaltyriinxg√thΛe. aforementioned solution Y This relationship also reﬂects the relationship between z and y. The eigenvalues are the same in both cases. For example, the wﬁirlsltbeeigepnrovpecotrotriownaitlhteoig(e√nvΛa1l1u.e.0. √wΛillnnn)oTl.onBgeecrabuesea vector of 1s, but the various entries of this diﬀerential scaling of various nodes, high-degree nodes will tend to have larger (absolute) coordinate values in the symmetric version. By selecting the smallest k eigenvectors, one can generate an n × k multidimensional representation Dk of the entire set of n nodes. Just as the random walk version scales each column of Dk to unit norm in the ﬁnal step, the symmetric version scales each row of Dk to unit norm. The ﬁnal step of row scaling is a heuristic enhancement to adjust for the diﬀerential scaling of nodes with various degrees, and it is not a part of the optimization formulation. Interestingly, even if the rows of the random walk solution Y had been scaled to unit norm (instead of scaling the columns to unit norm), exactly the same solution would be obtained as that obtained by scaling the rows of the symmetric solution Z to unit norm (see Exercise 13). Although the two diﬀerent ways of performing spectral clustering are equivalent in terms of the optimization problem solved, there are diﬀerences in terms of the heuristic scaling adjustments. The scaling relationships are illustrated in Fig. 19.8. It is evident from Fig. 19.8 that the main practical diﬀerence between the two methods is regulated only by the heuristic scaling used in the ﬁnal phase, rather than their respective optimization models. Because of the scaling variations, the clusters obtained in the two cases will not be exactly the same. The relative quality will depend on the data set at hand. These optimization problems can also be understood as linear programming relaxations of integer-programming formula- tions of balanced minimum cut problems. However, the minimum-cut explanation does not 4In practice, the unit eigenvectors of Λ−1L can be directly computed, and therefore an explicit post- processing step is not required.

640 CHAPTER 19. SOCIAL NETWORK ANALYSIS Minimize trace(YTLY) Spectral embedding Minimize subject to: YT Y = I (Random walk trace(ZT 1/2L 1/2Z) version) subject to: ZTZ = I Note that neither rows nor columns Spectral embedding Note that rows of (Symmetric version) matrix Z will not of matrix Y will have unit norm have unit norm Figure 19.8: Scaling relationships between random walk and symmetric versions of spectral clustering intuitively generalize to the relaxed version of the problem because eigenvectors have both positive and negative components. 19.3.4.1 Important Observations and Intuitions A few observations are noteworthy about the relationships between spectral clustering, PageRank, and eigenvector analysis: 1. Normalized random walk Laplacian: The smallest right eigenvectors of Λ−1L = Λ−1(Λ − W ) = I − P are used for the random walk embedding, where P is the stochastic transition matrix of the graph. The smallest right eigenvectors of I − P are the same as the largest right eigenvectors of P . The largest right eigenvector of P has eigenvalue 1. It is noteworthy that the largest left eigenvector of P , which also has eigenvalue 1, yields the PageRank of the graph. Therefore, both the left and the right eigenvectors of the stochastic transition matrix P yield diﬀerent insights about the network. 2. Normalized symmetric Laplacian: The smallest eigenvectors of the symmetric Lapla- cian Λ−1/2(Λ − W )Λ−1/2 are the same as the largest eigenvectors of the symmetric matrix Λ−1/2W Λ−1/2. The matrix Λ−1/2W Λ−1/2 can be viewed as a normalized and sparsiﬁed similarity matrix of the graph. Most forms of nonlinear embeddings such as SVD, Kernel PCA, and ISOMAP are extracted as large eigenvectors of similar- ity matrices (cf. Table 2.3 of Chap. 2). It is the choice of the similarity matrix that regulates the varying properties of these diﬀerent embeddings. 3. Goal of normalization: Spectral clustering is more eﬀective when the unnormalized Laplacian L is normalized with the node-degree matrix Λ. While it is possible to explain this behavior with a cut-interpretation of spectral clustering, the intuition does not generalize easily to continuous embeddings with both positive and negative eigenvector components. A simpler way of understanding normalization is by examin- ing the similarity matrix Λ−1/2W Λ−1/2 whose large eigenvectors yield the normalized spectral embedding. In this matrix, the edge similarities are normalized by the geomet- ric mean of the node degrees at their end points. This can be viewed as a normalization of the edge similarities with a local measure of the network density. As discussed in Chap. 3, normalizing similarity and distance functions with local density is helpful even in the case of multidimensional data mining applications. One of the most well- known algorithms for outlier analysis in multidimensional data, referred to as LOF,

19.4. COLLECTIVE CLASSIFICATION 641 TEST NODE B A AB Figure 19.9: Label sparsity issues in collective classiﬁcation also uses this principle. Normalization will yield more balanced clusters in networks with widely varying density over the network. 19.4 Collective Classiﬁcation In many social networking applications, labels may be associated with nodes. For example, consider the case of a social networking application, where it is desirable to determine all individuals interested in golf. The labels of a small number of actors may already be available. It is desirable to use the available labels to perform the classiﬁcation of nodes for which the label is not known. The solution to this model is crucially dependent on the notion of homophily. Because nodes with similar properties are usually connected, it is reasonable to assume that this is also true of node labels. A simple solution to this problem is to examine the k labeled nodes in the proximity of a given node and report the majority label. This approach is, in fact, the network analog of a nearest neighbor classiﬁer. However, such an approach is generally not possible in collective classiﬁcation because of the sparsity of node labels. An example of a network is illustrated in Fig. 19.9, in which the two classes are labeled A and B. The remaining nodes are unlabeled. For the test node in Fig. 19.9, it is evident that it is generally closer to instances of A in the network structure, but there is no labeled node directly connected to the test instance. Thus, it is evident that one must not only use the direct connections to labeled nodes, but also use the indirect connections through unlabeled nodes. Thus, collective classiﬁcation in networks are always performed in a transductive semisupervised setting, where the test instances and training instances are classiﬁed jointly. In fact, as discussed in Sect. 11.6.3 of Chap. 11, collective classiﬁcation methods can be used for semisupervised classiﬁcation of any data type by transforming the data into a similarity graph. Thus, the collective classiﬁcation problem is important not only from the perspective of social network analysis, but also for semisupervised classiﬁcation of any data type. 19.4.1 Iterative Classiﬁcation Algorithm The Iterative Classiﬁcation Algorithm (ICA) is one of the earliest classiﬁcation algorithms in the literature and has been applied to a wide variety of data domains. The algorithm has the capability to use content associated with the nodes for classiﬁcation. This is important because many social networks have text content associated with the nodes in the form of user posts. Furthermore, in cases where this framework is used5 for semisupervised classiﬁcation 5cf. Sect. 11.6.3 of Chap. 11.

642 CHAPTER 19. SOCIAL NETWORK ANALYSIS Algorithm ICA(Graph G = (N, A), Weights: [wij], Node Class Labels: C, Base Classiﬁer: A, Number of Iterations: T ) begin repeat Extract link features at each node with current training data; Train classiﬁer A using both link and content features of current training data and predict labels of test nodes; Make (predicted) labels of most “certain” nt/T test nodes ﬁnal, and add these nodes to training data, while removing them from test data; until T iterations; end Figure 19.10: The iterative classiﬁcation algorithm (ICA) of relational data with similarity graphs, the relational features continue to be available at the nodes for more eﬀective classiﬁcation. Consider the (undirected) network G = (N, A) with class labels are drawn from {1 . . . k}. Each edge (i, j) ∈ A is associated with the weight wij. Furthermore, the content Xi is available at the node i in the form of a multidimensional feature vector. The total number of nodes is denoted by n, from which nt nodes are unlabeled test nodes. An important step of the ICA algorithm is to derive a set of link features in addition to the available content features in Xi. The most important link features correspond to the distribution of the classes in the immediate neighborhood of the node. Therefore a feature is generated for each class, containing the fraction of its incident nodes belonging to that class. For each node i, its adjacent node j is weighted by wij for computing its credit to the relevant class. It is also possible, in principle, to derive other link features based on structural properties of the graph such as the degree of the node, PageRank values, number of closed triangles involving the node, or connectivity features. Such link features can be derived on the basis of an application-speciﬁc understanding of the network data set. The basic ICA is structured as a meta-algorithm. A base classiﬁer A is leveraged within an iterative framework. Many diﬀerent base classiﬁers have been used in diﬀerent implemen- tations, such as the naive Bayes classiﬁer, logistic regression classiﬁer, and a neighborhood voting classiﬁer. The main requirement is that these classiﬁers should be able to output a numeric score that quantiﬁes the likelihood of a node belonging to a particular class. While the framework is independent of speciﬁc choice of classiﬁer, the use of the naive Bayes classiﬁer is particularly common because of the interpretation of its numeric score as a probability. Therefore, the following discussion will assume that the algorithm A is instantiated to the naive Bayes classiﬁer. The link and content features are used to train the naive Bayes classiﬁer. For many nodes, it is diﬃcult to robustly estimate important class-speciﬁc features, such as the fractional presence of the diﬀerent classes in their neighborhood. This is a direct result of label sparsity, and it makes the class predictions of such nodes unreliable. Therefore, an iterative approach is used for augmenting the training data set. In each iteration, nt/T (test) node labels are made “certain” by the approach, where T is a user-deﬁned parameter controlling the maximum number of iterations. The test nodes, for which the Bayes classiﬁer exhibits the highest class membership probabilities, are selected to be made ﬁnal. These labeled test

19.4. COLLECTIVE CLASSIFICATION 643 STRONGLY CONNECTED NETWORK LABELED NODES EVENTUALLY TRAP ALL RANDOM WALKS TEST NODE X TEST NODE Y TEST NODE X TEST NODE Y A B AB AB AB UNIQUE PAGE RANK VECTOR NO UNIQUE PAGE RANK VECTOR (a) No absorbing state (b) With absorbing states Figure 19.11: Creating directed transition graphs from undirected graph of Fig. 19.9 nodes can then be added to the training data, and the classiﬁer is retrained by extracting the link features again with the augmented training data set. The approach is repeated until the labels of all nodes have been made ﬁnal. Because the labels of nt/T nodes are ﬁnalized in each iteration, the entire process terminates in exactly T iterations. The overall pseudocode is illustrated in Fig. 19.10. One advantage of the ICA is that it can seamlessly use content and structure in the classiﬁcation process. The classiﬁer can automatically select the most relevant features using oﬀ-the-shelf feature selection algorithms discussed in Chap. 10. This approach also has the merit that it is not strongly dependent on the notion of homophily, and can, therefore, be used for domains beyond social network analysis. Consider an adversarial relationship network in which nodes connected by links might have diﬀerent labels. In such cases, the ICA algorithm will automatically learn the correct importance of adjacent class distributions, and therefore it will yield accurate results. This property is not true of most of the other collective classiﬁcation methods, which are explicitly dependent on the notion of homophily. On the other hand, the errors made in the earlier phases of iterative classiﬁcation can propagate and multiply in later phases because of augmented training examples with incorrect labels. This can increase the cumulative error in noisy training data sets. 19.4.2 Label Propagation with Random Walks The label propagation method directly uses random walks on the undirected network struc- ture G = (N, A). The weight of edge (i, j) is denoted by wij = wji. To classify an unlabeled node i, a random walk is executed starting at node i and terminated at the ﬁrst labeled node encountered. The class at which the random walk has the highest probability of ter- mination is reported as the predicted label of node i. The intuition for this approach is that the walk is more likely to terminate at labeled nodes in the proximity of node i. Therefore, when many nodes of a particular class are located in its proximity, then the node i is more likely to be labeled with that class. An important assumption is that the graph needs to be label connected. In other words, every unlabeled node needs to be able to reach a labeled node in the random walk. For undirected graphs G = (N, A), this means that every connected component of the graph needs to contain at least one labeled node. In the following discussion, it will be assumed that the graph G = (N, A) is undirected and label-connected. The ﬁrst step is to model the random walks in such a way that they always terminate at their ﬁrst arrival at labeled nodes. This can be achieved by removing outgoing edges from labeled nodes and replacing them with self-loops. Furthermore, to use a random walk approach, we need to convert the undirected graph G = (N, A) into a directed graph G = (N, A ) with an n × n transition matrix P = [pij]:

644 CHAPTER 19. SOCIAL NETWORK ANALYSIS 1. For each undirected edge (i, j) ∈ A, directed edges (i, j) and (j, i) are added to A between the corresponding nodes. The transition probability pij of edge (i, j) is deﬁned as follows: pij = wij (19.27) n wik k=1 The transition probability pji of edge (j, i) is deﬁned as follows: pji = wji (19.28) n wjk k=1 For example, the directed transition graph created from the undirected graph of Fig. 19.9 is illustrated in Fig. 19.11a. 2. All outgoing edges from labeled nodes are removed from the graph G constructed in the previous step and replaced with a self-loop of transition probability 1. Such nodes are referred to as absorbing nodes because they trap the random walk after an incoming transition. An example of the ﬁnal transition graph is illustrated in Fig. 19.11b. Therefore, for each absorbing node i, the ith row of P is replaced with the ith row of the identity matrix. Assume that the ﬁnal n × n transition matrix is denoted by P = [pij]. For any absorbing node i, the value of pik is 1 only when i = k, and 0 otherwise. The transition matrix P does not have a unique steady-state probability distribution (or, PageRank vector), because of the presence of absorbing6 components. The steady-state probability distribution is dependent on the starting state of the random walk. For example, a random walk starting at test node X in Fig. 19.11b will always eventually end at label A, whereas a walk starting with node Y might end at either label A or B. It is noteworthy that the PageRank computation of Sect. 18.4.1 in Chap. 18 ensures unique steady-state probabilities by using teleportation to implicitly create a strongly connected transition graph. Interestingly, the modiﬁcations that create absorbing nodes have exactly the opposite eﬀect because the steady state probability distribution depends on the starting state. Strong connectivity of the transition graph is required to ensure a unique steady-state distribution. However, if the starting state is ﬁxed, then each node does have a steady state probability distribution. For any given starting node i, the steady-state probability distribution has positive values only at labeled nodes. This is because a random walk will eventually reach an absorbing node in a label-connected graph, and it will never emerge from that node. Therefore, if one can estimate the steady-state probability distribution for starting node i, then the probability values of the labeled nodes in each class can be aggregated. The class with the highest probability is reported as the relevant label of the node i. How can the steady-state probability be computed for a particular starting node i? Let π(t) represent the n-dimensional (row) probability vector after t steps, starting with a particular initial state π(0). When the starting state is node i, the value of π(0) is 1 for the ith component in this vector, and 0 otherwise. Then, we have: π(t) = π(t−1)P (19.29) 6In other words, the underlying Markov chain is not strongly connected, and therefore not ergodic. See the description of the PageRank algorithm in Chap. 18.

19.4. COLLECTIVE CLASSIFICATION 645 By recursively applying the aforementioned condition t times, and then setting t = ∞, it is possible to show the following: π(t) = π(0)P t (19.30) π(∞) = π(0)P ∞ (19.31) How can the steady-state transition matrix P ∞ be computed? A key observation is that the largest magnitude of the eigenvalue of a stochastic matrix is always 1 (see Exercise 7 of Chap. 18). Therefore, P may be expressed as follows: P = V ΔV −1 (19.32) Here, V is an n × n matrix, whose columns contain the eigenvectors, and Δ is a diagonal matrix containing the eigenvalues, all of which have magnitude no larger than 1. Note that stochastic matrices with absorbing components will have an eigenvector with unit eigenvalue for each absorbing component. Then, by multiplying P with itself (t − 1) times, we get: P t = V ΔtV −1 (19.33) In the limit where t approaches inﬁnity, Δt will contain diagonal values of only 0 or 1. Any eigenvalue in the original matrix Δ with magnitude less than 1 will approach 0 in Δ∞. In other words, Δ∞ can be computed easily from Δ. Therefore, if V has been computed, then P ∞ can be computed easily as well. A further optimization is that the steady-state transition matrix P ∞ can be eﬃciently computed by determining only the l leading eigenvectors of P , where l is the number of labeled (absorbing) nodes. Refer to the bibliographic notes for more details of this optimization. After P ∞ has been computed, it is relatively straightforward to compute the n- dimensional node probability vector π(∞) that results from starting the random walk at node i. When the starting state is (unlabeled) node i, the n-dimensional vector for the starting state π(0) contains 1 in the ith component, and 0 otherwise. According to our earlier discussion, one can compute π(∞) = π(0)P ∞. Note that π(∞) will contain positive probabilities only for the labeled nodes, which are also absorbing states. By summing up the probabilities in π(∞) of labeled nodes belonging to each class, one can obtain the probability of each class for unlabeled node i. The class with the maximum probability is reported as the relevant label. There is, however, a simpler way of computing the class probability distributions of all unlabeled nodes in one shot, rather than having to explicitly compute P ∞, and then trying diﬀerent starting vectors for π(0). For each class c ∈ {1 . . . k}, let Nc ⊆ N be the set of labeled nodes belonging to that class. In order for unlabeled node i to belong to class c, a walk starting at node i must end at a node in Nc. The probability of this event is given by bje∈lNonc [gPs ∞]ij . Let Yc be a column vector with n entries such that the jth entry is 1, if node j to class c, and 0, otherwise. Then, it is easy to see that the ith entry of the column vector Zc = P ∞Yc is equivalent tiotermj∈inNac t[Pin∞g ]ij , which is the sum of the probabilities of a walk starting at unlabeled node at various nodes belonging to class c. Therefore, we need to compute Zc for each class c ∈ {1 . . . k}. Let Y be an n × k matrix for which the cth column is Yc. Similarly, let Z be an n × k matrix for which the cth column is Zc. Then Z can be obtained with simple matrix multiplication between P ∞ and Y . Z = P∞Y (19.34)

646 CHAPTER 19. SOCIAL NETWORK ANALYSIS The class with the maximum probability in Z for unlabeled node (row) i may be reported as its class label. This approach is also referred to as the rendezvous approach to label propagation. We make a few important observations. If the ith row of P is absorbing then it is the same as the ith row of the identity matrix. Therefore, premultiplying Y with P for any number of times will not change the ith row of Y . In other words, rows of Z that correspond to labeled nodes will be ﬁxed to the corresponding rows of Y . Therefore, predictions of labeled nodes are ﬁxed to their training labels. For unlabeled nodes, the rows of Z will always sum to 1 in label-connected networks. This is because the sum of the values in row i in Z is equal to the probability that a random walk starting at node i reaches an absorbing state. In label-connected networks, every random walk will eventually reach an absorbing state. 19.4.2.1 Iterative Label Propagation: The Spectral Interpretation Equation 19.34 suggests a simple iterative approach for computing the label probabilities in Z, rather than computing P ∞. One can initialize Z(0) = Y and then repeatedly use the following update for increasing value of iteration index t. Z(t+1) = P Z(t) (19.35) It is easy to see that Z(∞) is the same as the value of Z in Eq. 19.34. For labeled (absorbing) node i, the ith row of Z will always be unaﬀected by the update because the ith row of P is the same as that of the identity matrix. The label-propagation update is executed to convergence. In practice, a relatively small number of iterations are required to reach convergence. The label propagation update can be rearranged to show that the ﬁnal solution Z will satisfy the following relationship at convergence: (I − P )Z = 0 (19.36) Note that I −P is simply the normalized (random walk) Laplacian of the adjacency matrix of the network G with absorbing states. Furthermore, each column of Z is a eigenvector of this Laplacian with eigenvalue 0. In unsupervised spectral clustering, the ﬁrst eigenvector with eigenvalue 0 is discarded because it is not informative. However, in collective classiﬁcation, there are additional eigenvectors of (I − P ) with eigenvalue 0 because of the presence of absorbing states. Each class-speciﬁc column of Z contains a diﬀerent eigenvector with eigenvalue 0. In fact, the label propagation solution can also be derived with an optimization formulation similar to spectral clustering on the original undirected graph G. In this case, the optimization formulation uses a similar objective function as spectral clustering with the additional constraint that the embedded values of all labeled nodes are ﬁxed to 1 for a particular (say, the cth) class and they are ﬁxed to 0 for the remaining classes. The embedded values of only the unlabeled nodes are unconstrained decision variables. The solution to each such optimization problem can be shown to be an eigenvector of (I − P ), with eigenvalue 0. Iterative label propagation converges to these eigenvectors. 19.4.3 Supervised Spectral Methods Spectral methods can be used in two diﬀerent ways for collective classiﬁcation of graphs. The ﬁrst method directly transforms the graph to multidimensional data to facilitate the use of a multidimensional classiﬁer such as a k-nearest neighbor classiﬁer. The embedding

19.4. COLLECTIVE CLASSIFICATION 647 approach is identical to that used in spectral clustering except that the class information is incorporated within the embedding. The second method directly learns an n × k class probability matrix Z with an optimization formulation related to spectral clustering. This class probability matrix Z is similar to that derived in label propagation. Interestingly, the second method is also closely related to label propagation. 19.4.3.1 Supervised Feature Generation with Spectral Embedding Let G = (N, A) be the undirected graph with weight matrix W . The approach consists of the following steps, the ﬁrst of which is to augment G with class-based supervision: 1. Add an edge with weight μ between each pair of nodes with the same label in G. If an edge already exists between a pair of such nodes, then the two edges are consolidated by adding μ to the weight of the existing edge. The resulting graph is denoted by G+. The parameter μ controls the level of supervision from existing labels. 2. Use the spectral embedding approach of Sect. 19.3.4 to generate an r-dimensional embedding of the augmented graph G+. 3. Apply any multidimensional classiﬁer, such as a nearest neighbor classiﬁer, on the embedded data. The value of μ may be tuned with the use of cross-validation. Note that this approach does not directly learn the class probabilities. Rather, it creates a feature representation that implicitly incorporates both the homophily eﬀects and the existing label information. This feature representation is sensitive to both network locality and label distribution. Therefore, it can be used to design an eﬀective multidimensional classiﬁer. 19.4.3.2 Graph Regularization Approach The graph regularization approach learns the labels of the nodes directly with an optimiza- tion formulation related to spectral clustering. let Z be an n × k matrix of optimization variables, in which the (i, c)th entry denotes the propensity of node i to belong to label c. When the (i, c)th entry is large, it indicates that node i is more likely to belong to label c. Therefore, for the ith row of Z, the index of the largest of the k entries provides a pre- diction of the class label of node i. The column-vector Zc denotes the cth column of Z for c ∈ {1 . . . k}. Furthermore, Y is an n × k binary matrix containing the label information. If the ith node is labeled, then exactly one entry in the ith row of Y is 1, corresponding to the relevant class label. Other entries are 0. For unlabeled nodes, all entries in the corresponding row of Y are 0. The cth column of Y is denoted by the column vector Yc. This approach directly uses the weighted matrix W of an undirected graph G = (N, A) (e.g., Fig. 19.9) rather than a directed transition graph. The variables in the matrix Z are derived with an optimization formulation related to spectral clustering. Each n-dimensional vector Zc is viewed as a 1-dimensional embedding of the n nodes. The goal of this optimiza- tion formulation is two-fold, which is reﬂected in the two additive terms of the objective function: 1. Smoothness (homophily) objective: For each class c ∈ {1 . . . k}, the nodes connected with high-weight edges should be mapped to similar values in Zc. This goal is iden- tical to the unsupervised objective function in spectral clustering. In this case, the symmetric Laplacian Ls is used because of its better convergence properties: Ls = I − Λ−1/2W Λ−1/2 (19.37)

648 CHAPTER 19. SOCIAL NETWORK ANALYSIS Here, Λ is a diagonal matrix in which the ith diagonal entry contains the sum of the ith row entries of the n×n weight matrix W . For brevity, we denote the normalized weight matrix by S = Λ−1/2W Λ−1/2. Therefore, the smoothness term Os in the objective function may be written as follows: kk (19.38) Os = ZcT LsZc = ZcT (I − S)Zc c=1 c=1 This term is referred to as the smoothness term because it ensures that the predicted label propensities Z vary smoothly along edges, especially if the weights of the edges are large. This term can also be viewed as a local consistency term. 2. Label-ﬁtting objective: Because the embedding Z is designed to mimic Y as closely as possible the value of ||Zc − Yc||2 should be as small as possible for each class c. Note that unlabeled nodes are included within ||Zc − Yc||2, and for those nodes, this term serves as a regularizer. The goal of a regularizer is to avoid7 ill-conditioned solutions and overﬁtting in optimization models. k (19.39) Of = ||Yc − Zc||2 c=1 This term can also be viewed as a global consistency term. The overall objective function may be constructed as O = Os + μOf , where μ deﬁnes the weight of the label-ﬁtting term. The parameter μ reﬂects the trade-oﬀ between the two criteria. Therefore, the overall objective function may be written as follows: kk (19.40) O = ZcT (I − S)Zc + μ ||Yc − Zc||2 c=1 c=1 To optimize this objective function, one must use the partial derivative with respect to the diﬀerent decision variables in Zc and set it to zero. This yields the following condition: (I − S)Zc + μ(Zc − Yc) = 0 ∀c ∈ {1 . . . k} (19.41) Because this condition is true for each class c ∈ {1 . . . k} one can write the aforementioned condition in matrix form as well: (I − S)Z + μ(Z − Y ) = 0 (19.42) One can rearrange this optimization condition as follows: Z = SZ + 1 μ Y (19.43) 1+μ + μ The goal is to determine a solution Z of this optimization condition. This can be achieved iteratively by initializing Z(0) = Y and then iteratively updating Z(t+1) from Z(t) as follows: Z (t+1) = S Z (t) + 1 μ μY (19.44) 1+μ + 7In this case, the regularizer ensures that no single entry in Zc for unlabeled nodes is excessively large.

19.4. COLLECTIVE CLASSIFICATION 649 This solution is iterated to convergence. It can be shown that the approach converges to the following solution: Z (∞) = μ μ I + 1 S μ + S 2 Y = μ I − 1 S μ −1 (19.45) 1+ + 1+μ 1+μ + +... Y Intuitively, the matrix I − S −1 = I + S + S 2 +... is an n × n matrix of 1+μ 1+μ 1+μ pairwise weighted Katz coeﬃcients (cf. Deﬁnition 19.5.4) between nodes. In other words, the propensity of node i to belong to class j is predicted as a sum of its weighted Katz coeﬃcients with respect to labeled nodes of class j. Because the Katz measure predicts links (cf. Sect. 19.5) between nodes, this approach illustrates the connections between collective classiﬁcation and link prediction. It is possible to learn the optimal value of μ with the use of cross-validation. It is noteworthy that, unlike the aforementioned label propagation algorithm with absorbing states, this approach only biases Z with the labels, and it does not constrain the rows in Z to be the same as the corresponding rows of Y for labeled nodes. In fact, the matrix Z can provide a label prediction for an already labeled node that is diﬀerent from its original training label. Such cases are likely to occur when the original labeling in the training data is error-prone and noisy. The regularization approach is, therefore, more ﬂexible and robust for networks containing noisy and error-prone training labels. 19.4.3.3 Connections with Random Walk Methods Even though the graph regularization approach is derived using spectral methods, it is also related to random walk methods. The n × k matrix-based update Eq. 19.44 can be decomposed into k diﬀerent vector-based update equations, one for each n-dimensional column Zc of Z: Zc = SZc + 1 μ μ Yc ∀c ∈ {1 . . . k} (19.46) 1+μ + Each of these update equations is algebraically similar to a personalized PageRank equation where replaces the transition matrix and the restart probability is at labeled nodes S μ 1+μ belonging to a particular class c. The vector Yc is analogous to the personalized restart vector for class c multiplied with the number of training nodes in class c. Similarly, the vector Zc is analogous to the personalized PageRank vector of class c multiplied with the number of training nodes in class c. Therefore, the class-speciﬁc Eq. 19.46 can be viewed as a personalized PageRank equation, scaled in proportion to the prior probability of class c. Of course, the symmetric matrix S is not truly a stochastic transition matrix because its columns do not sum to 1. Therefore, the results cannot formally be viewed as personalized PageRank probabilities. Nevertheless, this algebraic similarity to personalized PageRank suggests the possibility of a closely related family of random walk methods, similar to label propagation. For exam- ple, instead of using a nonstochastic matrix S derived from spectral clustering, one might use a stochastic transition matrix P . In other words, Eqs. 19.27 and 19.28 are used to derive P = Λ−1W . However, one diﬀerence from the transition matrix P used in label-propagation methods is that the network structure is not altered to create absorbing states. In other words, the directed transition graph of Fig. 19.11a is used, rather than that of Fig. 19.11b to derive P . Replacing S with P in Eq. 19.46 leads to a variant of the label propagation

650 CHAPTER 19. SOCIAL NETWORK ANALYSIS update (cf. Eq. 19.35) in which labeled nodes are no longer constrained to be predicted to their original label. Replacing S with P T in Eq. 19.46 leads to the (class-prior scaled) personalized PageR- ank equations. This is equivalent to executing the personalized PageRank algorithm k times, where the personalization vector for the cth execution restarts at labeled nodes belonging to the cth class. Each class-speciﬁc personalized PageRank probability is multiplied with the prior probability of that class, or, equivalently, the number of labeled training nodes in that class. For each node, the class index that yields the highest (prior-scaled) person- alized PageRank probability is reported. The performance of these alternative methods is dependent on the data set at hand. 19.5 Link Prediction In many social networks, it is desirable to predict future links between pairs of nodes in the network. For example, commercial social networks, such as Facebook, often recommend users as potential friends. In general, both structure and content similarity may be used to predict links between pairs of nodes. These criteria are discussed below: • Structural measures: Structural measures typically use the principle of triadic closure to make predictions. The idea is that two nodes that share similar nodes in their neighborhoods are more likely to become connected in the future, if they are not already connected. • Content-based measures: In these cases, the principle of homophily is used to make predictions. The idea is that nodes that have similar content are more likely to become linked. For example, in a bibliographic network containing scientiﬁc co-author rela- tions, a node containing the keyword “data mining” is more likely to be connected to another node containing the keyword “machine learning.” While content-based measures have been shown to have potential in enhancing link predic- tion, the results are rather sensitive to the network at hand. For example, in a network such as Twitter, where the content is the form of short and noisy tweets with many nonstandard acronyms, content-based measures are not particularly eﬀective. Furthermore, while struc- tural connectivity usually implies content-based homophily, the reverse is not always true. Therefore, the use of content similarity has mixed results in diﬀerent network domains. On the other hand, structural measures are almost always eﬀective in diﬀerent types of net- works. This is because triadic closure is ubiquitous across diﬀerent network domains and has more direct applicability to link prediction. 19.5.1 Neighborhood-Based Measures Neighborhood-based measures use the number of common neighbors between a pair of nodes i and j, in diﬀerent ways, to quantify the likelihood of a link between them in the future. For example, in Fig. 19.12a, Alice and Bob share four common neighbors. Therefore, it is reasonable to conjecture that a link might eventually form between them. In addition to their common neighbors, they also have their own disjoint sets of neighbors. There are diﬀerent ways of normalizing neighborhood-based measures to account for the number and relative importance of diﬀerent neighbors. These are discussed below.

19.5. LINK PREDICTION 651 Deﬁnition 19.5.1 (Common Neighbor Measure) The common-neighbor measure between nodes i and j is equal to the number of common neighbors between nodes i and j. In other words, if Si is the neighbor set of node i, and Sj is the neighbor set of node j, the common- neighbor measure is deﬁned as follows: CommonNeighbors(i, j) = |Si ∩ Sj| (19.47) The major weakness of the common-neighbor measure is that it does not account for the relative number of common neighbors between them as compared to the number of other connections. In the example of Fig. 19.12a, Alice and Bob each have a relatively small node degree. Consider a diﬀerent case in which Alice and Bob are either spammers or very popular public ﬁgures who were connected to a large number of other actors. In such a case, Alice and Bob might easily have many neighbors in common, just by chance. The Jaccard measure is designed to normalize for varying degree distributions. Deﬁnition 19.5.2 (Jaccard Measure) The Jaccard-based link prediction measure between nodes i and j is equal to the Jaccard coeﬃcient between their neighbor sets Si and Sj, respec- tively. JaccardPredict(i, j) = |Si ∩ Sj| (19.48) |Si ∪ Sj| The Jaccard measure between Alice and Bob in Fig. 19.12(a) is 4/9. If the degrees of either Alice or Bob were to increase, it would result in a lower Jaccard coeﬃcient between them. This kind of normalization is important, because of the power-law degree distributions of nodes. The Jaccard measure adjusts much better to the variations in the degrees of the nodes between which the link prediction is measured. However, it does not adjust well to the degrees of their intermediate neighbors. For example, in Fig. 19.12a, the common neighbors of Alice and Bob are Jack, John, Jill, and Mary. However, all these common neighbors could be very popular public ﬁgures with very high degrees. Therefore, these nodes are statistically more likely to occur as common neighbors of many pairs of nodes. This makes them less important in the link prediction measure. The Adamic–Adar measure is designed to account for the varying importance of the diﬀerent common neighbors. It can be viewed as a weighted version of the common-neighbor measure, where the weight of a common neighbor is a decreasing function of its node degree. The typical function used in the case of the Adamic–Adar measure is the inverse logarithm. In this case, the weight of the common neighbor with index k is set to 1/log(|Sk|), where Sk is the neighbor set of node k. Deﬁnition 19.5.3 (Adamic–Adar Measure) The common-neighbor measure between nodes i and j is equal to the weighted number of common neighbors between nodes i and j. The weight of node k is deﬁned is 1/log(|Sk|). AdamicAdar(i, j) = 1 (19.49) log(|Sk|) k∈Si ∩Sj The base of the logarithm does not matter in the previous deﬁnition, as long as it is chosen consistently for all pairs of nodes. In Fig. 19.12a, the Adamic-Adar measure between Alice and Bob is log1(4) + log1(2) + log1(2) + log1(4) = log3(2) .

652 CHAPTER 19. SOCIAL NETWORK ANALYSIS SAYANI JACK NICOLE ALICE SAYANI JIM JIM JOHN BOB TOM PREDICTED LINK ALICE PETER JILL MICHAEL MARY MARY BOB (a) Many common neighbors (b) Many indirect connections between Alice and Bob between Alice and Bob Figure 19.12: Examples of varying eﬀectiveness of diﬀerent link-prediction measures 19.5.2 Katz Measure While the neighborhood-based measures provide a robust estimation of the likelihood of a link forming between a pair of nodes, they are not quite as eﬀective when the number of shared neighbors between a pair of nodes is small. For example, in the case of Fig. 19.12b, Alice and Bob share one neighbor in common. Alice and Jim also share one neighbor in common. Therefore, neighborhood-based measures have diﬃculty in distinguishing between diﬀerent pairwise prediction strengths in these cases. Nevertheless, there also seems to be a signiﬁcant indirect connectivity in these cases through longer paths. In such cases, walk-based measures are more appropriate. A particular walk-based measure that is used commonly to measure the link-prediction strength is the Katz measure. Deﬁnition 19.5.4 (Katz Measure) Let n(ijt) be the number of walks of length t between nodes i and j. Then, for a user-deﬁned parameter β < 1, the Katz measure between nodes i and j is deﬁned as follows: ∞ Katz(i, j) = βt · n(ijt) (19.50) t=1 The value of β is a discount factor that de-emphasizes walks of longer length. For small enough values of β, the inﬁnite summation of Eq. 19.50 will converge. If A is the symmetric adjacency matrix of an undirected network, then the n × n pairwise Katz coeﬃcient matrix K can be computed as follows: ∞ (19.51) K = (βA)i = (I − βA)−1 − I i=1 The eigenvalues of Ak are the kth powers of the eigenvalues of A (cf. Eq. 19.33). The value of β should always be selected to be smaller than the inverse of the largest eigenvalue of A to ensure convergence of the inﬁnite summation. A weighted version of the measure can

19.5. LINK PREDICTION 653 be computed by replacing A with the weight matrix of the graph. The Katz measure often provides prediction results of excellent quality. It is noteworthy that the sum of the Katz coeﬃcients of a node i with respect to other nodes is referred to as its Katz centrality. Other mechanisms for measuring centrality, such as closeness and PageRank, are also used for link prediction in a modiﬁed form. The reason for this connection between centrality and link-prediction measures is that highly central nodes have the propensity to form links with many nodes. 19.5.3 Random Walk-Based Measures Random walk-based measures are a diﬀerent way of deﬁning connectivity between pairs of nodes. Two such measures are PageRank and SimRank. Because these methods are described in detail in Sect. 18.4.1.2 of Chap. 18, they will not be discussed in detail here. The ﬁrst way of computing the similarity between nodes i and j is with the use of the personalized PageRank of node j, where the restart is performed at node i. The idea is that if j is the structural proximity of i, it will have a very high personalized PageRank measure, when the restart is performed at node i. This is indicative of higher link prediction strength between nodes i and j. The personalized PageRank is an asymmetric measure between nodes i and j. Because the discussion in this section is for the case of undirected graphs, one can use the average of the values of P ersonalizedP ageRank(i, j) and P ersonalizedP ageRank(j, i). Another possibility is the SimRank measure that is already a symmetric measure. This measure computes an inverse function of the walk length required by two random surfers moving backwards to meet at the same point. The corresponding value is reported as the link prediction measure. Readers are advised to refer to Sect. 18.4.1.2 of Chap. 18 for details of the SimRank computation. 19.5.4 Link Prediction as a Classiﬁcation Problem The aforementioned measures are unsupervised heuristics. For a given network, one of these measures might be more eﬀective, whereas another might be more eﬀective for a diﬀerent network. How can one resolve this dilemma and select the measures that are most eﬀective for a given network? The link prediction problem can be viewed as a classiﬁcation problem by treating the presence or absence of a link between a pair of nodes as a binary class indicator. Thus, a multidimensional data record can be extracted for each pair of nodes. The features of this multidimensional record include all the diﬀerent neighborhood-based, Katz-based, or walk- based similarities between nodes. In addition, a number of other preferential-attachment features, such as node-degrees of each node in the pair, are used. Thus, for each node pair, a multidimensional data record is constructed. The result is a positive-unlabeled classiﬁcation problem, where node pairs with edges are the positive examples, and the remaining pairs are unlabeled examples. The unlabeled examples can be approximately treated as negative examples for training purposes. Because there are too many negative example pairs in large and sparse networks, only a sample of the negative examples is used. Therefore, the supervised link prediction algorithm works as follows: 1. Training phase: Generate a multidimensional data set containing one data record for each pair of nodes with an edge between them, and a sample of data records from pairs of nodes without edges between them. The features correspond to extracted similarity and structural features between node pairs. The class label is the presence or absence of an edge between the pair. Construct a training model on the data.

654 CHAPTER 19. SOCIAL NETWORK ANALYSIS 2. Testing phase: Convert each test node pair to a multidimensional record. Use any conventional multidimensional classiﬁer to make label predictions. The logistic regression method of Sect. 10.6 in Chap. 10 is a common choice for the base classiﬁer. Cost-sensitive versions of various classiﬁers are commonly used because of the imbalanced nature of the underlying classiﬁcation problem. One advantage of this approach is that content features can be used in a seamless way. For example, the content similarity between a pair of nodes can be used. The classiﬁer will automatically learn the relevance of these features in the training process. Furthermore, unlike many link prediction methods, the approach can also handle directed networks by extracting features in an asymmetric way. For example, instead of using node degrees, one might use indegrees and outdegrees as features. Random walk features can also be deﬁned in an asymmetric way on directed networks, such as computing the PageRank of node j with restart at node i, and vice versa. In general, the supervised model is more ﬂexible because of its ability to learn relationships between links and features of various types. 19.5.5 Link Prediction as a Missing-Value Estimation Problem Section 18.5.3 of Chap. 18 discusses how link prediction can be applied to user-item graphs for recommendations. In general, both the recommendation problem and the link prediction problem may be viewed as instances of missing value estimation on matrices of diﬀerent types. Recommendation algorithms are applied to user-item utility matrices, whereas link prediction algorithms are applied to incomplete adjacency matrices. All the 1s in the matrix correspond to edges. Only a small random sample of the remaining entries are set to 0, and the other entries are assumed to be unspeciﬁed. Any of the missing-value estimation methods discussed in Sect. 18.5 of Chap. 18 may be used to estimate the values of the missing entries. Among this class of methods, matrix factorization methods are among the most commonly used methods. One advantage of using these methods is that the speciﬁed matrix does not need to be symmetric. In other words, the approach can also be used for directed graphs. Refer to the bibliographic notes. 19.5.6 Discussion The diﬀerent measures have been shown to have varying levels of eﬀectiveness over diﬀerent data sets. The advantage of neighborhood-based measures is that they can be computed eﬃciently for very large data sets. Furthermore, they perform almost as well as the other unsupervised measures. Nevertheless, random walk-based and Katz-based measures are par- ticularly useful for very sparse networks, in which the number of common neighbors cannot be robustly measured. Although supervision provides better accuracy, it is computationally expensive. However, supervision provides the greatest adaptability across various domains of social networks, and available side information such as content features. In recent years, content has also been used to enhance link prediction. While content can signiﬁcantly improve link prediction, it is important to point out that structural measures are far more powerful. This is because structural measures directly use the triadic properties of real networks. The triadic property of networks is true across virtually all data domains. On the other hand, content-based measures are based on “reverse homophily,” where similar or link-correlated content is leveraged for predicting links. The eﬀectiveness of this is highly network domain-speciﬁc. Therefore, content-based measures are often used in a helping role for link prediction and are rarely used in isolation for the prediction process.

19.6. SOCIAL INFLUENCE ANALYSIS 655 19.6 Social Inﬂuence Analysis All social interactions result in varying levels of inﬂuence between individuals. In traditional social interactions, this is sometimes referred to as “word of mouth” inﬂuence. This general principle is also true for online social networks. For example, when an actor tweets a message in Twitter, the followers of the actors are exposed to the message. The followers may often retweet the message in the network. This results in the spread of information, ideas, and opinions in the social network. Many companies view this kind of information spread as a valuable advertising channel. By tweeting a popular message to the right participants, millions of dollars worth of advertising can be generated, if the message spreads through the social network as a cascade. An example [532] is the famous Oreo Superbowl tweet on February 3, 2013. The power went out during the Superbowl game between the San Francisco 49ers and the Baltimore Ravens. Oreo used this opportunity to tweet the following message, along with a picture of an Oreo cookie, during the 34 min interruption: “Power out? No problem. You can still dunk in the dark.” Viewers loved Oreo’s message, and retweeted it thousands of times. Oreo was thus able to generate millions of dollars of advertising at zero cost, and apparently had a higher impact than paid television advertisements during the Superbowl. Diﬀerent actors have diﬀerent abilities to inﬂuence their peers in the social network. The two most common factors that regulate the inﬂuence of an actor are as follows: 1. Their centrality within the social network structure is a crucial factor in their inﬂuence level. For example, actors with high levels of centrality are more likely to be inﬂuential. In directed networks, actors with high prestige are more likely to be inﬂuential. These measures are discussed in Sect. 19.2. 2. The edges in the network are often associated with weights that are dependent on the likelihood that the corresponding pair of actors can be inﬂuenced by each other. Depending on the diﬀusion model used, these weights can sometimes be directly inter- preted as inﬂuence propagation probabilities. Several factors may determine these prob- abilities. For example, a well-known individual may have higher inﬂuence than lesser known individuals. Similarly, two individuals, who have been friends for a long time, are more likely to inﬂuence one another. It is often assumed that the inﬂuence propa- gation probabilities are already available for analytical purposes, although a few recent methods show how to estimate these probabilities in a data-driven way. The precise impact of the aforementioned factors is quantiﬁed with the use of an inﬂuence propagation model. These are also referred to as diﬀusion models. The main goal of such models is to determine a set of seed nodes in the network, at which the dissemination of information maximizes inﬂuence. Therefore, the inﬂuence maximization problem is as follows: Deﬁnition 19.6.1 (Inﬂuence Maximization) Given a social network G = (N, A), determine a set of k seed nodes S, inﬂuencing which will maximize the overall spread of inﬂuence in the network. The value of k can be viewed as a budget on the number of seed nodes that one is allowed to initially inﬂuence. This is quite consistent with real-life models, where advertisers are faced with budgets on initial advertising capacity. The goal of social inﬂuence analysis is to extend this initial advertising capacity with word-of-mouth methods.

656 CHAPTER 19. SOCIAL NETWORK ANALYSIS Each model or heuristic can quantify the inﬂuence level of a node with the use of a function of S that is denoted by f (·). This function maps subsets of nodes to real numbers representing inﬂuence values. Therefore, after a model has been chosen for quantifying the inﬂuence f (S) of a given set S, the optimization problem is that of determining the set S that maximizes f (S). An interesting property of a very large number of inﬂuence analysis models is that the optimized function f (S) is submodular. What does submodularity mean? It is a mathematical way of representing the natural law of diminishing returns, as applied to sets. In other words, if S ⊆ T , then the additional inﬂuence obtained by adding an individual to set T cannot be larger than the additional inﬂuence of adding the same individual to set S. Thus, the incremental inﬂuence of the same individual diminishes, as larger supersets of cohorts are available as seeds. The sub- modularity of set S is formally deﬁned as follows: Deﬁnition 19.6.2 (Submodularity) A function f (·) is said to be submodular, if for any pair of sets S, T satisfying S ⊆ T , and any set element e, the following is true: f (S ∪ {e}) − f (S) ≥ f (T ∪ {e}) − f (T ) (19.52) Virtually all natural models for quantifying inﬂuence turn out to be submodular. Submod- ularity is algorithmically convenient because a very eﬃcient greedy optimization algorithm exists for maximizing submodular functions, as long as f (S) can be evaluated for a given value of S. This algorithm starts by setting S = {} and incrementally adds nodes to S that increase the value of f (S) as much as possible. This procedure is repeated until the set S contains the required number of inﬂuencers k. The approximation level of this heuristic is based on a well-known classical result on optimization of submodular functions. Lemma 19.6.1 The greedy algorithm for maximizing submodular functions provides a solu- tion with an objective function value that is at least a fraction e−1 of the optimal value. e Here, e is the base of the natural logarithm. Thus, these results show that it is possible to optimize f (S) eﬀectively, as long as an appropriate submodular inﬂuence function f (S) can be deﬁned for a given set of nodes S. Two common approaches for deﬁning the inﬂuence function f (S) of a set of nodes S are the Linear Threshold Model and the Independent Cascade Model. Both these diﬀusion models were proposed in one of the earliest works on social inﬂuence analysis. The general operational assumption in these diﬀusion models is that nodes are either in an active or inactive state. Intuitively, an active node is one which has already been inﬂuenced by the set of desired behaviors. Once a node moves to an active state, it never deactivates. Depending on the model, an active node may trigger activation of neighboring nodes either for a single time, or over longer periods. Nodes are successively activated until no more nodes are activated in a given iteration. The value of f (S) is evaluated as the total number of activated nodes at termination. 19.6.1 Linear Threshold Model In this model, the algorithm initially starts with an active set of seed nodes S and iteratively increases the number of active nodes based on the inﬂuence of neighboring active nodes. Active nodes are allowed to inﬂuence their neighbors over multiple iterations throughout the execution of the algorithm until no more nodes can be activated. The inﬂuence of

19.6. SOCIAL INFLUENCE ANALYSIS 657 neighboring nodes is quantiﬁed with the use of a linear function of the edge-speciﬁc weights bij. For each node i in the network G = (N, A), the following is assumed to be true: bij ≤ 1 (19.53) j:(i,j)∈A Each node i is associated with a random threshold θi ∼ U [0, 1] which is ﬁxed up front and stays constant over the course of the algorithm. The total inﬂuence I(i) of the active neighbors of node i on it, at a given time-instant, is computed as the sum of the weights bij of all active neighbors of i. I(i) = bij (19.54) j:(i,j)∈A,j is active The node i becomes active in a step when I(i) ≥ θi. This process is repeated until no further nodes can be activated. The total inﬂuence f (S) may be measured as the number of nodes activated by a given seed set S. The inﬂuence f (S) of a given seed set S is typically computed with simulation methods. 19.6.2 Independent Cascade Model In the aforementioned linear threshold model, once a node becomes active, it has multiple chances to inﬂuence its neighbors. The random variable θi was associated with a node, in the form of a threshold. On the other hand, in the independent cascade model, after a node becomes active, it obtains only a single chance to activate its neighbors, with propagation probabilities associated with the edges. The propagation probability associated with an edge is denoted by pij. In each iteration, only the newly active nodes are allowed to inﬂuence their neighbors, that have not already been activated. For a given node j, each of the edges (i, j) joining it to its newly active neighbors i ﬂips a coin independently with success probability pij. If the coin toss for edge (i, j) results in a success, then the node j is activated. If node j is activated, it will get a single chance in the next iteration to inﬂuence its neighbors. In the event that no nodes are newly activated in an iteration, the algorithm terminates. The inﬂuence function value is equal to the number of active nodes at termination. Because nodes are allowed to inﬂuence their neighbors only once over the course of the algorithm, a coin is tossed for each edge at most once over the course of the algorithm. 19.6.3 Inﬂuence Function Evaluation Both the linear threshold model and the independent cascade model are designed to compute the inﬂuence function f (S) with the use of a model. The estimation of f (S) is typically accomplished with simulation. For example, consider the case of the linear threshold model. For a given seed node set S, one can use a random number generator to set the thresholds at the nodes. After the thresholds have been set, the active nodes can be labeled using any deterministic graph- search algorithm starting from the seed nodes in S and progressively activating nodes when the threshold condition is satisﬁed. The computation can be repeated over diﬀerent sets of randomly generated thresholds, and the results may be averaged to obtain more robust estimates. In the independent cascade model, a diﬀerent simulation may be used. A coin with probability pij may be ﬂipped for each edge. The edge is designated as live if the coin toss was a success. It can be shown that a node will eventually be activated by the independent

658 CHAPTER 19. SOCIAL NETWORK ANALYSIS cascade model, when a path of live edges exists from at least one node in S to it. This can be used to estimate the size of the (ﬁnal) active set by simulation. The computation is repeated over diﬀerent runs and the results are averaged. The proof that the linear threshold model and the independent cascade model are sub- modular optimization problems can be found in pointers included in the bibliographic notes. However, this property is not speciﬁc to these models. Submodularity is a very natural conse- quence of the laws of diminishing returns, as applied to the incremental impact of individual inﬂuence in larger groups. As a result, most reasonable models for inﬂuence analysis will satisfy submodularity. 19.7 Summary Social networks have become increasingly popular in recent years, because of their ability to connect geographically and culturally diverse participants. A signiﬁcant amount of data is created because of the actions of social network participants. Much of this data are structural, in the form of relationships between diﬀerent individuals. Social network structures exhibit a number of typical properties, because of the natural dynamics of their formation. The most important similarity-based properties include triadic closure, and homophily. Typically, social networks are formed by preferential attachment, and they exhibit power-law degree distributions. The problem of clustering social networks is challenging because of the presence of hub nodes, and the natural tendency of social networks to cluster into a single large group. Therefore, most community detection algorithms have built-in mechanisms to ensure that the underlying clusters are balanced. Clustering methods are also sometimes referred to as graph-partitioning. One of the earliest clustering methods was the Kernighan–Lin method, which uses an iterative approach for clustering. Nodes are repeatedly exchanged between partitions to iteratively improve the value of the objective function. The Girvan–Newman algorithm uses notions of betweenness centrality to generate clusters. The METIS algorithm generates an eﬃcient partition by using coarsening and then creating the partitions on the coarsened representation. The spectral method uses multidimensional embeddings to generate the clusters. In collective classiﬁcation, the goal is to infer labels at the remaining vertices from the pre-existing labels at a subset of the vertices. This is a problem that has dual applicability to social network analysis and semisupervised learning. Multidimensional data sets can be transformed into similarity graphs to apply collective classiﬁcation methods. The most common methods used for collective classiﬁcation include iterative methods, random walk- based label propagation methods, and spectral methods. In the link-prediction problem, the goal is to predict the links from the currently avail- able structure and content in the network. Structural measures are generally much more eﬀective for link-prediction than content-based measures. The structural methods use local clustering measures such as the Jaccard measure or personalized PageRank values for mak- ing predictions. Supervised methods are able to discriminatively determine the most relevant features for link prediction. Social networks are often used for inﬂuencing individuals using “word-of-mouth” tech- niques. Typically, centrally located actors are more inﬂuential in the network. Diﬀusion models are used to characterize the ﬂow of information in social networks. Two examples of such models include the linear threshold model and the independent cascade model.

19.8. BIBLIOGRAPHIC NOTES 659 19.8 Bibliographic Notes Social network analysis has been studied extensively in the context of the ﬁeld of sociol- ogy [508], though more recent work has focused on online social networks [6, 192, 532]. A detailed discussion on proximity and centrality measures may be found in [6, 192, 508, 532]. The dynamics of social network formation may be found in the excellent survey paper [69]. The derivation of the power-law with the use of the scale-free model is provided in [70]. A detailed study of the power-law in the context of the Internet topology is provided in [201]. A study of graph densiﬁcation and shrinking diameters is provided in [342]. Other random graph models such as the Erdos–Renyi model and the Watts–Strogatz small-world model are discussed in [196, 509]. A detailed survey on community detection methods may be found in [212]. The minimum cut problem is polynomially solvable for some special cases. For example, the unweighted 2- way cut problem is polynomially solvable without balancing constraints [299]. The original Kernighan–Lin algorithm is presented in [312]. The enhancements to the Kernighan–Lin algorithm was discussed in [206, 301]. The Girvan–Newman algorithm discussed in this chapter is adapted from [230]. The METIS algorithm is presented in [301]. The normalized cut method for spectral clustering was discussed in this chapter [466]. The normalized sym- metric version was proposed in [405]. More details on spectral graph theory and clustering methods may be found in [152, 371]. This chapter uses the Laplacian eigenmap interpre- tation [90] of spectral clustering, rather than the more commonly used cut interpretation, because of its comprehensive explanation of the non-integer and possibly negative eigenvec- tor components. ICA has been presented in the context of many diﬀerent data domains, such as docu- ment data [128], and relational data [404]. Several base classiﬁers have been used within this framework, such as logistic regression [370] and a weighted voting classiﬁer [373]. The discussion in this chapter is based on [404]. The iterative label propagation method was proposed in [554], and the absorbing random walk interpretation is adapted from [78]. The iterative label propagation approach [554] was originally proposed with a spectral inter- pretation although the random walk interpretation is also brieﬂy discussed in the same work. Most random walk methods can also be formulated as supervised versions of spectral embeddings [530, 551, 554]. The regularization framework for collective classiﬁcation is dis- cussed in [551]. Collective classiﬁcation of directed graphs is discussed in [552]. A method for incorporating content within the random walk framework is discussed in [44]. Detailed surveys on node classiﬁcation methods may be found in [93, 368]. A toolkit for collective classiﬁcation may be found in [427]. The link-prediction problem for social networks was proposed in [353]. The measures discussed in this chapter are based on this work. Since then, a signiﬁcant amount of work has been done on incorporating content into the link prediction process. Methods that use content for link prediction may be found in [49, 64, 354, 484, 489]. The merits of supervised methods are discussed in [354], and matrix factorization methods are discussed in [383]. Recently, it has been shown how to use link prediction across multiple networks in [428]. A survey on link-prediction methods for social network analysis may be found in [63]. The problem of inﬂuence analysis in social networks was proposed in [304]. The linear threshold and independent cascade models are presented in this work. The degree-discount heuristic was proposed in [142]. A discussion of the submodularity property may be found in [403]. Other recent models for inﬂuence analysis in social networks are discussed in [45, 143, 144, 362, 488]. One of the main problems in social inﬂuence models is a diﬃculty in learning the inﬂuence propagation probabilities, though there has been some recent focus

660 CHAPTER 19. SOCIAL NETWORK ANALYSIS on this issue [235]. Recent work has also shown how inﬂuence analysis can be performed directly from the social stream [234, 482]. A survey on models and algorithms for social inﬂuence analysis may be found in [483]. 19.9 Exercises 1. For the ﬁgure in Example 19.1a, compute the highest-degree centrality, closeness cen- trality and betweenness centrality. The nodes that take on these highest values are already marked in the ﬁgure. 2. Implement the algorithms for determining the degree centrality, closeness centrality, and betweenness centrality. 3. Implement the Kernighan–Lin algorithm. 4. Why is the balancing constraint more important in community detection algorithms, as compared to multidimensional clustering algorithms? What would the uncon- strained minimum 2-way cut look like in a typical real network? 5. Consider a variation of Girvan–Newman algorithm in which edges are randomly dis- connected from a network, as opposed to those with high betweenness centrality. Explain the negative impact of this change on the algorithm. Can you make minor changes to the disconnection criterion to ameliorate this impact? 6. Write an integer-programming formulation for the minimum 2-way cut problem, so that the cut is balanced in terms of the number of nodes. 7. For the random walk formulation of spectral clustering algorithm, show why the fol- lowing are true: (a) All nontrivial eigenvectors y have both positive and negative components. (b) Provide an intuitive explanation why the normalization factor Λ in the constraint yT Λy = 1, increases the propensity of low-degree nodes to be embedded away from the origin, and the high-degree nodes to be embedded near the origin. 8. Suppose that all edge weights wij are discounted by the geometric mean of the weighted node-degrees at their endpoints. Write an unnormalized formulation of spec- tral clustering in terms of these normalized weights for discovering a 1-dimensional embedding. What eﬀect would the weight normalization have on the embedding? Describe the algebraic similarities and diﬀerences of this formulation from the sym- metric normalized formulation of spectral clustering. Discuss why the resulting eigen- vectors will often be heuristically similar to those obtained with the symmetric for- mulation of spectral clustering. 9. Explain the relationship between the random walk label propagation and the graph regularization algorithm. 10. Discuss the connections between the link-prediction problem and network clustering. 11. Create a link prediction measure that can perform the degree normalizations per- formed both by the Jaccard measure and the Adamic–Adar measure.

19.9. EXERCISES 661 12. Implement the linear threshold and independent cascade model for inﬂuence analysis. 13. The chapter provides a 1-dimensional formulation for the symmetric version using the column vector z. Set up a generalized formulation for the symmetric version using an n × k matrix Z. (a) Let Y be the decision v=ar√iaΛblYes. for the random walk formulation discussed in the chapter. Show that Z (b) Show that the unit-norm scaled rows of Y and Z are the same. 14. It is well known that a symmetric matrix always has real eigenvalues. Use this result to show that the stochastic transition matrix of an undirected graph always has real eigenvalues. 15. Show that if (y, λ) is an eigenvector–eigenvalue pair of the normalized Laplacian Λ−1(Λ−W ), then (y, 1−λ) is an eigenvector–eigenvalue pair of the normalized weight matrix Λ−1W . Here, Λ is a diagonal matrix containing the sum of each row in the weighted adjacency matrix W .

Chapter 20 Privacy-Preserving Data Mining “Civilization is the progress toward a society of privacy. The savage’s whole existence is public, ruled by the laws of his tribe. Civilization is the process of setting man free from men.”—Ayn Rand 20.1 Introduction A signiﬁcant amount of application data is of a personal nature. These kind of data sets may contain sensitive information about an individual, such as his or her ﬁnancial status, political beliefs, sexual orientation, and medical history. The knowledge about such personal information can compromise the privacy of individuals. Therefore, it is crucial to design data collection, dissemination, and mining techniques, so that individuals are assured of their privacy. Privacy-preservation methods can generally be executed at diﬀerent steps of the data mining process: 1. Data collection and publication: The privacy-driven modiﬁcation of a data set may be done at either the data collection time, or the data publication time. In anonymous data collection, a modiﬁed version of the data is collected using a software plugin within the collection platform. Therefore, the contributors of the data are assured that their data is not available even to the entity collecting the data. The implicit assumption in the collection-oriented model is that the data collector is not trusted, and therefore the privacy must be preserved at collection time. In anonymous data publication, the entire data set is available to a trusted entity, who has usually collected the data in the normal course of business. An example is a hospital that has collected data about its patients. Eventually, the entity may wish to release or publish the data to one of more third-parties for data analysis. For example, a hospital may want to use the data to study the long-term impact of various treatment alternatives. A real-world example is the Netﬂix prize data set [559], in which the anonymized movie ratings of users were published to advance studies on collaborative ﬁltering algorithms. During data publication, identifying or sensitive attribute values need to either be removed or be speciﬁed approximately to preserve privacy. Generally, such publication algorithms C. C. Aggarwal, Data Mining: The Textbook, DOI 10.1007/978-3-319-14142-8 20 663 c Springer International Publishing Switzerland 2015

664 CHAPTER 20. PRIVACY-PRESERVING DATA MINING can control the level of privacy much better than collection algorithms, because of their access to the entire data set on a trusted server. 2. Output privacy of data mining algorithms: Privacy can also be violated by the output of data mining algorithms. For example, consider a scenario where a user is allowed to determine association patterns, or otherwise query the data through a Web service, but is not provided access to the data set. In such a case, the output of the data mining and query processing algorithms provides valuable information, some of which may be private. In some applications, organizations may wish to share their data in a private way, so that only patterns in the shared data may be mined, but the statistics of the local databases are not revealed to the participants. This problem is referred to as distributed privacy preservation. In general, most forms of privacy-preserving data mining reduce the representation accu- racy of the data, in order to preserve privacy. This accuracy reduction is performed in a vari- ety of ways, such as data distortion, approximation (generalization), suppression, attribute value swapping, or microaggregation. Clearly, since the data is no longer speciﬁed exactly, this will have a detrimental impact on the quality of the data mining results. The eﬀective- ness of the released data for mining applications is often quantiﬁed explicitly, and is referred to as its utility. A natural trade-oﬀ exists between privacy and utility. For example, in a case, where data values are suppressed, one might simply choose to suppress all entries. While such a solution provides perfect privacy, it oﬀers no utility. This observation is also true for privacy-preserving publication algorithms in which noise is added to the data. When a greater amount of noise is added, a higher level of privacy is achieved, but utility is reduced. The goal of privacy-preservation methods is to maximize utility at a ﬁxed level of privacy. This chapter is organized as follows. Methods for privacy-preserving data collection are addressed in Sect. 20.2. Section 20.3 addresses the problem of privacy-preserving data publishing. This section includes several models such as the k-anonymity model, the - diversity model, and the t-closeness model. The problem of output privacy is addressed in Sect. 20.4. Methods for distributed and cryptographic privacy are discussed in Sect. 20.5. A summary is given in Sect. 20.6. 20.2 Privacy During Data Collection The randomization method is designed for privacy-preservation at data collection time. The implicit assumption is that the data collector is not trusted, and therefore the privacy must be preserved at data collection time. The basic idea of the approach is to allow users to enter the data through a software platform that is able to add random perturbations to the data. This approach is one of the most conservative models for ensuring data privacy, because the original data records are never stored on any single server. The random perturbations are added using a publicly available distribution. Examples of commonly used perturbing distributions include the uniform and the Gaussian distributions. In other words, the probability distribution used to perturb the data is speciﬁed together with the data set if and when the data collector releases the data for public use. This additional distribution information is needed to use the data eﬀectively in the context of data mining algorithms. The basic idea is to reconstruct the distribution of the original data, by “subtracting out” the noise distribution. This aggregate distribution is then used for mining purposes. The overall approach is as follows:

20.2. PRIVACY DURING DATA COLLECTION 665 1. Privacy-preserving data collection: In this step, random noise is added to the data while collecting data from users, with the use of a software plugin. The collected data is publicly released along with the probability distribution function (and parameters) used to add the random noise. 2. Distribution reconstruction: The aggregate distribution of the original data are recon- structed, by “subtracting out” the noise. Thus, at the end of this step, we will have a histogram representing the approximate probability distribution of the data values. 3. Data mining: Data mining methods are applied to the reconstructed distributions. It is important to note that the last step of the process requires the design of data mining algorithms that work with probability distributions of sets of data records, rather than individual records. Thus, one disadvantage of this approach is that it requires the redesign of data mining algorithms. Nevertheless, the approach can be made to work because many data mining problems such as clustering and classiﬁcation require only the probability distribution modeling of either the whole data set, or segments (e.g., diﬀerent classes) of the data. 20.2.1 Reconstructing Aggregate Distributions The reconstruction of the aggregate distribution of the original data is the key step in the randomization method. Consider the case where the original data values x1 . . . xn are drawn from the probability distribution X. For each original data value xi, a perturbation yi is added by the software data collection tool, to yield the perturbed value zi. The perturbation yi is drawn from the probability distribution Y, and is independent of X. It is assumed that this distribution is known publicly. Furthermore, the probability distribution of the ﬁnal set of perturbed values is assumed to be Z. Therefore, the original distribution X, the added perturbation Y and the ﬁnal aggregate distribution Z are related as follows: Z=X+Y X=Z−Y Thus, the probability distribution of X can be reconstructed, if the distributions of Y and Z are known explicitly. The probability distribution of Y is assumed to be publicly available, while discrete samples of Z are available in terms of z1 . . . zn. These discrete samples are suﬃcient to reconstruct Z using a variety of methods, such as kernel density estimation. Then, the distribution for X can be reconstructed using the relationship shown above. The main problem with this approach emerges when the probability distribution of the perturbation Y has a large variance and the number n of discrete samples of Z is small. In such a case, the distribution of Z also has a large variance, and it cannot be accurately estimated with a small number of samples. Therefore, a second approach is to directly estimate the distribution of X from the discrete samples of Z and the known distribution of Y. Let fX and FX be the probability density and cumulative distributions functions of X. These functions need to be estimated with the observed values z1 . . . zn. Let fˆX and FˆX be the corresponding estimated probability density and cumulative distribution functions for X. The key here is to use the Bayes formula with the use of observed values for Z. Consider a simpliﬁed scenario in which only a single observed value z1 is available. This can be used

666 CHAPTER 20. PRIVACY-PRESERVING DATA MINING to estimate the cumulative distribution function FˆX(a) at any value of the random variable X = a. The Bayes theorem yields the following: FˆX(a) = w=a fX(w|X + Y = z1)dw (20.1) w=−∞ w=∞ w=−∞ fX(w|X + Y = z1)dw The conditional in the aforementioned equation corresponds to the fact that the sum of the data values and perturbed values is equal to z1. Note that the expression fX(w|X + Y = z1) can be expressed in terms of the unconditional densities of X and Y, as follows: fX(w|X + Y = z1) = fY(z1 − w) · fX(w) (20.2) This expression uses the fact that the perturbation Y is independent of X. By substituting the aforementioned expression for fX(w|X + Y = z1) in the right-hand side of Eq. 20.1, the following expression is obtained for the cumulative density of X: FˆX(a) = w=a fY (z1 − w) · fX(w)dw (20.3) w=−∞ w=∞ w=−∞ fY (z1 − w) · fX(w)dw The expression for FˆX(a) was derived using a single observation z1, and needs to be gener- alized to the case of n diﬀerent observations z1 . . . zn . This can be achieved by averaging the previous expression over n diﬀerent values: FˆX(a) 1 n w=a fY (zi − w) · fX(w)dw n w=−∞ = · w=∞ (20.4) w=−∞ fY (zi − w) · fX(w)dw i=1 The corresponding density distribution can be obtained by diﬀerentiating FˆX(a). This dif- ferentiation results in the removal of the integral sign from the numerator and the corre- sponding instantiation of w to a. Since the denominator is a constant, it remains unaﬀected by the diﬀerentiation. Therefore, the following is true: fˆX(a) 1 n fY(zi − a) · fX(a) n = · w=∞ fY (zi − w) · fX(w)dw (20.5) w=−∞ i=1 The aforementioned equation contains the density function fX(·) on both sides. This circu- larity can be resolved naturally with the use of an iterative approach. The iterative approach initializes the estimate of the distribution fX(·) to the uniform distribution. Subsequently, the estimate of this distribution is continuously updated as follows: Set fˆX(·) to be the uniform distribution; repeat Update fˆX(a) = (1/n) · n fY (zi−a)·fˆX(a) until convergence i=1 w=∞ fY (zi −w)·fˆX (w)dw w=−∞ So far, it has been described, how to compute fX(a) for a particular value of a. In order to generalize this approach, the idea is to discretize the range of the random variable X into k intervals, denoted by [l1, u1] . . . [lk, uk]. It is assumed that the density distribution is uniform over the discretized intervals. For each such interval [li, ui], the density distribution is evaluated at the midpoint a = (li + ui)/2 of the interval. Thus, in each iteration, k diﬀerent values of a are used. The algorithm is terminated when the distribution does not

20.3. PRIVACY-PRESERVING DATA PUBLISHING 667 change signiﬁcantly over successive steps of the algorithm. A variety of methods can be used to compare the two distributions such as the χ2 test. The simplest approach is to examine the average change in the density values, at the midpoints of the density distribution over successive iterations. While this algorithm is known to perform eﬀectively in practice, it is not proven to be a optimally convergent solution. An expectation maximization (EM) method was proposed in a later work [28], which provably converges to the optimal solution. 20.2.2 Leveraging Aggregate Distributions for Data Mining The aggregate distributions determined by the algorithm can be leveraged for a variety of data mining problems, such as clustering, classiﬁcation, and collaborative ﬁltering. This is because each of these data mining problems can be implemented with aggregate statistics of the data, rather than the original data records. In the case of the classiﬁcation problem, the probability distributions of each of the classes can be reconstructed from the data. These distributions can then be used directly in the context of a naive Bayes classiﬁer, as discussed in Chap. 10. Other classiﬁers, such as decision trees, can also be modiﬁed to work with aggregate distributions. The key is to use the aggregate distributions in order to design the split criterion of the decision tree. The bibliographic notes contain pointers to data mining algorithms that use the randomization method. The approach cannot be used eﬀectively for data mining problems such as outlier detection that are dependent on individual data record values, rather than aggregate values. In general, outlier analysis is a diﬃcult problem for most private data sets because of the tendency of outliers to reveal private information. 20.3 Privacy-Preserving Data Publishing Privacy-preserving data publishing is distinct from privacy-preserving data collection, because it is assumed that all the records are already available to a trusted party, who might be the current owner of the data. This party then wants to release (or publish) this data for analysis. For example, a hospital might wish to release anonymized records about patients to study the eﬀectiveness of various treatment alternatives. This form of data release is quite useful, because virtually any data mining algorithm can be used on the released data. To determine sensitive information about an individual, there are two main pieces of information that an attacker (or adversary) must possess. 1. Who does this data record pertain to? While a straightforward way to determine the identity is to use the identifying attribute (e.g., Social Security Number), such attributes are usually stripped from the data before release. As will be discussed later, these straightforward methods of sanitization are usually not suﬃcient, because attackers may use other attributes, such as the age and ZIP code, to make linkage attacks. 2. In addition to identifying attributes, data records also contain sensitive attributes that most individuals do not wish to share with others. For example, when a hospital releases medical data, the records might contain sensitive disease-related attributes. Diﬀerent attributes in a data set may play diﬀerent roles in either facilitating identiﬁcation or facilitating sensitive information release. There are three main types of attributes:

668 CHAPTER 20. PRIVACY-PRESERVING DATA MINING Table 20.1: Example of a data table SSN Age ZIP Code Disease 012-345-6789 24 10598 HIV 823-627-9231 37 90210 Hepatitis C 987-654-3210 26 10547 382-827-8264 38 90345 HIV 847-872-7276 36 89119 Hepatitis C 422-061-0089 25 02139 Diabetes HIV 1. Explicit identiﬁers: These are attributes that explicitly identify an individual. For example, the Social Security Number (SSN) of an individual can be considered an explicit identiﬁer. Because this attribute is almost always removed in the data saniti- zation process, it is not relevant to the study of privacy algorithms. 2. Pseudo-identiﬁer or quasi-identiﬁer (QID): These are attributes that do not explicitly identify an individual in isolation, but can nevertheless be used in combination to identify an individual by joining them with publicly available information, such as voter registration rolls. This kind of attack is referred to as a linkage attack. Examples of such attributes include the Age and ZIP code. Strictly speaking, quasi-identiﬁers refer to the speciﬁc combination of attributes used to make a linkage attack, rather than the individual attributes. 3. Sensitive attribute: These are attributes that are considered private by most individu- als. For example, in a medical data set, individuals would not like information about their diseases to be known publicly. In fact, many laws in the USA, such as the Health Insurance Portability and Accountability Act (HIPAA), explicitly forbid the release of such information, especially when the sensitive attributes can be linked back to speciﬁc individuals. Most of the discussion in this chapter will be restricted to quasi-identiﬁers and sensitive attributes. To illustrate the signiﬁcance of these attribute types, an example will be used. In Table 20.1, the medical records of a set of individuals are illustrated. The SSN attribute is an explicit identiﬁer that can be utilized to identify an individual directly. Such directly identifying information will almost always be removed from a data set before release. How- ever, the impact of attributes such as the age and the ZIP code on identiﬁcation is quite signiﬁcant. While these attributes do not directly identify an individual, they provide very useful hints, when combined with other publicly available information. For example, it is possible for a small geographic region, such as a ZIP code, to contain only one individual of a speciﬁc gender, race, and date of birth. When combined with publicly available voter registration rolls, one might be able to identify an individual from these attributes. Such a combination of publicly available attributes is referred to as a quasi-identiﬁer. To understand the power of quasi-identiﬁers, consider a snapshot of the voter registra- tion rolls illustrated in Table 20.2. Even in cases, where the SSN is removed from Table 20.1 before release, it is possible to join the two tables with the use of the age and ZIP code attributes. This will provide a list of the possible matches for each data record. For exam- ple, Joy and Sue are the only two individuals in the voter registration rolls matching an individual with HIV in the medical release of Table 20.1. Therefore, one can tell with 50 % certainty that Joy and Sue have HIV. This is not desirable especially when an adversary

20.3. PRIVACY-PRESERVING DATA PUBLISHING 669 Table 20.2: Example of a snapshot of ﬁctitious voter registration rolls Name Age ZIP Code Mary A. 38 90345 John S. 36 89119 Ann L. 31 02139 Jack M. 57 10562 Joy M. 26 10547 Victor B. 46 90345 Peter P. 25 02139 Diana X. 24 10598 William W. 37 90210 Sue G. 26 10547 has other background medical information about Joy or Sue to further narrow down the possibilities. Similarly, William is the only individual in the voter registration rolls, who matches an individual with hepatitis C in the medical release. In cases, where only one data record in the voter registration rolls matches the particular combination of age and ZIP code, sensitive medical conditions about that individual may be fully compromised. This approach is referred to as a linkage attack. Most anonymization algorithms focus on pre- venting identity disclosure, rather than explicitly hiding the sensitive attributes. Thus, only the attributes which can be combined to construct quasi-identiﬁers are changed or speciﬁed approximately in the data release, whereas sensitive attributes are released in their exact form. Many privacy-preserving data publishing algorithms assume that the quasi-identiﬁers are drawn out of a set of attributes that are not sensitive, because they can only be used by an adversary by performing joins with (nonsensitive) publicly available information. This assumption may, however, not always be reasonable, when an adversary has (sensi- tive) background information about a target at hand. Adversaries are often familiar with their targets, and they can be assumed to have background knowledge about at least a subset of the sensitive attributes. In a medical application with multiple disease attributes, knowledge about a subset of these attributes may reveal the identity of the subject of the record. Similarly, in a movie collaborative ﬁltering application, where anonymized ratings are released, it may be possible to obtain information about a particular user’s ratings on a subset of movies, through personal interaction or other rating sources. If this combination is unique to the individual, then the other ratings of the individual are compromised as well. Thus, sensitive attributes also need to be perturbed, when background knowledge is available. Much of the work in the privacy literature assumes a rigid distinction between the role of publicly available attributes (from which the quasi-identiﬁers are constructed) and that of the sensitive attributes. In other words, sensitive attributes are not perturbed because it is assumed that revealing them does not incur the risk of a linkage attack with publicly available information. There are, however, a few algorithms that do not make this distinction. Such algorithms generally provide better privacy protection in the presence of background information. In this section, several models for group-based anonymization, such as k-anonymity, - diversity, and t-closeness, will be introduced. While the recent models, such as -diversity, have certain advantages over the k-anonymity model, a good understanding of k-anonymity

670 CHAPTER 20. PRIVACY-PRESERVING DATA MINING is crucial in any study of privacy-preserving data publishing. This is because the basic framework for most of the group-based anonymization models was ﬁrst proposed in the context of the k-anonymity model. Furthermore, many algorithms for other models, such as -diversity, build upon algorithms for k-anonymization. 20.3.1 The k-Anonymity Model The k-anonymity model is one of the oldest ones for data anonymization, and it is credited with the understanding of the concept of quasi-identiﬁers and their impact on data privacy. The basic idea in k-anonymization methods is to allow release of the sensitive attributes, while distorting only the attributes which are available through public sources of informa- tion. Thus, even though the sensitive attributes have been released, they cannot be linked to an individual through publicly available records. Before discussing the anonymization algorithms, some of the most common techniques for data distortion will be discussed. 1. Suppression: In this approach, some of the attribute values are suppressed. Depending on the algorithm used, the suppression can be done in a variety of ways. For example, one might omit some of the age or ZIP code attribute values from a few selected data records in Table 20.1. Alternatively, one might completely omit the entire record for a speciﬁc individual (row suppression) or the age attribute from all individuals (column suppression). Row suppression is often utilized to remove outlier records because such records are diﬃcult to anonymize. Column suppression is commonly used to remove highly identifying attributes, or explicit identiﬁers, such as the SSN. 2. Generalization: In the case of generalization, the attributes are speciﬁed approxi- mately in terms of a particular range. For example, instead of specifying Age = 26 and Location (ZIP Code) = 10547 for one of the entries of Table 20.1, one might generalize it to Age ∈ [25, 30] and Location (State) = New York. By specifying the attributes approximately, it becomes more diﬃcult for an adversary to perform linkage attacks. While numeric data can be generalized to speciﬁc ranges, the generalization of categorical data is somewhat more complicated. Typically, a generalization hierarchy of the categorical attribute values needs to be provided, for use in the anonymization process. For example, a ZIP code may be generalized to a city, which in turn may be generalized to a state, and so on. There is no unique way of specifying a domain hierar- chy. Typically, it needs to be semantically meaningful, and it is speciﬁed by a domain expert as a part of the input to the anonymization process. An example of a general- ization taxonomy of categorical attributes for the location attribute of Table 20.1 is provided in Fig. 20.1. This hierarchy of attribute values has a tree structure, and is referred to as a value generalization hierarchy. The notations A0 . . . A3 and Z0 . . . Z4 in Fig. 20.1 denote the domain generalizations at diﬀerent levels of granularity. The corresponding domain generalization hierarchies are also illustrated in the Fig. 20.1 by the single path between Z0 . . . Z4 and A0 . . . A4. 3. Synthetic data generation: In this case, a synthetic data set is generated that mimics the statistical properties of the original data, at the group level. Such an approach can provide better privacy, because it is more diﬃcult to map synthetic data records to particular groups of records. On the other hand, the data records are no longer truthful because they are synthetically generated. 4. Speciﬁcation as probabilistic and uncertain databases: In this case, one might specify an individual data record as a probability distribution function. This is diﬀerent from the

20.3. PRIVACY-PRESERVING DATA PUBLISHING 671 AGE ZIP CODE A3 (0, 100] Z3 ALL A2 (0, 20] (20, 40] (40, 60] Z2 N.E. US MID W. US W. US A1 (0, 10] (10, 20] (20, 30] (30, 40] Z1 NY MA CA NV A0 24 25 26 36 37 38 Z0 10547 10598 01239 90210 90345 89119 Figure 20.1: A value- and corresponding domain-generalization hierarchy for the age and ZIP code attributes Table 20.3: Example of a 3-anonymized version of Table 20.1 Row Index Age ZIP Code Disease 1 [20, 30] Northeastern US HIV 2 [30, 40] Western US Hepatitis C 3 [20, 30] 4 [30, 40] Northeastern US HIV 5 [30, 40] Western US Hepatitis C 6 [20, 30] Western US Diabetes Northeastern US HIV aggregate distribution approach of randomization because the probability distribution is data-record speciﬁc, and is designed to ensure k-anonymity. While this approach has not been studied intensively, it has the potential to allow the use of recent advances in the ﬁeld of probabilistic databases for anonymization. Among the aforementioned methods, the generalization and suppression methods are most commonly used for anonymization. Therefore, most of the discussion in this section will be focused on these methods. First, the notion of k-anonymity will be deﬁned. Deﬁnition 20.3.1 (k-anonymity) A data set is said to be k-anonymized, if the attributes of each record in the anonymized data set cannot be distinguished from at least (k − 1) other data records. This group of indistinguishable data records is also referred to as an equivalence class. To understand how generalization and suppression can be used for anonymization, consider the data set in Table 20.1. An example of a 3-anonymized version of this table is illustrated in Table 20.3. The SSN has been fully suppressed with column-wise suppression and replaced with an anonymized row index. Such explicit identiﬁers are almost always fully suppressed in anonymization. The two publicly available attributes corresponding to the age and ZIP code are now generalized and speciﬁed approximately. The subjects of the row indices 1, 3, and 6 can no longer be distinguished by using linkage attacks because their publicly available attributes are identical. Similarly, the publicly available attributes of row indices 2, 4, and 5 are identical. Thus, this table contains two equivalence classes containing three records each, and the data records cannot be distinguished from one another within these

672 CHAPTER 20. PRIVACY-PRESERVING DATA MINING ZIP CODE Z4 ALL Z3 1**** 9**** Z2 10*** 90*** Z1 105** 902** 903** Z0 10547 10598 90210 90345 Figure 20.2: An alternate value- and corresponding domain-generalization hierarchy for the ZIP code attribute equivalence classes. In other words, an adversary can no longer match the identiﬁcation of individual data records with voter registration rolls exactly. If any matching is found, then it is guaranteed that at least k = 3 records in the data set will match any particular individual in the voter registration roll. The ZIP code is generalized with the use of the prespeciﬁed value generalization hierarchy of Fig. 20.1. The generation of a domain generalization hierarchy for a categorical attribute can be done in several ways, and depends on the skill of the analyst responsible for the privacy modiﬁcations. An alternate example of a domain generalization hierarchy for the ZIP code attribute is illustrated in Fig. 20.2. A value generalization hierarchy on the continuous attributes does not require any special domain knowledge because it can be directly created by the analyst, using the actual distribution of the continuous values in the underlying data. This requires a simple hierarchical discretization of the continuous attributes. The goal of the privacy-preservation algorithms is to replace the original values in the data (numeric or discrete), with one of the discrete values illustrated in the taxonomy trees of Fig. 20.1. Thus, the data is recoded in terms of a new set of discrete values. In most cases, the numeric attributes do retain their ordering, because the corresponding ranges are ordered. Diﬀerent algorithms use diﬀerent rules in the recoding process. These diﬀerent ways of recoding attributes may be distinguished as follows: • Global versus local recoding: In global recoding, a given attribute value is always replaced with the same discrete counterpart from the domain generalization hierarchy over all data records. Consider the aforementioned example of Fig. 20.1, in which ZIP code can be generalized either to state or region. In global recoding, the particular ZIP code value of 10547 needs to be consistently replaced by either Northeastern US, or New York over all the data records. However, for a diﬀerent ZIP code such as 90210, a diﬀerent level of hierarchy may be selected than for the 10547 value, as long as it is done consistently for a particular data value (e.g., 10547 or 90210) across all data records. In local recoding, diﬀerent data records may use diﬀerent generalizations for the same data value. For example, one data record might use Northeastern US, whereas another data record might use New York for 10547. While local recoding might seem to be better optimized, because of its greater ﬂexibility, it does lose a diﬀerent kind of information. In particular, because the same ZIP code might map to diﬀerent values, such as New York and Northeastern US, the similarity computation

20.3. PRIVACY-PRESERVING DATA PUBLISHING 673 between the resulting data records may be less accurate. Most of the current privacy schemes use global recoding. • Full-domain generalization: Full-domain generalization is a special case of global recoding. In this approach, all values of a particular attribute are generalized to the same level of the taxonomy. For example, a ZIP code might be generalized to its state for all instances of the attribute. In other words, if the ZIP code 10547 is generalized to New York, then the ZIP code 90210 must be generalized to California. The various hierarchical alternatives for full-domain generalization of the age attribute are denoted by A0, A1, A2, and A3 in Fig. 20.1. The possible full-domain generalization levels of the ZIP code are denoted by Z0, Z1, Z2, and Z3. In this case, Z3 represents the high- est level of generalization (column suppression), and Z0 represents the original values of the ZIP code attribute. Thus, once it is decided that the anonymization algorithm should use Z2 for the ZIP code attribute, then every instance of the ZIP code attribute (Z0) in the data set is replaced with its generalized value in Z2. This is the reason that the approach is referred to as full-domain generalization, as the entire domain of data values for a particular attribute is generalized to the same level of the hierarchy. Full-domain generalization is the most common approach used in privacy-preserving data publishing. Full-domain generalization is intuitively appealing because it ensures that the diﬀerent values of an attribute have the same level of granularity throughout the data set. The earliest methods, such as Samarati’s original algorithm, and Incognito, were all full-domain generalization algorithms. 20.3.1.1 Samarati’s Algorithm Samarati’s algorithm was ﬁrst proposed in the context of the deﬁnition of k-anonymity. Samarati’s original AG-TS (Attribute Generalization and Tuple Suppression) algorithm for k-anonymity provides the basic domain generalization framework, which is the basis for group-based anonymization. It has already been discussed, how the domain generalization of a single attribute can be represented as a path. For example, the path from Z0 to Z3 in Fig. 20.1 represents the generalization of the ZIP code attribute. The notion of domain generalization can also be deﬁned for combinations of attributes. However, in the case of attribute combinations, the relationships are no longer expressed as a path, but as a special kind of directed acyclic graph, known as a lattice. In this case, each node speciﬁes a (full- domain) generalization level for the diﬀerent attributes For example, < A1, Z2 > denotes the domain generalization level of age to A1 and ZIP code to Z2. In other words, every data record is generalized to the level < A1, Z2 >. Note that < A1, Z2 > also represents the generalization level of the (anonymized) Table 20.3 based on the domain-generalization hierarchies speciﬁed in Fig. 20.1. Thus, each node in the lattice speciﬁes a possible level of full-domain generalization, in terms of which the original data is represented. The edges in this graph represent the direct generalization relationships among these tuples of domains. A directed path in the lattice, from lower to higher levels, represents a sequence of generalizations. Conversely, a lower-level node is a specialization of a higher-level node. For example, the node < A1, Z1 > is a direct specialization of either < A1, Z2 >, or < A2, Z1 > because a single attribute in either can be specialized once to immediately yield < A1, Z1 >. An example of the domain generalization hierarchy for the age and ZIP code combination is illustrated in Fig. 20.3a. The goal of the full-domain anonymization algorithm is to discover the node < Ai, Zj > in

674 CHAPTER 20. PRIVACY-PRESERVING DATA MINING Z3A3 k ANONYMOUS SUB LATTICE ON TWO ATTRIBUTES Z3A2 Z2A3 SATISFYING Z3A3 k ANONYMITY Z3A3 Z3A2 Z2A3 Z3A2 Z2A3 Z3A1 Z2A2 Z1A3 Z3A1 Z2A2 Z1A3 Z2A2 Z1A3 Z3A0 Z2A1 Z1A2 Z0A3 Z3A0 Z2A1 Z1A2 Z0A3 MINIMAL GENERALIZATIONS SATISFYING k ANONYMITY Z2A0 Z1A1 Z0A2 Z2A0 Z1A1 Z0A2 Z1A0 Z0A1 Z1A0 Z0A1 NOT SATISFYING Z0A0 Z0A0 k ANONYMITY (a) 2-attribute lattice (b) k-anonymous portion Figure 20.3: Domain generalization hierarchies over combinations of attributes this tuple-based domain generalization hierarchy that preserves k-anonymity with the least amount of generalization. After such a node < Ai, Zj > has been discovered, the privacy algorithm generalizes all ages to the level Ai and all ZIP codes to the level Zj. In practice, some of the tuples may need to be suppressed in order to prevent undesirably high levels of generalization. This is because these may represent outlier tuples that cannot be incorporated in any group without signiﬁcantly increasing the generalization level. For example, an individual with an age of 125 may need to be suppressed because of the outlier value of this attribute. Therefore, one of the parameters to the algorithm is a threshold M axSup, which speciﬁes the maximum number of tuples that can be suppressed. The goal is therefore to discover a node that is as low as possible in the lattice of Fig. 20.3a, such that k-anonymity is satisﬁed after suppressing at most M axSup tuples. The height of a node in the lattice is deﬁned as its path distance in the lattice from the most speciﬁc level of representation. In the example of Fig. 20.3, the height of node < Zi, Aj > is (i + j). A minimally generalized node may be deﬁned as a node, for which the height is as small as possible. Therefore, in this example, one way of determining minimal generalizations, is to discover a k-anonymizable node < Zi, Aj >, such that the height (i + j) is as small as possible. When there are d tahtattribpuartteiscu<laQr ic1o.m. .bQiniadt>io,n the gseunmeralizdka=t1ioiknso. vIetr all attributes rep- resents the height of of is easy to see that any ssapteiscfiyalkiz-aatnioonnyomf itay.nSoidmeil<arlQy,i1an. .y. gQeinder>alitzhaattiodnooefs not satisfy k-anonymity will also not a no de satisfying k-anonymity will also satisfy k-anonymity. Therefore, the subgraph of the lattice satisfying k-anonymity and the subgraph violating k-anonymity are both connected subgraphs, and a border can be constructed between them. An example of such a border1 is illustrated in Fig. 20.3b, and the corresponding minimal generalizations are illustrated in the same ﬁgure. Note that the minimal generalization is not unique, and that two possible minimal generalizations < Z2, A2 > and < Z1, A3 > are possible in this example. The reason for using minimally generalized nodes is to maximize the utility of the data for analytical algorithms. Other 1This border is for illustration purposes only, and does not correspond to any data set in this chapter.

20.3. PRIVACY-PRESERVING DATA PUBLISHING 675 more reﬁned deﬁnitions can be used for quantifying utility that use the distribution of the attribute values more explicitly. The bibliographic notes contain pointers to some of these deﬁnitions. Samarati’s algorithm uses a simple binary search over the lattice of domain generaliza- tion tuples. Let [0, hmax] represent the range of heights of the lattice. It is then checked whether any of the generalizations at level hmax/2 satisﬁes the k-anonymity constraint. If this is indeed the case, then the height hmax/4 is checked. Otherwise, the height 3 · hmax/4 is checked. This approach is repeated, until the lowest height at which a k-anonymous solu- tion exists, is found. All the corresponding domain generalizations are reported, and any of these can be used for transforming the data. An important step in Samarati’s algorithm is the process of using the original database to check whether a particular node in the lattice satisﬁes k-anonymity. However, a discussion of this step is omitted here, because similar steps are discussed below in the context of the Incognito algorithm. 20.3.1.2 Incognito The lattice of Fig. 20.3 shares a number of conceptual similarities with the lattice of frequent itemset mining algorithms, as discussed in Chap. 4. Therefore, some of the anonymization algorithms for discovering full-domain generalization also have similar characteristics to those of frequent itemset mining algorithms. The Incognito algorithm leverages a number of principles from frequent pattern mining to eﬃciently discover the k-anonymous portion of the lattice. An important observation is that the size of the lattice is exponentially related to the number of quasi-identiﬁers. This can lead to increasing computational complexity in many practical scenarios. While it has been shown by Meyerson and Williams [385] that optimal k-anonymization is NP-hard, it is possible to reduce the computational burden by careful exploration of the lattice. The Incognito algorithm is based on the observation that the k-anonymity of a subset of generalized attributes is a necessary (but not suﬃcient) con- dition for the k-anonymity of a superset of attributes with matching generalization levels of the common elements. Henceforth, this property will be referred to as attribute subset closure. This property is a speciﬁc case of the generalization property which states that any generalization of a k-anonymous node in the lattice will always be k-anonymous. These properties can be used to both generate candidates and prune the search process in a manner that is similar to the Apriori algorithm for frequent itemset mining. Therefore, nodes that are not k-anonymous with respect to a set of attributes, can be discarded, together with their specializations in the lattice hierarchy. Furthermore, generalizations of subsets of attributes that do satisfy the k-anonymity constraint, do not need to be checked because they are guaranteed to be k-anonymous. The Incognito approach uses a levelwise approach, in which the following steps are repeated iteratively, until the k-anonymous sublattice containing all d attributes has been constructed. The set Fi denotes the set of all sublattices on i attributes that satisﬁes k- anonymity. The algorithm starts by initializing F1 to the portions of the single-attribute domain generalization hierarchies satisfying k-anonymity. This is quite simple, because sin- gle attribute hierarchies are paths. Thus, F1 is simply the top portion of the path, such that each generalized attribute value contains at least k tuples. Subsequently, as in frequent pattern mining, the algorithm repeatedly generates candidate sublattices in Ci+1 by joining sublattices in Fi that have exactly (i − 1) attributes in common. The process of joining two sublattices will be described later. Note that Ci+1 is a set of candidate sublattices on (i + 1) attributes. Each of these sublattices is then pruned of some of its nodes, using an

676 CHAPTER 20. PRIVACY-PRESERVING DATA MINING Apriori-style approach. Speciﬁcally, nodes of sublattices in Ci+1 whose generalizations are not k-anonymous in Fi can be pruned. This step will also be described in detail later. After candidate generation and pruning, the portion of each sublattice that satisﬁes k-anonymity is retained by checking the constituent nodes against the base data records. Thus, each sublattice in Ci+1 reduces further in size. At this point, the set Ci+1 has been transformed to the set Fi+1. Thus, the following steps are repeated for increasing values of the index i: 1. Generate Ci+1, the set of candidate sublattices on (i + 1) attributes. This is achieved by joining all pairs of k-anonymous sublattices in Fi that share (i − 1) attributes. The details of a join between a pair of sublattices will be described later. 2. Prune the nodes from each sublattice in Ci+1 that cannot possibly satisfy k-anonymity by using the attribute subset closure property with respect to the set of k-anonymous combinations in Fi. The details of how the nodes may be pruned from a sublattice, will be described later. 3. Check each node in each (already pruned) sublattice of Ci+1 against the base data, and remove those that do not satisfy k-anonymity. A node does not need to be checked, if one of its specializations already satisﬁes k-anonymity. This step transforms the set of candidate sublattices Ci+1 to the set of k-anonymous sublattices Fi+1 by removing the anonymity-violating sublattices. If there are a total of d attributes, then the set Fd will contain a single sublattice of nodes satisfying k-anonymity. The nodes with the smallest height in this sublattice are reported. Note that the detailed implementation of the Incognito algorithm uses a slightly diﬀerent approach for actually tracking the sublattices, by tracking the lattice nodes and edges in separate tables. The i-dimensional tables containing the generalization levels of lattice nodes of Fi are joined on their (i − 1) common attributes to create the (i + 1)-dimensional tables containing the nodes of Ci+1. Subsequently, the lattice edges are added between the generated nodes based on the hierarchy relationships. Nevertheless, the simpler logical description provided here matches the Incognito algorithm. Next, the details of the join and pruning operations will be discussed with the use of an example. In this case, three attributes will be used for greater clarity. As discussed earlier, let Ar and Zr represent diﬀerent generalization levels of the age and ZIP code attributes, for varying values of the index r. Let Pr represent the generalization levels of an additional attribute corresponding to the profession. Higher values of the index r indicate a greater level of generalization. Consider the scenario where all three k-anonymous two-attribute sublattices on these three attributes are already available in F2. It is possible to use any pair of sublattices from these three possibilities, in order to perform the join. This will result in a candidate sublattice on all three attributes. Consider the case, where the sublattices on (ZIP code, Age) and (ZIP code, Profession) are joined. The nodes in the new candidate sublattice will now have three attributes (ZIP code, Profession, Age) instead of two. The nodes for the new candidate sublattice are constructed by joining the nodes of the two k-anonymous sublattices. A pair of nodes < Zr, Aj > and < Zs, Pl > will be joined, if and only if r = s. In other words, the gen- eralization level of the ZIP code attribute needs to be the same in both cases. This will result in the new node < Zr, Pl, Aj >. In general, for pairs of nodes with k attributes, a join will be successfully executed, if and only if (a) they share (k − 1) attributes, and (b) the generalization levels of the (k − 1) common attributes are the same. An example of a join with two k-anonymous sublattices is illustrated in Fig. 20.4a.

20.3. PRIVACY-PRESERVING DATA PUBLISHING 677 Z3A3 Z3P3A3 Z3A2 Z2A3 Z3P3A2 Z3P2A3 Z2P3A3 Z2A2 Z1A3 + JOIN Z3P2A2 Z3P1A3 Z2P3A2 Z2P2A3 Z3P3 Z3P2 Z2P3 Z3P1A2 Z2P2A2 Z3P1 Z2P2 CANDIDATE SUB LATTICE ON THREE ATTRIBUTES k ANONYMOUS SUB LATTICES ON TWO ATTRIBUTES (a) Incognito join between two sub-lattices Z3P3A3 Z3P3A3 Z3P3A2 Z3P2A3 Z2P3A3 P3A3 Z3P3A2 Z3P2A3 Z2P3A3 Z3P2A2 Z3P1A3 Z2P3A2 Z2P2A3 P3A2 P2A3 PRUNE WITH k ANONYMOUS Z2P3A2 Z2P2A3 SUB LATTICE ON TWO ATTRIBUTES PRUNED CANDIDATE ON Z3P1A2 Z2P2A2 THREE ATTRIBUTES CANDIDATE SUB LATTICE ON THREE ATTRIBUTES (b) Incognito pruning Figure 20.4: Incognito joins and pruning In the previous example, the sublattice for the profession–age combination was not used for the join. However, it is still useful for pruning. This is because, if a node < Pi, Aj > is not present in this sublattice, then any node of the form < Zm, Pi, Aj > will also not be k-anonymous. Therefore, such nodes can be removed from the constructed candidate sublattice together with their specializations. An example of a pruning step on the candidate sublattice is illustrated in Fig. 20.4b. This pruning is based on the attribute-subset closure property, and it is reminiscent of Apriori pruning in frequent itemset mining. As in the case of frequent itemset mining, all k-attribute subsets of each candidate (k + 1)-sublattice in Ck+1 need to be checked. If a node violates the closure property in any of these checks, then it is pruned. Finally, the generated nodes in Ck+1 need to checked against the original database to determine whether they satisfy k-anonymity. For example, in order to determine whether < Z1, A1 > satisﬁes k-anonymity based on the value generalization in Fig. 20.1, one needs to determine the number of individuals satisfying each of the pairs of condi- tions such as (ZIP code ∈ NY, 0 < Age ≤ 10), (ZIP code ∈ NY, 10 < Age ≤ 20), (ZIP code ∈ MA, 0 < Age ≤ 10), and so on. Therefore, for each node in the lattice,

678 CHAPTER 20. PRIVACY-PRESERVING DATA MINING a vector of frequency values need to be computed. This vector is also referred to as a frequency vector or frequency set. The process of frequency vector computation can be expensive because the original database may need to be scanned to determine the number of tuples satisfying these conditions. However, several strategies can be used to reduce the burden of computation. For example, if the frequency vector of < Z1, A1 > has already been computed, one can use roll-up to directly compute the frequency vectors of the generalization < Z2, A1 > without actually scanning the database. This is because the frequency of the set (ZIP code ∈ Northeastern US, 0 < Age ≤ 10) is the sum of the frequencies of (ZIP code ∈ NY, 0 < Age ≤ 10), (ZIP code ∈ NJ, 0 < Age ≤ 10), (ZIP code ∈ MA, 0 < Age ≤ 10), and so on. The simplest approach is to use a breadth-ﬁrst strategy on the lattice of each set of (k + 1) attributes, by determining the frequency vectors of speciﬁc (lower-level) nodes in the lattice before determining the frequency vectors of more general (higher-level) nodes. The frequency vectors of higher-level nodes can be computed eﬃciently from those of lower-level nodes by using the roll-up property. Note that a separate breadth-ﬁrst search needs to be performed for each subset of (k + 1) attributes in Ck+1 to compute its frequency vectors. Furthermore, once a node has been identiﬁed by the breadth-ﬁrst search to be k-anonymous, its generalizations in the lattice are guaranteed to be k-anonymous. Therefore, they are automatically marked as k-anonymous and are not explicitly checked. The original algorithm also supports a number of other optimizations, referred to as Incognito super-roots and Bottom-up precomputation. The bibliographic notes contain pointers to these methods. 20.3.1.3 Mondrian Multidimensional k-Anonymity One of the disadvantages of the methods discussed so far is that the domain generalization hierarchies for various attributes are constructed independently as a preprocessing step. Thus, after the hierarchical discretization (domain generalization) for a numeric attribute has been ﬁxed by the preprocessing step, it is utilized by the anonymization algorithm. This rigidity in the anonymization process creates ineﬃciencies in data representation, when the various data attributes are correlated in multidimensional space. For example, the salary distribution for older individuals may be diﬀerent from that of younger individuals. A pre- processed domain generalization hierarchy is unable to adjust to such attribute correlations in the data set. In general, the best trade-oﬀs between privacy and utility are achieved when the multidimensional relationships among data points are leveraged in the anonymization process. In other words, the attribute ranges for each attribute in a data point X should be generated in a dynamic way depending on the speciﬁc multidimensional locality of X. The Mondrian method generates multidimensional rectangular regions, containing at least k data points. This is achieved by recursively dividing the bounding boxes with axis- parallel cuts, until each region contains no more than k data points. This approach is not very diﬀerent from the methodology used by many traditional index structures, such as kd- trees. An example of the partitioning induced by the Mondrian algorithm is illustrated in Fig. 20.5. In this case, a 5-anonymous partitioning is illustrated. Thus, each group contains at least ﬁve data points. It is easy to see that the same attribute value is represented by diﬀerent ranges in diﬀerent portions of the data, in order to account for the varying density of diﬀerent regions. It is this ﬂexibility that gives Mondrian a more compact representation of the anonymized groups than the other methods. The Mondrian algorithm dynamically maintains the set B of multidimensional gen- eralizations that satisfy k-anonymity and cover the data set. The Mondrian algorithm starts with a rectangular box B of all the data points. This represents the generalization

20.3. PRIVACY-PRESERVING DATA PUBLISHING 679 . . . . .. .AGE ...... .. . . . . . . . . .. . .. . SALARY Figure 20.5: A sample 5-anonymous Mondrian multidimensional partitioning of the entire data set to a single multidimensional region, and therefore trivially satisﬁes k-anonymity. The algorithm therefore starts by initializing B = {B}. The algorithm repeat- edly uses the following steps: 1. Select a rectangular region R ∈ B containing at least 2 · k data points, such that a valid split into a pair of k-anonymous subsets exists. 2. Split the rectangular region R along any of the dimensions with an axis-parallel split, so that each of R1 and R2 contains at least k data points. 3. Update B ⇐ B ∪ {R1, R2} − R This iterative process is repeated, until the rectangular regions cannot be split any further without violating k-anonymity. There is some ﬂexibility in the choice of the dimension for performing the split. A natural heuristic is to split the longest dimension of the selected rectangular region. After the dimension has been selected, the split should be performed so that the data points are partitioned as evenly as possible. In the absence of ties on attribute values, the data points can be divided almost equally into the two regions. The rectangular regions in B deﬁne the equivalence classes that are utilized for k- anonymization. If each numeric attribute value is unique, it can be shown that every region will contain at most 2 · k − 1 data points. However, if there are ties among attribute val- ues, and tied values need to be assigned to be the same partition, then an upper bound of m + 2d · (k − 1) can be shown on the number of data points in each partition. Here m is the number of identical copies of any data record. On the other hand, if ties on an attribute value can be ﬂexibly assigned to any partition, then the maximum number of points in any rectangular partition, at the end of the process will be 2 · k − 1. The reader is referred to the bibliographic notes for the pointer to the proof of this bound. After the data has been divided into rectangular regions, the following approaches can be used for reporting the anonymized data points: 1. The averages along each dimension may be reported for each anonymized equivalence set. 2. The multidimensional bounding box of the data points may be reported. The Mondrian algorithm has been shown to be more eﬀective than the Incognito algorithm, because of the greater ﬂexibility provided by the multidimensional approach to partitioning.

680 CHAPTER 20. PRIVACY-PRESERVING DATA MINING The Mondrian approach is naturally designed for numeric attributes with an ordering on the values. However, the approach can also be generalized to categorical attributes by designing appropriate split rules for the attributes. 20.3.1.4 Synthetic Data Generation: Condensation-Based Approach The condensation-based approach generates synthetic data that matches the original data distribution, while maintaining k-anonymity. This means that k synthetic records are gen- erated for each group of k records, by using the statistics of that group. The overall con- densation approach may be described as follows: 1. Use any clustering approach to partition the data into groups of data records, such that each group contains at least k data records. Denote the number of created groups by m. 2. Compute the mean and covariance matrix for each group of data records. For a d- dimensional data set, the covariance matrix of a group represents the d×d covariances between pairs of attributes. 3. Compute the eigenvectors and eigenvalues of each covariance matrix. It is evident from the discussion of Principal Component Analysis (PCA) in Chap. 2, that the eigenvec- tors deﬁne a group-speciﬁc axis system, along which the data records are uncorrelated. The variance of the data along each eigenvector is equal to the corresponding eigen- value. The synthetic data set to be generated, is modeled as mixture of m clusters, where the mean of each cluster is the mean of the corresponding group of original data records. 4. Generate synthetic data records for each of the m clusters. For each cluster, the number and mean of the synthetic records matches its base group. Data records are gener- ated independently along the eigenvectors, with variance equal to the corresponding eigenvalues. The uniform distribution is typically used for synthetic data generation, because it is assumed that the data distribution does not change signiﬁcantly within the small locality deﬁned by a group. While the uniform distribution is a local approx- imation, the global distribution of the generated records generally matches the original data quite well. The approach can also be generalized to data streams, by maintaining group statistics incrementally. The idea here is that group sizes are allowed to vary between k and 2 · k − 1. Whenever a group reaches the size of 2 · k, they are split into two groups. The details of group splitting will be discussed later. To maintain the covariance statistics incrementally in the streaming scenario, an approach similar to the cluster-feature vector of CluStream (see Chap. 7) is used. The only diﬀerence is that the product-wise sum statistics are also maintained incrementally. For any pair of attributes i and j, the value of Sum(i, j) is equal to sum of the product of attribute values i and j over the diﬀerent data points. This can be easily maintained incrementally in a data stream. Then, for a set of r ∈ (k, 2 · k − 1) data points in a group, the covariance between attributes i and j may be estimated as follows: Covariance(i, j) = Sum(i, j)/r − Mean(i) · Mean(j) (20.6) Covariance(i, j) = Sum(i, j)/r − Sum(i) · Sum(j)/r2 (20.7)

20.3. PRIVACY-PRESERVING DATA PUBLISHING 681 All the statistics in the aforementioned equation are additive, and can easily be maintained incrementally in the stream setting. It remains to be explained how the groups are split, once the group sizes reach 2 · k. It is assumed that each group of size 2 · k is split into two groups of size k along the longest eigenvector. The reason for choosing the longest eigenvector is to ensure the compactness of the newly created groups. The splitting of groups can be a challenge, because the original data records are not available in the streaming scenario to recalculate the statistics of each of the split groups. Therefore, an approximation (i.e., modeling assumption) is needed. The condensation approach works with the modeling assumption that the data records of a group are independently distributed along each eigenvector according to a uniform distribution. For group sizes that are much smaller than the number of points in the data set, this is not an unreasonable assumption. This is because density distributions do not change drastically over small regions of the data. This modeling assumption of a uniform distribution is used to re-calculate the new means of each of the child groups of equal size k. This is because the range of the uniform distribution along the longest eigenvector can be approximated from its variance (eigen- value), based on the modeling assumption. Note that the variance of a uniform distribution is one twelfth the square of its range. Therefore, if λmax be the largest eigenvalue, then the range R of the uniform distribution is computed as follows: R = 12λmax (20.8) This range R is then split into two equal parts to create the two new group means. Thus, the two new group means are at a distance of R/4 from the old group mean in opposite directions along the longest eigenvector. The newly created groups are assumed to have the same eigenvectors as the parent group, because the splitting is performed along an uncorrelated direction. Therefore, the directions of correlation are not assumed to change after splitting. The largest eigenvalue of the original (parent) group is replaced by an eigenvalue in each of the child groups, which is one fourth2 the original value. Thus, if P is the d × d matrix with orthonormal columns containing the eigenvectors, and Σ is the diagonal matrix of eigenvalues (after adjustment of the largest eigenvalue), then the covariance matrix of the newly created split groups can be computed as follows: C = P ΣP T (20.9) This relationship is based on the standard PCA diagonalization discussed in Chap. 2. Note that the covariance matrices of both the split groups are the same. The covariance matri- ces and newly generated group means can be used to back-calculate the sum of pairwise attribute products of each group according to Eq. 20.6. Thus, as more data points arrive, these product values can continue to be updated incrementally. The condensation-based approach is one of the few methods that can be applied to data streams with a relatively low risk of disclosure, because of its approach of using synthetic data. It is often diﬃcult for an adversary to know which group of k synthetic records was generated from a particular base group of original records. In the case of generalization-based anonymization, it is relatively easy to identify groups of related data records, representing equivalence classes. Thus, synthetic data sets provide some additional privacy protection. Note that it is possible to generate larger data sets using this approach if needed. For example, for each group of k records, one might generate α · k synthetic data records, 2Splitting a uniform distribution into two equal parts reduces its variance by a factor of 4.

682 CHAPTER 20. PRIVACY-PRESERVING DATA MINING using the statistics of that group. This scales up the size of the data with a factor of α, and further reduces the mapping between the generated data and the original data. Furthermore, additional noise can be incorporated during synthetic data generation to ensure greater protection. These additional options do come at a price. The truthfulness of the published data is lost. The published data records are synthetic and therefore do not map onto any particular individual. In many aggregation- or modeling-based applications, this is not necessarily an issue, because the aggregate properties of the data are retained. In some medical data han- dling scenarios, legal restrictions may prohibit release of downgraded data, when there is a direct mapping between individuals and data records, even at a group level. The conden- sation approach provides a solution in some of these scenarios, because the released data records are synthetic, and are generally diﬃcult to map onto speciﬁc groups. The condensation approach shares a number of conceptual similarities with the Mon- drian approach, except that it allows the use of any constrained clustering algorithm, rather than rectangular partitions constructed with single dimensional cuts. The utility of the resulting anonymization depends on the eﬀectiveness of the clustering. Single dimensional cuts will not be able to construct high-quality clusters with increasing dimensionality. Fur- thermore, unlike Mondrian, synthetic data is generated to achieve greater anonymity. The condensation approach does not distinguish between publicly available attributes (used in combination to construct quasi-identiﬁers) and sensitive attributes, and applies the approach to all the attributes. As will be evident from the subsequent discussion on the dimensionality curse in Sect. 20.3.4, the distinction between quasi-identiﬁer and sensitive attributes is more ﬂuid, than is often assumed in the literature on data privacy. Because it is not possible to know the level of background knowledge available to adversaries about the sensitive attributes, all attributes should be perturbed. When the sensitive attributes are released without any perturbation, they become immediately available for identiﬁcation attacks, as long as background knowledge is available. For example, a number of privacy attacks on data sets such as the Netﬂix data set [402], have been performed using attributes that would normally not have been considered publicly available. This work [402] also makes the argument that such strong distinctions between publicly available and sensitive attributes are dangerous to make in real-world settings where the data and background knowledge available to the public continues to increase over time. 20.3.2 The -Diversity Model While the k-anonymity model provides the basic framework for privacy-preserving data publishing, there are scenarios in which it can lead to inadvertent sensitive attribute dis- closure. Consider the 3-anonymized table illustrated in Table 20.3. In this case, the row indices 1, 3, and 6 are in the same anonymized group, and cannot be distinguished from one another. However, all three individuals have the value of “HIV” on the sensitive attribute. Therefore, even though the identity of the speciﬁc individual from this group cannot be inferred, it can be inferred that any individual in this group has HIV. Therefore, if a voter registration roll is used to join this group to three unique individuals, then it can be inferred that all three of them have HIV. This represents a breach of sensitive attribute information about each of these three individuals. In other words, while the k-anonymity model prevents identity disclosure, it does not prevent attribute disclosure. The main reason for this breach is that the sensitive information is not diverse enough within the anonymized groups. Since the goal of privacy-preserving data publishing is to prevent the revelation of sensitive information, a model that does not use the sensitive

20.3. PRIVACY-PRESERVING DATA PUBLISHING 683 attribute values within the group formation process, cannot achieve this goal. The -diversity model is designed to ensure that the sensitive attributes within an equivalence class are suﬃciently diverse. Deﬁnition 20.3.2 ( -diversity Principle) An equivalence class is said to be diverse, if it contains “well-represented” values for the sensitive attribute. An anonymized table is said to be -diverse, if each equivalence class in it is -diverse. It is important to note that the notion of “well represented” can be instantiated in several diﬀerent ways. Therefore, the aforementioned deﬁnition provides the basic principle behind this approach, but cannot be considered a hard deﬁnition. There are several ways in which the notion of “well-represented” can be instantiated. These correspond to the notions of entropy -diversity and recursive -diversity. These deﬁnitions are described below. Deﬁnition 20.3.3 (Entropy -diversity) Let p1 . . . pr be the fraction of the data records belonging to diﬀerent values of the sensitive attribute in an equivalence class. The equivalence class is said to be entropy -diverse, if the entropy of its sensitive attribute value distribution is at least log( ). r − pi · log(pi) ≥ log( ) (20.10) i=1 An anonymized table is said to satisfy entropy -diversity, if each equivalence class in it satisﬁes entropy -diversity. It can be shown that the sensitive attributes in an equivalence class must have at least distinct values for the table to be -diverse (see Exercise 7). Therefore, any -diverse group has at least elements, and is -anonymous as well. One problem with this deﬁnition of -diversity is that it may be too restrictive in many settings, especially when the distributions of the sensitive attribute values are uneven. The entropy of a table can be shown to be at least equal to the minimum entropy of the con- stituent equivalence classes into which it is partitioned (see Exercise 8). Therefore, to ensure -diversity of each equivalence class, the sensitive attribute distribution in the entire table must also be -diverse. This is a restrictive assumption in many settings, because most real distributions of sensitive attributes are very skewed. For example, in a medical application, the sensitive (disease) attribute is likely to have uneven frequencies between normal indi- viduals and various diseases. Greater attribute skew reduces the (global) entropy -diversity of the sensitive-attribute distribution across the entire table. When this global -diversity is less than , it is no longer possible to create a globally -diverse partition without suppressing many data records. Therefore, a more relaxed notion of recursive (c, )-diversity has been proposed. The basic goal of the deﬁnition is to ensure that the most frequent attribute value in an equivalence class does not dominate the less frequent sensitive values in it. An additional parameter c is used to control the relative frequency of the diﬀerent values of the sensitive attribute within an equivalence class. Deﬁnition 20.3.4 (Recursive (c, )-diversity) Let p1 . . . pr be the fraction of the data records belonging to the r diﬀerent values of the sensitive attribute in an equivalence class, such that p1 ≥ p2 ≥ . . . ≥ pr. The equivalence class satisﬁes recursive (c, )-diversity, if the following is true: r p1 < c · pi (20.11) i=

684 CHAPTER 20. PRIVACY-PRESERVING DATA MINING An anonymized table is said to satisfy recursive (c, )-diversity, if each equivalence class in it satisﬁes entropy (c, )-diversity. The idea is that the least frequent tail of the sensitive attribute values must contain suﬃcient cumulative frequency compared to the most frequent sensitive attribute value. The value of r has to be at least , for the right-hand side of the aforementioned relationship to be non-zero. A key property of -diversity is that any generalization of an -diverse table is also -diverse. This is true for both deﬁnitions of -diversity. Lemma 20.3.1 (Entropy -diversity monotonicity) If a table is entropy -diverse, then any generalization of the table is entropy -diverse as well. Lemma 20.3.2 (Recursive (c, )-diversity monotonicity) If a table is recursive (c, )- diverse, then any generalization of the table is recursive (c, )-diverse as well. The reader is advised to work out Exercises 9(a) and (b), which are related to these results. Thus, -diversity exhibits the same monotonicity property exhibited by k-anonymity algo- rithms. This implies that the algorithms for k-anonymity can be easily generalized to - diversity by making minor modiﬁcations. For example, both Samarati’s algorithm and the Incognito algorithm can be adapted to the -diversity deﬁnition. The only change to any k-anonymity algorithm is as follows. Every time a table is tested for k-anonymity, it is now tested for -diversity instead. Therefore, algorithmic development of -diverse anonymization methods is typically executed by simply adapting existing k-anonymization algorithms. 20.3.3 The t-closeness Model While the -diversity model is eﬀective in preventing direct inference of sensitive attributes, it does not fully prevent the gain of some knowledge by an adversary. The primary reason for this is that -diversity does not account for the distribution of the sensitive attribute values in the original table. For example, the entropy of a set of sensitive attribute values with relative frequencies p1 . . . pr will take on the maximum value when p1 = p2 = . . . = pr = 1/r. Unfortunately, this can often represent a serious breach of privacy, when there is a signiﬁcant skew in the original distribution of sensitive attribute values. Consider the example of a medical database of HIV tests, where the sensitive value takes on the two values of “HIV” or “normal,” with relative proportions of 1 : 99. In this case, a group with an equal distribution of HIV and normal patients will have the highest entropy, based on the -diversity deﬁnition. Unfortunately, such a distribution is highly revealing when the distribution of the sensi- tive values in the original data is taken into account. Sensitive values are usually distributed in a skewed way, across most real data sets. In the medical example discussed above, it is already known that only 1 % of the patients in the entire data set have HIV. Thus, the equal distribution of HIV-infected and normal patients within a group, provides a signiﬁ- cant information gain to the adversary. The adversary now knows, that this small group of patients has a much higher expected chance of having HIV, than the base population. In this context, a notion of Bayes optimal privacy exists, which ensures that the addi- tional posterior information gained after release of information is as small as possible. Unfor- tunately, the notion of Bayes optimal privacy is practically and computationally diﬃcult to implement. The t-closeness model may be viewed as a practical and heuristic approach that attempts to achieve similar goals as the notion of Bayes optimal privacy. This is achieved by using the distance functions between distributions. Informally, the goal is to create an

20.3. PRIVACY-PRESERVING DATA PUBLISHING 685 anonymization, such that the distance between the sensitive attribute distributions of each anonymized group and the base data is bounded by a user-deﬁned threshold. Deﬁnition 20.3.5 (t-closeness Principle) Let P = (p1 . . . pr) be a vector representing the fraction of the data records belonging to the r diﬀerent values of the sensitive attribute in an equivalence class. Let Q = (q1 . . . qr) be the corresponding fractional distributions in the full data set. Then, the equivalence class is said to satisfy t-closeness, if the following is true, for an appropriately chosen distance function Dist(·, ·): Dist(P , Q) ≤ t (20.12) An anonymized table is said to satisfy t-closeness, if all equivalence classes in it satisfy t-closeness. The previous deﬁnition does not specify any particular distance function. There are many diﬀerent ways to instantiate the distance function, depending on application-speciﬁc goals. Two common instantiations of the distance function are as follows: 1. Variational distance: This is simply equal to half the Manhattan distance between the two distribution vectors: Dist(P , Q) = r |pi − qi| (20.13) i=1 2 2. Kullback-Leibler (KL) distance: This is an information-theoretic measure that com- putes the diﬀerence between the cross-entropy of (P , Q), and the entropy of P . r (20.14) Dist(P , Q) = (pi · log(pi) − pi · log(qi)) i=1 Note that the entropy of the ﬁrst distribution is − r pi · log(pi), whereas the cross- entropy is − r log(qi ). i=1 i=1 pi · While these are the two most common distance measures used, other distance measures can be used in the context of diﬀerent application-speciﬁc goals. For example, one may wish to prevent scenarios in which a particular equivalence class contains semantically related sensitive attribute values. Consider the scenario, where a par- ticular equivalence class contains diseases such as gastric ulcer, gastritis, and stomach can- cer. In such cases, if a group contains only these diseases, then it provides signiﬁcant infor- mation about the sensitive attribute of that group. The t-closeness method prevents this scenario by changing the distance measure, and taking the distance between diﬀerent values of the sensitive attribute into account in the distance-computation process. In particular, the Earth Mover Distance can be used eﬀectively for this scenario. The earth mover’s distance (EMD) is deﬁned in terms of the “work” (or cost) required to transform one distribution to the other, if we allow sensitive attribute values in the original data to be ﬂipped. Obviously, it requires less “work” to ﬂip a sensitive value to a semantically similar value. Formally, let dij be the amount of “work” required to transform the ith sensitive value to the jth sensitive value, and let fij be the fraction of data records which are ﬂipped from attribute value i to attribute value j. The values of dij are provided by a domain expert. Note that there are many diﬀerent ways to ﬂip the distribution (p1 . . . pr) to the distribution (q1 . . . qr), and it is desired to use the least cost sequence of ﬂips to

686 CHAPTER 20. PRIVACY-PRESERVING DATA MINING compute the distance between P and Q. For example, one would rather ﬂip “gastric ulcer” to “gastritis” rather than ﬂipping “HIV” to “gastritis” because the former is likely to have lower cost. Therefore, fij is a variable in a linear programming optimization problem, which is constructed to minimize the overall cost of ﬂips. For a table with r distinct sensitive attribute values, tahneocpotsitmoifzaﬂtiiposnisprgoivbelenmbythatri=m1inimrj=iz1efsijth· disijo.bTjehcetievaertfhunmctoivoenr’ssudbijsetcatntcoe may be posed as constraints on the aggregate ﬂips involving each sensitive attribute value. The constraints ensure that the aggregate ﬂips do transform the distribution P to Q. rr Dist(P , Q) = Minimize fij · dij i=1 j=1 subject to: rr ∀i ∈ {1 . . . r} ∀i, j ∈ {1, . . . r} pi − fij + fji = qi j=1 j=1 fij ≥ 0 The earth mover’s distance has certain properties that simplify the computation of gener- alizations satisfying t-closeness. Lemma 20.3.3 Let E1 and E2 be two equivalence classes, and let P1, P2 be their sensitive attribute distributions. Let P be the distribution of E1 ∪ E2, and Q be the global distribution of the full data set. Then, it can be shown that: Dist(P , Q) ≤ |E1| · Dist(P1, Q) + |E2| · Dist(P2, Q) (20.15) |E1| + |E2| |E1| + |E2| This lemma is a result of the fact that the optimal objective function of a linear programming formulation is convex, and P can be expressed as a convex linear combination of P1 and P2 |E1 | |E2 | with coeﬃcients |E1 |+|E2 | and |E1 |+|E2 | , respectively. This convexity result also implies the following: Dist(P , Q) ≤ max{Dist(P1, Q), Dist(P2, Q)} Therefore, when two equivalence classes satisfying t-closeness are merged, the merged equivalence class will also satisfy t-closeness. This implies the monotonicity property for t-closeness. Lemma 20.3.4 (t-closeness monotonicity) If a table satisﬁes t-closeness, then any gen- eralization of the table satisﬁes t-closeness as well. The proof of this lemma follows from the fact that the generalization A of any table B, contains equivalence classes that are the union of equivalence classes in B. If each equivalence class in B already satisﬁes t-closeness, then the corresponding union of these equivalence classes must satisfy t-closeness. Therefore, the generalized table must also satisfy t-closeness. This monotonicity property implies that all existing algorithms for k-anonymity can be directly used for t-closeness. The k-anonymity test is replaced with a test for t-closeness.

20.3. PRIVACY-PRESERVING DATA PUBLISHING 687 20.3.4 The Curse of Dimensionality As discussed at various places in this book, the curse of dimensionality causes challenges for many data mining problems. Privacy preservation is also one of the problems aﬀected by the curse of dimensionality. There are two primary ways in which the curse of dimensionality impacts the eﬀectiveness of anonymization algorithms: 1. Computational challenges: It has been shown [385], that optimal k-anonymization is NP-hard. This implies that with increasing dimensionality, it becomes more diﬃcult to perform privacy preservation. The NP-hardness result also applies to the -diversity and t-closeness models, using a very similar argument. 2. Qualitative challenges: The qualitative challenges to privacy preservation are even more fundamental. Recently, it has been shown that it may be diﬃcult to per- form eﬀective privacy preservation without losing the utility of the anonymized data records. This is an even more fundamental challenge, because it makes the privacy- preservation process less practical. The discussion of this section will be centered on this issue. In the following, a discussion of the qualitative impact of the dimensionality curse on group- based anonymization methods will be provided. While a formal mathematical proof [10] is beyond the scope of this book, an intuitive version of the argument is presented. To understand why the curse of dimensionality increases the likelihood of breaches, one only needs to understand the well-known notion of high dimensional data sparsity. For ease in understanding, consider the case of numeric attributes. A generalized representation of a table can be considered a rectangular region in d-dimensional space, where d is the number of quasi-identiﬁers. Let Fi ∈ (0, 1) be the fraction of the range of dimension i covered by a particular generalization. For the anonymized data set to be useful, the value of Fi should be as small as possible. However, the fractional volume of the space, covered by a generalization with fractional domain ranges of F1 . . . Fd, is given by adi=r1esFuil.t,Tthhies fraction converges to 0 exponentially fast with increasing dimensionality d. As fraction of data points within the volume also reduces rapidly, especially if the correlations among the diﬀerent dimensions are weak. For large enough values of d, it will be diﬃcult to create d-dimensional regions containing at least k data points, unless the values of Fi are chosen to be close to 1. In such cases, any value of an attribute is generalized to almost the entire range of values. Such a highly generalized data set therefore loses its utility for data mining purposes. This general principle has also been shown to be true for other privacy models, such as perturbation, and -diversity. The bibliographic notes contain pointers to some of these theoretical results. A real-world example of this scenario is the Netﬂix Prize data set, in which Netﬂix released ratings of individuals [559] for movies to facilitate the study of collaborative ﬁltering algorithms. Many other sources of data could be found, such as the Internet Movie Database (IMDb), from which the ratings information could be matched with the Netﬂix prize data set. It was shown that the identity of users could be breached with a very high level of accuracy, as the number of ratings (speciﬁed dimensionality) increased [402]. Eventually, Netﬂix retracted the data set.

Pages:

patcharapolonline

Book-DataMining

Like this book? You can publish your book online for free in a few minutes!

Create your own flipbook

TOP SEARCH

business design fashion music health life sports home marketing children

Book-DataMining

Description: Book-DataMining

Read the Text Version

patcharapolonline

TOP SEARCH

RELATED PUBLICATIONS