Home Explore Data Mining: Concepts, Models, Methods, and Algorithms

Data Mining: Concepts, Models, Methods, and Algorithms

Published by Willington Island, 2021-07-21 14:27:35

Description: The revised and updated third edition of Data Mining contains in one volume an introduction to a systematic approach to the analysis of large data sets that integrates results from disciplines such as statistics, artificial intelligence, data bases, pattern recognition, and computer visualization. Advances in deep learning technology have opened an entire new spectrum of applications. The author explains the basic concepts, models, and methodologies that have been developed in recent years.

This new edition introduces and expands on many topics, as well as providing revised sections on software tools and data mining applications. Additional changes include an updated list of references for further study, and an extended list of problems and questions that relate to each chapter.This third edition presents new and expanded information that:

• Explores big data and cloud computing
• Examines deep learning
• Includes information on CNN

ALGORITHM'S THEOREM

Read the Text Version

Pages:

430 ADVANCES IN DATA MINING vector a. The reader can easily check the previous relation. If, for example, the vectors a and b are a = (1,2,3) and b = (2,1,2), then the Euclidean distance may be calculated as d2 = 14 + 9 – 2 × 10 = 3. While products (a • a) and (b • b) can be computed locally and each result is a single value, the core challenge is to develop an algorithm for dis- tributed inner product computation (a • b). A simple, communication-efficient rando- mized technique for computing this inner product between two vectors observed at two different sites may consist of the following steps: 1. Vectors a and b are given on two sites A and B, respectively. Site A sends to the site B a random number generator seed. [That is only one passed message.] 2. Both sites A and B cooperatively generate random matrix R with dimensions k × m, where k m. Each entry in the matrix R is generated independently and identically from some fixed distribution with mean zero and finite variance. 3. Based on the matrix R, sites A and B compute their own local matrix products: ˆa = R a and ˆb = R b Dimensions of new local vectors ^a and ^b are k, and that means significantly lower than initial lengths of m. 4. Site A sends the resulting vector ^a to the site B. [That represents k passed messages.] 5. Site B computes approximate inner product (a • b) = (^aT • ^b)/k. So, instead of sending an m-dimensional vector to the other site, the algorithm sends only a (k + 1)-dimensional vector where k m (k is a user-defined parameter). The inner product of vectors can still be estimated accurately with lower communi- cation load. In the DDM literature, one of two assumptions is commonly adopted as to how data is distributed across sites: (1) homogeneously or horizontally partitioned or (2) heterogeneously or vertically partitioned. Both viewpoints assume that the data tables at each distributed site are partitions of a single global table. It is important to stress that the global table viewpoint is strictly conceptual. It is not necessarily assumed that such a table was physically realized and partitioned to form the tables at each site. In the homogeneous case, the global table is horizontally partitioned. The tables at each site are subsets of the global table; they have exactly the same attributes. Fig- ures 12.30a illustrate the homogeneously distributed case using an example from weather data where both tables use the same three attributes. In the heterogeneous case, the table is vertically partitioned, where each site contains a subset of columns. That means sites do not have the same attributes. However, samples at each site are assumed to contain a unique identifier to facilitate matching, and Figure 12.30b illus- trates this case. The tables at distributed sites have different attributes and samples are linked through a unique identifier, Patient ID. DDM technology support different data-mining tasks including classification, prediction, clustering, market basket analysis, and outliers’ detection. A solution for each of these tasks may be implemented with variety of DDM algorithms. For

DISTRIBUTED DATA MINING 431 (a) Site 1 Temperature Site 2 Humidity Temperature City Humidity 83°F City 67% 73°F Louisville 85% 81°F 77% 91°F 85°F Seattle 56% 95°F Cincinnati 77% Miami Nashville 89% Hearth rate 75 Huston 68 (b) 72 Site 2 Red Cells White Cells 80 4.2 × 103 4 × 109 Site 1 Patient ID 6.2 × 103 7.2 × 109 Patient ID Temperature 1 6.7 × 103 6.1 × 109 2 1 97°F 3 5.1 × 103 4.8 × 109 2 98.3°F 4 3 99.9°F 4 101°F Figure 12.30. (a) Horizontally vs. (b) vertically partitioned data. example, distributed Apriori has several versions for frequent itemset generation in distributed transactional database. They usually require multiple synchronizations and communication steps. Most of these implementations assume that platforms are homogeneous, and therefore the data sets are partitioned evenly among the sites. However, in practice, both the data sets and the processing platforms are more likely to be heterogeneous running multiple and different systems and tools. This leads to unbalanced data set distributions and workloads causing additional problems in implementation. One of recent trends is online mining technology used for monitoring in distrib- uted sensor networks. This is because deployments of large-scale distributed sensor networks are now possible owing to hardware advances and increasing software sup- port. Online data mining, also called data stream mining, is concerned with extracting patterns, detecting outliers, or developing dynamic models of system’s behavior from continuous data streams such as those generated by sensor networks. Because of the massive amount of data and the speed of which the data are generated, many data- mining applications in sensor networks require in-network processing such as aggre- gation to reduce sample size and the communication overhead. Online data mining in sensor networks offers many additional challenges, including: • limited communication bandwidth, • constraints on local computing resources, • limited power supply, • need for fault tolerance, and • asynchronous nature of the network. Obviously, data-mining systems have evolved in short period of time from stand- alone programs characterized by single algorithms with little support for the entire knowledge-discovery process to integrated systems incorporating several mining

432 ADVANCES IN DATA MINING algorithms, multiple users, communications, and various and heterogeneous data for- mats and distributed data sources. Although many DDM algorithms are developed and deployed in variety of applications, the trend will be illustrated in this book with only one example of a distributed clustering algorithm. 12.4.1 Distributed DBSCAN Clustering Distributed clustering assumes that samples to be clustered reside on different sites. Instead of transmitting all samples to a central site where we can apply one of standard clustering algorithms to analyze the data locally, the data are clustered independently on the distributed local sites. Then, in a subsequent step, the central site is trying to establish a global clustering model based on the downloaded local models, i.e. sum- marized representatives of local data. Distributed clustering is carried out on two dif- ferent levels, i.e. the local level and the global level (Figure 12.31). On the local level, all sites carry out a clustering independently from each other. Communication with central site and determining global model should reflect an optimum trade-off between complexity and accuracy of the algorithm. Local models consist of a set of representatives for each locally found cluster. A representative is a good approximation for samples residing on the corresponding local site. The local model is transferred to a central site, where the local models are merged in order to form a global model. The representation of local models should be enough simple so there will be no overload in communications. At the same time, local model should be enough informative to support high quality of approximate Site 1 Site 2 Site n Local data Local data . . . Local data Local clustering Local clustering Local clustering Local level Determination of Determination of Determination of local representatives local representatives local representatives Determination of global representatives Global level Local data Local data ... Local data Local labeling labeling labeling level Figure 12.31. System architecture for distributed clustering.

DISTRIBUTED DATA MINING 433 global clustering. The global model is created by analyzing and integrating local representatives. The resulting global clustering is sent back, at the end of the process, to all local sites. This global distributed framework may be more precisely specified when we implement specific clustering algorithm. The density-based clustering algorithm DBSCAN is a good candidate, because it is robust to outliers, is easy to implement, supports clusters of different shapes, and allows incremental online implementation. Main steps of the algorithm are explained in Chapter 9, and the same process is applied locally. To find local clusters, DBSCAN starts with an arbitrary core object p, which is not yet clustered and retrieves all objects density-reachable from p. The retrieval of density-reachable objects is performed in iterations until all local samples are analyzed. After having clustered the data locally, we need a small number of repre- sentatives that will describe the local clustering result accurately. For determining suitable representatives of the clusters, the concept of specific core points is introduced. Let C be a local cluster with respect to the given DBSCAN parameters ε and MinPts. Furthermore, let CorC C be the set of core points belonging to this cluster. Then ScorC C is called a complete set of specific core points of C iff the following conditions are true: • ScorC CorC • si,sj ScorC: si Neighborhoodε (sj) • c CorC, s ScorC: c Neighborhoodε (s) ScorC set of points consists of very small number of specific core points that describes the cluster C. For example, given in Figure 12.32a, sites 2 and 3 have only one specific core point, while site 1, because of the cluster shape, has two specific core points. To further simplify the representation of local clusters, the number of specific core points, |ScorC| = K, is used as input parameters for a further local “clustering step” with an adapted version of K-means. For each cluster C found by DBSCAN, k-means use ScorC points as starting points. The result is K = |ScorC| subclusters and centroids within C. Each local model LocalModelk consists of a set of mk pairs: a representative r (complete specific core point) and a ε radius value. The number m of pairs transmitted from each site k is determined by the number n of clusters Ci found on site k. Each of these pairs (r, εr) represents subset of samples that are all located in corresponding local cluster. Obviously, we have to check whether it is possible to merge two or more of these clusters, found on different sites, together. That is the main task of a global modeling part. To find such a global model, the algorithm continues with the density- based clustering algorithm DBSCAN again, but only for collected representatives from local models. Because of characteristics of these representative points, the parameter MinPtsglobal is set to 2, and radius εglobal value should be set generally close to 2εlocal.

434 Objects at site 2 ADVANCES IN DATA MINING (a) Objects at site 3 Objects at site 1 Cluster 1 Cluster 3 Representatives at site 3 (I) Cluster 2 (II) (III) Representatives at site 2 (b) εr (VI) R1 ε Representatives at site 1 R1 εr (IX) R2 R3 R4 εr εr (IV) (V) R1 (c) ε (VII) All objects R1 ε ε R2 R3 R2 R3 R2 ε 2ε R4 R3 R4 ε ε ε R4 ε (VIII) Figure 12.32. Distributed DBSCAN clustering (Januzaj 2003). (a) Local clusters. (b) Local representatives. (c) Global model with εglobal = 2εlocal. In Figure 12.32, an example of distributed DBSCAN for εglobal = 2εlocal is depicted. In Figure 12.32a the independently detected clusters on sites 1, 2, and 3 are represented. The cluster on site 1 is represented using K-means by two represen- tatives, R1 and R2, whereas the clusters on site 2 and site 3 are only represented by one representative as shown in Figure 12.32b. Figure 12.32c illustrates that all four local clusters from the different sites are merged together to one large cluster. This integra- tion is obtained by using an εglobal parameter equal to 2εlocal. Figure 12.32c also makes clear that an εglobal = εlocal is insufficient to detect this global cluster. When the final global model is obtained, the model is distributed to local sites. This model makes corrections comparing previously found local models. For example, in the local clus- tering, some points may be left as outliers, but with the global model they may be integrated into modified clusters.

CORRELATION DOES NOT IMPLY CAUSALITY! 435 12.5 CORRELATION DOES NOT IMPLY CAUSALITY! An associational concept is any relationship that can be defined in terms of a frequency-based joint distribution of observed variables, while a causal concept is any relationship that cannot be defined from the distribution alone. Even simple exam- ples show that the associational criterion is neither necessary nor sufficient for cau- sality confirmation. For example, data mining might determine that males with income between $50,000 and $65,000 who subscribe to certain magazines are likely purchasers of a product you want to sell. While you can take advantage of this pattern, say, by aiming your marketing at people who fit the pattern, you should not assume that any of these factors (income, type of magazine) cause them to buy your product. The predictive relationships found via data mining are not necessarily causes of an action or behavior. The research questions that motivate many studies in the health, social, and behavioral sciences are not statistical but causal in nature. For example, what is the efficacy of a given drug in a given population, or what fraction of past crimes could have been avoided by a given policy? The central target of such studies is determining of cause–effect relationships among variables of interests, for example, treatments– diseases or policies–crime, as preconditions–outcomes relationships. In order to express causal assumptions mathematically, certain extensions are required in the standard mathematical language of statistics, and these extensions are not generally emphasized in the mainstream literature and education. The aim of standard statistical analysis, typified by regression and other esti- mation techniques, is to infer parameters of a distribution from samples drawn of that distribution. With the help of such parameters, one can infer associations among variables or estimate the likelihood of past and future events. These tasks are managed well by standard statistical analysis so long as experimental condi- tions remain the same. Causal analysis goes one step further; its aim is to infer aspects of the data-generation process. Associations characterize static conditions, while causal analysis deals with changing conditions. There is nothing in the joint distribution of symptoms and diseases to tell us that curing the former would or would not cure the latter. Drawing analogy to visual perception, the information contained in a probability function is analogous to a geometrical description of a three-dimensional object; it is sufficient for predicting how that object will be viewed from any angle outside the object, but it is insufficient for predicting how the object will be deformed if manipu- lated and squeezed by external forces. The additional information needed for making predictions such as the object’s resilience or elasticity is analogous to the information that causal assumptions provide. These considerations imply that the slogan “corre- lation does not imply causation” can be translated into a useful principle: one cannot substantiate causal claims from associations alone, even at the population level. Behind every causal conclusion, there must lie some causal assumption that is not test- able in observational studies.

436 ADVANCES IN DATA MINING Any mathematical approach to causal analysis must acquire new notation for expressing causal assumptions and causal claims. To illustrate, the syntax of proba- bility calculus does not permit us to express the simple fact that “symptoms do not cause diseases,” let alone draw mathematical conclusions from such facts. All we can say is that two events are dependent—meaning that if we find one, we can expect to encounter the other, but we cannot distinguish statistical dependence, quantified by the conditional probability P(disease/symptom) from causal dependence, for which we have no expression in standard probability calculus. Symbolic representation for the relation “symptoms cause disease” is distinct from the symbolic representation of “symptoms are associated with disease.” The need to adopt a new notation, foreign to the province of probability theory, has been traumatic to most persons trained in statistics, partly because the adaptation of a new language is difficult in general and partly because statisticians—this author included—have been accustomed to assuming that all phenomena, processes, thoughts, and modes of inference can be captured in the powerful language of prob- ability theory. Causality formalization requires new mathematical machinery for cause–effect analysis and a formal foundation for counterfactual analysis including concepts such as “path diagrams,” “controlled distributions,” causal structures, and causal models. 12.5.1 Bayesian Networks One of the powerful aspects of graphical models is that a specific graph can make probabilistic statements for a broad class of distributions. In order to motivate the use of directed graphs to describe probability distributions, consider first an arbitrary joint distribution p(a, b, c) over three variables, a, b, and c. By application of the prod- uct rule of probability, we can write the joint distribution in the form p a, b, c = p c a, b p a, b = p c a, b p b a p a We now represent the right-hand side of the equation in terms of a simple graph- ical model as follows: First, we introduce a node for each of the random variables a, b, and c and associate each node with the corresponding conditional distribution on the right-hand side of the equation. Then, for each conditional distribution, we add direc- ted links (arrows) to the graph from the nodes corresponding to the variables on which the distribution is conditioned. Thus for the factor p(c|a, b), there will be links from nodes a and b to node c, whereas for the factor p(a) there will be no incoming links as it is presented in Figure 12.33a. If there is a link going from a node a to a node b, then we say that node a is the parent of node b, and we say that node b is the child of node a. For given K variables, we can again represent a joint probability distribution as a directed graph having K nodes, one for each conditional distribution, with each node having incoming links from all lower numbered nodes. We say that this graph is fully connected because there is a link between every pair of nodes. Consider now the graph shown in Figure 12.33b, which is not a fully connected graph because, for instance,

CORRELATION DOES NOT IMPLY CAUSALITY! 437 (a) (b) x3 a x1 x2 b x4 x5 c x6 x7 Figure 12.33. A directed graphical model representing the joint probability distribution over a set of variables. (a) Fully connected. (b) Partially connected. there is no link from x1 to x2 or from x3 to x7. We may transform this graph to the corresponding representation of the joint probability distribution written in terms of the product of a set of conditional distributions, one for each node in the graph. The joint distribution of all seven variables is given by p x1, x2, …, x7 = p x1 p x2 p x3 p x4 x1, x2, x3 p x5 x1, x3 p x6 x4 p x7 x4, x5 Any joint distribution can be represented by a corresponding graphical model. It is the absence of links in the graph that conveys interesting information about the properties of the class of distributions that the graph represents. We can interpret such models as expressing the processes by which the observed data arose, and in many situations we may draw conclusions about new samples from a given probability dis- tribution. The directed graphs that we are considering are subject to an important restriction, namely, that there must be no directed cycles; in other words, there are no closed paths within the graph such that we can move from node to node along links following the direction of the arrows and end up back at the starting node. Such graphs are also called directed acyclic graphs (DAGs). An important concept for probability distributions over multiple variables is that of conditional independence. Consider three variables a, b, and c, and suppose that the conditional distribution of a, given b and c, is such that it does not depend on the value of b so that p a b,c = p a c We say that a is conditionally independent of b given c. This can be extended in a slightly different way if we consider the joint distribution of a and b conditioned on c, which we can write in the form p a, b c = p a b,c p b c = p a c p b c

438 ADVANCES IN DATA MINING (a) (b) (c) a b c ac b ab c Figure 12.34. Joint probability distributions show different dependencies between variables a, b, and c. The joint distribution of a and b, conditioned on c, may be factorized into the product of the marginal distribution of a and the marginal distribution of b (again both conditioned on c). This says that the variables a and b are statistically independent, given c. This independence may be presented in a graphical form in Figure 12.34a. The other typical joint distributions may be graphically interpreted. The distribution for Figure 12.34b represents the case p b, c a = p c a p b c while for Figure 12.34c the probability p(c|a, b) is under the assumption that variables a and b are independent p(a, b) = p(a) p(b). In general, graphical models may capture the causal processes by which the observed data was generated. For this reason, such models are often called generative models. We could make previous models in Figure 12.33 generative by introducing a suitable prior distribution p(x) for all input variables (these are variables—nodes with- out input links). For the case in Figure 12.33a, this is a variable a, and for the case in Figure 12.33b, these are variables x1, x2, and x3. In practice, producing synthetic observations from a generative model can prove informative in understanding the form of the probability distribution represented by that model. This preliminary analysis about joint probability distributions brings us to the concept of Bayesian networks (BN), which are also called in the literature belief net- works or probabilistic networks. The nodes in a BN represent variables of interest (e.g., the temperature of a device, the gender of a patient, the price of a product, the occurrence of an event), and the links represent dependencies among the variables. Each node has states, or a set of probable values for each variable. For example, the weather could be cloudy or sunny, an enemy battalion could be near or far, symptoms of a disease are present or not present, and the garbage disposal is working or not working. Nodes are connected with an arrow to show causality, also indicating the direction of influence. These arrows are called edges. The dependencies are quantified by conditional probabilities for each node given its parents in the network. Figure 12.35 presents some BN architectures, initially without probability distribu- tions. In general, we can formally describe a BN as a graph in which the follow- ing holds:

CORRELATION DOES NOT IMPLY CAUSALITY! 439 Cloudy Disposal Kitchen breaker circuit Sprinkler Rain Wet grass Disposal Kitchen lights Figure 12.35. Two examples of Bayesian network architectures. 1. A set of random variables makes up the nodes of the network. 2. A set of directed links connects pairs of nodes. The intuitive meaning of an arrow from node X to node Y is that X has a direct influence on Y. 3. Each node has a conditional probability table (CPT) that quantifies the effects that the parents have on the node. The parents of a node X are all those nodes that have arrows pointing to X. 4. The graph has no directed cycles (hence a DAG). Each node in the BN corresponds to a random variable X and has a probability distribution of the variable P(X). If there is a directed arc from node X to node Y, this indicates that X has a direct influence on Y. The influence is specified by the condi- tional probability P(Y|X). Nodes and arcs define a structure of the BN. Probabilities are parameters of the structure. We turn now to the problem of inference in graphical models, in which some of the nodes in a graph are clamped to observed values, and we wish to compute the posterior distributions of one or more subsets of other nodes. The network supports the computation of the probabilities of any subset of variables given evidence about any other subset. We can exploit the graphical structure both to find efficient algo- rithms for inference and to make the structure of those algorithms transparent. Spe- cifically, many inference-based algorithms can be expressed in terms of the propagation of local probabilities around the graph. A BN can be considered as a probabilistic graph in which the probabilistic knowledge is represented by the topol- ogy of the network and the conditional probabilities at each node. The main purpose of building knowledge on probabilities is to use it for inference, i.e., computing the answer for particular cases about the domain. For example, we may assume that rain causes the grass to get wet. Causal graph in Figure 12.36 explains cause–effect relation between these variables including corre- sponding probabilities. If P(Rain) = P(R) = 0.4 is given, that also means P(¬R) = 0.6. Also, note that the sum of presented conditional probabilities is not equal to 1. If you analyze the relations between probabilities, P(W R) + P(¬W R) = 1, and also P(W ¬R) + P(¬W ¬R) = 1, not the sum of given probabilities. In these expressions

440 ADVANCES IN DATA MINING p(W ∣ R) = 0.9 p(W ∣ ¬ R) = 0.2 Rain Wet grass p(R) = 0.4 Figure 12.36. Simple causal graph. R means “Rain,” and W means “Wet grass.” Based on the given BN, we may check what is the probability of “Wet grass”: P W = P W R P R + P W ¬R P ¬R = 0 9 × 0 4 + 0 2 × 0 6 = 0 48 or 48 Bayesian rule allow us to invert the dependencies and obtain probabilities of par- ents in the graph based on probabilities of children. That could be useful in many applications, such as determining probability of a diagnosis based on symptoms. For example, based on the BN in Figure 12.36, we may determine conditional prob- ability P(Rain Wet grass) = P(R W). We know that P R, W = P W R P R = P R W P W and therefore PW R PR PW R PR PR W = PW = P W R P R + P W ¬R P ¬R = 0 9∗0 4 0 9∗0 4 + 0 2∗0 6 = 0 75 Let us include now more complex problem and more complex BN represented in Figure 12.37. In this case we have three nodes, and they are connected serially, often called head-to-tail connections of three events. Now additional event “Cloudy” with yes and no values is included as a variable at the beginning of the network. R node blocks a path from C to W, which separates them. If R node is removed, there is no path from C to W. Therefore, relation between conditional probabilities in the graph is given as P(C, R, W) = P(C) ∗ P(R|C) ∗ P(W|R). In our case, based on the BN in Figure 12.37, it is possible to determine and use “forward” and “backward” conditional probabilities as we represented in the previous BN. We are starting with P W C = P W R ∗P R C + P W ¬R ∗P ¬R C = 0 9∗0 8 + 0 2∗0 2 = 0 76 Then, we may use Bayesian rule for inverted conditional probabilities: PC W = P W C ∗P C = 0 65 P W requires detailed computation PW

CORRELATION DOES NOT IMPLY CAUSALITY! 441 p(R∣ C) = 0.8 p(W ∣ R) = 0.9 p(R ∣ ¬ C) = 0.8 p(W ∣ ¬ R) = 0.9 Cloudy Rain Wet grass p(C ) = 0.4 Figure 12.37. An extended causal graph. p(C = F ) p(C = T ) 0.5 0.5 C p(S = F ) p(S = T ) Cloudy Rain C p(R = F ) p(S = T ) Sprinkler F 0.5 0.5 F 0.8 0.2 T 0.9 0.1 T 0.2 0.8 Wet grass S R p(W = F ) p(W = T ) F F 1.0 0.0 T F 0.1 0.9 F T 0.1 0.9 T T 0.01 0.99 Figure 12.38. Four-node architecture of a Bayesian network. More complex connections may be analyzed in BN. Figure 12.38 shows the graph structure and the assumed input parameters. The parameters of a graphical model are represented by the conditional probabil- ity distributions in a form of CPTs for each node given its parents. The simplest form of a formalized distribution, a CPT, is suitable when the nodes are discrete valued. All nodes in Figure 12.38 are represented with discrete set of states and corresponding CPTs. For example, “Sprinkler” node (S) may be “on” and “off,” and it is represented in the table with T and F values. Sprinkler CPT is generated including input discrete values for node “Cloudy,” C. Many algorithms for BN analysis may be expressed in terms of the propagation of probabilities through the graph. All probabilistic models, no matter how refined and accurate, Bayesian included, describe a distribution over possible observed events, but say nothing about what will happen if a certain intervention occurs. For example, what if I turn on the sprinkler? What effect does that have on the season or on the connection between wetness and

442 ADVANCES IN DATA MINING slipperiness? A causal network is a BN with the added property that the parents of each node are its direct causes. In such a network, the result of an intervention is obvi- ous: the sprinkler node is set to “on,” and the causal link between the season and the sprinkler is removed. All other causal links and conditional probabilities remain intact. This added property endows the causal network with the capability of representing and responding to external or spontaneous changes. For example, to represent a dis- abled sprinkler in the story of Figure 12.38, we simply delete from the network all links incident to the node Sprinkler. To represent the policy of turning the sprinkler off if it rains, we simply add a link between Rain and Sprinkle. Such changes would require much greater remodeling efforts if the network were not constructed along the causal direction. This remodeling flexibility may well be cited as the ingredient that manages novel situations instantaneously, without requiring training or adaptation of the model. 12.6 PRIVACY, SECURITY, AND LEGAL ASPECTS OF DATA MINING An important lesson of the Industrial Revolution was that the introduction of new technologies can have a profound effect on our ethical principles. In today’s Informa- tion Revolution we strive to adapt our ethics to diverse concepts in cyberspace. The recent emergence of very large databases, and their associated data-mining tools, pre- sents us with yet another set of ethical challenges to consider. The rapid dissemination of data-mining technologies calls for an urgent examination of their social impact. It should be clear that data mining itself is not socially problematic. Ethical challenges arise when it is executed over data of a personal nature. For example, the mining of manufacturing data is unlikely to lead to any consequences of a personally objection- able nature. However, mining clickstreams of data obtained from Web users initiate a variety of ethical and social dilemmas. Perhaps the most significant of these is the invasion of privacy, but that is not the only one. Thanks to the proliferation of digital technologies and networks such as the Inter- net and tremendous advances in the capacity of storage devices and parallel decreases in their cost and physical size, many private records are linked and shared more widely and stored far longer than ever before, often without the individual consumer’s knowl- edge or consent. As more everyday activities move online, digital records contain more detailed information about individuals’ behavior. Merchants’ record data are no longer only on what individuals buy and how they pay for their purchases. Instead, those data include every detail of what we look at, the books we read, the movies we watch, the music we listen to, the games we play, and the places we visit. The robust- ness of these records is difficult to overestimate and is not limited to settings involving commercial transactions. More and more computers track every moment of most employees’ days. Email and voice mail are stored digitally; even the content of tele- phone conversations may be recorded. Digital time clocks and entry keys record phys- ical movements. Computers store work product, text messages, and Internet browsing records—often in keystroke-by-keystroke detail, they monitor employee behavior.

PRIVACY, SECURITY, AND LEGAL ASPECTS OF DATA MINING 443 The ubiquitous nature of data collection, analysis, and observation is not limited only to the workplace. Digital devices for paying tolls, computer diagnostic equip- ment in car engines, and global positioning services that are increasingly common on passenger vehicles record every mile driven. Cellular telephones and personal dig- ital assistants record not only call and appointment information but also location and transmit this information to service providers. ISPs record online activities, digital cable and satellite record what we watch and when, alarm systems record when we enter and leave our homes, and all of these data are held by third parties. Information on our browsing habits is available to both the employer and the ISP. If an employee buys an airline ticket through an online travel service, such as Travelocity or Expedia, the information concerning that transaction will be available to the employer, the ISP, the travel service, the airline, and the provider of the payment mechanism, at a minimum. All indications are that this is just the beginning. Broadband Internet access into homes has not only increased the personal activities we now engage in online but also created new and successful markets for remote computer backup and online photo, email, and music storage services. With Voice over IP telephone service, digital phone calls are becoming indistinguishable from digital documents: both can be stored and accessed remotely. Global positioning technologies are appearing in more and more products, and radio-frequency identification tags are beginning to be used to identify high-end consumer goods, pets, and even people. Many individuals are unaware of the extent of the personal data stored, analyzed, and used by governments’ institutions, private corporations, and research labs. It is usually only when things go wrong that individuals exercise their rights to obtain this data and seek to eliminate or correct it. For many of those, whose records are accessed through data mining, we do not know it is happening and may never find out because nothing incriminating is signaled. But we still know that data mining allows compa- nies to accumulate and analyze vast amounts of information about us, sufficient per- haps to create, with the help of data mining, what some have called personality or psychological “mosaics” of the subjects. One result of the entry into the information age is that faceless bureaucrats (in company, in government, everywhere) will be able to compile dossiers on anyone and everyone, for any reason or for no reason at all. The possibility, even if slim, that this information could somehow be used to our detriment or simply revealed to others can create a chilling effect on all these activities. Data and the information derived from that data using data mining is an extremely valuable resource for any organization. Every data-mining professional is aware of this, but few are concentrated on the impact that data mining could have on privacy and the laws surrounding the privacy of personal data. Recent survey showed that data-mining professionals “prefer to focus on the advantages of web-data mining instead of discussing the possible dangers.” These professionals argued that Web data mining does not threaten privacy. One might wonder why professionals are not aware of or concerned over the possible misuse of their work and the possible harm it might cause to individuals and society. Part of the reason some professionals are not con- cerned over the possible misuse of their work and the possible harm it might cause

444 ADVANCES IN DATA MINING might lie in the explanations that “they are primary technical professionals and some- body else should take care about these social and legal aspects.” But sensible regula- tions of data mining depend on understanding of its many variants and its potential harms. Therefore, technical professional has to be a part of the team, often leading, which will try to solve privacy challenges. The key ethical issues in mining personal data are that people are generally: 1. not aware that their personal information is being gathered, 2. do not know to what use the data will be made, and/or 3. have not consented to such collecting or data use. In order to alleviate concerns about data privacy, a number of techniques have recently been proposed in order to perform the data-mining tasks in a privacy- preserving way. These techniques for performing privacy-preserving data mining are drawn from a wide array of related topics such as cryptography and information hiding. Most privacy-preserving data-mining methods apply a transformation that reduces the effectiveness of the underlying data when it is applied to data-mining methods or algorithms. In fact, there is a natural trade-off between privacy and accu- racy, though this trade-off is affected by the particular algorithm that is used for pri- vacy preservation. The key directions in the field of privacy-preserving data mining include: • Privacy-preserving data publishing: These techniques tend to study different transformation methods associated with privacy. They concentrate on how the perturbed data can be used in conjunction with classical data-mining methods. • Changing the results of data-mining applications to preserve privacy: These techniques are concentrated on privacy of data-mining results where some results are modified in order to preserve the privacy. A classic example of such techniques are association rule hiding methods, in which some of the associ- ation rules are suppressed in order to preserve privacy. • Cryptographic methods for distributed privacy: If the data are distributed across multiple sites, a variety of cryptographic protocols may be used in order to communicate among the different sites so that secure function computation is possible without revealing sensitive information. Recent research trends propose that issues of privacy protection, currently viewed in terms of data access, be reconceptualized in terms of data use. From a technology perspective, this requires supplementing legal and technical mechan- isms for access control with new mechanisms for transparency and accountability of data used in data-mining process. Current technical solutions of the impact of data mining on privacy have generally focused on limiting access to data at the point of collection or storage. Most effort has been put into the application of cryptographic and statistical techniques to construct finely tuned access-limiting mechanisms.

PRIVACY, SECURITY, AND LEGAL ASPECTS OF DATA MINING 445 Even if privacy-preserving data-mining techniques prove to be practical, they are unlikely to provide sufficient public assurance that data-mining inferences conform to legal restrictions. While privacy-preserving data-mining techniques are certainly necessary in some contexts, they are not sufficient privacy protection without the transparency and accountability. In the long run, access restriction alone is not enough neither to protect privacy nor to ensure reliable conclusions, and the best example of these challenges is Web and Web mining technology. As we leave the well-bounded world of enterprise databases and enter the open, unbounded world of the Web, data users need a new class of tools to verify that the results they see are based on data that is from trustworthy sources and is used according to agreed-upon institutional and legal requirements. The implications of data mining on digital social networks such as Facebook, WhatsApp, or Twitter may be enormous. Unless it is part of a public record designed for consumption by everyone or describes an activity observed by strangers, the stored information is rarely known outside our families, much less outside our social networks. An expectation that such information and potential derivatives will remain “private” on Internet is not anymore reasonable assumption from the social network perspective. One of the major contributors to these contro- versies is the absence of clear legal standards. Thirty years ago, the lack of relevant law was understandable: the technologies were new, their capacity was largely unknown, and the types of legal issues they might raise were novel. Today, it is inexplicable and threatens to undermine both privacy and security. Hence, we must develop technical, legal, and policy foundations for transparency and accountability of large-scale mining across distributed heterogeneous data sources. Policy aware- ness is a property of the Semantic Web still in development that should provide users with accessible and understandable views of the policies associated with resources. The following issues related to privacy concerns may assist in individual privacy protection during a data-mining process and should be a part of the best data-mining practices: • Whether there is a clear description of a program’s collection of personal information, including how the collected information will serve the program’s purpose. In other words, be transparent early on about a data-mining project’s purpose. Clearly state up front the business benefits that will be achieved by data mining. Provide notice of the combining of the information from different sources. Companies, like Walmart or Kroger, store much of their business and customer data in large warehouses. Their customers are not told the extent of the information that is accumulated on them, how long it will be kept, nor the uses to which the data will be put, or other users with which data will be shared. • Whether information collected for one purpose will then be used for additional secondary purposes in the future. Ensure that any new purpose of a project is consistent with the project’s original purpose. Maintain oversight of data- mining project and create audit requirements.

446 ADVANCES IN DATA MINING • Whether privacy protections are built in to systems in the early developmental stage. Build in privacy considerations up front, and bring in all stakeholders at the beginning, including privacy advocates to get input from them. Ensure the accuracy of data entry. • What type of action will be taken on the basis of information discovered through a data-mining process? Where appropriate, anonymize personal infor- mation. Limit the actions that may be taken as a result of unverified findings from data mining. • Whether there is an adequate feedback system for individuals to review and correct their personal information that is collected and maintained in order to avoid “false positives” in a data-mining program. Determine whether an individual should have a choice in the collection of information. Provide notice to individuals about use of their personal information. Create a system where individuals can ensure that any incorrect personal information can be corrected. • Whether there are proper disposal procedures for collected and derived per- sonal information that has been determined to be irrelevant. Some observers suggest that the privacy issues presented by data mining will be resolved by technologies, not by law or policy. But even the best technological solu- tions will still require a legal framework in which to operate, and the absence of that framework may not only slow their development and deployment, but make them entirely unworkable. Although there is no explicit right to privacy of personal data in the Constitution, legislation and court decisions on privacy are usually based on parts of the First, Fourth, Fifth, and Fourteenth Amendments. Except for healthcare and financial organizations and data collected from children, there is no law that gov- erns the collection and use of personal data by commercial enterprises. Therefore, it is essentially up to each organization to decide how they will use the personal data they have accumulated on their customers. In early March 2005, hackers stole the personal information of 32,000 people from the databases of LexisNexis. The stolen data included Social Security numbers and financial information. Although the CEO of LexisNexis claimed that the information they collect is governed by the US Fair Credit Reporting Act, members of Congress disagreed. As a result of this and other large- scale identity thefts in recent years, Congress is considering new laws explaining what personal data a company can collect and share. For example, Congress is considering a law to prohibit almost all sales of Social Security numbers. At the same time, especially since 9/11, government agencies have been eager to experiment with the data-mining process as a way of nabbing criminals and terrorists. Although details of their operation often remain unknown, a number of such programs have come to light since 2001. The DOJ, through the FBI, has been collecting tele- phone logs, banking records, and other personal information regarding thousands of Americans not only in connection with counterterrorism efforts but also in furtherance of ordinary law enforcement. A 2004 report by the Government Accountability Office

PRIVACY, SECURITY, AND LEGAL ASPECTS OF DATA MINING 447 (GAO) found 42 federal departments—including every cabinet-level agency that responded to the GAO’s survey—engaged in, or were planning to engage in, 122 data-mining efforts involving personal information (U.S. General Accounting Office, Data Mining: Federal Efforts Cover a Wide Range of Uses [GAO-04-548], May 2004, pp. 27–64). Recently US government recognized that sensible regulation of data mining depends on understanding its many variants and its potential harms, and many of these data-mining programs are reevaluated. In the United Kingdom, the problem is being addressed more comprehensively by the Foundation for Infor- mation Policy Research, an independent organization examining the interaction between IT and society with goals: to identify technical developments with significant social impact, commission research into public policy alternatives, and promote pub- lic understanding and dialogue between technologists and policy makers in the United Kingdom and Europe. It combines IT researchers with people interested in social impacts and uses a strong media presence to disseminate its arguments and educate the public. There is one additional legal challenge related specifically to data mining. Today’s privacy laws and guidelines, if they exist, protect data that is explicit, con- fidential, and exchanged between databases. However, there is no legal or normative protection for data that is implicit, nonconfidential, and not exchanged. Data mining can reveal sensitive information that is derived from nonsensitive data and metadata through the inference process. Information gathered in data mining is usually implicit patterns, models, or outliers in the data, and questionable is the application of privacy regulations primary written for traditional explicit data. In addition to data privacy issues, data mining raises other social concerns. For example, some researchers argue that data mining and the use of consumer profiles in some companies can actually exclude groups of customers from full participation in the marketplace and limit their access to information. As data mining increasingly affects decisions in domains protected by anti- discrimination law, there is much interest in algorithmically measuring and ensuring fairness in the field. What does it mean for a data-mining model to be fair or nondis- criminatory? The answer is not simple, but the basic idea is that the models were trained with data that reflect society’s biases and algorithms amplify these biases. Although reliance on data and quantitative measures can help quantify and eliminate some of existing biases, latest research has warned that data-mining algorithms can also introduce totally new biases or perpetuate existing ones. Recently social study research and numerous media, including highly respected Nature magazine, have pointed out plenty of anecdotal evidence that decision-making by data-mining algorithms may unintentionally discriminate people. For example, if one naively trains a model to filter resumes and find the most qualified candidates for certain jobs based on prior hires data, even if the algorithm is explicitly instructed to ignore “protected attributes” like race or gender, the results can turn out to be race or gender biased. It turns out that race and gender are correlated with other “unprotected” information like names, which the naive algorithm can use. When Harvard Professor Latanya Sweeney put her name into a search engine, she was delivered an ad saying

448 ADVANCES IN DATA MINING “Latanya Sweeney, Arrested?” and offered a background check for a fee. The result of that background check was that Dr. Sweeney had no arrest record, and an initial trig- gering of this ad was obviously deeply unfair and discriminatory to Dr. Sweeney. Searching on people whose first names indicated they were more likely to be black, like Latanya, was much more likely to produce this “Arrested? ad.” If potential employers put Dr. Sweeney’s name into a search engine, they might write her off immediately upon seeing the ad. A vetted methodology for avoiding discrimination against protected attributes in data mining is lacking. An initial idea was based on naïve approach that require that the data-mining algorithm should ignore all protected attributes such as race, color, religion, gender, disability, or family status. However, this idea of “fairness through unawareness” is ineffective, and sometimes it is a source of discrimination by them- selves. There is an entirely new field emerging at the intersection of computer science, law, and social study. In the long run it is necessary to develop fundamental scientific understanding for ensuring transparency and accountability of using data mining in the society. Within this goal guaranteeing fairness of algorithms is one of the key issues. Current approaches to fair data mining are typically focused on interventions at the data preparation, model learning, or including tuning postprocessing stages. While fair data-mining algorithms have already been proposed, the insight of under- lying mechanisms how such discrimination happens from the computational perspec- tive is not yet scientifically understood. We need to develop theoretical understanding how algorithms may become discriminatory and establish fundamental machine- learning principles for prevention. The state-of-the-art fair data mining should advance from heuristic repairing to proactive and theoretically supported prevention. Good privacy and fairness protection not only can help to build support for data mining and other tools to enhance security, but it can also contribute to making those tools more effective. Still, as the data-mining and machine-learning industry makes significant progress in performing various tasks and actions in the everyday life, questions are raised regarding ethics, responsibilities, and human engagement in these real-time applications. Instead of highly publicized and discussed with confronting opinions application of self-driving cars, let us present an illustrative example of a smaller, locally developed application. They are also very important, although they are not always under enough society attention. These new applications often open the questions of a balance between attractiveness and usefulness of collected data and intelligent decisions on one hand and privacy, legality, and even fairness issues on the other hand. An illustrative example is a high school in Hangzhou City, Zhejiang Province, located on the eastern coast of China, which has employed facial recognition technol- ogy to monitor students’ attentiveness in a classroom. Three cameras at the front of the classroom scan students’ faces every 30 seconds, analyzing their facial expressions to detect different moods—surprised, sad, antipathy, angry, happy, afraid, or neutral. The recordings are stored and averaged during each class. The system, called “smart classroom behavior management system,” also analyzes students’ actions during the classes, categorized into reading, listening, writing, standing up, raising hands, or

CLOUD COMPUTING BASED ON HADOOP AND MAP/REDUCE 449 leaning on the desk. As the result of this monitoring system, the teacher’s screen dis- plays a list of student names deemed “not paying attention.” Similar systems, based on facial recognition technology in schools, were used to serve meals to students at school cafeterias or pay for items at local stores. As technology designers, we should provide information infrastructure that helps society to be more certain that data-mining power is used only in legally approved ways and that the data that may give rise to consequences for individuals is based on inferences that are derived from accurate, approved, fair, and legally available data. Future data-mining solutions reconciling any social issues must not only be applicable to the ever-changing technological environment but also flexible with regard to spe- cific contexts and disputes. 12.7 CLOUD COMPUTING BASED ON HADOOP AND MAP/REDUCE Modern information societies are defined by vast repositories of data, both public and private where new generation of applications must be able to scale up to adjust big data frameworks. Current technological advances in storage, communication, and computations have enabled cost-effective capture of big data in a timely manner. Big data find a solution for more computing power in cloud computing platforms. As the National Institute of Standards and Technology defined, “Cloud computing is a model for enabling ubiquitous, convenient, on-demand network access to a shared pool of configurable computing resources (e.g., networks, servers, storage, applica- tions, and services) that can be rapidly provisioned and released with minimal man- agement effort or service provider interaction.” The growth of cloud computing enables businesses to turn into new service-based IT solutions overcoming chal- lenges of big investments in IT infrastructure. The cloud may be considered as a marketplace, where the storage and computing capabilities of a cloud can be leased as necessary. Cloud computing offers enterprises and users high scalability, high availability, and high reliability. It can improve resource utilization efficiency and can reduce the cost of business information construction, investment, and maintenance. As the public cloud services from Amazon, Google, Microsoft, and other specialized companies become more sophisticated and better developed, more and more companies are migrating toward the cloud computing platform. The core of cloud computing model is a new parallel computing architecture, often called cluster computing. It is organized as large number of computing nodes (standard processing modules with processor, central memory, and disk), which are stored together on racks, usually 8–64 on each rack. The nodes on a single rack are connected by a gigabit Ethernet network. There can be many racks multiple nodes, and racks are connected by another level of network through a special high speed net- work or switch. Figure 12.39 highlights main components of the architecture of a large-scale parallel computing system.

450 ADVANCES IN DATA MINING Switch Computing Computing Computing Computing Nodes Nodes Nodes Nodes Rack Rack Rack Rack Figure 12.39. Computing nodes are organized into racks, and racks are interconnected by a switch. These new computing facilities have given rise to a new generation of program- ming systems, which take advantage of the power of parallelism and at the same time avoid the reliability problems that arise when the computing hardware consists of thousands of independent components. Series of Google products has opened the door to the proposed massive data storage, and cloud computing becomes the de facto standard in a big data field, with Google as a technology leader. While Google’s tech- nology was not open source, Yahoo was first to offer an open-source software frame- work designed for supporting data-intensive applications called Hadoop. Instead of relying on expensive proprietary hardware to store and process the data, Hadoop enables distributed processing of large amounts of data on large clusters of commodity servers. Hadoop has many advantages comparing other previously developed architectures, and it is particularly suitable for big data management and analysis because of: • High scalability: Hadoop allows hardware infrastructure to be scaled up and down with no need to change data formats. The system will automatically redistribute data and computation jobs to accommodate hardware changes. • Cost efficiency: Hadoop brings massively parallel computation to commodity servers, leading to a sizeable decrease in cost per terabyte of storage, which makes massively parallel computation affordable for the ever-growing volume of big data.

CLOUD COMPUTING BASED ON HADOOP AND MAP/REDUCE 451 • Fault tolerance: Missing data and computation failures are common in big data analytics. Hadoop can recover the data and computation failures caused by node breakdown or network congestion. The foundation of Hadoop is Hadoop Distributed File System (HDFS). This module is responsible for storing and retrieving data. It is designed and optimized to deal with large amount of data. Files are composed of chunks of about 64 mega- bytes, and each chunk is replicated several times on different compute nodes or racks enabling HDFS protection from failures mechanisms. HDFS is the only mandatory module of the Hadoop framework, and all other components are optional. Second important component that represents core module of Hadoop and that is essential for big data applications on the cloud is MapReduce. MapReduce, a powerful programming framework, enables the automatic paral- leling and distribution of computation applications on large clusters of commodity machines. It gives a style of solving specific problems when big data is available. One of the most significant advantages of MapReduce is that it provides abstractions that hide many implementation details from the programmer. MapReduce can be viewed as the first breakthrough in the new abstractions that allow us to organize com- putations, not over individual machines, but over entire clusters. Additionally, MapReduce can also provide great fault tolerance ability, which is important for work- ing with the large data sets. The core idea of MapReduce is to divide massive data into small chunks firstly and then deal with these chunks in parallel and in a distributed manner to generate intermediate results. By aggregating all the intermediate results, the final result is derived. MapReduce codifies this generic “recipe” for processing large data sets into two stages. In the first so-called Map function, a user-specified computation is applied over all input records in a data set. It takes a collection of input objects and turns each into zero or more key–value pairs. Key values are not necessarily unique. These Map opera- tions occur in parallel and yield intermediate output that is then aggregated by the sec- ond function called Reduce. Reduce tasks are also parallel, and they combine the elements on each key–value list by applying the function written by the user. The results produced by all the Reduce tasks form the aggregated output of the map–reduce proc- ess. This MapReduce programming model has achieved massive scalability in real- world applications across hundreds or thousands of servers within a Hadoop cluster. An illustrative example of a MapReduce process could explain all advantages of the framework. Suppose that there is a large collection of text documents in a database and the task is to count the number of times each distinct word is occurring in the document corpus. Figure 12.40 illustrates the basic steps in MapReduce process to perform the required task. To simplify explanations, we selected to process the data- base with only three documents, and assumption is that there are three servers for Map functions and two servers for Reduce functions. Of course, scalability of the solution enables to extend the problem to millions of documents with hundreds or even thou- sands of Map and Reduce servers in the cloud, while the example only captures the essence of MapReduce framework.

452 ADVANCES IN DATA MINING (a) (b) (c) (d) Map input Map output Reduce input Reduce output Web page 1 Worker 1 Worker 1 Worker 1 (the, 1) The weather is (the, 1) (the, 1) Worker 2 good (weather, 1) (is, 3) (is, 1) Worker 2 Web page 2 (good, 1) (is, 1) Worker 3 Today is good (is, 1) (weather, 2) Worker 2 (is, 1) (today, 1) Worker 3 (is, 1) (weather, 1) (good, 1) (weather, 1) Web page 3 Worker 3 Worker 4 Worker 4 (good, 1) (today, 1) (today, 1) Good weather is (weather, 1) good Worker 5 Worker 5 (is, 1) (good, 1) (good, 4) (good, 1) (good, 1) (good, 1) (good, 1) Figure 12.40. MapReduce process for word count problem. Based on Jeff Dean & Sanjay Ghemawat slides [Google inc.]. In a simple example, because there are only three documents and three Map ser- vers, each document (at Figure 12.40a) is assigned to a separate server, and the Map function is decomposing the document text into set of distinct words (Figure 12.40b). The final result of the Map phase is an integrated list of key–value pairs for all docu- ments. In our case, it is a list of all words with its count in each document (Figure 12.40c). Results of a Map phase are now distributed between Reduce servers. Sorted list of words is distributed between two Reduce servers in our example. Reduce phase accumulates the words that are repeating in different documents, and the Reduce aggregation produces final word count results (Figure 12.40d). Final output is written in the distributed file system, one file per reducer, where each file will con- tain roughly the same number of words. Although a two-stage MapReduce processing structure may appear to be very restrictive, many interesting algorithms can be expressed quite concisely, especially if one decomposes complex algorithms into a sequence of MapReduce jobs. An illus- trative example is Google’s PageRank algorithm. It is important to realize that also many algorithms cannot be easily expressed as a single MapReduce job. Consider what would happen if the task is to find mean value of large set of numerical values.

CLOUD COMPUTING BASED ON HADOOP AND MAP/REDUCE 453 The Mapper would compute the mean of an arbitrary subset of values associated with the same key, and the Reducer would compute the mean of those (mean) values. As a concrete example, we know that Mean 1; 2; 3; 4; 5 Mean Mean 1; 2 ; Mean 3; 4; 5 and it means that simple implementation of MapReduce framework for this problem is not applicable. The availability of cloud-based solutions has dramatically lowered the cost of storage and enables a new business models that support on-demand, pay-for-use, and scalable IT services over the Internet. Clouds provide services at different levels of resource use: • Software as a Service (SaaS) model enables a software provider to license a software application to be used and purchased on demand. Application can be accessed through networks from various clients (Web browser, mobile phone, etc.) The application requires no client installation, just a browser or other client device and network connectivity. • Platform as a Service (PaaS) model provides developers a service that can be used as a complete software development lifecycle management, from pla- nning and design, through building application and deployment, to finally test- ing and maintenance. • Infrastructure as a Service (IaaS) focuses on enabling technologies. The cloud consumers directly use infrastructure components (storage, network, firewalls, etc.) that are enabled by the cloud provider. The trend of making everything-as-a-service has affected many disciplines including machine learning and data mining. Machine Learning as a Service (MLaaS) is an integrate framework of automated and semiautomated cloud platforms that cover most infrastructure required during data-mining process including data pre- processing, model training, model evaluation, and application of the machine-learning model in a real-world environments. Obtained prediction results can be bridged with other IT infrastructure in the company. Information governance, security, and system management support each processing phase to ensure regulation and policies for all data used or derived in data-mining process. Compliance is tracked to ensure controls about delivering expected results. Amazon Machine Learning services, Microsoft Azure Machine Learning, and Google Cloud AI are three initial and leading frameworks that are offering cloud MLaaS services that allow for fast model developing and deployment with little data-mining experience. Tools for classification, regression, clustering, anomaly detection, ranking, and recommender systems are available in all three frameworks with different levels of automation and graphical interfaces. These platforms include additionally APIs, which could perform text, speech and image recognition, topics

454 ADVANCES IN DATA MINING extraction, voice verification, sentiment analysis, etc. Besides fully integrated plat- forms, you can use high-level APIs for performing specific machine-learning tasks in the cloud environment. These are the services with easy trained models that you can feed your data into, and get results. APIs do not require machine-learning exper- tise at all. Currently, the APIs from main vendors include applications such as text recognition, text translation, textual analysis, sentiment analysis, facial recognition, image annotation, and video recognition. 12.8 REINFORCEMENT LEARNING Reinforcement learning (RL) is a class of machine-learning algorithms that aims at allowing an agent to learn how to behave in an environment. RL agent has the ability to act, and this interaction with the environment enables agent to learn by trial and error. Each action in the environment influences the agent’s future state. Success of an agent’s action is measured by a scalar reward signal, which represents the only feedback from the environment. In general, the main goal of the RL process is selec- tion of appropriate actions that will maximize reward in a multiple agent–environment interaction. The agent has to explore the environment by performing actions and ana- lyzing the consequences. The effects on the environments are obtained through rewards for each agent’s action, and that feedback is a core of a machine-learning process. The goal of the agent is to perform set of actions that will maximize the reward signal in the long run. Model-free RL algorithms do not rely on the availability of a perfect model. Instead, they rely on interaction with the environment and learning step by step through these interactions. Because the model is unknown, the learner has to try out different actions to see their results and to evaluate potential rewards. Each step in the agent–environment interaction generates a learning sample. These samples are used to bring some value in accordance to the immediate reward and also to estimate the value of the next state. RL may be formalized as a framework defining the interaction between a learning agent and its environment in terms of (1) states, (2) actions, and (3) rewards. The main components and their interactions are given in Figure 12.41. The framework is a Environment Reward State Action Agent Figure 12.41. Main components in the reinforcement learning process.

REINFORCEMENT LEARNING 455 simplified representation of essential features and main dynamics of the RL process. The agent, to be involved in the learning process, must have a single goal or alternative goals relating to the state of the environment, and it must be able to learn from its own experience. Beyond the agent and the environment, additional three elements partic- ipate in an RL activities: a policy, a reward function, and a value function. A policy defines the learning agent’s way of behaving at a given time. It is a mapping function from perceived states of the environment to actions to be taken when the agent is in those states. A reward function highlights the main goal in an RL problem; it specifies a reward for an agent being in a given state or doing some specific action in a state. The environment sends to the RL agent a single num- ber as a reward for every step in each learning sample. While the reward signal indi- cates what is good in an immediate sense, a value function specifies what is good, as a solution in the long-run learning process. The agent’s main objective is to maxi- mize the total reward it receives over multiple learning samples. One of the core RL challenges, which is not essential in other kinds of machine learning, is the trade-off between exploration and exploitation processes. The agent has to exploit what it already knows in order to obtain the highest reward, but it also has to explore unknown paths using new samples in order to make better action selections in the future. One of the basic and most popular RL methods is the Q-learning algorithm. The basic idea in Q-learning is to incrementally estimate Q-values for all new actions, based on feedback expressed through rewards, and the previous agent’s Q-value func- tion. Q-learning is exploration insensitive. It means that the learning process will con- verge to the optimal policy regardless of the exploration policy being followed. The only required assumption is that each state–action pair is visited an infinite or in prac- tice enough large number of times. Basic ideas and components of a Q-learning algorithm will be illustrated through a relatively simple example. Suppose that there is an environment that represents a building with five rooms connected by doors. The layout of the framework is given in Figure 12.42a, where the rooms are numerated 0 to 4, while number 5 is represent- ing the outside space. Notice that the doors from rooms 1 and 4 lead into the “room” 5 (outside), while the other doors are between regular rooms. We may call each room, including outside, a “state” of the environment, and the agent’s movement from one room to another should be called an “action.” In the graph formalization on Figure 12.42b, a “state” is depicted as a node in a graph, while “actions” are repre- sented by the arrows between states. The structure of the building is represented with the corresponding graph where actions are represented between two states only if there are direct doors between corresponding rooms. For illustration of a Q-learning process, an agent may be initially located in any room, and it should learn how to move from that room to reach the goal—outside the building space 5. To set this “room” 5 as a goal, it is necessary to associate a reward value to each door or each action in the graph. Assume that the doors from rooms 1 and 4 leading immediately to the goal state 5 have an instant highest reward of 100. It includes “room” 5 feedback loop to itself with the same reward of 100. In Q-learning,

456 ADVANCES IN DATA MINING (a) (b) 5 1 0 23 1 2 5 43 04 5 Figure 12.42. A framework for illustration of Q-learning process. (a) Five-room building. (b) Graph representation of the building. From: http://mnemstudio.org/path-finding-q-learning- tutorial.htm (a) (b) Action 2 1 012345 0 State 0 –1 –1 –1 –1 0 –1 0 –1 –1 –1 0 –1 100 00 –1 –1 –1 0 –1 –1 0 100 1 –1 0 0 –1 0 –1 0 –1 –1 0 –1 100 3 100 R= 2 –1 0 –1 –1 0 100 0 5 3 0 0 100 4 5 0 0 4 0 Figure 12.43. Reward function R for the five-room building environment. (a) Tabular representation of R matrix. (b) Graph representation of R matrix. the goal is to reach the state with the highest reward so that if the agent arrives at the goal, it should remain there forever. In the reward matrix R in Figure 12.43a, the value (−1) shows that there is no link (door) between corresponding rooms (states). The reward value of 0 is assigned for all existing doors, which are not leading to the goal, outside state 5. The reward matrix may be translated into the graph in Figure 12.43b where only existing links/actions with corresponding awards 0 or 100 are specified. The agent starts activities of learning, and at the beginning the assumption is that the agent knows nothing. Therefore, the matrix Q is initialized to zero. The main tran- sition rule of Q-learning is represented by two simple formulas: QLEARNED state, action = R state, action + γ∗Max QOLD next state, all actions QNEW state, action = 1 − α QOLD state, action + αQLEARNED state, action where α is the learning rate (0 ≤ α ≤ 1) and γ is the discount factor (0 ≤ γ ≤ 1).

REINFORCEMENT LEARNING 457 These formulas connect calculation of a new Q-values in the matrix based on ini- tial R matrix of rewards and current exploration of the agent in the current state and with the current action. The agent will learn without teacher, based on experiences with every new action. The exploration process is analyzed through episodes, where each episode consists of an agent moving from the initial state to the goal state, and it is equivalent to one training session. Each time the agent arrives at the goal state, the learning process goes to the next episode. More training results applied in the Q meth- odology gives a more optimized matrix Q. Let us start with the simple learning process assuming that the learning parameter α, discount parameter γ = 0.8, and the initial state is randomly selected as room 1. In this case QNEW = QLEARNED. There are two possible actions for the current state 1: (1) go to state 3, or (2) go to state 5. The agent does not have any initial knowledge, and it selects randomly to go to room 5 as the action. In that case correction in the Q matrix will be calculated as follows: QNEW 1, 5 = R 1, 5 + 0 8∗Max Q 5, 1 ,Q 5, 4 ,Q 5, 5 = 100 + 0 8∗0 = 100 Initial R(1,5) value is taken from R matrix, while all alternative Q-values for actions from state 5 toward states 1, 4, and 5 are equal 0 in the current Q matrix. Because the agent arrived in the goal state 5, this episode is finished, and the optimized Q matrix has only one new value Q(1,5) = 100. All the other values stay the same with values 0. For the next episode, the agent starts with a randomly chosen initial state 3. Based on the matrix R, there are three possible actions: go to state 1, 2, or 4. Without addi- tional knowledge, the agent uses a random selection of state 1 as the action. The cor- rection in the Q matrix will be calculated as QNEW 3, 1 = R 3, 1 + 0 8∗Max Q 1, 3 , Q 1, 5 = 0 + 0 8∗Max 0, 100 = 80 This time the values for R(3,1) and Q(1, 3) are both 0, while Q(1,5) = 100 is already learned value from the previous episode. The matrix Q becomes the matrix with two corrected values, as it is given on the Figure 12.44a. With each new episode new values or modified previous values will be introduced in the Q matrix. With larger number of learning episodes and often changes in the matrix Q, the values in the matrix may become very high. The matrix Q can be normalized and converted to per- centages by dividing all nonzero entries by the highest number in the matrix. If the reader practices calculation of Q matrix after several randomly selected episodes, the obtained result will be similar to the normalized matrix on Figure 12.44b. Once the matrix Q gets close enough to a state of convergence, the agent has learned the most optimal paths to the goal state for the given environment. The state of convergence may be recognized when the matrix Q is not changing significantly with any additional episodes. The agent learns enough, and it behaves in the environ- ment by tracing the best sequences of states; it is essentially a simple process, which is following the actions with the highest Q-values at each state. RL approach become widely well-known with AlphaGo system. The ancient Chi- nese game of Go has challenged AI researchers for many decades. Methods that

458 ADVANCES IN DATA MINING (a) 012345 (b) 012345 0 000000 0 0 0 0 0 80 0 0 0 0 0 0 100 0 0 0 64 0 100 1 000000 1 0 0 0 64 0 0 0 80 0 0 0 0 0 80 51 0 80 0 Q = 2 000000 Q = 2 64 0 0 64 0 100 3 000000 3 0 80 0 0 80 100 4 4 5 5 Figure 12.44. Dynamic changes of Q matrix with each new episode. (a) Q matrix after two episodes. (b) Q matrix already learned the problem. achieve human-level skills in other games, such as poker and chess, have not been successful in producing strong Go programs. Recently, a Google DeepMind team developed a program called AlphaGo that broke this barrier in the game of Go learning by combining deep learning neural networks and RL. The system beats the best human players in the world in high-profile matches. But the RL approach shows the ability of building more advanced learning systems, where application’s areas are much wider than games. The promising and highly publicized application of RL technology is humanlike behavior of automatic agents in self-driving cars. Today’s driverless vehicles are becoming more and more sophisticated; still they are often falter in complex situations that still involve interacting with human drivers. A number of industrial-robot makers are testing the RL approach as a way to train their machines to perform new tasks without manual programming. Applications of RL in high-dimensional control problems, including robotics, have been the subject of research in academia and industry, while startups are beginning to use RL to build new generation of products for industrial robotics. Google worked with DeepMind group in using RL to make its data centers more energy efficient. It is difficult to figure out how all the elements in a data center will affect energy usage, but an RL algorithm can learn from collated data and experiment in simulation to suggest, say, how and when to operate the cooling systems. In the financial sector there are several solution leveraging RL for evaluating trading strategies. It is turning out that RL represents a robust methodology for train- ing systems to optimize financial objectives. It has primary applications in stock market trading, where Q-learning algorithm is able to learn an optimal trading strat- egy with one simple instruction: maximize the value of user’s portfolio. Also, per- sonalizing Web services such as the delivery of news articles or advertisements is one approach to increasing users’ satisfaction with a Web site or to increase the yield of a marketing campaign. This is a natural domain for RL. An RL system can improve a recommendation policy by making adjustments in response to user feed- back. One way to obtain user feedback is by means of Web site satisfaction surveys, but for acquiring feedback in real time, it is common to monitor user clicks as indi- cators of interest in a link.

REVIEW QUESTIONS AND PROBLEMS 459 In spite of all initial successes with RL, there are several warnings from some experienced users and researchers that the approach, to be successful, requires a huge amount of data. While some solutions may be obtained with large number of simulations if available, the researchers are aware that RL becomes less effective when dealing with highly complicated problems with high-dimensional state and action spaces. Some recognize potential solutions in combining different machine-learning approaches such as integration of RL with deep learning techniques. 12.9 REVIEW QUESTIONS AND PROBLEMS 1. What are the benefits in modeling social networks with a graph structure? What kind of graphs you would you in this case? 2. For the given undirected graph G: 12 3 45 (a) Compute degree and variability parameters of the graph. (b) Find adjacency matrix for the graph G. (c) Determine binary code(G) for the graph. (d) Find closeness parameter or each node of the graph. (e) What is the betweenness measure for the node #2? 3. For the graph given in Problem #2, find partial betweenness centrality using modified graph starting with node #5. 4. Give real-world examples for traditional analyses of temporal data (i.e. trends, cycles, seasonal patterns, outliers). 5. Given the temporal sequence S = {1 2 3 2 4 6 7 5 3 1 0 2}. (a) Find PAA for four sections of the sequence. (b) Determine SAX values for solution in (a) if (i) α = 3 and (ii) α = 4. (c) Find PAA for three sections of the sequence. (d) Determine SAX values for solution in (c) if (i) α = 3 and (ii) α = 4. 6. Given the sequence S = {A B C B A A B A B C B A B A B B C B A C C}. (a) Find the longest subsequence with frequency ≥ 3. (b) Construct finite state automaton (FSA) for the subsequence found in (a).

460 ADVANCES IN DATA MINING 7. Find normalized contiguity matrix for the table of US cities: Minneapolis Chicago New York Nashville Louisville Charlotte Make assumption that only neighboring cities (vertical and horizontal) in the table are close. 8. For the Bayesian network in Figure 12.38, determine: (a) P(C, R, W) (b) P( C, S, W) 9. Review the latest articles on privacy-preserving data mining that are available on the Internet. Discuss the trends in the field. 10. What are the largest sources of unintended personal data on the Internet? How to increase awareness of Web users of their personal data that are available on the Web for variety of data-mining activities? 11. Discuss an implementation of transparency and accountability mechanisms in a data-mining process. Illustrate your ideas with examples of real-world data- mining applications. 12. Give examples of data-mining applications where you would use distributed- data-mining approach. Explain the reasons. 13. Develop and explain all steps of MapReduce process for inverted file index. Inputs are text document. The output should be a list of words used in the docu- ments, where each word is connected with the list of keys, representing each doc- ument containing the word. 14. Find the set of 2-shingles for the “documents” (or simple text) in (a) and (b): (a) ABRACADABRA (b) BRICABRAC (c) How many 2-shingles two texts have in common? 15. For the following graph: C -- D -- E /\\ AB \\/ F -- G – H compute the betweenness of every edge.

REFERENCES FOR FURTHER STUDY 461 16. The following is a description of the causal relationship between storms, the behavior of burglars and cats, and house alarms: Stormy nights are rare. Burglary is also rare, and if it is a stormy night, burglars are likely to stay at home (burglars do not like going out in storms). Cats do not like storms either, and if there is a storm, they like to go inside. The alarm on your house is designed to be triggered if a burglar breaks into your house, but sometimes it can be set off by your cat coming into the house, and sometimes it might not be triggered even if a burglar breaks in (it could be faulty or the burglar might be very good). Define the topology of a Bayesian network that encodes these causal relationships. 17. Consider the time series (−3, −1, 0, 3, 5, 7, ∗). Here, a missing entry at the end is denoted by ∗. (a) What would be the estimated value of the missing entry using linear interpolation on a window of size 3? (b) What would be the estimated value if the fourth value in the time series is missing and calculation is based on the same window size? 12.10 REFERENCES FOR FURTHER STUDY 1. Aggarwal C. C., Yu P. S., Privacy-Preserving Data Mining: Models and Algo- rithms, Springer, Berlin, 2008. The book proposes a number of techniques to perform the data-mining tasks in a privacy-preserving way. These techniques generally fall into the following cate- gories: data modification techniques, cryptographic methods and protocols for data sharing, statistical techniques for disclosure and inference control, query auditing methods, randomization, and perturbation-based techniques. This edited volume contains surveys by distinguished researchers in the privacy field. Each survey includes the key research content as well as future research directions. Privacy- Preserving Data Mining: Models and Algorithms is designed for researchers, pro- fessors, and advanced-level students in computer science and is also suitable for industry practitioners. 2. Mitsa T., Temporal data Mining, Chapmann & Hall / CRC Press, 2010. From basic data-mining concepts to state-of-the-art advances, Temporal Data Min- ing covers the theory of this subject as well as its application in a variety of fields. It discusses the incorporation of temporality in databases as well as temporal data rep- resentation, similarity computation, data classification, clustering, pattern discovery, and prediction. The book also explores the use of temporal data mining in medicine and biomedical informatics, business and industrial applications, Web usage mining, and spatiotemporal data mining. Along with various state-of-the-art

462 ADVANCES IN DATA MINING algorithms, each chapter includes detailed references and short descriptions of rel- evant algorithms and techniques described in other references. In the appendices, the author explains how data mining fits the overall goal of an organization and how these data can be interpreted for the purpose of characterizing a population. She also provides programs written in the Java language that implement some of the algo- rithms presented in the first chapter. 3. Zeitouni K., A Survey of Spatial Data Mining Methods: Databases and Statistics Point of View, In “Data warehousing and web engineering”, Becker S., editor, IRM Press, 2002. This chapter reviews the data-mining methods that are combined with geographic information systems (GIS) for carrying out spatial analysis of geographic data. We will first look at data-mining functions as applied to such data and then highlight their specificity compared with their application to classical data. We will go on to describe the research that is currently going on in this area, pointing out that there are two approaches: the first comes from learning on spatial databases, while the second is based on spatial statistics. We will conclude by discussing the main dif- ferences between these two approaches and the elements they have in common. 4. Chakrabarti D., Faloutsos C., Graph Mining: Laws, Generators, and Algorithms, ACM Computing Surveys, Vol. 38, March 2006, pp. 1–69. How does the Web look? How could we tell an abnormal social network from a normal one? These and similar questions are important in many fields where the data can intuitively be cast as a graph; examples range from computer networks to sociology to biology and many more. Indeed, any M: N relation in database termi- nology can be represented as a graph. A lot of these questions boil down to the following: “How can we generate synthetic but realistic graphs?” To answer this, we must first understand what patterns are common in real-world graphs and can thus be considered a mark of normality/realism. This survey gives an overview of the incredible variety of work that has been done on these problems. One of our main contributions is the integration of points of view from physics, mathematics, sociology, and computer science. Further, we briefly describe recent advances on some related and interesting graph problems. 5. Pearl J., Causality: Models, Reasoning and Inference, 2nd edition, Cambridge University Press, 2009. This book fulfills a long-standing need for a rigorous yet accessible treatise on the mathematics of causal inference. Judea Pearl has done a masterful job of describing the most important approaches and displaying their underlying logical unity. The book deserves to be read by all scientists who use non-experimental data to study causation and would serve well as a graduate or advanced undergraduate course text. The book should prove invaluable to researchers in artificial intelligence, sta- tistics, economics, epidemiology, and philosophy and, indeed, all those interested in the fundamental notion of causality. It may well prove to be one of the most influential books of the next decade.

REFERENCES FOR FURTHER STUDY 463 6. Gosavi A., Simulation-Based Optimization: Parametric Optimization Techniques and Reinforcement Learning, 2nd edition, Springer, New York, NY, 2014. The book introduces the evolving area of static and dynamic simulation-based optimization. Covered in detail are model-free optimization techniques— especially designed for those discrete-event stochastic systems that can be simu- lated but whose analytical models are difficult to find in closed mathematical forms. It includes an in-depth consideration of dynamic simulation optimization via temporal differences and reinforcement learning: Q-learning, SARSA, and R-SMART algorithms, and policy search, via API, Q-P-learning, actor-critics, and learning automata. A special examination is given to neural-network-based function approximation for reinforcement learning, semi-Markov decision pro- cesses (SMDPs), finite-horizon problems, two time scales, case studies for indus- trial tasks, computer codes (placed online), and convergence proofs, via Banach fixed point theory and ordinary differential equations. The book is themed around three areas in separate sets of chapters—static simulation optimization, reinforce- ment learning, and convergence analysis. 7. Talia D., Data Analysis in the Cloud: Models, Techniques and Applications, Else- vier Inc., 2015. The book introduces and discusses models, methods, techniques, and systems to analyze the large number of digital data sources available on the Internet using the computing and storage facilities of the cloud. Coverage includes scalable data- mining and knowledge-discovery techniques together with cloud computing con- cepts, models, and systems. Specific sections focus on map–reduce and NoSQL models. The book also includes techniques for conducting high-performance dis- tributed analysis of large data on clouds. Finally, the book examines research trends such as big data pervasive computing, data-intensive exascale computing, and massive social network analysis.

13 GENETIC ALGORITHMS Chapter Objectives • Identify effective algorithms for approximate solutions of optimization problems described with large data sets. • Compare basic principles and concepts of natural evolution and simulated evolution expressed through genetic algorithms. • Describe the main steps of a genetic algorithm with illustrative examples. • Explain standard and nonstandard genetic operators such as a mechanism for improving solutions. • Discuss a schema concept with don’t care values and its application to approx- imate optimization. • Apply a genetic algorithm to the traveling salesman problem and optimization of classification rules as examples of hard optimizations. Data Mining: Concepts, Models, Methods, and Algorithms, Third Edition. Mehmed Kantardzic. © 2020 by The Institute of Electrical and Electronics Engineers, Inc. Published 2020 by John Wiley & Sons, Inc. 465

466 GENETIC ALGORITHMS There is a large class of interesting problems for which no reasonably fast algorithms have been developed. Many of these problems are optimization problems that arise frequently in applications. The fundamental approach to optimization is to formulate a single standard of measurement—a cost function—that summarizes the perfor- mance or value of a decision and iteratively improves this performance by selecting from among the available alternatives. Most classical methods of optimization gener- ate a deterministic sequence of trial solutions based on the gradient or higher-order statistics of the cost function. In general, any abstract task to be accomplished can be thought of as solving a problem, which can be perceived as a search through a space of potential solutions. Since we are looking for “the best” solution, we can view this task as an optimization process. For small data spaces, classical exhaustive search methods usually suffice; for large spaces, special techniques must be employed. Under regular conditions, the techniques can be shown to generate sequences that asymptotically converge to optimal solutions, and in certain cases they converge exponentially fast. But the methods often fail to perform adequately when random perturbations are imposed on the function that is optimized. Further, locally optimal solutions often prove insufficient in real-world situations. Despite such problems, which we call hard-optimization problems, it is often possible to find an effective algorithm whose solution is approximately optimal. One of the approaches is based on genetic algorithms (GAs), which are developed on the principles of natural evolution. Natural evolution is a population-based optimization process. Simulating this process on a computer results in stochastic-optimization techniques that can often outperform classical methods of optimization when applied to difficult real-world pro- blems. The problems that the biological species have solved are typified by chaos, chance, temporality, and nonlinear interactivity. These are the characteristics of the problems that have proved to be especially intractable to classical methods of optimi- zation. Therefore, the main avenue of research in simulated evolution is a genetic algorithm, which is a new iterative optimization method that emphasizes some facets of natural evolution. GAs approximate an optimal solution to the problem at hand; they are by nature stochastic algorithms whose search methods model some natural phenomena such as genetic inheritance and the Darwinian strife for survival. 13.1 FUNDAMENTALS OF GENETIC ALGORITHMS GAs are derivative-free stochastic-optimization methods based loosely on the con- cepts of natural selection and evolutionary processes. They were first proposed and investigated by John Holland at the University of Michigan in 1975. The basic idea of GAs was revealed by a number of biologists when they used computers to perform simulations of natural genetic systems. In these systems, one or more chromosomes combine to form the total genetic prescription for the construction and operation of some organism. The chromosomes are composed of genes, which may take a number of values called allele values. The position of a gene (its locus) is identified separately

FUNDAMENTALS OF GENETIC ALGORITHMS 467 from the gene’s function. Thus, we can talk of a particular gene, e.g., an animal’s eye- color gene with its locus at position 10 and its allele value as blue eyes. Before going into details of the applications of GAs in the following sections, let us understand its basic principles and components. GAs encode each point in a param- eter or solution space into a binary-bit string called a chromosome. These points in an n-dimensional space do not represent samples in the terms that we defined them at the beginning of this book. While samples in other data-mining methodologies are data sets given in advance for training and testing, sets of n-dimensional points in GAs are a part of a GA, and they are produced iteratively in the optimization process. Each point or binary string represents a potential solution to the problem that is to be solved. In GAs, the decision variables of an optimization problem are coded by a structure of one or more strings, which are analogous to chromosomes in natural genetic systems. The coding strings are composed of features that are analogous to genes. Features are located in different positions in the string, where each feature has its own position (locus) and a definite allele value, which complies with the proposed coding method. The string structures in the chromosomes go through different operations similar to the natural evolution process to produce better alternative solutions. The quality of new chromosomes is estimated based on the “fitness” value, which can be considered as the objective function for the optimization problem. The basic relations between concepts in natural evolution and GAs are given in Table 13.1. Instead of single a point, GAs usually keep a set of points as a population, which is then evolved repeat- edly toward a better overall fitness value. In each generation, the GA constructs a new population using genetic operators such as crossover and mutation. Members with higher fitness values are more likely to survive and participate in mating or crossover operations. As a general-purpose optimization tool, GAs are moving out of academia and finding significant applications in many other venues. Typical situations where GAs are particularly useful are in difficult optimization cases for which analytical methods do not work well. GAs have been quite successfully applied to optimization problems like wire routing, scheduling, adaptive control, game playing, transportation problems, traveling salesman problems (TSP), database query optimization, machine learning, etc. During the last decades, the significance of optimization has grown even further because many important large-scale combinatorial-optimization problems and T AB LE 13. 1 . Basic Concepts in Genetic Algorithms Concept in Natural Evolution Concept in Genetic Algorithms Chromosome String Gene Features in the string Locus Position in the string Allele Position value (usually 0 or 1) Genotype String structure Phenotype Set of characteristics (features)

468 GENETIC ALGORITHMS highly constrained engineering problems can only be solved approximately. GAs aim at such complex problems. They belong to the class of probabilistic algorithms, yet they are very different from random algorithms as they combine elements of directed and stochastic search. Another important property of genetics-based search methods is that they maintain a population of potential solutions while all other methods process a single point of the search space. Because of these characteristics, GAs are more robust than existing directed-search methods. Gas are popular because they do not depend on functional derivatives and they have the following characteristics: 1. GAs are parallel-search procedures that can be implemented on parallel- processing machines for massively speeding up their operations. 2. GAs are applicable to both continuous- and discrete-optimization problems. 3. GAs are stochastic and less likely to get trapped in local minima, which inev- itably are present in any practical optimization application. 4. GAs’ flexibility facilitates both structure and parameter identification in com- plex models. The GA theory provides some explanation why, for a given problem formulation, we may obtain convergence to the sought optimal point. Unfortunately, practical applications do not always follow the theory, the main reason being: 1. The coding of the problem often moves the GA to operate in a different space than that of the problem itself. 2. There are practical limits on the hypothetically unlimited number of iterations (generations in the GA). 3. There is a limit on the hypothetically unlimited population size. One of the implications of these observations is the inability of GAs, under certain conditions, to find the optimal solution or even an approximation to the optimal solu- tion; such failures are usually caused by premature convergence to a local optimum. Do not forget that this problem is common not only for the other optimization algo- rithms but also for the other data-mining techniques. 13.2 OPTIMIZATION USING GENETIC ALGORITHMS Let us note first that without any loss of generality, we can assume that all optimiza- tion problems can be analyzed as maximization problems only. If the optimization problem is to minimize a function f(x), this is equivalent to maximizing a function g(x) = –f(x). Moreover, we may assume that the objective function f(x) takes positive values in its domain. Otherwise, we can translate the function for some positive con- stant C so that it will be always positive; i.e.,

OPTIMIZATION USING GENETIC ALGORITHMS 469 max f ∗ x = max f x + C If each variable xi, with real values, is coded as a binary string of length m, then the relation between the initial value and the coded information is b–a xi = a + decimal binary-stringi 2m – 1 where the variable xi can take the values from a domain Di = [a, b] and m is the smal- lest integer such that the binary code has the required precision. For example, the value for variable x given on the domain [10, 20] is a binary-coded string with the length equal to 3 and the code 100. While the range of codes is between 000 and 111, the question is: what is the real value of the coded variable x? For this example m = 3 and the corresponding precision is b – a 20 – 10 10 2m – 1 = = = 1 42 23 – 1 7 and that is the difference between two successive xi values that could be tested as can- didates for extreme. Finally, the attribute with the code 100 has a decimal value: x = 10 + decimal 100 1 42 = 10 + 4 1 42 = 15 68 Each chromosome as a potential solution is represented by a concatenation of binary codes for all features in the problem to be optimized. Its total length m is a sum of the features’ code lengths mi: k m = mi i=1 where k is the number of features or input variables for the problem at hand. When we introduce these basic principles of a code construction, it is possible to explain the main steps of a GA. 13.2.1 Encoding Schemes and Initialization A GA starts with designing a representation of a solution for the given problem. A solution here means any value that is a candidate for a correct solution that can be evaluated. For example, suppose we want to maximize function y = 5 – (x – 1)2. Then, x = 2 is a potential solution, x = 2.5 is another solution, and x = 3 is the correct solution of the problem that maximizes y. The representation of each solution for a GA is up to the designer. It depends on what each solution looks like and which solution form will be convenient for applying a GA. The most common representation of a solution is as a

470 GENETIC ALGORITHMS string of characters, i.e., a string of codes for feature representation, where the characters belong to a fixed alphabet. The larger the alphabet, the more the information that can be represented by each character in the string. Therefore, fewer elements in a string are necessary to encode specific amounts of information. However, in most real-world applications, GAs usually use a binary-coding schema. The encoding process transforms points in a feature space into bit string repre- sentation. For instance, a point (11, 6, 9) in a three-dimensional feature space, with ranges [0, 15] for each dimension, can be represented as a concatenated binary string: 11, 6, 9 101101101001 in which each feature’s decimal value is encoded as a gene composed of four bits using a binary coding. Other encoding schemes, such as Gray coding, can also be used, and, when nec- essary, arrangements can be made for encoding negative, floating-point, or discrete- value numbers. Encoding schemes provide a way of translating problem-specific knowledge directly into the GA framework. This process plays a key role in determin- ing GAs’ performances. Moreover, genetic operators can and should be designed along with the encoding scheme used for a specific application. A set of all features values encoded into a bit string represents one chromosome. In GAs we are manipulating not a single chromosome but a set of chromosomes called a population. To initialize a population, we can simply set some pop-size number of chromosomes randomly. The size of the population is also one of the most important choices faced by any user of GAs and may be critical in many applications: will we reach the approximate solution at all, and if yes, how fast? If the population size is too small, the GA may converge too quickly and maybe to a solution that is only the local optimum; if it is too large, the GA may waste computational resources and the waiting time for an improvement might be too long. 13.2.2 Fitness Evaluation The next step, after creating a population, is to calculate the fitness value of each mem- ber in the population because each chromosome is a candidate for an optimal solution. For a maximization problem, the fitness value fi of the ith member is usually the objec- tive function evaluated at this member (or the point in parameter space). The fitness of a solution is a measure that can be used to compare solutions to determine which is better. The fitness values may be determined from complex analytical formulas and simulation models or by referring to observations from experiments or real-life prob- lem settings. GAs will work correctly if fitness values are determined appropriately keeping in mind that a selection of the objective function is highly subjective and problem dependent. We usually need fitness values that are positive, so some kind of scaling and/or translation of data may become necessary if the objective function is not strictly pos- itive. Another approach is to use the rankings of members in a population as their

OPTIMIZATION USING GENETIC ALGORITHMS 471 fitness values. The advantage of this approach is that the objective function does not need to be accurate, as long as it can provide the correct ranking information. 13.2.3 Selection In this phase, we have to create a new population from the current generation. The selection operation determines which parent chromosomes participate in producing offspring for the next generation. Usually, members are selected for mating with a selection probability proportional to their fitness values. The most common way to implement this method is to set the selection probability p equal to pi = fi n = 1fk k where n is the population size and fi is a fitness value for the ith chromosome. The effect of this selection method is to allow members with above-average values to reproduce and replace members with below-average fitness values. For the selection process (selection of a new population with respect to the prob- ability distribution based on fitness values), a roulette wheel with slots sized according to fitness for each chromosome is used. We construct such a roulette wheel as follows: 1. Calculate the fitness value f(vi) for each chromosome vi. 2. Find the total fitness of the population: pop-size F = f vi i=1 3. Calculate the probability of a selection pi for each chromosome vi: pi = f vi F 4. Calculate a cumulative probability qi after each chromosome vi is included: i qi = pi j=1 where q increases from 0 to maximum 1. Value 1 shows that all chromosomes from the population are included into a cumulative probability. The selection process is based on spinning the roulette wheel pop-size times. Each time we select a single chromosome for a new population. An implementation could repeat steps 1 and 2 pop-size times: 1. Generate a random number r from the range [0, 1]. 2. If r < q1, then select the first chromosome v1; otherwise select the ith chromo- some vi such that qi – 1 < r ≤ qi.

472 GENETIC ALGORITHMS Obviously, some chromosomes would be selected more than once. That is in accordance with the theory. GA performs a multidirectional search by maintaining a population of potential solutions and encourages good solutions. The population undergoes a simulated evolution—in each generation the relatively “good” solutions reproduce, while the relatively “bad” solutions die. To distinguish between different solutions, we use an objective or evaluation function, which plays the role of an environment. 13.2.4 Crossover The strength of GAs arises from the structured information exchange of crossover combinations of highly fit individuals. So what we need is a crossover-like operator that would exploit important similarities between chromosomes. The probability of crossover PC is the parameter that will define the expected number of chromosomes—PC pop-size—which undergo the crossover operation. We define the chromosomes for crossover in a current population using the following iterative procedure. Steps 1 and 2 have to be repeated for all chromosomes: 1. Generate a random number r from the range [0, 1]. 2. If r < PC, select the given chromosome for crossover. If PC is set to 1, all chromosomes in the population will be included into the cross- over operation; if PC = 0.5, only half of the population will perform crossover, and the other half will be included into a new population directly without changes. To exploit the potential of the current gene pool, we use crossover operators to generate new chromosomes that will retain the good features from the previous gen- eration. Crossover is usually applied to selected pairs of parents. One-point crossover is the most basic crossover operator, where a crossover point on the genetic code is selected at random and two parent chromosomes are interchanged at this point. In two-point crossover, two points are selected, and a part of chromosome string between these two points is then swapped to generate two children of the new generation. Examples of one- and two-point crossover are shown in Figure 13.1. We can define an n-point crossover similarly, where the parts of strings between points 1 and 2, 3 and 4, and finally n − 1 and n are swapped. The effect of crossover is similar to that of mating in the natural evolutionary process in which parents pass seg- ments of their own chromosomes on to their children. Therefore, some children are able to outperform their parents if they get “good” genes or genetic traits from their parents. 13.2.5 Mutation Crossover exploits existing gene potentials, but if the population does not contain all the encoded information needed to solve a particular problem, no amount of gene mix- ing can produce a satisfactory solution. For this reason, a mutation operator capable of

OPTIMIZATION USING GENETIC ALGORITHMS 473 (a) Selected point for one-point crossover (after the fifth position in the string) ⇓ 10011110 10010010 ⇒ 10110010 10111110 (b) Selected points for two-point crossover (after the second and fifth positions in the strings) ⇓⇓ 10011110 10110110 ⇒ 10110010 10011010 Figure 13.1. Crossover operators. (a) One-point crossover. (b)Two-point crossover. Randomly selected mutated bit Chromosome after mutation (in the sixth position) ⇓ ⇓ ⇒ 10011010 10011110 Figure 13.2. Mutation operator. spontaneously generating new chromosomes is included. The most common way of implementing mutation is to flip a bit with a probability equal to a very low given mutation rate (MR). A mutation operator can prevent any single bit from converging to a value through the entire population, and, more importantly, it can prevent the pop- ulation from converging and stagnating at any local optima. The MR is usually kept low so good chromosomes obtained from crossover are not lost. If the MR is high (for example, above 0.1), GA performance will approach that of a primitive random search. Figure 13.2 provides an example of mutation. In the natural evolutionary process, selection, crossover, and mutation all occur simultaneously to generate offspring. Here we split them into consecutive phases to facilitate implementation of and experimentation with GAs. Note that this section only gives a general description of the basics of GAs. Detailed implementations of GAs vary considerably, but the main phases and the iterative process remain. At the end of this section, we can summarize that the major components of GAs include encoding schemas, fitness evaluation, parent selection, and application of crossover operators and mutation operators. These phases are performed iteratively, as represented in Figure 13.3. It is relatively easy to keep track of the best individual chromosomes in the evo- lution process. It is customary in GA implementations to store “the best ever” indi- vidual at a separate location. In that way, the algorithm would report the best value found during the whole process, just in the final population.

474 GENETIC ALGORITHMS Encoding schemata YES Halt Fitness evaluation Testing the end of the algorithm NO Parent selection Crossover operators Mutation operators Figure 13.3. Major phases of a genetic algorithm. Optimization under constraints is also the class of problems for which GAs are an appropriate solution. The constraint-handling techniques for GAs can be grouped into different categories. One typical way of dealing with GA candidates that violate the constraints is to generate potential solutions without considering the constraints and then to penalize them by decreasing the “goodness” of the evaluation function. In other words, a constrained problem is transformed to an unconstrained one by asso- ciating a penalty with all constraint violations. These penalties are included in the function evaluation, and there are different kinds of implementations. Some penalty functions assign a constant as a penalty measure. Other penalty functions depend on the degree of violation: the larger the violation, the greater the penalty. The growth of the penalty function can be logarithmic, linear, quadratic, exponential, etc., depending upon the size of the violation. Several implementations of GAs’ optimization under constraints are given in the texts recommended for further study (Section 13.9). 13.3 A SIMPLE ILLUSTRATION OF A GENETIC ALGORITHM To apply a GA for a particular problem, we have to define or to select the following five components: 1. A genetic representation or encoding schema for potential solutions to the problem. 2. A way to create an initial population of potential solutions. 3. An evaluation function that plays the role of the environment, rating solutions in terms of their “fitness.” 4. Genetic operators that alter the composition of offspring. 5. Values for the various parameters that the GA uses (population size, rate of applied operators, etc.).

A SIMPLE ILLUSTRATION OF A GENETIC ALGORITHM 475 We discuss the main features of GAs by presenting a simple example. Suppose that the problem is the optimization of a simple function of one variable. The function is defined as f x = x2 The task is to find x from the range [0, 31] that maximizes the function f(x). We selected this problem because it is relatively easy to analyze optimization of the func- tion f(x) analytically, to compare the results of the analytic optimization with a GA, and to find the approximate optimal solution. 13.3.1 Representation The first step in the GA is to represent the solution alternative (a value for the input feature) in a coded-string format. Typically, the string is a series of features with their values; each feature’s value can be coded with one from a set of discrete values called allele set. The allele set is defined according to the needs of the problem, and finding the appropriate coding method is a part of the art of using GAs. The coding method must be minimal but completely expressive. We will use a binary vector as a chromo- some to represent real values of the single variable x. The length of the vector depends on the required precision, which, in this example, is selected as 1. Therefore, we need a minimum five-bit code (string) to accommodate the range with required precision: b–a ≤ Required precision 2m – 1 31 – 0 ≤1 2m – 1 2m ≥ 32 m≥5 For this example, the mapping from a real number to a binary code is defined by the relation (because a = 0) Code = binary xdecimal Opposite mapping from the binary code to the real value of the argument is also unique: x = decimal Codebinary and it will be used only for checking the intermediate results of optimization. For example, if we want to transform the value x = 11 into a binary string, the corres- ponding code will be 01011. On the other hand, code 11001 represents decimal value x = 25.

476 GENETIC ALGORITHMS 13.3.2 Initial Population The initialization process is very simple: we randomly create a population of chromo- somes (binary codes) with the given length. Suppose that we decide that the parameter for the number of strings in the population is equal to four. Then one possible ran- domly selected population of chromosomes is CR1 = 01101 CR2 = 11000 CR3 = 01000 CR4 = 10011 13.3.3 Evaluation The evaluation function for binary vectors representing chromosomes is equivalent to the initial function f(x) where the given chromosome represents the binary code for the real value x. As noted earlier, the evaluation function plays the role of the environ- ment, rating potential solutions in terms of their fitness. For our example, four chro- mosomes CR1 to CR4 correspond to values for input variable x: x1 CR1 = 13 x2 CR2 = 24 x3 CR3 = 8 x4 CR4 = 19 Consequently, the evaluation function would rate them as follows: f x1 = 169 f x2 = 576 f x3 = 64 f x4 = 361 The results of evaluating the chromosomes initially generated may be given in a tabular form, and they are represented in Table 13.2. The expected reproduction col- umn shows “the evaluated quality” of chromosomes in the initial population. Chro- mosomes CR2 and CR4 are more likely to be reproduced in the next generation than CR1 and CR3.

A SIMPLE ILLUSTRATION OF A GENETIC ALGORITHM 477 13.3.4 Alternation In the alternation phase, the new population is selected based on the population eval- uated in the previous iteration. Clearly, the chromosome CR4 in our example is the best of the four chromosomes, since its evaluation returns the highest value. In the alternation phase, an individual may be selected depending on its objective-function value or fitness value. For maximization problems, the higher the individual’s fitness, the more probable that it can be selected for the next generation. There are different schemes that can be used in the selection process. In the simple GA we proposed earlier, the roulette wheel selection technique, an individual is selected randomly depending on a computed probability of selection for each individual. The probability of selection is computed by dividing the individual’s fitness value by the sum of fitness values of the corresponding population, and these values are represented in column 5 in Table 13.2. In the next step we design the roulette wheel, which is, for our problem, graph- ically represented in Figure 13.4. T A BL E 13 . 2. Evaluation of the Initial Population Expected Reproduction: f(x)/fav CRi Code x f(x) f(x)/ f(x) 0.58 1 01101 13 169 0.14 1.97 0.49 0.22 2 11000 24 576 0.06 1.23 0.31 4.00 3 01000 8 64 1.00 1.00 0.25 1.97 4 10011 19 361 0.49 1170 Average 293 Max 576 49% CR2 14% CR1 31% 6% CR4 CR3 Figure 13.4. Roulette wheel for selection of the next population.

478 GENETIC ALGORITHMS Using the roulette wheel, we can select chromosomes for the next population. Suppose that the randomly selected chromosomes for the next generation are CR1, CR2, CR2, CR4 (the selection is in accordance with the expected reproduction— column 6 in Table 13.2). In the next step, these four chromosomes undergo the genetic operations: crossover and mutation. 13.3.5 Genetic Operators Crossover is not necessarily applied to all pairs of selected individuals. A choice is made depending on a specified probability called crossover probability (PC), which is typically between 0.5 and 1. If crossover is not applied (PC = 0), the offspring are simply a duplication of the parents. For the process of crossover, it is necessary to determine the percentage of the population that will perform the crossover. For our particular problem, we use the following parameters of the GA: 1. Population size, pop-size = 4 (the parameter was already used). 2. Probability of crossover, PC = 1. 3. Probability of mutation, PM = 0.001 (the parameter will be used in a mutation operation). A value of 1 for the probability of crossover translates into a 100% crossover—all chromosomes will be included in the crossover operation. The second set of parameters in this phase of a GA is the random selection of parents for crossover and positions in the strings where the crossover will be per- formed. Suppose that these are randomly selected pairs, CR1–CR2 and CR2–CR4, and crossover is after the third position in the strings for both pairs. Then the selected strings First pair: CR1 = 01101 CR2 = 11000 Second pair: CR2 = 11000 CR4 = 10011

A SIMPLE ILLUSTRATION OF A GENETIC ALGORITHM 479 will become, after crossover, a new population: CR1 = 01100 CR2 = 11001 CR3 = 11011 CR4 = 10000 The second operator that can be applied in every iteration of a GA is mutation. For our example, the mutation operator has a probability of 0.1%, which means that on the 1000 transformed bits, a mutation will be performed only once. Because we trans- formed only 20 bits (one population of 4 × 5 bits is transformed into another), the probability that a mutation will occur is very small. Therefore, we can assume that the strings CR 1 to CR 4 will stay unchanged with respect to a mutation operation in this first iteration. It is expected that only one bit will be changed for every 50 iterations. That was the final processing step in the first iteration of the GA. The results, in the form of the new population CR 1 to CR 4, are used in the next iteration, which starts again with an evaluation process. 13.3.6 Evaluation (Second Iteration) The process of evaluation is repeated in the new population. These results are given in Table 13.3. The process of optimization with additional iterations of the GA can be continued in accordance with Figure 13.3. We will stop here with a presentation of computa- tional steps for our example and give some additional analyses of results useful for a deeper understanding of a GA. Although the search techniques used in the GA are based on many random para- meters, they are capable of achieving a better solution by exploiting the best alterna- tives in each population. A comparison of sums and average and max values from Tables 13.2 and 13.3 = 1170 = 1754 1 2 Average1 = 293 Average2 = 439 Max1 = 576 Max2 = 729 shows that the new second population is approaching closer to the maximum of the function f(x). The best result obtained from the evaluation of chromosomes in the first two iterations is the chromosome CR3 = 11011 and it corresponds to the feature’s value x = 27 (theoretically it is known that the maximum of f(x) is for x = 31, where

Pages:

Willington Island

Data Mining: Concepts, Models, Methods, and Algorithms

Like this book? You can publish your book online for free in a few minutes!

Create your own flipbook

TOP SEARCH

business design fashion music health life sports home marketing children

Data Mining: Concepts, Models, Methods, and Algorithms

Read the Text Version

Willington Island

TOP SEARCH

RELATED PUBLICATIONS