Important Announcement
PubHTML5 Scheduled Server Maintenance on (GMT) Sunday, June 26th, 2:00 am - 8:00 am.
PubHTML5 site will be inoperative during the times indicated!

Home Explore Analytics in a Big Data World The Essential Guide to Data Science and its Applications by Bart Baesens (z-lib.org)

Analytics in a Big Data World The Essential Guide to Data Science and its Applications by Bart Baesens (z-lib.org)

Published by supasit.kon, 2022-08-29 02:10:47

Description: Analytics in a Big Data World The Essential Guide to Data Science and its Applications by Bart Baesens (z-lib.org)

Search

Read the Text Version

S O C I A L N E T W O R K A N A L Y T I C S ◂ 131 Louis Peter Text Mining  Paper Tina SNA Paper Fraud  Detection  Paper Monique Figure 6.10 Representation of a Small Author–Paper Network neglected before. However, including a second type of nodes results in an increasing complexity for analysis. If a network consists of two types of nodes, we call the network a bipartite graph or a bigraph. For example, in an author–paper network, there are two types of nodes: authors and papers. This is illustrated in Figure 6.10. Mathematically, a bipartite graph is represented by a matrix M with n rows and m columns. The rows refer to the type‐one nodes, while the columns specify the type‐two nodes. The correspond- ing matrix of Figure 6.10 is given in Figure 6.11. Author Louis Paper Tina TM SNA FD Peter 1 –– Monique 1 11 1 –1 – 11 Figure 6.11 Mathematical Representation of the Author–Paper Network

132 ▸ ANALYTICS IN A BIG DATA WORL D While the weight of the links in the unipartite graph was used to represent the frequency that both nodes were associated to a similar object (e.g., the number of papers written together), the bipartite graph allows one to include additional information in the link weight, like the recency, intensity, and information exchange. For example, in the author–paper network, instead of using a binary link (0/1 or writer/ nonwriter) to specify relationships between authors and papers, the link weight can now represent the contributions of each author to the paper. When analyzing the influence of one node on another, the link weights should refer to the recency of the relationship. Authors will have much less influence on each other if they wrote a paper together several years ago than if they had written the paper only yesterday. NOTES 1. M. Girvan and M. E. J. Newman, “Community Structure in Social and Biological Networks,” in Proceedings of the National Academy of Sciences (2002), 7821–7826. 2. W. Verbeke, D. Martens, and B. Baesens, “Social Network Analysis for Customer Churn Prediction,” Applied Soft Computing, forthcoming, 2014. 3. S. A. Macskassy and F. Provost, “Classification in Networked Data: A Toolkit and a Univariate Case Study,” Journal of Machine Learning Research 8 (2007): 935–983; W. Verbeke, D. Martens, and B. Baesens, “Social Network Analysis for Customer Churn Prediction,” Applied Soft Computing, forthcoming, 2014; T. Verbraken et al., “Predicting Online Channel Acceptance Using Social Network Data,” Decision Support Systems, forthcoming, 2014. 4. Q. Lu and L. Getoor, “Link‐based Classification,” in Proceedings of the Twentieth Confer- ence on Machine Learning (ICML‐2003) (Washington, DC, 2003). 5. S. Geman and D. Geman, “Stochastic Relaxation, Gibbs Distributions, and the Bayesian Restoration of Images,” IEEE Transactions on Pattern Analysis and Machine Intelligence 6 (1984): 721–741. 6. Q. Lu and L. Getoor, “Link‐based Classification,” in Proceedings of the Twentieth Confer- ence on Machine Learning (ICML‐2003) (Washington, DC, 2003). 7. S. Chakrabarti, B. Dom, and P. Indyk, “Enhanced Hypertext Categorization Using Hyperlinks,” in Proceedings of the 1998 ACM SIGMOD International Conference on Man- agement of Data (1998), ACM, Seattle, WA, US, 307–319. 8. J. Pearl, Probabilistic Reasoning in Intelligent Systems (Morgan Kaufmann, 1988). 9. M. E. J. Newman, Networks: An Introduction (Oxford University Press, 2010).

7C H A P T E R Analytics: Putting It All to Work  In Chapter 1, we discussed the following key requirements of analyti- cal models: ■ Business relevance ■ Statistical performance ■ Interpretability and justifiability ■ Operational efficiency ■ Economical cost ■ Regulatory compliance When only considering statistical performance as the key objec- tive, analytical techniques such as neural networks, SVMs, and ran- dom forests are among the most powerful. However, when inter- pretability and justifiability are the goal, then logistic regression and decision trees should be considered. Obviously, the ideal mix of these requirements largely depends on the setting in which analytics is to be used. For example, in fraud detection, response and/or retention modeling, interpretability, and justifiability are less of an issue. Hence, it is common to see techniques such as neural networks, SVMs, and/or random forests applied in these settings. In domains such as credit risk modeling and medical diagnosis, comprehensibility is a key require- ment. Techniques such as logistic regression and decision trees are 133

134 ▸ ANALYTICS IN A BIG DATA WORL D very popular here. Neural networks and/or SVMs can also be applied if they are complemented with white box explanation facilities using, for example, rule extraction and/or two‐stage models, as explained in Chapter 3. BACKTESTING ANALYTICAL MODELS Backesting is an important model monitoring activity that aims at comparing ex‐ante made predictions with ex‐post observed num- bers.1 For example, consider the example in Table 7.1 of a churn pre- diction model. The purpose here is to decide whether the observed churn rates differ significantly from the estimated probability of churn. During model development, one typically performs out‐of‐ sample validation. This means that the training set and test set basi- cally stem from the same underlying time period. Backtesting is done using an out‐of‐sample/out‐of‐time data set, as illustrated in Figure 7.1. Out of universe validation refers to testing the model on another population. An example of this could be a model devel- oped on European customers that is being validated on American customers. Many challenges arise during backtesting. Different reasons could be behind the differences between the predicted and observed churn rates reported in Table 7.1. A first reason could be sample variation. This is the variation due to the fact that the predictions are typically based on a limited sample. Suppose one only considers sample varia- tion and the churn rate for a cluster is 1 percent, and one wants to be 95 percent confident that the actual churn rate is no more than 20 Table 7.1 Backtesting a Churn Prediction Model Cluster Estimated No. of Customers No. of Churners Observed Probability of Observed Observed Churn Rate 30 Churn 120 3% 500 6% A 2% 1,000 750 12.5% 37.5% B 4% 2,000 C 10% 4,000 D 30% 2,000

A N A L Y T I C S : P U T T I N G I T A L L T O W O R K ◂ 135 : Training Set :Test Set Out of Sample Out of Sample/Out of Time Time Out of Universe Out of Universe/Out of Time Figure 7.1 Out‐of‐Sample versus Out‐of‐Sample/Out‐of‐Time Validation basis points off from that estimate. The number of observations needed would be: ⎛ 1.96 P (1− P ) ⎞ 2 ⎝⎜ ⎠⎟ n = 0.002 = 9,500 When dealing with large data sets, this number can be easily obtained. However, for smaller data sets (as is typically the case in credit risk modeling), a lower number of observations might be avail- able, hereby inflating the standard errors and making the uncertainty on the predictions bigger. External effects could also be a reason for the difference between predicted and observed churn rates. A typical example here is the impact of macroeconomic up‐ or downturns. Finally, internal effects could also play a role. Examples here are a strategy change or a merger and/or acquisition. Both have an impact on the composition of the data samples and, as such, also on the observed churn rates. When backtesting analytical models, one often adopts a traffic light indicator approach to encode the outcome of a performance metric or test statistic. A green traffic light means that the model predicts well and no changes are needed. A yellow light indicates an early warning that a potential problem may arise soon. An orange light is a more

136 ▸ ANALYTI CS IN A BI G DATA WORL D severe warning that a problem is very likely to arise. A red light then indicates a serious problem that needs immediate attention and action. Depending on the implementation, more or fewer traffic lights can be adopted. Backtesting Classification Models When backtesting classification models, one should first clearly state whether the goal of the classification model is scoring/ranking or provid- ing well-calibrated posterior class probabilities. In response and/or reten- tion modeling, one is typically interested in scores/ranking customers, whereas in credit risk modeling, well‐calibrated probabilities are needed. When the model purpose is scoring, backtesting should check both data stability and model ranking. When the model is aimed at providing well‐ calibrated probabilities, the calibration itself should also be backtested. When validating data stability, one should check whether inter- nal or external environmental changes will impact the classification model. Examples of external environmental changes are new devel- opments in economic, political, or legal environment; changes in commercial law; or new bankruptcy procedures. Examples of inter- nal environmental changes are changes of business strategy, explora- tion of new market segments, or changes in organizational structure (internal). A two‐step approach can be suggested as follows: 1. Check whether the population on which the model is currently being used is similar to the population that was used to develop the model. 2. If differences occur in step 1, verify the stability of the individual variables. For step 1, a system stability index (SSI) can be calculated as follows: k (observedi − expectedi ) i =1 ∑ .SSI= ln observedi expectedi This is illustrated in Table 7.2. Note that the system stability index is also referred to as the devia- tion index. It is identical to the information value measure discussed

A N A L Y T I C S : P U T T I N G I T A L L T O W O R K ◂ 137 Table 7.2 Calculating the System Stability Index (SSI) Score Range Expected Observed SSI (Training) % (Actual) % 0.0015 0.0045 0–169 6% 7% 0.0050 0.0086 170–179 10% 8% 0.0009 0.0096 180–189 9% 7% 0.0107 0.0162 190–199 12% 9% 0.0009 0.0027 200–209 12% 11% 0.0605 210–219 8% 11% 220–229 7% 10% 230–239 8% 12% 240–249 12% 11% 250+ 16% 14% 100% 100% in Chapter 2 for variable screening. A rule of thumb can be defined as follows: ■ SSI < 0.10: no significant shift (green traffic light) ■ 0.10 ≤ SSI < 0.25: moderate shift (yellow traffic light) ■ SSI ≥ 0.25: significant shift (red traffic light) It is also recommended to monitor the SSI through time as illus- trated in Table 7.3. When population instability has been diagnosed, one can then verify the stability of the individual variables. Again, a system stability index can be calculated at the variable level as illustrated in Table 7.4. Note also that histograms and/or t‐tests can be used for this purpose. Backtesting model ranking verifies whether high (low) scores are assigned to good (bad) customers. Ranking is then typically used in combination with profit measures to decide on the desired action (e.g., who to mail in a direct mailing campaign). Performance measures commonly adopted here have been discussed in Chapter 3: ROC, CAP, lift, and/or Kolmogorov‐Smirnov curves. In terms of area under the ROC curve, one can adopt the traffic light indicator approach given in Table 7.5. Note that an AUC of bigger than 0.95 can be regarded as too good to be true and might be a sign that something has gone wrong in

138 ▸ ANALYTICS IN A BIG DATA WORL D Table 7.3 Monitoring the SSI through Time Score Range Expected Observed Observed (Training) % (Actual) % at t (Actual) % at t + 1 0–169 6% 7% 6% 8% 7% 170–179 10% 7% 10% 9% 11% 180–189 9% 11% 10% 11% 9% 190–199 12% 10% 11% 12% 11% 200–209 12% 11% 10% 14% 15% 210–219 8% 0.0605 0.0494 220–229 7% 0.0260 230–239 8% 240–249 12% 250+ 16% SSI versus Expected SSI versus t − 1 Table 7.4 Calculating the SSI for Individual Variables Expected Observed Observed (Actual) Range (Training)% (Actual)% at t % at t + 1 Income 0–1,000 16% 18% 10% 1,001–2,000 23% 25% 12% 2,001–3,000 22% 20% 20% 3,001–4,000 19% 17% 25% 4,001–5,000 15% 12% 20% 5,000+ 5% 8% 13% SSI Reference 0.029 0.208 SSI t − 1 0.238 Years client Unknown client 15% 10% 5% 0–2 years 20% 25% 15% 2–5 years 25% 30% 40% 5–10 years 30% 30% 20% 10+ years 10% 5% 20% SSI Reference 0.075 0.304 SSI t − 1 0.362

A N A L Y T I C S : P U T T I N G I T A L L T O W O R K ◂ 139 Table 7.5 Traffic Light Coding of AUC Area under the ROC Curve Quality 0 < AUC ≤ 0.5 No discrimination 0.5 < AUC ≤ 0.7 Poor discrimination 0.7 < AUC ≤ 0.8 Acceptable discrimination 0.8 < AUC ≤ 0.9 Excellent discrimination 0.9 < AUC ≤ 1 Exceptional the setup of the model (e.g., information about the dependent variable was used in one of the independent variables). One can then monitor the AUC or accuracy ratio (AR) through time using a report as depicted in Table 7.6. A rule of thumb that could be applied here is that a decrease of less than 5% in terms of AR is considered green (normal script), between 5% and 10% yellow (bold face), and more than 10% red (bold face and underlined). For backtesting probability calibration, one can first use the Brier score defined as follows: ∑1 n − θi )2 n (Pˆi i=1 Table 7.6 Monitoring Accuracy Ratio (AR) through Time Number of Number of AR Observations Defaulters 0.85 0.81 AR model 5,866 105 0.80 0.83 AR 2012 5,677 97 0.79 0.79 AR 2011 5,462 108 0.75 0.82 AR 2010 5,234 111 0.78 0.80 AR 2009 5,260 123 0.83 0.8 AR 2008 5,365 113 AR 2007 5,354 120 AR 2006 5,306 119 AR 2005 4,970 98 AR 2004 4,501 62 AR 2003 3,983 60 Average AR 5,179.8 101.5

140 ▸ ANALYTI CS IN A BI G DATA WORL D whereby n is the number of customers, Pˆi the calibrated probability for customer i, and θi is 1 if the event of interest (e.g. churn, fraud, default) took place and 0 otherwise. The Brier score always varies between 0 and 1, and lower values indicate a better calibration ability. Another very popular test for measuring calibration performance is the binomial test. The binomial test assumes an experiment with only two outcomes (e.g., head or tail), whereby the experiment is repeated multiple times and the individual outcomes are independent. Although the last assumption is not always nicely fulfilled because of, for example, social network effects, the binomial test is often used as a heuristic for calibration. It works as follows: H0: The estimated probability of the event (e.g., churn, fraud, default), Pˆ , equals the true probability P. HA: The estimated probability of the event Pˆ is bigger/smaller/not equal to the true probability. Note that the estimated probability Pˆ is typically the probability within a particular customer segment or pool. Depending on the analytical tech- nique, the pool can be obtained in various ways. It could be a leaf node of a decision tree, or a clustered range output from a logistic regression. Assuming a right‐tailed test and given a significance level, α, (e.g., α = 99% ), H0 is rejected if the number of events is greater than or equal to k*, which is obtained as follows: ⎪⎧ n ⎛ n ⎞ ≤ 1− α⎪⎬⎫. min ⎨k ⎜ ⎟ ∑k*= | Pˆ i(1− Pˆ )i ⎪⎩ i=k ⎝ k ⎠ ⎪⎭ For large n, nPˆ > 5 and n(1− Pˆ) > 5, the binomial distribution can be approximated by a normal distribution as N( nPˆ , nPˆ (1− Pˆ)). Hence, one obtains: ⎛ k* − nPˆ ⎞ = α, P ⎜⎝⎜z ≤ nPˆ(1− Pˆ) ⎠⎟⎟ with z a standard normally distributed variable. The critical value, k*, can then be obtained as follows: k* = nPˆ + N −1(α) nPˆ (1− Pˆ)

A N A L Y T I C S : P U T T I N G I T A L L T O W O R K ◂ 141 with N −1(α) the inverse cumulative standard normal distribution. In terms of a critical event rate, p*, one then has: p* = Pˆ + N −1(α) Pˆ (1− Pˆ ) n H0 can then be rejected at significance level α, if the observed event rate is higher than p*. Remember that the binomial test assumes that all observations are independent. If the observations are correlated, then the binomial test has a higher probability to erroneously reject H0 (type I error), so that’s why it is often used as an early warning system. It can be coded using traffic lights, as follows: ■ Green (normal font): no statistical difference at 90 percent ■ Yellow (italics): statistical difference at 90 percent but not at 95 percent ■ Orange (bold face): statistical difference at 95 percent but not at 99 percent ■ Red (bold face and underlined): statistical difference at 99 percent Table 7.7 shows an example of using the binomial test for backtest- ing calibrated probabilities of default (PDs) against observed default rates (DRs). It can be seen that from 2001 onwards, the calibration is no longer satisfactory. The Hosmer‐Lemeshow test is a closely related test that will test calibrated versus observed event rates across multiple segments/pools simultaneously. It also assumes independence of the events, and the test statistic is defined as follows: =∑χ2(k)k (niPˆi − θi)2 i=1 niPˆi (1− Pˆi) whereby ni is the number of observations in pool i, Pˆi is the estimated probability of the event for pool i, and θi is the number of observed events. The test statistic follows a chi‐squared distribution with k degrees of freedom. It can be coded using traffic lights in a similar way as for the binomial test.

Table 7.7 The Binomial Test for Backtesting PDs versus DRs PD Baa1 Baa2 Baa3 Ba1 Ba2 Ba3 B1 B2 B3 Caa‐C Av 1.36% 2.46% 5.76% 8.76% 20.89% 3.05% 0.26% 0.17% 0.42% 0.53% 0.54% Ba3 B1 B2 B3 Caa‐C Av 0.76% 3.24% 5.04% 11.29% 28.57% 3.24% DR Baa1 Baa2 Baa3 Ba1 Ba2 0.59% 1.88% 3.75% 7.95% 1.88% 1.76% 4.35% 6.42% 4.06% 5.13% 2.51% 1993 0.00% 0.00% 0.00% 0.83% 0.00% 0.00% 1.17% 0.00% 3.28% 11.57% 0.78% 0.47% 0.00% 1.54% 7.22% 13.99% 1.41% 1994 0.00% 0.00% 0.00% 0.00% 0.00% 1.12% 2.11% 7.55% 5.52% 14.67% 2.83% 2.00% 3.28% 6.91% 9.63% 15.09% 3.35% 1995 0.00% 0.00% 0.00% 0.00% 0.00% 1.04% 3.24% 4.10% 10.88% 20.44% 3.01% 2.93% 3.19% 11.07% 16.38% 19.65% 5.48% 142 1996 0.00% 0.00% 0.00% 0.00% 0.00% 1.58% 2.00% 6.81% 6.86% 34.45% 3.70% 1.36% 2.46% 5.76% 8.76% 29.45% 3.05% 1997 0.00% 0.00% 0.00% 0.00% 0.00% 20.9% 1998 0.00% 0.31% 0.00% 0.00% 0.62% 1999 0.00% 0.00% 0.34% 0.47% 0.00% 2000 0.28% 0.00% 0.97% 0.94% 0.63% 2001 0.27% 0.27% 0.00% 0.51% 1.38% 2002 1.26% 0.72% 1.78% 1.58% 1.41% Av 0.26% 0.17% 0.42% 0.53% 0.54%

A N A L Y T I C S : P U T T I N G I T A L L T O W O R K ◂ 143 Backtesting Regression Models In backtesting regression models, one can also make a distinction between model ranking and model calibration. When predicting CLV, one might especially be interested in model ranking, since it is typically hard to accurately quantify CLV. However, in the majority of the cases, the aim is model calibration. For ranking, one could first consider a system stability index (SSI), as discussed before, applied to the categorized output. Also t‐tests and/or histograms can be used here. For ranking, one could create a scatter plot and summarize it into a Pearson correlation coefficient (see Chapter 3). For calibra- tion, one can calculate the R‐squared, mean squared error (MSE), or mean absolute deviation (MAD) as also discussed in Chapter 3. Table 7.8 gives an example of a table that can be used to monitor the MSE. Backtesting Clustering Models When backtesting clustering models, one can first check the data stability by comparing the number of observations per cluster dur- ing model design with the number observed now and calculate a system stability index (SSI) across all clusters. One can also measure how the distance/proximity measures have changed on new obser- vations by creating histograms of distances per cluster and compare the histograms of the model design data with those of new data. The Table 7.8 Monitoring Model Calibration Using MSE Number of Traffic Light Events Number of MSE Observations MSE model MSE year t MSE year t + 1 MSE year t + 2 … Average MSE period 1 Average MSE period 2

144 ▸ ANALYTI CS IN A BI G DATA WORL D distances can then be statistically tested using, for example, a t‐test. One can also statistically compare the intracluster similarity with the intercluster similarity using an F‐test to see whether reclustering is needed. Developing a Backtesting Framework In order to setup a backtesting framework, one needs to decide on the following: ■ Diagnose backtesting needs ■ Work out backtesting activities ■ Design timetable for backtesting activities ■ Specify tests and analyses to be performed ■ Define actions to be taken in response to findings ■ Identify why/what/who/how/when All of the above should be described in a backtesting policy. Figure 7.2 presents an example of a digital dashboard application that could be developed for backtesting classification models. Note also that qualitative checks are included that are based on a judgment made by one or more business experts. These subjective evaluations are consid- ered to be very important. Once a backtesting framework has been developed, it should be complemented with an action plan. This plan will specify what to do in response to what finding of the backtesting exercise. Figure 7.3 gives an example of this. If the model calibration is okay, one can continue to use the model. If not, one needs to verify the model discrimination or ranking. If this is okay, then the solution might be to simply recalibrate the probabilities upward or downward using a scaling factor. If not, the next step is to check the data stability. If the data stability is still okay, one may consider tweaking the model. Note that this is, however, not that straightforward and will often boil down to reestimating the model (as is the case when the data stability is not okay).

A N A L Y T I C S : P U T T I N G I T A L L T O W O R K ◂ 145 Green Yellow Red Binomial Not significant  Significant at  Significant at 99% Hosmer-Lemeshow at 95% level 95% but not at  level Vasicek  Normal 99% level Significant at 99% Portfolio distribution level Quantitative Not significant  Significant at  Level 2: at 95% level 95% but not at  Significant at 99% Calibration level 99% level Qualitative Significant at 99% Not significant  Significant at  level at 95% level 95% but not at  Major shift 99% level Not significant  Significant at  at 95% level 95% but not at  99% level Minor shift Moderate shift Difference Correct Overestimation Underestimation Portfolio stability Minor  Moderate  Major migrations migrations migrations Green Yellow Red AR difference with  < 5% Between 5% > 10% reference model and 10% Quantitative AUC difference with  < 2.5% Between 2.5% > 5% reference model and 5% Model significance  p-value < 0.01 p-value between p-value > 0.10 0.01 and 0.10  Level 1: Preprocessing  Considered Partially  Ignored Discrimination  (missing values,  considered outliers) Coefficient signs All as expected Minor  Major exceptions Qualitative exceptions Number of overrides Minor Moderate Major Minor issues Major issues Documentation Sufficient Green Yellow Red SSI (current versus  SSI < 0.10 0.10< SSI< 0.25 SSI > 0.25 training sample) SSI < 0.10 Quantitative 0.10< SSI< 0.25 SSI > 0.25 SSI attribute level Level 0: Data t-test attribute level p-value > 0.10 p-value between  p-value < 0.01 0.10 and 0.01 Qualitative Characteristic No change Moderate  Major change analysis No shift change Major shift Attribute histogram Moderate shift Figure 7.2 A Backtesting Digital Dashboard for Classification Models

146 ▸ ANALYTI CS IN A BI G DATA WORL D Model  calibration Not Okay Okay Model  Continue using discrimination model Not Okay Okay Data stability Recalibrate model Not Okay Okay Reestimate Tweak model model Figure 7.3 Example Backtesting Action Plan BENCHMARKING The idea of benchmarking is to compare the output and performance of the analytical model with a reference model or benchmark. This is needed as an extra validity check to make sure that the current analyt- ical is the optimal one to be used. The benchmark can be externally or internally developed. A popular example of an external benchmark in credit risk modeling could be the FICO score. This is a credit score that ranges between 300 and 850 and is developed by Experian, Equifax, and Transunion in the United States. It is often used as a benchmark to compare application and/or behavioral credit scoring models. A closely related score is the Vantage score, also available in the United States. Credit rating agencies (e.g., Moody’s, S&P, and Fitch) could also be considered as benchmarking partners. These agencies typically provide information on credit ratings and default probabilities that are very useful in a credit risk modeling context. Note that although external benchmarking may seem appealing at first sight, one should be aware of potential problems, for example, unknown quality of the external benchmark, different underlying data samples and/or methodologies, different target definitions, and legal constraints. One should also be vigilant for cherry‐picking, whereby

A N A L Y T I C S : P U T T I N G I T A L L T O W O R K ◂ 147 the external benchmark is selected so as to correspond as closely as possible to the internal model. The benchmark can also be internally developed, either statisti- cally or expert based. For example, one could benchmark a logis- tic regression model against a neural network benchmark to see whether there are any significant nonlinearities in the data. If it turns out that this is indeed the case, then nonlinear transforma- tions and/or interaction terms can be added to the logistic regres- sion model to come as close as possible to the neural network per- formance. An expert-based benchmark is a qualitative model based on expert experience and/or common sense. An example of this could be an expert committee ranking a set of small‐ and medium‐ sized enterprises (SMEs) in terms of default risk by merely inspect- ing their balance sheet and financial statement information in an expert‐based, subjective way. When benchmarking, one commonly adopts a champion– challenger approach. The current analytical model serves as the champion and the benchmark as the challenger. The purpose of the challenger is to find the weaknesses of the champion and to beat it. Once the benchmark outperforms the champion, one could consider making it the new champion, and the old champion then becomes the new benchmark. The purpose of this approach is to continuously chal- lenge the current model so as to continuously perfect it. Popular agreement statistics for benchmarking are Spearman’s rank order correlation, Kendall’s τ , and the Goodman‐Kruskal γ. Spearman’s rank order correlation measures the degree to which a monotonic relationship exists between the scores or ratings provided by an internal scoring system and those from a benchmark. It starts by assigning 1 to the lowest score, 2 to the second lowest score, and so on. In case of tied scores, the average is taken. Spearman’s rank order correlation is then computed as follows: ∑ρs 6 n di2 = 1− i=1 n(n2 −1) whereby n is the number of observations and di the difference between the scores. Spearman’s rank order correlation always ranges between −1 (perfect disagreement) and +1 (perfect agreement).

148 ▸ ANALYTI CS IN A BI G DATA WORL D Kendall’s τ works by first calculating the concordant and discor- dant pairs of observations. Two observations are said to be concor- dant if the observation that has a higher score assigned by the internal model also has a higher score assigned by the external model. If there is disagreement in the scores, then the pair is said to be discordant. Note that if the pair is neither concordant nor discordant, it is tied, meaning the two observations have identical scores assigned by the internal model, or by the benchmark, or by both. Kendall’s τ is then calculated as follows: τ = 1 A− B , n(n −1) 2 whereby n is the number of observations, A the number of concordant pairs, and B the number of discordant pairs. Note that the denomina- tor gives all possible pairs for n observations. Kendall’s τ is 1 for perfect agreement and −1 for perfect disagreement. Kendall’s τ basically looks at all possible pairs of observations. The Goodman‐Kruskal γ will only consider the untied pairs (i.e., either concordant or discordant), as follows: γ= A−B A+B The Goodman‐Kruskal γ is +1 if there are no discordant pairs (per- fect agreement), −1 if there are no concordant pairs (perfect disagree- ment), and 0 if there are equal numbers of concordant and discordant pairs. For example, consider the example in Table 7.9. Spearman’s rank order correlation then becomes −0.025. The con- cordant pairs are as follows: C1,C3; C1,C4; C3,C4; C3,C5; and C4,C5. The discordant pairs are: C1,C2; C2,C3; C3,C4; and C2,C5. The pair C1,C5 is a tie. Kendall’s τ thus becomes: (5 − 4)/10 or 0.1 and the Goodman‐Kruskal γ becomes (5 − 4)/(5 + 4) or 0.11. In case of disagreement between the current analytical model and the benchmark, it becomes interesting to see which is the best model overall, or whether there are certain segments of observa- tions where either the internal model or benchmark proves to be superior. Based on this analysis, it can be decided to further perfect

A N A L Y T I C S : P U T T I N G I T A L L T O W O R K ◂ 149 Table 7.9 Example for Calculating Agreement Statistics Customer Internal Rank Internal Rank di Credit Score FICO Score External 0.25 2 680 2.5 16 3 20 580 5 Score 1 4 35 640 1 1 5 15 720 4 3 2.25 25 700 2.5 20.5 20 1 2 5 4 n ∑ di2 i=1 the current analytical model or simply proceed with the benchmark as the new model. DATA QUALITY Corporate information systems consist of many databases linked by real-time and batch data feeds.2 The databases are continuously updated, as are the applications performing data exchange. This dynamism has a negative impact on data quality (DQ), which is very disadvantageous since DQ determines the value of the data to the ana- lytical technique. Information and communication technology can be used to further improve intrinsic value. Hence, high-quality data in combination with good technology gives added value, whereas poor- quality data with good technology is a big problem (remember the garbage in, garbage out idea discussed in Chapter 2). Decisions made based on bad data can create high losses for companies. Poor DQ impacts organizations in many ways. At the operational level, it has an impact on customer satisfaction, increases operational expenses, and will lead to lowered employee job satisfaction. Similarly, at the strategic level, it affects the quality of the (analytical) decision mak- ing process.3 Poor DQ are often experienced in everyday life. For exam- ple, the mistaken delivery of a letter is often associated with

150 ▸ ANALYTI CS IN A BI G DATA WORL D malfunctioning postal services. However, one of the causes of this mistaken delivery can be an error in the address. Similarly, two similar emails sent to the same recipient can be an indication of a duplication error. Moreover, the magnitude of DQ problems is continuously growing following the exponential increase in the size of databases. This cer- tainly qualifies DQ management as one of the most important business challenges in today’s information‐based economy. Data quality is often defined as “fitness for use,” which implies the relative nature of the concept.4 Data with quality for one use may not be appropriate for another use. For instance, the extent to which data is required to be complete for accounting tasks may not be required for analytical sales prediction tasks. More generally, data that are of acceptable quality in one decision context may be perceived to be of poor quality in another decision context, even by the same individual. This is mainly because DQ is a multidi- mensional concept in which each dimension represents a single aspect or construct of data items and also comprises both objec- tive and subjective aspects. Some aspects are independent while others depend on the type of task and/or experience of the data user. Therefore, it is useful to define DQ in terms of its dimensions. Table 7.10 shows the different DQ dimensions, their categories, and definitions.5 Accuracy indicates whether the data stored are the correct val- ues. For example if my birthdate is February 27, 1975, for a data- base that expects dates in USA format, 02/27/1975 is the correct value. However, for a database that expects a European represen- tation, the date 02/27/1975 is incorrect; instead 27/02/1975 is the correct value.6 Another interesting dimension concerns the completeness of data. The completeness dimension can be considered from different per- spectives. Schema completeness refers to the extent to which entities and attributes are not lacking from the schema. Column completeness verifies whether a column of a table has missing values or not. Finally, population completeness refers to the degree to which members of the population are not present. As an example, population complete- ness is depicted in Table 7.11.7

A N A L Y T I C S : P U T T I N G I T A L L T O W O R K ◂ 151 Table 7.10 Data Quality Dimensions Category Dimension Definition: The Extent to Which . . . Intrinsic Accuracy Data are regarded as correct Believability Data are accepted or regarded as true, real, and credible Objectivity Data are unbiased and impartial Reputation Data are trusted or highly regarded in terms of their source and content Contextual Value‐added Data are beneficial and provide advantages for their use Completeness Data values are present Relevancy Data are applicable and useful for the task at hand Appropriate The quantity or volume of available data is appropriate amount of data Representational Interpretability Data are in appropriate language and unit and the data definitions are clear Ease of Data are clear without ambiguity and easily understanding comprehended Accessibility Accessibility Data are available or easily and quickly retrieved Security Access to data can be restricted and hence kept secure Table 7.11 Population Completeness ID Name Surname Birth Date Email 04/10/1978 [email protected] 1 Monica Smith 04/03/1968 Nulla 02/01/1937 Nullb 2 Yuki Tusnoda 14/12/1955 Nullc 3 Rose David 4 John Edward aNot existing bExisting but unknown cNot known if existing

152 ▸ ANALYTI CS IN A BI G DATA WORL D Tuple 2: Since the person represented by tuple 2 has no email address, we can say that the tuple is complete. Tuple 3: Since the person represented by tuple 3 has an email, but its value is not known, we can say that the tuple is incomplete. Tuple 4: If we do not know the person represented by tuple 4 has an email or not, incompleteness may not be the case. A next data quality dimension is believability, which is the extent to which data is regarded as true and credible. Accessibility refers to how easy the data can be located and retrieved. From a decision making viewpoint, it is important that the data can be accessed and delivered on time, so as to not needlessly delay important decisions. The dimension of consistency can be considered from various per- spectives. A first example is the presence of redundant data (e.g. name, address, …) in multiple data sources. Another perspective is the consistency between related data attri- butes. For example, city name and zip code should be corresponding. Another consistency perspective concerns the data format used. For example, gender can be encoded as male/female, M/F, or 0/1. It is of key importance that a uniform coding scheme is adopted so as to have a consistent corporate wide data representation. The timeliness dimension reflects how up‐to‐date the data is with respect to the task for which it is used. There are different DQ problem causes such as: ■ Multiple data sources: Multiple sources of the same data may produce duplicates; a consistency problem. ■ Subjective judgment: Subjective judgment can create data bias; objectivity problem. ■ Limited computing facilities: Lack of sufficient computing facili- ties limits data access; accessibility problem. ■ Size of data: Big data can give high response times; accessibility problem. Data quality can be improved through a total data quality manage- ment program. It consists of the four phases, as shown in Figure 7.4.8

A N A L Y T I C S : P U T T I N G I T A L L T O W O R K ◂ 153 • Define • Assess Identifying  Assessing/ important DQ  measuring DQ  dimensions level using the  important DQ  dimensions Suggesting  Investigating  improvement  DQ problems  actions and analyzing  • Improve their major  causes • Analyze Figure 7.4 Data Quality Management Program SOFTWARE Different types of software can be used for doing analytics. A first dis- tinction can be made between open source and commercial software. Popular open source analytical workbenches are RapidMiner (for- merly Yale), R, and Weka. Especially the latter has gained in impor- tance and usage nowadays. In the commercial area, SAS, SPSS, Mat- lab, and Microsoft are well‐known vendors of analytical software. Many of these vendors actually provide analytical solutions targeted at specific industries (e.g., churn prediction in telco, fraud detection in insurance) and hereby provide full coverage of the whole range of analytical activities needed in the specific business setting. Table 7.12 presents an overview of a KDnuggets poll asking about software used in 2012 and 2013. Based on Table 7.12, it can be concluded that RapidMiner and R, two open source software solutions, are the most popular tools for analytics. The distinction between open source and commercial is get- ting more and more difficult to make, since vendors like RapidMiner have also started providing commercial versions of their software.

154 ▸ ANALYTI CS IN A BI G DATA WORL D Table 7.12 Results of KDnuggets Poll on Software Tools Used in Analytics in 2012 and 2013. Legend: Bold: Free/Open Source tools First bar: % users in 2013 Normal case: Commercial tools Second bar: % users in 2012 Rapid‐I RapidMiner/RapidAnalytics free 39.2% edition (737), 30.9% alone 26.7% R (704), 6.5% alone 37.4% 30.7% Excel (527), 0.9% alone 28.0% Weka/Pentaho (269), 5.6% alone 29.8% Python with any of numpy/scipy/ 14.3% pandas/iPython packages (250), 0% alone 14.8% Rapid‐I RapidAnalytics/RapidMiner 13.3% Commercial Edition (225), 52.4% alone 14.9% SAS (202), 2.0% alone 12.0% MATLAB (186), 1.6% alone 10.7% 12.7% StatSoft Statistica (170), 45.9% alone 9.9% 10.0% IBM SPSS Statistics (164), 1.8% alone 9.0% 14.0% Microsoft SQL Server (131), 1.5% alone 8.7% Tableau (118), 0% alone 7.8% 7.0% IBM SPSS Modeler (114), 6.1% alone 5.0% 6.3% KNIME free edition (110), 1.8% alone 4.4% 6.1% SAS Enterprise Miner (110), 0% alone 6.8% Rattle (84), 0% alone 5.9% JMP (77), 7.8% alone 21.8% Orange (67), 13.4% alone 5.9% 5.8% Other free analytics/data mining 4.5% software (64), 3.1% alone 4.1% Gnu Octave (54), 0% alone 4.0% 3.6% 5.3% 3.4% 4.9% 2.9% Source: www.kdnuggets.com/polls/2013/analytics‐big‐data‐mining‐data‐science‐software.html.

A N A L Y T I C S : P U T T I N G I T A L L T O W O R K ◂ 155 In addition, Microsoft Excel is still quite popular for doing analytics. The average number of tools used was 3. PRIVACY The introduction of new technology, such as data analytics, brings new privacy concerns. Privacy issues can arise in two ways.9 First, data about individuals can be collected without these individuals being aware of it. Second, people may be aware that data is collected about them, but have no say in how the data is being used. Furthermore, it is important to note that data analytics brings extra concerns regarding privacy as compared to simple data collection and data retrieval from databases. Data analytics entails the use of massive amounts of data—possibly combined from several sources, including the Internet—to mine for hidden patterns. Hence, this technology allows for the discovery of previously unknown relationships without the customer and com- pany being able to anticipate this knowledge. Consider an example in which three independent pieces of information about a certain cus- tomer lead to the customer being classified as a long‐term credit risk, whereas the individual pieces of information would never have led to this conclusion. It is exactly this kind of discovery of hidden patterns that forms an additional threat to citizens’ privacy. Moreover, previous work has shown that it is possible to construct partial profiles of a person by crawling the web for small amounts of nonsensitive information that is publicly available; often this informa- tion is voluntarily published by individuals through social networking sites.10 Also, the individual pieces of nonsensitive information are not harmful for one’s privacy. However, when all information is aggre- gated into a partial profile, this information can be used for crimi- nal activities—such as stalking, kidnapping, identity theft, phishing, scams—or for direct marketing by legitimate companies. It is again important to note that this use of data is not anticipated by citizens, hence privacy issues arise. As illustrated by the previous examples, data analytics is more than just data collection and information retrieval from vast databases. This is recognized by the definition of data mining in several government

156 ▸ ANALYTI CS IN A BI G DATA WORL D reports. For example, the U.S. Government Accountability Office11 defined data mining as: the application of database technology and techniques— such as statistical analysis and modeling—to uncover hidden patterns and subtle relationships in data and to infer rules that allow for the prediction of future results. In the August 2006 Survey of DHS Data Mining Activities, the Department of Homeland Security (DHS) Office of the Inspector Gen- eral (OIG) defined data mining as:12 the process of knowledge discovery, predictive modeling, and analytics. Traditionally, this involves the discovery of patterns and relationships from structured databases of historical occurrences. Several other definitions have been given, and generally these def- initions imply the discovery of hidden patterns and the possibility for predictions. Thus, simply summarizing historical data is not considered data mining. There are several regulations in place in order to protect an individ- ual’s privacy. The Fair Information Practice Principles (FIPPs), which were stated in a report of the U.S Department of Health, Education and Welfare in 1973,13 have served as the main inspiration for the Pri- vacy Act of 1974. In 1980, the Organization for Economic Cooperation and Development (OECD) defined its “Guidelines on the Protection of Privacy and Transborder Flows of Personal Data.” The following basic principles are defined to safeguard privacy:14 ■ Collection limitation principle: Data collection should be done lawfully and with knowledge and consent of the data subject. ■ Data quality principle: The data should be relevant for the pur- pose it is collected for, accurate, complete, and up‐to‐date. ■ Purpose specification principle: The purposes of the data should be specified before data collection and the use should be limited to these purposes. ■ Use limitation principle: The data should not be used for other purposes than specified, neither should it be disclosed to other

A N A L Y T I C S : P U T T I N G I T A L L T O W O R K ◂ 157 parties without consent of the data subject (or by the authority of law). ■ Safety safeguards principle: The data should be protected against risks of loss, unauthorized access, use, modification, or disclo- sure of data. ■ Openness principle: There should be a policy of openness about the developments, practices, and policies with respect to per- sonal data. ■ Individual participation principle: An individual has the right to obtain confirmation whether data exists about him or her, to receive the data, to challenge data relating to him or her and to have it erased or completed should the challenge be successful. ■ Accountability principle: A data controller can be held account- able for compliance with the above principles. These guidelines are widely accepted, have been endorsed by the U.S. Department of Commerce, and are the foundation of privacy laws in many other countries (e.g., Australia, Belgium). Given the increasing importance and awareness of privacy in the context of analytics, more and more research is being conducted on privacy preserving data mining algorithms. The parties that are typically involved are: the record owner, the data publisher, and the data recipient.15 A data publisher can be untrusted, in which case the collection of records needs to be done anonymously. When the data publisher is trusted, the record owners are willing to share their information with the data publisher, but not necessarily with third parties, and it is necessary to anonymize the data. This can be further complicated when the data publisher is a nonexpert in the sense that he or she is not aware that (and how) the data recipient can mine the data. The privacy of an individual is breached when an attacker can learn anything extra about a record owner, possibly with the pres- ence of any background knowledge from other sources.16 Consider an example in which explicit identifiers are removed from a data set, but there is a combination of a number of variables (e.g., age, zip code, gender), which serves as a quasi‐identifier (QID). This means that it is possible to link the record owner, by means of the QID, to a record

158 ▸ ANALYTI CS IN A BI G DATA WORL D Zip Code Age Gender Zip Code Age Gender 83661 26 M 836** 2* M 83659 23 M 836** 2* M 83645 58 F 836** 5* F Example of Generalization and Suppression to Anonymize Data owner in another data set. To preserve privacy, there should be several records in the data set with the same QID. There are several classes of methods to anonymize data.17 A first class of methods is generalization and suppression. These methods will remove information from the quasi‐identifiers, until the records are not individually identifiable, as illustrated in Figure 7.5. Another group of techniques consists of anatomization and per- mutation, which groups and shuffles sensitive values within a QID group, in order to remove the relationship between the QID and sensi- tive attributes. Perturbation methods change the data by adding noise, swapping values, creating synthetic data, and so forth, based on the statistical properties of the real data.18 MODEL DESIGN AND DOCUMENTATION Some example questions that need to be answered from a model design perspective are: ■ When was the model designed, and by who? ■ What is the perimeter of the model (e.g., counterparty types, geographical region, industry sectors)? ■ What are the strengths and weaknesses of the model? ■ What data were used to build the model? How was the sample constructed? What is the time horizon of the sample? ■ Is human judgment used, and how? It is important that all of this is appropriately documented. In fact, all steps of the model development and monitoring process should be adequately documented. The documentation should be transparent

A N A L Y T I C S : P U T T I N G I T A L L T O W O R K ◂ 159 and comprehensive. It is advised to use document management sys- tems with appropriate versioning facilities to keep track of the differ- ent versions of the documents. An ambitious goal here is to aim for a documentation test, which verifies whether a newly hired analytical team could use the existing documentation to continue development or production of the existing analytical model(s). CORPORATE GOVERNANCE From a corporate governance perspective, it is also important that the ownership of the analytical models is clearly claimed. A good practice here is to develop model boards that take full responsibility of one or more analytical models in terms of their functioning, interpreta- tion, and follow-up. Also, it is of key importance that the board of directors and senior management are involved in the implementa- tion and monitoring processes of the analytical models developed. Of course, one cannot expect them to know all underlying technical details, but they should be responsible for sound governance of the analytical models. Without appropriate management support, analyti- cal models are doomed to fail. Hence, the board and senior manage- ment should have a general understanding of the analytical models. They should demonstrate active involvement on an ongoing basis, assign clear responsibilities, and put into place organizational proce- dures and policies that will allow the proper and sound implementa- tion and monitoring of the analytical models. The outcome of the monitoring and backtesting exercise must be communicated to senior management and, if needed, accompanied by appropriate (strategic) response. Given the strategic importance of analytical models nowa- days, one sees a strong need to add a Chief Analytics Officer (CAO) to the board of directors to oversee analytic model development, imple- mentation, and monitoring. NOTES 1. E. Lima, C. Mues, and B. Baesens, “Monitoring and Backtesting Churn Models,” Expert Systems with Applications 38, no. 1 (2010): 975–982; G. Castermans et al., “An Overview and Framework for PD Backtesting and Benchmarking.” Special issue, Journal of the Operational Research Society 61 (2010): 359–373.

160 ▸ ANALYTI CS IN A BI G DATA WORL D 2. H. T. Moges et al., “A Multidimensional Analysis of Data Quality for Credit Risk Management: New Insights and Challenges,” Information and Management, 50:1, 43–58, 2014. 3. A. Maydanchik, Data Quality Assessment (Bradley Beach, NJ: Technics Publications, 2007), 20–21. 4. R. Y. Wang and D. M. Strong, “Beyond Accuracy: What Data Quality Means to Data Consumers,” Journal of Management Information Systems 12, no. 4 (1996): 5–33. 5. Ibid. 6. Y. W. Lee, L. L. Pipino, J. D. Funk, and R. Y. Wang, Journey to Data Quality (London: MIT Press, 2006), 67–108. 7. C. Batini and M. Scannapieco, Data Quality: Concepts, Methodologies and Techniques (New York: Springer, 2006), 20–50. 8. G. Shankaranarayanan, M. Ziad, and R. Y. Wang, “Managing Data Quality in Dynamic Decision Environments: An Information Product Approach,” Journal of Database Management 14, no. 4 (2003): 14–32. 9. H. T. Tavani, “Informational Privacy, Data Mining, and the Internet,” Ethics and Infor- mation Technology 1, no. 2 (1999): 137–145. 10. M. Pontual et al., “The Privacy in the Time of the Internet: Secrecy vs Transparency,” in Proceedings of the Second ACM Conference on Data and Application Security and Privacy (ACM, 2012), ACM, New York, US, 133–140. 11. U.S. General Accounting Office (GAO), “Data Mining: Federal Efforts Cover a Wide Range of Uses,” GAO‐04‐548 (May 2004), www.gao.gov/new.items/d04548.pdf. 12. U.S. Department of Homeland Security, Survey of DHS Data Mining Activities, August 2006. 13. The report is entitled “Records, Computers and the Rights of Citizens.” 14. The documentation can be found at www.oecd.org/internet/ieconomy/oecdguideli nesontheprotectionofprivacyandtransborderflowsofpersonaldata.htm. 15. B. Fung et al., “Privacy‐Preserving Data Publishing: A Survey of Recent Develop- ments,” ACM Computing Surveys (CSUR) 42, no. 4 (2010): 14. 16. T. Dalenius, “Finding a Needle in a Haystack—or Identifying Anonymous Census Record, Journal of Official Statistics 2, no. 3 (1986): 329–336. 17. B. Fung et al., “Privacy‐Preserving Data Publishing: A Survey of Recent Develop- ments,” ACM Computing Surveys (CSUR) 42, no. 4 (2010): 14. 18. For more details about the specific techniques, the reader is referred to overview papers such as J. Wang et al., “A Survey on Privacy Preserving Data Mining,” in First International Workshop on Database Technology and Applications (IEEE, Washington, DC, US, 2009), 111–114; and B. Fung et al., “Privacy‐Preserving Data Publishing: A Sur- vey of Recent Developments,” ACM Computing Surveys (CSUR) 42, no. 4 (2010): 14.

8C H A P T E R Example Applications Analytics is hot and is being applied in a wide variety of settings. Without claiming to be exhaustive, in this chapter, we will briefly zoom into some key application areas. Some of them have been around for quite some time, whereas others are more recent. CREDIT RISK MODELING The introduction of compliance guidelines such as Basel II/Basel III has reinforced the interest in credit scorecards. Different types of analytical models will be built in a credit risk setting.1 A first example are applica- tion scorecards. These are models that score credit applications based on their creditworthiness. They are typically constructed by taking two snapshots of information: application and credit bureau information at loan origination and default status information 12 or 18 months ahead. This is illustrated in Figure 8.1. Table 8.1 provides an example of an application scorecard. Logistic regression is a very popular application scorecard construction technique due to its simplicity and good performance.2 For the scorecard in Table 8.1, the following logistic regression with WOE coding was used: P(Customer = good | age, employment, salary) = 1 + e ( 1 )− β0 +β1WOEage +β2WOEemployment +β3WOEsalary 161

162 ▸ ANALYTI CS IN A BI G DATA WORL D Snapshot 1 Snapshot 2 t0 t18 Application Age Good or Bad Payer? Data Income Marital status Savings amount …. Credit Bureau score Bureau Delinquency history Data Number of bureau checks Number of outstanding credits …. Figure 8.1 Constructing a Data Set for Application Scoring Typically, the model will then be re‐expressed in terms of the log odds, as follows: ⎛ P(Customer = good|age, employment, salary)⎞ log ⎜⎝ P(Customer = bad |age, employment, salary) ⎠⎟ = β0 + β1WOEage + β2WOEemployment + β3WOEsalary One then commonly applies a scorecard scaling by calculating a score as a linear function of the log odds, as follows: Score = offset + factor * log(odds) Table 8.1 Example Application Scorecard Characteristic Name Attribute Points Age 1 100 Age 2 Up to 26 120 Age 3 26−35 185 Age 4 35−37 225 Employment status 1 37+ 90 Employment status 2 Employed 180 Salary 1 Unemployed 120 Salary 2 Up to 500 140 Salary 3 501−1,000 160 Salary 4 1,001−1,500 200 Salary 5 1,501−2,000 240 2,001+

E X A M P L E A P P L I C A T I O N S ◂ 163 Assume that we want a score of 600 for odds of 50:1, and a score of 620 for odds of 100:1. This gives the following: 600 = offset + factor * log(50) 620 = offset + factor * log(100) The offset and factor then become: factor = 20/ln(2) offset = 600 − factor * ln(50) Once these values are known, the score becomes: ∑Score=⎛ N (WOEi * βi)+ β0 ⎞ * factor + offset ⎜⎝ i =1 ⎠⎟ =∑Score⎛N ⎛ WOEi * βi + β0 ⎞⎞ * factor + offset ⎜⎝ i =1 ⎝⎜ N ⎠⎟ ⎠⎟ ∑Score=⎛N ⎛ WOEi * βi + β0 ⎞ * factor + offset ⎞ ⎜⎝ i =1 ⎝⎜ N ⎠⎟ N ⎠⎟ Hence, the points for each attribute are calculated by multiplying the weight of evidence of the attribute with the regression coefficient of the characteristic, then adding a fraction of the regression intercept, multi- plying the result by the factor, and finally adding a fraction of the offset. In addition to application scorecards, behavioral scorecards are also typically constructed. These are analytical models that are used to score the default behavior of an existing portfolio of customers. On top of the application characteristics, behavioral characteristics, such as trends in account balance or bureau score, delinquency history, credit limit increase/decrease, and address changes, can also be used. Because behavioral scorecards have more data available than applica- tion scorecards, their performance (e.g., measured using AUC) will be higher. Next to debt provisioning, behavioral scorecards can also be used for marketing (e.g., up/down/cross‐selling) and/or proactive debt collection. Figure 8.2 gives an example of how a data set for behavioral scoring is typically constructed. Both application and behavioral scorecards are then used to cal- culate the probability of default (PD) for a portfolio of customers. This

164 ▸ ANALYTI CS IN A BI G DATA WORL D 3000 Checking account 2500 2000 1500 1000 500 0 Snapshot t24 t0 0 2 468 10 12 t12 Month Good/Bad? Bureau score 800 Observation 750 Point 700 650 2 4 6 8 10 12 600 Month 550 500 0 Number of products purchased Number of times changed home address Delinquency history (all credits) … Figure 8.2 Constructing a Data Set for Behavioral Scoring is done by first segmenting the scores into risk ratings and then cal- culating a historically observed default rate for each rating, which is then used to project the probability of default (PD) for (typically) the upcoming year. Figure 8.3 gives an example of how credit risk models are commonly applied in many bank settings.3 Figure 8.3 Three Level Credit Risk Model

E X A M P L E A P P L I C A T I O N S ◂ 165 Other measures that need to be calculated in credit risk model- ing are the loss given default (LGD) and exposure at default (EAD). LGD measures the economic loss expressed as a percentage of the outstanding loan amount and is typically estimated using linear regres- sion or regression trees. EAD represents the outstanding balance for on‐balance sheet items (e.g., mortgages, installment loans). For off‐ balance sheet items (e.g., credit cards, credit lines), the EAD is typically calculated as follows: EAD = DRAWN + CCF * (LIMIT−DRAWN), whereby DRAWN represents the already drawn balance, LIMIT the credit limit, and CCF the credit conversion factor, which is expressed as a percentage between 0 and 1. CCF is typically modeled using either averages, linear regression, or regression trees. Once the PD, LGD, and EAD have been estimated, they will be input into a capital requirements formula provided in the Basel II/III accord, calculating the necessary amount of capital needed to protect against unexpected losses. FRAUD DETECTION Fraud detection comes in many flavors. Typical examples for which fraud detection is relevant are: credit card fraud, insurance claim fraud, money laundering, tax evasion, product warranty fraud, and click fraud. A first important challenge in fraud detection concerns the labeling of the transactions as fraudulent or not. A high suspi- cion does not mean absolute certainty, although this is often used to do the labeling. Alternatively, if available, one may also rely on court judgments to make the decision. Supervised, unsupervised, and social network learning can be used for fraud detection. In supervised learning, a labeled data set with fraud transactions is available. A common problem here is the skewness of the data set because typically only a few transactions will be fraudulent. Hence, a decision tree already starts from a very pure root node (say, 99 percent nonfraudulent/1 percent fraudulent) and one may not be able to find any meaningful splits to further reduce the impurity. Simi- larly, other analytical techniques may have a tendency to simply pre- dict the majority class by labeling each transaction as nonfraudulent. Common schemes to deal with this are over‐ and undersampling. In

166 ▸ ANALYTI CS IN A BI G DATA WORL D oversampling, the fraudulent transactions in the training data set (not the test data set!) are replicated to increase their importance. In under- sampling, nonfraudulent transactions are removed from the training data set (not test data set!) to increase the weight and importance of the fraudulent transactions. Both procedures are useful to help the analyti- cal technique in finding a discriminating pattern between fraudulent and nonfraudulent transactions. Note that it is important to remember that the test set remains untouched during this. However, if an analyti- cal technique is built using under‐ or oversampling, the predictions it produces on the test data set may be biased and need to be adjusted. One way to adjust the predictions is as follows:4 p(C i ) pt(Ci ⎢x) pt (C i ) ∑p(Ci ⎢x) = m ppt((CCjj))pt (C j ⎢x) j =1 whereby Ci represents the target class (e.g., C1 is fraudulent and C2 is nonfraudulent), pt(Ci | x) represents the probability estimated on the over‐ or undersampled training data set, pt(Ci) is the prior probability of class Ci on the over‐ or undersampled training data set, and p(Ci) repre- sents the original priors (e.g., 99/1 percent). The denominator is intro- duced to make sure that the probabilities sum to one for all classes. Unsupervised learning can also be used to detect clusters of outly- ing transactions. The idea here is to build, for example, a SOM and look for cells containing only a few observations that might potentially indicate anomalies requiring further inspection and attention. Finally, social network analysis might also be handy for fraud detection. Although fraud may be hard to detect based on the avail- able variables, it is often very useful to analyze the relationships between fraudsters. Rather than a standalone phenomenon, fraud is often a carefully organized crime. Exploiting relational information provides some interesting insights in criminal patterns and activities. Figure 8.4 illustrates a fraud network. Note that this network is con- structed around node 1 (in the center of the figure). Nodes in the net- work that are green are legitimate nodes. Red nodes are fraudulent. The network visualization gives a good impression of the difference

E X A M P L E A P P L I C A T I O N S ◂ 167 17 13 6 2 27 30 28 3 14 18 16 21 9 19 25 1 47 22 5 40 43 4 26 37 44 31 45 29 10 24 15 20 35 41 46 32 11 36 39 23 7 34 38 42 12 33 8 Figure 8.4 Fraud Network. Light Gray Nodes Refer to Legitimate Individuals, While Dark Gray Nodes Represent Fraud in network structure between legitimate and fraudulent nodes. While legitimate nodes only sparsely connect to each other, fraudulent nodes are characterized by a dense structure, with many links between all the members. Such structures have been investigated by Van Vlasselaer, Meskens, Van Dromme, and Baesens5 and are called spider construc- tions in the domain of social security fraud. The name spider construc- tions is derived from their appearance: The fraudulent constructions look like a dense web in which all nodes are closely connected to each other. Based on the egonet concept, discussed earlier, both local and network variables are constructed to characterize each node. Local variables define the node of interest using only individual characteris- tics, independent of its surrounding neighbors. Network variables are dependent on the network structure, and include: ■ Fraudulent degree. In the network domain, the first‐order degree refers to the number of immediate contacts a node has. The n‐degree defines the number of nodes the surveyed node

168 ▸ ANALYTI CS IN A BI G DATA WORL D can reach in at most n hops. Instead of calculating the overall degree, one can make a distinction based on the label of each of the surrounding nodes. For the fraud domain, this means that the fraudulent first‐order degree corresponds to counting the number of direct fraudulent neighbors. ■ Triangles. A triangle in a network is defined as a structure in which three nodes of the network are connected to each other. Especially triangles containing at least two fraudulent nodes are a good indicator of potential suspicious activities of the third node. Nodes that are involved in many suspicious triangles have a higher probability to commit fraud themselves. ■ Cliques. A clique is an extension of a triangle. Newman (2010) defines a clique as the maximal subset of the vertices in an undi- rected network such that every member of the set is connected by an edge to every other. While fraudulent triangles appear regularly in a network, fraudulent k‐cliques (with k > 3) will appear less often. However, such cliques are extremely precise indicators of future fraud. Although network variables as such can be very useful in detect- ing potential future fraud, these characteristics can also be converted in aggregated variables characterizing each node (e.g., total number of triangles/cliques, average degree weight, average triangle/clique weight). Afterward, these network variables should be enriched by local variables as discussed before. Using all the available attributes, standard learning techniques like logistic regression, random forests, and neural networks are able to estimate future fraud based on both network‐related information and personal information. Such a com- bined approach exploits all potential information and returns the rel- evance, in terms of variable weight, of each characteristic. NET LIFT RESPONSE MODELING In response modeling, the focus lies on deepening or recovering customer relationships, or new customer acquisition by means of targeted or win‐back campaigns. The campaign can be a mail cata- log, email, coupon, or A/B or multivariate testing. The purpose is to

E X A M P L E A P P L I C A T I O N S ◂ 169 identify the customers most likely to respond based on the following information: ■ Demographic variables (e.g., age, gender, marital status) ■ Relationship variables (e.g., length of relationship, number of products purchased) ■ Social network information ■ RFM variables RFM has been popularized by Cullinan6 as follows: ■ Recency: Time frame (days, weeks, months) since last purchase ■ Frequency: Number of purchases within a given time frame ■ Monetary: Dollar value of purchases Each of these constructs can be operationalized in various ways; for example, one can consider the minimum/maximum/average/most recent monetary value of purchases. The constructs can be used sepa- rately or combined into an RFM score by either independent or depen- dent sorting. For the former (see Figure 8.5), the customer database is sorted into independent quintiles based on RFM (e.g., recency quintile 1 is the 20 percent most ancient buyers). The final RFM score Recency Monetary score Frequency score score 5 5 5 4 4 4 3 3 3 2 2 2 1 1 1 Figure 8.5 Constructing an RFM Score (Independent Sorting)

170 ▸ ANALYTI CS IN A BI G DATA WORL D Recency Frequency Monetary score score score 5 35 335 4 34 334 3 33 333 2 32 332 1 31 331 Figure 8.6 Constructing an RFM Score (Dependent Sorting) (e.g., 325) can then be used as a predictor for the response model. For dependent sorting, the customer database is first sorted into quintiles based on recency (see Figure 8.6). Each recency quintile is then further divided into frequency quintiles and then into monetary quintiles. This again yields an RFM score (e.g., 335) that can be used as a predic- tor for a response model. A first approach to response modeling is by splitting the previ- ous campaign population into a test group and a control group (see Figure 8.7). The test group receives the marketing campaign and a model is built on a training subset and evaluated on a holdout subset. Traditionally, the impact of such a marketing campaign is measured by comparing the purchase rate of a test group against the purchase rate of a control group. If the purchase rate of the test group exceeds the purchase rate of the control group, the marketing campaign is said to be effective. Although such methods concentrate on maximizing the gross purchase rate (i.e., purchase rate test group minus purchase rate control group), they do not differentiate between different customers and therefore ignore the net or incremental impact of the campaign. In general, three types of customers can be distinguished. First, there are those people who would never buy the product, whether they are exposed to a marketing offer or not. Targeting these people would not make any sense because they won’t buy the product anyway. A second group of customers is those who always buy the product. Tar- geting these people will cause a profit loss because they will always buy the product; therefore, offering them a marketing incentive (e.g., a discount) will reduce the profit margin. A last category of customers is the so‐called swing clients. These types of customers will not buy the product spontaneously, but need to be motivated to take action.

E X A M P L E A P P L I C A T I O N S ◂ 171 Previous campaign data Control Test Training Holdout data data Model Figure 8.7 Gross Lift Response Modeling Because they are still undecided on whether to buy the product, a marketing campaign is especially effective for these people. Focusing on only these customers will maximize the true impact of the market- ing campaign and is the goal of net lift modeling. Net lift modeling tries to measure the true impact by the incremental purchases, that is, purchases that are only attributable to the campaign and that would not be made otherwise.7 Net lift modeling aims at finding a model such that the difference between the test group purchase rate and the control group purchase rate is maximized so as to identify the swing clients (see Figure 8.8). By implementing this methodology, marketers Previous campaign data Control Test Training Holdout data data Model Figure 8.8 Net Lift Response Modeling

172 ▸ ANALYTICS IN A BIG DATA WORL D Test group Self-selectors Y=1 Y=0 Converted swing clients No purchase Self-selectors Y=1 Control group Swing clients Y=0 No purchase Figure 8.9 Observed Target in Net Lift Modeling not only optimize the true business objective—maximizing profit—but also gain a better insight in the different customer segments. In the test and control group, the target will then be observed as indicated in Figure 8.9. One could then build a difference score model, as follows: ■ Build a logistic regression model estimating probability of pur- chase given marketing message, P(purchase|test). ■ Build a logistic regression model estimating probability of pur- chase given control, P(purchase|control). ■ Incremental score = P(purchase|test)−P(purchase|control). To further understand the impact of the predictors, one can then regress the incremental lift scores on the original data. Another option could be to build only one logistic regression model with an additional binary predictor specifying whether an observation belongs to the control or test group. The model can then also include all possible interaction terms with this binary variable. CHURN PREDICTION Customer churn, also called attrition or defection, is the loss of custom- ers. In saturated markets, there are limited opportunities to attract new customers, so retaining existing customers is essential to profitability and stability. It is estimated that attracting a new customer costs five to

E X A M P L E A P P L I C A T I O N S ◂ 173 six times more than retaining a customer.8 Established customers are more profitable due to the lower cost to serve them. In addition, brand loyalty developed over time makes them less likely to churn. Satisfied customers also serve as word‐of‐mouth advertisement, referring new customers to the company. Research on customer churn can take two perspectives: the over- all company level and the individual customer level. Identifying the determinants of churn, or reasons why customers may churn, can give insight into company‐level initiatives that may reduce the issues that lead to higher churn. One such study9 performed a survey of the Korean mobile telephone market. Service attributes such as call qual- ity and tariff level are negatively correlated with churn in that mar- ket. Naturally, if it is possible to improve call quality, fewer customers would be expected to churn. The results of this and similar studies certainly indicate that management must focus on the quality of attri- butes that are most important to customers.10 However, continually improving in these areas may not always be feasible due to cost or other limitations. As a complementary approach, switching the focus to the individual customer level can yield high returns for a relatively low investment. It is possible to use churn prediction models to identify individual cus- tomers who are likely to churn and attempt to prevent them from leaving the company. These models assign each customer an expected probability of churn. Then it is relatively straightforward to offer those customers with the greatest probability a discount or other promo- tion to encourage them to extend their contract or keep their account active. In the following section, several techniques and approaches to churn prediction will be discussed. Churn Prediction Models Many well‐known and less common models have been applied to churn prediction, including decision trees, logistic regression, support vector machines, Bayesian networks, survival analysis, self‐organizing maps, and relational classifiers, among others. Both accuracy and com- prehensibility are crucial for the decision‐making process, so careful consideration should be used when choosing a technique. Accurate

174 ▸ ANALYTI CS IN A BI G DATA WORL D predictions are perhaps the most apparent goal, but learning the reasons, or at least the indicators, for churn is also invaluable to the company. Understanding why a model makes the predictions it does serves several purposes. Comprehensibility allows for domain experts to evaluate the model and ensure that it is intuitively correct. In this way, it can be verified or confirmed by the business. More comprehen- sible models also offer insight into the correlation between customer attributes and propensity to churn,11 allowing management to address the factors leading to churn in addition to targeting the customers before they decide to churn. Finally, understandable and intuitive models may be more easily adopted within a company. If managers are accustomed to making decisions based on their own experience and knowledge, they will be more inclined to trust predictions made by a model that is not only comprehensible but also in line with their own reasoning. Logistic regression is a statistical classification model that is often used for churn prediction, either as a model on its own or as a com- parison for other models. The coefficients for this model indicate the correlation between the customer attributes and the probability of churn. It is a well understood and accepted model both in research and practice. It is both easy to interpret and provides good results when compared with other methods. It has been shown to outper- form more complex methods in many cases. Decision trees can also be used for churn prediction. They also offer interpretability and robust- ness. Neural networks and support vector machines have also been applied to churn prediction; however, these methods are seen as black boxes, offering little insight into how the predictions are made. Sur- vival analysis offers the interpretability of logistic regression in the form of hazard ratios that can be interpreted similarly to odds ratios in logistic regression. In addition, the target of interest is time‐to‐event rather than a binary variable. It is therefore possible to make predic- tions about how long a customer will remain active before they churn. Relational classifiers can also be used for churn prediction. Homoph- ily in networks is based on the idea that similar individuals are more likely to interact, and from that it is expected that individuals that are connected in a network will behave similarly. In churn prediction, if customers are linked with churners, they may also be likely to churn.

E X A M P L E A P P L I C A T I O N S ◂ 175 Social network features can also be used in a traditional classifier like logistic regression or survival analysis. In order to do this, measures of connectedness can be extracted from the network and used as input features for the other model.12 Churn Prediction Process Regardless of the particular technique, churn prediction modeling fol- lows a standard classification process as illustrated in Figure 8.10. The first step is to define churn for the particular situation. This may be nat- urally present in the data: contract termination, service cancellation, or nonrenewal. In other settings, it will not be so clear: A customer no longer shops at the store or website, or a customer stops purchas- ing credits. In these cases, the analyst or researcher must choose a definition of churn that makes sense in the context. One common solution is to select an appropriate length of time of inactivity on the account. In the previous examples, a number of days or months with- out a purchase might define churn. Of course, a customer may not buy something within that time frame but still return again at a later date. Setting too short of a time period may lead to nonchurn cus- tomers being targeted as potential churners. Too long of a period may mean churning customers are not identified in a timely manner. In most cases, a shorter time period may be preferable, if the cost of the intervention campaign is much lower than the cost of a lost customer. After defining churn, the original set of customers should be labeled according to their true churn status. The data set is split for Unknown  6 Retention  Data Campaign Training Set 1 Define 2 3 Predictions 8 Churn Model 7 No  Campaign Test Set 4 5 Model  Performance Figure 8.10 The Churn Prediction Process

176 ▸ ANALYTI CS IN A BI G DATA WORL D validation and the customer attributes from the training set can be used to train the selected model. The customer attributes from the test set are then used to compare the model’s results with the actual churn label. This allows for an evaluation of the model performance. The model may also be evaluated by domain experts to gauge whether the predictive attributes seem in line with business knowledge. If the performance is acceptable, the attributes of current customers can be entered into the model to predict their churn class. A group of custom- ers with the highest predicted churn probability can then be contacted with the retention campaign. Other customers who are less likely to churn are not contacted with the promotion. RECOMMENDER SYSTEMS People are influenced by recommendations in their daily decisions. Salesmen try to sell us the product we like, restaurants are being eval- uated and rated, and so on. Recommender systems can support us in our online commercial activities by suggesting specific items from a wide range of options. A considerable number of different techniques are available to build a recommender system, of which the following are the most important: collaborative filtering, content‐based filtering, demographic filtering, knowledge‐based filtering, and hybrid filter- ing. Case studies presenting all these techniques have greatly multi- plied in recent years. A lot of these deal with movies,13 tourism,14 and restaurants.15 In this section, the five main techniques are introduced and fol- lowed by some of their advantages and disadvantages. Some other issues concerning recommender systems are then briefly discussed. Collaborative Filtering Collaborative filtering, also called social filtering, has been the approach that is associated the most with recommender systems. The main idea is to recommend items based on the opinions of other users. A dis- tinction can be made between user‐based collaborative filtering and item‐based collaborative filtering. In case of user‐based collaborative filtering, items will be recommended to a user based on how similar

E X A M P L E A P P L I C A T I O N S ◂ 177 users rated these items. When opting for item‐based collaborative fil- tering, items will be recommended to a user based on how this user rated similar items. One way to calculate similarity between users or items is to use a user‐item matrix that contains information on which user bought what item. Any similarity measure can then be used to create a similarity matrix (e.g., Pearson correlation and cosine). To build a collaborative recommender system, ratings are required. These ratings form the link between a user and an item.16 A distinc- tion can be made between three types of ratings. A scalar rating can be a number or an ordinal rating. A binary rating consists of two pos- sibilities, such as good or bad. Finally, unary ratings indicate that a user has had an interaction with an item, such as a click on an item or a purchase.17 We can distinguish between two types of methods for the collection of ratings. Explicit ratings can be obtained by requesting a user to rate a certain item. Implicit ratings are obtained by associating a rating with a certain action, such as buying an item.18 Typically, neighborhood‐based algorithms are applied, in which the following three steps can be distinguished.19 First, a similarity measure is used to calculate similarity between users (in case of a user‐based algorithm) or items (in case of an item‐based algorithm). Second, a subset of users or items is selected that functions as the neighborhood of the active user or item. Third, the algorithm predicts a rating based on the active user’s or item’s neighborhood, typically giving the high- est weight to the most similar neighbors. As is often the case with analytics, different techniques can be used to solve the same problem, with their respective advantages and disadvantages. Three main advantages of collaborative recommender systems are identified. First, collaborative filtering does not restrict the type of items to be recommended. It is indeed enough to construct a matrix linking items to users to start the recommendation. A second advantage, linked to the first, is that it manages to deliver recommen- dations to a user even when it is difficult to find out which specific feature of the item makes it interesting to the user or when there is no easy way to extract such a feature automatically. A third advantage has to do with novelty or serendipity: Collaborative filtering is believed to recommend more unexpected items (that are equally valuable) than content‐based techniques.20 Although collaborative filtering methods

178 ▸ ANALYTICS IN A BIG DATA WORL D are the most commonly used techniques because of their power, some disadvantages or weak points should be noted. First, sparse data can be a problem for such a technique. A critical mass of ratings is indeed necessary in order to build meaningful similarity matrices. In cases in which the items are not frequently bought by the users (e.g., recom- mending mobile phones or apartments), it may indeed be difficult to obtain representative neighborhoods, hence lowering the power of the technique. A second disadvantage is known as the cold start problem, which means that new items cannot easily be recommended because they have not been rated yet; therefore, new users cannot easily receive recommendations because they have not yet rated items. Some minor disadvantages are, for example, the fact that items purchased a long time ago may have a substantial impact if few items have been rated, which may lead to wrong conclusions in a changing environment. Privacy could also be a problem because collaborative filtering needs data on users to give recommendations or could generate trust issues because a user cannot question the recommendation. Content‐Based Filtering Content‐based recommender systems recommend items based on two information sources: features of products and ratings given by users. Different kinds of data can be encountered, requiring different strate- gies to obtain usable input. In the case of structured data, each item consists of the same attributes and the possible values for these attri- butes are known. It is then straightforward to apply content‐based approaches. When only unstructured data are available, such as text, different techniques have to be used in order to learn the user profiles. Because no standard attributes and values are available, typical prob- lems arise, such as synonyms and polysemous words. Free text can then be translated into more structured data by using a selection of free text terms as attributes. Techniques like TF‐IDF (term frequency/ inverse document frequency) can then be used to assign weights to the different terms of an item. Sometimes, data is semistructured, con- sisting of some attributes with restricted values and some free text. One approach to deal with this kind of data is to convert the text into structured data.21

E X A M P L E A P P L I C A T I O N S ◂ 179 When items can be represented in a usable way, machine learning techniques are applied to learn a user profile. Typically, a classification algorithm is invoked for each user based on his or her ratings on items and their attributes. This allows the recommender system to predict whether a user will like an item with a specific representation. As with collaborative filtering methods, explicit or implicit ratings are required. When explicit ratings are considered, the ratings are directly used for the classification task, whereas implicit ratings can be obtained using the item–user interactions. The classification problem mentioned above can be implemented using a large number of different machine learning techniques. Some examples are logistic regression, neural networks, decision trees, asso- ciation rules, and Bayesian networks. Nearest neighbor methods can also be used to determine the labeled items that are most similar to a new unlabeled item in order to label this new item based on the labels of the nearest neighbors. Concerning the similarity metric used in nearest neighbor methods, Euclidean distance is often used when data are structured, whereas cosine similarity may prove its use when the vector space model is applied. Other approaches are linear classi- fiers, support vector machines, and Naïve Bayes.22 A first advantage of content‐based recommender systems is that there is no cold start problem for new items. Indeed, new items (which have not received ratings before) can be recommended, which was not the case in a collaborative filtering approach. Sec- ond, items can also be recommended to users that have unique preferences. A third important advantage is the possibility to give an explanation to the user about his or her recommendations, for example, by means of displaying a list of features that led to the item being recommended. A fourth advantage is that only ratings of the active user are used in order to build the profile, which is not the case for collaborative recommender systems.23 Concerning the disadvantages, a first limitation is that content‐based techniques are only suitable if the right data are available. It is indeed necessary to have enough information about the items to determine whether a user would like an item or not. The cold start problem for new users forms a second limitation as well, as old ratings potentially influence the recommendation too much. Finally, over‐specialization can be a

180 ▸ ANALYTI CS IN A BI G DATA WORL D problem because such techniques will focus on items similar to the previously bought items. Demographic Filtering Demographic filtering recommends items based on demographic infor- mation of the user. The main challenge is to obtain the data. This can be explicitly done by asking for information from users such as age, gender, address, and so on. If this approach is not possible, analytical techniques could be used to extract information linked to the interac- tions of the users with the system. A user profile can then be built and used to recommend items.24 The main advantage of demographic recommender systems is that there is not always a need for a history of user ratings of the type that is required in collaborative and content‐based approaches. Segments can be used in combination with user–item interactions in order to obtain a high‐level recommender system. Some disadvantages are the cold start problem for new users and new items, as well as the difficulty in capturing the data, which is highly dependent on the participation of the users. Knowledge‐Based Filtering Compared with collaborative filtering and content‐based recommender systems, it is more difficult to briefly summarize the characteristics of knowledge‐based recommender systems. The main difference with regard to the other techniques resides in the data sources used. With this approach, additional inputs consisting of constraints or require- ments are provided to the recommender system typically by allowing a dialog between the user and the system. Knowledge‐based recom- mender systems can be divided in two main categories: constraint‐ based recommenders and case‐based recommenders. Constraint‐based recommenders are systems meeting a set of constraints imposed by both users and the item domain. A model of the customer requirements, the product properties, and other constraints that limit the possible requirements is first constructed and formalized. Any technique can then be used and will have to meet the requirements, or at least


Like this book? You can publish your book online for free in a few minutes!
Create your own flipbook