Data Science for Business | Alessandro Desantis

Highlights

If you look too hard at a set of data, you will find something — but it might not generalize beyond the data you’re looking at. This is referred to as overfitting a dataset.
In many business analytics projects, we want to find “correlations” between a particular variable describing an individual and other variables.
Classification and class probability estimation attempt to predict, for each individual in a population, which of a (small) set of classes this individual belongs to.
A scoring model applied to an individual produces, instead of a class prediction, a score representing the probability (or some other quantification of likelihood) that that individual belongs to each class.
Regression (“value estimation”) attempts to estimate or predict, for each individual, the numerical value of some variable for that individual.
Similarity matching attempts to identify similar individuals based on data known about them.
Clustering attempts to group individuals in a population together by their similarity, but not driven by any specific purpose.
Clustering is useful in preliminary domain exploration to see which natural groups exist because these groups in turn may suggest other data mining tasks or approaches.
Co-occurrence grouping (also known as frequent itemset mining, association rule discovery, and market-basket analysis) attempts to find associations between entities based on transactions involving them.
The result of co-occurrence grouping is a description of items that occur together. These descriptions usually include statistics on the frequency of the co-occurrence and an estimate of how surprising it is.
Profiling (also known as behavior description) attempts to characterize the typical behavior of an individual, group, or population. An example profiling question would be: “What is the typical cell phone usage of this customer segment?”
Link prediction attempts to predict connections between data items, usually by suggesting that a link should exist, and possibly also estimating the strength of the link.
Data reduction attempts to take a large set of data and replace it with a smaller set of data that contains much of the important information in the larger set.
Causal modeling attempts to help us understand what events or actions actually influence others.
Both experimental and observational methods for causal modeling generally can be viewed as “counterfactual” analysis: they attempt to understand what would be the difference between the situations — which cannot both happen — where the “treatment” event (e.g., showing an advertisement to a particular individual) were to happen, and were not to happen.
When undertaking causal modeling, a business needs to weigh the trade-off of increasing investment to reduce the assumptions made, versus deciding that the conclusions are good enough given the assumptions.
If a specific target can be provided, the problem can be phrased as a supervised one. Supervised tasks require different techniques than unsupervised tasks do, and the results often are much more useful. A supervised technique is given a specific purpose for the grouping — predicting the target.
Clustering, an unsupervised task, produces groupings based on similarities, but there is no guarantee that these similarities are meaningful or will be useful for any particular purpose.
The value for the target variable for an individual is often called the individual’s label, emphasizing that often (not always) one must incur expense to actively label the data.
Classification, regression, and causal modeling generally are solved with supervised methods. Similarity matching, link prediction, and data reduction could be either. Clustering, co-occurrence grouping, and profiling generally are unsupervised.
A vital part in the early stages of the data mining process is (i) to decide whether the line of attack will be supervised or unsupervised, and (ii) if supervised, to produce a precise definition of a target variable. This variable must be a specific quantity that will be the focus of the data mining (and for which we can obtain values for some example data).
It is important to understand the strengths and limitations of the data because rarely is there an exact match with the problem. Historical data often are collected for purposes unrelated to the current business problem, or for no explicit purpose at all.
It is also common for the costs of data to vary. Some data will be available virtually for free while others will require effort to obtain. Some data may be purchased. Still other data simply won’t exist and will require entire ancillary projects to arrange their collection. A critical part of the data understanding phase is estimating the costs and benefits of each data source and deciding whether further investment is merited.
In data understanding we need to dig beneath the surface to uncover the structure of the business problem and the data that are available, and then match them to one or more data mining tasks for which we may have substantial science and technology to apply.
It is not unusual for a business problem to contain several data mining tasks, often of different types, and combining their solutions will be necessary (see Chapter 11).
One very general and important concern during data preparation is to beware of “leaks” (Kaufman et al. 2012). A leak is a situation where a variable collected in historical data gives information on the target variable — information that appears in historical data but is not actually available when the decision has to be made. As an example, when predicting whether at a particular point in time a website visitor would end her session or continue surfing to another page, the variable “total number of webpages visited in the session” is predictive. However, the total number of webpages visited in the session would not be known until after the session was over (Kohavi et al., 2000) — at which point one would know the value for the target variable!
To facilitate such qualitative assessment, the data scientist must think about the comprehensibility of the model to stakeholders (not just to the data scientists).
Firms with sophisticated data science teams wisely build testbed environments that mirror production data as closely as possible, in order to get the most realistic evaluations before taking the risk of deployment.
Increasingly, the data mining techniques themselves are deployed. For example, for targeting online advertisements, systems are deployed that automatically build (and test) models in production when a new advertising campaign is presented. Two main reasons for deploying the data mining system itself rather than the models produced by a data mining system are (i) the world may change faster than the data science team can adapt, as with fraud and intrusion detection, and (ii) a business has too many modeling tasks for their data science team to manually curate each model individually.
Software managers might look at the CRISP data mining cycle (Figure 2-2) and think it looks comfortably similar to a software development cycle, so they should be right at home managing an analytics project the same way. This can be a mistake because data mining is an exploratory undertaking closer to research and development than it is to engineering.
Summary statistics should be chosen with close attention to the business problem to be solved (one of the fundamental principles we will present later), and also with attention to the distribution of the data they are summarizing.
Generally speaking, a model is a simplified representation of reality created to serve a purpose.
Supervised learning is model creation where the model describes a relationship between a set of selected variables (attributes or features) and a predefined variable called the target variable.
The creation of models from data is known as model induction. Induction is a term from philosophy that refers to generalizing from specific cases to general rules (or laws, or truths).
When the world presents you with very large sets of attributes, it may be (extremely) useful to harken back to this early idea and to select a subset of informative attributes. Doing so can substantially reduce the size of an unwieldy dataset, and as we will see, often will improve the accuracy of the resultant model.
So let’s ask ourselves: which of the attributes would be best to segment these people into groups, in a way that will distinguish write-offs from non-write-offs? Technically, we would like the resulting groups to be as pure as possible. By pure we mean homogeneous with respect to the target variable. If every member of a group has the same value for the target, then the group is pure. If there is at least one member of the group that has a different value for the target variable than the rest of the group, then the group is impure.
Fortunately, for classification problems we can address all the issues by creating a formula that evaluates how well each attribute splits a set of examples into segments, with respect to a chosen target variable.
The most common splitting criterion is called information gain, and it is based on a purity measure called entropy.
Fortunately, with entropy to measure how disordered any set is, we can define information gain (IG) to measure how much an attribute improves (decreases) entropy over the whole segmentation it creates.
Strictly speaking, information gain measures the change in entropy due to any amount of new information being added; here, in the context of supervised segmentation, we consider the information gained by splitting the set on all values of a single attribute.
We have not discussed what exactly to do if the attribute is numeric. Numeric variables can be “discretized” by choosing a split point (or many split points) and then treating the result as a categorical attribute. For example, Income could be divided into two or more ranges. Information gain can be applied to evaluate the segmentation created by this discretization of the numeric attribute. We still are left with the question of how to choose the split point(s) for the numeric attribute. Conceptually, we can try all reasonable split points, and choose the one that gives the highest information gain.
a natural measure of impurity for numeric values is variance. If the set has all the same values for the numeric target variable, then the set is pure and the variance is zero.
We also can rank a set of attributes by their informativeness, in particular by their information gain. This can be used simply to understand the data better. It can be used to help predict the target. Or it can be used to reduce the size of the data to be analyzed, by selecting a subset of attributes in cases where we can not or do not want to process the entire dataset.
Classification trees often are used as predictive models — “tree structured models.” In use, when presented with an example for which we do not know its classification, we can predict its classification by finding the corresponding segment and using the class value at the leaf.
There are many techniques to induce a supervised segmentation from a dataset. One of the most popular is to create a tree-structured model (tree induction). These techniques are popular because tree models are easy to understand, and because the induction procedures are elegant (simple to describe) and easy to use.
In summary, the procedure of classification tree induction is a recursive process of divide and conquer, where the goal at each step is to select an attribute to partition the current group into subgroups that are as pure as possible with respect to the target variable.
It may be difficult to compare very different families of models just by examining their form (e.g., a mathematical formula versus a set of rules) or the algorithms that generate them. Often it is easier to compare them based on how they partition the instance space.
The lines separating the regions are known as decision lines (in two dimensions) or more generally decision surfaces or decision boundaries.
In higher dimensions, since each node of a classification tree tests one variable it may be thought of as “fixing” that one dimension of a decision boundary; therefore, for a problem of n variables, each node of a classification tree imposes an (n–1)-dimensional “hyperplane” decision boundary on the instance space.
You will often see the term hyperplane used in data mining literature to refer to the general separating surface, whatever it may be.
Recall that the tree induction procedure subdivides the instance space into regions of class purity (low entropy). If we are satisfied to assign the same class probability to every member of the segment corresponding to a tree leaf, we can use instance counts at each leaf to compute a class probability estimate.
Instead of simply computing the frequency, we would often use a “smoothed” version of the frequency-based estimate, known as the Laplace correction, the purpose of which is to moderate the influence of leaves with only a few instances. The equation for binary class probability estimation becomes: where n is the number of examples in the leaf belonging to class c, and m is the number of examples not belonging to class c.
Nodes in a classification tree depend on the instances above them in the tree. Therefore, except for the root node, features in a classification tree are not evaluated on the entire set of instances. The information gain of a feature depends on the set of instances against which it is evaluated, so the ranking of features for some internal node may not be the same as the global ranking.
The data miner specifies the form of the model and the attributes; the goal of the data mining is to tune the parameters so that the model fits the data as well as possible. This general approach is called parameter learning or parametric modeling.
Figure 4-3. The dataset of Figure 4-2 with a single linear split. This is called a linear classifier and is essentially a weighted sum of the values for the various attributes, as we will describe next.
Equation 4-1. Classification function This is called a linear discriminant because it discriminates between the classes, and the function of the decision boundary is a linear combination — a weighted sum — of the attributes.
Linear functions are one of the workhorses of data science; now we finally come to the data mining. We now have a parameterized model: the weights of the linear function (wi) are the parameters.[21] The data mining is going to “fit” this parameterized model to a particular dataset — meaning specifically, to find a good set of weights on the features.
Our general procedure will be to define an objective function that represents our goal, and can be calculated for a particular set of weights and a particular set of data. We will then find the optimal value for the weights by maximizing or minimizing the objective function.
SVMs choose based on a simple, elegant idea: instead of thinking about separating with a line, first fit the fattest bar between the classes. This is shown by the parallel dashed lines in Figure 4-8. The SVM’s objective function incorporates the idea that a wider bar is better. Then once the widest bar is found, the linear discriminant will be the center line through the bar (the solid middle line in Figure 4-8).
A loss function determines how much penalty should be assigned to an instance based on the error in the model’s predicted value — in our present context, based on its distance from the separation boundary.
Squared error loss usually is used for numeric value prediction (regression), rather than classification.
The linear function estimates this numeric target value using Equation 4-2, and of course the training data have the actual target value. Therefore, an intuitive notion of the fit of the model is: how far away are the estimated values from the true values on the training data?
Standard linear regression procedures instead minimize the sum or mean of the squares of these errors — which gives the procedure its common name “least squares” regression.
For least squares regression a serious drawback is that it is very sensitive to the data: erroneous or otherwise outlying data points can severely skew the resultant linear function.
This is exactly a logistic regression model: the same linear function f(x) that we’ve examined throughout the chapter is used as a measure of the log-odds of the “event” of interest.
For a consumer c, our model may estimate the probability of responding to the offer to be . In the data, we see that the person indeed does respond. That does not mean that this consumer’s probability of responding actually was 1.0, nor that the model incurred a large error on this example. The consumer’s probability may indeed have been around p(c responds) = 0.02, which actually is a high probability of response for many campaigns, and the consumer just happened to respond this time.
A decision tree, if it is not too large, may be considerably more understandable to someone without a strong statistics or mathematics background.
The tradeoff is that as we increase the amount of flexibility we have to fit the data, we increase the chance that we fit the data too well. The model can fit details of its particular training set rather than finding patterns or models that apply more generally.
We have also introduced two criteria by which models can be evaluated: the predictive performance of a model and its intelligibility.
Generalization is the property of a model or modeling process, whereby the model applies to data that were not used to build the model.
Overfitting is the tendency of data mining procedures to tailor models to the training data, at the expense of generalization to previously unseen data points.
What we need to do is to “hold out” some data for which we know the value of the target variable, but which will not be used to build the model.
When the model is not allowed to be complex enough, it is not very accurate. As the models get too complex, they look very accurate on the training data, but in fact are overfitting — the training accuracy diverges from the holdout (generalization) accuracy.
Why does performance degrade? The short answer is that as a model gets more complex it is allowed to pick up harmful spurious correlations.
Cross-validation begins by splitting a labeled dataset into k partitions called folds. Typically, k will be five or ten.
Cross-validation then iterates training and testing k times, in a particular way. As depicted in the bottom pane of Figure 5-9, in each iteration of the cross-validation, a different fold is chosen as the test data. In this iteration, the other k–1 folds are combined to form the training data. So, in each iteration we have (k–1)/k of the data used for training and 1/k used for testing.
Learning curves usually have a characteristic shape. They are steep initially as the modeling procedure finds the most apparent regularities in the dataset. Then as the modeling procedure is allowed to train on larger and larger datasets, it finds more accurate models. However, the marginal advantage of having more data decreases, so the learning curve becomes less steep. In some cases, the curve flattens out completely because the procedure can no longer improve accuracy even with more training data.
A learning curve shows the generalization performance — the performance only on testing data, plotted against the amount of training data used. A fitting graph shows the generalization performance as well as the performance on the training data, but plotted against model complexity. Fitting graphs generally are shown for a fixed amount of training data.
To avoid overfitting, we control the complexity of the models induced from the data.
So, for stopping tree growth, an alternative to setting a fixed size for the leaves is to conduct a hypothesis test at every leaf to determine whether the observed difference in (say) information gain could have been due to chance. If the hypothesis test concludes that it was likely not due to chance, then the split is accepted and the tree growing continues.
The second strategy for reducing overfitting is to “prune” an overly large tree. Pruning means to cut off leaves and branches, replacing them with leaves. There are many ways to do this, and the interested reader can look into the data mining literature for details. One general idea is to estimate whether replacing a set of leaves or a branch with a leaf would reduce accuracy. If not, then go ahead and prune. The process can be iterated on progressive subtrees until any removal or replacement would reduce accuracy.
We can take the training set and split it again into a training subset and a testing subset. Then we can build models on this training subset and pick the best model based on this testing subset. Let’s call the former the sub-training set and the latter the validation set for clarity. The validation set is separate from the final test set, on which we are never going to make any modeling decisions. This procedure is often called nested holdout testing.
For example, sequential forward selection (SFS) of features uses a nested holdout procedure to first pick the best individual feature, by looking at all models built using just one feature. After choosing a first feature, SFS tests all models that add a second feature to this first chosen feature. The best pair is then selected. Next the same procedure is done for three, then four, and so on. When adding a feature does not improve classification accuracy on the validation data, the SFS process stops.
(There is a similar procedure called sequential backward elimination of features. As you might guess, it works by starting with all features and discarding features one at a time. It continues to discard features as long as there is no performance loss.)
Complexity control via regularization works by adding to this objective function a penalty for complexity:
If we incorporate the L2-norm penalty into standard least-squares linear regression, we get the statistical procedure called ridge regression.
If instead we use the sum of the absolute values (rather than the squares), known as the L1-norm, we get a procedure known as the lasso (Hastie et al., 2009). More generally, this is called L1-regularization.
For reasons that are quite technical, L1-regularization ends up zeroing out many coefficients. Since these coefficients are the multiplicative weights on the features, L1-regularization effectively performs an automatic form of feature selection.
Specifically, linear support vector machine learning is almost equivalent to the L2-regularized logistic regression just discussed; the only difference is that a support vector machine uses hinge loss instead of likelihood in its optimization.
The accuracy starts off low when the model is simple, increases as complexity increases, flattens out, then starts to decrease again as overfitting sets in.
Equation 6-1. General Euclidean distance
To use similarity for predictive modeling, the basic procedure is beautifully simple: given a new example whose target variable we want to predict, we scan through all the training examples and choose several that are the most similar to the new example. Then we predict the new example’s target value, based on the nearest neighbors’ (known) target values.
There is no simple answer to how many neighbors should be used. Odd numbers are convenient for breaking ties for majority vote classification with two-class problems.
Weighted scoring has a nice consequence in that it reduces the importance of deciding how many neighbors to use. Because the contribution of each neighbor is moderated by its distance, the influence of neighbors naturally drops off the farther they are from the instance.
More generally, irregular concept boundaries are characteristic of all nearest-neighbor classifiers, because they do not impose any particular geometric form on the classifier. Instead, they form boundaries in instance space tailored to the specific data used for training.
Thus, in terms of overfitting and its avoidance, the k in a k-NN classifier is a complexity parameter.
If a stakeholder asks “What did your system learn from the data about my customers? On what basis does it make its decisions?” there may be no easy answer because there is no explicit model. Strictly speaking, the nearest-neighbor “model” consists of the entire case set (the database), the distance function, and the combining function.
Feature selection can be done manually by the data miner, using background knowledge as what attributes are relevant. This is one of the main ways in which a data mining team injects domain knowledge into the data mining process.
Some applications require extremely fast predictions; for example, in online advertisement targeting, decisions may need to be made in a few tens of milliseconds. For such applications, a nearest neighbor method may be impractical.
Clustering is another application of our fundamental notion of similarity. The basic idea is that we want to find groups of objects (consumers, businesses, whiskeys, etc.), where the objects within groups are similar, but the objects in different groups are not so similar.
So far we have discussed distance between instances. For hierarchical clustering, we need a distance function between clusters, considering individual instances to be the smallest clusters. This is sometimes called the linkage function.
To put it another way: characteristic descriptions concentrate on intragroup commonalities, whereas differential descriptions concentrate on intergroup differences. Neither is inherently better — it depends on what you’re using it for.
What do we do when in the business understanding phase we conclude: we would like to explore our data, possibly with only a vague notion of the exact problem we are solving? The problems to which we apply clustering often fall into this category. We want to perform unsupervised segmentation: finding groups that “naturally” occur (subject, of course, to how we define our similarity measures).
Another problem with simple classification accuracy as a metric is that it makes no distinction between false positive and false negative errors. By counting them together, it makes the tacit assumption that both errors are equally important. With real-world domains this is rarely the case. These are typically very different kinds of errors with very different costs because the classifications have consequences of differing severity. Consider a medical diagnosis domain where a patient is wrongly informed he has cancer when he does not.
In an expected value calculation the possible outcomes of a situation are enumerated. The expected value is then the weighted average of the values of the different possible outcomes, where the weight given to each value is its probability of occurrence.

No results found.