During week 4, 5 and 6, the Metis Data Science Bootcamp zoomed in on the following technologies:
- SQL and mySQL in the cloud (0.5 day);
- Supervised learning with scikit-learn and statsmodels (1.5 week);
- Interactive visualization with matplotlib, interactive widgets in iPython, mpld3 and especially D3.js (1.5 week).
The remainder of this article reports on the project and business context in which these technologies were applied.
A bank wants to run a targeted marketing campaign in order to sell a term deposit product.
Given a certain (but unspecified) campaign budget:
- Which customers should it target?
- In what order?
- Is it wise to spend the complete campaign budget?
The bank has a database of 41K past customer contact records, with an indication of successful and failed sales of this product.
Using the historic customer database, build a model to:
- predict which customers are (more) likely to buy the term deposit product
- rank customers by their propensity to buy the product
To evaluate model performance, produce for each machine learning algorithm:
Data preparation, feature selection and model building
How well do our models perform? And much more importantly: how do they contribute to answering the business questions?
Let's start by looking at the ROC curve.
ROC curve: between a rock and a hard place
Hint: move the mouse over the graph, to highlight individual line points
Inside the False Positive Rate intervals [0-0.07] and [0.54-1.00] (horizontal axis), the logistic regression model scores best in terms of True Positive Rate (vertical axis). In between these two intervals, KNN (with k = 30) is the winning model. But all in all, the three algorithms play in the same league.
The selection of a specific FPR/TPR combination (and of a corresponding best algorithm) is for our customer to make. In business terms, this means making a trade-off between:
- Reaching sufficiently interested customers, which will bring in revenue
- In the process, annoying too many uninterested ones, and incurring the cost of making these contacts
On its own, the ROC curve does not tell us how to set the optimal threshold. Therefore, let's try to formulate the trade-off more in operational terms, by means of a lift curve.
Lift curve: trained model vs. baseline performance
From our reference dataset, we know that only a fraction (<10 %) of our potential customers will accept an offer to buy the term deposit product. The question, of course, is who these customers are. For each customer in the test set, our trained models return a probability estimate of their purchase of the product. This probability allows us to rank our customers accordingly, which is key to making smart(er) decisions.
But first, for the sake of the argument, let's assume we only know the distribution of yes/no answers in the training data set, and nothing more. The (near-perfect) diagonal line from (0,0) to (100, 100) in the graph below represents such a baseline case. The model behind it was built by randomly guessing who might accept or reject the offer, according to this yes/no distribution.
So, in the baseline case, if we were to contact 20% of the customers (x axis), we can expect to hit approximately 20% of all customers who would actually buy the product (y axis). This is approximately true for any other percentage point along this (near-perfect) diagonal.
Luckily, we do have more information, so we can do better! For example, hover over data point (20.0, 63.2) on the upper KNN line. This data point means that by merely selecting the top-ranked one fifth of our customers, we can expect to hit no less than two thirds of all customers who would actually buy the product!
Each table that appears when hovering over the data points also mentions cumulative lift. This measure is simply defined as the ratio of the y over the x value in that point. For the example data point at (20.0, 63.2), lift is equal to the trained model yield (63.2 percent) divided by the (theoretical) baseline yield at the same x value, so lift is 63.2 / 20.0 = 3.16. In the steep left parts of the trained model curves, cumulative lift rises sharply. On the KNN curve, it reaches its maximum value (4.13) at x=13.0.
So based on the lift graph alone, the bank could decide to contact its 13% top-ranked customers, and then stop.
Alternatively, it could also decide to only stop contacting customers when the local lift (i.e. the lift in the percentile immediately preceding the point) starts dropping below 1, i.e. performing worse than the (theoretical) baseline. On the KNN curve, local lift fluctuates around 1 in the x interval [19.0-32.0], so this is not such a clear-cut decision.
Now, despite all these niceties, there's still one ingredient missing: money!
Profit curve: knowing when to stop the campaign
Each customer contact costs money, but also carries a potential reward in the form of future revenue. Let's bring in two new variables:
- the average cost per contact. This cost is incurred irrespective of whether the contact leads to a sale or not.
- the average revenue per successful contact. An unsuccesful contact brings in zero revenue, by definition.
Even though we have to limit ourselves to averages, these variables do allow us to improve on our rather abstract lift curve. The default profit curve below assumes that each contact (successful or not) costs $10 on average, while the average successful one carries $50 in revenue.
In this default configuration, the KNN profit curve maxes out at x=14, bringing in $6840 of cumulative profit. Compare this to the cumulative loss of $2410 in case we would just contact customers in random order. As a matter of fact, the current cost/revenue configuration would constantly write in the red in the baseline case.
Now increase the average revenue per successful contact to $90, using the second slider on top of the graph above, and watch attentively. By increasing the profit per successful contact to $90-$10=$80, the shape of the profit curve has taken a different form. Cumulative profit now tops at $17.530, but to get there we have to contact 20% of all customers. Also, if we were to contact all (100%) customers, we would more or less break even, with a cumulative loss of only $330.
Finally, increase both sliders with $10, so their respective values become $20 and $100. While this has no effect on the profit per successful contact ($100-$20 = $80 = $90-$10), the loss per failed contact has doubled from $10 to $20. As a result, the trained model profit curves now cross the breakeven line again (around x=40%), while the baseline curve goes permanently in the red again.
The bottom line is that any specific configuration of cost and revenue greatly influences the shape of the curves, the profitability intervals, and therefore the decision boundaries.
Profit heat map: impressionistic view on individual and cumulative profit contributions
How much does each individual customer contribute to cumulative profit? In the heatmap below, each cell represents one customer from the test set. Customers are displayed in the ranking order defined by the model, with the most probable buyer first. Reading order is left-to-right, top-down - the same as for written English.
In the baseline case, with default cost and revenue values, the heatmap progressively takes on deeper shades of red. Since the baseline lists customers in random order, there are more negative than positive contributions, no matter which region of the heatmap we look at. This is aptly demonstrated if you select profit: "contribution per individual customer".
Now select a non-baseline model, and see how the random distribution of individual positive profit contributions transforms into a more ordered view: the positive profit contributions are indeed pushed towards the beginning (top-left corner) of the heatmap. This is the power of the learned models at work.
Finally, also select different configurations of the cost and revenue parameters, and see how this influences the heatmap. For example, try the baseline model with cost 10 and revenue 90, as we did in the profit curve before. This configuration confirms that the cumulative profit remains near the break-even line.
Customers must be targeted in descending order of their probability to buy the product. Running the trained models on unseen customer data will provide such a ranking.
To enable an informed decision about when to halt the campaign, the bank must first provide information on cost and revenue per (successful) contact.
- Without such information, we advise to stop when maximum cumulative lift is achieved, or when we have run out of budget, whichever comes first.
- If we do have such information, we advise to stop when maximum profit is achieved, or when we have run out of budget, whichever comes first.
Having set the cost and revenue variables, define the optimization criterion (e.g. maximum cumulative profit). Then consult the profit curve and heatmap to identify which and how many customers to contact, most probable buyers first.
- Foster Provost & Tom Fawcett: Data Science for Business. What you need to know about data mining and data-analytic thinking. O'Reilly, 2013.
Chapter 8: "Visualizing Model Performance" introduces the ROC, Lift and Profit curves.
- NVD3. Re-usable charts for D3.js
- Heatmap example