# Elementary Statistics for Direct Marketing

This post is going to look at some elementary statistics for direct marketing. Most of the techniques are direct applications of topics learned in a first year statistics course hence the "elementary". I'll start off by covering some background and terminology on the direct marketing and then introduce some of the statistical inference techniques that are commonly used. As usual, I'll mix in some theory where appropriate to build some intuition.

#### Direct Marketing

Direct marketing is a form of advertising that sends communications directly to potential customers through a wide variety of media such as (snail) mail, e-mail, text message, websites among others. The distinguishing feature of direct marketing efforts is that a business has a database of potential customers where the business will send a direct (usually personally addressed) message to these customers (or a segment of them) with a specific outcome in mind. Some familiar examples include:

- Signing up for a rewards plan at a retailer and receiving emails about offers periodically at the store (e.g. "Buy $75 get $25 off").
- A (snail) mail offer for a "discounted" magazine subscription (usually they get your address from another magazine you've subscribed to).
- A telemarketer calls you (at what seems to always be an inconvenient time) to get you to buy their product.

The interesting thing about direct marketing campaigns (as opposed to mass market campaigns) is that you can (usually) individually track the behavior of each message that you send out. This fine grained tracking allows you to scientifically (read: statistically) measure the effectiveness of a certain direct marketing campaign and allows you to learn what works and what doesn't.

In the following sections, I'll discuss direct marketing campaigns with respect to retail direct marketing campaigns because that's what I'm most familiar with, but the general idea should apply in other domains as well.

##### Direct Marketing Campaigns

Imagine you are a retailer wanting to increase sales. With very broad strokes, there are three ways to go about it:

- Get more (unique) customers to come in the door (or onto your website).
- Get each customer to spend more money on each visit.
- Get each customer to come back more often.

Obviously there are many ways to do this and it is a very complex subject, so let's stick with one of the most popular ways: by providing an offer to the customer. The offer might entice a new customer to walk in, stretch an existing customer to spend a bit more, or even cause a previous customer to come back earlier than she would have without the offer.

The simplest type of offer is a mass-promotion where the offer is available to everyone. For example: "$2 off a bag of milk". The drawback of giving it to everyone is that some people would have bought the milk without the mass-offer (resulting in a missed opportunity for full-priced revenue), while for others, $2 may not be enough of a bargain to buy. Enter direct marketing.

Now suppose you have a database of all (or a large segment) of your customers. You might have their email address, phone number, age, purchase history, address, and any number of other individual attributes. Now you can target customers in a more fine grained way. You might only want to give that $2 milk deal to the price-sensitive shoppers so that they don't go over to your competitor while also trying to cross-sell them other complementary products. Or you might want to target your high-end customers by upselling them with an offer for a luxury version of products they previously bought. In the limit, you can customize the offer on a one-to-one basis for true personalization.

There are many machine learning and statistical techniques to build models to target and predict how customers will respond to the various combinations of offers. I'll cover that in another post, for now let's stick with something more mundane (but perhaps more important): after sending out a direct marketing campaign, how do I know if it worked?

##### Important Metrics

The most important aspect when running a direct marketing campaign is make sure it is achieving your business objective, which usually relates to one of the following:

- Incremental revenue
- Incremental active rate (similar to a conversion rate)
- Return On Investment (ROI)

We'll focus on the first one, talk a bit about the second one, and leave the last one out for now.

Let's try to translate our business objective into something that we can measure. Increasing incremental revenue in a direct marketing campaign means that by running this marketing campaign, we will have made more money than not running this campaign. Incremental active rate means that we will have convinced more (unique) customers to walk in the door than we would have without providing this offer to them. Lastly, ROI is simply the efficiency of the campaign dollars, if you have negative ROI, you're basically losing money by running this campaign.

Average revenue per customer (or spend per customer), active rate and ROI can be measured across campaigns using some simple ratios:

The difficulty question is how can you measure the effectiveness of the offer
because you can't simultaneously give the offer to a person *and* not give it
to them at the same time. This is a well known problem in the medical field
relating to the effectiveness of treatments and has been solved for
decades.

##### Control, Control, Control

The experimental setup involves dividing your population (e.g. customers whom you send the offer to) into two randomly selected groups: treatment and control. The treatment group receives the direct marketing offer (or drug in the case of medical trials) and the control group will receive the business-as-usual placebo. Depending on what you are trying to measure the business-as-usual treatment might be nothing (measuring the overall effectiveness of the offer), a generic offer (measuring the A vs. B effectiveness of your offer), or some variation that allows you to compare one treatment to another.

The randomized control group allows us the "control" for confounding variables.
That is, for hidden biases that might be introduced when running the experiment [1]
such as a holiday or perhaps a competitor's sale.
It also allows us to make causal statements about the relationship between two
variables. Instead of just saying treatment A is correlated or associated with
a sales increase, we can say treatment A *causes* a sales increase.
Randomized control groups are the primary method in which we can verify causality
between two events.

A few important points when practically designing an experiment in this scenario [2]:

- A randomly selected control group is needed in order to make a causal statement.
- The samples (i.e. customers) are independent and measured on an individual basis (not just the total revenue but the revenue for each customer too).
- Sample size have to be large enough (for both groups) in order to have a statistically significant conclusion [3].

We'll cover more on these topics below.

#### Elementary Statistics

Now that we have a high level understanding of how direct marketing campaigns work, let's try to work out some of the math.

Let's imagine we have we have \(n\) i.i.d. variables \(Y_1, Y_2, \ldots, Y_n\) representing our outcome variables i.e. \(Y_i\) is customer \(i\)'s total spend at your store during the campaign period. Also denote the binary treatment variable as \(X_i=1\) as "treated" (given the offer) and \(X_i=0\) as "not treated" (i.e. control group or not given the offer) for the \(i^{th}\) customer. Note: we'll represent random variables with capital letters and their corresponding values after they have been observed with lower case letters [4].

To not bury the lead, a good point estimator for measuring campaign
effectiveness is just using the difference of spend per customer (SPC) or
**lift** of the treatment and control group (i.e. difference of population
means):

where the \(\hat{}\) symbol represents an estimate, \(n_T = \Sigma_{i=1}^{n} x_i\) and \(n_C = \Sigma_{i=1}^{n} (1-x_i)\). All those equations boil down to basically just taking the difference of spend per customer between treatment and control. In some of the more math heavy sections below, you'll see why we introduced all this notation. First, let's see why our control groups have to be randomized.

##### Causal Inference [5]

As I mentioned before, we can never simultaneously give a promotional offer
*and* not give it at the same time (unless we had access to some kind of parallel universe).
However, the ideal measurement for campaign effectiveness is exactly this
quantity! Let's define this more precisely.

Introduce two variables \(C_{X=0}, C_{X=1}\) as potential outcome variables. \(C_{X=0}\) is the outcome if we did not treat customer \(i\) and \(C_{X=1}\) is the outcome if we did treat customer \(i\). Therefore:

Or more concisely, \(Y=C_X\). (Note: we can never observe both \(C_{X=0}\) and \(C_{X=1}\) at the same time.)

The actual effect we want to measure is actually the difference in expected
value of these two variables called the **average causal effect**:

In other words, we want to find the SPC of sending *everyone* an offer minus
the SPC of not sending *everyone* an offer.
Equation 2 looks similar but actually measures something different called
**association** (denoted by \(A\)):

It's a widely known fact that association does not equal causation. Let's take a look at a small example why.

Example 1

Here we can can calculate the lift and association:

The lift is zero because the treatment has no effect: look at the hypothetical \(C_X\) variables, they are the same regardless of whether or not the treatment was applied. The association on the other hand is clearly positive at $10.

Coming up with other examples where association and lift have opposite signs is not too difficult. The reason why we got such different values for \(A\) and \(\text{lift}\) is because \(C_{X=0}, C_{X=1}\) are not independent of \(X\). That is, the treatment is not independent of the customer. In the above example, we put all the high value customers in the treatment group while putting the low value ones in the control group. We want to ensure any confounding variables (like whether or not a customer is a high spender) is spread out proportionally across both treatment and control. The next theorem states this idea.

**Theorem**

If we randomly assign subjects to treatment and control such that \(P(X=0) > 0\) and \(P(X=1) > 0\), then \(A=\text{lift}\).

**Proof**-
Since X is randomly assigned, X is independent of \(C_{X=1}, C_{X=0}\), so:

\begin{align*} \text{lift} &= E(C_{X=1}) - E(C_{X=0}) \\ &= E(C_{X=1}|X=1) - E(C_{X=0}|X=0) && \text{since } X \text{ is independent of } C_X \\ &= E(Y|X=1) - E(Y|X=0) && \text{since } Y = C_X \\ &= A \tag{7} \end{align*}

Using Theorem 1 we can see that by assigning random control groups, our use of difference in SPC or association (Equation 2) is identical the actual causal effect i.e. lift. However, if we don't have random assignments (and have some kind of bias in assignment towards treatment or control) then there is no guarantee that the association we computed in Equation 2 has anything to do with lift, as we saw in Example 1.

##### Central Limit Theorem and Confidence Intervals

From here on out to simplify our notation, let's define \(U_i = Y_i X_i\) and \(V_i = Y_i (1 - X_i)\) to represent samples from the treatment and control respectively. Let their respective mean and variance be represented by \(\mu_U\), \(\mu_V\) and \(\sigma^2_U\), \(\sigma^2_V\). This is done just for convenience so we don't have to keep writing our equations with \(X_i\) in them.

In general, the distributions of customer spend, \(U_i\) and \(V_i\), do not take any familiar form of distribution that we know. However, using the results of the central limit theorem (CLT), we know that the sample mean of each population (treatment \(\bar{U}\) and control \(\bar{V}\)) can be approximated by a normal distribution (if our samples are large enough):

Usually \(n\) is quite large (\(n>10,000\)) so the normal approximation is quite good. Similarly, using the strong law of large numbers, the sample variance (denoted by \(s^2\)) is a pretty good approximation of the actual variance when we have large \(n\):

Using Equation 8 and 9, we get our approximation of the sample mean for large \(n\):

Knowing that the difference of two normal distributions is just a normal distribution (with mean equal to the difference of the means, and variance equal to the sum of variances), our lift is:

Now that our lift is simply just a normal random variable whose mean and variance we know how to estimate, we can get a \(1 - \alpha\) two sided confidence interval. Since lift is approximately normal, we know that \(Z=\frac{\text{lift} - \mu_{\text{lift}}}{\sigma_{\text{lift}}}\) has a standard normal distribution:

That is, the true lift (\(\mu_{\text{lift}}\)) lies in the interval \(({\text{lift}} - z_{\alpha/2}{{\sigma}_{\text{lift}}}, {\text{lift}} - z_{\alpha/2}{{\sigma}_{\text{lift}}})\) with frequency \(1-\alpha\). Plugging in our estimates of \(\hat{\text{lift}}=\bar{u}-\bar{v} = SPC_{\text{T}} - SPC_\text{C}\) and \(\hat{\sigma_{\text{lift}}}=\sqrt{\frac{s^2_U}{n_U} + \frac{s^2_V}{n_V}}\) and looking up the appropriate Z-score, we can compute our \(1-\alpha\) confidence interval:

##### Activation Rate and Binomial Outcome Variables

All the math above equally applies to when your outcome variable is not a real-valued number but a binary outcome such as activation or conversion. In that case, each customer can only have two outcomes: \(1\) (shops or converts) and \(0\) (doesn't shop or convert). There are a few caveats though.

A good rule of thumb of when you can use a normal to approximate a binomial is (from Wackerly et al.):

So for \(p=0.01\), \(n \geq 9 \frac{0.99}{0.01} = 891\), meaning you want your treatment and control groups to be at least 900. Most likely you will want bigger values of \(n\) to control the error rate (see below).

The standard unbiased estimators for mean and variance of a binomial would be used where \(Y\) is the number of successes (or \(1\)'s) in the sample, and \(n\) is the total number of samples:

Using our lift notation above, we would get:

Plugging these two values into the equations from the previous section will give us a good approximation of the lift with respect to the activation rate.

#### Selecting a Sample Size

The above sections are about finding a confidence interval *after* you have all
your observations. What if you want to ensure that you can detect a statistically
significance result (if there is one)? The only thing you can (usually) do a priori is pick the
sample size.

There are two main ways to select a sample size: (i) using an error bound, and (ii) using the hypothesis testing framework. Let's take a look at both.

##### Sample Size using an Error Bound

In this method, we'll be using our confidence interval from Equation 13. We can see that our true mean is bounded within \(\pm z_{\alpha/2}\sqrt{\frac{\sigma^2_U}{n_U} + \frac{\sigma^2_V}{n_V}}\) of our estimate of lift. Limiting this quantity to a specific value (\(B\)) and solving for \(n\), we can compute our desired sample size. (Set \(n_U = n\) and \(n_V = c n\) for some constant \(c\) to make our computation a bit simpler.)

The only caveat here is finding an estimate for \(\sigma\) (remember we're doing this before we have any observations), so we can't use any samples to estimate it. In that case, you could use a previously known sample variance (from a similar experiment) or another quick and dirty estimate is that the range of allowable values usually falls within \(4\sigma\). Both these will provide a decent estimate of the values. Let's take a look at an example.

Example 2

From a past experiment, we know that customers usually spend between $20 and $80. We can send \(20,000\) flyers to customers and want an error bound on spend per customer of $0.50 with 95% confidence. How many people should we allocate for treatment and control?

First, find an approximate standard deviation:

Using this estimate (assuming that it is valid for both control and treatment) and Equation 17 (with \(1 - \alpha = 0.95\) so \(z_{0.05/2}\approx 1.96\)), we get:

We want to allocate as small a control group as possible so we can maximize revenue (assuming our promotion has positive lift). Knowing that \(n(1 + c) = 20,000\) (since treatment and control add up to this number), we can solve for \(n\) in Equation 19 (using the quadratic equation):

The two solutions correspond to \(n\) being the larger or smaller number because we did not make any assumptions about which one is larger (\(n\) or \(cn\)). Thus, we should pick the treatment group to be approximately 15550 and control to be 4450. Contrast this with setting \(c=1\) in Equation 19, which would yield \(n=6915\), a slightly larger control group than is necessary.

##### Sample Size using Hypothesis Testing and Statistical Power

Another method to pick sample size is to use a hypothesis testing framework along with statistical power. To conduct this procedure, we need a few things:

- \(\alpha\): the false positive rate (or how often we incorrectly detect something is true when it's not). This is usually set a \(0.01\) or \(0.05\) in most scientific experiments.
- Power (denoted by \(1 - \beta\), where \(\beta\) is the false negative rate): How often we are able to conclude that the alternative hypothesis is true when it is. A common value is usually \(0.80\).
- Minimum detectable effect size (denoted by \(\Delta\)): The minimum effect size (i.e. lift) we want be able to detect. For example, we may choose a $0.50 SPC as the minimum detectable effect size.

The basic idea is first we establish a test or "rule" using our the hypothesis testing framework (and a given \(\alpha\)) to decide when we accept and when we reject a given sample. Next, we use this rule, the minimum detectable effect size as our alternative hypothesis along with the statistical power to compute the required \(n\). Let's take a look in detail.

First, we determine our statistical test. Since we're dealing with large sample sizes, we've already said that we can approximate things using a normal distribution. The uniformly most powerful test in this case (which you can derive using the Neyman-Pearson Lemma) is given by:

Translating that into our problem with the outcome variable being lift and the null hypothesis being \(\mu_{\text{lift}}=0\), we get:

Now that we have established our test: \(\bar{U} - \bar{V} > \frac{z_{\alpha/2}\sigma}{\sqrt{n}}\), we can see how often we will correctly identify the alternative hypothesis (\(\mu_{\text{lift}} = \Delta\)) to be true when it is true:

Let's take a look at an example of how this works.

Example 3

Using the numbers from Example 2, how large of a sample size do we require if \(1-\beta=0.80\) and now we can send \(30,000\) flyers?

Using Equation 23 and estimating \(\sigma \approx 15\), we have (where \(z_{1-\beta}\approx 0.84\)):

With the added constraint \(n(1 + c) = 30000\), we solve for \(c\):

So we should send approximately 11350 flyers to the control group and 18650 to the treatment group. Contrast this with setting \(c=1\) resulting in \(n=14112\).

##### Binomial Outcome Variables

One last point, the above calculations for sample size are equally valid when the outcome is binary (e.g. for conversion or activation rate). The big difference is how we estimate the standard deviation/variance. Since we're dealing with a binary outcome, we can model the total number of customers who convert as a binomial random variable (\(Y\)) and estimate the variance (and standard deviation) according to Equation 16:

With Equation 26, we need to estimate \(p\). Usually this can be estimated based on prior campaign that you ran where you have a ballpark of the previous conversion rate. Putting it all together with the estimate \(\hat{p}\) (for both \(U\) and \(V\)):

For example, you might expect the baseline conversion rate to be approximately 5% (\(\hat{p}=0.05\)). In that case, we can easily solve for \(n\) as before.

#### Conclusion

Whew! This post was a lot longer than I expected. "Elementary" is such a misleading word because in some cases it's obvious and others exceedingly complex. The reason why statistics is often inaccessible is that the derivations and details of "elementary" statistics is sometimes a bit complex (even though the actual procedure is simple). Hopefully this primer will help put both direct marketing and elementary statistics in perspective while giving some intuition on both subjects.

#### References and Further Reading

- Wikipedia: Direct marketing, Central Limit Theorem, Sample Mean, Sample Variance, Law of Large Numbers.
- All of Statistics: A Concise Course in Statistical Inference by Wasserman.
- Mathematical Statistics with Applications by Wackerly, Mendenhall and Scheaffer.
- Data Mining Techniques: For Marketing, Sales, and Customer Relationship Management by Linoff.
- How Not To Run An A/B Test, Evan Miller.

[1] | An interesting story I read about in The Emperor of All Maladies about how several large scale trials testing the effectiveness of mammograms (low-dose x-ray imaging to detect breast cancer) were undone by implicit selection biases. In one of the Canadian studies, nurses would write down trial patients names in a notebook where the first line corresponded to the treatment group, second control group, third treatment and so forth. The nurses administering the trial subtlety biased the results by feeding patients who they thought were more in need to the treatment group. A compassionate gesture but, statistically, a failed experiment. Without the benefit of a truly randomized trial, they could not longer analyze the effectiveness of the treatment in isolation of confounding variables. |

[2] | The topic of design of randomized control trials can actually be quite complex. In this explanation, we're assuming relatively large population sizes (10s of thousands), which allows us to make a lot of simplifying assumptions. When dealing with small sample sizes, the variation of the samples can be quite large making it much more important to design your experiment properly. For larger population sizes, we usually "hide" behind the central limit theorem and normal approximations. |

[3] | Many marketers are tempted to not take a control group. Their reasoning is something along the lines of "but I'm missing out on sales!", in which you should respond back "how do you know that?" It's quite possible the offer could have absolutely no meaningful effect on your customers (e.g. targeting wrong product to people who don't want it) and possibly a negative effect (e.g. the discount may be too high with not enough people coming in)! Just because you're giving people a discount doesn't mean it always increases sales. Further, even if you have an increase, you don't know how much of an increase it is by i.e. the effect size. If two different offers boosted sales by 1% vs. 10%, you should know that! This is how you can test and learn to improve your overall business. |

[4] | For example, \(X\) and \(Y\) represent random variables which we can manipulate and analyze properties from before we have actually observed any values for them. That means any analysis we apply on them doesn't depend one the actual numbers we observe. After we observe some samples, we'll have explicit values for them like \(x=1\) and \(y=48\), where we use lower case to distinguish these realizations of the random variables. |

[5] | This section was primarily based on All of Statistics, Chapter 16. It has a great explanation of how randomized control groups work. Check it out if my quick explanation glosses over too many things. |