A Primer on Statistical Inference and Hypothesis Testing
This post is about some fundamental concepts in classical (or frequentist) statistics: inference and hypothesis testing. A while back, I came to the realization that I didn't have a good intuition of these concepts (at least not to my liking) beyond the mechanical nature of applying them. What was missing was how they related to a probabilistic view of the subject. This bothered me since having a good intuition about a subject is probably the most useful (and fun!) part of learning a subject. So this post is a result of my reeducation on these topics. Enjoy!
Statistical Models and Inference
A Couple of Big Ideas
To start from the beginning, there are two big ideas that underlie much of classical statistics. The first big idea is that all data (or observations as statisticians like to say) have a "true" probability distribution [1]. Of course, it is almost never possible to precisely define it because the real world rarely fits so nicely into the distributions we learn in stats class. However, the implications of this idea is that the "true" distribution and its parameters are fixed (i.e. not random) albeit unknown. The randomness comes in when you sample from this "true" distribution from which each datum is randomly drawn.
The second big idea is that statistical inference [2] (or as computer scientists call it "learning" [3]) basically boils down to estimating this distribution directly by computing the distribution or density function [4], or indirectly by estimating derived metrics such as the mean or median of the distribution. A typical question we might ask is:
Give a sample \(X_1, X_2, \ldots, X_n\) drawn from some (unknown) distribution \(F\), how do we estimate \(F\) (or some properties of \(F\))?
Of course there are variations to this question depending on the precise problem such as regression but by and large it comes down to finding things about \(F\) (or its derived properties).
Models, models, models
Now that we have those two big ideas out of the way, let's define a (statistical) model:
A statistical model \(\mathfrak{F}\) is a set of distributions (or densities or regression functions).
The idea here is that we want to define a subset of all possible distributions (or densities or regression functions) that closely approximates the "true" distribution (whether or not \(\mathfrak{F}\) actually contains \(F\) [5]). One of the first tasks in inferential procedures is selecting the correct model. The model is an assumption about your data, picking the wrong one will lead to invalid conclusions.
By far, the most common type of model is a parametric model, which defines \(\mathfrak{F}\) using a finite number of parameters. For example, if we assume that the data comes from a Normal distribution, we would use the parametric model for a Normal distribution:
Here we use the notation \(f(x; \mu, \sigma)\) to denote a density function of \(x\) parameterized by \(\mu\) and \(\sigma\). Similarly, when we have data of the form \((X_i, Y_i)\) and we want to learn regression function \(r(x) = E(YX)\), we could define a model for \(\mathfrak{F}\) to be all functions of \(x\), \(r(x)\), that are straight lines. This gives us a linear regression model.
The other type of model is a nonparametric model. Here the number of parameters is not finite or fixed by the model, instead the model is defined by the input data. In essence, the parameters are determined by the training data (not the model). For example, a histogram can be thought of as a simple nonparametric model that estimates a probability distribution because the data determines the shape of the histogram. Another example would be a knearest neighbour algorithm that can classify a new observation solely based on its knearest neighbours from training data. The surface defined by the classification function is not predefined rather it is determined solely by the training data (and hyper parameter \(k\)). You can contrast this with a logistic regression as a classifier, which has a (relatively) more rigid structure regardless of how well the data matches.
Although, it sounds appealing to let the "data define the model", nonparametric models typically requires a much larger sample size to draw a similar conclusion compared to parametric methods. This makes sense intuitively since parametric methods have the advantage of having the extra model assumptions, so making conclusions should be easier all else being equal. Of course, you must be careful picking the right parametric model or else it will lead you to incorrect conclusions.
Types of Statistical Inference
For the most part, statistical inference problems can be broken into three different types of problems [6]: point estimation, confidence intervals, and hypothesis testing. I'll briefly describe the former two and focus on the latter in the next section.
Point estimates aim to find the single "best guess" for a particular quantity of interest. The quantity could be the parameter of a model, a CDF/PDF, or a regression/prediction function. Formally:
For \(n\) independent and identically distributed (IID) observations, \(X_1, \ldots, X_n\), from some distribution \(F\) with parameter(s) \(\theta\), a point estimator \(\widehat{\theta}_n\) of parameter \(\theta\) is some function of \(X_1, \ldots, X_n\):
\begin{equation*} \widehat{\theta}_n = g(X_1, \ldots, X_n). \tag{2} \end{equation*}
For example, if our desired quantity is the expected value of the "true" distribution \(F\), we might use the sample mean of our data as our "best guess" (or estimate). Similarly, for a regression problem with a linear model, we are finding a "point" estimate for the regression function \(r\), which are just the coefficients for the covariates (or features) that minimize the mean squared error. From what I've seen, many "machine learning" techniques fall in this category where you typically will aim to find a maximum likelihood estimate or related measure that is your "best guess" (or estimate) based on the data.
The next category of inference problems are confidence intervals (or sets). The basic idea here is that instead of finding a single "best guess" for a parameter, we try to find an interval that "traps" the actual value of the parameter (remember the observations have a "true" distribution) with a particular frequency. Let's take a look at the formal definition first and then try to interpret it:
A \(1\alpha\) confidence interval for parameter \(\theta\) is an interval \(C_n(a,b)\) where \(a=a(X_1, \ldots, X_N)\) and \(b=b(X_1, \ldots, X_N)\) are functions such that
\begin{equation*} P(\theta \in C_n) \geq 1  \alpha. \tag{3} \end{equation*}
This basically says that our interval \((a,b)\) "traps" the true value of \(\theta\) with probability \(1  \alpha\) . Now the confusing part is that this does not say anything directly about the probability of \(\theta\) occurring because \(\theta\) is fixed (from the "true" distribution) and instead it is \(C_n\) that is the random variable [7]. So this is more a statement about how "right" we were in picking \(C_n\).
Another way to think about it is this: suppose we set \(\alpha = 0.05\) (a 95% confidence interval), for every confidence interval, for every statistical inference problem we ever compute from now until eternity. These problems will, of course, cover a wide range of statistical problems with many different "true" distributions and sample observations. Since we set a 95% confidence interval for all our problems, we would expect that the respective "true" \(\theta\) in each case to be "trapped" in our confidence interval 95% of the time. That is, our confidence interval will be "correct" 95% of the time in that the "true" value of \(\theta\) is contained within it  a kind of longrun frequency guarantee. Note this is different from saying that on any one experiment we "trapped" \(\theta\) with a 95% probability. After we have a realized confidence interval (i.e. fixed values for our confidence interval based on observed values), the "true" parameter \(\theta\) either lies in it or it doesn't.
In some ways confidence intervals give us more context then a single point estimate. For example, if we're looking at the response of a marketing campaign versus a control group, the difference in response or incremental lift is a key performance indicator. We could just compute the difference in the sample mean of the two populations to get a point estimate for the lift, which might show a positive result say 1%. However, if we computed a 95% confidence interval we might get \((0.015, 0.0155)\) which overlaps with 0, implying that our 1% lift may not be statistically significant.
Conceptually, point estimates and confidence intervals are not that hard to understand. The complexity comes in when you have to actually pick an estimator that has nice properties (like minimizing bias and variance) in the case of Equation 2, or picking an interval such that Equation 3 is satisfied. Thankfully, many smart mathematicians and statisticians have figured out estimators and confidence intervals for many common situations so we rarely need to derive things from scratch. Instead, we can pick the most appropriate technique for the problem at hand.
Hypothesis Testing
A Digression
I'm a huge fan of hypothesis testing as a general concept (not necessarily statistical) because it's such a powerful framework for learning. One of the biggest advantages is it sets you up to "disprove your bestloved ideas" as Charlie Munger puts it, not to mention the hundreds of years its been used as part of the scientific method. There is a huge advantage to having a mental framework that allows you to disprove your hardest won ideas, a proverbial "empty your cup" type situation where you can begin to learn after you have let go of your some of your past (hopefully, incorrect) beliefs. I mean that's what science is all about right?
Statistical Hypothesis Testing
Statistical hypothesis testing is probably one of the earliest concepts learn in a statistics course. Null hypotheses, Student's ttest, pvalues these terms get thrown around a lot without explaining their underlying probabilistic basis [9]. When I first learned statistics it was definitely more biased towards a mechanical view of hypothesis testing, rather than an intuitive understanding. Here's my attempt to explain it a bit more precisely while hopefully adding some colour to give some intuition.
Following the scientific method, we make a hypothesis, run an experiment and see if our observations match the prediction from our hypothesis. However in certain cases, the cause and effect is not so clear like it is with laws of nature. For example, when you conduct a doubleslit experiment to determine the dual nature of light, the result of the experiment is clear. But when you're determining if a new drug helps cure a disease, you usually randomly divide a population into a treatment group which gets the drug, and a control group which receives a placebo. If we look at the various scenarios of what can happen, we can see why it's not so clear cut:

If at least one person in the treatment group doesn't get better, does it mean the drug isn't effective?
Not necessarily, the drug could still be quite effective but for some other random reason, the person could have not responded to the drug by pure chance.

If more people in the treatment group get better than the control, does it mean the drug is effective?
Not necessarily, what if only the treatment group has only 1 person who got better versus control. In this case, probably not, it could be due to another random factor. How about 10? 1000? Now it starts to get unclear.
You can start to see why we need to apply some mathematics to these situations in order to see if the effect is significant. In particular, we apply statistical hypothesis testing when we want to determine if an observed effect is really there or just happening by purely chance (i.e. other random factors).
The high level setup for this procedure is to first come up with a null hypothesis (denoted by \(H_0\)), that usually denotes the "no effect" scenario, or our default position. We then try to see how likely the data is generated in this situation. If it's unlikely then we say we reject the null hypothesis and accept the alternate hypothesis, which just means something other than the null hypothesis must true. Otherwise, we have no evidence to reject the null hypothesis and we continue to believe it to be true (since it's our default position).
A good analogy is that of a legal trial, the defendant is innocent until proven guilty. Likewise, we assume the null hypothesis is true from the start, and only when we reject it do we say it is false. This is not unlike how science works where we have established models that are assumed to be true until later proven otherwise. Now that we have a conceptual understanding of this process, let's look at some details.
Rejection Regions and Types of Errors
A critical point when conducting statistical hypothesis testing is determining your null hypothesis. The first step of this process is picking an appropriate statistical model \(\mathfrak{F}\). If your model is illformed for your problem, the results of hypothesis testing will be invalid. Next, we partition the parameter space of \(\mathfrak{F}\) into two disjoint sets \(\Theta_0\) and \(\Theta_1\), and define our hypotheses as:
where \(H_0\) is our null hypothesis and \(H_1\) is our alternative hypothesis. So we must first pick a good statistical model then define an appropriate null hypothesis. For example, we might pick a normal distribution as our statistical model and our null hypothesis is that the mean of the distribution is less than or equal to zero (\(\mu \leq 0\)) .
Now let's suppose our data are represented by the random variable \(X\) with range \(\chi\) (all possible values of \(X\)). Our goal is to define a rejection region on \(\chi\) such that:
We want to define \(R\) such that when \(H_0\) is true, we have a high probability of retaining \(H_0\) and when \(H_0\) is false, we have a high chance probability of rejecting it. Of course, our definition of \(R\) should be heavily influenced by some estimate of \(\theta\) based on our data \(X\). If we picked a good \(R\), when we actually observe our data then it should be quite simple to check if \(X \in R\) and be correct quite often.
Another way to view this is in terms of the errors we could make. If we reject \(H_0\) when it's actually true, we've committed a Type I Error or false positive (whose probability is denoted by \(\alpha\)). If we retain \(H_0\) when it's actually false, we've committed a Type II Error or false negative (whose probability is denoted by \(\beta\)). Here's a summary:
Cases  Retain Null  Reject Null 

\(H_0\) true  Correct  Type I Error (\(\alpha\)) 
\(H_0\) false  Type II Error (\(\beta\))  Correct 
So it makes sense that we want choose \(R\) to maximize the "Correct" diagonals or alternatively minimize "Error" diagonals in the above table. To throw another wrench in the mix, we usually refer to the bottom right cell as the power, which is the probability of correctly rejecting the null hypothesis when it is false (i.e. the alternative hypothesis is true). The tricky part is that trying to minimize both \(\alpha\) and \(\beta\) results in conflicting goals [10], which makes picking a good rejection region \(R\) highly nontrivial.
Test Statistics
Practically, we rarely explicitly pick a rejection region in terms of the range of the data (\(\chi\)). It's usually much more convenient to pick a rejection region in terms of a function of \(X\) that produces a single number summarizing the data called a test statistic (which we denote as \(T\)). This test statistic usually relates in some way to the estimate of the "true" parameter \(\theta\) and is usually more convenient to use than a direct estimation. Thus, our expression for rejection region usually ends up looking something like this:
The value \(c\) is called the critical value which determines whether or not we retain or reject our null hypothesis. So now the problem of hypothesis testing comes down to picking an appropriate test statistic \(T\) and an appropriate value \(c\) to minimize our error rates (\(\alpha\) and \(\beta\)).
As mentioned above, minimizing \(\alpha\) and \(\beta\) are usually in conflict, so what happens is we fix the level for \(\alpha\) (usually values like \(0.05\) or \(0.01\)), and find an appropriate \(T\) and \(c\) so that \(\beta\) is minimized (alternatively power is maximized). Computing (and proving) that a test statistic has the highest power for a given an \(alpha\) is quite complex so I won't mention much more of it here. Most of the time though you won't have to actually come up with \(T\) yourself since many common situations have already been worked out. The usual procedure usually ends up being something along the lines of:
 Define your null hypothesis (and the appropriate statistical model of your data).
 Pick an appropriate \(\alpha\), e.g. \(0.05\).
 Look up and compute the appropriate test statistic for your hypothesis/model e.g. Z statistic.
 Look up (or compute) the critical value \(c\) based on \(\alpha\) e.g. \(Z > 1.96\).
 Retain/reject the null hypothesis based on the computed test statistic and critical value.
pvalues and such
Of course, just giving a retain/reject null hypothesis type answer isn't very informative. Instead, we might want to give the smallest \(\alpha\) that rejects the null hypothesis which is called a pvalue:
A pvalue is basically a measure of evidence against \(H_0\). The smaller the pvalue, the more evidence we have that \(H_0\) is false. Researchers usually use this scale for pvalues:
pvalue  Evidence 

\(<0.01\)  very strong evidence against \(H_0\) 
\(0.010.05\)  strong evidence against \(H_0\) 
\(0.050.10\)  weak evidence against \(H_0\) 
\(>0.10\)  little or no evidence against \(H_0\) 
Two important misconceptions about pvalues:
 Nowhere in the above table do we say we have evidence for \(H_0\). A pvalue says nothing about evidence in favour of \(H_0\). A large pvalue could mean that \(H_0\) is true, or our test didn't have enough power.
 A pvalue is not the probability that the null hypothesis is true (e.g. \(\text{pvalue} \neq P(H_0  Data)\)).
A common way of stating what a pvalue is (taken from All of Statistics):
The pvalue is the probability (under \(H_0\)) of observing a value of the test statistic the same as or more extreme than what was actually observed.
Admittedly, this does not exactly line up with how we have looked at \(\alpha\) in terms of rejection regions, however, rest assured the definitions do match up if you went through the derivations of the test statistic and critical values. Personally, I don't find the above definition all that helpful because most people will conflate it with \(P(H_0data)\) just because both mention the word "probability".
The way I like to think of it is simply a measure of evidence against \(H_0\) (but not for \(H_0\)) according to the table above with no mention of probability. In this way, we can remember the point of hypothesis testing is primarily a procedure to help us prove our default or null hypothesis false. Thinking this way helps to remember that the null hypothesis is our default stance and the test's aim is prove it false [11].
Conclusion
While writing this post, I had to dig through the probabilistic foundations for these techniques and it can get really deep! I just scratched the surface, enough to satisfy my intellectual curiosity and intuition (for now). Hopefully, this post (and some of the references below) will help you along the way too.
References and Further Reading
 All of Statistics: A Concise Course in Statistical Inference by Larry Wasserman.
 Hypothesis Testing, Paninski, Intro. Math. Stats., December 6, 2005.
 Wikipedia: Statistical Model, Statistical Inference, Non parametric Statistics, Statistical Hypothesis Testing, Statistical Power, Sufficient Statistic, Null Hypothesis.
[1]  Taking note that no model can truly represent reality leading to the aphorism: All models are wrong. 
[2]  Inferential statistics is in contrast to descriptive statistics, which only tries to describe the sample or observations  not estimate a probability distribution. Examples of this are measures of central tendency (like mean or median), or measures of variability (such as standard deviation or min/max values). Note that although the mean of a sample is a descriptive statistic, it is also an estimate for the expected value of a given distribution, thus used in statistical inference. Similarly for the other descriptive statistics. 
[3]  There is a great chart in All of Statistics that shows the difference between statistics and computer science/data mining terminology on page xi of the preface. It's very illuminating to contrast the two especially since terms like estimation, learning, covariates, hypothesis are thrown around very casually in their respective literature. I come more from a computer science/data mining and learned most of my stats afterwards so it's great to see all these terms with their definitions in one place. 
[4]  Might be obvious but let's state it explicitly: distribution refers to the cumulative distribution function (CDF), and density refers to the probability density function (PDF). 
[5]  In fact, most of the time \(\mathfrak{F}\) will not contain \(F\) since as we mentioned above, the "true" distribution is probably much more complex than any model we could come up with. 
[6]  This categorization is given in All of Statistics, Section 6.3: Fundamental Concepts in Inference. I've found it quite a good way to think about statistics from a high level. 
[7]  An important note outlined in All of Statistics about \(\theta\), point estimators and confidence intervals is that \(\theta\) is fixed. Recall, that our data is drawn from a "true" distribution that has (theoretically) exact parameters. So there is a single fixed, albeit unknown, value of \(\theta\). The randomness comes in through our observations. Each observation, \(X_i\), is drawn (randomly) from the "true" distribution so by definition a random variable. This means our point estimators \(\widehat{\theta}_n\) and confidence intervals \(C_n\) are also random variables since they are functions of random variables. This can all be a little confusing, so here's another way to think about it: Say we have a "true" distribution, and we're going to draw \(n\) samples from it. Ahead of time, we don't know what the values of those observations are going to be but we know they will follow the "true" distribution. Thus, the \(n\) samples are \(n\) random variables, each distributed according to the "true" distribution. We can then take those \(n\) variables and combine them into a function (e.g. a point estimator like a mean) to get a estimator. This estimator, before we know the actual values of the \(n\) variables, will also be a random variable. However, what usually happens is that the values of the \(n\) samples are actually observed, so we plugs these realizations into our point estimator (i.e. the function of the \(n\) observations) to get a point estimate  a deterministic value. One reason we make this distinction is so that we can compute properties of our point estimator like bias and variance. So long story short, the point estimator is a random variable where after having realized values of the observations, we can use it to get a single fixed number called a point estimate. 
[8]  Interestingly, it's very difficult to prove something to be true, whereas much easier to prove it false. The reason is that many useful statements we want to prove are universally quantified (think of statements that use the word "all"). An example made famous by Nassim Nicholas Taleb is the "black swan" problem. It's almost impossible to prove the statement "all swans are white" because you'd literally have to check the colour every single swan. However, it's quite easy to prove it false by finding a single counterexample: a single black swan. That's why the scientific method and hypothesis testing is such a good framework. Knowing that it's difficult to prove things universally true, it sets itself up to weed out poor models of reality by allowing a systematic way of finding counterexamples (at least that's one way of looking at it). 
[9]  It's probably fair that when learning elementary hypothesis testing that you don't learn about the probabilistic interpretation. For most students, they will never have to use hypothesis testing beyond rote application of standard tests. However from an understanding perspective, I find this rather unappealing. I at least like to have an intuition about how a method works rather than just a mechanical process thus this blog post. 
[10]  Think about a procedure that always rejects the null hypothesis i.e. a rejection consisting of the entire space. In this case, our \(\alpha = 1\) but \(\beta=0\) because we are always correctly rejecting the null hypothesis when it is false. Similarly if \(\beta = 1\). Of course, this choice of rejection region is absolutely useless so we want to pick something a bit smarter. 
[11]  An important point about hypothesis testing is that it's proving our null hypothesis is false. For example, our null hypothesis might be that the drug had no effect. If we correctly reject it, our test or pvalue says nothing about the absolute effectiveness of the drug; all it says it that it has some effect. It could have minimal or negligible effect but still technically have "statistical significance". We should remember to use the right tool for the right job and not be prone to "man with a hammer"syndrome. In our example here, we should be examining the effect size (the difference in the population means), perhaps with confidence intervals along with using our knowledge of the situation to determine if the results we're seeing are useful and practically significant. 