It certainly bothered me (not any more) to hear people casually dropping the word “sufficient statistics” in their talks, often describing some of their ‘‘observations’’ or ‘‘samples’’ that they feel like are ‘‘sufficient’’ for them to make certain decisions. The thing is, the word ‘‘sufficient’’ comes with a very rigorous definition, that may not even hold in those frameworks. For example, a lot of traffic control algorithms use measured vehicle streamed data to learn some embeddings as the input, the downstream tasks could be variable speed limit for certain highway segments for example, and people might call the embeddings ‘‘sufficient statistics’’. Umm, … truth is I don’t even where to start with this.
Anyhow, in case you are wondering, let me give you a pedantic-as-hell sufficient statistic definition. Given a probability space $(\Omega, \mathbb{P}, \mathcal{F})$, where the parameter space $\Omega$ that labels all the possible statistical models ${P_\theta}$, we call a function, often denoted as $T$, that maps from data $X \sim P_\theta(\cdot) \in \Delta (\mathcal{X})$ to some ‘‘coonclusion’’ a statistic. E.g., from the Boston housing prices data we get the average price, and that is a statistic, we can also calculate the emprical variance, now that’s also a statistic. To be more general, we restrict the data to be in a measurable space $(\mathcal{X}, \mathcal{B})$ and the statistical outcome to be in a measurable space $(\mathcal{T}, \mathcal{C})$.
Definition 1
Let $(\mathcal{T} , \mathcal{C})$ be a measurable space such that the $\sigma$-field $\mathcal{C}$ contains all singletons. A measurable mapping $T : \mathcal{X} \to \mathcal{T}$ is called a statistic.
Usually we can think of $\mathcal{X}/\mathcal{T}$ as a subspace of $\mathbb{R}^d$ and $\mathcal{B}/\mathcal{C}$ its borel-algebra. Consider the distribution $P_\Theta$ densities $f_\Theta$ w.r.t. a measure $\nu$, so does the distribution of $T = T(X)$, the idea is that $T$ should say all the things about the $\Theta$-data-generating process: with $t = T(x)$, the conditional probability $$ f_{X|T, \Theta} (x|t, \theta) = \frac{f_{X, T | \Theta} (x, t | \theta)}{f_{T| \Theta} ( t|\theta)} = \frac{f_{X | \Theta} (x | \theta)}{f_{T| \Theta} ( t|\theta)} $$ remains the same for all the $\Theta = \theta \in \Omega$. In plain words, no matter what the statistical model $P_\Theta$ is, knowing the likelihood of the data generated ($f_{X | \Theta} (x | \theta)$) is equivalent to knowing the likelihood of the statistics calculated $f_{T| \Theta} ( t|\theta)$, in which case we don’t even care what the data looks like, since the statistics $t$ are sufficient. Simple, right? To say some quantities are sufficient statistics, one does not need to give a rigorous definition like the following, but at least discuss the ratio of the conditional probability above. Because in some cases, it’s simply not true.
Definition 2.
Suppose there exist versions of conditional distributions $\mu_{X|\Theta,T} (· | \theta, t)$ and a function $r : \mathcal{B} × \mathcal{T} \to [0, 1]$ such that
- $r(\cdot, t)$ is a probability on $\mathcal{B}$ for each $t \in \mathcal{T} $,
- $r(B, \cdot)$ is measurable for each $B \in \mathcal{B}$, and for each $\theta \in \Omega$ and $B \in \mathcal{B}$.
- $\mu_{X|\Theta,T} (B | θ, t) = r(B, t)$, for $\mu_{T | \Theta}(\cdot | \theta) − a.e. \ \ t.$
Then $T$ is called a sufficient statistic for $\Theta$ (in the classical sense).
Notice that we haven’t really discussed whether our setting is Frequentist or Bayesian yet, in the sense that we don’t really know if there is a prior measure on $\Omega$. But this definition is considered Frequentist version by default.
Now let’s look at the Bayesian setting, where we have a prior $\mu_\Theta(\cdot) \in \Delta(\Omega)$.
Definition 3.
A statistic $T$ is called a sufficient statistic for the parameter $\Theta$ (in the Bayesian sense) if, for every prior $\mu_\Theta$, there exists versions of posterior distributions $\mu_{\Theta|X}$ and $\mu_{\Theta|T}$ such that, for every $A \in \mathcal{F}$, we have $$ \mu_{\Theta|X} (A | x) = \mu_{\Theta|T} (A|T(x)) \quad\quad \mu_X-a.s. $$ where $\mu_X$ is the marginal distribution of $X$.
When there are densities, the equality looks like this $$ \begin{aligned} \mu_{\Theta \mid X}(A \mid x) & =\int_{A} f_{\Theta \mid X}(\theta \mid x) \mu_{\Theta}(d \theta) , \\ \mu_{\Theta \mid T}(A \mid t) & =\int_{A} f_{\Theta \mid T}(\theta \mid t) \mu_{\Theta}(d \theta) . \end{aligned} $$ Therefore, for any element $x$ in the support of $\mu_X$, it must holds for $ f_{\Theta \mid X}(\theta \mid x) = f_{\Theta \mid T}(\theta \mid t) $, which collapses to the condition in the Frequentist setting.
Now, suppose someone gives you a parameterized family of statistical models ${P_\theta : \theta \in \Omega}$, which are all densities w.r.t some measure $\nu$, how do we find a sufficient statistic? Or, how do you check if a statistic is sufficient? The following theorem will help.
Factorization Theorem
$T$ is a sufficient statistic if.f. there exist $h$ and $g$, such that $$ f_{X|\Theta} (x| \theta ) = h(x) g(\theta , T(x)) $$
Proof
Sufficiency: $$ \begin{aligned} \frac{d \mu_{\Theta \mid X}}{d \mu_{\Theta}}(\theta \mid x) & =\frac{f_{X \mid \Theta}(x \mid \theta)}{\int_{\Omega} f_{X \mid \Theta}(x \mid \theta) \mu_{\Theta}(d \theta)} \\ & =\frac{h(x) g(\theta, T(x))}{\int_{\Omega} h(x) g(\theta, T(x)) \mu_{\Theta}(d \theta)} \\ & =\frac{g(\theta, T(x))}{\int_{\Omega} g(\theta, T(x)) \mu_{\Theta}(d \theta)}\end{aligned} $$ hence it is a function of $T$;
Necessity: $$ f_{X | \Theta}(x | \theta)= f_{\Theta \mid X}(\theta \mid x) f_{X}(x)= \underbrace{f_{X}(x)}_{h(x)} \underbrace{f_{\Theta \mid T}(\theta | T(x))}_{g(\theta, T(x))} . $$ $\quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad\quad \quad \quad \quad \quad \quad \quad\quad \quad \quad \quad \quad \quad \quad\quad \quad \quad \quad \quad \quad \square $
A sufficient statistic always exists because we can compute the exponential family, i.e., if we put $T(x) = (t_i(x))_{i=1}^d$, we can always put $$ f_{X \mid \Theta}(x \mid \theta) = \underbrace{h(x)}_{h(x)} \underbrace{c(\theta) \exp \bigg( { \sum_{i=1}^{d} \theta_{i} t_{i}(x) \bigg) }}_{g(\theta, T(x))}. $$
To summarize, a statistic is a function, when we say it is sufficient we have to tell people what (statistical) model, and what parameters it is sufficient to, even though sometimes it is intuitive and does not need all the fuss. (Statistics is not pure math anyway.)