⌂ Home Statistical Inference

Statistical Inference

10 Comprehensive Sections, Questions and Solutions

Fundamental Concepts of Statistical Inference

Statistical inference draws conclusions about a population from a random sample. This module covers:

  • Random samples, statistics, and sampling distributions
  • Parametric and nonparametric models
  • Point estimation (MLE, Fisher Information)
  • Confidence intervals and hypothesis tests
  • Simulation and the bootstrap
  • Simple linear regression

We first review the probability distributions that form the backbone of classical inference— normal, \(t\), \(\chi^2\), and \(F\)—explaining why they arise and how degrees of freedom are determined.

1. Core Distributions for Inference (under Normality)

Statistical inference using the distributions below is generally valid only if one of two conditions is met:

  1. Exact Normality: The underlying population is normally distributed.
  2. Large Samples: The sample size is large enough (typically \(n \ge 30\)) that the Central Limit Theorem (CLT) allows us to approximate the sampling distribution as normal, regardless of the population’s true shape (even if non-normal).

The Normal Distribution (\(N(\mu, \sigma^2)\))

The foundation of “Normal Theory” inference and the primary target of the Central Limit Theorem.

  • The Sampling Distribution of \(\bar{X}\): The distribution of the sample mean is defined as \(\bar{X} \sim N(\mu, \sigma^2/n)\). This relationship holds under two distinct scenarios:
    • Case A (Exact): If the underlying population is Normal, \(\bar{X}\) is perfectly normal for any sample size \(n\).
    • Case B (Asymptotic): If the underlying population is not Normal, the CLT ensures \(\bar{X}\) is approximately normal, provided the sample size is large (typically \(n \ge 30\)).
  • The \(Z\)-Test: When the population variance (\(\sigma^2\)) is known, we “standardize” the mean to the Standard Normal distribution (\(Z \sim N(0,1)\)). We then use \(z_\alpha\) percentiles to determine rejection regions for hypothesis tests and to calculate large-sample confidence intervals.

The Chi-Square Distribution (\(\chi^2_\nu\))

The distribution of “squared errors,” essential for inference about variances.

  • Connection with Normal distribution: If \(Z_1,\dots,Z_\nu \stackrel{\text{i.i.d.}}{\sim} N(0,1)\), then the sum of their squares follows this distribution: \[\sum_{i=1}^{\nu} Z_i^2 \sim \chi^2_\nu\]
  • Connection with variance: For a sample of size \(n\), we use \(\nu = n-1\) degrees of freedom because calculating \(\bar{X}\) removes one dimension of variation. Thus, \(\frac{(n-1)S^2}{\sigma^2} \sim \chi^2_{n-1}\).
  • Properties: \(\mathbb{E}[\chi^2_\nu] = \nu\) and \(\operatorname{Var}(\chi^2_\nu) = 2\nu\). As \(\nu \to \infty\), it converges to a Normal distribution.

Student’s \(t\) Distribution (\(t_\nu\))

The “small sample” alternative used when the population variance (\(\sigma\)) is unknown and must be estimated by \(s\).

  • Definition: Formed by the ratio of a Normal variable and the square root of an independent Chi-Square variable: \[T = \frac{Z}{\sqrt{V/\nu}} \sim t_\nu\]
  • A reality check: Because \(s\) is an estimate, the distribution “stretches” to account for extra uncertainty, resulting in heavier tails than the Normal curve.
  • Convergence: As sample size grows (\(\nu \to \infty\)), the \(t\) distribution becomes identical to the Standard Normal (\(Z\)).
  • Variance: \(\operatorname{Var}(t_\nu) = \frac{\nu}{\nu-2}\) (for \(\nu > 2\)).

Snedecor’s \(F\) Distribution (\(F_{\nu_1, \nu_2}\))

The tool for comparing ratios of variances, primarily used in ANOVA and regression significance.

  • Definition: The ratio of two independent Chi-Square variables, each scaled by their respective degrees of freedom: \[F = \frac{U/\nu_1}{V/\nu_2} \sim F_{\nu_1, \nu_2}\]
  • Connection to \(t\): \(t_\nu^2 = F_{1, \nu}\). A two-sided \(t\)-test is equivalent to an \(F\)-test for a single coefficient.
  • Reciprocal Property: If \(W \sim F_{m,n}\), then \(1/W \sim F_{n,m}\). This allows the use of upper-percentile tables to find lower-percentile values.

Practical Application: Percentiles and Rejection Regions

For any continuous distribution, the upper \(\alpha\) percentile (\(q_\alpha\)) is the value such that \(P(X > q_\alpha) = \alpha\). These percentiles are the critical values used to construct confidence intervals and define rejection regions:

Distribution Notation Typical Use Case
Normal \(z_\alpha\) \(Z\)-tests, large-sample proportions.
Student’s \(t\) \(t_{\alpha, \nu}\) One-sample or two-sample means (unknown \(\sigma\)).
Chi-Square \(\chi^2_{\alpha, \nu}\) Tests for a single variance, goodness-of-fit.
F-Dist \(f_{\alpha, \nu_1, \nu_2}\) Comparing two variances, ANOVA, Regression.

Note on Two-Sided Tests: We typically split the significance level \(\alpha\) equally into both tails, using \(\alpha/2\) and \(1-\alpha/2\) percentiles to determine the thresholds for statistical significance.

2. Random Samples and Statistics

2.1 What is a statistic

A random sample \(X_1, X_2, \dots, X_n\) consists of independent, identically distributed (i.i.d.) random variables from a population distribution, denoted as \(\mathbb{F}\) (Here \(\mathbb{F}\) is just a letter for the population’s distribution. It has nothing to do with the ‘F-test’ or ‘F-distribution’).

  • Statistic: Any function of the sample data, \(T = t(X_1, \dots, X_n)\). Common examples include the sample mean (\(\bar{X}\)) and sample variance (\(S^2\)).
  • Sampling Distribution: Because the data is random, the statistic itself is a random variable. The probability distribution of a statistic is known as the sampling distribution.

2.2 Sampling Distributions

The sampling distribution of a statistic describes how the statistic varies from sample to sample. It is key to quantifying uncertainty. For many statistics the exact sampling distribution is intractable, but for normal populations several exact distributions exist, and for large samples the Central Limit Theorem (CLT) provides universal approximations.

2.2.1 Exact Distributions (The Normal Case)

If the population is strictly normal, \(X_i \sim N(\mu, \sigma^2)\), we get precise mathematical results:

  • Sample Mean: \(\bar{X} \sim N(\mu, \sigma^2/n)\). Note that the variance shrinks as \(n\) increases.
  • Sample Variance: \(\frac{(n-1)S^2}{\sigma^2} \sim \chi^2_{n-1}\).
  • Independence: Crucially, for normal populations, \(\bar{X}\) and \(S^2\) are independent. This independence is what allows us to construct the \(t\)-statistic (\(T = \text{Normal} / \sqrt{\text{Chi-Square}}\)).

Why \(n-1\) degrees of freedom? > To calculate \(S^2\), we must first calculate \(\bar{X}\). This imposes a constraint: the deviations \((X_i - \bar{X})\) must sum to zero. This “consumes” one piece of independent information, leaving us with \(n-1\) degrees of freedom.

These facts, together with the definitions of the \(t\) and \(F\) families, generate many widely used exact tests and intervals.

2.2.2 Asymptotic Distributions (The Large Sample Case)

When the population is not normal, we rely on limit theorems as \(n \to \infty\).

The Central Limit Theorem (CLT)

Regardless of the population shape \(F\) (as long as it has a finite variance), the sample mean converges to a normal distribution: \[\sqrt{n} \left( \frac{\bar{X} - \mu}{\sigma} \right) \xrightarrow{d} N(0,1)\]

The Delta Method

What if we aren’t interested in \(\bar{X}\), but a function of it, like \(\ln(\bar{X})\) or \(1/\bar{X}\)? The Delta Method allows us to approximate the distribution of a differentiable function \(g(\bar{X})\): \[\sqrt{n}\bigl(g(\bar{X}) - g(\mu)\bigr) \xrightarrow{d} N\bigl(0, [g'(\mu)]^2 \sigma^2 \bigr)\]

These approximations justify many large‑sample confidence intervals and tests, including those for quantiles, odds, and variances.

3. Parametric and Nonparametric Models

  • Parametric Models: These assume the population distribution belongs to a specific family defined by a finite set of parameters (e.g., \(N(\mu, \sigma^2)\)). Statistical inference focuses on estimating these parameters to describe the entire distribution.
  • Nonparametric Models: These make no assumptions about the functional form of the distribution; the distribution \(\mathbb{F}\) itself is the unknown quantity. The Empirical CDF, \(\mathbb{F}_n(x) = \frac{1}{n}\sum I(X_i \le x)\), serves as the fundamental estimator, while techniques like the bootstrap offer robust ways to assess uncertainty without parametric constraints.
  • The Bootstrap: A resampling technique used to estimate the variability of a statistic. By repeatedly sampling from the observed data with replacement, it constructs an empirical sampling distribution. This allows for the calculation of confidence intervals and standard errors without assuming a specific parametric form for the underlying population.

4. Point Estimation

4.1 Overview

Point Estimation provides a single “best guess” for a parameter \(\theta\). Common methodologies include:

  • Method of Moments (MoM): Estimates are found by equating sample moments to their corresponding population moments and solving the resulting system of equations for \(\theta\).
  • Maximum Likelihood Estimation (MLE): Estimates \(\hat{\theta}\) by finding the value that maximizes the likelihood function \(L(\theta \mid \mathbf{x})\). MLEs are favored for being asymptotically efficient and asymptotically normally distributed.
  • Plug-in Principle (Empirical CDF): For any parameter \(\theta\) that is a summary characteristic of the distribution (such as the mean, variance, or quantiles), we estimate it by calculating that same characteristic directly from the observed data (\(\mathbb{F}_n\)).
    • Why the “Plug-in” name? You are taking the theoretical formula for the population and “plugging in” your sample data where the population distribution used to be.
    • Example: Since the population mean is the expected value of the true distribution \(\mathbb{F}\), the plug-in estimate is simply the mean of the sample data.

4.2 Desirable Properties of an estimator

  • Unbiasedness: \(\mathbb{E}[\hat{\theta}] = \theta\) (no systematic error).
  • Consistency: \(\hat{\theta} \xrightarrow{P} \theta\) (converges to the truth as \(n \to \infty\)).
  • Efficiency: Achieving the smallest possible variance among a specific class of estimators—most commonly among all unbiased estimators (known as the MVUE).

It is worth noting that sometimes a biased estimator is actually “better” because it has a much smaller variance than any unbiased one. This is why we sometimes look at Mean Squared Error (MSE), which combines both: \[\text{MSE} = \text{Bias}^2 + \text{Variance}\]

4.3 MLE, Fisher Information, and Asymptotic Inference

This section serves as a refresher for constructing MLE and asymptotic confidence intervals.

Step 1: Constructing the Likelihood

The process begins by translating the experimental data into a mathematical function of the parameter.

  1. The Data: We start with a random sample \(X_1, X_2, \dots, X_n\) that are independent and identically distributed (i.i.d.) from a population with density \(f(x \mid \theta)\).
  2. The Likelihood (\(L\)): Under the i.i.d. assumption, the joint probability of the data is the product of individual densities: \[L(\theta) = \prod_{i=1}^n f(x_i \mid \theta)\] Important: We perform our initial “Regularity Pre-Check” on \(L(\theta)\) before moving to the logarithm.

Step 2: Verification of Regularity Conditions (The Pre-Check)

Before proceeding to calculate Fisher Information or Wald Intervals, we must verify that \(L(\theta)\) is “well-behaved.” Note: Always check these on \(L(\theta)\), not \(\ell(\theta)\).

A. Positivity and Differentiability

  • Positivity (\(L > 0\)): The likelihood must be strictly positive within the parameter space. If \(L=0\) in certain regions, it creates “holes” or discontinuities that break asymptotic theory.
  • Differentiability: \(L(\theta)\) must be smooth and differentiable (at least twice) with respect to \(\theta\). If \(L(\theta)\) has “kinks” or sharp points (like the Laplace/Double Exponential distribution), the standard Taylor expansion used for Fisher Information is invalid.

B. Boundary Behavior (The Limit Check)

To ensure the Maximum Likelihood Estimator (MLE) is an interior maximum (and thus approximately Normal), we check the limits at the boundaries of the parameter space \((a, b)\): \[\lim_{\theta \to a^+} L(\theta) = 0 \quad \text{and} \quad \lim_{\theta \to b^-} L(\theta) = 0\] * Why? If the likelihood is highest at the very edge of the possible values (like in a Uniform or truncated distribution), the derivative \(\ell'(\theta)\) may not be zero at the maximum, and the resulting estimator will not have a symmetric “Bell Curve” distribution.

C. Fixed Support & The Interchange Property

  • Fixed Support: The range of \(X\) cannot depend on \(\theta\). (e.g., \(X \sim \text{Uniform}(0, \theta)\) is a “Regularity violation”).
  • Interchange Property: The structure of \(L\) must allow the swapping of integration and differentiation. This ensures that the Expected Score is zero: \(E[\ell'(\theta)] = 0\).

Step 3: Optimization and Finding the MLE

Once the likelihood \(L(\theta)\) passes the regularity pre-check, we transition to the Log-Likelihood \(\ell(\theta)\) to find the estimator.

  1. Define the Log-Likelihood: \(\ell(\theta) = \ln L(\theta) = \sum_{i=1}^{n} \ln f(x_i \mid \theta)\). We shift to the log-scale because it transforms the product of i.i.d. densities into a sum, simplifying differentiation.
  2. Solve the Score Equation: Find the candidate MLE (\(\hat{\theta}\)) by setting the first derivative (the Score) to zero: \[\ell'(\theta) = \frac{d}{d\theta} \ell(\theta) = 0\]
  3. Verify the Maximum (Second Derivative Test): Confirm that \(\hat{\theta}\) is a maximum rather than a minimum or inflection point by ensuring the curvature is concave down: \[\ell''(\hat{\theta}) < 0\]
  4. Apply Invariance (Optional): If the goal is to estimate a transformation \(\gamma = g(\theta)\), the MLE is simply \(\hat{\gamma} = g(\hat{\theta})\).

Step 4: Calculate Fisher Information (\(I_n\))

Fisher Information quantifies the “sharpness” of the log-likelihood peak. A sharper peak implies less uncertainty about the parameter’s true value.

  • Expected Information (Theoretical): The average curvature across all possible samples. \[I_n(\theta) = -E[\ell''(\theta)]\]

Step 5: Asymptotic Distribution & Standard Error

Under the regularity conditions, the MLE is Asymptotically Normal. As \(n\) increases, \(\hat{\theta}\) converges in distribution: \[\hat{\theta} \xrightarrow{d} N\left(\theta, \frac{1}{I_n(\theta)}\right)\]

To perform inference, we calculate the Standard Error (SE) by taking the square root of the estimated variance (plugging in our MLE): \[\text{SE}(\hat{\theta}) \approx \sqrt{\frac{1}{I_n(\hat{\theta})}} \quad \text{or} \quad \sqrt{\frac{1}{J(\hat{\theta})}}\]

Step 6: Construct the asymptotic confidence interval

The Wald interval is the standard asymptotic confidence interval for large samples (\(n \ge 30\)): \[\hat{\theta} \pm z_{\alpha/2} \cdot \text{SE}(\hat{\theta})\]

⚠️ Usage Note: This interval assumes the likelihood is symmetric. For small samples or parameters near physical boundaries (e.g., \(p\) near 0), the Wald interval may produce “impossible” values or have poor coverage.

Step 7: The Delta Method (Transformed Parameters)

If you require inference for a smooth transformation \(\gamma = g(\theta)\), the Delta Method provides the Asymptotic Variance (AVar): \[\operatorname{AVar}(g(\hat{\theta})) \approx [g'(\theta)]^2 \cdot \frac{1}{I_n(\theta)}\]

Plug-in Standard Error for \(g(\hat{\theta})\): \[\text{SE}(g(\hat{\theta})) \approx |g'(\hat{\theta})| \cdot \sqrt{\frac{1}{I_n(\hat{\theta})}}\]

5. Confidence Intervals

A \(100(1-\alpha)\%\) Confidence Interval (CI) is a random interval \([L, U]\) constructed such that, in repeated sampling, \((1-\alpha)\) of such intervals will contain the true parameter \(\theta\).

Important Distinction: The probability \(1-\alpha\) describes the reliability of the method, not the specific result. Once data is observed, we say we are “\(100(1-\alpha)\) % confident” the interval contains \(\theta\).

5.1 Construction Strategies

  • Exact Pivot: Uses a “pivotal quantity”—a function of the data and \(\theta\) whose distribution is known and does not depend on \(\theta\).
    • Example: For a normal mean with unknown variance, \(T = \frac{\bar{X}-\mu}{S/\sqrt{n}} \sim t_{n-1}\) leads to the interval \(\bar{X} \pm t_{\alpha/2, n-1} \frac{S}{\sqrt{n}}\).
  • Asymptotic Interval: Relies on the Central Limit Theorem (CLT) for large \(n\). Under regularity conditions, estimators like the MLE are approximately normal, allowing for intervals like \(\hat{\theta} \pm z_{\alpha/2} \cdot \text{SE}(\hat{\theta})\).
  • Bootstrap Intervals: Avoids distributional assumptions by using computer-intensive resampling to approximate the sampling distribution of the estimator directly.

5.2 Key Concept: Duality

There is a fundamental duality with hypothesis tests: A \(100(1-\alpha)\%\) CI consists of all values \(\theta_0\) for which the null hypothesis \(H_0: \theta = \theta_0\) would not be rejected at significance level \(\alpha\).

5.3 An illustration on constructing an Exact CI

1. Identify the Pivotal Quantity

A pivotal quantity \(Q\) is a function of the data and the unknown parameter \(\theta\) such that its distribution is known and independent of \(\theta\).

For a sample \(X_1, \dots, X_n \sim N(\mu, \sigma^2)\), the pivot for the variance is: \[Q = \frac{(n-1)S^2}{\sigma^2} \sim \chi^2_{n-1}\] Notice that while \(Q\) contains \(\sigma^2\), the resulting distribution (\(\chi^2\)) does not depend on the value of \(\sigma^2\) at all.

The reason this works is that the Chi-Square distribution is naturally defined as a ratio that includes \(\sigma^2\). Because we know exactly how that ratio behaves, we start with the ratio, find the confidence interval for \(\chi^2_{n-1}\) and then isolate \(\sigma^2\) algebraically to find the exact confidence interval for \(\sigma^2\).

In summary:

  1. The Ratio: We know that \((n-1)S^2\) divided by \(\sigma^2\) follows a Chi-Square distribution.
  2. The Known Shape: We know the boundaries (percentiles) of that Chi-Square shape.
  3. The Swap: Since \(\sigma^2\) is part of that known ratio, we can “un-ratio” the formula to trap \(\sigma^2\) between those boundaries.

2. Set the Probability Statement

We find two “critical values” from the known distribution that capture \((1-\alpha)\) of the probability. For the Chi-Square distribution (which is asymmetric), we place \(\alpha/2\) in each tail.

\[P\left( \chi^2_{1-\alpha/2, \, n-1} \le \frac{(n-1)S^2}{\sigma^2} \le \chi^2_{\alpha/2, \, n-1} \right) = 1 - \alpha\]

Notation Note: \(\chi^2_{\alpha/2}\) usually refers to the upper tail area. Because the Chi-Square is not symmetric, the lower critical value (\(\chi^2_{1-\alpha/2}\)) is not just the negative of the upper value; it is a separate positive number near zero.

3. Isolate the Parameter (\(\sigma^2\))

To find the interval for \(\sigma^2\), we must algebraically rearrange the inequalities inside the probability statement:

  1. Reciprocate all terms (this flips the inequality signs): \[\frac{1}{\chi^2_{1-\alpha/2}} \ge \frac{\sigma^2}{(n-1)S^2} \ge \frac{1}{\chi^2_{\alpha/2}}\]
  2. Multiply by \((n-1)S^2\) to isolate \(\sigma^2\): \[\frac{(n-1)S^2}{\chi^2_{\alpha/2, \, n-1}} \le \sigma^2 \le \frac{(n-1)S^2}{\chi^2_{1-\alpha/2, \, n-1}}\]

4. The Final Exact Interval

The resulting \(100(1-\alpha)\%\) Confidence Interval for \(\sigma^2\) is: \[\left( \frac{(n-1)S^2}{\chi^2_{\alpha/2, \, n-1}}, \;\; \frac{(n-1)S^2}{\chi^2_{1-\alpha/2, \, n-1}} \right)\]

Critical Observations for this Example:

  • Asymmetry: Unlike the \(Z\) or \(t\) intervals, this interval is not centered at \(S^2\). The “distance” from the sample variance to the lower bound is different from the distance to the upper bound.
  • Relationship to \(n\): As \(n \to \infty\), the Chi-Square distribution becomes more symmetric (approximately normal), and the “skewness” of this interval disappears.
  • Sensitivity: This interval is an exact result of the Normality assumption. If the population is even slightly non-normal, the actual coverage of this interval can be far less than \((1-\alpha)\), even for large samples.

6. Hypothesis Tests

A hypothesis test formally assesses whether the observed data provide sufficient evidence to reject a null hypothesis \(H_0\) in favor of an alternative \(H_1\).

1. Core Components

  • Test Statistic (\(T\)): A function of the data whose distribution is known under \(H_0\) (the Null Distribution).
  • Expected Baseline (\(\eta\)): The expected value of the statistic under the null, \(\eta = E[T \mid H_0]\).
  • Rejection Region: The set of values for \(T\) so unlikely under \(H_0\) that we instead favor \(H_1\). This region is sized to control the Type I Error (\(\alpha\)).

2. The Fisherian p-value Procedure

To handle any test statistic \(T\) (standard or unseen), we define the p-value based on the direction of support for \(H_1\) relative to the baseline \(\eta\).

Given an observed value \(t\):

If \(H_1\) implies… Direction of Evidence p-value Calculation
\(E[T] > \eta\) Upper-tail \(P(T \ge t \mid H_0)\)
\(E[T] < \eta\) Lower-tail \(P(T \le t \mid H_0)\)
\(E[T] \neq \eta\) Two-tailed \(P(|T - \eta| \ge |t - \eta| \mid H_0)\)

The Logic: The p-value is the probability of observing a result at least as contradictory to \(H_0\) (in the direction of \(H_1\)) as the one actually obtained.

Strategy for “Unseen” Statistics

  1. Assume \(H_0\): Determine the expected value \(\eta\) and the distribution of \(T\). Calculate \(\eta = E[T \mid H_0]\).
  2. Check \(H_1\): Does the alternative claim \(T\) should be larger or smaller than \(\eta\)? Identify if \(H_1\) suggests \(T\) should be larger, smaller, or simply “different” from \(\eta\).
  3. Find the Tail: Integrate from the observed \(t\) toward the direction of \(H_1\).
  4. Conclude: If p-value \(< \alpha\), the deviation from \(\eta\) is too large to be attributed to chance.

7. Simulation and the Bootstrap

When analytical sampling distributions are intractable or distributional assumptions are too risky, simulation provides a numerical approximation. The Bootstrap is the primary framework for this, treating the observed data as a mini-population.

7.1 The Nonparametric Bootstrap

The Nonparametric Bootstrap uses the Empirical CDF (\(\mathbb{F}_n\)) as a surrogate for the true distribution \(\mathbb{F}\).

The Algorithm:

  1. Resample: Draw \(B\) bootstrap samples \(\mathbf{x}^{*b} = \{x_1^{*b}, \dots, x_n^{*b}\}\) from the original data with replacement.
  2. Compute: For each sample, calculate the bootstrap replicate of the statistic: \(\hat{\theta}^{*b} = T(\mathbf{x}^{*b})\).
  3. Aggregate: The collection \(\{\hat{\theta}_1^*, \dots, \hat{\theta}_B^*\}\) serves as an empirical approximation of the sampling distribution of \(\hat{\theta}\).

Bootstrap Variance & Standard Error

The estimated variance of \(\hat{\theta}\) is simply the sample variance of the bootstrap replicates:

\[\widehat{\operatorname{Var}}_{\text{boot}}(\hat{\theta}) = \frac{1}{B-1}\sum_{b=1}^B (\hat{\theta}_b^* - \bar{\hat{\theta}}^*)^2 \implies \widehat{SE}[\hat{\theta}] = \sqrt{\widehat{\operatorname{Var}}_{\text{boot}}}\]

Bootstrap Confidence Intervals

Let \(\hat{\theta}^*_{(q)}\) denote the \(q\)-th quantile of the bootstrap replicates.

  • Percentile Interval: Uses the distribution of \(\hat{\theta}^*\) directly.

    \[[\hat{\theta}^*_{(\alpha/2)}, \;\; \hat{\theta}^*_{(1-\alpha/2)}]\]

  • Basic (Pivotal) Interval: Corrects for bias by looking at the distribution of the error \((\hat{\theta}^* - \hat{\theta})\).

    \[[2\hat{\theta} - \hat{\theta}^*_{(1-\alpha/2)}, \;\; 2\hat{\theta} - \hat{\theta}^*_{(\alpha/2)}]\]

  • Bootstrap-\(t\): Studentizes the statistic using a nested bootstrap to estimate \(SE^*\) for every replicate. It is second-order accurate (converges faster to the true coverage) but computationally expensive.

Limitations: The bootstrap is highly effective for “smooth” statistics (means, regression coefficients) but fails for extreme order statistics (e.g., the sample maximum) or data with infinite variance.

7.2 Parametric Bootstrap

If a specific parametric model \(F_{\psi}\) is trusted, we estimate the parameters \(\hat{\psi}\) (e.g., via MLE) and then simulate new samples from \(F_{\hat{\psi}}\) rather than resampling the raw data. * Advantage: More efficient (lower variance) if the model is correct. * Risk: Highly sensitive to model misspecification; if the assumed family is wrong, the simulation is biased.

Summary: The Simulation Logic

  • Nonparametric: Population \(\approx\) Data (\(\mathbb{F}_n\)) \(\rightarrow\) Resample with replacement.
  • Parametric: Population \(\approx\) Model (\(F_{\hat{\psi}}\)) \(\rightarrow\) Generate new data from the model.
  • Goal: Replace complex calculus/integration with the Law of Large Numbers via Monte Carlo repetition.

7.3 Bootstrap Hypothesis Testing

In a traditional test, we use a theoretical null distribution (like \(t\) or \(F\)). In a bootstrap test, we build the null distribution ourselves by forcing the data to obey \(H_0\).

The goal is to calculate a p-value: \(P(T^* \ge t_{obs} \mid H_0)\). Since our raw data might not satisfy \(H_0\), we must transform it before resampling.

1. The Null-Transformation Logic

To test \(H_0: \mu = \mu_0\), we cannot simply resample the original data \(X\) (because \(E[X] = \bar{x}\), not \(\mu_0\)). We must shift the data so that the mean of our “bootstrap population” is exactly \(\mu_0\): \[\tilde{X}_i = X_i - \bar{x} + \mu_0\] Now, the new dataset \(\tilde{X}\) has a mean of \(\mu_0\) but retains the variance and shape of our original sample.

The “Null-Transformation” (shifting/scaling the data) is applied to ensure the simulation actually represents the “Null World”, i.e. obeys \(H_0\).

2. The Procedure

  1. Transform: Create the null-compliant dataset \(\tilde{X}\) (as shown above).
  2. Resample: Draw \(B\) bootstrap samples \(\mathbf{x}^{*b}\) from \(\tilde{X}\) with replacement.
  3. Compute: Calculate the test statistic for each: \(t^{*b} = T(\mathbf{x}^{*b})\).
  4. p-value: Calculate the proportion of bootstrap statistics that are as extreme as our original observed statistic \(t_{obs}\): \[\text{p-value}_{\text{boot}} = \frac{1}{B} \sum_{b=1}^B I(t^{*b} \ge t_{obs})\]

Comparison of Test Strategies

Test Type Source of Null Distribution Best Used When…
Parametric (\(t, F\)) Mathematical Theory Data is Normal or \(n\) is large.
Nonparametric (Sign, \(U\)) Combinatorics / Ranks Data is skewed/ordinal; no outliers.
Bootstrap Test Resampling the ECDF Distribution is “weird” and \(n\) is small/medium.
Permutation Test Shuffling labels Testing exchangeability (\(H_0\): groups are identical).

Strategy for Exam Questions

If you are asked to “Design a bootstrap test for the ratio of two variances”:

  1. State the Null: \(H_0: \sigma_1^2 / \sigma_2^2 = 1\).
  2. Transform: Scale the datasets so they have equal variances (e.g., divide each group by its own standard deviation).
  3. Resample: Draw bootstrap samples from these scaled groups.
  4. Calculate: Find the ratio \(s^{*2}_1 / s^{*2}_2\) for each iteration.
  5. p-value: See how often the simulated ratio exceeds your original observed ratio.

8. Parametric Tests

This section covers Parametric Tests, which assume a specific distributional form (typically Normal) and make inference about parameters such as means and variances.

A key idea is the role of the Central Limit Theorem:

  • It provides robustness for mean-based tests
  • But offers limited protection for variance-based tests

8.1 Overview of parametric tests

Test Target Statistic Null Dist. df Key Assumptions
One-sample mean $T = \frac{\bar{X}-\mu_0}{S/\sqrt{n}}$ $t$ $n-1$ Normality (or large $n$)
Two-sample (Equal Var) $T = \frac{\bar{X}_1-\bar{X}_2}{S_p\sqrt{1/n_1+1/n_2}}$ $t$ $n_1+n_2-2$ Normality, Homoscedasticity
Two-sample (Welch) $T = \frac{\bar{X}_1-\bar{X}_2}{\sqrt{S_1^2/n_1+S_2^2/n_2}}$ approx $t$ $\hat{\nu}$ (Satterthwaite) Normality
One-sample variance $V = \frac{(n-1)S^2}{\sigma_0^2}$ $\chi^2$ $n-1$ Strict Normality
Two-sample variance $F = S_1^2/S_2^2$ $F$ $n_1-1, n_2-1$ Strict Normality
One-way ANOVA $F = \frac{MS_{Bet}}{MS_{With}}$ $F$ $k-1, N-k$ Normality, Homoscedasticity

Clarifying “Normality” vs “Strict Normality”

The distinction between normality and strict normality mainly reflects how much protection you get from the Central Limit Theorem.

Normality (means-based tests: t, ANOVA)

  • Small samples: The population (or residuals) should be approximately normal.
  • Large samples: The assumption becomes less critical because the CLT ensures the sampling distribution of the mean is approximately normal (assuming finite variance).
  • Robustness: These tests are generally robust to mild non-normality (e.g., moderate skewness or kurtosis), especially when sample sizes are balanced.

👉 Practical interpretation:

“Normality is helpful, but moderate deviations usually don’t invalidate the test—especially when ( n ) is not small.”

Strict Normality (variance-based tests: \(\chi^2\), F for variances)

  • These tests rely directly on results that hold exactly only under normal data (e.g., sums of squared normal variables).
  • The CLT does not directly stabilize variance statistics in the same way it does for means.
  • Sensitivity: Skewness and heavy tails can significantly distort the null distribution, affecting p-values and error rates—even for moderately large samples.

👉 Practical interpretation:

“These tests are much more sensitive to non-normality; noticeable departures (especially heavy tails or skewness) can make results unreliable.”

Theoretical Justification

These tests rely on the fundamental properties of normal samples:

  1. Independence: For normal populations, \(\bar{X}\) and \(S^2\) are independent.
  2. Distributional Ratios:
    • \(\chi^2\) arises from the sum of squared standard normals.
    • \(t\) is the ratio of a standard normal to the square root of a \(\chi^2\) (divided by df).
    • \(F\) is the ratio of two independent \(\chi^2\) variables (each divided by their respective df).

8.2 One-Sample t-Test

Purpose: Tests the population mean.

  • \(H_0: \mu = \mu_0\)
  • Assumptions: Independent observations; population approximately normal (or large \(n\))

Test Statistic

\[ T = \frac{\bar{X} - \mu_0}{S / \sqrt{n}} \]

Null Distribution

\[ T \sim t_{n-1} \]

Insight

  • For small samples, normality matters
  • For large samples, the CLT ensures approximate validity

Note: In the event that the population mean is known, we can use a z-test instead. The \(z\)-statistic is \(Z = \frac{\bar{X} - \mu_0}{\sigma / \sqrt{n}}\)

8.3 Two-Sample t-Test (Equal Variance)

Purpose: Compares means of two independent populations assuming equal variances.

  • \(H_0: \mu_1 = \mu_2\)
  • Assumptions:
    • Independence
    • Normality
    • Homoscedasticity (\(\sigma_1^2 = \sigma_2^2\))

Test Statistic

\[ T = \frac{\bar{X}_1 - \bar{X}_2}{S_p \sqrt{1/n_1 + 1/n_2}} \]

where \[ S_p^2 = \frac{(n_1 - 1)S_1^2 + (n_2 - 1)S_2^2}{n_1 + n_2 - 2} \]

Null Distribution

\[ T \sim t_{n_1 + n_2 - 2} \]

8.4 Welch Two-Sample t-Test (FYI)

Purpose: Compares means without assuming equal variances.

  • \(H_0: \mu_1 = \mu_2\)
  • Assumptions: Independence; approximate normality (or large samples)

Test Statistic

\[ T = \frac{\bar{X}_1 - \bar{X}_2}{\sqrt{S_1^2/n_1 + S_2^2/n_2}} \]

Welch-Satterthwaite Effective Degrees of Freedom (\(\hat{\nu}\)):

\[\hat{\nu} \approx \frac{\left( \frac{S_1^2}{n_1} + \frac{S_2^2}{n_2} \right)^2}{\frac{(S_1^2/n_1)^2}{n_1-1} + \frac{(S_2^2/n_2)^2}{n_2-1}}\]

Null Distribution

\[ T \approx t_{\nu} \]

Insight

  • Default choice in practice
  • Robust to unequal variances and sample sizes
  • Used when variances are unequal (\(\sigma_1^2 \neq \sigma_2^2\)). It approximates the distribution of the pooled variance.

8.5 One-Sample Variance Test

Purpose: Tests whether a population variance equals a specified value.

  • **( H_0: ^2 = _0^2 )**
  • Assumptions: Strict normality

Test Statistic

\[ V = \frac{(n-1)S^2}{\sigma_0^2} \]

Null Distribution

\[ V \sim \chi^2_{n-1} \]

Insight

  • Exact only under normal data
  • Sensitive to skewness and heavy tails

8.6 Two-Sample Variance Test (F-Test)

Purpose: Compares variances of two populations.

  • **( H_0: _1^2 = _2^2 )**

  • Assumptions:

    • Independence
    • Strict normality

Test Statistic

\[ F = \frac{S_1^2}{S_2^2} \]

Null Distribution

\[ F \sim F_{n_1-1, n_2-1} \]

Insight

  • Highly sensitive to non-normality
  • Even moderate skewness can distort results

8.6. One-Way ANOVA (Analysis of Variance)

Purpose: Tests equality of means across \(k\) groups.

  • \(H_0: \mu_1 = \mu_2 = \dots = \mu_k\)
  • Assumptions:
    • Independence
    • Normality (within groups)
    • Homoscedasticity

Test Statistic

\[F = \frac{MS_{\text{Between}}}{MS_{\text{Within}}}\] where

  • \(MS_{Between}\): Measures variation between group means and the grand mean.
  • \(MS_{Within}\): Measures variation within each group (the “error”).

Null Distribution

\[F \sim F_{k-1, N-k}\]

Insight

  • Logic: If the \(F\)-ratio is significantly \(> 1\), the group means are more spread out than random chance (within-group noise) would explain.
  • Extension of the two-sample t-test
  • Fairly robust to mild non-normality
  • Sensitive to unequal variances when group sizes are very different

9. Nonparametric Tests

This section focuses on Nonparametric Tests, which do not assume a specific functional form (like Normality) for the underlying population. Instead, they rely on the relative ranks of the data or categorical counts.

9.1. The Sign Test

Purpose: Used to test the median (\(M\)) of a continuous distribution.

  • \(H_0: M = M_0\)
  • Assumptions: Independent draws from a continuous distribution.
  • Key Insight: No assumption of symmetry or normality is required.

Test Statistic and Exact Distribution

The test statistic \(S^+\) counts how many observations fall above the hypothesized median:

\[S^+ = \sum_{i=1}^n I(X_i > M_0)\]

Under \(H_0\), any observation has a \(0.5\) probability of being above the median. Therefore, the null distribution is exactly Binomial:

\[S^+ \sim \text{Binomial}(n, 0.5)\]

The p-value is calculated using binomial tail probabilities. For large \(n\), we use the normal approximation: \(Z = \frac{S^+ - n/2}{\sqrt{n/4}} \xrightarrow{d} N(0,1)\).

9.2. Mann–Whitney \(U\) (Wilcoxon Rank-Sum)

Purpose: Compares two independent samples (\(X_1, \dots, x_m\) and \(Y_1, \dots, y_n\)) to see if one is “stochastically larger” than the other.

  • \(H_0: F_X = F_Y\) (Distributions are identical).
  • Assumptions: Independent observations and continuous distributions (no ties).

The \(U\) Statistic and Ranks

The \(U\) statistic counts the number of pairs where \(X_i > Y_j\):

\[U_X = \sum_{i=1}^m \sum_{j=1}^n I(X_i > Y_j)\]

Relationship to Ranks: If \(W_X\) is the sum of ranks for the \(X\) sample in the combined dataset:

\[U_X = W_X - \frac{m(m+1)}{2}\]

Moments and Large-Sample Distribution

Under \(H_0\), the mean and variance are:

\[\eta = E[U_X] = \frac{mn}{2}, \quad \operatorname{Var}(U_X) = \frac{mn(m+n+1)}{12}\] For large \(m, n\), we use the Fisherian approach with the normal approximation:

\[Z = \frac{U_X - \eta}{\sqrt{\operatorname{Var}(U_X)}} \xrightarrow{d} N(0,1)\]

9.3 Chi-Square Goodness-of-Fit Test

Purpose: Tests if observed categorical counts (\(O_i\)) match expected counts (\(E_i\)) from a specified distribution.

  • Assumptions: Data follow a multinomial distribution; expected counts \(E_i \ge 5\) for the approximation to hold.

Pearson’s Test Statistic

The statistic measures the squared “distance” between observed and expected frequencies: \[X^2 = \sum_{i=1}^k \frac{(O_i - E_i)^2}{E_i}\] Under \(H_0\), \(X^2 \xrightarrow{d} \chi^2_{df}\).

Determining Degrees of Freedom (\(df\))

The \(df\) depends on how much information is “used up” to calculate the expected counts:

  1. Fully Specified: If all \(p_i\) are known, \(df = k - 1\) (the \(-1\) accounts for the constraint \(\sum O_i = n\)).
  2. Estimated Parameters: If you must estimate \(r\) parameters (e.g., estimating \(\lambda\) for a Poisson fit) using MLE from the binned data: \[df = k - 1 - r\]

The Fisherian Insight: Estimating \(r\) parameters forces the expected values to “mimic” the observed data in \(r\) dimensions, reducing the remaining “freedom” for the data to vary independently.

Confidence vs. Prediction Intervals at \(X = X_h\)

Both intervals are centered at the point estimate \(\hat{Y}_h = \hat{\beta}_0 + \hat{\beta}_1 X_h\), but they differ in their “width” due to the underlying uncertainty they attempt to capture.

1. Confidence Interval (For the Mean Response \(\mu_{Y|X_h}\)) Used when estimating the expected (average) value of \(Y\) for all individuals with a specific \(X_h\). * Standard Error: \(SE(\hat{Y}_h) = \sqrt{MSE \left[ \frac{1}{n} + \frac{(X_h - \bar{X})^2}{S_{XX}} \right]}\) * The Interval: \[\hat{Y}_h \pm t_{\alpha/2, n-2} \cdot SE(\hat{Y}_h)\]

2. Prediction Interval (For a New Individual \(Y_{h(\text{new})}\)) Used when predicting the value for a single, specific future observation at \(X_h\). * Standard Error: \(SE(\text{pred}) = \sqrt{MSE \left[ \mathbf{1} + \frac{1}{n} + \frac{(X_h - \bar{X})^2}{S_{XX}} \right]}\) * The Interval: \[\hat{Y}_h \pm t_{\alpha/2, n-2} \cdot SE(\text{pred})\]

10. Putting It All Together

The logic of classical inference rests on:

  1. Random sampling ensures the data are a representative i.i.d. draw from \(\mathbb{F}\).
  2. Statistics summarise the data; their sampling distributions are derived either exactly (using normal theory and the \(t,\chi^2,F\) families) or approximately (via the CLT and delta method).
  3. Point estimates are computed by plug‑in, MLE, or other principles.
  4. Confidence intervals quantify the uncertainty of estimates using pivots (exact or asymptotic) or resampling methods.
  5. Hypothesis tests evaluate specific claims by comparing a test statistic to its null distribution, controlling Type I error via known percentiles.
  6. When the model is uncertain, the bootstrap replaces analytical derivations with computation, making inference possible for a vast range of problems without exact distributional assumptions.

Question 1: Asymptotic Approximations and the Chi-Square Distribution

Let \(X_1, \dots, X_n\) be a random sample from a \(N(0, \sigma^2)\) distribution.

The MLE for \(\sigma^2\) is \(\hat{\sigma}^2 = \frac{1}{n} \sum_{i=1}^n X_i^2\).

Let \(Y_n = \sum_{i=1}^n X_i^2 / \sigma^2\), so that \(Y_n \sim \chi^2_n\).

  1. Use the Central Limit Theorem (CLT) to find the asymptotic distribution of \(Y_n\) as \(n \to \infty\).
  2. Find the Fisher Information for \(\sigma^2\), \(I(\sigma^2)\), and state the asymptotic variance of the MLE \(\hat{\sigma}^2\). Verify that this asymptotic variance matches the variance derived from the CLT result in Part 1.
  3. For large \(n\), calculating exact percentiles of the \(\chi^2\) distribution can be computationally intensive. Using the asymptotic distribution of \(Y_n\) from Part 1, derive an algebraic expression for the approximate upper \(\alpha\) percentile of the chi-square distribution, \(\chi^2_{\alpha, n}\) (defined such that \(P(Y_n > \chi^2_{\alpha, n}) = \alpha\)), in terms of \(n\) and the upper \(\alpha\) percentile of the standard normal distribution, \(z_\alpha\).

Answer 1

Part 1: Use the Central Limit Theorem (CLT) to find the asymptotic distribution of \(Y_n\) as \(n \to \infty\).

We can view \(Y_n \sim \chi^2_n\) as the sum of \(n\) independent identically distributed \(\chi^2_1\) random variables. Let \(Z_i \sim \chi^2_1\). Then \(Y_n = \sum_{i=1}^n Z_i\). We know that for \(Z_i \sim \chi^2_1\):

\[E[Z_i] = 1, \quad \text{Var}(Z_i) = 2\] Therefore, \(E[Y_n] = n\) and \(\text{Var}(Y_n) = 2n\). By the Central Limit Theorem, as \(n \to \infty\):

\[\frac{Y_n - n}{\sqrt{2n}} \xrightarrow{d} N(0,1)\] Equivalently, we can write the asymptotic distribution as \(Y_n \dot\sim N(n, 2n)\).

Part 2: Find the Fisher Information for \(\sigma^2\), \(I(\sigma^2)\), and state the asymptotic variance of the MLE \(\hat{\sigma}^2\). Verify that this asymptotic variance matches the variance derived from the CLT result in Part 1.

The probability density function of \(X \sim N(0, \sigma^2)\) is \(f(x) = \frac{1}{\sqrt{2\pi\sigma^2}} e^{-x^2/(2\sigma^2)}\). Let \(\theta = \sigma^2\). The log-likelihood for a single observation is: \[\ln f(x) = -\frac{1}{2}\ln(2\pi) - \frac{1}{2}\ln(\theta) - \frac{x^2}{2\theta}\] Taking the first and second derivatives with respect to \(\theta\): \[\frac{\partial \ln f}{\partial \theta} = -\frac{1}{2\theta} + \frac{x^2}{2\theta^2}\] \[\frac{\partial^2 \ln f}{\partial \theta^2} = \frac{1}{2\theta^2} - \frac{x^2}{\theta^3}\] The Fisher Information for a single observation is the negative expected value of the second derivative:

\[I_1(\theta) = -E\left[\frac{\partial^2 \ln f}{\partial \theta^2}\right] = -\frac{1}{2\theta^2} + \frac{E[X^2]}{\theta^3}\] Since \(E[X^2] = \text{Var}(X) = \sigma^2 = \theta\):

\[I_1(\theta) = -\frac{1}{2\theta^2} + \frac{\theta}{\theta^3} = \frac{1}{2\theta^2} = \frac{1}{2\sigma^4}\] For the full sample of size \(n\), the total Fisher Information is \(I(\sigma^2) = n I_1(\sigma^2) = \frac{n}{2\sigma^4}\).
By the properties of MLEs, the asymptotic variance of \(\hat{\sigma}^2\) is the inverse of the Fisher Information:

\[\text{Var}(\hat{\sigma}^2) = \frac{1}{I(\sigma^2)} = \frac{2\sigma^4}{n}\]

Verification via CLT:

From Part 1, \(Y_n \dot\sim N(n, 2n)\). Since \(Y_n = \frac{n\hat{\sigma}^2}{\sigma^2}\), we can rearrange to get \(\hat{\sigma}^2 = \frac{\sigma^2}{n} Y_n\). Using the properties of normal distributions, the variance of \(\hat{\sigma}^2\) is: \[\text{Var}(\hat{\sigma}^2) = \left(\frac{\sigma^2}{n}\right)^2 \text{Var}(Y_n) = \frac{\sigma^4}{n^2} (2n) = \frac{2\sigma^4}{n}\] This perfectly matches the asymptotic variance derived from the Fisher Information.

Part 3: Using the asymptotic distribution of \(Y_n\) from Part 1, derive an algebraic expression for the approximate upper \(\alpha\) percentile of the chi-square distribution, \(\chi^2_{\alpha, n}\), in terms of \(n\) and the upper \(\alpha\) percentile of the standard normal distribution, \(z_\alpha\).

We want to find an approximation for \(\chi^2_{\alpha, n}\) such that \(P(Y_n > \chi^2_{\alpha, n}) = \alpha\).

Using the asymptotic distribution from Part 1, we standardize \(Y_n\): \[P\left(\frac{Y_n - n}{\sqrt{2n}} > \frac{\chi^2_{\alpha, n} - n}{\sqrt{2n}}\right) \approx \alpha\] Because \(\frac{Y_n - n}{\sqrt{2n}} \xrightarrow{d} Z \sim N(0,1)\), the right-hand side of the inequality must approximate the upper \(\alpha\) percentile of the standard normal distribution, \(z_\alpha\) (where \(P(Z > z_\alpha) = \alpha\)):

\[\frac{\chi^2_{\alpha, n} - n}{\sqrt{2n}} \approx z_\alpha\] Solving for \(\chi^2_{\alpha, n}\):

\[\chi^2_{\alpha, n} - n \approx z_\alpha \sqrt{2n}\]

\[\chi^2_{\alpha, n} \approx n + z_\alpha \sqrt{2n}\]

Question 2: Exact Distributions and Properties of the \(t\)-Statistic

Let \(X_1, X_2, \dots, X_n \stackrel{iid}{\sim} N(\mu, \sigma^2)\) for \(n > 3\).

We define the sample mean \(\bar{X} = \frac{1}{n}\sum_{i=1}^n X_i\) and the sample variance \(S^2 = \frac{1}{n-1}\sum_{i=1}^n (X_i - \bar{X})^2\).

Let \(T = \frac{\sqrt{n}(\bar{X} - \mu)}{S}\).

  1. State the distributions of \(\bar{X}\) and \(\frac{(n-1)S^2}{\sigma^2}\), and state the key relationship between \(\bar{X}\) and \(S^2\) that is required to construct the \(t\)-statistic.
  2. Based on the formal definition of Student’s \(t\)-distribution, prove that \(T \sim t_{n-1}\).
  3. Given that \(\operatorname{Var}(t_\nu) = \frac{\nu}{\nu - 2}\) for \(\nu > 2\). For \(T \sim t_{n-1}\) as derived in Part 2, apply this formula to state the exact variance of \(T\). Then, taking the limit as \(n \to \infty\), show that \(\operatorname{Var}(T) \to 1\), and explain why this is consistent with the convergence of the \(t\)-distribution to the standard normal \(N(0,1)\).

Answer 2

Part 1: State the distributions of \(\bar{X}\) and \(\frac{(n-1)S^2}{\sigma^2}\), and state the key relationship between \(\bar{X}\) and \(S^2\) that is required to construct the \(t\)-statistic.

Given a random sample from \(N(\mu, \sigma^2)\), the sampling distributions are:

  1. \(\bar{X} \sim N\left(\mu, \frac{\sigma^2}{n}\right)\)
  2. \(V = \frac{(n-1)S^2}{\sigma^2} \sim \chi^2_{n-1}\)

Key relationship: \(\bar{X}\) and \(S^2\) are statistically independent (By Cochran’s Theorem)

Part 2: Based on the formal definition of Student’s \(t\)-distribution, prove that \(T \sim t_{n-1}\).

The formal definition of a random variable following Student’s \(t\)-distribution with \(\nu\) degrees of freedom is \(T = \frac{Z}{\sqrt{V/\nu}}\), where \(Z \sim N(0,1)\), \(V \sim \chi^2_\nu\), and \(Z\) and \(V\) are independent. Let us manipulate our statistic \(T\):

\[T = \frac{\sqrt{n}(\bar{X} - \mu)}{S} = \frac{\frac{\bar{X} - \mu}{\sigma/\sqrt{n}}}{S/\sigma}\] The numerator is \(Z = \frac{\bar{X} - \mu}{\sigma/\sqrt{n}}\). Since \(\bar{X} \sim N(\mu, \sigma^2/n)\), it follows that \(Z \sim N(0,1)\).

The denominator can be rewritten in terms of \(V = \frac{(n-1)S^2}{\sigma^2} \sim \chi^2_{n-1}\): \[S/\sigma = \sqrt{\frac{S^2}{\sigma^2}} = \sqrt{\frac{S^2}{\sigma^2} \times \frac{n-1}{n-1}} = \sqrt{\frac{V}{n-1}}\]

Thus, \(T = \frac{Z}{\sqrt{V/(n-1)}}\).

Since \(\bar{X}\) and \(S^2\) are independent, \(Z\) and \(V\) are independent. This satisfies the exact definition of a \(t\)-distribution with \(\nu = n-1\) degrees of freedom. Therefore, \(T \sim t_{n-1}\).

Part 3: Apply this formula to state the exact variance of \(T\) and explain why this is consistent with the convergence of the \(t\)-distribution to the standard normal \(N(0,1)\)

Setting \(\nu = n-1\):

\[\operatorname{Var}(T) = \frac{n-1}{n-3} \quad (n > 3)\]

As \(n \to \infty\):

\[\frac{n-1}{n-3} = \frac{1 - 1/n}{1 - 3/n} \to \frac{1}{1} = 1\]

This is consistent with the convergence of \(t_\nu \to N(0,1)\) as \(\nu \to \infty\), since \(\operatorname{Var}(N(0,1)) = 1\). The variance of the \(t\)-distribution exceeds 1 for any finite \(n\) precisely because \(S\) is an estimate of \(\sigma\) — the extra uncertainty “inflates” the tails and hence the variance. As \(n\) grows, \(S \xrightarrow{p} \sigma\) and this inflation disappears.

Question 3: The \(F\)-Distribution and Reciprocal Percentiles

Let \(W \sim F_{m, n}\). Let \(f_{\alpha, m, n}\) denote the upper \(\alpha\) percentile of the \(F\)-distribution, such that \(P(W > f_{\alpha, m, n}) = \alpha\).

  1. Prove the reciprocal property of the \(F\)-distribution: \(\frac{1}{W} \sim F_{n, m}\). Hint: In your proof, explicitly justify the step where you take reciprocals of the inequality.
  2. Using the result from Part 1, formally prove that \(f_{1-\alpha, m, n} = \frac{1}{f_{\alpha, n, m}}\).
  3. Suppose \(S_1^2\) and \(S_2^2\) are sample variances from two independent normal populations with equal variances, based on sample sizes \(n_1 = 10\) and \(n_2 = 16\). Given the upper percentiles \(f_{0.05, 9, 15} = 2.59\) and \(f_{0.05, 15, 9} = 3.01\), evaluate the exact probability: \[P\left(\frac{S_1^2}{S_2^2} < 0.3322\right)\]

Answer 3

Part 1: Prove the reciprocal property of the \(F\)-distribution: \(\frac{1}{W} \sim F_{n, m}\).

By definition, a random variable \(W\) follows an \(F\)-distribution with \((m, n)\) degrees of freedom if it can be expressed as: \[W = \frac{U/m}{V/n}\] where \(U \sim \chi^2_m\) and \(V \sim \chi^2_n\) are independent chi-square random variables.

To ensure \(\frac{1}{W}\) is well-defined, we observe that \(U = \sum_{i=1}^m Z_i^2\) and \(V = \sum_{j=1}^n Z_j^2\).

Since \(Z \sim N(0,1)\) is a continuous random variable, \(P(Z = 0) = 0\). Consequently, the probability that a chi-square variable equals zero is the probability that all its constituent normal variables are zero simultaneously:

\[P(U = 0) = 0 \quad \text{and} \quad P(V = 0) = 0\] Thus, both the numerator and denominator are strictly positive with probability 1. It follows that \(P(W > 0) = 1\), making the reciprocal \(\frac{1}{W}\) defined and positive for almost all outcomes.

Consider the reciprocal of \(W\): \[\frac{1}{W} = \frac{1}{\left( \frac{U/m}{V/n} \right)} = \frac{V/n}{U/m}\] By the definition of the \(F\)-distribution:

  • The numerator is a chi-square variable (\(V\)) divided by its degrees of freedom (\(n\)).
  • The denominator is an independent chi-square variable (\(U\)) divided by its degrees of freedom (\(m\)).

This structure perfectly matches the definition of an \(F\)-distribution where the degrees of freedom are swapped.

Thus,

\[\frac{1}{W} \sim F_{n, m}\]

Note: This property is why \(F\)-tables often only provide values for the upper tail. If you need a lower-tail critical value for \(F_{m, n, 1-\alpha}\), you can simply calculate: \[F_{m, n, \alpha} = \frac{1}{F_{n, m, 1-\alpha}}\]

Part 2: Using the result from Part 1, formally prove that \(f_{1-\alpha, m, n} = \frac{1}{f_{\alpha, n, m}}\).

Let \(W \sim F_{m, n}\). By the definition of the upper-\(\alpha\) critical value (where \(\alpha\) is the area in the right tail):

\[P(W \le f_{1-\alpha, m, n}) = \alpha\] (Note: This implies \(f_{1-\alpha}\) is the value that leaves \(\alpha\) in the left tail and \(1-\alpha\) in the right tail).

Because \(P(W > 0) = 1\) (from Part 1), we can take the reciprocal of the terms inside the probability without loss of generality. When we take the reciprocal of both sides of an inequality involving positive terms, the inequality sign flips: \[P\left( \frac{1}{W} \ge \frac{1}{f_{1-\alpha, m, n}} \right) = \alpha\]

Let \(W^* = \frac{1}{W}\). From Part 1, we know \(W^* \sim F_{n, m}\). Substituting \(W^*\): \[P\left( W^* \ge \frac{1}{f_{1-\alpha, m, n}} \right) = \alpha\] Due to the continuity of the \(F\)-distribution, \(P(W^* \ge k) = P(W^* > k)\). Thus: \[P\left( W^* > \frac{1}{f_{1-\alpha, m, n}} \right) = \alpha\]

By definition, the value \(k\) such that \(P(W^* > k) = \alpha\) for an \(F_{n, m}\) distribution is denoted as \(f_{\alpha, n, m}\). Comparing the terms, we must have: \[f_{\alpha, n, m} = \frac{1}{f_{1-\alpha, m, n}}\]

Conclusion: Rearranging the terms yields the final result: \[f_{1-\alpha, m, n} = \frac{1}{f_{\alpha, n, m}}\]

Why this matters in practice: If you are doing a two-tailed \(F\)-test at \(\alpha = 0.05\) (so \(0.025\) in each tail) and your table only gives upper-tail values:

  1. Find the upper critical value: \(f_{0.025, m, n}\).
  2. Find the lower critical value: \(\frac{1}{f_{0.025, n, m}}\).

Part 3: Suppose \(S_1^2\) and \(S_2^2\) are sample variances from two independent normal populations with equal variances, based on sample sizes \(n_1 = 10\) and \(n_2 = 16\). Evaluate the exact probability: \(P\left(\frac{S_1^2}{S_2^2} < 0.3322\right)\)

Since the samples are from normal populations with equal variances (\(\sigma_1^2 = \sigma_2^2\)), the ratio of the sample variances follows an \(F\)-distribution:

\[\frac{S_1^2}{S_2^2} = \frac{S_1^2 / \sigma_1^2}{S_2^2 / \sigma_2^2} \sim F_{n_1-1, n_2-1} = F_{9, 15}\] We want to evaluate: \[P\left(\frac{S_1^2}{S_2^2} < 0.3322\right) = P(F_{9, 15} < 0.3322)\] Notice that \(0.3322 \approx \frac{1}{3.01}\). We are given \(f_{0.05, 15, 9} = 3.01\). Using the reciprocal rule from Part 2:

\[f_{0.95, 9, 15} = \frac{1}{f_{0.05, 15, 9}} = \frac{1}{3.01} \approx 0.3322\] This means that \(0.3322\) is the 95th lower percentile (or the upper \(0.95\) critical value).

\[P(F_{9, 15} < f_{0.95, 9, 15}) = 1 - 0.95 = 0.05\]

Thus, the exact probability is 0.05.

Question 4: Transformations and the CLT

We define the following:

  • \(U_1, U_2, \dots, U_n \stackrel{iid}{\sim} \text{Uniform}(0, 1)\)
  • \(Y_i = -2 \ln(U_i)\) for \(i = 1, \dots, n\), and
  • \(S_n = \sum_{i=1}^n Y_i\).
  1. Given that \(Y_i = -2\ln(U_i) \sim \chi^2_2\), use the additive property of the chi-square distribution to identify the exact distribution of \(S_n = \sum_{i=1}^n Y_i\), and state its mean and variance.
  2. For \(n = 30\), we wish to approximate the probability \(P(S_{30} > 70)\). Use the Central Limit Theorem to find this approximation. Leave your answer in terms of the standard normal cumulative distribution function, \(\Phi(\cdot)\).

Answer 4:

Part 1: Given that \(Y_i = -2\ln(U_i) \sim \chi^2_2\), use the additive property of the chi-square distribution** to identify the exact distribution of \(S_n = \sum_{i=1}^n Y_i\), and state its mean and variance.**

Since \(Y_1, \dots, Y_n \stackrel{iid}{\sim} \chi^2_2\) and chi-square random variables are additive (i.e., the sum of independent \(\chi^2\) variables is also \(\chi^2\) with degrees of freedom equal to the sum), we have: \[S_n = \sum_{i=1}^n Y_i \sim \chi^2_{2n}\]

Using the known properties \(E[\chi^2_\nu] = \nu\) and \(\operatorname{Var}(\chi^2_\nu) = 2\nu\): \[E[S_n] = 2n, \qquad \operatorname{Var}(S_n) = 4n\]

Part 2: For \(n = 30\), we wish to approximate the probability \(P(S_{30} > 70)\).

We want to approximate \(P(S_{30} > 70)\). From Part 1, we know \(S_{30} \sim \chi^2_{60}\). The exact mean and variance of \(S_{30}\) are: \[E[S_{30}] = 60, \quad \text{Var}(S_{30}) = 2(60) = 120\] For large \(n\), the Central Limit Theorem states that the sum \(S_n\) is approximately normally distributed. Thus, \(S_{30} \approx N(60, 120)\). We want \(P(S_{30} > 70)\). Standardizing this probability: \[P\left(S_{30} > 70\right) = P\left(\frac{S_{30} - 60}{\sqrt{120}} > \frac{70 - 60}{\sqrt{120}}\right)\] \[\approx P\left(Z > \frac{10}{\sqrt{120}}\right)\] Where \(Z \sim N(0,1)\). Using the symmetry of the normal distribution, \(P(Z > z) = 1 - \Phi(z)\). \[P(S_{30} > 70) \approx 1 - \Phi\left(\frac{10}{\sqrt{120}}\right) = 1 - \Phi\left(\sqrt{\frac{100}{120}}\right) = 1 - \Phi\left(\sqrt{\frac{5}{6}}\right)\] The final answer is \(1 - \Phi\left(\sqrt{5/6}\right)\).

Question 5: CLT Application, Percentiles, and Confidence Intervals

Let \(X_1, X_2, \dots, X_n\) be a random sample from a Uniform\((0, \theta)\) distribution, where \(\theta > 0\) is an unknown parameter.

  1. Find the exact mean \(\mu\) and variance \(\sigma^2\) of this distribution.
  2. Using the Central Limit Theorem, derive the asymptotic distribution of \(\sqrt{n}(\bar{X} - \mu)\).
  3. Construct an approximate \(95\%\) confidence interval for \(\theta\) using the CLT. Specifically, use the \(2.5\%\) and \(97.5\%\) percentiles of the relevant asymptotic distribution.
  4. Briefly explain why a standard \(t\)-distribution confidence interval is technically inappropriate here, even though we are estimating a mean.

Answer 5

Part 1: Find the exact mean \(\mu\) and variance \(\sigma^2\) of this distribution. For \(X \sim \text{Uniform}(0, \theta)\), the probability density function is \(f(x) = 1/\theta\) for \(0 < x < \theta\). \[ \mu = E[X] = \int_0^\theta \frac{x}{\theta} dx = \left[ \frac{x^2}{2\theta} \right]_0^\theta = \frac{\theta}{2} \] \[ E[X^2] = \int_0^\theta \frac{x^2}{\theta} dx = \left[ \frac{x^3}{3\theta} \right]_0^\theta = \frac{\theta^2}{3} \] \[ \sigma^2 = \text{Var}(X) = E[X^2] - (E[X])^2 = \frac{\theta^2}{3} - \frac{\theta^2}{4} = \frac{\theta^2}{12} \]

Part 2: Using the Central Limit Theorem, derive the asymptotic distribution of \(\sqrt{n}(\bar{X} - \mu)\). By the Central Limit Theorem, for large \(n\): \[ \sqrt{n}(\bar{X} - \mu) \xrightarrow{d} N(0, \sigma^2) \] Substituting \(\mu = \theta/2\) and \(\sigma^2 = \theta^2/12\), we get: \[ \sqrt{n}\left(\bar{X} - \frac{\theta}{2}\right) \xrightarrow{d} N\left(0, \frac{\theta^2}{12}\right) \]

Part 3: Construct an approximate \(95\%\) confidence interval for \(\theta\) using the CLT. Specifically, use the \(2.5\%\) and \(97.5\%\) percentiles of the relevant asymptotic distribution. We want to find bounds \(L\) and \(U\) such that \(P(L \leq \theta \leq U) \approx 0.95\). From the CLT approximation, we can write: \[ P\left( -z_{0.025} \leq \frac{\sqrt{n}(\bar{X} - \theta/2)}{\sqrt{\theta^2/12}} \leq z_{0.025} \right) \approx 0.95 \] where \(z_{0.025} = 1.96\) is the \(97.5\%\) percentile of the standard normal distribution (and \(-1.96\) is the \(2.5\%\) percentile). \[ P\left( -1.96 \leq \frac{\sqrt{12n}(\bar{X} - \theta/2)}{\theta} \leq 1.96 \right) \approx 0.95 \] Multiplying by \(\theta\) and rearranging to isolate \(\theta\): \[ P\left( -1.96\theta \leq \sqrt{12n}(\bar{X} - \theta/2) \leq 1.96\theta \right) \] \[ P\left( \theta \left( \frac{\sqrt{12n}}{2} - 1.96 \right) \leq \sqrt{12n}\bar{X} \leq \theta \left( \frac{\sqrt{12n}}{2} + 1.96 \right) \right) \] Let \(c = \sqrt{3n}\). The interval is formed by inverting the inequalities: \[ \theta \geq \frac{\sqrt{12n}\bar{X}}{c + 1.96} \quad \text{and} \quad \theta \leq \frac{\sqrt{12n}\bar{X}}{c - 1.96} \] Thus, the approximate 95% CI is: \[ \left( \frac{2\sqrt{3n}\bar{X}}{\sqrt{3n} + 1.96}, \frac{2\sqrt{3n}\bar{X}}{\sqrt{3n} - 1.96} \right) \]

Part 4: Briefly explain why a standard \(t\)-distribution confidence interval is technically inappropriate here, even though we are estimating a mean.

The \(t\)-distribution arises exactly when sampling from a normal distribution where the sample variance \(S^2\) is used to estimate \(\sigma^2\), creating a dependent ratio. Here, the underlying population is Uniform (non-normal), and the parameter \(\theta\) appears in both the mean and the variance, requiring a more complex algebraic isolation. The \(t\)-interval does not account for the variance depending on the unknown parameter \(\theta\) in this specific way.

Question 6: The Fisherian P-value Procedure

A city’s emergency call centre monitors the hourly arrival rate of calls. Let \(X_1, \dots, X_n \stackrel{iid}{\sim} \text{Poisson}(\theta)\), where \(\theta\) is the true mean number of calls per hour. The historical baseline is \(\theta_0 = 2\) calls per hour.

Define the test statistic \(T = \sum_{i=1}^n X_i\).

You are given that for \(X \sim \text{Poisson}(\theta)\): \(E[X] = \theta\) and \(\text{Var}(X) = \theta\).

A recent sample of \(n = 50\) hours yields \(\sum x_i = 120\) observed calls.

  1. State the exact distribution of \(T = \sum_{i=1}^n X_i\) under \(H_0\), justifying your answer using a standard property of the Poisson distribution. Hence write down the exact p-value for testing \(H_0: \theta = 2\) against \(H_a: \theta > 2\) as a summation expression. Explain why this expression is computationally intractable by hand, and state the numerically evaluated value. Compare this to the CLT approximation you will derive in part 3.
  2. Under \(H_0: \theta = \theta_0\), \(\eta = E[T \mid H_0]\) and \(\text{Var}(T \mid H_0)\). Write down the approximate null distribution of \(T\) for large \(n\), citing the theorem you use.
  3. Apply the Fisherian p-value procedure to test \(H_0: \theta = 2\) against \(H_a: \theta > 2\). Explicitly state each step: identify \(\eta\), determine the direction of evidence, and calculate the p-value. Conclude at \(\alpha = 0.05\).
  4. Now test \(H_0: \theta = 2\) against \(H_a: \theta < 2\) using the same observed data. Without recalculating \(Z\), state the p-value and explain why it is very large.
  5. Now test \(H_0: \theta = 2\) against \(H_a: \theta \neq 2\). State the p-value and conclude at \(\alpha = 0.05\).
  6. A colleague instead uses \(T^* = \bar{X}\) as their test statistic. Without any new calculations, state the p-values for the three hypotheses in (b)–(d). Explain in one sentence why they are identical to those from \(T\).

Hint: You may use: \(\Phi(2.0) = 0.9772\)

Answer 6

Part 1: State the exact distribution of \(T = \sum_{i=1}^n X_i\) under \(H_0\). Hence write down the exact p-value for testing \(H_0: \theta = 2\) against \(H_a: \theta > 2\) as a summation expression.

Exact distribution of \(T\) under \(H_0\):

Since \(X_1, \dots, X_n \stackrel{iid}{\sim} \text{Poisson}(\theta_0)\) and independent Poisson random variables are additive — that is, the sum of independent \(\text{Poisson}(\theta)\) variables is \(\text{Poisson}\) with rate equal to the sum of the individual rates — we have the exact null distribution:

\[T = \sum_{i=1}^{50} X_i \sim \text{Poisson}(n\theta_0) = \text{Poisson}(100)\]

This is an exact result requiring no approximation.

Exact p-value:

For \(H_a: \theta > 2\), the evidence lies in the upper tail. The exact p-value is:

\[p_{\text{exact}} = P(T \ge 120 \mid T \sim \text{Poisson}(100)) = \sum_{t=120}^{\infty} \frac{e^{-100} \cdot 100^t}{t!}\]

Why it is intractable by hand: This requires summing infinitely many terms, each involving \(100^t / t!\) for large \(t\) — numbers that are astronomically large and small simultaneously. There is no closed form. Numerical evaluation gives:

\[p_{\text{exact}} \approx 0.028\]

Comparison with CLT: The CLT approximation (derived in part 3) gives \(p \approx 0.023\). The CLT underestimates slightly because the Poisson(100), while well-approximated by a normal, retains mild right-skew — the exact upper tail is marginally heavier than the normal predicts. The discrepancy is small here because \(n = 50\) is reasonably large, but it motivates why the exact distribution should always be the first consideration when it is tractable.

Note on continuity correction: Since \(T\) is discrete and the normal is continuous, a refined CLT approximation uses \(P(T \ge 120) \approx P(Z \ge \frac{119.5 - 100}{10}) = P(Z \ge 1.95) = 1 - \Phi(1.95) \approx 0.026\), which is noticeably closer to the exact value of 0.028.

Part 2: Write down the approximate null distribution of \(T\) for large \(n\)

Since \(X_1, \dots, X_n \stackrel{iid}{\sim} \text{Poisson}(\theta)\), the sum \(T = \sum_{i=1}^n X_i\) is also a sum of i.i.d. random variables with:

\[\eta = E[T \mid H_0] = n\theta_0 = 50 \times 2 = 100\] \[\text{Var}(T \mid H_0) = n\theta_0 = 50 \times 2 = 100\]

By the Central Limit Theorem, since the \(X_i\) are i.i.d. with finite mean and finite variance, for large \(n\):

\[T \;\dot{\sim}\; N(100,\ 100) \quad \text{under } H_0\]

The standardised statistic is:

\[Z = \frac{T - \eta}{\sqrt{\text{Var}(T \mid H_0)}} = \frac{120 - 100}{\sqrt{100}} = \frac{20}{10} = 2.0\]

Part 3: Apply the Fisherian p-value procedure to test \(H_0: \theta = 2\) against \(H_a: \theta > 2\)

Following the Fisherian procedure:

Step 1 — Find \(\eta\): \(\eta = E[T \mid H_0] = 100\)

Step 2 — Direction from \(H_a\): If \(\theta > \theta_0\) then calls are arriving faster than baseline, so \(T\) should be larger than \(\eta\). Evidence points to the upper tail.

Step 3 — Integrate the tail:

\[p = P(T \ge 120 \mid H_0) \approx P(Z \ge 2.0) = 1 - \Phi(2.0) = 1 - 0.9772 = 0.0228\]

Step 4 — Conclude: Since \(p = 0.023 < 0.05\), we reject \(H_0\). There is significant evidence that the true call rate exceeds 2 per hour.

Part 4: Test \(H_0: \theta = 2\) against \(H_a: \theta < 2\) using the same observed data.

Direction from \(H_a\): If \(\theta < \theta_0\), then \(T\) should be smaller than \(\eta\). Evidence points to the lower tail.

\[p = P(T \le 120 \mid H_0) \approx P(Z \le 2.0) = \Phi(2.0) = 0.9772\]

The p-value is very large because the observed sum \(T = 120\) lies above \(\eta = 100\) — the data point entirely in the wrong direction for this alternative. The observation is actually the least surprising possible outcome under \(H_a: \theta < 2\).

Conclude: \(p = 0.977 \gg 0.05\). We fail to reject \(H_0\). The data provide no evidence of a rate below 2.

Part 5: Test \(H_0: \theta = 2\) against \(H_a: \theta \neq 2\)

Direction from \(H_a\): \(H_a\) claims only that \(\theta\) differs from \(\theta_0\) in either direction. Evidence is extreme in either tail.

\[p = P(|T - \eta| \ge |120 - 100| \mid H_0) = P(|Z| \ge 2.0) = 2 \times P(Z \ge 2.0) = 2 \times 0.0228 = 0.0456\]

Conclude: Since \(p = 0.046 < 0.05\), we reject \(H_0\) at the 5% level. There is significant evidence that the call rate differs from the baseline of 2 per hour. Note this is a borderline result — at \(\alpha = 0.01\) we would not reject.

Part 6: A colleague instead uses \(T^* = \bar{X}\) as their test statistic. Without any new calculations, state the p-values for the three hypotheses in (b)–(d). Explain in one sentence why they are identical to those from \(T\)

The p-values are identical for all three alternatives:

Alternative p-value from \(T\) p-value from \(T^*\)
\(H_a: \theta > 2\) 0.0228 0.0228
\(H_a: \theta < 2\) 0.9772 0.9772
\(H_a: \theta \neq 2\) 0.0456 0.0456

Why: \(T^* = \bar{X} = T/n\) is a strictly monotone increasing transformation of \(T\). Any probability statement about \(T\) exceeding a threshold is equivalent to the same statement about \(T^*\) exceeding the corresponding scaled threshold — the tail areas are preserved exactly. Formally, the standardised statistic is identical:

\[Z^* = \frac{\bar{X} - \theta_0}{\sqrt{\theta_0/n}} = \frac{T/n - \theta_0}{\sqrt{\theta_0/n}} = \frac{T - n\theta_0}{\sqrt{n\theta_0}} = Z = 2.0\]

The choice of statistic does not affect the p-value when both statistics are monotone functions of each other.

Question 7: Point Estimation, Pivotal Quantities, and the t-Distribution

Let \(X_1, X_2, \dots, X_n\) be a random sample from a \(N(\mu, \sigma^2)\) distribution where both \(\mu\) and \(\sigma^2\) are unknown.

  1. State the Maximum Likelihood Estimator (MLE) for \(\mu\) when \(\sigma^2\) is unknown, and separately state the plug-in estimator for \(\mu\). Show that both methods yield the same estimator, and confirm that it is unbiased.
  2. Show that the statistic \(T = \frac{\bar{X} - \mu}{S / \sqrt{n}}\) (where \(S^2\) is the unbiased sample variance) is a pivotal quantity.
  3. Rigorously state the exact distribution of \(T\) and explain the role of the independence between \(\bar{X}\) and \(S^2\) in deriving this distribution.
  4. Construct a \(100(1-\alpha)\%\) confidence interval for \(\mu\). Discuss what happens to the width of this interval as \(n \to \infty\) and mathematically relate it back to the standard normal distribution.

Answer 7

Part 1: State the Maximum Likelihood Estimator (MLE) for \(\mu\) when \(\sigma^2\) is unknown,

MLE: The log-likelihood is \(\ell(\mu, \sigma^2) = -\frac{n}{2}\ln(2\pi\sigma^2) - \frac{1}{2\sigma^2}\sum(X_i - \mu)^2\). Differentiating with respect to \(\mu\) and setting to zero gives: \[\frac{\partial \ell}{\partial \mu} = \frac{1}{\sigma^2}\sum(X_i - \mu) = 0 \implies \hat{\mu}_{MLE} = \bar{X}\]

Plug-in: Since the population mean is \(\mu = E_F[X]\), the plug-in principle replaces \(F\) with \(\mathbb{F}_n\), giving \(\hat{\mu}_{plug-in} = \bar{X}\).

Unbiasedness: \(E[\bar{X}] = \frac{1}{n}\sum E[X_i] = \mu\).

Part 2: Show that the statistic \(T = \frac{\bar{X} - \mu}{S / \sqrt{n}}\) is a pivotal quantity.

A pivotal quantity is a function of both the data and the parameter whose distribution does not depend on any unknown parameters. \[ T = \frac{\bar{X} - \mu}{S / \sqrt{n}} \] The numerator’s distribution depends on \(\mu\) and \(\sigma\), and the denominator’s distribution depends on \(\sigma\). However, when formed into the ratio, the \(\sigma\) parameters cancel out, and the shift parameter \(\mu\) is standardized out. As shown in the next part, the resulting distribution depends only on \(n\) (through degrees of freedom), making \(T\) a pivotal quantity.

Part 3: Rigorously state the exact distribution of \(T\) and explain the role of the independence between \(\bar{X}\) and \(S^2\) in deriving this distribution.

The exact distribution of \(T\) is the Student’s t-distribution with \(n-1\) degrees of freedom, denoted \(t_{n-1}\). To rigorously derive this, we rely on three facts for normal samples:

  1. \(\bar{X} \sim N(\mu, \sigma^2/n)\), which implies \(Z = \frac{\bar{X}-\mu}{\sigma/\sqrt{n}} \sim N(0,1)\).
  2. \(V = \frac{(n-1)S^2}{\sigma^2} \sim \chi^2_{n-1}\).
  3. Independence: \(\bar{X}\) and \(S^2\) are strictly independent (a special property of the normal distribution. This is what allows us to form the ratio of an independent \(N(0,1)\) and \(\sqrt{\chi^2_{n-1}/(n-1)}\), yielding the \(t\)-distribution exactly.). Because \(Z\) and \(V\) are independent, we can form their ratio: \[ T = \frac{Z}{\sqrt{V / (n-1)}} = \frac{\frac{\bar{X}-\mu}{\sigma/\sqrt{n}}}{\sqrt{\frac{(n-1)S^2}{\sigma^2(n-1)}}} = \frac{\bar{X} - \mu}{S / \sqrt{n}} \sim t_{n-1} \] Without independence, \(Z\) and \(V\) could be correlated, and their ratio would not yield the standard \(t\)-distribution.

Part 4: Construct a \(100(1-\alpha)\%\) confidence interval for \(\mu\). Discuss what happens to the width of this interval as \(n \to \infty\) and mathematically relate it back to the standard normal distribution.

Let \(t_{\alpha/2, n-1}\) be the upper \(\alpha/2\) percentile of the \(t_{n-1}\) distribution. \[ P\left( -t_{\alpha/2, n-1} \leq \frac{\bar{X} - \mu}{S / \sqrt{n}} \leq t_{\alpha/2, n-1} \right) = 1 - \alpha \] The \(100(1-\alpha)\%\) CI for \(\mu\) is: \[ \left( \bar{X} - t_{\alpha/2, n-1} \frac{S}{\sqrt{n}}, \quad \bar{X} + t_{\alpha/2, n-1} \frac{S}{\sqrt{n}} \right) \] As \(n \to \infty\), the width of the interval approaches \(2 z_{\alpha/2} \frac{S}{\sqrt{n}}\). This is because, by the Law of Large Numbers, \(S^2 \xrightarrow{P} \sigma^2\), and by the properties of the \(t\)-distribution, as degrees of freedom \(\nu \to \infty\), the \(t_\nu\) distribution converges exactly to the standard normal distribution \(N(0,1)\). Therefore, \(t_{\alpha/2, n-1} \to z_{\alpha/2}\).

Question 8: Hypothesis Testing, Chi-Square Distribution, and Bootstrap

Suppose you are given a random sample \(X_1, \dots, X_n\) from a distribution with unknown variance \(\sigma^2\). You wish to test \(H_0: \sigma^2 = \sigma_0^2\) versus \(H_1: \sigma^2 \neq \sigma_0^2\).

  1. Assuming the population is normally distributed, state the test statistic and rejection region using the \(\chi^2\) distribution.
  2. Now assume the population is not normally distributed. Explain why the test in part 1 is no longer valid.

Answer 8

Part 1. Assuming the population is normally distributed, state the test statistic and rejection region using the \(\chi^2\) distribution.

Under normality, the test statistic is: \[ \chi^2 = \frac{(n-1)S^2}{\sigma_0^2} \] Under \(H_0\), this exactly follows a \(\chi^2_{n-1}\) distribution. The rejection region for a two-sided test at level \(\alpha\) is: \[ \text{Reject } H_0 \text{ if } \frac{(n-1)S^2}{\sigma_0^2} < \chi^2_{1-\alpha/2, n-1} \quad \text{or} \quad \frac{(n-1)S^2}{\sigma_0^2} > \chi^2_{\alpha/2, n-1} \] (Note: \(\chi^2_{1-\alpha/2}\) is the lower tail percentile, and \(\chi^2_{\alpha/2}\) is the upper tail percentile).

Part 2: Now assume the population is not normally distributed. Explain why the test in part 1 is no longer valid.

The derivation that \((n-1)S^2/\sigma^2 \sim \chi^2_{n-1}\) relies strictly on the underlying random variables being independent normal. If the population is not normal, the sampling distribution of \(S^2\) is not a scaled chi-square. The actual Type I error rate of the test will likely deviate significantly from the nominal \(\alpha\) level, leading to an invalid test.

Question 9: Simulation, CLT, and Nonparametric Models

Let \(X_1, \dots, X_n\) be a random sample from an unknown, continuous probability distribution \(F\) (a nonparametric model). Let \(\eta\) be the population median of \(F\). Let \(\hat{\eta}\) be the sample median.

  1. The bootstrap is generally the preferred method for constructing confidence intervals for the sample median. Briefly explain why a CLT-based interval for the median is more difficult to use in practice than a CLT-based interval for the mean, without needing to state the exact variance formula.
  2. Given the difficulty identified in Part 1, explain in one sentence why the nonparametric bootstrap percentile interval sidesteps this problem entirely.
  3. Suppose \(n = 25\) (a small sample). Explain why relying on the asymptotic CLT distribution might be unreliable.
  4. Describe a detailed nonparametric bootstrap simulation procedure to construct a \(95\%\) confidence interval for \(\eta\) that does not rely on the CLT formula from part 1. Explicitly state how the percentiles of the bootstrap distribution are used.

Answer 9

Part 1: Briefly explain why a CLT-based interval for the median is more difficult to use in practice than a CLT-based interval for the mean, without needing to state the exact variance formula.

For the sample mean, the asymptotic variance \(\sigma^2/n\) involves only the population variance \(\sigma^2\), which can be consistently estimated by \(S^2\). For the sample median, the asymptotic variance depends on \(f(\eta)\) — the value of the population density at the true median. This quantity is not a simple moment of the distribution and cannot be estimated directly from data without additional techniques (such as kernel density estimation), which introduce their own assumptions and variability. The CLT interval for the median is therefore impractical to construct reliably, especially in small samples, motivating the bootstrap approach.

Part 2: Given the difficulty identified in Part 1, explain in one sentence why the nonparametric bootstrap percentile interval sidesteps this problem entirely.

The bootstrap directly simulates the sampling distribution of \(\hat{\eta}\) from the data itself, so it never requires estimating \(f(\eta)\) or any other density quantity — the interval boundaries emerge purely from the empirical quantiles of the bootstrap replicates.

Part 3: Suppose \(n = 25\) (a small sample). Explain why relying on the asymptotic CLT distribution might be unreliable.

For \(n=25\), the convergence of the sampling distribution of \(\hat{\eta}\) to a normal distribution is highly dependent on the shape of the underlying distribution \(F\). If \(F\) is skewed or has heavy tails, \(n=25\) is vastly insufficient for the CLT to provide a good approximation. Furthermore, the term \(f(\eta)\) is notoriously difficult to estimate accurately with only 25 data points; small errors in estimating \(f(\eta)\) drastically distort the width of the confidence interval, making it highly unstable.

Part 4: Describe a detailed nonparametric bootstrap simulation procedure to construct a \(95\%\) confidence interval for \(\eta\) that does not rely on the CLT formula from part 1.

To avoid the CLT entirely, we use the bootstrap to empirically estimate the sampling distribution of \(\hat{\eta}\):

  1. From the original sample \(x_1, \dots, x_{25}\), draw a new sample of size 25 with replacement. This is bootstrap sample 1.
  2. Calculate the sample median of bootstrap sample 1, denoted \(\hat{\eta}_1^*\).
  3. Repeat steps 1 and 2 a large number of times (e.g., \(B = 10,000\)) to obtain the bootstrap distribution: \(\hat{\eta}_1^*, \hat{\eta}_2^*, \dots, \hat{\eta}_{10000}^*\).
  4. Sort the \(10,000\) bootstrap medians in ascending order.
  5. Find the \(2.5\%\) and \(97.5\%\) empirical percentiles of this sorted list.
    • The \(2.5\%\) percentile is the value at position \(0.025 \times (10000 + 1) \approx 250\).
    • The \(97.5\%\) percentile is the value at position \(0.975 \times (10000 + 1) \approx 9751\).
  6. The Bootstrap Percentile Confidence Interval is simply: \[ \left[ \hat{\eta}^*_{(0.025)}, \hat{\eta}^*_{(0.975)} \right] \] This method directly uses the percentiles of the simulated bootstrap distribution to capture the true variability of the median without requiring normality, formulas for standard errors, or estimation of the density \(f(\eta)\).

Question 10: CLT-Based Inference for a Non-Normal Population

Let \(X_1, X_2, \dots, X_n \stackrel{iid}{\sim} \text{Exponential}(\theta)\), with pdf:

\[f(x \mid \theta) = \theta e^{-\theta x}, \quad x > 0, \quad \theta > 0\]

You are given that the population mean is \(\mu = \frac{1}{\theta}\) and the population variance is \(\sigma^2 = \frac{1}{\theta^2}\). (These are provided — no derivation required.)

A random sample of \(n = 36\) observations yields \(\bar{x} = 2.5\) and \(S = 2.4\).

  1. State the approximate sampling distribution of \(\bar{X}\) for large \(n\), citing the relevant theorem. Write down the approximate standard error, replacing any unknown parameters with their sample equivalents.

  2. Construct the test statistic for \(H_0: \mu = 2.0\) against \(H_a: \mu \neq 2.0\). Calculate the approximate p-value and state your conclusion at \(\alpha = 0.05\).

  3. Construct an approximate 95% confidence interval for \(\mu\). State one limitation of this interval in the current setting.

  4. Describe how you would construct a nonparametric bootstrap percentile interval for \(\mu\) instead. Explain why it may be preferred over the Wald interval here.

  5. A colleague suggests applying the one-sample \(\chi^2\) variance test (\(V = \frac{(n-1)S^2}{\sigma_0^2}\)) to test \(H_0: \sigma^2 = 4.0\). Explain why this test is far less reliable than the mean-based test in Part 2, even with \(n = 36\).

Answer 10

Part 1: State the approximate sampling distribution of \(\bar{X}\) for large \(n\), citing the relevant theorem.

The Central Limit Theorem states that for any i.i.d. population with finite mean \(\mu\) and finite variance \(\sigma^2\), regardless of the population’s shape:

\[\sqrt{n}\left(\frac{\bar{X} - \mu}{\sigma}\right) \xrightarrow{d} N(0,1) \quad \text{as } n \to \infty\]

Therefore, for large \(n\) (here \(n = 36 \ge 30\)):

\[\bar{X} \;\dot{\sim}\; N\!\left(\mu,\; \frac{\sigma^2}{n}\right)\]

Since \(\sigma\) is unknown, we substitute \(S\). The approximate standard error is:

\[\widehat{\text{SE}}(\bar{X}) = \frac{S}{\sqrt{n}} = \frac{2.4}{\sqrt{36}} = \frac{2.4}{6} = 0.4\]

Note: The exponential distribution is right-skewed, so this is an asymptotic approximation — not an exact result. The CLT justifies it for \(n \ge 30\).

Part 2: Construct the test statistic for \(H_0: \mu = 2.0\) against \(H_a: \mu \neq 2.0\). Calculate the approximate p-value and state your conclusion at \(\alpha = 0.05\)

Test statistic:

\[T^* = \frac{\bar{X} - \mu_0}{S/\sqrt{n}} = \frac{2.5 - 2.0}{0.4} = \frac{0.5}{0.4} = 1.25\]

For large \(n\), \(T^* \;\dot{\sim}\; N(0,1)\) under \(H_0\) (since \(S \xrightarrow{p} \sigma\) by consistency, so Slutsky’s theorem ensures the approximation holds).

Two-tailed p-value:

\[p = 2 \cdot P(Z > 1.25) = 2 \cdot (1 - \Phi(1.25)) = 2 \cdot (1 - 0.8944) = 2 \times 0.1056 \approx 0.211\]

Conclusion: Since \(p \approx 0.211 > 0.05\), we fail to reject \(H_0\). The data are consistent with a population mean of 2.0.

Part 3: Construct an approximate 95% confidence interval for \(\mu\)

The asymptotic (Wald) interval is:

\[\bar{x} \pm z_{\alpha/2} \cdot \widehat{\text{SE}} = 2.5 \pm 1.96 \times 0.4 = 2.5 \pm 0.784\]

\[\boxed{(1.716,\;\; 3.284)}\]

Limitation: The Wald interval assumes the sampling distribution of \(\bar{X}\) is approximately symmetric around the estimate. For a skewed population like the Exponential with moderate \(n\), the true sampling distribution may still carry noticeable right-skew, meaning the interval’s actual coverage can fall below the nominal 95%. The Wald interval may also produce poor coverage near parameter boundaries.

Part 4: Describe how you would construct a nonparametric bootstrap percentile interval** for \(\mu\) instead. Explain why it may be preferred over the Wald interval here.**

Procedure

  1. Resample: Draw \(B\) bootstrap samples \(\mathbf{x}^{*b} = \{x_1^{*b}, \dots, x_{36}^{*b}\}\) from the observed data with replacement, each of size \(n = 36\).
  2. Compute: For each resample, calculate \(\bar{x}^{*b}\).
  3. Aggregate: Collect the \(B\) replicates \(\{\bar{x}^{*1}, \dots, \bar{x}^{*B}\}\) to form the empirical sampling distribution.
  4. Interval: Read off the 2.5th and 97.5th percentiles of the replicates:

\[\left[\bar{x}^*_{(0.025)},\;\; \bar{x}^*_{(0.975)}\right]\]

Why it may be preferred: The bootstrap makes no distributional assumptions — it lets the observed data’s shape (including its skewness) determine the interval’s boundaries directly. For a skewed population like the Exponential, it will naturally produce an asymmetric interval that better reflects the true sampling distribution, whereas the Wald interval forces symmetry and can therefore undercover on one tail.

Part 5: Explain why this test is far less reliable than the mean-based test in Part 2, even with \(n = 36\).

The one-sample variance test \(V = \frac{(n-1)S^2}{\sigma_0^2} \sim \chi^2_{n-1}\) rests on a strict normality assumption, for two reasons outlined in the notes (Section 8):

1. The distributional result is exact only under normality. The \(\chi^2\) distribution arises because, for a normal population, the quantity \((n-1)S^2/\sigma^2\) is exactly a sum of squared standard normals. The Exponential population does not produce squared-normal deviations, so the null distribution of \(V\) is not \(\chi^2_{n-1}\).

2. The CLT does not rescue variance-based tests. The CLT stabilises the mean \(\bar{X}\) toward normality for large \(n\), giving mean-based tests robustness. But the sample variance \(S^2\) is a function of squared deviations; its sampling distribution is sensitive to the skewness and kurtosis of the underlying population in a way the CLT does not correct at moderate \(n\). The notes explicitly state: “skewness and heavy tails can significantly distort the null distribution [of variance tests], affecting p-values and error rates — even for moderately large samples.”

Consequence here: The Exponential distribution is substantially right-skewed. Using the \(\chi^2\) test with \(n = 36\) would produce unreliable p-values and incorrect Type I error rates. A bootstrap test (Section 7.3) — scaling the data to satisfy \(H_0: \sigma^2 = 4.0\) before resampling — would be the appropriate alternative.

Question 11: Two-Sample t-Test

A study compares exam scores between two teaching methods.

  • Group A (new method): \(n_1 = 15\), \(\bar{x}_1 = 78.0\), \(S_1 = 6.0\)
  • Group B (traditional): \(n_2 = 11\), \(\bar{x}_2 = 71.0\), \(S_2 = 7.0\)

Assume both populations are approximately normal.

  1. Test \(H_0: \mu_1 = \mu_2\) against \(H_a: \mu_1 \neq \mu_2\) at \(\alpha = 0.05\) using the equal-variance two-sample \(t\)-test. Clearly show the pooled variance \(S_p^2\), the test statistic \(T\), the degrees of freedom, and your conclusion. [Given: \(t_{0.025,\, 24} = 2.064\)]

  2. Construct a 95% confidence interval for \(\mu_1 - \mu_2\) using the same pooled approach. Verify that your interval is consistent with the conclusion in Part 1.

  3. The \(F\)-test for equal variances gives \(F = S_1^2 / S_2^2 = 36/49 \approx 0.735\). Given \(f_{0.025,\, 14,\, 10} = 3.80\) and \(f_{0.975,\, 14,\, 10} = 1/f_{0.025,\, 10,\, 14} \approx 0.272\), assess whether the homoscedasticity assumption appears reasonable.

Answer 11

Part 1: Test \(H_0: \mu_1 = \mu_2\) against \(H_a: \mu_1 \neq \mu_2\) at \(\alpha = 0.05\) using the equal-variance two-sample \(t\)-test.

Pooled variance: \[S_p^2 = \frac{(n_1-1)S_1^2 + (n_2-1)S_2^2}{n_1+n_2-2} = \frac{14 \times 36 + 10 \times 49}{24} = \frac{504 + 490}{24} = \frac{994}{24} \approx 41.42\] \[S_p \approx 6.44\]

Test statistic: \[T = \frac{\bar{x}_1 - \bar{x}_2}{S_p\sqrt{1/n_1 + 1/n_2}} = \frac{78.0 - 71.0}{6.44\sqrt{1/15 + 1/11}} = \frac{7.0}{6.44 \times \sqrt{0.1576}} = \frac{7.0}{6.44 \times 0.397} = \frac{7.0}{2.557} \approx 2.74\]

Degrees of freedom: \(n_1 + n_2 - 2 = 24\)

Decision: \(|T| = 2.74 > t_{0.025,\, 24} = 2.064\), so we reject \(H_0\) at \(\alpha = 0.05\). There is sufficient evidence that the two teaching methods produce different mean scores.

Part 2: Construct a 95% confidence interval for \(\mu_1 - \mu_2\) using the same pooled approach. Verify that your interval is consistent with the conclusion in Part 1.

\[(\bar{x}_1 - \bar{x}_2) \pm t_{0.025,\,24} \cdot S_p\sqrt{1/n_1+1/n_2} = 7.0 \pm 2.064 \times 2.557 = 7.0 \pm 5.28\] \[\Rightarrow (1.72,\ 12.28)\]

The interval does not contain 0, consistent with rejecting \(H_0\) in Part 1.

Part 3: The \(F\)-test for equal variances gives \(F = S_1^2 / S_2^2 = 36/49 \approx 0.735\). Given \(f_{0.025,\, 14,\, 10} = 3.80\) and \(f_{0.975,\, 14,\, 10} = 1/f_{0.025,\, 10,\, 14} \approx 0.272\), assess whether the homoscedasticity assumption appears reasonable.

Under \(H_0: \sigma_1^2 = \sigma_2^2\), the rejection region for the two-sided \(F\)-test is \(F < 0.272\) or \(F > 3.80\).

Our observed \(F \approx 0.735\) falls between these bounds, so we fail to reject equal variances. The homoscedasticity assumption appears reasonable.

Part (d)

The \(F\)-test for variances requires strict normality — the result \((n-1)S^2/\sigma^2 \sim \chi^2_{n-1}\) holds exactly only under normality, and the CLT does not stabilise variance-based statistics the way it does for means. The \(t\)-test in (a) only requires approximate normality (or large \(n\)), since the CLT ensures the sampling distribution of \(\bar{X}\) converges to normal regardless of the population shape.

Question 12: Mann–Whitney \(U\) Test

Two independent groups of patients are given the following pain scores (lower = less pain):

  • Group X (Treatment): \(25,\ 30,\ 35,\ 40\) \((m = 4)\)
  • Group Y (Placebo): \(10,\ 15,\ 20,\ 22\) \((n = 4)\)
  1. Rank all 8 observations jointly from smallest to largest. Compute \(W_X\) (the rank sum for Group X) and then calculate \(U_X\).

  2. State the mean \(\eta\) and variance of \(U_X\) under \(H_0: F_X = F_Y\). Compute the standardised test statistic \(Z\) and the approximate two-sided p-value using the normal approximation. At \(\alpha = 0.05\), state your conclusion in context.

  3. Compare the assumptions of the Mann–Whitney \(U\) test with those of the two-sample \(t\)-test. Identify one scenario where the Mann–Whitney test would be strongly preferred.

Answer 12

Part 1: Rank all 8 observations jointly from smallest to largest. Compute \(W_X\) (the rank sum for Group X) and then calculate \(U_X\).

Joint ranking:

Value Group Rank
10 Y 1
15 Y 2
20 Y 3
22 Y 4
25 X 5
30 X 6
35 X 7
40 X 8

Rank sum for Group X: \[W_X = 5 + 6 + 7 + 8 = 26\]

\(U_X\) statistic: \[U_X = W_X - \frac{m(m+1)}{2} = 26 - \frac{4 \times 5}{2} = 26 - 10 = 16\]

Part 2: State the mean \(\eta\) and variance of \(U_X\) under \(H_0: F_X = F_Y\). Compute the standardised test statistic \(Z\) and the approximate two-sided p-value using the normal approximation. At \(\alpha = 0.05\), state your conclusion in context.

Under \(H_0\): \[\eta = E[U_X] = \frac{mn}{2} = \frac{4 \times 4}{2} = 8\] \[\operatorname{Var}(U_X) = \frac{mn(m+n+1)}{12} = \frac{4 \times 4 \times 9}{12} = \frac{144}{12} = 12\]

Standardised statistic: \[Z = \frac{U_X - \eta}{\sqrt{\operatorname{Var}(U_X)}} = \frac{16 - 8}{\sqrt{12}} = \frac{8}{3.464} \approx 2.31\]

Two-sided p-value: \[p = 2 \times P(Z > 2.31) = 2 \times (1 - \Phi(2.31)) = 2 \times (1 - 0.9896) = 2 \times 0.0104 \approx 0.021\]

Since \(p \approx 0.021 < 0.05\), we reject \(H_0\). There is significant evidence that the treatment group’s pain scores are different from (specifically, higher than) the placebo group’s — suggesting the treatment did not reduce pain relative to placebo in this sample.

Part 3: Compare the assumptions of the Mann–Whitney \(U\) test with those of the two-sample \(t\)-test. Identify one scenario where the Mann–Whitney test would be strongly preferred.

The Mann–Whitney \(U\) test requires only that observations are independent and come from continuous distributions (no ties). It makes no assumption about the shape of \(F_X\) or \(F_Y\).

The two-sample \(t\)-test additionally requires approximate normality (or large \(n\)) within each group, and the equal-variance version also requires homoscedasticity.

The Mann–Whitney test would be strongly preferred when sample sizes are small and the data are visibly skewed or ordinal in nature — for example, pain scores, Likert-scale responses, or reaction times — where normality cannot be assumed and the \(t\)-test’s CLT protection is insufficient.

Question 13: Simple Linear Regression — MLE, Fisher Information, and Prediction

A researcher models the relationship between study hours (\(X\)) and exam scores (\(Y\)):

\[Y_i = \beta_0 + \beta_1 X_i + \varepsilon_i, \quad \varepsilon_i \stackrel{iid}{\sim} N(0, \sigma^2), \quad i = 1, \dots, n\]

The following summary statistics are from \(n = 20\) students:

\[\bar{X} = 5,\quad \bar{Y} = 12,\quad S_{XX} = \sum(X_i-\bar{X})^2 = 100,\quad S_{XY} = \sum(X_i-\bar{X})(Y_i-\bar{Y}) = 250,\quad SSE = 180\]

  1. Write down the log-likelihood \(\ell(\beta_0, \beta_1, \sigma^2)\). Treating \(\sigma^2\) as known, derive \(\hat{\beta}_0\) and \(\hat{\beta}_1\) by solving the score equations.

  2. Compute numerical values for \(\hat{\beta}_0\), \(\hat{\beta}_1\), and the MLE \(\hat{\sigma}^2 = SSE/n\). Explain in one sentence why \(\hat{\sigma}^2\) is biased, and state the unbiased estimator \(MSE\).

  3. Treating \(\beta_0\) and \(\beta_1\) as known, derive the Fisher Information \(I_n(\sigma^2)\). Hence state the asymptotic variance of \(\hat{\sigma}^2\) and verify it is consistent with the exact result that \(SSE/\sigma^2 \sim \chi^2_{n-2}\).

  4. It can be shown that \(\hat{\beta}_1 \sim N(\beta_1,\ \sigma^2/S_{XX})\) exactly under normality. Using this, construct an exact \(95\%\) confidence interval for \(\beta_1\), replacing \(\sigma^2\) with \(MSE\). Verify the interval from the MLE asymptotic framework would give the same structure.

  5. A new student studied for \(X_h = 7\) hours. Compute the \(95\%\) confidence interval for the mean score \(\mu_{Y \mid X_h = 7}\) and the \(95\%\) prediction interval for this individual student’s score. Explain clearly why the prediction interval is wider.

Given: \(t_{0.025,\,18} = 2.101\)

Answer 13

Part 1: Write down the log-likelihood \(\ell(\beta_0, \beta_1, \sigma^2)\). Treating \(\sigma^2\) as known, derive \(\hat{\beta}_0\) and \(\hat{\beta}_1\) by solving the score equations.

The model states that \(Y_i = \beta_0 + \beta_1 X_i + \varepsilon_i\), where \(\varepsilon_i \sim N(0, \sigma^2)\). Since \(X_i\) is treated as a fixed constant in regression, \(Y_i\) is a linear transformation of a normal variable. Therefore:

\[Y_i \sim N(\mu_i, \sigma^2) \quad \text{where} \quad \mu_i = \beta_0 + \beta_1 X_i\] The PDF for a single observation \(Y_i\) is: \[f(y_i; \beta_0, \beta_1, \sigma^2) = \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left( -\frac{(y_i - (\beta_0 + \beta_1 x_i))^2}{2\sigma^2} \right)\]

Since the errors \(\varepsilon_i\) are independent and identically distributed (iid), the observations \(Y_i\) are independent. The joint likelihood is the product of the individual PDFs: \[L(\beta_0, \beta_1, \sigma^2) = \prod_{i=1}^n f(y_i; \beta_0, \beta_1, \sigma^2)\] Substituting the PDF: \[L = \left( \frac{1}{2\pi\sigma^2} \right)^{n/2} \prod_{i=1}^n \exp\left( -\frac{(y_i - \beta_0 - \beta_1 x_i)^2}{2\sigma^2} \right)\] Using exponent rules (\(e^a \cdot e^b = e^{a+b}\)): \[L(\beta_0, \beta_1, \sigma^2) = (2\pi\sigma^2)^{-n/2} \exp\left( -\frac{1}{2\sigma^2} \sum_{i=1}^n (y_i - \beta_0 - \beta_1 x_i)^2 \right)\]

We take the natural logarithm (\(\ln\)) to simplify the maximization process. Recall that \(\ln(ab) = \ln a + \ln b\) and \(\ln(a^b) = b \ln a\): \[\ell = \ln \left[ (2\pi\sigma^2)^{-n/2} \right] + \ln \left[ \exp\left( -\frac{1}{2\sigma^2} \sum_{i=1}^n (y_i - \beta_0 - \beta_1 x_i)^2 \right) \right]\] Applying the log properties: \[\ell(\beta_0, \beta_1, \sigma^2) = -\frac{n}{2}\ln(2\pi\sigma^2) - \frac{1}{2\sigma^2} \sum_{i=1}^n (y_i - \beta_0 - \beta_1 x_i)^2\] Expanding \(\ln(2\pi\sigma^2)\) further as \(\ln(2\pi) + \ln(\sigma^2)\) gives the final form used in the question: \[\boxed{\ell(\beta_0, \beta_1, \sigma^2) = -\frac{n}{2}\ln(2\pi) - \frac{n}{2}\ln(\sigma^2) - \frac{1}{2\sigma^2} \sum_{i=1}^n (y_i - \beta_0 - \beta_1 x_i)^2}\]

Solving the Score Equations (Step-by-Step)

To find the MLEs, we take partial derivatives with respect to the parameters and set them to zero.

\[\frac{\partial \ell}{\partial \beta_0} = -\frac{1}{2\sigma^2} \sum_{i=1}^n 2(y_i - \beta_0 - \beta_1 x_i)(-1) = 0\] \[\frac{1}{\sigma^2} \sum_{i=1}^n (y_i - \beta_0 - \beta_1 x_i) = 0\] Summing the terms individually: \[\sum y_i - n\beta_0 - \beta_1 \sum x_i = 0 \implies n\bar{y} - n\beta_0 - n\beta_1\bar{x} = 0\] Dividing by \(n\): \[\hat{\beta}_0 = \bar{y} - \hat{\beta}_1\bar{x}\]

\[\frac{\partial \ell}{\partial \beta_1} = -\frac{1}{2\sigma^2} \sum_{i=1}^n 2(y_i - \beta_0 - \beta_1 x_i)(-x_i) = 0\] \[\sum x_i(y_i - \beta_0 - \beta_1 x_i) = 0\] Substitute \(\beta_0 = \bar{y} - \beta_1\bar{x}\): \[\sum x_i(y_i - (\bar{y} - \beta_1\bar{x}) - \beta_1 x_i) = 0\] \[\sum x_i((y_i - \bar{y}) - \beta_1(x_i - \bar{x})) = 0\] By the properties of sums (\(\sum x_i(y_i - \bar{y}) = S_{XY}\) and \(\sum x_i(x_i - \bar{x}) = S_{XX}\)): \[S_{XY} - \beta_1 S_{XX} = 0 \implies \hat{\beta}_1 = \frac{S_{XY}}{S_{XX}}\]

Note: The MLE and OLS estimates are identical. This is a direct consequence of the normal error assumption — maximising the log-likelihood under normality is the same as minimising sum of squared residuals.

Part 2: Compute numerical values for \(\hat{\beta}_0\), \(\hat{\beta}_1\), and the MLE \(\hat{\sigma}^2 = SSE/n\). Explain in one sentence why \(\hat{\sigma}^2\) is biased, and state the unbiased estimator \(MSE\).

\[\hat{\beta}_1 = \frac{250}{100} = 2.5 \qquad \hat{\beta}_0 = 12 - 2.5(5) = -0.5\]

\[\hat{\sigma}^2_{MLE} = \frac{SSE}{n} = \frac{180}{20} = 9\]

Bias: \(\hat{\sigma}^2_{MLE}\) divides by \(n\) rather than \(n-2\). Fitting \(\beta_0\) and \(\beta_1\) consumes 2 degrees of freedom, so only \(n-2\) independent pieces of information remain for estimating \(\sigma^2\).

\[MSE = S^2 = \frac{SSE}{n-2} = \frac{180}{18} = 10\]

Part 3: Derive the Fisher Information \(I_n(\sigma^2)\). Hence state the asymptotic variance of \(\hat{\sigma}^2\) and verify it is consistent with the exact result that \(SSE/\sigma^2 \sim \chi^2_{n-2}\).

With \(\beta_0\), \(\beta_1\) treated as known, the residuals \(e_i = Y_i - \beta_0 - \beta_1 X_i \stackrel{iid}{\sim} N(0,\sigma^2)\). The problem reduces to estimating the variance of a normal — structurally identical to Q1. Let \(\theta = \sigma^2\):

\[\ell(\theta) = -\frac{n}{2}\ln\theta - \frac{1}{2\theta}\sum e_i^2 + \text{const}\]

\[\frac{\partial \ell}{\partial \theta} = -\frac{n}{2\theta} + \frac{\sum e_i^2}{2\theta^2} \qquad \frac{\partial^2 \ell}{\partial \theta^2} = \frac{n}{2\theta^2} - \frac{\sum e_i^2}{\theta^3}\]

Taking the negative expectation, and using \(E\left[\sum e_i^2\right] = n\sigma^2 = n\theta\):

\[I_n(\sigma^2) = -E\left[\frac{\partial^2 \ell}{\partial \theta^2}\right] = -\frac{n}{2\sigma^4} + \frac{n\sigma^2}{\sigma^6} = \frac{n}{2\sigma^4}\]

Asymptotic variance: \(\text{AVar}(\hat{\sigma}^2) = \frac{1}{I_n(\sigma^2)} = \frac{2\sigma^4}{n}\)

Verification via exact result: The exact result \(SSE/\sigma^2 \sim \chi^2_{n-2}\) gives \(\text{Var}(SSE/\sigma^2) = 2(n-2)\), so:

\[\text{Var}(\hat{\sigma}^2_{MLE}) = \text{Var}\!\left(\frac{SSE}{n}\right) = \frac{\sigma^4}{n^2}\cdot 2(n-2) = \frac{2\sigma^4(n-2)}{n^2} \xrightarrow{n\to\infty} \frac{2\sigma^4}{n} \checkmark\]

The asymptotic variance matches the exact variance to leading order, consistent with the MLE being asymptotically efficient.

Note that when deriving \(I_n(\sigma^2)\), we treat \(\beta_0\) and \(\beta_1\) as known. This turns the problem into finding the Fisher Information for a Normal Variance estimate. * The first derivative (Score) wrt \(\theta = \sigma^2\) is \(\frac{n}{2\theta^2}(\frac{\sum e_i^2}{n} - \theta)\). * The expected value of the second derivative is the “curvature” of the log-likelihood. A higher Fisher Information means a “sharper” peak, which leads to a smaller asymptotic variance (\(1/I\)).

Part 4: Construct an exact \(95\%\) confidence interval for \(\beta_1\), replacing \(\sigma^2\) with \(MSE\). Verify the interval from the MLE asymptotic framework would give the same structure.

Since \(\sigma^2\) is unknown, replace with \(MSE\):

\[SE(\hat{\beta}_1) = \sqrt{\frac{MSE}{S_{XX}}} = \sqrt{\frac{10}{100}} = \sqrt{0.1} \approx 0.316\]

The exact pivot follows from the independence of \(\hat{\beta}_1\) and \(MSE\) under normality:

\[T = \frac{\hat{\beta}_1 - \beta_1}{SE(\hat{\beta}_1)} \sim t_{n-2} = t_{18}\]

\[95\%\ \text{CI}:\ \hat{\beta}_1 \pm t_{0.025,\,18} \times SE(\hat{\beta}_1) = 2.5 \pm 2.101 \times 0.316 = 2.5 \pm 0.664\]

\[\Rightarrow \boxed{(1.836,\ 3.164)}\]

Since 0 is not in the interval, there is significant evidence of a positive linear relationship at \(\alpha = 0.05\).

Connection to MLE asymptotic framework: The MLE asymptotic theory gives \(\hat{\beta}_1 \pm z_{\alpha/2} \cdot SE(\hat{\beta}_1)\), using \(z\) instead of \(t\). For finite samples the \(t\) interval is exact under normality and preferred. As \(n \to \infty\), \(t_{n-2} \to N(0,1)\) and the two intervals coincide — the asymptotic framework is the large-sample limit of the exact result.

Part 5: A new student studied for \(X_h = 7\) hours. Compute the \(95\%\) confidence interval for the mean score \(\mu_{Y \mid X_h = 7}\) and the \(95\%\) prediction interval for this individual student’s score. Explain clearly why the prediction interval is wider.

Point estimate: \[\hat{Y}_7 = -0.5 + 2.5(7) = 17.0\]

Confidence interval for the mean response \(\mu_{Y \mid X_h = 7}\):

\[SE(\hat{Y}_7) = \sqrt{MSE\left[\frac{1}{n} + \frac{(X_h - \bar{X})^2}{S_{XX}}\right]} = \sqrt{10\left[\frac{1}{20} + \frac{(7-5)^2}{100}\right]} = \sqrt{10 \times 0.09} = \sqrt{0.9} \approx 0.949\]

\[95\%\ \text{CI}:\ 17.0 \pm 2.101 \times 0.949 = 17.0 \pm 1.994 \approx (15.0,\ 19.0)\]

Prediction interval for a new individual’s score:

\[SE(\text{pred}) = \sqrt{MSE\left[1 + \frac{1}{n} + \frac{(X_h - \bar{X})^2}{S_{XX}}\right]} = \sqrt{10 \times 1.09} = \sqrt{10.9} \approx 3.302\]

\[95\%\ \text{PI}:\ 17.0 \pm 2.101 \times 3.302 = 17.0 \pm 6.937 \approx (10.1,\ 23.9)\]

Why the PI is wider: Both intervals are centred at the same point estimate \(\hat{Y}_7 = 17.0\). The CI captures only the uncertainty in estimating the mean response — how precisely we know where the regression line sits at \(X_h = 7\). The PI must additionally account for the irreducible individual-level randomness \(\sigma^2\): even if \(\beta_0\) and \(\beta_1\) were known exactly, a single new student’s score would still vary randomly around the line. This is the “\(+1\)” inside the square root — it never shrinks no matter how large \(n\) becomes, so the PI can never collapse to a point the way a CI does.