Hypothesis Testing in Python

Introduction to Hypothesis Testing

Hypothesis testing is a fundamental concept in statistics that helps us make decisions about populations based on sample data. Let's explore this concept through both theoretical understanding and practical implementation in Python.

Understanding A/B Testing

A/B testing is a practical application of hypothesis testing commonly used in business decisions. Consider the real-world example of Electronic Arts (EA) and SimCity 5:

EA wanted to increase pre-orders of their game
They tested different advertising scenarios
Users were split into control and treatment groups
The results showed that the treatment group (no ad) got 43.4% more purchases than the control group (with ad)

This raises an important question: Was this result statistically significant, or just due to chance? This is where hypothesis testing comes in.

Working with Sample Data

Let's start with loading and examining our data:

import pandas as pd
import numpy as np
from scipy.stats import norm
import matplotlib.pyplot as plt

# Load the Stack Overflow Developer Survey data
stack_overflow = pd.read_csv('stack_overflow_data.csv')

# Example of examining the data
print(stack_overflow.head())

Bootstrapping for Hypothesis Testing

Bootstrapping is a powerful technique for generating sampling distributions. Here's how to implement it:

def generate_bootstrap_distribution(data, column, statistic_func, n_bootstraps=5000):
    """
    Generate a bootstrap distribution for a given statistic.

    Parameters:
    -----------
    data : pandas DataFrame
        The dataset to sample from
    column : str
        The column name to calculate the statistic on
    statistic_func : function
        The function to calculate the statistic (e.g., np.mean)
    n_bootstraps : int
        Number of bootstrap samples to generate

    Returns:
    --------
    list
        Bootstrap distribution of the statistic
    """
    boot_distn = []
    for _ in range(n_bootstraps):
        # Resample with replacement
        sample = data.sample(frac=1, replace=True)
        # Calculate and store the statistic
        boot_distn.append(statistic_func(sample[column]))

    return boot_distn

# Example usage
so_boot_distn = generate_bootstrap_distribution(
    stack_overflow, 
    'converted_comp', 
    np.mean
)

# Visualize the bootstrap distribution
plt.figure(figsize=(10, 6))
plt.hist(so_boot_distn, bins=50, edgecolor='black')
plt.title('Bootstrap Distribution of Mean Compensation')
plt.xlabel('Mean Compensation')
plt.ylabel('Frequency')
plt.show()

Z-Scores and Hypothesis Testing

Understanding Z-Scores

Z-scores are standardized values that tell us how many standard deviations an observation is from the mean. The formula is:

z = (sample statistic - hypothesized parameter value) / standard error

def calculate_z_score(sample_stat, hypoth_value, std_error):
    """
    Calculate the z-score for a hypothesis test.
    """
    return (sample_stat - hypoth_value) / std_error

# Example: Testing mean compensation
mean_comp_samp = stack_overflow['converted_comp'].mean()
mean_comp_hyp = 110000
std_error = np.std(so_boot_distn, ddof=1)

z_score = calculate_z_score(mean_comp_samp, mean_comp_hyp, std_error)
print(f"Z-score: {z_score:.3f}")

P-Values and Statistical Significance

Understanding P-Values

P-values represent the probability of obtaining a result at least as extreme as the observed result, assuming the null hypothesis is true.

def calculate_p_value(z_score, alternative='two-sided'):
    """
    Calculate p-value for a given z-score.

    Parameters:
    -----------
    z_score : float
        The calculated z-score
    alternative : str
        Type of test ('two-sided', 'greater', 'less')

    Returns:
    --------
    float
        The p-value
    """
    if alternative == 'two-sided':
        return 2 * (1 - norm.cdf(abs(z_score)))
    elif alternative == 'greater':
        return 1 - norm.cdf(z_score)
    else:  # alternative == 'less'
        return norm.cdf(z_score)

# Example usage
p_value = calculate_p_value(z_score, alternative='greater')
print(f"P-value: {p_value:.4f}")

Statistical Significance

A result is considered statistically significant if the p-value is less than the chosen significance level (α).

Common significance levels are:

α = 0.05 (5% significance level)
α = 0.01 (1% significance level)
α = 0.10 (10% significance level)

def test_hypothesis(p_value, alpha=0.05):
    """
    Make a decision about the hypothesis test.

    Parameters:
    -----------
    p_value : float
        The calculated p-value
    alpha : float
        The significance level

    Returns:
    --------
    str
        The decision and interpretation
    """
    if p_value <= alpha:
        return (f"Reject the null hypothesis (p={p_value:.4f} ≤ α={alpha}). "
                "There is sufficient evidence to support the alternative hypothesis.")
    else:
        return (f"Fail to reject the null hypothesis (p={p_value:.4f} > α={alpha}). "
                "There is insufficient evidence to support the alternative hypothesis.")

# Example usage
alpha = 0.05
decision = test_hypothesis(p_value, alpha)
print(decision)

Confidence Intervals

Confidence intervals provide a range of plausible values for the population parameter:

def calculate_confidence_interval(boot_distn, confidence_level=0.95):
    """
    Calculate confidence interval from a bootstrap distribution.
    """
    lower_percentile = (1 - confidence_level) / 2
    upper_percentile = 1 - lower_percentile

    lower = np.quantile(boot_distn, lower_percentile)
    upper = np.quantile(boot_distn, upper_percentile)

    return lower, upper

# Example usage
lower, upper = calculate_confidence_interval(so_boot_distn)
print(f"95% Confidence Interval: ({lower:.2f}, {upper:.2f})")

Types of Errors in Hypothesis Testing

There are two types of errors that can occur in hypothesis testing:

Type I Error (False Positive):
- Rejecting H₀ when it's actually true
- Probability = α (significance level)
Type II Error (False Negative):
- Failing to reject H₀ when it's actually false
- Probability = β (depends on sample size and effect size)

Best Practices for Hypothesis Testing

Always state your hypotheses clearly before conducting the test
Choose your significance level (α) before collecting data
Consider the practical significance, not just statistical significance
Report confidence intervals along with p-values
Be aware of multiple testing problems
Use appropriate sample sizes

Example: Complete Hypothesis Test

Let's put it all together with a complete example:

def complete_hypothesis_test(data, column, hypothesis_value, 
                           alternative='two-sided', alpha=0.05):
    """
    Conduct a complete hypothesis test.
    """
    # Calculate sample statistic
    sample_stat = data[column].mean()

    # Generate bootstrap distribution
    boot_distn = generate_bootstrap_distribution(data, column, np.mean)

    # Calculate standard error
    std_error = np.std(boot_distn, ddof=1)

    # Calculate z-score
    z_score = calculate_z_score(sample_stat, hypothesis_value, std_error)

    # Calculate p-value
    p_value = calculate_p_value(z_score, alternative)

    # Calculate confidence interval
    ci_lower, ci_upper = calculate_confidence_interval(boot_distn)

    # Make decision
    decision = test_hypothesis(p_value, alpha)

    return {
        'sample_statistic': sample_stat,
        'z_score': z_score,
        'p_value': p_value,
        'confidence_interval': (ci_lower, ci_upper),
        'decision': decision
    }

# Example usage
results = complete_hypothesis_test(
    stack_overflow,
    'converted_comp',
    110000,
    alternative='greater',
    alpha=0.05
)

# Print results in a formatted way
print("Hypothesis Test Results")
print("=====================")
print(f"Sample Statistic: {results['sample_statistic']:.2f}")
print(f"Z-score: {results['z_score']:.3f}")
print(f"P-value: {results['p_value']:.4f}")
print(f"95% Confidence Interval: ({results['confidence_interval'][0]:.2f}, "
      f"{results['confidence_interval'][1]:.2f})")
print("\nDecision:")
print(results['decision'])

This comprehensive guide provides both the theoretical foundation and practical implementation of hypothesis testing in Python. The code examples are designed to be reusable and adaptable for different hypothesis testing scenarios.