Chapter 3: Data Distribution
Normal Distribution
Overview
The normal distribution, also known as the Gaussian distribution, is characterized by:
- Symmetrical bell-shaped curve
- Total area under the curve equals 1
- The curve never reaches zero
- Defined by two parameters: mean (μ) and standard deviation (σ)
Key Properties
-
Standard Normal Distribution
- Mean = 0
- Standard deviation = 1
-
Areas under the Normal Distribution
- 68% of data falls within 1 standard deviation
- 95% of data falls within 2 standard deviations
- 99.7% of data falls within 3 standard deviations
Working with Normal Distributions in Python
from scipy.stats import norm
# Example using women's heights
# Mean = 161 cm, Standard deviation = 7 cm
# Calculate probability of being shorter than 154 cm
prob_shorter = norm.cdf(154, 161, 7) # Returns 0.158655 (about 16%)
# Calculate probability of being taller than 154 cm
prob_taller = 1 - norm.cdf(154, 161, 7) # Returns 0.841345 (about 84%)
# Calculate probability of height between 154-157 cm
prob_between = norm.cdf(157, 161, 7) - norm.cdf(154, 161, 7) # Returns 0.1252
# Find height threshold where 90% of women are shorter
height_90th = norm.ppf(0.9, 161, 7) # Returns 169.97086
# Generate random heights
random_heights = norm.rvs(161, 7, size=10)
Central Limit Theorem (CLT)
Overview
The Central Limit Theorem states that the sampling distribution of a statistic becomes closer to the normal distribution as the number of trials increases.
Requirements
- Samples should be random and independent
Implementation Example
import pandas as pd
import numpy as np
# Create a die
die = pd.Series([1, 2, 3, 4, 5, 6])
# Function to generate sample means
def generate_sample_means(n_samples, sample_size):
sample_means = []
for i in range(n_samples):
sample_means.append(np.mean(die.sample(sample_size, replace=True)))
return sample_means
# Generate different numbers of sample means
sample_means_100 = generate_sample_means(100, 5)
sample_means_1000 = generate_sample_means(1000, 5)
Poisson Distribution
Overview
The Poisson distribution models the probability of events occurring over a fixed period when these events appear to happen at a certain rate but completely at random.
Key Concepts
- Lambda (λ) represents the average number of events per time interval
- The distribution peaks at lambda
- Applicable to various scenarios like:
- Animal shelter adoptions
- Restaurant customer arrivals
- Earthquake occurrences
Python Implementation
from scipy.stats import poisson
# Example: Average adoptions per week = 8
lambda_param = 8
# Probability of exactly 5 adoptions
prob_exact = poisson.pmf(5, lambda_param) # Returns 0.09160366
# Probability of 5 or fewer adoptions
prob_less_equal = poisson.cdf(5, lambda_param) # Returns 0.1912361
# Probability of more than 5 adoptions
prob_greater = 1 - poisson.cdf(5, lambda_param) # Returns 0.8087639
# Generate random samples
random_samples = poisson.rvs(lambda_param, size=10)
Exponential Distribution
Overview
The exponential distribution models the probability of time between Poisson events.
Key Properties
- Uses the same lambda (rate) as the Poisson distribution
- Continuous distribution (time)
- Expected value = 1/λ
Python Implementation
from scipy.stats import expon
# Example: Average 0.5 customer service tickets per minute
lambda_rate = 0.5
scale = 1/lambda_rate # scale = 2
# Probability of waiting less than 1 minute
prob_less_1min = expon.cdf(1, scale=scale)
# Probability of waiting more than 4 minutes
prob_more_4min = 1 - expon.cdf(4, scale=scale)
# Probability of waiting between 1 and 4 minutes
prob_between = expon.cdf(4, scale=scale) - expon.cdf(1, scale=scale)
Additional Distributions
Student's t-Distribution
- Similar shape to normal distribution
- Has degrees of freedom (df) parameter
- Lower df = thicker tails
- Higher df = closer to normal distribution
Log-Normal Distribution
- Variable whose logarithm is normally distributed
- Common applications:
- Chess game lengths
- Adult blood pressure
- Hospital admissions during epidemics
References
- Content developed by Maggie Matsui for DataCamp
- All code examples use SciPy's stats module
- Visualizations can be created using matplotlib or seaborn (not shown in examples)