Chapter 4: Correlation and Experiment Design

Correlation

Correlation Coefficient

Definition: The correlation coefficient quantifies the linear relationship between two variables. It is a number between -1 and 1, where:
1: Perfect positive correlation (as one variable increases, the other also increases).
-1: Perfect negative correlation (as one variable increases, the other decreases).
0: No linear relationship.

Code Example:

import pandas as pd

# Sample data
data = pd.DataFrame({
    'sleep_total': [12.1, 17.0, 14.4, 14.9, 4.0],
    'sleep_rem': [8.2, 12.5, 11.0, 9.5, 2.0]
})

# Compute the correlation coefficient
correlation = data['sleep_total'].corr(data['sleep_rem'])
print(f"Correlation between total sleep and REM sleep: {correlation}")

Visualizing Relationships

Scatter Plot: A scatter plot helps visualize the relationship between two variables.

Code Example:

import seaborn as sns
import matplotlib.pyplot as plt

# Scatter plot of total sleep vs. REM sleep
sns.scatterplot(x="sleep_total", y="sleep_rem", data=data)
plt.show()

This scatter plot will visually represent the correlation between total sleep and REM sleep.

Adding a Trendline

Trendline: A trendline can help visually quantify the linear relationship between two variables.

Code Example:

sns.lmplot(x="sleep_total", y="sleep_rem", data=data, ci=None)
plt.show()

This will add a linear regression trendline to the scatter plot.

Caveats of Correlation

Non-linear Relationships

Correlation only measures linear relationships. If the relationship between variables is non-linear, correlation may not be meaningful.

Code Example:

# Example of a non-linear relationship
import numpy as np
import seaborn as sns

x = np.linspace(-10, 10, 100)
y = np.sin(x)

sns.scatterplot(x=x, y=y)
plt.show()

# Correlation coefficient for non-linear data
corr_non_linear = np.corrcoef(x, y)[0, 1]
print(f"Correlation for non-linear data: {corr_non_linear}")

In this case, the correlation coefficient may be low despite the clear non-linear relationship.

Design of Experiments

Controlled Experiments

Definition: In controlled experiments, participants are randomly assigned to either a treatment or a control group. This helps to establish a cause-and-effect relationship.

Example: An experiment testing the effect of an advertisement on product sales.

Treatment: Exposure to the advertisement.
Response: Number of products purchased.

Randomized Controlled Trials (RCT)

Definition: RCTs are considered the gold standard for experiments. In an RCT, participants are randomly assigned to treatment and control groups, ensuring that any observed effects are due to the treatment itself.

Observational Studies

Definition: In observational studies, participants are not randomly assigned. Instead, researchers observe the subjects and try to infer relationships based on the data.
These studies can establish association but not causation.

Longitudinal vs. Cross-Sectional Studies

Longitudinal Study: Follows participants over time to observe how an exposure or treatment affects an outcome.
Cross-Sectional Study: Collects data at a single point in time, which may be quicker but can lead to confounding variables.

These expanded sections provide a more detailed explanation of the statistical concepts, their formulas, and Python code examples for practical understanding. Let me know if you need any more specific examples or clarifications!