Decision Tree Learning Algorithm
1. Introduction
The decision tree learning algorithm follows a:
Top-down, greedy search strategy using recursive divide-and-conquer
This approach is used in classical algorithms such as:
- ID3
- C4.5
- CART
2. High-Level Intuition
A decision tree is built by:
- Starting with the entire dataset
- Selecting the best feature to split
- Dividing the dataset into subsets
- Repeating the process recursively
👉 Goal: Create subsets that are as pure as possible (i.e., contain similar class labels)
3. Core Algorithmic Steps
Step 1: Initialization
- Start with the root node
- It represents the entire training dataset
Step 2: Evaluate Attributes
- For each feature (attribute), compute how well it separates the data
- Use statistical measures like:
- Entropy
- Information Gain
- Gini Index
Step 3: Select the Best Attribute
- Choose the attribute that best splits the data
- Typically, this is the one with maximum Information Gain
Step 4: Create Branches
- For the selected attribute:
- Create a branch for each possible value
- Each branch represents a subset of the data
Step 5: Recursive Partitioning
- Split the dataset based on the selected attribute
- Repeat the same process for each subset
4. Mathematical Selection Criteria
To choose the best split, we measure impurity.
4.1 Entropy
Entropy measures the randomness (impurity) in the dataset.
Where:
- ( \(p_+\) ): proportion of positive class
- ( \(p_-\) ): proportion of negative class
👉 Lower entropy = more pure dataset
4.2 Information Gain
Information Gain measures the reduction in entropy after a split.
Where:
- ( S ): dataset
- ( A ): attribute
- ( \(S_v\) ): subset where attribute ( A = v )
👉 Choose attribute with highest Information Gain
4.3 Gini Index
Used mainly in CART algorithm.
Where:
- ( \(p_i\) ): probability of class ( i )
👉 Lower Gini = better split
5. Stopping Conditions
The recursive splitting stops when:
5.1 Pure Node
- All samples belong to the same class
- Entropy = 0
👉 Create a leaf node
5.2 No Remaining Attributes
- No features left to split
👉 Use majority voting to assign class
5.3 Empty Dataset
- No samples reach a branch
👉 Assign the most common class from the parent node
6. Key Characteristics
6.1 Greedy Nature
- Chooses the best split at each step
- Does not reconsider previous decisions
6.2 Interpretability
-
Can be visualized as:
-
A tree structure
- A set of IF-THEN rules
Example:
IF Outlook = Sunny AND Humidity = High → No
ELSE IF Outlook = Rain AND Wind = Weak → Yes
6.3 Invariance to Scaling
- Decision trees do not require feature scaling
- Normalization or standardization is unnecessary
7. Summary
| Component | Description |
|---|---|
| Strategy | Greedy, top-down |
| Process | Recursive partitioning |
| Goal | Maximize purity |
| Measures | Entropy, Information Gain, Gini |
| Output | Tree or rule-based model |
8. Conclusion
Decision trees are:
- Simple and interpretable
- Powerful for classification and regression
- Built using a recursive, greedy splitting strategy
👉 They form the foundation for advanced models like:
- Random Forests
- Gradient Boosted Trees