|
| 1 | +# Decision Tree Algorithm |
| 2 | + |
| 3 | +## Overview |
| 4 | +A **Decision Tree** is a supervised machine learning algorithm used for both classification and regression tasks. |
| 5 | +It works by recursively splitting the dataset into smaller subsets based on feature values until a stopping criterion is met. |
| 6 | + |
| 7 | +--- |
| 8 | + |
| 9 | +## Mathematical Concept |
| 10 | + |
| 11 | +Decision Trees use measures like **Entropy**, **Information Gain**, and **Gini Impurity** to decide where to split. |
| 12 | + |
| 13 | +### 1. Entropy |
| 14 | +Entropy measures the amount of uncertainty or impurity in the dataset: |
| 15 | + |
| 16 | +H(S) = - Σ p(x) log₂ p(x) |
| 17 | + |
| 18 | + |
| 19 | +Where: |
| 20 | +- `p(x)` = probability of class `x` |
| 21 | +- Lower entropy = more pure dataset |
| 22 | + |
| 23 | +--- |
| 24 | + |
| 25 | +### 2. Information Gain |
| 26 | +Information Gain measures the reduction in entropy after splitting on an attribute: |
| 27 | + |
| 28 | + |
| 29 | + |
| 30 | +IG(S, A) = H(S) - Σ ( |Sv| / |S| ) * H(Sv) |
| 31 | + |
| 32 | + |
| 33 | +Where: |
| 34 | +- `S` = dataset |
| 35 | +- `A` = attribute (feature) |
| 36 | +- `Sv` = subset after splitting by `A` |
| 37 | + |
| 38 | +A split with the **highest information gain** is chosen. |
| 39 | + |
| 40 | +--- |
| 41 | + |
| 42 | +### 3. Gini Index |
| 43 | +An alternative to entropy for impurity: |
| 44 | + |
| 45 | + |
| 46 | + |
| 47 | +Gini(S) = 1 - Σ (p(i)²) |
| 48 | + |
| 49 | + |
| 50 | +Where: |
| 51 | +- `p(i)` = probability of class `i` in dataset `S` |
| 52 | + |
| 53 | +A pure dataset has Gini = 0. |
| 54 | + |
| 55 | +--- |
| 56 | + |
| 57 | +## Practical Use Cases |
| 58 | +- **Business**: Predicting customer churn |
| 59 | +- **Finance**: Credit scoring / loan approval |
| 60 | +- **Healthcare**: Diagnosing diseases based on symptoms |
| 61 | +- **Cybersecurity**: Spam / phishing detection |
| 62 | + |
| 63 | +--- |
| 64 | + |
| 65 | +## Advantages |
| 66 | +- Simple to understand and visualize |
| 67 | +- Handles both numerical and categorical data |
| 68 | +- Requires little preprocessing (no normalization or scaling) |
| 69 | + |
| 70 | +--- |
| 71 | + |
| 72 | +## Limitations |
| 73 | +- Prone to overfitting (can be solved using pruning or ensembles like Random Forests) |
| 74 | +- Small changes in data can lead to different trees (instability) |
| 75 | + |
| 76 | +--- |
| 77 | + |
| 78 | +## Example Usage |
| 79 | + |
| 80 | +```python |
| 81 | +from machine_learning.decision_tree import DecisionTree |
| 82 | + |
| 83 | +# Sample dataset |
| 84 | +X = [[1], [2], [3], [4], [5]] |
| 85 | +y = [0, 0, 1, 1, 1] |
| 86 | + |
| 87 | +# Train model |
| 88 | +tree = DecisionTree(max_depth=2) |
| 89 | +tree.fit(X, y) |
| 90 | + |
| 91 | +# Prediction |
| 92 | +print(tree.predict([[2]])) # Output: [0] |
0 commit comments