Decision Tree

0. Performance
1. various Indicators of Impurity
2. Information Gain $IG(\cdot)$
- Definition information gain $IG(\cdot)$
3. some algorithms used in Decision Tree
4. ID3 algorithm
5. Feature Importance
- Definition node importance $NI_p$ of a node $p$
- Definition feature importance $FI_f$ for a feature $f$
Reference

0. Performance

Not good
Low bias and high variance : overfitting

Algorithm	Bias	variance
Decision Tree	low	high
Linear Regression	high	low

1. various Indicators of Impurity

(0) Notation

$D_n$ : dataset of node $n$
- $D_p, D_j$ : dataset of a parent node and its j-th child node
$N_n$ : the number of samples in a node $n$
- $N_p, N_j$ : the number of samples in a parent node and its j-th child node
$f$ : feature to use for split
$I(\cdot)$ : indicator of impurity
$p(k \vert D_n)$ : frequency of $k$-th class for given dataset $D_n$

(1) Definition entropy $I_H$

\[I_H(D_n) = \sum_{k=1}^K \Big[ - p(k \vert D_n) \cdot \text{log} p(k \vert D_n) \Big]\]

where $K$ is the number of classes

(2) Definition Gini impurity $I_G$

\[\begin{align*} I_G(D_n) &= \sum_{k=1}^K p(k \vert D_n) \big( 1 - p(k \vert D_n) \big) \\ &= 1 - \sum_{k=1}^K p(k \vert D_n)^2 \end{align*}\]

어떤 dataset 에서 한 element 를 뽑아 무작위로 label 을 추정할 때 틀릴 확률

(3) Definition classification error $I_E$

\[I_E = 1 - \underset{k}{\text{max}} \lbrace p(k \vert D_n) \rbrace\]

pruning (가지치기) 에 적합한 기준이지만 $p$ 의 변화에 민감하지 않아 tree 생성에는 잘 사용되지 않음.

2. Information Gain $IG(\cdot)$

Definition information gain $IG(\cdot)$

\[\begin{align*} IG(D_p, f) &= (\text{impurity of parent node}) - (\text{impurity of children nodes}) \\ &= I(D_p) - \sum_{j=1}^{m} \frac{N_j}{N_p} I(D_j) \end{align*}\]

3. some algorithms used in Decision Tree

ID3 → (extension of D3)
C4.5 → (successor of ID3)
CART → (Classification And Regression Tree)
CHAID → (Chi-square automatic interaction detection Performs multi-level splits when computing classification trees)
MARS → (multivariate adaptive regression splines)

4. ID3 algorithm

It begins with the original dataset $D$ as the root node.
On each iteration of the algorithm, it iterates through the very unused attribute of the set D and calculates Impurity ($I$) and Information gain ($IG$) of this attribute.
It then selects the attribute which has the smallest Entropy or the largest Information gain.
The set $D$ is then split by the selected attribute to produce a subset of the data. (children nodes)
The algorithm continues to recur on each subset, considering only attributes never selected before.

5. Feature Importance

Definition node importance $NI_p$ of a node $p$

\[\begin{align*} NI_p &= \frac{N_p}{N_t} I(D_p) - \sum_{j=1}^{m} \frac{N_j}{N_t} I(D_j) \\ &= \frac{N_p}{N_t} IG(D_p, f) \end{align*}\]

where $N_t$ is the number of samples in total dataset

Definition feature importance $FI_f$ for a feature $f$

Mean Impurity Decrease

\[\begin{align*} FI_f &= \frac{\sum\limits_{j\text{: node splits on the feature } f} NI_j}{\sum\limits_{i\text{: all nodes}}NI_i} \\ &= \frac{\sum NI \text{ of the spliting nodes on feature } k}{\sum NI \text{ of all nodes}} \end{align*}\]

Reference

codeit, 머신러닝 > 결정 트리와 앙상블 기법
세바스찬 라시카 외, 머신러닝 교과서 with 파이썬, 사이킷런, 텐서플로 (개정3판), 길벗
KD nuggets, “Decision Tree Algorithm, Explained”