Chapter8: Decision Trees

Parametric v/s Non-Parametric Models

Introduction

We used linear regression for regression problems and logistic regression for classification problems. Decision Trees can handle both problems.

Machine learning can be described as the construction of a function, f, that maps input variables (x) to output variables (y): y = f(x).

The job of an ML Engineer is to find the best mathematical model for the function.

Unlike linear or logistic regressions, decision tree algorithms do not have any assumptions.

PARAMETRIC MODELS

These models make strong assumptions about the form of the mapping function.
Very simple and interpretable.
The set of parameters does not depend upon the amount of data.
A learning model that summarizes data with a set of parameters of fixed size (independent of the number of training examples) is called a parametric model.
Source: (Book) Artificial Intelligence: A Modern Approach.

Working Steps of Parametric Models:

Select a form for the function.
Learn the coefficient/parameters for the function from the training data.

Benefits:

Simpler: Easier to understand and interpret results.
Fast: Very fast to learn from data.
Less Data: Do not require as much training data and can work well even if the fit to the data is not perfect.

Limitations:

Constrained: Highly constrained to the specified form.
Limited Complexity: More suited to simpler problems.
Poor Fit: Unlikely to match the exact underlying mapping function.

NON-PARAMETRIC MODELS

These algorithms do not make strong assumptions about the form of the mapping function.
They are free to form any functional form from the data.
Non-parametric methods are good when you have a lot of data and no prior knowledge and when you don’t want to worry too much about choosing the right features.
Source: (Book) Artificial Intelligence: A Modern Approach.
They best fit the training data in constructing the mapping function while maintaining some flexibility to generalize to unseen data.

Benefits:

Flexibility: Capable of fitting a large number of functional forms.
Power: No assumptions or weak assumptions about the underlying function.
Performance: Can result in higher performance models for prediction.

Limitations:

More Data: Require a lot more training data to estimate the mapping function.
Slower: Slower to train as they often have far more parameters.
Overfitting: Higher risk of overfitting the training data and harder to explain why specific predictions are made.

Aspect	Parametric Models	Non-Parametric Models
Assumptions	Make strong assumptions about the data distribution.	Do not assume a specific data distribution.
Flexibility	Limited flexibility due to predefined assumptions.	Highly flexible, capable of capturing complex patterns.
Data Requirement	Requires less data to train effectively.	Needs a large amount of training data.
Interpretability	Easy to interpret and explain.	Harder to interpret due to complex structures.
Training Time	Faster to train.	Slower to train due to high complexity.
Overfitting	Lower risk of overfitting.	Higher risk of overfitting.
Example Models	Linear Regression, Logistic Regression, Naive Bayes.	Decision Trees, k-NN, Random Forest, Neural Networks.

Working of Decision Trees

Decision Trees

Every time we split the dataset based on some condition/decision, we get a decision tree.
We tend to separate all the positive classes from negative classes using a decision tree.
In an ideal case, we find a feature that can split the dataset into pure nodes.
A pure node is one in which all the data points exhibit the desired behavior or belong to the same class.
The objective of decision trees is to have pure nodes.
A decision tree with pure nodes is good at segregating data into respective classes.
A decision tree can have multiple splits.
Purity of nodes helps to decide how splitting should be done.

Terminology

Root Node: Represents the entire population or dataset.
Splitting: Process of dividing a node into two or more sub-nodes.
Decision Node: When a sub-node is divided into further sub-nodes, the original sub-node is called a decision node.
Leaf/Terminal Node: A node that does not split any further.
Branch/Sub-tree: A subsection of the entire tree.
Parent Node: A node that is divided into sub-nodes.
Child Node: A sub-node derived from a parent node.
Depth of the Tree: The length of the longest path from the root node to the leaf node.

The root node has a depth of 0.

Types of Decision Trees

Classification Decision Tree:
- Used when the target variable is categorical in nature (e.g., Yes/No, Spam/Not Spam).
- It classifies data points into predefined categories based on feature conditions.
- The decision boundaries formed by classification trees are often stepwise and may not always be smooth.
- Common use cases include spam detection, loan approval, and medical diagnosis.
Regression Decision Tree:
- Used when the target variable is continuous in nature (e.g., predicting house prices, temperature, or sales revenue).
- Instead of predicting distinct categories, it predicts real-valued outputs by computing the mean of observations in a region.
- The splits in the tree aim to minimize variance within each partition of data.
- Common use cases include stock price forecasting, customer spending predictions, and demand estimation.
Key Differences:
- Classification trees predict labels, whereas regression trees predict numeric values.
- Loss functions differ: classification trees often use Gini impurity or entropy, while regression trees use mean squared error (MSE).
- Regression trees can model complex relationships but are prone to overfitting, requiring pruning or regularization.

Splitting Criteria of Tree Nodes

Gini Impurity

Gini impurity measures the impurity of nodes.

\(\text{Gini impurity} = 1 - \text{Gini}\)

Gini is the probability that two elements chosen randomly from a population are of the same class.

For a pure node, \( Gini = 1 \)
Range of Gini is from 0 to 1

Properties of Gini Impurity:

Helps in deciding the best split
Lower the Gini impurity, higher will be the homogeneity of the nodes
Works only with categorical variables
Performs binary splits only
Lower the Gini impurity → higher the purity of nodes → higher the homogeneity of nodes

Calculating Gini Impurity:

Calculate Gini impurity for sub-nodes: \[ \text{Gini impurity} = 1 - \sum_{i=1}^{n} p_i^2 \] where \( p_i^2 \) is the probability of any two random data points belonging to class \( i \).
Calculate the Gini impurity for the split: \[ \text{Weighted Gini impurity} = (\text{weight of the node}) \times (\text{Gini impurity}) \] The split producing the minimum Gini impurity is selected as the first split.

Information Gain

Information gain is the information needed to describe the parent node minus the information needed to describe the children nodes.

More homogeneous nodes result in higher information gain.

\(\text{IG} = 1 - \text{Entropy (sub-nodes)}\)

Entropy Calculation:

\[ \text{Entropy} = -P_1 \log P_1 - P_2 \log P_2 - P_3 \log P_3 - ... - P_n \log P_n \] where \( P_1, P_2, P_3, ..., P_n \) are the proportions of each class in the node, and the base of the logarithm is \( n \).

Steps to Calculate Entropy for a Split:

Calculate entropy of parent nodes
Calculate entropy of each child node
Compute the weighted average entropy of the split
If the weighted entropy of the child node is greater than the parent node, ignore the split

Decision Tree Regressor

Since information gain and Gini impurity cannot be applied to continuous variables, we use variance reduction.

Variance Calculation:

\[ \text{Variance} = \frac{\sum_{i=1}^{n} (x_i - \mu)^2}{n} \] where \( x_i \) is a data point, \( \mu \) is the mean, and \( n \) is the number of samples.

Steps to Calculate Variance for a Split:

Calculate variance of each child node using the standard formula.
Calculate the variance of the split as the weighted average variance of each child split.

Note: Applicable only on continuous variables.

Implementation

Initiation

he data we are using is the same as used in the logistic regression modelling. Target variable is still the ‘churn’ variable. We used the Decision Tree Classifier class of tree module of Sklearn library.

from sklearn.tree import DecisionTreeClassifier as DTC
classifier = DTC(class_weight='balanced')
# classifier = DTC()

classifier.fit(X_train, y_train)
predicted_values = classifier.predict(X_train)

predicted_values[:30]

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0,
           0, 0, 0, 0, 1, 0, 1, 0], dtype=int64)

It is clearly visible from the classification reports that when used on training data the model shows a perfect score of 1 but when the same model is used on test data the metrics drops significantly. This is due to Overfitting of model.

The classifier model tries to learn each and every data from the training set but suffers radically when it comes to applying it on the test set.

#classification report 
 from sklearn.metrics import classification_report 
 print(classification_report(y_train, predicted_values))


                                                
                                                    
                                                          
                                                        Precision
                                                        Recall
                                                        F1-Score
                                                        Support
                                                    
                                                
                                                
                                                    
                                                        0
                                                        1.00
                                                        1.00
                                                        1.00
                                                        14234
                                                    
                                                    
                                                        1
                                                        1.00
                                                        1.00
                                                        1.00
                                                        3419
                                                    
                                                    
                                                        Accuracy
                                                            
                                                            
                                                        1.00
                                                        17653
                                                    
                                                    
                                                        Macro Avg
                                                        1.00
                                                        1.00
                                                        1.00
                                                        17653
                                                    
                                                    
                                                        Weighted Avg
                                                        1.00
                                                        1.00
                                                        1.00
                                                        17653

	Precision	Recall	F1-Score	Support
0	1.00	1.00	1.00	14234
1	1.00	1.00	1.00	3419
Accuracy			1.00	17653
Macro Avg	1.00	1.00	1.00	17653
Weighted Avg	1.00	1.00	1.00	17653

predicted_values = classifier.predict(X_test) 
 print(classification_report(y_test, predicted_values))


                                                
                                                    
                                                            
                                                        Precision
                                                        Recall
                                                        F1-Score
                                                        Support
                                                    
                                                
                                                
                                                    
                                                        0
                                                        0.86
                                                        0.87
                                                        0.86
                                                        3559
                                                    
                                                    
                                                        1
                                                        0.42
                                                        0.40
                                                        0.41
                                                        855
                                                    
                                                    
                                                        Accuracy
                                                            
                                                            
                                                        0.78
                                                        4414
                                                    
                                                    
                                                        Macro Avg
                                                        0.64
                                                        0.63
                                                        0.64
                                                        4414
                                                    
                                                    
                                                        Weighted Avg
                                                        0.77
                                                        0.78
                                                        0.77
                                                        4414

	Precision	Recall	F1-Score	Support
0	0.86	0.87	0.86	3559
1	0.42	0.40	0.41	855
Accuracy			0.78	4414
Macro Avg	0.64	0.63	0.64	4414
Weighted Avg	0.77	0.78	0.77	4414

Due to the memorizing nature of the classifier instance the model shows perfect score in training classification reports but a significantly low score in testing classification report.

If the model is performing too well on the training data but the performance drops significantly over the test it is called the overfitting model or Memorizing model. If the model is not learning and consistently performing poorly over the test and the train set then it is called the underfitting model. When a model shows almost equal results over both train and test set it is called best fit model.

Why Did the Decision Tree Overfit While the Linear Model Didn’t?

The primary reason for this difference lies in the nature of the models:

Linear models are parametric models, meaning they have a fixed and finite set of parameters to learn from the data.
Decision trees are non-parametric models, meaning they continue learning from the data until no further patterns remain.

Visualizing Overfitting with Graphviz

To understand the depth and complexity of our decision tree, we can use the Graphviz module in Python to visualize its structure.

Visualization

Step 1: Generating the Tree Structure

First, we generate the tree structure using the classifier instance.
We use the export_graphviz function to create an intermediary file containing information about the nodes and edges of the tree.

Step 2: Creating a Visual Representation

Using the Graphviz module, we convert this file into a PNG image.
This process creates a tree visualization that can be easily analyzed.

Understanding the Output

The full decision tree is extremely large and difficult to interpret without zooming in.
By analyzing the leaf nodes, we can conclude that the tree memorizes the dataset, leading to an overfitted model.

from sklearn.tree import export_graphviz
export_graphviz(decision_tree=classifier, 
out_file='tree_viz', 
max_depth=None, 
feature_names=X.columns, 
label=None, impurity=False)

from graphviz import render
render(filepath='tree_viz', 
format='png', 
engine='neato')

tree_viz.png

HyperParameter Tuning

These methods are used to prevent overfitting in decision trees. Pruning is the technique of preventing a decision tree from overfitting by tuning its characteristics To analyse the effect of different hyperparameter of decision tree on the train-test score we created two function as shown in figure. The effect function plots the curve between train and test score.

Max Depth

On applying the score-effect function we created on the max-depth parameter of decision tree we can see that: As max depth increases the train score increases significantly but the test score on other hand decreases with it. Note that the train and test score both increased together till a point. After that point, we can say the tree started overfitting which led to the drastic decrease of train score and increase of train score.

Min Sample Split

Min-sample-split tell the decision tree what is the minimum number of required observation in any given node in order to split it. (default = 2) For small parameters train and test score have large variation. For large values train and test score match accordingly. For very large values there is a dip for both train and test score which led to underfitting of data. Optimal range: 950-1000.

Max Leaf

Max leaf nodes specifies maximum terminal nodes allowed. For very low values tree is underfitting. For very high values tree is overfitting. According to this plot overfitting starts after 25.

Min Sample Leaf

Min-sample-leaf denotes the minimum number of sample points to be present for splitting. Clearly as values keep on increasing the model escapes overfitting for a small range of values after which it starts to underfit.

Macro parameters like max-depth prevents tree from growing on a macro level and other three parameters fine tune the decision tree on micro level. So, a good practice is to first find the best range for the tree on a macro level so that tree do not overfit and after that fine tune the other three parameters.

NEXT-->