, , , ,

Decision Tree Algorithms in Scikit-Learn

admin Avatar
1. Types of Decision Tree Algorithms

Decision tree is a robust non-parametric supervised learning technique for classification and regression. It aims to predict target variable values using decision rules derived from data features. Key components include the root node for data splitting and decision nodes or leaves for final outputs.

There are different types of decision tree algorithms, as briefly describe below:

  • ID3 (also called Iterative Dichotomiser 3): This algorithm identifies categorical features for each node that maximize information gain for categorical targets, allowing trees to grow fully, followed by a pruning step to enhance performance on unseen data, resulting in a multiway tree.
  • C4.5: This algorithm is the successor to ID3, which introduces a discrete attribute that segments continuous values into intervals, eliminating categorical feature restrictions. It transforms the ID3 tree into IF-THEN rules, evaluating each rule’s accuracy to establish their application sequence.
  • C5.0: This algorithm uses less memory, builds smaller rulesets, and is more accurate than C4.5.
  • CART (called classification and regression tree): This algorithm generates binary splits based on features and thresholds that maximize information gain, determined by the Gini index. Higher Gini values indicate greater homogeneity. Unlike C4.5, it does not compute rule sets or support numerical target variables.
1.1. Classification With Decision Tree

The decision variables are categorical in this case. The example below illustrates how to build a classifier to predict male or female from a dataset including two features, called height and length of hair, and 25 samples.

from sklearn.model_selection import train_test_split
from sklearn import tree

data = [[176, 15], [131, 32], [166, 6], [128, 32], [112, 38], [169, 9], [171, 36], [116, 25], [196, 25], [196, 38], [126, 40], [197, 20], [150, 25], [140, 32], [136, 35], [165, 19], [175, 32], [136, 35], [174, 65], [141, 28]]
target = ['Man', 'Woman', 'Man', 'Woman', 'Man', 'Woman', 'Woman', 'Woman', 'Man', 'Woman', 'Man', 'Woman', 'Woman', 'Man', 'Woman', 'Woman', 'Man', 'Woman', 'Woman', 'Man']
data_feature_names = ['height', 'length of hair']

data_train, data_test, target_train, target_test = train_test_split(data, target, test_size = 0.5, random_state = 1)

classifier = tree.DecisionTreeClassifier()
classifier = classifier.fit(data, target)
prediction = classifier.predict([[135, 29]])
print(prediction)
prediction = classifier.predict_proba([[135, 29]])
print(prediction)

Output

['Woman']
[[0. 1.]]
1.2. Regression With Decision Tree

The decision variables are continuous in this case. As seen in the following example, in this regression approach, the fit() method takes floating point values of the target.

from sklearn import tree

data = [[2, 3], [1, 4]]
target = [1.2, 0.5]

regressor = tree.DecisionTreeRegressor()
regressor = regressor.fit(data, target)
print(regressor.predict([[3, 1]]))

Output

[1.2]
2. Randomized Decision Tree Algorithms

Decision tree often overfits when trained on entire datasets, so random forests utilize multiple trees trained on various data subsamples. Scikit-Learn includes two algorithms based on the randomized decision tree: random forest and extra-tree.

2.1. Random Forest

Random forest constructs decision trees from bootstrap samples of the training set, computing optimal feature combinations. Each tree provides predictions, and Scikit-Learn’s ensemble selects the best outcome through voting, applicable for classification and regression tasks.

2.1.1. Classification With Random Forest

This classifier builds random forests using two key parameters (max_features and n_estimators) for classification. The max_features parameter determines the size of random feature subsets for node splitting; if set to none, all features are used. The n_estimators parameter specifies the number of trees in the forest, improving results with more trees but increasing computation time. The following examples show how to build a random forest classifier and check its accuracy based on the random and iris datasets, respectively.

Example 1 with the random dataset:

from sklearn.datasets import make_blobs
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score

data, target = make_blobs(n_samples = 5000, n_features = 10, centers = 500, random_state = 0)

classifier = RandomForestClassifier(n_estimators = 20, max_depth = None, min_samples_split = 5, random_state = 0)
scores = cross_val_score(classifier, data, target, cv = 5)
print(scores.mean())

Output

0.9986

Example 2 with the iris dataset:

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score

path = "https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data"
headers = ['sepal-length', 'sepal-width', 'petal-length', 'petal-width', 'class']
dataset = pd.read_csv(path, names = headers)
data = dataset.iloc[:, :-1].values
target = dataset.iloc[:, 4].values

data_train, data_test, target_train, target_test = train_test_split(data, target, test_size = 0.30)

classifier = RandomForestClassifier(n_estimators = 200)
classifier.fit(data_train, target_train)

target_prediction = classifier.predict(data_test)
result1 = confusion_matrix(target_test, target_prediction)
print("The confusion matrix is:\n", result1, "\n")
result2 = classification_report(target_test, target_prediction)
print("The classification report is:\n", result2)
result3 = accuracy_score(target_test, target_prediction)
print("The accuracy is:", result3)

Output

The confusion matrix is:
 [[12  0  0]
 [ 0 19  1]
 [ 0  0 13]]

The classification report is:
                  precision    recall  f1-score   support

    Iris-setosa       1.00      1.00      1.00        12
Iris-versicolor       1.00      0.95      0.97        20
 Iris-virginica       0.93      1.00      0.96        13

       accuracy                           0.98        45
      macro avg       0.98      0.98      0.98        45
   weighted avg       0.98      0.98      0.98        45

The accuracy is: 0.9777777777777777
2.1.2. Regression With Random Forest

Scikit-Learn’s regressor uses similar parameters as the classifier for random forest regression. The example below describes how to build a random forest regressor and predict for new values.

from sklearn.datasets import make_regression
from sklearn.ensemble import RandomForestRegressor

data, target = make_regression(n_features = 10, n_informative = 2, random_state = 0, shuffle = False)

classifier = RandomForestRegressor(max_depth = 20, random_state = 0, n_estimators = 100)
classifier.fit(data, target)
print(classifier.predict([[0, 2, 1, 0, 1, 2, 1, 1, 2, 2]]))

Output

[98.76936898]
2.2. Extra-Tree

Extra-tree methods randomly select split values, helping to reduce model variance, but they slightly increase bias. This trade-off can influence the overall performance of the model.

2.2.1. Classification With Extra-Tree

The extra-tree classifier in Scikit-Learn creates classifiers similarly to the random forest classifier but differs in tree construction methods. The following example shows how to build an extra-tree classifier and also check its accuracy based on the random and Pima-Indian datasets, respectively.

Example 1 with the random dataset:

from sklearn.datasets import make_blobs
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.model_selection import cross_val_score

data, target = make_blobs(n_samples = 5000, n_features = 10, centers = 500, random_state = 0)

classifier = ExtraTreesClassifier(n_estimators = 20, max_depth = None, min_samples_split = 5, random_state = 0)
scores = cross_val_score(classifier, data, target, cv = 5)
print(scores.mean())

Output

0.9991999999999999

Example 2 with the Pima-Indian dataset:

from pandas import read_csv
from sklearn.model_selection import KFold
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.model_selection import cross_val_score

path = "pima-indians-diabetes.csv"
headers = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
dataset = read_csv(path, names = headers)
data = dataset.values[1:, 0:8]
target = dataset.values[1:, 8]

kfold = KFold(n_splits = 10)
classifier = ExtraTreesClassifier(n_estimators = 150, max_features = 5)
scores = cross_val_score(classifier, data, target, cv = kfold)
print(scores.mean())

Output

0.770745044429255

The “pima-indians-diabetes.csv” dataset can be downloaded using the following link:

https://github.com/npradaschnor/Pima-Indians-Diabetes-Dataset/blob/master/diabetes.csv
2.2.2. Regression With Extra-Tree

Scikit-Learn’s extra-tree regressor uses the same parameters as the extra-tree classifier for regression tasks. The example below illustrates how to apply this regressor on the same data as used while creating the random forest regressor.

from sklearn.datasets import make_regression
from sklearn.ensemble import ExtraTreesRegressor

data, target = make_regression(n_features = 10, n_informative = 2, random_state = 0, shuffle = False)

regressor = ExtraTreesRegressor(max_depth = 20, random_state = 0, n_estimators = 100)
regressor.fit(data, target)
print(regressor.predict([[0, 2, 1, 0, 1, 2, 1, 1, 2, 2]]))

Output

[86.99469325]
References
  1. Hackeling, G. (2017). Mastering Machine Learning with scikit-learn, 2nd Edition. Packt Publishing Ltd.
  2. Géron, A. (2019). Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow, 2nd Edition. O’Reilly Media, Inc.
  3. Tutorials Point. Scikit Learn Tutorial. Retrieved November 20, 2025, from https://www.tutorialspoint.com/.

Tagged in :

admin Avatar

Leave a Reply

Your email address will not be published. Required fields are marked *

You May Love