1. Types of Decision Tree Algorithms
Decision tree is a robust non-parametric supervised learning technique for classification and regression. It aims to predict target variable values using decision rules derived from data features. Key components include the root node for data splitting and decision nodes or leaves for final outputs.
There are different types of decision tree algorithms, as briefly describe below:
- ID3 (also called Iterative Dichotomiser 3): This algorithm identifies categorical features for each node that maximize information gain for categorical targets, allowing trees to grow fully, followed by a pruning step to enhance performance on unseen data, resulting in a multiway tree.
- C4.5: This algorithm is the successor to ID3, which introduces a discrete attribute that segments continuous values into intervals, eliminating categorical feature restrictions. It transforms the ID3 tree into IF-THEN rules, evaluating each rule’s accuracy to establish their application sequence.
- C5.0: This algorithm uses less memory, builds smaller rulesets, and is more accurate than C4.5.
- CART (called classification and regression tree): This algorithm generates binary splits based on features and thresholds that maximize information gain, determined by the Gini index. Higher Gini values indicate greater homogeneity. Unlike C4.5, it does not compute rule sets or support numerical target variables.
1.1. Classification With Decision Tree
The decision variables are categorical in this case. The example below illustrates how to build a classifier to predict male or female from a dataset including two features, called height and length of hair, and 25 samples.
from sklearn.model_selection import train_test_split
from sklearn import tree
data = [[176, 15], [131, 32], [166, 6], [128, 32], [112, 38], [169, 9], [171, 36], [116, 25], [196, 25], [196, 38], [126, 40], [197, 20], [150, 25], [140, 32], [136, 35], [165, 19], [175, 32], [136, 35], [174, 65], [141, 28]]
target = ['Man', 'Woman', 'Man', 'Woman', 'Man', 'Woman', 'Woman', 'Woman', 'Man', 'Woman', 'Man', 'Woman', 'Woman', 'Man', 'Woman', 'Woman', 'Man', 'Woman', 'Woman', 'Man']
data_feature_names = ['height', 'length of hair']
data_train, data_test, target_train, target_test = train_test_split(data, target, test_size = 0.5, random_state = 1)
classifier = tree.DecisionTreeClassifier()
classifier = classifier.fit(data, target)
prediction = classifier.predict([[135, 29]])
print(prediction)
prediction = classifier.predict_proba([[135, 29]])
print(prediction)
Output
['Woman']
[[0. 1.]]
1.2. Regression With Decision Tree
The decision variables are continuous in this case. As seen in the following example, in this regression approach, the fit() method takes floating point values of the target.
from sklearn import tree
data = [[2, 3], [1, 4]]
target = [1.2, 0.5]
regressor = tree.DecisionTreeRegressor()
regressor = regressor.fit(data, target)
print(regressor.predict([[3, 1]]))
Output
[1.2]
2. Randomized Decision Tree Algorithms
Decision tree often overfits when trained on entire datasets, so random forests utilize multiple trees trained on various data subsamples. Scikit-Learn includes two algorithms based on the randomized decision tree: random forest and extra-tree.
2.1. Random Forest
Random forest constructs decision trees from bootstrap samples of the training set, computing optimal feature combinations. Each tree provides predictions, and Scikit-Learn’s ensemble selects the best outcome through voting, applicable for classification and regression tasks.
2.1.1. Classification With Random Forest
This classifier builds random forests using two key parameters (max_features and n_estimators) for classification. The max_features parameter determines the size of random feature subsets for node splitting; if set to none, all features are used. The n_estimators parameter specifies the number of trees in the forest, improving results with more trees but increasing computation time. The following examples show how to build a random forest classifier and check its accuracy based on the random and iris datasets, respectively.
Example 1 with the random dataset:
from sklearn.datasets import make_blobs
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score
data, target = make_blobs(n_samples = 5000, n_features = 10, centers = 500, random_state = 0)
classifier = RandomForestClassifier(n_estimators = 20, max_depth = None, min_samples_split = 5, random_state = 0)
scores = cross_val_score(classifier, data, target, cv = 5)
print(scores.mean())
Output
0.9986
Example 2 with the iris dataset:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score
path = "https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data"
headers = ['sepal-length', 'sepal-width', 'petal-length', 'petal-width', 'class']
dataset = pd.read_csv(path, names = headers)
data = dataset.iloc[:, :-1].values
target = dataset.iloc[:, 4].values
data_train, data_test, target_train, target_test = train_test_split(data, target, test_size = 0.30)
classifier = RandomForestClassifier(n_estimators = 200)
classifier.fit(data_train, target_train)
target_prediction = classifier.predict(data_test)
result1 = confusion_matrix(target_test, target_prediction)
print("The confusion matrix is:\n", result1, "\n")
result2 = classification_report(target_test, target_prediction)
print("The classification report is:\n", result2)
result3 = accuracy_score(target_test, target_prediction)
print("The accuracy is:", result3)
Output
The confusion matrix is:
[[12 0 0]
[ 0 19 1]
[ 0 0 13]]
The classification report is:
precision recall f1-score support
Iris-setosa 1.00 1.00 1.00 12
Iris-versicolor 1.00 0.95 0.97 20
Iris-virginica 0.93 1.00 0.96 13
accuracy 0.98 45
macro avg 0.98 0.98 0.98 45
weighted avg 0.98 0.98 0.98 45
The accuracy is: 0.9777777777777777
2.1.2. Regression With Random Forest
Scikit-Learn’s regressor uses similar parameters as the classifier for random forest regression. The example below describes how to build a random forest regressor and predict for new values.
from sklearn.datasets import make_regression
from sklearn.ensemble import RandomForestRegressor
data, target = make_regression(n_features = 10, n_informative = 2, random_state = 0, shuffle = False)
classifier = RandomForestRegressor(max_depth = 20, random_state = 0, n_estimators = 100)
classifier.fit(data, target)
print(classifier.predict([[0, 2, 1, 0, 1, 2, 1, 1, 2, 2]]))
Output
[98.76936898]
2.2. Extra-Tree
Extra-tree methods randomly select split values, helping to reduce model variance, but they slightly increase bias. This trade-off can influence the overall performance of the model.
2.2.1. Classification With Extra-Tree
The extra-tree classifier in Scikit-Learn creates classifiers similarly to the random forest classifier but differs in tree construction methods. The following example shows how to build an extra-tree classifier and also check its accuracy based on the random and Pima-Indian datasets, respectively.
Example 1 with the random dataset:
from sklearn.datasets import make_blobs
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.model_selection import cross_val_score
data, target = make_blobs(n_samples = 5000, n_features = 10, centers = 500, random_state = 0)
classifier = ExtraTreesClassifier(n_estimators = 20, max_depth = None, min_samples_split = 5, random_state = 0)
scores = cross_val_score(classifier, data, target, cv = 5)
print(scores.mean())
Output
0.9991999999999999
Example 2 with the Pima-Indian dataset:
from pandas import read_csv
from sklearn.model_selection import KFold
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.model_selection import cross_val_score
path = "pima-indians-diabetes.csv"
headers = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
dataset = read_csv(path, names = headers)
data = dataset.values[1:, 0:8]
target = dataset.values[1:, 8]
kfold = KFold(n_splits = 10)
classifier = ExtraTreesClassifier(n_estimators = 150, max_features = 5)
scores = cross_val_score(classifier, data, target, cv = kfold)
print(scores.mean())
Output
0.770745044429255
The “pima-indians-diabetes.csv” dataset can be downloaded using the following link:
https://github.com/npradaschnor/Pima-Indians-Diabetes-Dataset/blob/master/diabetes.csv
2.2.2. Regression With Extra-Tree
Scikit-Learn’s extra-tree regressor uses the same parameters as the extra-tree classifier for regression tasks. The example below illustrates how to apply this regressor on the same data as used while creating the random forest regressor.
from sklearn.datasets import make_regression
from sklearn.ensemble import ExtraTreesRegressor
data, target = make_regression(n_features = 10, n_informative = 2, random_state = 0, shuffle = False)
regressor = ExtraTreesRegressor(max_depth = 20, random_state = 0, n_estimators = 100)
regressor.fit(data, target)
print(regressor.predict([[0, 2, 1, 0, 1, 2, 1, 1, 2, 2]]))
Output
[86.99469325]
References
- Hackeling, G. (2017). Mastering Machine Learning with scikit-learn, 2nd Edition. Packt Publishing Ltd.
- Géron, A. (2019). Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow, 2nd Edition. O’Reilly Media, Inc.
- Tutorials Point. Scikit Learn Tutorial. Retrieved November 20, 2025, from https://www.tutorialspoint.com/.

Leave a Reply