, , , ,

How is the Modeling Process Done in Scikit-Learn?

admin Avatar

The modeling process in this library can be performed through multiple processes, including dataset loading, dataset splitting, model training, model persistence, and data preprocessing, which are described below.

1. Loading the Dataset

Scikit-learn includes example datasets (such as iris and digits) for classification and the Boston house prices for regression. The following example loads the iris dataset and displays its specifications and contents.

from sklearn.datasets import load_iris

iris_dataset = load_iris()
data = iris_dataset.data
target = iris_dataset.target

feature_names = iris_dataset.feature_names
target_names = iris_dataset.target_names

print("Feature names:", feature_names)
print("Target names:", target_names)
print("\nFirst 5 rows of data:\n", data[:5])

Output

Feature names: ['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']
Target names: ['setosa' 'versicolor' 'virginica']

First 5 rows of data:
 [[5.1 3.5 1.4 0.2]
 [4.9 3.  1.4 0.2]
 [4.7 3.2 1.3 0.2]
 [4.6 3.1 1.5 0.2]
 [5.  3.6 1.4 0.2]]
2. Splitting the Dataset

The following example loads the iris dataset and displays training and testing data. Note that the data is split in an 80:20 ratio, meaning 80% of the data will be used as training data and 20% as testing data.

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split

iris = load_iris()
data = iris.data
target = iris.target

data_train, data_test, target_train, target_test = train_test_split(
   data, target, test_size = 0.2, random_state = 1
)

print(data_train.shape)
print(data_test.shape)
print(target_train.shape)
print(target_test.shape)

Output

(120, 4)
(30, 4)
(120,)
(30,)
3. Training the Model

The example below shows the training of a model using the K-Nearest Neighbors (KNN) classifier based on the iris dataset.

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn import metrics

iris = load_iris()
data = iris.data
target = iris.target

data_train, data_test, target_train, target_test = train_test_split(
   data, target, test_size = 0.2, random_state=1
)

classifier = KNeighborsClassifier(n_neighbors = 5)
classifier.fit(data_train, target_train)
target_pred = classifier.predict(data_test)

# Comparing actual (target_test) and predicted (target_pred) responses for accuracy
print("Accuracy of the model:", metrics.accuracy_score(target_test, target_pred))

# Predicting outcomes with the model based on the provided sample data
sample_data = [[4, 4, 1, 3], [3, 2, 5, 4]]
preds = classifier.predict(sample_data)
pred_species = [iris.target_names[p] for p in preds]
print("Predictions:", pred_species)

Output

Accuracy of the model: 1.0
Predictions: [np.str_('setosa'), np.str_('virginica')]
4. Persisting the Model

Saving the trained model for future use:

from sklearn.externals import joblib
joblib.dump(classifier, 'iris_classifier.joblib')

Loading the trained model:

joblib.load('iris_classifier.joblib')
5. Preprocessing the Data

Preprocessing is essential for transforming raw data into meaningful information before inputting it into machine learning algorithms for effective analysis and modeling.

5.1. Binarisation

This preprocessing technique converts numerical values into Boolean values. In the following example, a threshold value is used, where all the values above this threshold would be converted to 1, while all the values below it would be converted to 0.

import numpy as np
from sklearn import preprocessing

raw_data = np.array(
   [[3.6, 0.9, -2.7],
   [10.5, -2.2, -1.3],
   [-1.1, 1.5, -1.1],
   [-1.1, 1.2, 1.3]]
)

binary_data = preprocessing.Binarizer(threshold=0.9).transform(raw_data)
print("\nBinary data:\n", binary_data)

Output

Binary data:
 [[1. 0. 0.]
 [1. 0. 0.]
 [0. 1. 0.]
 [0. 1. 1.]]
5.2. Mean Removal

This technique eliminates mean from feature vector, centering every feature on zero.

import numpy as np
from sklearn import preprocessing

raw_data = np.array(
   [[3.6, 0.9, -2.7],
   [10.5, -2.2, -1.3],
   [-1.1, 1.5, -1.1],
   [-1.1, 1.2, 1.3]]
)

# Displaying the mean and standard deviation values
print("Mean:", raw_data.mean(axis=0))
print("Standard deviation:", raw_data.std(axis=0))

# Removing the mean and standard deviation values
scaled_data = preprocessing.scale(raw_data)
print("Mean (removed):", scaled_data.mean(axis=0))
print("Standard deviation (removed):", scaled_data.std(axis=0))

Output

Mean: [ 2.975  0.35  -0.95 ]
Standard deviation: [4.74940786 1.48744748 1.43788038]
Mean (removed): [ 0.00000000e+00  0.00000000e+00 -1.11022302e-16]
Standard deviation (removed): [1. 1. 1.]
5.3. Scaling

Scaling the feature vectors is crucial to avoid synthetic distortions in their relative sizes.

import numpy as np
from sklearn import preprocessing

raw_data = np.array(
   [[3.6, 0.9, -2.7],
   [10.5, -2.2, -1.3],
   [-1.1, 1.5, -1.1],
   [-1.1, 1.2, 1.3]]
)

min_max_scaler = preprocessing.MinMaxScaler(feature_range=(0, 1))
scaled_data = min_max_scaler.fit_transform(raw_data)
print("Scaled data:\n", scaled_data)

Output

Scaled data:
 [[0.40517241 0.83783784 0.        ]
 [1.         0.         0.35      ]
 [0.         1.         0.4       ]
 [0.         0.91891892 1.        ]]
5.4. Normalisation

Normalization modifies feature vectors, ensuring they are measured on a common scale for better comparison. Two types of normalization exist as following.

L1 Normalisation

This type, also called Least Absolute Deviations, modifies values to ensure the sum of absolute values in each row equals 1. The L1 normalization implemented on an input data is shown in the following example.

import numpy as np
from sklearn import preprocessing

raw_data = np.array(
   [[3.6, 0.9, -2.7],
   [10.5, -2.2, -1.3],
   [-1.1, 1.5, -1.1],
   [-1.1, 1.2, 1.3]]
)

normalized_data = preprocessing.normalize(raw_data, norm='l1')
print("Normalized data:\n", normalized_data)

Output

Normalized data:
 [[ 0.5         0.125      -0.375     ]
 [ 0.75       -0.15714286 -0.09285714]
 [-0.2972973   0.40540541 -0.2972973 ]
 [-0.30555556  0.33333333  0.36111111]]
L2 Normalisation

This type, also called Least Squares, modifies values to keep the sum of squares in each row equal to one. The L2 normalization implemented on an input data is illustrated in the example below.

import numpy as np
from sklearn import preprocessing

raw_data = np.array(
   [[3.6, 0.9, -2.7],
   [10.5, -2.2, -1.3],
   [-1.1, 1.5, -1.1],
   [-1.1, 1.2, 1.3]]
)

normalized_data = preprocessing.normalize(raw_data, norm='l2')
print("Normalized data:\n", normalized_data)

Output

Normalized data:
 [[ 0.78446454  0.19611614 -0.58834841]
 [ 0.97163928 -0.20358156 -0.1202982 ]
 [-0.50901929  0.69411722 -0.50901929]
 [-0.5280169   0.57601843  0.62401997]]
References
  1. Hackeling, G. (2017). Mastering Machine Learning with scikit-learn, 2nd Edition. Packt Publishing Ltd.
  2. Géron, A. (2019). Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow, 2nd Edition. O’Reilly Media, Inc.
  3. Tutorials Point. Scikit Learn Tutorial. Retrieved November 20, 2025, from https://www.tutorialspoint.com/.

Tagged in :

admin Avatar

Leave a Reply

Your email address will not be published. Required fields are marked *

You May Love