, , , ,

Dimensionality Reduction Using PCA in Scikit-Learn

admin Avatar

Dimensionality reduction optimizes data samples by selecting principal features, with Principal Component Analysis (PCA) being a widely used algorithm for this process. PCA is provided in different functions, as described below.

Exact PCA

PCA reduces dimensionality linearly by applying Singular Value Decomposition (SVD) on centered input data. Scikit-Learn’s PCA module functions as a transformer, learning n components during fitting and enabling projection of new data onto these components afterward. The following utilizes this module to find the best five principal components of the Pima-Indian dataset.

from pandas import read_csv
from sklearn.decomposition import PCA

path = "pima-indians-diabetes.csv"
headers = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
dataset = read_csv(path, names = headers)
values = dataset.values
data = values[1:, 0:8]
target = values[1:, 8]

pca = PCA(n_components = 5)
fit = pca.fit(data)
print(("The explained variance is: %s\n") % (fit.explained_variance_ratio_))
print(fit.components_)

Output

The explained variance is: [0.88854663 0.06159078 0.02579012 0.01308614 0.00744094]

[[-2.02176587e-03  9.78115765e-02  1.60930503e-02  6.07566861e-02
   9.93110844e-01  1.40108085e-02  5.37167919e-04 -3.56474430e-03]
 [ 2.26488861e-02  9.72210040e-01  1.41909330e-01 -5.78614699e-02
  -9.46266913e-02  4.69729766e-02  8.16804621e-04  1.40168181e-01]
 [ 2.24649003e-02 -1.43428710e-01  9.22467192e-01  3.07013055e-01
  -2.09773019e-02  1.32444542e-01  6.39983017e-04  1.25454310e-01]
 [-4.90459604e-02  1.19830016e-01 -2.62742788e-01  8.84369380e-01
  -6.55503615e-02  1.92801728e-01  2.69908637e-03 -3.01024330e-01]
 [ 1.51612874e-01 -8.79407680e-02 -2.32165009e-01  2.59973487e-01
  -1.72312241e-04  2.14744823e-02  1.64080684e-03  9.20504903e-01]]

The “pima-indians-diabetes.csv” dataset can be downloaded using the following link:

https://github.com/npradaschnor/Pima-Indians-Diabetes-Dataset/blob/master/diabetes.csv
Incremental PCA

Incremental Principal Component Analysis (IPCA) addresses the limitation of PCA by allowing for out-of-core processing, enabling the handling of data that does not fit into memory. Similar to PCA, data is centered but not scaled across features before using SVD during decomposition. The example below show how to apply this module to the digits dataset.

from sklearn.datasets import load_digits
from sklearn.decomposition import IncrementalPCA

data, _ = load_digits(return_X_y = True)

transformer = IncrementalPCA(n_components = 10, batch_size = 100)
transformer.partial_fit(data[:100, :])
transformed_data = transformer.fit_transform(data)
print(transformed_data.shape)

Output

(1797, 10)
Kernel PCA

Kernel PCA extends PCA for non-linear dimensionality reduction using kernels, supporting transform and inverse_transform in Scikit-Learn. The following example uses this algorithm in the digits dataset based on the sigmoid kernel.

from sklearn.datasets import load_digits
from sklearn.decomposition import KernelPCA

data, _ = load_digits(return_X_y = True)

transformer = KernelPCA(n_components = 10, kernel = 'sigmoid')
transformed_data = transformer.fit_transform(data)
print(transformed_data.shape)

Output

(1797, 10)
PCA Using Randomized SVD

PCA with randomized SVD reduces data dimensionality while preserving variance by eliminating lower singular value components for efficient processing. The example below applies this algorithm to find the best eight principal components in the Pima-Indian dataset.

from pandas import read_csv
from sklearn.decomposition import PCA

path = "pima-indians-diabetes.csv"
headers = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
dataset = read_csv(path, names = headers)
values = dataset.values
data = values[1:, 0:8]
target = values[1:, 8]

pca = PCA(n_components = 8, svd_solver = 'randomized')
fit = pca.fit(data)
print(("The explained variance is: %s\n") % (fit.explained_variance_ratio_))
print(fit.components_)

Output

The explained variance is: [8.88546635e-01 6.15907837e-02 2.57901189e-02 1.30861374e-02
 7.44093864e-03 3.02614919e-03 5.12444875e-04 6.79264301e-06]

[[-2.02176587e-03  9.78115765e-02  1.60930503e-02  6.07566861e-02
   9.93110844e-01  1.40108085e-02  5.37167919e-04 -3.56474430e-03]
 [ 2.26488861e-02  9.72210040e-01  1.41909330e-01 -5.78614699e-02
  -9.46266913e-02  4.69729766e-02  8.16804621e-04  1.40168181e-01]
 [ 2.24649003e-02 -1.43428710e-01  9.22467192e-01  3.07013055e-01
  -2.09773019e-02  1.32444542e-01  6.39983017e-04  1.25454310e-01]
 [-4.90459604e-02  1.19830016e-01 -2.62742788e-01  8.84369380e-01
  -6.55503615e-02  1.92801728e-01  2.69908637e-03 -3.01024330e-01]
 [ 1.51612874e-01 -8.79407680e-02 -2.32165009e-01  2.59973487e-01
  -1.72312241e-04  2.14744823e-02  1.64080684e-03  9.20504903e-01]
 [ 5.04730888e-03 -5.07391813e-02 -7.56365525e-02 -2.21363068e-01
   6.13326472e-03  9.70776708e-01  2.02903702e-03  1.51133239e-02]
 [ 9.86672995e-01  8.83426114e-04 -1.22975947e-03 -3.76444746e-04
   1.42307394e-03 -2.73046214e-03 -6.34402965e-03 -1.62555343e-01]
 [ 6.10123250e-03 -8.25459539e-04  5.20865450e-04 -2.54871909e-03
  -2.68965921e-04 -2.67341863e-03  9.99972146e-01 -1.95271966e-03]]
References
  1. Hackeling, G. (2017). Mastering Machine Learning with scikit-learn, 2nd Edition. Packt Publishing Ltd.
  2. Géron, A. (2019). Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow, 2nd Edition. O’Reilly Media, Inc.
  3. Tutorials Point. Scikit Learn Tutorial. Retrieved November 20, 2025, from https://www.tutorialspoint.com/.

Tagged in :

admin Avatar

Leave a Reply

Your email address will not be published. Required fields are marked *

You May Love