Algorithm behind Principal Component Analysis ·

Motive behind Data pre-processing

Motive behind Data Processing or Mining is obtaining data and process it to create models for analysis. Insights obtained from these models support decision-making process. Before modelling, incomplete data as well as data which does not contribute towards analysis are pre-processed using:

Normalization
Dimensionality reduction

This notebook describes :

How PCA is implemnted manually
How PCA boosts classsfication performance using only 50% of dimnesions of Input features

Principal Component Analysis

Dimensionality Reduction is data transformation technique that is used to reduce multidimensional data sets to a lower number of dimensions for further analysis. Its goal is to extract the important information from the database.

Techniques for Dimensionality Reduction

Elimination : Drop variables with lesser correlation with target variable
- This technique achieves results, but we dropped dimensions. Their effect on target variable [however minimal] is completely unaccounted for
Extraction : Analyze varaibles and EXTRACT NEW varaibles [dimensions] from them. Insights from all variables are preserved and variance in target variable can be completely described these new dimensions.
- This is methodlogy behind PCA : To Decompose data sets as a function of the variance in the data
- PCA represents extracted information as a set of new orthogonal variables called principal components, and to display the pattern of similarity of the observations and of the variables as points in maps.

Import necessary libraries

import warnings
import pandas as pd
import numpy as np
from numpy import mean
from numpy import std
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import  train_test_split
from sklearn.linear_model import LogisticRegression

%matplotlib inline
warnings.filterwarnings('ignore')

Read red wine data

# references: 
# https://stackoverflow.com/questions/32400867/pandas-read-csv-from-url
# https://svaderia.github.io/articles/downloading-and-unzipping-a-zipfile/
import io
import requests
url="https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv"
s=requests.get(url).content
df_red_wine=pd.read_csv(io.StringIO(s.decode('utf-8')), sep=";")

print("shape of red wine data: ", df_red_wine.shape, "\n")
df_red_wine.head()

shape of red wine data:  (1599, 12)

	fixed acidity	volatile acidity	citric acid	residual sugar	chlorides	free sulfur dioxide	total sulfur dioxide	density	pH	sulphates	alcohol	quality
0	7.4	0.70	0.00	1.9	0.076	11.0	34.0	0.9978	3.51	0.56	9.4	5
1	7.8	0.88	0.00	2.6	0.098	25.0	67.0	0.9968	3.20	0.68	9.8	5
2	7.8	0.76	0.04	2.3	0.092	15.0	54.0	0.9970	3.26	0.65	9.8	5
3	11.2	0.28	0.56	1.9	0.075	17.0	60.0	0.9980	3.16	0.58	9.8	6
4	7.4	0.70	0.00	1.9	0.076	11.0	34.0	0.9978	3.51	0.56	9.4	5

df_red_wine['quality'].describe()

count    1599.000000
mean        5.636023
std         0.807569
min         3.000000
25%         5.000000
50%         6.000000
75%         6.000000
max         8.000000
Name: quality, dtype: float64

Convert to a classification problem

df_red_wine['quality'] = [1 if quality >= 6 else 0 for quality in df_red_wine['quality']]

Distribution of data

Right skewed normal distribution is genral tendancy of data.

sns.color_palette("Set2")
for col in df_red_wine.columns:
    g = sns.FacetGrid(df_red_wine, col="quality")
    g.map(sns.histplot, col)

Normalization of input features

Target feature ‘quality’ is ommited to form set of input features X
PCA “extracts” i.e. converts input features into Principal compoents which Maximize Variance
- Standardisation i.e. Z-score normalisation is performed on data. It scales data to N(0,1)
- Majority of data is normally distributed, so standardization is good choice

X = df_red_wine.drop(['quality'], axis=1)

X_std = StandardScaler().fit_transform(X.values.astype('float64'))
X = pd.DataFrame(X_std, index=X.index, columns=X.columns)
X.head()

	fixed acidity	volatile acidity	citric acid	residual sugar	chlorides	free sulfur dioxide	total sulfur dioxide	density	pH	sulphates	alcohol
0	-0.528360	0.961877	-1.391472	-0.453218	-0.243707	-0.466193	-0.379133	0.558274	1.288643	-0.579207	-0.960246
1	-0.298547	1.967442	-1.391472	0.043416	0.223875	0.872638	0.624363	0.028261	-0.719933	0.128950	-0.584777
2	-0.298547	1.297065	-1.186070	-0.169427	0.096353	-0.083669	0.229047	0.134264	-0.331177	-0.048089	-0.584777
3	1.654856	-1.384443	1.484154	-0.453218	-0.264960	0.107592	0.411500	0.664277	-0.979104	-0.461180	-0.584777
4	-0.528360	0.961877	-1.391472	-0.453218	-0.243707	-0.466193	-0.379133	0.558274	1.288643	-0.579207	-0.960246

Statistics of normalized Input features

As observed, all features have mean ~ 0; standard deviation ~ 1
min and max of features vary as per distance of a datapoint from mean varies. Effect of outliers can not be compensated for in Norm(0,1).

X.describe()

	fixed acidity	volatile acidity	citric acid	residual sugar	chlorides	free sulfur dioxide	total sulfur dioxide	density	pH	sulphates	alcohol
count	1.599000e+03	1.599000e+03	1.599000e+03	1.599000e+03	1.599000e+03	1.599000e+03	1.599000e+03	1.599000e+03	1.599000e+03	1.599000e+03	1.599000e+03
mean	3.554936e-16	1.733031e-16	-8.887339e-17	-1.244227e-16	3.910429e-16	-6.221137e-17	4.443669e-17	2.364032e-14	2.861723e-15	6.754377e-16	1.066481e-16
std	1.000313e+00	1.000313e+00	1.000313e+00	1.000313e+00	1.000313e+00	1.000313e+00	1.000313e+00	1.000313e+00	1.000313e+00	1.000313e+00	1.000313e+00
min	-2.137045e+00	-2.278280e+00	-1.391472e+00	-1.162696e+00	-1.603945e+00	-1.422500e+00	-1.230584e+00	-3.538731e+00	-3.700401e+00	-1.936507e+00	-1.898919e+00
25%	-7.007187e-01	-7.699311e-01	-9.293181e-01	-4.532184e-01	-3.712290e-01	-8.487156e-01	-7.440403e-01	-6.077557e-01	-6.551405e-01	-6.382196e-01	-8.663789e-01
50%	-2.410944e-01	-4.368911e-02	-5.636026e-02	-2.403750e-01	-1.799455e-01	-1.793002e-01	-2.574968e-01	1.760083e-03	-7.212705e-03	-2.251281e-01	-2.093081e-01
75%	5.057952e-01	6.266881e-01	7.652471e-01	4.341614e-02	5.384542e-02	4.901152e-01	4.723184e-01	5.768249e-01	5.759223e-01	4.240158e-01	6.354971e-01
max	4.355149e+00	5.877976e+00	3.743574e+00	9.195681e+00	1.112703e+01	5.367284e+00	7.375154e+00	3.680055e+00	4.528282e+00	7.918677e+00	4.202453e+00

PCA calculation : Steps

Calculating the covariance matrix
Calculating the eigenvalues and eigenvector
Forming Principal Components
Projection into the new feature space

Covariance matrix

Dimensions {X1, X2,..Xn} = {fixed acidity, volatile acidity, citric acid, residual sugar, chlorides….pH, sulphates, alcohol}

Cov(XX) is given by \begin{bmatrix}\mathrm {E} [(X_{1}-\operatorname {E} [X_{1}])(X_{1}-\operatorname {E} [X_{1}])]&\mathrm {E} [(X_{1}-\operatorname {E} [X_{1}])(X_{2}-\operatorname {E} [X_{2}])]&\cdots &\mathrm {E} [(X_{1}-\operatorname {E} [X_{1}])(X_{n}-\operatorname {E} [X_{n}])]\\\mathrm {E} [(X_{2}-\operatorname {E} [X_{2}])(X_{1}-\operatorname {E} [X_{1}])]&\mathrm {E} [(X_{2}-\operatorname {E} [X_{2}])(X_{2}-\operatorname {E} [X_{2}])]&\cdots &\mathrm {E} [(X_{2}-\operatorname {E} [X_{2}])(X_{n}-\operatorname {E} [X_{n}])]\\\vdots &\vdots &\ddots &\vdots \\\mathrm {E} [(X_{n}-\operatorname {E} [X_{n}])(X_{1}-\operatorname {E} [X_{1}])]&\mathrm {E} [(X_{n}-\operatorname {E} [X_{n}])(X_{2}-\operatorname {E} [X_{2}])]&\cdots &\mathrm {E} [(X_{n}-\operatorname {E} [X_{n}])(X_{n}-\operatorname {E} [X_{n}])]\end{bmatrix}

cov_matrix = np.cov(X.T)

cov_matrix[:3]

array([[ 1.00062578, -0.25629118,  0.67212377,  0.11484855,  0.09376383,
        -0.15389043, -0.11325227,  0.66846534, -0.68340559,  0.18312019,
        -0.06170686],
       [-0.25629118,  1.00062578, -0.55284143,  0.00191908,  0.06133613,
        -0.0105104 ,  0.07651786,  0.02204002,  0.23508431, -0.26115001,
        -0.20241462],
       [ 0.67212377, -0.55284143,  1.00062578,  0.14366701,  0.20395046,
        -0.06101629,  0.03555526,  0.36517555, -0.54224326,  0.31296577,
         0.10997202]])

Eigenvalues and Eigenvector

eigenvalues, eigenvectors = np.linalg.eig(cov_matrix)

for val in eigenvalues:
    print(val)

1010718226728273
9271148896585149
5515137913334218
2139917499341308
9598923792754817
059595582455006985
18144664164085156
34485778773040704
42322137844374963
5841565453623766
6600210359988645

for val in eigenvectors.T:
    print(val)

[ 0.48931422 -0.23858436  0.46363166  0.14610715  0.21224658 -0.03615752
  0.02357485  0.39535301 -0.43851962  0.24292133 -0.11323206]
[-0.11050274  0.27493048 -0.15179136  0.27208024  0.14805156  0.51356681
  0.56948696  0.23357549  0.00671079 -0.03755392 -0.38618096]
[-0.12330157 -0.44996253  0.23824707  0.10128338 -0.09261383  0.42879287
  0.3224145  -0.33887135  0.05769735  0.27978615  0.47167322]
[-0.22961737  0.07895978 -0.07941826 -0.37279256  0.66619476 -0.04353782
 -0.03457712 -0.17449976 -0.00378775  0.55087236 -0.12218109]
[-0.08261366  0.21873452 -0.05857268  0.73214429  0.2465009  -0.15915198
 -0.22246456  0.15707671  0.26752977  0.22596222  0.35068141]
[-0.63969145 -0.0023886   0.0709103  -0.18402996 -0.05306532  0.05142086
 -0.0687016   0.5673319  -0.3407109  -0.06955538  0.31452591]
[-0.24952314  0.36592473  0.62167708  0.09287208 -0.21767112  0.24848326
 -0.37075027 -0.23999012 -0.0109696   0.11232046 -0.3030145 ]
[ 0.19402091 -0.1291103  -0.38144967  0.00752295  0.11133867  0.63540522
 -0.59211589  0.02071868 -0.16774589 -0.05836706  0.03760311]
[-0.17759545 -0.07877531 -0.37751558  0.29984469 -0.35700936 -0.2047805
  0.01903597 -0.23922267 -0.56139075  0.37460432 -0.21762556]
[-0.35022736 -0.5337351   0.10549701  0.29066341  0.37041337 -0.11659611
 -0.09366237 -0.17048116 -0.02513762 -0.44746911 -0.3276509 ]
[ 0.10147858  0.41144893  0.06959338  0.04915555  0.30433857 -0.01400021
  0.13630755 -0.3911523  -0.52211645 -0.38126343  0.36164504]

eigen_map = list(zip(eigenvalues, eigenvectors.T))

eigen_map.sort(key=lambda x: x[0], reverse=True)

sorted_eigenvalues = [pair[0] for pair in eigen_map]
sorted_eigenvectors = [pair[1] for pair in eigen_map]

Formation of Principal Components

sorted_eigenvalues

[3.1010718226728273,
9271148896585149,
5515137913334218,
2139917499341308,
9598923792754817,
6600210359988645,
5841565453623766,
42322137844374963,
34485778773040704,
18144664164085156,
059595582455006985]

pd.DataFrame(sorted_eigenvectors, columns=df_red_wine.drop(['quality'], axis=1).columns)

	fixed acidity	volatile acidity	citric acid	residual sugar	chlorides	free sulfur dioxide	total sulfur dioxide	density	pH	sulphates	alcohol
0	0.489314	-0.238584	0.463632	0.146107	0.212247	-0.036158	0.023575	0.395353	-0.438520	0.242921	-0.113232
1	-0.110503	0.274930	-0.151791	0.272080	0.148052	0.513567	0.569487	0.233575	0.006711	-0.037554	-0.386181
2	-0.123302	-0.449963	0.238247	0.101283	-0.092614	0.428793	0.322415	-0.338871	0.057697	0.279786	0.471673
3	-0.229617	0.078960	-0.079418	-0.372793	0.666195	-0.043538	-0.034577	-0.174500	-0.003788	0.550872	-0.122181
4	-0.082614	0.218735	-0.058573	0.732144	0.246501	-0.159152	-0.222465	0.157077	0.267530	0.225962	0.350681
5	0.101479	0.411449	0.069593	0.049156	0.304339	-0.014000	0.136308	-0.391152	-0.522116	-0.381263	0.361645
6	-0.350227	-0.533735	0.105497	0.290663	0.370413	-0.116596	-0.093662	-0.170481	-0.025138	-0.447469	-0.327651
7	-0.177595	-0.078775	-0.377516	0.299845	-0.357009	-0.204781	0.019036	-0.239223	-0.561391	0.374604	-0.217626
8	0.194021	-0.129110	-0.381450	0.007523	0.111339	0.635405	-0.592116	0.020719	-0.167746	-0.058367	0.037603
9	-0.249523	0.365925	0.621677	0.092872	-0.217671	0.248483	-0.370750	-0.239990	-0.010970	0.112320	-0.303015
10	-0.639691	-0.002389	0.070910	-0.184030	-0.053065	0.051421	-0.068702	0.567332	-0.340711	-0.069555	0.314526

Explained Variance : choice of Principal compoents

eigenvalue_sum = sum(eigenvalues)
var_exp = [(v / eigenvalue_sum)*100 for v in sorted_eigenvalues]
cum_var_exp = np.cumsum(var_exp)

cum_var_exp

array([ 28.17393128,  45.68220118,  59.77805108,  70.80743772,
        79.52827474,  85.52471351,  90.83190641,  94.67696732,
        97.81007747,  99.4585608 , 100.        ])

dims = len(df_red_wine.drop(['quality'], axis=1).columns)

plt.clf()
fig, ax = plt.subplots()

ax.plot(range(dims), cum_var_exp, '-o')

plt.xlabel('Number of Components')
plt.ylabel('Percent of Variance Explained')

plt.show()

<Figure size 432x288 with 0 Axes>

It is noted that 6 eigenvectors describe more than 84% of varaince in target variable

ev1 = sorted_eigenvectors[0]
ev2 = sorted_eigenvectors[1]
ev3 = sorted_eigenvectors[2]
ev4 = sorted_eigenvectors[3]
ev5 = sorted_eigenvectors[4]
ev6 = sorted_eigenvectors[5]

eigen_matrix = np.hstack((ev1.reshape(dims,1),
                          ev2.reshape(dims,1),
                          ev3.reshape(dims,1),
                          ev4.reshape(dims,1),
                          ev5.reshape(dims,1),
                          ev6.reshape(dims,1)))

eigen_matrix[:3]

array([[ 0.48931422, -0.11050274, -0.12330157, -0.22961737, -0.08261366,
         0.10147858],
       [-0.23858436,  0.27493048, -0.44996253,  0.07895978,  0.21873452,
         0.41144893],
       [ 0.46363166, -0.15179136,  0.23824707, -0.07941826, -0.05857268,
         0.06959338]])

Y = X.dot(eigen_matrix).join(df_red_wine['quality'])

Y.head()

	0	1	2	3	4	5	quality
0	-1.619530	0.450950	-1.774454	0.043740	0.067014	-0.913921	0
1	-0.799170	1.856553	-0.911690	0.548066	-0.018392	0.929714	0
2	-0.748479	0.882039	-1.171394	0.411021	-0.043531	0.401473	0
3	2.357673	-0.269976	0.243489	-0.928450	-1.499149	-0.131017	1
4	-1.619530	0.450950	-1.774454	0.043740	0.067014	-0.913921	0

Distribution of PC against each other for Good(1) and Bad(0) quaity of wine

sns.pairplot(data=Y, hue="quality", kind="scatter")

<seaborn.axisgrid.PairGrid at 0x1edbed67688>

PCA with scikit-learn library

from sklearn.decomposition import PCA
pca = PCA(n_components=0.85)
Y_sklearn = pca.fit_transform(X)

plt.clf()
fig, ax = plt.subplots()

ax.plot(range(6), np.cumsum(pca.explained_variance_), '-o')

plt.xlabel('Number of Components')
plt.ylabel('Percent of Variance Explained')

plt.show()

<Figure size 432x288 with 0 Axes>

np.cumsum(pca.explained_variance_)

array([3.10107182, 5.02818671, 6.5797005 , 7.79369225, 8.75358463,
       9.41360567])

Y_sklearn_plt = pd.DataFrame(Y_sklearn, 
                             index=Y.index, 
                             columns=Y.columns[:-1]).join(df_red_wine['quality'])

sns.pairplot(data=Y_sklearn_plt, hue="quality")

<seaborn.axisgrid.PairGrid at 0x1edc1a87608>

Performance of classfier : with and without PCA

Without PCA

y = df_red_wine['quality'].values

# split dataset for training and testing, and use a logistic regression as classifier
X_train, X_test, y_train, y_test = train_test_split(df_red_wine.drop('quality', axis=1), y, test_size=0.25)

classifier = LogisticRegression(random_state= 0)

classifier.fit(X_train, y_train)

LogisticRegression(random_state=0)

y_pred = classifier.score(X_test, y_test)
y_pred

0.7125

With PCA

X_train, X_test, y_train, y_test = train_test_split(Y_sklearn, y, test_size=0.3)

classifier_with_pca = LogisticRegression(random_state=0)
classifier_with_pca.fit(X_train, y_train)

LogisticRegression(random_state=0)

y_pred = classifier_with_pca.score(X_test, y_test)
y_pred

0.7458333333333333

Resources

https://www.sciencedirect.com/science/article/pii/B9780080448947013038
H. Abdi and L. J. Williams, “Principal component analysis,” Wiley Interdisciplinary Reviews. Computational Statistics, vol. 2, (4), pp. 433-459, 2010.

Rasik Kane