Motive behind Data pre-processing

Motive behind Data Processing or Mining is obtaining data and process it to create models for analysis. Insights obtained from these models support decision-making process. Before modelling, incomplete data as well as data which does not contribute towards analysis are pre-processed using:

  • Normalization
  • Dimensionality reduction

This notebook describes :

  • How PCA is implemnted manually
  • How PCA boosts classsfication performance using only 50% of dimnesions of Input features

Principal Component Analysis

Dimensionality Reduction is data transformation technique that is used to reduce multidimensional data sets to a lower number of dimensions for further analysis. Its goal is to extract the important information from the database.

Techniques for Dimensionality Reduction

  • Elimination : Drop variables with lesser correlation with target variable
    • This technique achieves results, but we dropped dimensions. Their effect on target variable [however minimal] is completely unaccounted for
  • Extraction : Analyze varaibles and EXTRACT NEW varaibles [dimensions] from them. Insights from all variables are preserved and variance in target variable can be completely described these new dimensions.
    • This is methodlogy behind PCA : To Decompose data sets as a function of the variance in the data
    • PCA represents extracted information as a set of new orthogonal variables called principal components, and to display the pattern of similarity of the observations and of the variables as points in maps.

Import necessary libraries

import warnings
import pandas as pd
import numpy as np
from numpy import mean
from numpy import std
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import  train_test_split
from sklearn.linear_model import LogisticRegression

%matplotlib inline
warnings.filterwarnings('ignore')

Read red wine data

# references: 
# https://stackoverflow.com/questions/32400867/pandas-read-csv-from-url
# https://svaderia.github.io/articles/downloading-and-unzipping-a-zipfile/
import io
import requests
url="https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv"
s=requests.get(url).content
df_red_wine=pd.read_csv(io.StringIO(s.decode('utf-8')), sep=";")
print("shape of red wine data: ", df_red_wine.shape, "\n")
df_red_wine.head()
shape of red wine data:  (1599, 12) 
fixed acidity volatile acidity citric acid residual sugar chlorides free sulfur dioxide total sulfur dioxide density pH sulphates alcohol quality
0 7.4 0.70 0.00 1.9 0.076 11.0 34.0 0.9978 3.51 0.56 9.4 5
1 7.8 0.88 0.00 2.6 0.098 25.0 67.0 0.9968 3.20 0.68 9.8 5
2 7.8 0.76 0.04 2.3 0.092 15.0 54.0 0.9970 3.26 0.65 9.8 5
3 11.2 0.28 0.56 1.9 0.075 17.0 60.0 0.9980 3.16 0.58 9.8 6
4 7.4 0.70 0.00 1.9 0.076 11.0 34.0 0.9978 3.51 0.56 9.4 5
df_red_wine['quality'].describe()
count    1599.000000
mean        5.636023
std         0.807569
min         3.000000
25%         5.000000
50%         6.000000
75%         6.000000
max         8.000000
Name: quality, dtype: float64

Convert to a classification problem

df_red_wine['quality'] = [1 if quality >= 6 else 0 for quality in df_red_wine['quality']]

Distribution of data

Right skewed normal distribution is genral tendancy of data.

sns.color_palette("Set2")
for col in df_red_wine.columns:
    g = sns.FacetGrid(df_red_wine, col="quality")
    g.map(sns.histplot, col)

Normalization of input features

  • Target feature ‘quality’ is ommited to form set of input features X
  • PCA “extracts” i.e. converts input features into Principal compoents which Maximize Variance
    • Standardisation i.e. Z-score normalisation is performed on data. It scales data to N(0,1)
    • Majority of data is normally distributed, so standardization is good choice
X = df_red_wine.drop(['quality'], axis=1)
X_std = StandardScaler().fit_transform(X.values.astype('float64'))
X = pd.DataFrame(X_std, index=X.index, columns=X.columns)
X.head()
fixed acidity volatile acidity citric acid residual sugar chlorides free sulfur dioxide total sulfur dioxide density pH sulphates alcohol
0 -0.528360 0.961877 -1.391472 -0.453218 -0.243707 -0.466193 -0.379133 0.558274 1.288643 -0.579207 -0.960246
1 -0.298547 1.967442 -1.391472 0.043416 0.223875 0.872638 0.624363 0.028261 -0.719933 0.128950 -0.584777
2 -0.298547 1.297065 -1.186070 -0.169427 0.096353 -0.083669 0.229047 0.134264 -0.331177 -0.048089 -0.584777
3 1.654856 -1.384443 1.484154 -0.453218 -0.264960 0.107592 0.411500 0.664277 -0.979104 -0.461180 -0.584777
4 -0.528360 0.961877 -1.391472 -0.453218 -0.243707 -0.466193 -0.379133 0.558274 1.288643 -0.579207 -0.960246

Statistics of normalized Input features

  • As observed, all features have mean ~ 0; standard deviation ~ 1
  • min and max of features vary as per distance of a datapoint from mean varies. Effect of outliers can not be compensated for in Norm(0,1).
X.describe()
fixed acidity volatile acidity citric acid residual sugar chlorides free sulfur dioxide total sulfur dioxide density pH sulphates alcohol
count 1.599000e+03 1.599000e+03 1.599000e+03 1.599000e+03 1.599000e+03 1.599000e+03 1.599000e+03 1.599000e+03 1.599000e+03 1.599000e+03 1.599000e+03
mean 3.554936e-16 1.733031e-16 -8.887339e-17 -1.244227e-16 3.910429e-16 -6.221137e-17 4.443669e-17 2.364032e-14 2.861723e-15 6.754377e-16 1.066481e-16
std 1.000313e+00 1.000313e+00 1.000313e+00 1.000313e+00 1.000313e+00 1.000313e+00 1.000313e+00 1.000313e+00 1.000313e+00 1.000313e+00 1.000313e+00
min -2.137045e+00 -2.278280e+00 -1.391472e+00 -1.162696e+00 -1.603945e+00 -1.422500e+00 -1.230584e+00 -3.538731e+00 -3.700401e+00 -1.936507e+00 -1.898919e+00
25% -7.007187e-01 -7.699311e-01 -9.293181e-01 -4.532184e-01 -3.712290e-01 -8.487156e-01 -7.440403e-01 -6.077557e-01 -6.551405e-01 -6.382196e-01 -8.663789e-01
50% -2.410944e-01 -4.368911e-02 -5.636026e-02 -2.403750e-01 -1.799455e-01 -1.793002e-01 -2.574968e-01 1.760083e-03 -7.212705e-03 -2.251281e-01 -2.093081e-01
75% 5.057952e-01 6.266881e-01 7.652471e-01 4.341614e-02 5.384542e-02 4.901152e-01 4.723184e-01 5.768249e-01 5.759223e-01 4.240158e-01 6.354971e-01
max 4.355149e+00 5.877976e+00 3.743574e+00 9.195681e+00 1.112703e+01 5.367284e+00 7.375154e+00 3.680055e+00 4.528282e+00 7.918677e+00 4.202453e+00



PCA calculation : Steps

  • Calculating the covariance matrix
  • Calculating the eigenvalues and eigenvector
  • Forming Principal Components
  • Projection into the new feature space

Covariance matrix

Dimensions {X1, X2,..Xn} = {fixed acidity, volatile acidity, citric acid, residual sugar, chlorides….pH, sulphates, alcohol}

Cov(XX) is given by \begin{bmatrix}\mathrm {E} [(X_{1}-\operatorname {E} [X_{1}])(X_{1}-\operatorname {E} [X_{1}])]&\mathrm {E} [(X_{1}-\operatorname {E} [X_{1}])(X_{2}-\operatorname {E} [X_{2}])]&\cdots &\mathrm {E} [(X_{1}-\operatorname {E} [X_{1}])(X_{n}-\operatorname {E} [X_{n}])]\\\mathrm {E} [(X_{2}-\operatorname {E} [X_{2}])(X_{1}-\operatorname {E} [X_{1}])]&\mathrm {E} [(X_{2}-\operatorname {E} [X_{2}])(X_{2}-\operatorname {E} [X_{2}])]&\cdots &\mathrm {E} [(X_{2}-\operatorname {E} [X_{2}])(X_{n}-\operatorname {E} [X_{n}])]\\\vdots &\vdots &\ddots &\vdots \\\mathrm {E} [(X_{n}-\operatorname {E} [X_{n}])(X_{1}-\operatorname {E} [X_{1}])]&\mathrm {E} [(X_{n}-\operatorname {E} [X_{n}])(X_{2}-\operatorname {E} [X_{2}])]&\cdots &\mathrm {E} [(X_{n}-\operatorname {E} [X_{n}])(X_{n}-\operatorname {E} [X_{n}])]\end{bmatrix}

cov_matrix = np.cov(X.T)
cov_matrix[:3]
array([[ 1.00062578, -0.25629118,  0.67212377,  0.11484855,  0.09376383,
        -0.15389043, -0.11325227,  0.66846534, -0.68340559,  0.18312019,
        -0.06170686],
       [-0.25629118,  1.00062578, -0.55284143,  0.00191908,  0.06133613,
        -0.0105104 ,  0.07651786,  0.02204002,  0.23508431, -0.26115001,
        -0.20241462],
       [ 0.67212377, -0.55284143,  1.00062578,  0.14366701,  0.20395046,
        -0.06101629,  0.03555526,  0.36517555, -0.54224326,  0.31296577,
         0.10997202]])

Eigenvalues and Eigenvector

eigenvalues, eigenvectors = np.linalg.eig(cov_matrix)
for val in eigenvalues:
    print(val)
3.1010718226728273
1.9271148896585149
1.5515137913334218
1.2139917499341308
0.9598923792754817
0.059595582455006985
0.18144664164085156
0.34485778773040704
0.42322137844374963
0.5841565453623766
0.6600210359988645
for val in eigenvectors.T:
    print(val)
[ 0.48931422 -0.23858436  0.46363166  0.14610715  0.21224658 -0.03615752
  0.02357485  0.39535301 -0.43851962  0.24292133 -0.11323206]
[-0.11050274  0.27493048 -0.15179136  0.27208024  0.14805156  0.51356681
  0.56948696  0.23357549  0.00671079 -0.03755392 -0.38618096]
[-0.12330157 -0.44996253  0.23824707  0.10128338 -0.09261383  0.42879287
  0.3224145  -0.33887135  0.05769735  0.27978615  0.47167322]
[-0.22961737  0.07895978 -0.07941826 -0.37279256  0.66619476 -0.04353782
 -0.03457712 -0.17449976 -0.00378775  0.55087236 -0.12218109]
[-0.08261366  0.21873452 -0.05857268  0.73214429  0.2465009  -0.15915198
 -0.22246456  0.15707671  0.26752977  0.22596222  0.35068141]
[-0.63969145 -0.0023886   0.0709103  -0.18402996 -0.05306532  0.05142086
 -0.0687016   0.5673319  -0.3407109  -0.06955538  0.31452591]
[-0.24952314  0.36592473  0.62167708  0.09287208 -0.21767112  0.24848326
 -0.37075027 -0.23999012 -0.0109696   0.11232046 -0.3030145 ]
[ 0.19402091 -0.1291103  -0.38144967  0.00752295  0.11133867  0.63540522
 -0.59211589  0.02071868 -0.16774589 -0.05836706  0.03760311]
[-0.17759545 -0.07877531 -0.37751558  0.29984469 -0.35700936 -0.2047805
  0.01903597 -0.23922267 -0.56139075  0.37460432 -0.21762556]
[-0.35022736 -0.5337351   0.10549701  0.29066341  0.37041337 -0.11659611
 -0.09366237 -0.17048116 -0.02513762 -0.44746911 -0.3276509 ]
[ 0.10147858  0.41144893  0.06959338  0.04915555  0.30433857 -0.01400021
  0.13630755 -0.3911523  -0.52211645 -0.38126343  0.36164504]
eigen_map = list(zip(eigenvalues, eigenvectors.T))
eigen_map.sort(key=lambda x: x[0], reverse=True)
sorted_eigenvalues = [pair[0] for pair in eigen_map]
sorted_eigenvectors = [pair[1] for pair in eigen_map]

Formation of Principal Components

sorted_eigenvalues
[3.1010718226728273,
 1.9271148896585149,
 1.5515137913334218,
 1.2139917499341308,
 0.9598923792754817,
 0.6600210359988645,
 0.5841565453623766,
 0.42322137844374963,
 0.34485778773040704,
 0.18144664164085156,
 0.059595582455006985]
pd.DataFrame(sorted_eigenvectors, columns=df_red_wine.drop(['quality'], axis=1).columns)
fixed acidity volatile acidity citric acid residual sugar chlorides free sulfur dioxide total sulfur dioxide density pH sulphates alcohol
0 0.489314 -0.238584 0.463632 0.146107 0.212247 -0.036158 0.023575 0.395353 -0.438520 0.242921 -0.113232
1 -0.110503 0.274930 -0.151791 0.272080 0.148052 0.513567 0.569487 0.233575 0.006711 -0.037554 -0.386181
2 -0.123302 -0.449963 0.238247 0.101283 -0.092614 0.428793 0.322415 -0.338871 0.057697 0.279786 0.471673
3 -0.229617 0.078960 -0.079418 -0.372793 0.666195 -0.043538 -0.034577 -0.174500 -0.003788 0.550872 -0.122181
4 -0.082614 0.218735 -0.058573 0.732144 0.246501 -0.159152 -0.222465 0.157077 0.267530 0.225962 0.350681
5 0.101479 0.411449 0.069593 0.049156 0.304339 -0.014000 0.136308 -0.391152 -0.522116 -0.381263 0.361645
6 -0.350227 -0.533735 0.105497 0.290663 0.370413 -0.116596 -0.093662 -0.170481 -0.025138 -0.447469 -0.327651
7 -0.177595 -0.078775 -0.377516 0.299845 -0.357009 -0.204781 0.019036 -0.239223 -0.561391 0.374604 -0.217626
8 0.194021 -0.129110 -0.381450 0.007523 0.111339 0.635405 -0.592116 0.020719 -0.167746 -0.058367 0.037603
9 -0.249523 0.365925 0.621677 0.092872 -0.217671 0.248483 -0.370750 -0.239990 -0.010970 0.112320 -0.303015
10 -0.639691 -0.002389 0.070910 -0.184030 -0.053065 0.051421 -0.068702 0.567332 -0.340711 -0.069555 0.314526

Explained Variance : choice of Principal compoents

eigenvalue_sum = sum(eigenvalues)
var_exp = [(v / eigenvalue_sum)*100 for v in sorted_eigenvalues]
cum_var_exp = np.cumsum(var_exp)
cum_var_exp
array([ 28.17393128,  45.68220118,  59.77805108,  70.80743772,
        79.52827474,  85.52471351,  90.83190641,  94.67696732,
        97.81007747,  99.4585608 , 100.        ])
dims = len(df_red_wine.drop(['quality'], axis=1).columns)
plt.clf()
fig, ax = plt.subplots()

ax.plot(range(dims), cum_var_exp, '-o')

plt.xlabel('Number of Components')
plt.ylabel('Percent of Variance Explained')

plt.show()
<Figure size 432x288 with 0 Axes>

It is noted that 6 eigenvectors describe more than 84% of varaince in target variable

ev1 = sorted_eigenvectors[0]
ev2 = sorted_eigenvectors[1]
ev3 = sorted_eigenvectors[2]
ev4 = sorted_eigenvectors[3]
ev5 = sorted_eigenvectors[4]
ev6 = sorted_eigenvectors[5]
eigen_matrix = np.hstack((ev1.reshape(dims,1),
                          ev2.reshape(dims,1),
                          ev3.reshape(dims,1),
                          ev4.reshape(dims,1),
                          ev5.reshape(dims,1),
                          ev6.reshape(dims,1)))
eigen_matrix[:3]
array([[ 0.48931422, -0.11050274, -0.12330157, -0.22961737, -0.08261366,
         0.10147858],
       [-0.23858436,  0.27493048, -0.44996253,  0.07895978,  0.21873452,
         0.41144893],
       [ 0.46363166, -0.15179136,  0.23824707, -0.07941826, -0.05857268,
         0.06959338]])
Y = X.dot(eigen_matrix).join(df_red_wine['quality'])
Y.head()
0 1 2 3 4 5 quality
0 -1.619530 0.450950 -1.774454 0.043740 0.067014 -0.913921 0
1 -0.799170 1.856553 -0.911690 0.548066 -0.018392 0.929714 0
2 -0.748479 0.882039 -1.171394 0.411021 -0.043531 0.401473 0
3 2.357673 -0.269976 0.243489 -0.928450 -1.499149 -0.131017 1
4 -1.619530 0.450950 -1.774454 0.043740 0.067014 -0.913921 0

Distribution of PC against each other for Good(1) and Bad(0) quaity of wine

sns.pairplot(data=Y, hue="quality", kind="scatter")
<seaborn.axisgrid.PairGrid at 0x1edbed67688>

PCA with scikit-learn library

from sklearn.decomposition import PCA
pca = PCA(n_components=0.85)
Y_sklearn = pca.fit_transform(X)
plt.clf()
fig, ax = plt.subplots()

ax.plot(range(6), np.cumsum(pca.explained_variance_), '-o')

plt.xlabel('Number of Components')
plt.ylabel('Percent of Variance Explained')

plt.show()
<Figure size 432x288 with 0 Axes>

np.cumsum(pca.explained_variance_)
array([3.10107182, 5.02818671, 6.5797005 , 7.79369225, 8.75358463,
       9.41360567])
Y_sklearn_plt = pd.DataFrame(Y_sklearn, 
                             index=Y.index, 
                             columns=Y.columns[:-1]).join(df_red_wine['quality'])
sns.pairplot(data=Y_sklearn_plt, hue="quality")
<seaborn.axisgrid.PairGrid at 0x1edc1a87608>



Performance of classfier : with and without PCA

Without PCA

y = df_red_wine['quality'].values
# split dataset for training and testing, and use a logistic regression as classifier
X_train, X_test, y_train, y_test = train_test_split(df_red_wine.drop('quality', axis=1), y, test_size=0.25)
classifier = LogisticRegression(random_state= 0)
classifier.fit(X_train, y_train)
LogisticRegression(random_state=0)
y_pred = classifier.score(X_test, y_test)
y_pred
0.7125

With PCA

X_train, X_test, y_train, y_test = train_test_split(Y_sklearn, y, test_size=0.3)
classifier_with_pca = LogisticRegression(random_state=0)
classifier_with_pca.fit(X_train, y_train)
LogisticRegression(random_state=0)
y_pred = classifier_with_pca.score(X_test, y_test)
y_pred
0.7458333333333333

Resources

  • https://www.sciencedirect.com/science/article/pii/B9780080448947013038
  • H. Abdi and L. J. Williams, “Principal component analysis,” Wiley Interdisciplinary Reviews. Computational Statistics, vol. 2, (4), pp. 433-459, 2010.