Skip to content

Commit ea70b47

Browse files
committed
PCA implemention
1 parent 0ee534e commit ea70b47

1 file changed

Lines changed: 205 additions & 0 deletions

File tree

machine_learning/pca.py

Lines changed: 205 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,205 @@
1+
# Author: https://github.com/satori1995
2+
"""
3+
Principal Component Analysis (PCA) is primarily used for data dimensionality reduction.
4+
Its working principle involves mapping high-dimensional data to a lower-dimensional
5+
space through linear transformation while preserving the main information in the data.
6+
7+
For example, if a sample has n features and n is very large, the training time
8+
will be severely impacted.
9+
However, in real-world situations, not all of these n features play a decisive role
10+
in determining the sample classification - some features have negligible impact.
11+
In such cases, PCA can be used for dimensionality reduction.
12+
13+
Note: At this point, you might think that PCA simply filters features by
14+
selecting the most important ones. But that's not the case.
15+
What PCA actually does is create new features (principal components).
16+
These principal components are linear combinations of the original features,
17+
are orthogonal to each other (uncorrelated),
18+
and are arranged in descending order according to the maximum variance
19+
they capture in the data.
20+
21+
Based on practical requirements, we only need the first k principal components
22+
to achieve the desired effect.
23+
As for the exact value of k, we can set a variance explanation ratio
24+
threshold (such as 95%) and select the minimum number of principal components that
25+
can explain this proportion of variance.
26+
In any case, k must be less than n.
27+
28+
- The first few principal components usually capture most of the variability
29+
in the data.
30+
In many real-world datasets, the first 10~20% of principal components can explain
31+
more than 90% of the data variance.
32+
- The later principal components often capture only small amounts of data variability,
33+
with some primarily reflecting noise rather than useful information.
34+
35+
Therefore, PCA is more powerful than simple feature selection because it can create
36+
new features(principal components) that are more effective than the original features
37+
while preserving the data structure.
38+
Each principal component is a weighted combination of all original features,
39+
just with different weights.
40+
This enables PCA to preserve the main patterns and structures of the original data
41+
while reducing dimensionality.
42+
43+
Now you should understand what PCA is. Its characteristics are as follows:
44+
- Comprehensive information utilization: Each principal component integrates
45+
information from all original features, just with different weights assigned.
46+
This means that even if a feature's contribution is small, the useful information
47+
it contains can still be preserved in the principal components.
48+
- Uncorrelatedness: Through orthogonal transformation, principal components
49+
are uncorrelated with each other, eliminating redundant information and
50+
multicollinearity problems that may exist between original features.
51+
- Noise filtering: Low-variance principal components often represent noise.
52+
Discarding these components can improve the signal-to-noise ratio.
53+
- Data structure preservation: Although the dimensionality is reduced,
54+
the main data structures and patterns are still preserved by
55+
the first few high-variance principal components.
56+
57+
In subsequent calculations and modeling,
58+
we use these principal components (the newly created features).
59+
Since the first k principal components already contain sufficient information
60+
from the original features (where k<n), we achieve dimensionality reduction.
61+
The benefits include faster training speed and lower memory consumption.
62+
63+
So principal components are new features created by linearly
64+
combining the original features.
65+
These new features synthesize the information from the original features
66+
but represent the data in a more effective and streamlined way.
67+
Moreover, each principal component is orthogonal to the others - they extract
68+
information from the original features in mutually independent directions.
69+
The benefit of this approach is that it avoids information redundancy, with each
70+
principal component providing information that other principal components
71+
do not contain.
72+
73+
These principal components are then arranged in descending order according to
74+
the amount of data variance they capture:
75+
- The first principal component is the new feature with the maximum variance.
76+
- The second principal component is the new feature with maximum variance in
77+
the direction orthogonal to the first principal component.
78+
- And so on...
79+
80+
For example, if a sample has 100 features, 100 principal components will be created.
81+
However, in practical scenarios, the first 10 principal components may already retain
82+
95% of the information from the original features.
83+
If this level of accuracy is acceptable, then we can select only the first
84+
10 principal components.
85+
This reduces the data from 100 dimensions to 10 dimensions, resulting in a
86+
significant speed improvement during training.
87+
"""
88+
89+
import doctest
90+
91+
import numpy as np
92+
93+
94+
class PCA:
95+
def __init__(self, n_components=None):
96+
"""
97+
Parameters
98+
----------
99+
n_components : int or float, default=None
100+
if n >= 1, Number of components to keep.
101+
102+
if n < 1, select the number of components such that the amount of
103+
variance that needs to be explained is greater than the percentage
104+
specified by n_components.
105+
106+
if n is None, keep min(n_samples, n_features) components.
107+
"""
108+
self.n_components = n_components
109+
self.components_ = None
110+
self.explained_variance_ = None
111+
self.explained_variance_ratio_ = None
112+
113+
def fit(self, x: np.ndarray):
114+
# Zero-mean
115+
x = x - np.mean(x, axis=0)
116+
# calculate eigenvalues and eigenvectors
117+
# the eigenvectors are the corresponding principal components
118+
# the eigenvalues are the amount of variance that the corresponding principal
119+
# components can explain
120+
eigenvalues, eigenvectors = np.linalg.eig(x @ x.T)
121+
# order by eigenvalues
122+
sorted_indices = np.argsort(eigenvalues)[::-1]
123+
eigenvalues = eigenvalues[sorted_indices]
124+
eigenvectors = eigenvectors[:, sorted_indices]
125+
126+
# Matrix SVD, solve U, Σ, V
127+
u = eigenvectors
128+
singular_values = np.sqrt(np.where(eigenvalues > 0, eigenvalues, 0))
129+
sigma = np.diag(singular_values)
130+
sigma_inv = np.linalg.pinv(sigma)
131+
v = x.T @ u @ sigma_inv
132+
133+
component_array = np.real(v.T[np.argsort(singular_values)[::-1]])
134+
explained_variance_array = np.real(
135+
np.sort(singular_values**2 / (len(x) - 1))[::-1]
136+
)
137+
explained_variance_ratio_array = explained_variance_array / np.sum(
138+
explained_variance_array
139+
)
140+
141+
if self.n_components is None:
142+
n_components = min(x.shape)
143+
elif self.n_components >= 1:
144+
n_components = self.n_components
145+
elif self.n_components < 1:
146+
current_explained_variance_ratio = 0
147+
i = 0
148+
for i in range(len(explained_variance_ratio_array)):
149+
current_explained_variance_ratio += explained_variance_ratio_array[i]
150+
if current_explained_variance_ratio >= self.n_components:
151+
break
152+
n_components = i + 1
153+
else:
154+
raise ValueError("n_components must be a number or None")
155+
156+
self.components_ = component_array[:n_components]
157+
self.explained_variance_ = explained_variance_array[:n_components]
158+
self.explained_variance_ratio_ = explained_variance_ratio_array[:n_components]
159+
160+
def transform(self, x: np.ndarray) -> np.ndarray:
161+
# Project the centered data onto the selected principal components
162+
return (x - np.mean(x, axis=0)) @ self.components_.T
163+
164+
165+
def main():
166+
import time
167+
168+
from sklearn.datasets import load_digits
169+
from sklearn.model_selection import train_test_split
170+
from sklearn.neighbors import KNeighborsClassifier
171+
172+
data = load_digits()
173+
x = data.data
174+
y = data.target
175+
x_train, x_test, y_train, y_test = train_test_split(x, y, random_state=520)
176+
print(x_train.shape, y_train.shape) # (1347, 64) (1347,)
177+
178+
knn_clf = KNeighborsClassifier()
179+
# Use all features
180+
knn_clf.fit(x_train, y_train)
181+
start = time.perf_counter()
182+
print("score:", knn_clf.score(x_test, y_test)) # score: 0.9822222222222222
183+
end = time.perf_counter()
184+
print("costs:", end - start) # costs: 0.13106690000131493
185+
186+
# decomposition
187+
pca = PCA(n_components=0.95)
188+
pca.fit(x_train)
189+
x_train_reduction = pca.transform(x_train)
190+
x_test_reduction = pca.transform(x_test)
191+
print(x_train_reduction.shape) # (1347, 28)
192+
print("n_features:", x_train_reduction.shape[1]) # n_features: 28
193+
knn_clf = KNeighborsClassifier()
194+
knn_clf.fit(x_train_reduction, y_train)
195+
start = time.perf_counter()
196+
print(
197+
"score:", knn_clf.score(x_test_reduction, y_test)
198+
) # score: 0.9888888888888889
199+
end = time.perf_counter()
200+
print("costs:", end - start) # costs: 0.010363900000811554
201+
202+
203+
if __name__ == "__main__":
204+
doctest.testmod()
205+
main()

0 commit comments

Comments
 (0)