This blog post is an addendum to a 3 post miniseries1.
Here were present 2 notebooks.
- 02a—SVD-with-pytorch-optimizer-SGD.ipynb
Is a dropin replacement of the stochastic gradient + momentum method shown earlier2, but with using the inbuilt pytorch sgd optimiser.
- 02b—SVD-with-pytorch-optimizer-adam.ipynb
Uses the inbuild pytorch adam optimizer – rather than the sgd optimiser. As known in the literature, the adam optimiser shows better results3.
So here’s the notebook using the inbuilt pytorch sgd optimiser.
Outline
Here's the general outline:
Given a matrix M, with dimensions (m x n), we want to decompose it in the following way:
(i) M = U @ S @ V.T
where
- @ is the matrix multiplication operator
- U has dimensions (m x n)
- S is diagonal and has dimensions (n x n)
- V has dimensions (n x n), V.T is the transpose of V
Question
We want to calculate U, S and V using gradient descent. How can this be done? (Allow yourself to think about this for 5-15 secs)
New Section¶
New Section¶
Answer
Ok so here is how:
Let us use gradient descent with respect to the approximate_inverse to gradually improve the approximate inverse of a matrix.
0) Guess: We guess the answer M_inverse, let's call this guess approximate_inverse. The first guess is random.
1) Improve: We improve approximate_inverse by nudging it in the right direction.
Step 0) Guess is only done once, Step 1) Improve is done many times
How do we know we're moving our approximation in the right direction? We must have a sense if one candidate approximate_inverse_1 is better than another candidate approximate_inverse_2. Put differently, given a candidate for the inverse, we must be able to anser: How far are we off? Are we completely lost? The degree of "being lost" is expressed via a loss function. For our loss function we choose a very common metric the mean squared error, or mse. In this case it measures the distance at each matrix element of our estimate Y_hat and Y_true (the identity matrix).
Further we must decide on what is called the learning rate. The learning rate is one of the most important parameters controlling the gradient descent method. One common way of visualising gradient descent is by comparing the loss function to some sort of hilly landscape. The current combination of parameters (in this case eg. our random inverse matrix) corresponds to a position in this landscape. We would like to get to a valley in this landscape. One way to imagine this is to put a "ball" at the current position and let it roll. This ball moves downhill and can also pickup some momentum.
Code¶
import torch
# set printing options
torch.set_printoptions(sci_mode=False)
Some helper functions¶
# define the a matrix-elementwise metric, the mean square error
def mse(y_hat, y_true): return ((y_hat-y_true)**2).mean()
def improve(U, diag_sqrt, V, M, optimizer:torch.optim, losses:list):
# the given matrix
y_true = M
# estimate the original matrix M
diag = diag_sqrt * diag_sqrt
# this enforces that the diagonal values are >=0
S = torch.diag(input=diag)
y_hat = U @ S @ V.T
# loss = "degree of being lost"
loss1 = mse(y_hat, y_true) #
# further constraints
# 2) orthonomality U.T @ U = I_U
I_U = torch.eye(U.shape[1])
loss2 = mse(U.T @ U, I_U)
# 3) orthonomality U.T @ U = I_U
I_V = torch.eye(V.shape[0])
loss3 = mse(V @ V.T, I_V)
# add up the loss with some scaling factors
# ... to improve convergence
loss = loss1 + loss2/100 + loss3/100
losses.append(loss.detach().numpy())
optimizer.zero_grad()
# calculate loss
loss.backward()
with torch.no_grad():
# displace by learning_rate * derivatives
optimizer.step()
# (i)
return None
def calculate_svd(M, learning_rate=0.9, momentum=0.9):
torch.manual_seed(314)
# Step 0) Guess
U = torch.rand(size=M.shape, requires_grad=True)
diag_sqrt = torch.rand(size=[M.shape[1]], requires_grad=True)
V = torch.rand(size=(M.shape[1], M.shape[1]), requires_grad=True)
losses = []
optimizer = torch.optim.SGD([U, diag_sqrt, V], lr=learning_rate, momentum=momentum)
for t in range(20000):
# Step 1) Improve
improve(U, diag_sqrt, V, M, optimizer, losses)
diag = diag_sqrt*diag_sqrt
# sort diagonal values in descending order
diag_sorted, order = diag.sort(descending=True)
diag = diag[order]
U = U[:,order]
V = V[:,order]
return U, diag, V, losses
Calculate SVD of random matrix¶
torch.manual_seed(314)
M1 = torch.rand(size=(5,4)) # torch.diag(torch.tensor([1.,2.,3.,4.]))
U1, diag1, V1, losses_1 = calculate_svd(M1, learning_rate=0.9, momentum=0.9)
Inspect results¶
Compare with constraints
S1 = torch.diag(input=diag1)
U1 @S1 @V1.T - M1, U1.T @ U1, V1 @ V1.T
Compare with torch implementation¶
U, S, V = torch.svd(M1)
U , '',U1
S, '', torch.diag(S1)
V, '' ,V1
Looking good!
Plot convergence speed for random matrix¶
import pandas as pd
import numpy as np
import plotly.express as px
df = pd.DataFrame(
{
'losses_1': losses_1,
})
fig = px.line(df, y='losses_1', log_y=True)
fig.update_traces(name='Random Matrix 1, Momentum=0.9', showlegend = True)
fig.update_layout(title=\
'<b>How fast does the gradient algorithm converge?</b>',
xaxis_title='Iterations',
yaxis_title='Losses'
)
fig