SVD with pytorch optimizer
This blog post is part of a 3 post miniseries.
Today’s post in particular covers the topic SVD with pytorch optimizer.
The point of the entire miniseries is to reproduce matrix operations such as matrix inverse and svd using pytorch’s automatic differentiation capability.
These algorithms are already implemented in pytorch itself and other libraries such as scikit-learn. However, we will solve this problem in a general way using gradient descent. We hope that this will provide an understanding of the power of the gradient method in general and the capabilities of pytorch in particular.
To avoid reader fatigue, we present the material in 3 posts:
- A introductory section: pytorch – playing with tensors demonstrates some basic tensor usage1. This notebook also shows how to calculate various derivatives.
- A main section: pytorch – matrix inverse with pytorch optimizer shows how to calculate the matrix inverse2 using gradient descent.
- An advanced section: SVD with pytorch optimizer shows how to do singular value decomposition3,4 with gradient descent.
Some background information on gradient descent can be found here5,6.A post with a similar albeit slightly more mathematical character can be found here7. Some more advanced material can be found here8.
Requirements
Hardware
- A computer with at least 4 GB Ram
Software
- The computer can run on Linux, MacOS or Windows
- Familiarity with Python and basic linear algebra
Let’s get started
The code of this post is provided in a jupyter notebook on github:
Remark: the following part of the post is directly written in a Jupyter notebook. It is displayed via a very nice wordpress plugin nbconvert9.
Outline
Here's the general outline:
Given a matrix M, with dimensions (m x n), we want to decompose it in the following way:
(i) M = U @ S @ V.T
where
- @ is the matrix multiplication operator
- U has dimensions (m x n)
- S is diagonal and has dimensions (n x n)
- V has dimensions (n x n), V.T is the transpose of V
Question
We want to calculate U, S and V using gradient descent. How can this be done? (Allow yourself to think about this for 5-15 secs)
Answer
Ok so here is how:
Let us use gradient descent with respect to the approximate_inverse to gradually improve the approximate inverse of a matrix.
0) Guess: We guess the answer M_inverse, let's call this guess approximate_inverse. The first guess is random.
1) Improve: We improve approximate_inverse by nudging it in the right direction.
Step 0) Guess is only done once, Step 1) Improve is done many times
How do we know we're moving our approximation in the right direction? We must have a sense if one candidate approximate_inverse_1 is better than another candidate approximate_inverse_2. Put differently, given a candidate for the inverse, we must be able to anser: How far are we off? Are we completely lost? The degree of "being lost" is expressed via a loss function. For our loss function we choose a very common metric the mean squared error, or mse. In this case it measures the distance at each matrix element of our estimate Y_hat and Y_true (the identity matrix).
Further we must decide on what is called the learning rate. The learning rate is one of the most important parameters controlling the gradient descent method. One common way of visualising gradient descent is by comparing the loss function to some sort of hilly landscape. The current combination of parameters (in this case eg. our random inverse matrix) corresponds to a position in this landscape. We would like to get to a valley in this landscape. One way to imagine this is to put a "ball" at the current position and let it roll. This ball moves downhill and can also pickup some momentum.
Code¶
import torch
# set printing options
torch.set_printoptions(sci_mode=False)
Some helper functions¶
# define the a matrix-elementwise metric, the mean square error
def mse(y_hat, y_true): return ((y_hat-y_true)**2).mean()
def improve(U, diag_sqrt, V, M, learning_rate, momentum, losses):
# the given matrix
y_true = M
# estimate the original matrix M
diag = diag_sqrt * diag_sqrt
# this enforces that the diagonal values are >=0
S = torch.diag(input=diag)
y_hat = U @ S @ V.T
# loss = "degree of being lost"
loss1 = mse(y_hat, y_true) #
# further constraints
# 2) orthonomality U.T @ U = I_U
I_U = torch.eye(U.shape[1])
loss2 = mse(U.T @ U, I_U)
# 3) orthonomality U.T @ U = I_U
I_V = torch.eye(V.shape[0])
loss3 = mse(V @ V.T, I_V)
# add up the loss with some scaling factors
# ... to improve convergence
loss = loss1 + loss2/100 + loss3/100
losses.append(loss.detach().numpy())
# calculate loss
loss.backward()
with torch.no_grad():
# displace by learning_rate * derivatives
# (i)
U -= learning_rate * U.grad
diag_sqrt -= learning_rate * diag_sqrt.grad
V -= learning_rate * V.grad
# remark: instead of (i) we could alternatively use sub_ and write (ii),
# this is computationally more efficient but harder to read.
# (ii)
# approximate_inverse.sub_(learning_rate * approximate_inverse.grad)
# momentum
# (iii)
U.grad *= momentum # reduce speed due to friction
diag_sqrt.grad *= momentum # reduce speed due to friction
V.grad *= momentum # reduce speed due to friction
# remark for the case of momentum == 0 we could alternatively write
# (iv)
# approximate_inverse.grad.zero_()
return None
def calculate_svd(M, learning_rate=0.9, momentum=0.9):
torch.manual_seed(314)
# Step 0) Guess
U = torch.rand(size=M.shape, requires_grad=True)
diag_sqrt = torch.rand(size=[M.shape[1]], requires_grad=True)
V = torch.rand(size=(M.shape[1], M.shape[1]), requires_grad=True)
losses = []
for t in range(20000):
# Step 1) Improve
improve(U, diag_sqrt, V, M, learning_rate, momentum, losses)
diag = diag_sqrt*diag_sqrt
# sort diagonal values in descending order
diag_sorted, order = diag.sort(descending=True)
diag = diag[order]
U = U[:,order]
V = V[:,order]
return U, diag, V, losses
Calculate SVD of random matrix¶
torch.manual_seed(314)
M1 = torch.rand(size=(5,4)) # torch.diag(torch.tensor([1.,2.,3.,4.]))
U1, diag1, V1, losses_1 = calculate_svd(M1, learning_rate=0.9, momentum=0.9)
Inspect results¶
Compare with constraints
S1 = torch.diag(input=diag1)
U1 @S1 @V1.T - M1, U1.T @ U1, V1 @ V1.T
Compare with torch implementation¶
U, S, V = torch.svd(M1)
U , '',U1
S, '', torch.diag(S1)
V, '' ,V1
Looking good!
Plot convergence speed for random matrix¶
import pandas as pd
import numpy as np
import plotly.express as px
df = pd.DataFrame(
{
'losses_1': losses_1,
})
fig = px.line(df, y='losses_1', log_y=True)
fig.update_traces(name='Random Matrix 1, Momentum=0.9', showlegend = True)
fig.update_layout(title=\
'<b>How fast does the gradient algorithm converge?</b>',
xaxis_title='Iterations',
yaxis_title='Losses'
)
fig
Calculate SVD of some other random matrices¶
torch.manual_seed(315)
M2 = torch.rand(size=(4,4)) # torch.diag(torch.tensor([1.,2.,3.,4.]))
_, _, _, losses_2 = calculate_svd(M2, learning_rate=0.9, momentum=0.9)
torch.manual_seed(316)
M3 = torch.rand(size=(7,4)) # torch.diag(torch.tensor([1.,2.,3.,4.]))
_, _, _, losses_3 = calculate_svd(M3, learning_rate=0.9, momentum=0.9)
torch.manual_seed(317)
M4 = torch.rand(size=(3,4)).T # torch.diag(torch.tensor([1.,2.,3.,4.]))
_, _, _, losses_4 = calculate_svd(M4, learning_rate=0.9, momentum=0.9)
#####################################################################
torch.manual_seed(317)
M5 = torch.rand(size=(4,3)) # torch.diag(torch.tensor([1.,2.,3.,4.]))
_, _, _, losses_5 = calculate_svd(M5, learning_rate=0.9, momentum=0.9)
import pandas as pd
import numpy as np
import plotly.express as px
df = pd.DataFrame(
{
'losses_1': losses_1,
'losses_2' : losses_2,
'losses_3':losses_3,
'losses_4': losses_4,
'losses_5':losses_5
})
fig = px.line(df, y='losses_1', log_y=True)
fig.update_traces(name='Random Matrix 1, Momentum=0.9', showlegend = True)
fig.add_scatter(y=df['losses_2'], name='Random Matrix 2, Momentum=0.9')
fig.add_scatter(y=df['losses_3'], name='Random Matrix 3, Momentum=0.9')
fig.add_scatter(y=df['losses_4'], name='Random Matrix 4, Momentum=0.9')
fig.add_scatter(y=df['losses_5'], name='Random Matrix 5, Momentum=0.9')
fig.update_layout(title=\
'<b>How fast does the gradient algorithm converge?</b>' + \
'<br>Answer:It depends',
xaxis_title='Iterations',
yaxis_title='Losses'
)
fig