Matrix operations with pytorch – optimizer – part 3

SVD with pytorch optimizer

This blog post is part of a 3 post miniseries. 

Today’s post in particular covers the topic SVD with pytorch optimizer.

The point of the entire miniseries is to reproduce matrix operations such as matrix inverse and svd using pytorch’s automatic differentiation capability.

These algorithms are already implemented in pytorch itself and other libraries such as scikit-learn. However, we will solve this problem in a general way using gradient descent. We hope that this will provide an understanding of the power of the gradient method in general and the capabilities of pytorch in particular.

To avoid reader fatigue, we present the material in 3 posts:

  • A introductory section: pytorch – playing with tensors demonstrates some basic tensor usage​1​. This notebook also shows how to calculate various derivatives.
  • A main section: pytorch – matrix inverse with pytorch optimizer shows how to calculate the matrix inverse​2​ using gradient descent.
  • An advanced section: SVD with pytorch optimizer shows how to do singular value decomposition​3,4​ with gradient descent.

Some background information on gradient descent can be found here​5,6​.A post with a similar albeit slightly more mathematical character can be found here​7​. Some more advanced material can be found here​8​.

Requirements

Hardware

  • A computer with at least 4 GB Ram

Software

  • The computer can run on Linux, MacOS or Windows

Wetware

  • Familiarity with Python and basic linear algebra

Let’s get  started

The code of this post is provided in a jupyter notebook on github:

https://github.com/hfwittmann/matrix-operations-with-pytorch/blob/master/Matrix-operations-with-pytorch-optimizer/02—SVD-with-pytorch-optimizer.ipynb

Remark: the following part of the post is directly written in a Jupyter notebook. It is displayed via a very nice wordpress plugin nbconvert​9​.

Open In Colab

Outline

Here's the general outline:

Given a matrix M, with dimensions (m x n), we want to decompose it in the following way:

(i) M = U @ S @ V.T

where

  • @ is the matrix multiplication operator
  • U has dimensions (m x n)
  • S is diagonal and has dimensions (n x n)
  • V has dimensions (n x n), V.T is the transpose of V

Question

We want to calculate U, S and V using gradient descent. How can this be done? (Allow yourself to think about this for 5-15 secs)

Answer

Ok so here is how:

Let us use gradient descent with respect to the approximate_inverse to gradually improve the approximate inverse of a matrix.

0) Guess: We guess the answer M_inverse, let's call this guess approximate_inverse. The first guess is random.

1) Improve: We improve approximate_inverse by nudging it in the right direction.

Step 0) Guess is only done once, Step 1) Improve is done many times

How do we know we're moving our approximation in the right direction? We must have a sense if one candidate approximate_inverse_1 is better than another candidate approximate_inverse_2. Put differently, given a candidate for the inverse, we must be able to anser: How far are we off? Are we completely lost? The degree of "being lost" is expressed via a loss function. For our loss function we choose a very common metric the mean squared error, or mse. In this case it measures the distance at each matrix element of our estimate Y_hat and Y_true (the identity matrix).

Further we must decide on what is called the learning rate. The learning rate is one of the most important parameters controlling the gradient descent method. One common way of visualising gradient descent is by comparing the loss function to some sort of hilly landscape. The current combination of parameters (in this case eg. our random inverse matrix) corresponds to a position in this landscape. We would like to get to a valley in this landscape. One way to imagine this is to put a "ball" at the current position and let it roll. This ball moves downhill and can also pickup some momentum.

Code

In [0]:
import torch

# set printing options
torch.set_printoptions(sci_mode=False)

Some helper functions

In [0]:
# define the a matrix-elementwise metric, the mean square error 
def mse(y_hat, y_true): return ((y_hat-y_true)**2).mean()
In [0]:
def improve(U, diag_sqrt, V, M, learning_rate, momentum, losses):

  # the given matrix
  y_true = M

  # estimate the original matrix M
  diag = diag_sqrt * diag_sqrt 
  # this enforces that the diagonal values are >=0
  
  S = torch.diag(input=diag)

  y_hat =  U @ S @ V.T
  
  # loss = "degree of being lost"
  loss1 = mse(y_hat, y_true) # 

  # further constraints
  # 2) orthonomality U.T @ U = I_U
  I_U = torch.eye(U.shape[1])
  loss2 = mse(U.T @ U, I_U)

  # 3) orthonomality U.T @ U = I_U
  I_V = torch.eye(V.shape[0])
  loss3 = mse(V @ V.T, I_V)

  # add up the loss with some scaling factors 
  # ... to improve convergence
  loss = loss1 + loss2/100 + loss3/100


  losses.append(loss.detach().numpy())


  # calculate loss
  loss.backward()

  with torch.no_grad():
    # displace by learning_rate * derivatives
    
    # (i)

    U -= learning_rate * U.grad
    diag_sqrt -= learning_rate * diag_sqrt.grad
    V -= learning_rate * V.grad

    # remark: instead of (i) we could alternatively use sub_ and write (ii),
    #         this is computationally more efficient but harder to read.
    # (ii)
    # approximate_inverse.sub_(learning_rate * approximate_inverse.grad)
    

    # momentum
    # (iii)
    U.grad *= momentum # reduce speed due to friction
    diag_sqrt.grad *= momentum # reduce speed due to friction
    V.grad *= momentum # reduce speed due to friction

    # remark for the case of momentum == 0 we could alternatively write
    # (iv)
    # approximate_inverse.grad.zero_()


  return None

def calculate_svd(M, learning_rate=0.9, momentum=0.9):

  torch.manual_seed(314)
  # Step 0) Guess
  U = torch.rand(size=M.shape, requires_grad=True)
  diag_sqrt = torch.rand(size=[M.shape[1]], requires_grad=True)
  V = torch.rand(size=(M.shape[1], M.shape[1]), requires_grad=True)
  losses = []
  

  for t in range(20000): 
    # Step 1) Improve
    improve(U, diag_sqrt, V,  M, learning_rate, momentum, losses)
  

  diag = diag_sqrt*diag_sqrt

  # sort diagonal values in descending order
  diag_sorted, order = diag.sort(descending=True)
  diag = diag[order]
  U = U[:,order]
  V = V[:,order]

  return U, diag, V, losses

Calculate SVD of random matrix

In [0]:
torch.manual_seed(314)
M1 = torch.rand(size=(5,4)) # torch.diag(torch.tensor([1.,2.,3.,4.]))

U1, diag1, V1, losses_1 = calculate_svd(M1, learning_rate=0.9, momentum=0.9)

Inspect results

Compare with constraints

In [0]:
S1 = torch.diag(input=diag1)
U1 @S1 @V1.T - M1, U1.T @ U1, V1 @ V1.T
Out[0]:
(tensor([[     0.0000,     -0.0000,      0.0000,     -0.0000],
         [    -0.0000,      0.0000,     -0.0000,      0.0000],
         [     0.0000,      0.0000,     -0.0000,      0.0000],
         [    -0.0000,      0.0000,      0.0000,     -0.0000],
         [     0.0000,      0.0000,      0.0000,     -0.0000]],
        grad_fn=<SubBackward0>),
 tensor([[     1.0000,     -0.0000,     -0.0000,     -0.0000],
         [    -0.0000,      1.0000,      0.0000,      0.0000],
         [    -0.0000,      0.0000,      1.0000,     -0.0000],
         [    -0.0000,      0.0000,     -0.0000,      1.0000]],
        grad_fn=<MmBackward>),
 tensor([[     1.0000,     -0.0000,      0.0000,      0.0000],
         [    -0.0000,      1.0000,     -0.0000,     -0.0000],
         [     0.0000,     -0.0000,      1.0000,      0.0000],
         [     0.0000,     -0.0000,      0.0000,      1.0000]],
        grad_fn=<MmBackward>))

Compare with torch implementation

In [0]:
U, S, V = torch.svd(M1)
In [0]:
U , '',U1
Out[0]:
(tensor([[-0.5076, -0.2959, -0.5596, -0.1478],
         [-0.4468, -0.5972,  0.6504, -0.1059],
         [-0.4043,  0.7161,  0.3923, -0.2353],
         [-0.3839,  0.1188, -0.0145,  0.9138],
         [-0.4815,  0.1701, -0.3314, -0.2768]]),
 '',
 tensor([[ 0.5076, -0.2959,  0.5596, -0.1478],
         [ 0.4468, -0.5972, -0.6504, -0.1059],
         [ 0.4043,  0.7161, -0.3923, -0.2353],
         [ 0.3839,  0.1188,  0.0145,  0.9138],
         [ 0.4815,  0.1701,  0.3314, -0.2768]], grad_fn=<IndexBackward>))
In [0]:
S, '',  torch.diag(S1)
Out[0]:
(tensor([2.3398, 0.6873, 0.2592, 0.1177]),
 '',
 tensor([2.3398, 0.6873, 0.2592, 0.1177], grad_fn=<DiagBackward>))
In [0]:
V, '' ,V1
Out[0]:
(tensor([[-0.4633, -0.8784,  0.0512,  0.1056],
         [-0.4603,  0.1387, -0.7021, -0.5253],
         [-0.6461,  0.4191,  0.0297,  0.6372],
         [-0.3949,  0.1831,  0.7096, -0.5540]]),
 '',
 tensor([[ 0.4633, -0.8784, -0.0512,  0.1056],
         [ 0.4603,  0.1387,  0.7021, -0.5253],
         [ 0.6461,  0.4191, -0.0297,  0.6372],
         [ 0.3949,  0.1831, -0.7096, -0.5540]], grad_fn=<IndexBackward>))

Looking good!

Plot convergence speed for random matrix

In [0]:
import pandas as pd
import numpy as np


import plotly.express as px
df = pd.DataFrame(
    {
     'losses_1': losses_1,
     })
fig = px.line(df, y='losses_1', log_y=True)
fig.update_traces(name='Random Matrix 1, Momentum=0.9', showlegend = True)

fig.update_layout(title=\
                  '<b>How fast does the gradient algorithm converge?</b>',
                  xaxis_title='Iterations',
                  yaxis_title='Losses'
                  )
fig
In [0]:
 

Calculate SVD of some other random matrices

In [0]:
torch.manual_seed(315)

M2 = torch.rand(size=(4,4)) # torch.diag(torch.tensor([1.,2.,3.,4.]))

_, _, _, losses_2 = calculate_svd(M2, learning_rate=0.9, momentum=0.9)
In [0]:
torch.manual_seed(316)

M3 = torch.rand(size=(7,4)) # torch.diag(torch.tensor([1.,2.,3.,4.]))

_, _, _, losses_3 = calculate_svd(M3, learning_rate=0.9, momentum=0.9)
In [0]:
torch.manual_seed(317)

M4 = torch.rand(size=(3,4)).T # torch.diag(torch.tensor([1.,2.,3.,4.]))

_, _, _, losses_4 = calculate_svd(M4, learning_rate=0.9, momentum=0.9)
In [0]:
#####################################################################
torch.manual_seed(317)

M5 = torch.rand(size=(4,3)) # torch.diag(torch.tensor([1.,2.,3.,4.]))

_, _, _, losses_5 = calculate_svd(M5, learning_rate=0.9, momentum=0.9)
In [0]:
import pandas as pd
import numpy as np


import plotly.express as px
df = pd.DataFrame(
    {
     'losses_1': losses_1,
     'losses_2' : losses_2,
     'losses_3':losses_3,
     'losses_4': losses_4,
     'losses_5':losses_5
     })
fig = px.line(df, y='losses_1', log_y=True)
fig.update_traces(name='Random Matrix 1, Momentum=0.9', showlegend = True)
fig.add_scatter(y=df['losses_2'], name='Random Matrix 2,  Momentum=0.9')
fig.add_scatter(y=df['losses_3'], name='Random Matrix 3,  Momentum=0.9')
fig.add_scatter(y=df['losses_4'], name='Random Matrix 4,  Momentum=0.9')
fig.add_scatter(y=df['losses_5'], name='Random Matrix 5,  Momentum=0.9')

fig.update_layout(title=\
                  '<b>How fast does the gradient algorithm converge?</b>' + \
                  '<br>Answer:It depends',
                  xaxis_title='Iterations',
                  yaxis_title='Losses'
                  )
fig