Matrix operations with pytorch – optimizer – addendum

This blog post is an addendum to a 3 post miniseries​1​

Here were present 2 notebooks.

  • 02a—SVD-with-pytorch-optimizer-SGD.ipynb

Is a dropin replacement of the stochastic gradient + momentum method shown earlier​2​, but with using the inbuilt pytorch sgd optimiser.

  • 02b—SVD-with-pytorch-optimizer-adam.ipynb

Uses the inbuild pytorch adam optimizer – rather than the sgd optimiser. As known in the literature, the adam optimiser shows better results​3​.

So here’s the notebook using the inbuilt pytorch sgd optimiser.

Open In Colab

Outline

Here's the general outline:

Given a matrix M, with dimensions (m x n), we want to decompose it in the following way:

(i) M = U @ S @ V.T

where

  • @ is the matrix multiplication operator
  • U has dimensions (m x n)
  • S is diagonal and has dimensions (n x n)
  • V has dimensions (n x n), V.T is the transpose of V

Question

We want to calculate U, S and V using gradient descent. How can this be done? (Allow yourself to think about this for 5-15 secs)

New Section

New Section

Answer

Ok so here is how:

Let us use gradient descent with respect to the approximate_inverse to gradually improve the approximate inverse of a matrix.

0) Guess: We guess the answer M_inverse, let's call this guess approximate_inverse. The first guess is random.

1) Improve: We improve approximate_inverse by nudging it in the right direction.

Step 0) Guess is only done once, Step 1) Improve is done many times

How do we know we're moving our approximation in the right direction? We must have a sense if one candidate approximate_inverse_1 is better than another candidate approximate_inverse_2. Put differently, given a candidate for the inverse, we must be able to anser: How far are we off? Are we completely lost? The degree of "being lost" is expressed via a loss function. For our loss function we choose a very common metric the mean squared error, or mse. In this case it measures the distance at each matrix element of our estimate Y_hat and Y_true (the identity matrix).

Further we must decide on what is called the learning rate. The learning rate is one of the most important parameters controlling the gradient descent method. One common way of visualising gradient descent is by comparing the loss function to some sort of hilly landscape. The current combination of parameters (in this case eg. our random inverse matrix) corresponds to a position in this landscape. We would like to get to a valley in this landscape. One way to imagine this is to put a "ball" at the current position and let it roll. This ball moves downhill and can also pickup some momentum.

Code

In [0]:
import torch

# set printing options
torch.set_printoptions(sci_mode=False)

Some helper functions

In [0]:
# define the a matrix-elementwise metric, the mean square error 
def mse(y_hat, y_true): return ((y_hat-y_true)**2).mean()
In [0]:
def improve(U, diag_sqrt, V, M, optimizer:torch.optim, losses:list):

  # the given matrix
  y_true = M

  # estimate the original matrix M
  diag = diag_sqrt * diag_sqrt 
  # this enforces that the diagonal values are >=0
  
  S = torch.diag(input=diag)

  y_hat =  U @ S @ V.T
  
  # loss = "degree of being lost"
  loss1 = mse(y_hat, y_true) # 

  # further constraints
  # 2) orthonomality U.T @ U = I_U
  I_U = torch.eye(U.shape[1])
  loss2 = mse(U.T @ U, I_U)

  # 3) orthonomality U.T @ U = I_U
  I_V = torch.eye(V.shape[0])
  loss3 = mse(V @ V.T, I_V)

  # add up the loss with some scaling factors 
  # ... to improve convergence
  loss = loss1 + loss2/100 + loss3/100


  losses.append(loss.detach().numpy())

  optimizer.zero_grad()

  # calculate loss
  loss.backward()

  with torch.no_grad():
    # displace by learning_rate * derivatives

    optimizer.step()
    # (i)
    


  return None

def calculate_svd(M, learning_rate=0.9, momentum=0.9):

  torch.manual_seed(314)
  # Step 0) Guess
  U = torch.rand(size=M.shape, requires_grad=True)
  diag_sqrt = torch.rand(size=[M.shape[1]], requires_grad=True)
  V = torch.rand(size=(M.shape[1], M.shape[1]), requires_grad=True)
  losses = []
  optimizer = torch.optim.SGD([U, diag_sqrt, V], lr=learning_rate, momentum=momentum)
  

  for t in range(20000): 
    # Step 1) Improve
    improve(U, diag_sqrt, V,  M, optimizer, losses)
  

  diag = diag_sqrt*diag_sqrt

  # sort diagonal values in descending order
  diag_sorted, order = diag.sort(descending=True)
  diag = diag[order]
  U = U[:,order]
  V = V[:,order]

  return U, diag, V, losses

Calculate SVD of random matrix

In [0]:
torch.manual_seed(314)
M1 = torch.rand(size=(5,4)) # torch.diag(torch.tensor([1.,2.,3.,4.]))

U1, diag1, V1, losses_1 = calculate_svd(M1, learning_rate=0.9, momentum=0.9)

Inspect results

Compare with constraints

In [0]:
S1 = torch.diag(input=diag1)
U1 @S1 @V1.T - M1, U1.T @ U1, V1 @ V1.T
Out[0]:
(tensor([[     0.0000,     -0.0000,      0.0000,     -0.0000],
         [    -0.0000,      0.0000,     -0.0000,      0.0000],
         [    -0.0000,      0.0000,     -0.0000,      0.0000],
         [    -0.0000,      0.0000,      0.0000,     -0.0000],
         [     0.0000,      0.0000,      0.0000,      0.0000]],
        grad_fn=<SubBackward0>),
 tensor([[     1.0000,     -0.0000,     -0.0000,     -0.0000],
         [    -0.0000,      1.0000,      0.0000,      0.0000],
         [    -0.0000,      0.0000,      1.0000,     -0.0000],
         [    -0.0000,      0.0000,     -0.0000,      1.0000]],
        grad_fn=<MmBackward>),
 tensor([[     1.0000,     -0.0000,      0.0000,      0.0000],
         [    -0.0000,      1.0000,     -0.0000,      0.0000],
         [     0.0000,     -0.0000,      1.0000,      0.0000],
         [     0.0000,      0.0000,      0.0000,      1.0000]],
        grad_fn=<MmBackward>))

Compare with torch implementation

In [0]:
U, S, V = torch.svd(M1)
In [0]:
U , '',U1
Out[0]:
(tensor([[-0.5076, -0.2959, -0.5596, -0.1478],
         [-0.4468, -0.5972,  0.6504, -0.1059],
         [-0.4043,  0.7161,  0.3923, -0.2353],
         [-0.3839,  0.1188, -0.0145,  0.9138],
         [-0.4815,  0.1701, -0.3314, -0.2768]]),
 '',
 tensor([[ 0.5076, -0.2959,  0.5596, -0.1478],
         [ 0.4468, -0.5972, -0.6504, -0.1059],
         [ 0.4043,  0.7161, -0.3923, -0.2353],
         [ 0.3839,  0.1188,  0.0145,  0.9138],
         [ 0.4815,  0.1701,  0.3314, -0.2768]], grad_fn=<IndexBackward>))
In [0]:
S, '',  torch.diag(S1)
Out[0]:
(tensor([2.3398, 0.6873, 0.2592, 0.1177]),
 '',
 tensor([2.3398, 0.6873, 0.2592, 0.1177], grad_fn=<DiagBackward>))
In [0]:
V, '' ,V1
Out[0]:
(tensor([[-0.4633, -0.8784,  0.0512,  0.1056],
         [-0.4603,  0.1387, -0.7021, -0.5253],
         [-0.6461,  0.4191,  0.0297,  0.6372],
         [-0.3949,  0.1831,  0.7096, -0.5540]]),
 '',
 tensor([[ 0.4633, -0.8784, -0.0512,  0.1056],
         [ 0.4603,  0.1387,  0.7021, -0.5253],
         [ 0.6461,  0.4191, -0.0297,  0.6372],
         [ 0.3949,  0.1831, -0.7096, -0.5540]], grad_fn=<IndexBackward>))

Looking good!

Plot convergence speed for random matrix

In [0]:
import pandas as pd
import numpy as np


import plotly.express as px
df = pd.DataFrame(
    {
     'losses_1': losses_1,
     })
fig = px.line(df, y='losses_1', log_y=True)
fig.update_traces(name='Random Matrix 1, Momentum=0.9', showlegend = True)

fig.update_layout(title=\
                  '<b>How fast does the gradient algorithm converge?</b>',
                  xaxis_title='Iterations',
                  yaxis_title='Losses'
                  )
fig
In [0]:
 

Calculate SVD of some other random matrices

In [0]:
torch.manual_seed(315)

M2 = torch.rand(size=(4,4)) # torch.diag(torch.tensor([1.,2.,3.,4.]))

_, _, _, losses_2 = calculate_svd(M2, learning_rate=0.9, momentum=0.9)
In [0]:
torch.manual_seed(316)

M3 = torch.rand(size=(7,4)) # torch.diag(torch.tensor([1.,2.,3.,4.]))

_, _, _, losses_3 = calculate_svd(M3, learning_rate=0.9, momentum=0.9)
In [0]:
torch.manual_seed(317)

M4 = torch.rand(size=(3,4)).T # torch.diag(torch.tensor([1.,2.,3.,4.]))

_, _, _, losses_4 = calculate_svd(M4, learning_rate=0.9, momentum=0.9)
In [0]:
torch.manual_seed(317)

M5 = torch.rand(size=(4,3)) # torch.diag(torch.tensor([1.,2.,3.,4.]))

_, _, _, losses_5 = calculate_svd(M5, learning_rate=0.9, momentum=0.9)
In [0]:
import pandas as pd
import numpy as np


import plotly.express as px
df = pd.DataFrame(
    {
     'losses_1': losses_1,
     'losses_2' : losses_2,
     'losses_3':losses_3,
     'losses_4': losses_4,
     'losses_5':losses_5
     })
fig = px.line(df, y='losses_1', log_y=True)
fig.update_traces(name='Random Matrix 1, Momentum=0.9', showlegend = True)
fig.add_scatter(y=df['losses_2'], name='Random Matrix 2,  Momentum=0.9')
fig.add_scatter(y=df['losses_3'], name='Random Matrix 3,  Momentum=0.9')
fig.add_scatter(y=df['losses_4'], name='Random Matrix 4,  Momentum=0.9')
fig.add_scatter(y=df['losses_5'], name='Random Matrix 5,  Momentum=0.9')

fig.update_layout(title=\
                  '<b>How fast does the gradient algorithm converge?</b>' + \
                  '<br>Answer:It depends',
                  xaxis_title='Iterations',
                  yaxis_title='Losses'
                  )
fig