Matrix operations with pytorch – optimizer – part 2

pytorch – matrix inverse with pytorch optimizer

This blog post is part of a 3 post miniseries. 

Today’s post in particular covers the topic pytorch – matrix inverse with pytorch optimizer.

The point of the entire miniseries is to reproduce matrix operations such as matrix inverse and svd using pytorch’s automatic differentiation capability.

These algorithms are already implemented in pytorch itself and other libraries such as scikit-learn. However, we will solve this problem in a general way using gradient descent. We hope that this will provide an understanding of the power of the gradient method in general and the capabilities of pytorch in particular.

To avoid reader fatigue, we present the material in 3 posts:

  • A introductory section: pytorch – playing with tensors demonstrates some basic tensor usage​1​. This notebook also shows how to calculate various derivatives.
  • A main section: pytorch – matrix inverse with pytorch optimizer shows how to calculate the matrix inverse​2​ using gradient descent.
  • An advanced section: SVD with pytorch optimizer shows how to do singular value decomposition​3​ with gradient descent​4​.

Some background information on gradient descent can be found here​5,6​.A post with a similar albeit slightly more mathematical character can be found here​7​. Some more advanced material can be found here​8​.

Requirements

Hardware

  • A computer with at least 4 GB Ram

Software

  • The computer can run on Linux, MacOS or Windows

Wetware

  • Familiarity with Python and basic linear algebra

Let’s get  started

The code of this post is provided in a jupyter notebook on github:

https://github.com/hfwittmann/matrix-operations-with-pytorch/blob/master/Matrix-operations-with-pytorch-optimizer/01—Matrix-inverse-with-pytorch-optimizer.ipynb

Remark: the following part of the post is directly written in a Jupyter notebook. It is displayed via a very nice wordpress plugin nbconvert​9​.

Outline

Here's the general outline:

Given a square matrix M, we want to calculate its inverse, this is to say:

Given M we seek M_inverse in the following equation:

(i) M @ M_inverse = I

where

  • @ is the matrix multiplication operator
  • I is the identity matrix
  • the dimensions of M, M_inverse and I are all : n x n

Question

We want to calculate M_inverse using gradient descent. How can this be done? (Allow yourself to think about this for 5-15 secs)

Answer

Ok so here is how:

Let us use gradient descent with respect to the approximate_inverse to gradually improve the approximate inverse of a matrix.

0) Guess: We guess the answer M_inverse, let's call this guess approximate_inverse. The first guess is random.

1) Improve: We improve approximate_inverse by nudging it in the right direction.

Step 0) Guess is only done once, Step 1) Improve is done many times

How do we know we're moving our approximation in the right direction? We must have a sense if one candidate approximate_inverse_1 is better than another candidate approximate_inverse_2. Put differently, given a candidate for the inverse, we must be able to anser: How far are we off? Are we completely lost? The degree of "being lost" is expressed via a loss function. For our loss function we choose a very common metric the mean squared error, or mse. In this case it measures the distance at each matrix element of our estimate Y_hat and Y_true (the identity matrix).

Further we must decide on what is called the learning rate. The learning rate is one of the most important parameters controlling the gradient descent method. One common way of visualising gradient descent is by comparing the loss function to some sort of hilly landscape. The current combination of parameters (in this case eg. our random inverse matrix) corresponds to a position in this landscape. We would like to get to a valley in this landscape. One way to imagine this is to put a "ball" at the current position and let it roll. This ball moves downhill and can also pickup some momentum.

Code

In [0]:
import torch

torch.set_printoptions(sci_mode=False)
In [2]:
M = torch.rand((4,4))
M
Out[2]:
tensor([[0.7839, 0.7145, 0.5376, 0.3232],
        [0.2862, 0.6781, 0.3990, 0.8421],
        [0.8651, 0.9477, 0.6445, 0.0973],
        [0.5755, 0.7770, 0.7933, 0.0029]])
In [3]:
M.shape[0] == M.shape[1]
Out[3]:
True
In [0]:
# define the a matrix-elementwise metric, the mean square error 
def mse(y_hat, y_true): return ((y_hat-y_true)**2).mean()

def improve(approximate_inverse, M, learning_rate, momentum, losses):

  # identity matrix
  y_true = torch.eye(n=len(M))

  # estimate of identity matrix
  # here we have various choices a) b) c)
  
  # a)
  y_hat = approximate_inverse @ M

  ## b)
  #y_hat = M @ approximate_inverse

  ## c)
  #y_hat = (approximate_inverse @ M + M @ approximate_inverse)/2


  
  # loss = "degree of being lost"
  loss = mse(y_hat, y_true)
  losses.append(loss.detach().numpy())

  # calculate loss
  loss.backward()

  with torch.no_grad():
    # displace by learning_rate * derivatives
    
    # (i)
    approximate_inverse -= learning_rate * approximate_inverse.grad

    # remark: instead of (i) we could alternatively use sub_ and write (ii),
    #         this is computationally more efficient but harder to read.
    # (ii)
    # approximate_inverse.sub_(learning_rate * approximate_inverse.grad)
    

    # momentum
    # (iii)
    approximate_inverse.grad *= momentum # reduce speed due to friction

    # remark for the case of momentum == 0 we could alternatively write
    # (iv)
    # approximate_inverse.grad.zero_()

  return None

def calculate_inverse(M, learning_rate=0.9, momentum=0.9):

  isSquare = M.shape[0] == M.shape[1]
  assert isSquare, 'M should be square'

  torch.manual_seed(314)
  # Step 0) Guess
  approximate_inverse = torch.rand(size=M.shape, requires_grad=True)
  losses = []
  

  for t in range(10000): 
    # Step 1) Improve
    improve(approximate_inverse, M, learning_rate, momentum, losses)
  
  return approximate_inverse, losses

Calculate inverse of random matrix

In [0]:
torch.manual_seed(314)
M1 = torch.rand(size=(4,4)) # torch.diag(torch.tensor([1.,2.,3.,4.]))

approximate_inverse_1, losses_1 = calculate_inverse(M1)

Calculate inverse of diagonal matrix

In [0]:
losses2 = []
torch.manual_seed(314)

M2 = torch.diag(torch.tensor([1.,2.,3.,4.])) # torch.rand(size=(4,4)) # 

approximate_inverse_2, losses_2 = calculate_inverse(M2)

Inspect results

Let's inspect whether the approximate_inverse_1 is a good approximation the inverse of M1

In [7]:
approximate_inverse_1 @ M1
Out[7]:
tensor([[     1.0000,      0.0000,     -0.0000,      0.0000],
        [     0.0000,      1.0000,      0.0000,     -0.0000],
        [    -0.0000,      0.0000,      1.0000,      0.0000],
        [     0.0000,     -0.0000,      0.0000,      1.0000]],
       grad_fn=<MmBackward>)

Let's inspect whether the approximate_inverse_2 is a good approximation the inverse of M2

In [8]:
approximate_inverse_1 @ M1
Out[8]:
tensor([[     1.0000,      0.0000,     -0.0000,      0.0000],
        [     0.0000,      1.0000,      0.0000,     -0.0000],
        [    -0.0000,      0.0000,      1.0000,      0.0000],
        [     0.0000,     -0.0000,      0.0000,      1.0000]],
       grad_fn=<MmBackward>)

Looking good!

Compare convergence speed for simple diagonal matrix and general random matrix

In [9]:
import pandas as pd
import numpy as np


import plotly.express as px
df = pd.DataFrame(
    {
     'losses_1':np.array(losses_1).flatten(),
     'losses_2':np.array(losses_2).flatten()
     })
fig = px.line(df, y='losses_1', log_y=True)
fig.update_traces(name='Random Matrix, Momentum=0.9', showlegend = True)
fig.add_scatter(y=df['losses_2'], name='Simple Diagonal Matrix,  Momentum=0.9')

fig.update_layout(title=\
                  '<b>How fast does the gradient algorithm converge?</b>' + \
                  '<br>Answer:It depends, for the Simple Diagonal Matrix it' + \
                  ' converges faster',
                  xaxis_title='Iterations',
                  yaxis_title='Losses'
                  )
fig
In [0]:
 

  1. 1.
    Matrix operations with pytorch – optimizer – part 1. Arthought. https://arthought.com/matrix-operations-with-pytorch-optimizer-part-1/. Published February 7, 2020. Accessed February 7, 2020.
  2. 2.
    Invertible matrix. Wikipedia. https://en.wikipedia.org/wiki/Invertible_matrix. Published February 6, 2020. Accessed February 6, 2020.
  3. 3.
    Singular value decomposition. Wikipedia. https://en.wikipedia.org/wiki/Singular_value_decomposition. Published February 6, 2020. Accessed February 6, 2020.
  4. 4.
    Matrix operations with pytorch – optimizer – part 3. arthought. https://arthought.com/. Published February 9, 2020. Accessed February 9, 2020.
  5. 5.
    Gradient Descent. Wikipedia. https://en.wikipedia.org/wiki/Gradient_descent. Published February 8, 2020. Accessed February 8, 2020.
  6. 6.
    Gradient descent. ML glossary. https://ml-cheatsheet.readthedocs.io/en/latest/gradient_descent.html. Published February 8, 2020. Accessed February 8, 2020.
  7. 7.
    PyTorch Basics: Solving the Ax=b matrix equation with gradient descent. bytepawn. http://bytepawn.com/pytorch-basics-solving-the-axb-matrix-equation-with-gradient-descent.html. Published February 8, 2019. Accessed February 8, 2020.
  8. 8.
    PyTorch for Scientific Computing – Quantum Mechanics Example Part 3) Code Optimizations – Batched Matrix Operations, Cholesky Decomposition and Inverse. pugetsystems. https://www.pugetsystems.com/. Published August 31, 2018. Accessed February 8, 2020.
  9. 9.
    Challis A. PHP: nbconvert – A wordpress plugin for Jupyter notebooks.   . https://www.andrewchallis.co.uk/. Published May 1, 2019. Accessed February 6, 2020.