Loss surface with multiple valleys

This post is a follow-up of 1.

We start off with an eye-catching plot, representing the functioning of an optimiser using the stochastic gradient method. The plot is explained in more detail further below.

Visualisation of a loss surface with multiple minima. The surface is in gray, the exemplary path taken by the optimiser  is in colors.

We had previously explored a visualisation  of gradient-based optimisation. To this end we plotted an optimisation path against the background of its corresponding loss surface.

Remark: the loss surface  is a function that is typically used to describe the badness of fit of an optimisation.

Following the logic of a  tutorial from tensorflow2 , in the previous case we looked at a very simple  optimisation, maybe one of the simplest of all : we used linear regression. As data points we used  a straight line with mild noise (= a little bit of noise,  but not too much). Further, we used the offset and the inclination as fitting parameters.  As a loss function we used a quadratic function,  which is very typical. It turns out that this produces a very well behaved loss surface,  with a simple valley, which the optimiser has no problem of finding.

As a special case, the surface can be one-dimensional. In this current post, for simplicity, we use only one optimisation parameter, therefore the surface is one-dimensional. The fitting parameter we use here is the straight line’s angle. Despite this one-dimensional simplicity, we construct a surface which is reasonably hard for the gradient-based optimiser,  and in fact can derail it.

The primary purpose of this post is to construct from artificial data and  visualise a loss surface that is not well behaved, and which in particular has many minima.

Continue reading “Loss surface with multiple valleys”

Comparison of a very simple regression in tensorflow and keras

In this short post we perform a comparative  analysis of a very simple regression problem in tensorflow and keras.

We start off with an eye-catching plot, representing the functioning of an optimizer using the stochastic gradient method. The plot is explained in more detail further below.

A 3 rotatable version of the Loss function of the regression problem. For the hosting we use the free service by plotly. The black line is the path taken by the optimiser.  W, b are the slope and offset parameters of the model. A full view can be found here:  https://plot.ly/~hfwittmann/5/loss-function/#/

The focus is on  the first principles of gradient descent. We replicate the results of 1,2. The post uses a Gradient Tape which in turn makes use of Automatic differentiation 3,4.

In the original implementation in 1, the training and testing data are not separate. The motivation behind the original version is – doubtless – to keep things as simple as possible, and to omit everything unimportant. We feel however, that it might be confusing to not have the training / testing split. Therefore we use a train/test split in the notebooks covered in this post.

Custom_training_basics_standard_optimizer

Here we present first a “split-variation” of the original version, where the training and testing are in fact split.

We add two more notebooks that are replications of the split-variation, these are in particular:

  • A tensorflow-based replication with a standard optimizer
  • A tensorflow/keras implementation.

Please note that all three workbooks are self.contained. Moreover, the results are exactly the same between the notebooks.

As usual the code/notebooks can be found on github:

https://github.com/hfwittmann/comparison-tensorflow-keras

Continue reading “Comparison of a very simple regression in tensorflow and keras”

Dash simple deployment with docker

The previous post was already about dash. So why return to the subject? In some ways I got carried away by the possibilities of dash. I therefore included some concepts that are nice, by themselves, however introduce a level of complexity that is not fully necessary to start you first deployment. This post is really aimed at people who found the previous post too demanding, and want a more gentle introduction to deploying dash with docker. 
Additionally there gonna be a stronger focus on infrastructure, in particular how to add ssl encryption and a basic login.

So here’s an image of what we’re going to build: 

Here’s the full-width view : http://dashsimple4.arthought.com

The login credentials are:
user
word

As you can see already, this is a very simple page.

The code can be found in the following github repository: https://github.com/hfwittmann/dash

The dashboard can be found under this link: https://dashsimple4.arthought.com/

Continue reading “Dash simple deployment with docker”

Dash for timeseries

Dash is an amazing dasboarding framework. If you’re looking for an easy-to- setup dashboarding framework that is able to produce amazing plots that wow your audience, chances are that this is your perfect fit. Further it is also friendly to your CPU. The solution I will show here is running simultaneously on the same 5€/month digital ocean instance as the WordPress installation hosting the article you’re reading.

Dash is fully open source, produced by the makers of plotly. So you can host it yourself, at no cost, or alternatively you can buy a service subscription from plotly, the pricelist you find under this link.

In this post we focus on doing things ourselves, so there is no cost, apart from time spent. And, of course, you need a server, running a standard operating system, and some knowledge.

So let’s work on the latter, and try and augment our knowledge.

The code can be found in the following github repository: https://github.com/hfwittmann/dash

The dashboard can be found under this link: https://dashdax.arthought.com/

Continue reading “Dash for timeseries”

Knime – Multivariate time series

Intro:

Knime is a  very powerful machine learning tool, particularly suitable for the management of complicated workflows as well as rapid prototyping.

It has recently become yet more useful with the arrival of easy-to-use Python nodes. This is true because sometimes the set of nodes – which is large – still may not provide the exact functionality that you need. On the other hand, Python is flexible enough to do anything easily. Therefore the marriage of the two is very powerful.
This post is about the transfer of the three previous time series prediction posts into a knime workflow.
This will provide for a good overview of the dataflow, making it thus more easily manageable.
The workflow code is available on github:

Continue reading “Knime – Multivariate time series”

Multivariate Time Series Forecasting with Neural Networks (3) – multivariate signal noise mixtures

In this follow up post we apply the same methods we developed previously to a different dataset. In this third post we mix the previous two datasets.

So, on the one hand, we have noise signals, on the other hand we have innovators and followers.

Continue reading “Multivariate Time Series Forecasting with Neural Networks (3) – multivariate signal noise mixtures”

Multivariate Time Series Forecasting with Neural Networks (1)

In this post we present the results of a competition between various forecasting techniques applied to multivariate time series. The forecasting techniques we use are some neural networks, and also – as a benchmark – arima.

In particular the neural networks we considered are long short term memory (lstm) networks, and dense networks.

The winner in the setting is lstm, followed by dense neural networks followed by arima.

Of course, arima is actually typically applied to univariate time series, where it works extremely well. For arima we adopt the approach to treat the multivariate time series as a collection of many univariate time series. As stated, arima is not the main focus of this post but used only to demonstrate a benchmark.

To test these forecasting techniques we use random time series. We distinguish between innovator time series and follower time series. Innovator time series are composed of random innovations and can therefore not be forecast. Follower time series are functions of lagged innovator time series and can therefore in principle be forecast.

It turns out that our dense network can only forecast simple functions of innovator time series. For instance the sum of two-legged innovator series can be forecast by our dense network. However a product is already more difficult for a dense network.

In contrast the lstm network deals with products easily and also with more complicated functions.

Continue reading “Multivariate Time Series Forecasting with Neural Networks (1)”

Game of Nim, Reinforcement Learning

Intro

In this post we report success in using reinforcement to learn the game of nim. We had previously cited two theses (ERIK JÄRLEBERG (2011) and  PAUL GRAHAM & WILLIAM LORD (2015)) that used Q-learning to learn the game of nim. However, in this setting, the scaling issues with Q-learning are much more severe than with value-learning. In this post we use a value-based approach with a table. Because the value-based approach is much more efficient than Q-learning no functional approximation is needed, up to reasonable heap sizes.

Continue reading “Game of Nim, Reinforcement Learning”

Game of Nim, Supervised Learning

There are entire theses devoted to reinforcement learning of the game of nim, in particular those of ERIK JÄRLEBERG (2011) and  PAUL GRAHAM & WILLIAM LORD (2015).

Those two were successful in training a reinforcement-based agent to play the game of nim with a high percentage of accurate moves. However, they used lookup tables as their evaluation functions, which leads to scalability  problems.  Further, there is no particular advantage in using Q-learning as opposed to a Value-based approach. This is due to the fact that the “environment’s response” to a particular action (“take b beans from heap h”) is entirely known, and particularly simple. This is different, e.g. from games where the rules are unclear and not stated explicitly and must be learned by the agent, as is the case in the video. In the game of nim the rules are stated explicitly. Indeed, if the action “take b beans from heap h” is possible, i.e. there are at least b beans on heap h, then the update rule is:

heapSize(h) -> heapSize(h) – b

In other the size of heap h is reduced by he beans taken away from it. Therefore, as stated, there is no advantage in using Q-learning over a Value-based approach. The curse of dimensionality, however, is worse for the Q-learning setup as for a heap vector of (h0, h1, …, hn-1) there is one Value but, without paying special attention to duplicates,
~ h0* h1  * … * hn-1  actions, and therefore ~ h0* h1  * … * hn-1 Q-values. There will use a Value based approach.

In other words, we want to use a neural network approximation for the evaluation function. A priori, it is, however, by no means clear that this type of function approximation will work. Yet, the game of Nim is in a sense easy, as there is  a complete solution of the game. We can use this solution to our advantage by using it to estimate whether it is likely that the mentioned network approximation will work.

A simple way is to use supervised learning as a test.

The simplest case is to test a classification of a position as winning or losing.

Continue reading “Game of Nim, Supervised Learning”