Weight Update Rule for Boltzmann Machines

Prerequisites

Description

The equation is used to update the weights in a Boltzmann Machine during training. Its purpose is to minimize the Kullback-Leibler divergence between the target probability distribution \(\htmlId{tooltip-probDistribution}{P}_{\text{target}}\) and the model's distribution \(\htmlId{tooltip-probDistribution}{P}_{\htmlId{tooltip-weightMatrix}{\mathbf{W}}}\)​. This gradient descent-based rule adjusts the weights iteratively, aiming to improve the model's accuracy by reducing the discrepancy between the expected and actual joint activations of units \(\htmlId{tooltip-iter}{i}\) and \(\htmlId{tooltip-2iter}{j}\).

Equation

\[\htmlId{tooltip-connectionWeight}{w}_{\htmlId{tooltip-iter}{i} \htmlId{tooltip-2iter}{j}}(\htmlId{tooltip-wholeNumber}{n} + 1) = \htmlId{tooltip-connectionWeight}{w}_{\htmlId{tooltip-iter}{i} \htmlId{tooltip-2iter}{j}}(\htmlId{tooltip-wholeNumber}{n}) + \htmlId{tooltip-learningRate}{\mu}(\htmlId{tooltip-averageProb}{p_{ij}} - \htmlId{tooltip-2averageProb}{q_{ij}})\]

Symbols Used

\(j\)

This is a secondary symbol for an iterator, a variable that changes value to refer to a series of elements

\(i\)

This is the symbol for an iterator, a variable that changes value to refer to a sequence of elements.

\(w\)

This symbol describes the connection strength between two units in an Boltzmann machine.

\(\mu\)

This is the symbol representing the learning rate.

\(n\)

This symbol represents any given whole number, \( n \in \htmlId{tooltip-setOfWholeNumbers}{\mathbb{W}}\).

\(q_{ij}\)

This symbol describes the average probability, in a Boltzmann machine running in confabulation mode, that two units, \(\htmlId{tooltip-microstate}{\mathbf{s}}_{\htmlId{tooltip-iter}{i}}\) and \(\htmlId{tooltip-microstate}{\mathbf{s}}_{\htmlId{tooltip-2iter}{j}}\) are the same.

\(p_{ij}\)

This symbol describes the average probability, over a set of training samples, that two units, \(\htmlId{tooltip-microstate}{\mathbf{s}}_{\htmlId{tooltip-iter}{i}}\) and \(\htmlId{tooltip-microstate}{\mathbf{s}}_{\htmlId{tooltip-2iter}{j}}\) are the same.

Derivation

Let us begin by considering the rule for gradient descent:

\[\htmlId{tooltip-modelParameters}{\theta} \leftarrow \htmlId{tooltip-modelParameters}{\theta} - \htmlId{tooltip-learningRate}{\mu} \htmlId{tooltip-vectorGradient}{\nabla} \htmlId{tooltip-risk}{R} (\htmlId{tooltip-modelParameters}{\theta})\]

In the notation we are using in this equation, this is equivalent to:

\[\htmlId{tooltip-connectionWeight}{w}_{\htmlId{tooltip-iter}{i} \htmlId{tooltip-2iter}{j}}(\htmlId{tooltip-wholeNumber}{n} + 1) = \htmlId{tooltip-connectionWeight}{w}_{\htmlId{tooltip-iter}{i} \htmlId{tooltip-2iter}{j}}(\htmlId{tooltip-wholeNumber}{n}) - \htmlId{tooltip-learningRate}{\mu} \htmlId{tooltip-vectorGradient}{\nabla} \htmlId{tooltip-risk}{R}(\htmlId{tooltip-weightMatrix}{\mathbf{W}})\]

as the model parameters, (\( \htmlId{tooltip-modelParameters}{\theta} \)) are equivalent to a weight at some iteration (\(\htmlId{tooltip-connectionWeight}{w}_{\htmlId{tooltip-iter}{i} \htmlId{tooltip-2iter}{j}}(\htmlId{tooltip-wholeNumber}{n})\))

For the gradient of our risk (\( \htmlId{tooltip-risk}{R} \)) we can use our Kullback-Leibler Loss function:

\[\frac{\delta KL(\htmlId{tooltip-probDistribution}{P}_{target}(\htmlId{tooltip-microstate}{\mathbf{s}}),\htmlId{tooltip-probDistribution}{P}_{\htmlId{tooltip-weightMatrix}{\mathbf{W}}}(\htmlId{tooltip-microstate}{\mathbf{s}}))}{\delta \htmlId{tooltip-connectionWeight}{w}_{\htmlId{tooltip-iter}{i} \htmlId{tooltip-2iter}{j}}} = - \frac{1}{\htmlId{tooltip-temperature}{T}}(\htmlId{tooltip-averageProb}{p_{ij}} - \htmlId{tooltip-2averageProb}{q_{ij}})\]

Using our gradient descent rule with the gradient of the Kullback-Leibler loss function, we get:

\[\htmlId{tooltip-connectionWeight}{w}_{\htmlId{tooltip-iter}{i} \htmlId{tooltip-2iter}{j}}(\htmlId{tooltip-wholeNumber}{n} + 1) = \htmlId{tooltip-connectionWeight}{w}_{\htmlId{tooltip-iter}{i} \htmlId{tooltip-2iter}{j}}(\htmlId{tooltip-wholeNumber}{n}) - \htmlId{tooltip-learningRate}{\mu} (\frac{\delta KL(\htmlId{tooltip-probDistribution}{P}_{target}(\htmlId{tooltip-microstate}{\mathbf{s}}),\htmlId{tooltip-probDistribution}{P}_{\htmlId{tooltip-weightMatrix}{\mathbf{W}}}(\htmlId{tooltip-microstate}{\mathbf{s}}))}{\delta \htmlId{tooltip-connectionWeight}{w}_{\htmlId{tooltip-iter}{i} \htmlId{tooltip-2iter}{j}}} )\]

By substituting in the right hand side of the equation for the gradient of our Kullback-Leibler loss, we get:

\[\htmlId{tooltip-connectionWeight}{w}_{\htmlId{tooltip-iter}{i} \htmlId{tooltip-2iter}{j}}(\htmlId{tooltip-wholeNumber}{n} + 1) = \htmlId{tooltip-connectionWeight}{w}_{\htmlId{tooltip-iter}{i} \htmlId{tooltip-2iter}{j}}(\htmlId{tooltip-wholeNumber}{n}) - \htmlId{tooltip-learningRate}{\mu}(- \frac{1}{\htmlId{tooltip-temperature}{T}}(\htmlId{tooltip-averageProb}{p_{ij}} - \htmlId{tooltip-2averageProb}{q_{ij}}))\]

We can now simplify the double negative:

\[\htmlId{tooltip-connectionWeight}{w}_{\htmlId{tooltip-iter}{i} \htmlId{tooltip-2iter}{j}}(\htmlId{tooltip-wholeNumber}{n} + 1) = \htmlId{tooltip-connectionWeight}{w}_{\htmlId{tooltip-iter}{i} \htmlId{tooltip-2iter}{j}}(\htmlId{tooltip-wholeNumber}{n}) + \htmlId{tooltip-learningRate}{\mu}( \frac{1}{\htmlId{tooltip-temperature}{T}}(\htmlId{tooltip-averageProb}{p_{ij}} - \htmlId{tooltip-2averageProb}{q_{ij}}))\]

Finally, we can incorporate our \(\frac{1}{\htmlId{tooltip-temperature}{T}}\) term into the learning rate (change the learning rate from what it is, \( \htmlId{tooltip-learningRate}{\mu} \) to \(\htmlId{tooltip-learningRate}{\mu} \cdot \frac{1}{\htmlId{tooltip-temperature}{T}}\). This gives us:

\[\htmlId{tooltip-connectionWeight}{w}_{\htmlId{tooltip-iter}{i} \htmlId{tooltip-2iter}{j}}(\htmlId{tooltip-wholeNumber}{n} + 1) = \htmlId{tooltip-connectionWeight}{w}_{\htmlId{tooltip-iter}{i} \htmlId{tooltip-2iter}{j}}(\htmlId{tooltip-wholeNumber}{n}) + \htmlId{tooltip-learningRate}{\mu}(\htmlId{tooltip-averageProb}{p_{ij}} - \htmlId{tooltip-2averageProb}{q_{ij}})\]

as required.

Example

Let us now work through an example for a single weight update, using the equation:

\[\htmlId{tooltip-connectionWeight}{w}_{\htmlId{tooltip-iter}{i} \htmlId{tooltip-2iter}{j}}(\htmlId{tooltip-wholeNumber}{n} + 1) = \htmlId{tooltip-connectionWeight}{w}_{\htmlId{tooltip-iter}{i} \htmlId{tooltip-2iter}{j}}(\htmlId{tooltip-wholeNumber}{n}) + \htmlId{tooltip-learningRate}{\mu}(\htmlId{tooltip-averageProb}{p_{ij}} - \htmlId{tooltip-2averageProb}{q_{ij}})\]

We will say that:

Substituting these values in, we find:

\[\begin{align*}\htmlId{tooltip-connectionWeight}{w}_{\htmlId{tooltip-iter}{i} \htmlId{tooltip-2iter}{j}}(\htmlId{tooltip-wholeNumber}{n} + 1) &= 0.5 + 0.1(0.7 -0.4)\\&= 0.5 + 0.1(0.3)\\&= 0.5 +0.03\\&= 0.53\end{align*}\]

So \(0.53\) is our answer.

References

  1. Jaeger, H. (n.d.). Neural Networks (AI) (WBAI028-05) Lecture Notes BSc program in Artificial Intelligence. Retrieved April 27, 2024, from https://www.ai.rug.nl/minds/uploads/LN_NN_RUG.pdf

Was this page helpful?