The equation is used to update the weights in a Boltzmann Machine during training. Its purpose is to minimize the Kullback-Leibler divergence between the target probability distribution \(\htmlId{tooltip-probDistribution}{P}_{\text{target}}\) and the model's distribution \(\htmlId{tooltip-probDistribution}{P}_{\htmlId{tooltip-weightMatrix}{\mathbf{W}}}\). This gradient descent-based rule adjusts the weights iteratively, aiming to improve the model's accuracy by reducing the discrepancy between the expected and actual joint activations of units \(\htmlId{tooltip-iter}{i}\) and \(\htmlId{tooltip-2iter}{j}\).
\(j\) | This is a secondary symbol for an iterator, a variable that changes value to refer to a series of elements |
\(i\) | This is the symbol for an iterator, a variable that changes value to refer to a sequence of elements. |
\(w\) | This symbol describes the connection strength between two units in an Boltzmann machine. |
\(\mu\) | This is the symbol representing the learning rate. |
\(n\) | This symbol represents any given whole number, \( n \in \htmlId{tooltip-setOfWholeNumbers}{\mathbb{W}}\). |
\(q_{ij}\) | This symbol describes the average probability, in a Boltzmann machine running in confabulation mode, that two units, \(\htmlId{tooltip-microstate}{\mathbf{s}}_{\htmlId{tooltip-iter}{i}}\) and \(\htmlId{tooltip-microstate}{\mathbf{s}}_{\htmlId{tooltip-2iter}{j}}\) are the same. |
\(p_{ij}\) | This symbol describes the average probability, over a set of training samples, that two units, \(\htmlId{tooltip-microstate}{\mathbf{s}}_{\htmlId{tooltip-iter}{i}}\) and \(\htmlId{tooltip-microstate}{\mathbf{s}}_{\htmlId{tooltip-2iter}{j}}\) are the same. |
Let us begin by considering the rule for gradient descent:
\[\htmlId{tooltip-modelParameters}{\theta} \leftarrow \htmlId{tooltip-modelParameters}{\theta} - \htmlId{tooltip-learningRate}{\mu} \htmlId{tooltip-vectorGradient}{\nabla} \htmlId{tooltip-risk}{R} (\htmlId{tooltip-modelParameters}{\theta})\]
In the notation we are using in this equation, this is equivalent to:
\[\htmlId{tooltip-connectionWeight}{w}_{\htmlId{tooltip-iter}{i} \htmlId{tooltip-2iter}{j}}(\htmlId{tooltip-wholeNumber}{n} + 1) = \htmlId{tooltip-connectionWeight}{w}_{\htmlId{tooltip-iter}{i} \htmlId{tooltip-2iter}{j}}(\htmlId{tooltip-wholeNumber}{n}) - \htmlId{tooltip-learningRate}{\mu} \htmlId{tooltip-vectorGradient}{\nabla} \htmlId{tooltip-risk}{R}(\htmlId{tooltip-weightMatrix}{\mathbf{W}})\]
as the model parameters, (\( \htmlId{tooltip-modelParameters}{\theta} \)) are equivalent to a weight at some iteration (\(\htmlId{tooltip-connectionWeight}{w}_{\htmlId{tooltip-iter}{i} \htmlId{tooltip-2iter}{j}}(\htmlId{tooltip-wholeNumber}{n})\))
For the gradient of our risk (\( \htmlId{tooltip-risk}{R} \)) we can use our Kullback-Leibler Loss function:
\[\frac{\delta KL(\htmlId{tooltip-probDistribution}{P}_{target}(\htmlId{tooltip-microstate}{\mathbf{s}}),\htmlId{tooltip-probDistribution}{P}_{\htmlId{tooltip-weightMatrix}{\mathbf{W}}}(\htmlId{tooltip-microstate}{\mathbf{s}}))}{\delta \htmlId{tooltip-connectionWeight}{w}_{\htmlId{tooltip-iter}{i} \htmlId{tooltip-2iter}{j}}} = - \frac{1}{\htmlId{tooltip-temperature}{T}}(\htmlId{tooltip-averageProb}{p_{ij}} - \htmlId{tooltip-2averageProb}{q_{ij}})\]
Using our gradient descent rule with the gradient of the Kullback-Leibler loss function, we get:
\[\htmlId{tooltip-connectionWeight}{w}_{\htmlId{tooltip-iter}{i} \htmlId{tooltip-2iter}{j}}(\htmlId{tooltip-wholeNumber}{n} + 1) = \htmlId{tooltip-connectionWeight}{w}_{\htmlId{tooltip-iter}{i} \htmlId{tooltip-2iter}{j}}(\htmlId{tooltip-wholeNumber}{n}) - \htmlId{tooltip-learningRate}{\mu} (\frac{\delta KL(\htmlId{tooltip-probDistribution}{P}_{target}(\htmlId{tooltip-microstate}{\mathbf{s}}),\htmlId{tooltip-probDistribution}{P}_{\htmlId{tooltip-weightMatrix}{\mathbf{W}}}(\htmlId{tooltip-microstate}{\mathbf{s}}))}{\delta \htmlId{tooltip-connectionWeight}{w}_{\htmlId{tooltip-iter}{i} \htmlId{tooltip-2iter}{j}}} )\]
By substituting in the right hand side of the equation for the gradient of our Kullback-Leibler loss, we get:
\[\htmlId{tooltip-connectionWeight}{w}_{\htmlId{tooltip-iter}{i} \htmlId{tooltip-2iter}{j}}(\htmlId{tooltip-wholeNumber}{n} + 1) = \htmlId{tooltip-connectionWeight}{w}_{\htmlId{tooltip-iter}{i} \htmlId{tooltip-2iter}{j}}(\htmlId{tooltip-wholeNumber}{n}) - \htmlId{tooltip-learningRate}{\mu}(- \frac{1}{\htmlId{tooltip-temperature}{T}}(\htmlId{tooltip-averageProb}{p_{ij}} - \htmlId{tooltip-2averageProb}{q_{ij}}))\]
We can now simplify the double negative:
\[\htmlId{tooltip-connectionWeight}{w}_{\htmlId{tooltip-iter}{i} \htmlId{tooltip-2iter}{j}}(\htmlId{tooltip-wholeNumber}{n} + 1) = \htmlId{tooltip-connectionWeight}{w}_{\htmlId{tooltip-iter}{i} \htmlId{tooltip-2iter}{j}}(\htmlId{tooltip-wholeNumber}{n}) + \htmlId{tooltip-learningRate}{\mu}( \frac{1}{\htmlId{tooltip-temperature}{T}}(\htmlId{tooltip-averageProb}{p_{ij}} - \htmlId{tooltip-2averageProb}{q_{ij}}))\]
Finally, we can incorporate our \(\frac{1}{\htmlId{tooltip-temperature}{T}}\) term into the learning rate (change the learning rate from what it is, \( \htmlId{tooltip-learningRate}{\mu} \) to \(\htmlId{tooltip-learningRate}{\mu} \cdot \frac{1}{\htmlId{tooltip-temperature}{T}}\). This gives us:
\[\htmlId{tooltip-connectionWeight}{w}_{\htmlId{tooltip-iter}{i} \htmlId{tooltip-2iter}{j}}(\htmlId{tooltip-wholeNumber}{n} + 1) = \htmlId{tooltip-connectionWeight}{w}_{\htmlId{tooltip-iter}{i} \htmlId{tooltip-2iter}{j}}(\htmlId{tooltip-wholeNumber}{n}) + \htmlId{tooltip-learningRate}{\mu}(\htmlId{tooltip-averageProb}{p_{ij}} - \htmlId{tooltip-2averageProb}{q_{ij}})\]
as required.
Let us now work through an example for a single weight update, using the equation:
\[\htmlId{tooltip-connectionWeight}{w}_{\htmlId{tooltip-iter}{i} \htmlId{tooltip-2iter}{j}}(\htmlId{tooltip-wholeNumber}{n} + 1) = \htmlId{tooltip-connectionWeight}{w}_{\htmlId{tooltip-iter}{i} \htmlId{tooltip-2iter}{j}}(\htmlId{tooltip-wholeNumber}{n}) + \htmlId{tooltip-learningRate}{\mu}(\htmlId{tooltip-averageProb}{p_{ij}} - \htmlId{tooltip-2averageProb}{q_{ij}})\]
We will say that:
Substituting these values in, we find:
\[\begin{align*}\htmlId{tooltip-connectionWeight}{w}_{\htmlId{tooltip-iter}{i} \htmlId{tooltip-2iter}{j}}(\htmlId{tooltip-wholeNumber}{n} + 1) &= 0.5 + 0.1(0.7 -0.4)\\&= 0.5 + 0.1(0.3)\\&= 0.5 +0.03\\&= 0.53\end{align*}\]
So \(0.53\) is our answer.
Was this page helpful?