Risk Minimization for MLPs

Prerequisites

Description

Like any Machine Learning model, a Multi Layer Perceptron has an associated risk that is sought to be minimized. Because once again the distribution of the input and output are generally unknown, the empirical risk over some data is used instead.

Equation

\[\htmlId{tooltip-weightVector}{\theta}_\text{opt} = \argmin_{\htmlId{tooltip-weightVector}{\theta} \in \htmlId{tooltip-parameterSpace}{\Theta}} \frac{1}{N} \sum_{i=1}^{N} \htmlId{tooltip-loss}{L}(\htmlId{tooltip-network}{\mathcal{N}}_{\htmlId{tooltip-weightVector}{\theta}}(\htmlId{tooltip-input}{u}_i), \htmlId{tooltip-groundTruth}{y}_i)\]

Symbols Used

\(\mathcal{N}\)

This is the symbol used for a function approximator, typically a neural network.

\(y\)

This symbol stands for the ground truth of a sample. In supervised learning this is often paired with the corresponding input.

\(\Theta\)

This is the symbol for the set of all possible model parameters \( \htmlId{tooltip-weightVector}{\theta} \).

\(\theta\)

This is the symbol we use for model weights/parameters.

\(L\)

This is the symbol for a loss function. It is a function that calculates how wrong a model's inference is compared to where it should be.

\(u\)

This symbol denotes the input of a model.

Derivation

  1. Consider the empirical risk of some model \(\htmlId{tooltip-model}{h}\):
    \[\htmlId{tooltip-risk}{R}^{emp}(\htmlId{tooltip-model}{h}) = \frac{1}{N} \sum^{N}_{i=1} L (\htmlId{tooltip-model}{h}(\htmlId{tooltip-input}{u}_i), \htmlId{tooltip-groundTruth}{y}_i)\]
  2. The minimization of empirical risk for \( \htmlId{tooltip-model}{h} \) finds the optimal model \( \htmlId{tooltip-optimalModel}{\hat{f}} \):
    \[\htmlId{tooltip-optimalModel}{\hat{f}} = h_{opt} = \underset{h \in \htmlId{tooltip-hypothesisSpace}{\mathcal{H}}}{argmin} \hspace{0.2cm} \frac{1}{N} \sum^{N}_{i=1} L (\htmlId{tooltip-model}{h}(\htmlId{tooltip-input}{u}_i), \htmlId{tooltip-groundTruth}{y}_i)\]
  3. Consider a model in the form of a neural network \( \htmlId{tooltip-network}{\mathcal{N}}_{\htmlId{tooltip-weightVector}{\theta}} \) parametrized by the weights/parameters \( \htmlId{tooltip-weightVector}{\theta} \).
  4. Finding the optimal model corresponds to finding the optimal weights \( \htmlId{tooltip-weightVector}{\theta}_\text{opt} \).
  5. Replacing \( \htmlId{tooltip-model}{h} \) with \( \htmlId{tooltip-network}{\mathcal{N}}_{\htmlId{tooltip-weightVector}{\theta}} \), and the optimization target \( \htmlId{tooltip-optimalModel}{\hat{f}} \) with the optimal weights \( \htmlId{tooltip-weightVector}{\theta}_\text{opt} \), we get:
    \[ \htmlId{tooltip-weightVector}{\theta}_\text{opt} = \argmin_{\htmlId{tooltip-weightVector}{\theta} \in \htmlId{tooltip-parameterSpace}{\Theta}} \frac{1}{N} \sum_{i=1}^{N} \htmlId{tooltip-loss}{L}(\htmlId{tooltip-network}{\mathcal{N}}_{\htmlId{tooltip-weightVector}{\theta}}(\htmlId{tooltip-input}{u}_i), \htmlId{tooltip-groundTruth}{y}_i) \]
    as required.

Note: Other terms such as can be added to the formulation in similar ways: see Loss Minimization with Regularization.

References

  1. Jaeger, H. (n.d.). Neural Networks (AI) (WBAI028-05) Lecture Notes BSc program in Artificial Intelligence. Retrieved April 26, 2024, from https://www.ai.rug.nl/minds/uploads/LN_NN_RUG.pdf

Was this page helpful?