We want our model to fit the training data well (minimize training loss)

We want our model's performance to generalize to unseen data

- If our model does not fit the training data, we say it is underfitting

- two possible reasons for underfitting:
- limited
**learnability**(vanishing gradients, bad loss geometry) - limited
**expressivity**(no set of parameters can express the function)

- limited

How can we improve **learnability**?

- change the model (e.g. introduce skip connections)
- change the initial conditions (parameter initializations)
- introduce stochasticity
- use normalization layers
- normalize the data
- change the objective function to not just optimize for the task but also for its learnability

How can we improve **expressivity**?

add more parameters to the model

alternatively re-use existing parameters in smart ways (e.g. convolutions)

use a model with a more appropriate inductive bias

How do we know which of the two is the problem?

- We can't generally tell whether underfitting is caused by learnability or expressivity issues.

What to do about it?

- We can run
**diagnostics**on our model to test for learning/optimization issues- analyze gradients and activations
- inspect stability of training

- We can implement
**best practices**to ensure learnability- sensible weight initialization
- sensible learning rate and batch size
- (adaptive) gradient clipping
- normalization layers
- skip connections

**Let us assume we already managed to fit the training data very well (or well enough)**

We speak of overfitting when our model performs badly on unseen data.

There are two kinds of unseen data:

- data from the training distribution (interpolation)
- data that is not likely to come from the training distribution (extrapolation)

**interpolation** but often they fail at even that (overfitting!)

What happens if the model overfits the training data?

- model learns to predict based on the particular noise and spurious correlations seen in the training data

When does overfitting happen?

very expressive models that have the capacity to memorize the training data

- perfect memorization of training data leads to a loss of zero

loss landscape in which we can easily end up in such optima

What can we do about it?

- have simpler models with fewer parameters or stronger inductive biases
- change the loss landscape together with the optimizer

*

- Idea: the total Frobenius norm (Euclidean for matrices) is a proxy for effective model size

$\tiny{\text{*Total Frobenius norm values are linearly scaled to match min-max range of validation loss}}$

What can we do to prevent the Frobenius norm from becoming **too** large?

- adding penalties to the loss function to minimize parameter magnitudes
- L1 loss: $L = \sum_i \lvert w_i \rvert$
- L2 loss: $L = \sum_i {w_i}^2$

*kernel_regularizer* argument in tf.keras.layers.Layer objects adds a penalty of specified strength to the layer.

```
tf.keras.layers.Conv2D(32, 3, activation="relu", kernel_regularizer=tf.keras.regularizers.L2(0.001))```
```

Regularization losses are automatically written to self.losses in the tf.keras.Model class.

- training data
- get more and more diverse training data whenever possible

data augmentation (adding random transformations under which the model predictions should be invariant)

- done in the data pipeline with the map method and a callable augmentation model

randomly dropping units or feature maps during training ("dropout")

- e.g. tf.keras.layers.Dropout(0.5)

- other forms of
**adding stochasticity to the training**- smaller batch sizes
- e.g. BatchNormalization layers (that primarily tremendously help with learnability)
- ...

- label smoothing
- do not learn categorization with hard labels but with soft probability distributions

- early stopping
- do not continue learning when the validation loss is consistently increasing

overparameterization as regularization

very strong overparameterization can have a regularizing effect

this phenomenon is called double descent

- first the generalization performance increases as the model complexity increases.
- then it suddenly decreases
- then (as we have a super large model) it increases again

the reason for this is the smoother geometry of high-dimensional loss landscapes

- choice of optimizer
- similar in spirit to using massive overparameterization
- introduce an inductive bias to the optimization to favor smooth, broad local optima over sharp optima

- overfitting is a massive problem esp. when training data is limited
- training models with more (diverse) data is always preferable
- there are many options for regularization that we can and should readily use
- most of these methods also have an effect on learnability and expressivity
- learnability, expressivity and generalization all interact in complex ways
- sharpness in loss landscapes is undesirable