Discussed in the context of Depp Learning in this section
Let’s consider a hypothetical set of logits before applying the softmax function. For simplicity, let’s take a vector with three logits: \((2, 5, 8)\). The softmax function is defined as:
\[ P_i = \frac{e^{z_i / \tau}}{\sum_{j=1}^{N} e^{z_j / \tau}} \]
where: - \(z_i\) is the logit for class \(i\), - \(N\) is the number of classes, - \(\tau\) is the temperature.
Let’s examine how different temperatures affect the resulting softmax probabilities:
High Temperature (Smoothed Distribution):
- Temperature (\(\tau\)) is high, let’s say \(\tau = 5\).
- Softmax Probabilities: \(P_1, P_2, P_3\) where \(P_i\) is the probability for class \(i\).
= [2, 5, 8] logits = 5.0 tau = torch.nn.functional.softmax(torch.tensor(logits) / tau, dim=-1) softmax_probs print(softmax_probs.numpy())
The output will be a set of probabilities where the distribution is “smoothed” due to the high temperature:
[0.04661262, 0.23688284, 0.71650454]
Notice that the probabilities are more evenly distributed among the classes.
Low Temperature (Sharp Distribution):
- Temperature (\(\tau\)) is low, let’s say \(\tau = 0.5\).
- Softmax Probabilities: \(P_1, P_2, P_3\)
= 0.5 tau = torch.nn.functional.softmax(torch.tensor(logits) / tau, dim=-1) softmax_probs print(softmax_probs.numpy())
The output will be a set of probabilities where the distribution is “sharpened” due to the low temperature:
[0.00242826, 0.04741446, 0.95015728]
Notice that the probability for the class with the highest logit (class 3, with logit 8) is much higher compared to the others.
# pairwise similarities between ima_feat and text_feat -> softamx -> scaling # scaling: self.logit_scale = self.clip_model.logit_scale.exp().detach() aka e^(logit_scale) # ^ makes softmax have a temperture term in its formula # - high tau -> smothing probs # - low tau -> peaked probs
In these examples, adjusting the temperature influences the degree of exploration (smoothness) versus exploitation (sharpness) in the distribution of probabilities. Higher temperatures encourage a more uniform distribution, while lower temperatures emphasize the most confident predictions. This temperature parameter is often used in training as a form of regularization to control the model’s uncertainty.