admin管理员组

文章数量:1579085

Once upon a time, I was trying to train a speaker recognition model with TIMIT dataset. I used Alexnet since I wanted to try this with a smaller model first. I have used a softmax layer at the end. The inputs were spectrograms of voices of different people and labels were the speaker IDs. MSELoss was used with the PyTorch library. I left the model for hours to train but to no avail. I was wondering why.

从前,我试图用TIMIT数据集训练说话者识别模型。 我使用Alexnet是因为我想先尝试使用较小的模型。 我在最后使用了softmax层。 输入是不同人的声音的声谱图,标签是讲话者ID。 MSELoss与PyTorch库一起使用。 我把模型放了几个小时去训练,但徒劳无功。 我想知道为什么。

I checked the output from the model (the output from the softmax). The elements of the output array were all equal to each other, for all the inputs I tried. This was really annoying. It seemed that the model did not learn anything at all. So I set out to investigate. This article contains some of my findings about the softmax function. First let’s examine the softmax function

我检查了模型的输出(softmax的输出)。 对于我尝试的所有输入,输出数组的元素都彼此相等。 真烦人。 该模型似乎根本不学习任何东西。 所以我开始调查。 本文包含有关softmax函数的一些发现。 首先让我们检查一下softmax函数

The equation above shows the softmax function of a vector x. As we can see softmax function contains exponential terms. The result of the exponential function can get very large with increasing input. Therefore for sufficiently large inputs, overflow errors can occur! So we need to make sure that the input does not get too large to cause this. Here by input I mean the input to the softmax function. So I tried to find when the exponential term gives an overflow error. The largest value without overflow was found to be 709 (at least in my machine).

上面的等式显示了向量x的softmax函数。 我们可以看到softmax函数包含指数项。 随着输入的增加,指数函数的结果会变得非常大。 因此,对于足够大的输入,可能会发生溢出错误! 因此,我们需要确保输入不会变得太大而导致这种情况。 在这里,输入是指softmax函数的输入。 所以我试图找到指数项何时给出溢出错误。 发现没有溢出的最大值是709(至少在我的机器中)。

value=709
sm=np.exp(value)

Note that this value could change from machine to machine and from the library to library.

请注意,此值可能会在计算机之间和库之间变化。

Next I set out to explore how softmax behaves for large and small inputs. So I created input arrays sampled from normal distributions with varying mean values. And then plotted the statistics after taking the the softmax function. The size of the input was chosen to be 1000.

接下来,我着手探讨softmax在大小输入方面的表现。 因此,我创建了从正态分布采样的平均值均不同的输入数组。 然后采用softmax函数绘制统计数据。 输入的大小选择为1000。

The code I used to do this is as follows (in python)

我用来执行此操作的代码如下(在python中)

means_list=[]
max_list=[]
sd_list=[]
x_axis=[]
sm_list=[]
for mean in range(0,40000):
mean=mean/100
sd=mean/10
feature=normal(mean,sd,1000)
sm=np.exp(feature)/np.sum(np.exp(feature))
sm_list.append(sm)
means_list.append(sm.mean())
max_list.append(sm.max())
sd_list.append(sm.std())
x_axis.append(mean)

The following figure shows the mean values of the softmax function plotted against mean value of the input feature. As expected, it is constant. Well that is good so far.

下图显示了针对输入特征的平均值绘制的softmax函数的平均值。 如预期的那样,它是恒定的。 到目前为止,这很好。

Now let’s plot the max value of softmax vs the mean value of the input feature vector.

现在,让我们绘制softmax的最大值与输入特征向量的平均值的关系图。

We can see that for very small inputs, the result of softmax 0.001 (I had to print the array values for this). The input array had 1000 elements. Seems like under small inputs, softmax divides the output probabilities equally (1/1000) to the components even though the elements in the input feature array are not equal.

我们可以看到,对于非常小的输入,softmax 0.001的结果(我必须为此打印数组值)。 输入数组有1000个元素。 好像在较小的输入下,即使输入要素数组中的元素不相等,softmax也会将输出概率平均(1/1000)划分给各个分量。

Further insight can be taken by looking at the plot of standard deviation of the softmax values plotted against the mean value of the input feature vector.

通过查看针对输入特征向量平均值绘制的softmax值的标准偏差图,可以进一步了解。

The SD reaches 0 (meaning no variation among the probability values) when the input in small. Well this is not good. Lets observe a numerical example.

当输入较小时,SD达到0(表示概率值之间没有变化)。 好吧,这不好。 让我们观察一个数值示例。

feature=np.array([1.0,5.0,6.0,2.0])*1e-8
sm=np.exp(feature)/np.sum(np.exp(feature))
print(sm)>> [0.24999999 0.25 0.25000001 0.25 ]

As we can see inputs in the scale of 1e-8 causes softmax to output similar values making these useless. Well this was what happened to my model.

如我们所见,输入范围为1e-8的输入会导致softmax输出相似的值,从而使这些值无用。 嗯,这就是我的模型发生的事情。

And again something awful happens when the inputs are very large. The max value of the softmax reaches 1. This means the other values must be close to 0. It can be seen that SD value also plateau. Now for larger values,

当输入很大时,又会发生可怕的事情。 softmax的最大值达到1。这意味着其他值必须接近0。可以看出SD值也趋于平稳。 现在为更大的值,

feature=np.array([1.0,5.0,6.0,2.0])*100
sm=np.exp(feature)/np.sum(np.exp(feature))
print(sm)>>[7.12457641e-218 3.72007598e-044 1.00000000e+000 1.91516960e-174]

Only the 3rd element is 1 in the softmax output. The other are almost zero.

softmax输出中只有第三个元素为1。 另一个几乎为零。

We can see that softmax does not represent the inputs distribution well for inputs too large or too small. So if our model produces values in these ranges before the softmax, the model will not learn anything because softmax is useless.

我们可以看到,对于太大或太小的输入,softmax不能很好地表示输入分布。 因此,如果我们的模型在softmax之前的这些范围内产生值,则该模型将不会学到任何东西,因为softmax是无用的。

我们该怎么办? (What can we do about this ?)

缩放输入 (Scaling the input)

One solution to these problems is standardizing the inputs before we send them to softmax.

这些问题的一种解决方案是在将输入发送到softmax之前对其进行标准化。

feature=(feature-feature.max())/(feature.max()-feature.min())

After doing this the plot of max value of the softmax against mean value of input features looked like below.

完成此操作后,softmax的最大值相对于输入特征的平均值的图如下所示。

It looks like those awkward values at very small and large input values are gone now which is good.

看起来那些输入值非常小而笨拙的值现在消失了,这很好。

使用log-softmax (Using log-softmax)

Sometimes taking log value of softmax can make the operation more stable. The equation for log softmax is simply taking log value of softmax (obviously!). But there are certain implications of doing this.

有时取softmax的对数值可以使操作更稳定。 对数softmax的方程式只是采用softmax的对数值(显然!)。 但是这样做有一定的含义。

On closer inspection, we can see that log softmax can be converted to the following form

通过仔细检查,我们可以看到log softmax可以转换为以下形式

This simplifies things a lot. The second term on the right hand side of the equation can be simplified with the method commonly called the log-sum-exp trick. This prevents overflow and underflow errors making log softmax more stable than bare softmax. Most of the libraries which calculate log-softmax use this log-sum-exp trick. For more about this, read this article :

这大大简化了事情。 可以使用通常称为log-sum-exp技巧的方法来简化等式右边的第二项。 这样可以防止出现上溢和下溢错误,从而使日志软最大比裸软最大更稳定。 计算log-softmax的大多数库都使用此log-sum-exp技巧。 有关此的更多信息,请阅读本文:

Let’s take a look at the variation of mean value of log softmax as we change the mean value of input to log softmax

让我们看一下将输入的平均值更改为对数softmax时对数softmax平均值的变化

Next max value of log softmax

日志softmax的下一个最大值

Then standard deviation

然后标准偏差

Looks like standard deviation increases when we increase the mean input feature values. The standard deviation does not reach zero when the input feature mean increases like in earlier problematic cases.

当我们增加平均输入要素值时,似乎标准偏差增加了。 当输入特征均值增加时(如较早出现问题的情况),标准差不会达到零。

In fact from my experiments I saw that using log-softmax, the model trained faster than with the min-max scaling. If you are using PyTorch, CrossEntropyLoss can be used since it already contains log-softmax.

实际上,根据我的实验,我发现使用log-softmax可以比使用max-max缩放模型更快地训练模型。 如果您使用的是PyTorch,则可以使用CrossEntropyLoss,因为它已经包含log-softmax。

For my Jupyter Notebook (with plots and all) go to

对于我的Jupyter笔记本(包含所有图),请转到

带走 (Takeaway)

Sometimes softmax can be numerically unstable (give overflow or underflow errors) or useless (all the outputs are the same or weird). So if your model is reluctant to learn anything, it could be due to this. In this case solve this problem with something like log-softmax. But some solutions may not work for a particular application. So we may have to experiment a bit.

有时softmax可能在数值上不稳定(产生上溢或下溢错误)或无用(所有输出相同或怪异)。 因此,如果您的模型不愿学习任何东西,则可能是由于此。 在这种情况下,可以使用log-softmax之类的方法解决此问题。 但是某些解决方案可能不适用于特定的应用程序。 因此,我们可能需要尝试一下。

翻译自: https://medium/swlh/are-you-messing-with-me-softmax-84397b19f399

本文标签: 和我您在softmax