250908-2

作者：

在

dropout 正则化，随机让一部分神经元（注意力权重）暂时不工作；

dim 数据“嵌套”的索引，dim=0最外层，dim=1次外层…;

shape 每一层的元素个数;

nn.linear() 实现一个对输入数据的线性变换(Linear Transformation) y = xA^T +b, x 是输入数据，A是权重矩阵，b是bias向量。

layer = nn.linear(in_features=10, out_features=5, bias=True)，则 A layer.weight.shape=torch.Size([5, 10]), b layer.bias.shape=torch.Sieze([5])

评论