2024阿里巴巴全球数学竞赛决赛(应用与计算数学)神经网络部分赏析

\( \newcommand{\trsp}{\mathrm{T}} \newcommand{\spr}{\mathbb{R}} \newcommand{\Tr}{\mathrm{Tr}} \newcommand{\expe}[2]{\mathbb{E}_{ #1}\begin{bmatrix}#2\end{bmatrix}} \newcommand{\Var}{\mathrm{Var}} \newcommand\defeq{\equiv} \newcommand{\prob}[1]{\mathbb{P}\begin{bmatrix}#1\end{bmatrix}} \newcommand\rvw{\boldsymbol{w}} \newcommand\rvx{\boldsymbol{x}} \)

题目2

假设$F(x;w)$是一个输出标量的深度神经网络,其中$x$是输入,$w$表示权重。假设$F$关于$w$连续可微,并且对于调练数据$\{x_j,y_j\}_{j=1}^m$过参数化:即存在$w^*$使得对所有$j$满足$F(x_j;w^*)=y_j$。为了研究训练神经网络在$w^*$的局部优化动力学,我们考虑线性化神经网络$\widetilde{F}(x;w)=F(x;w^*)+(w-w^*)^\trsp\nabla F(x;w^*)$,其损失函数为:
\begin{equation}
\text{Loss}(w) = \frac{1}{2m}\sum_{j=1}^m (y_j-\widetilde{F}(x_j;w))^2.
\end{equation}
令$\eta$表示学习率,梯度下降法为$w_{i+1} = w_i - \eta\nabla\text{Loss}(w_i)$,而随机梯度下降法为$w_{i+1} = w_i - \eta(\nabla\text{Loss}(w_i) + \epsilon_i)$,其中噪声项$\epsilon_i$满足$\expe{}{\epsilon_i}=0$和$\expe{}{\epsilon_i\epsilon_i^\trsp}=M(w_i)/b$,$b$是mini-batch的大小。假设协方差矩阵$M$与
\begin{equation}
\Sigma = \frac1m \sum_{j=1}^m\nabla F(x;w^*)\nabla F(x;w^*)^\trsp,
\end{equation}

在以下意义上对齐:
\begin{equation}
\frac{\Tr(M(w)\Sigma)}{2\text{Loss}(w)|\Sigma|_F^2}\geq \delta,
\end{equation}

对于$\delta >0$和所有$w$成立。这里$|\cdot|_F$表示Frobenius范数。

  1. 对于梯度下降,证明如果$\Sigma$的谱范数满足:
    \begin{equation}
    |\Sigma|_2 \leq \frac{2}{\eta},
    \end{equation}
    则梯度下降是局部稳定的(即对所有$i$,$\text{Loss}(w_i)$有界)。(注意,这里蕴含了一个依赖维度的界:$|\Sigma|_F \leq \frac{2\sqrt{d}}{\eta}$,其中$d$是$w$的维度。)
  2. 对于随机梯度下降,如果$\expe{}{\text{Loss}(w_i)}$对所有$i$都有界,则以下独立于维度的不等式必须成立:
    \begin{equation}
    |\Sigma|_F \leq \frac{\sqrt{b/\delta}}{\eta}.
    \end{equation}

题目2的证明

  1. 注意到损失函数可以改写为:
    \begin{aligned}
    \text{Loss}(w) &= \frac{1}{2m}\sum_{j=1}^m (y_j-\widetilde{F}(x_j;w))^2\\
    & = \frac{1}{2m}\sum_{j=1}^m (y_j-F(x_j;w^*)-(w-w^*)^\trsp\nabla F(x_j;w^*))^2\\
    & = \frac{1}{2m}\sum_{j=1}^m ((w-w^*)^\trsp\nabla F(x_j;w^*))^2.
    \end{aligned}

显然:
\begin{equation}
\nabla\text{Loss}(w) = \frac{1}{2m}\sum_{j=1}^m 2\nabla F(x_j;w^*)\nabla F(x_j;w^*)^\trsp(w-w^*) = \Sigma(w-w^*).
\end{equation}
考虑梯度下降法:
\begin{aligned}
w_{i+1} - w^* &= w_i - \eta\nabla\text{Loss}(w_i) -w^* = w_i - \eta\Sigma(w_i-w^*) -w^*\\
& = (I-\eta\Sigma)(w_i-w^*),
\end{aligned}
其中$I$为单位阵。注意到$\Sigma$为Fisher信息矩阵,半正定,且由条件$|\Sigma|_2 \leq \frac{2}{\eta}$,得到:
\begin{equation}
|I-\eta\Sigma|_2 \leq 1.
\end{equation}
于是递推得到:
\begin{equation}
|w_{i+1} - w^*|_2 \leq |w_{i} - w^*|_2 \leq \cdots \leq |w_0 - w^*|_2.
\end{equation}
自然,$\forall i \in\mathbb{N}$,有:
\begin{aligned}
\text{Loss}(w_i) &= \frac{1}{2m}\sum_{j=1}^m ((w_i-w^*)^\trsp\nabla F(x_j;w^*))^2\\
& \leq \frac12 (w_i-w^*)^\trsp\Sigma(w_i-w^*) \leq \frac12 |(w_i-w^*)|_2^2|\Sigma|_F\\
& \leq \frac12 \frac{2\sqrt{d}}{\eta}|(w_0-w^*)|_2^2 = \frac{\sqrt{d}}{\eta}|(w_0-w^*)|_2^2.
\end{aligned}
其中$d$是$w$的维度,从而梯度下降法是局部稳定的。

  1. 记$z_i = w_i - w^*$,考察随机梯度下降法的损失期望:
    \begin{aligned}
    \expe{}{\text{Loss}(w_{i+1})} &= \expe{}{\frac12 z_{i+1}^\trsp\Sigma z_{i+1}} = \expe{}{\frac12 (z_i -\eta\nabla\text{Loss}(w_i)-\eta \varepsilon_i )^\trsp\Sigma (z_i -\eta\nabla\text{Loss}(w_i)-\eta \varepsilon_i )}\\
    & = \expe{}{\frac12 z_i^\trsp\Sigma z_i - \eta z_i^\trsp\Sigma(\text{Loss}(w_i) +\varepsilon_i) + \frac12\eta^2(\text{Loss}(w_i) +\varepsilon_i)^\trsp\Sigma(\text{Loss}(w_i) +\varepsilon_i)}\\
    & = \expe{}{\frac12 z_i^\trsp\Sigma z_i - \eta z_i^\trsp\Sigma^2z_i - \underbrace{\eta z_i^\trsp\Sigma\varepsilon_i}_{\expe{}{\varepsilon_i}=0} + \frac12\eta^2(\Sigma z_i+\varepsilon_i)^\trsp\Sigma(\Sigma z_i+\varepsilon_i)}\\
    & = \expe{}{\frac12 z_i^\trsp\Sigma z_i - \eta z_i^\trsp\Sigma^2z_i} + \frac12\eta^2\expe{}{z_i^\trsp\Sigma^3z_i + \varepsilon_i^\trsp\Sigma\varepsilon_i + \underbrace{\varepsilon_i^\trsp\Sigma^2z_i}_{\expe{}{\varepsilon_i}=0}}\\
    &\overset{*}{=} \expe{}{\frac12 z_i^\trsp\Sigma z_i - \eta z_i^\trsp\Sigma^2z_i + \frac12\eta^2z_i^\trsp\Sigma^3z_i} + \frac12\eta^2\expe{}{\Tr(\Var[\varepsilon_i]\Sigma)}\\
    & \overset{\Var[\varepsilon_i] = M(w_i)/b}{=} \expe{}{\frac12 z_i^\trsp\Sigma z_i - \eta z_i^\trsp\Sigma^2z_i + \frac12\eta^2z_i^\trsp\Sigma^3z_i} + \frac{\eta^2}{2b}\expe{}{\Tr(M(w_i)\Sigma)}\\
    & \defeq \expe{}{s(z_i)\text{Loss}(w_{i})} + \frac{\eta^2}{2b}\expe{}{\Tr(M(w_i)\Sigma)}
    \end{aligned}

其中*式使用了恒等式$\expe{}{x^\trsp\Sigma x}=\Tr(\Sigma\Var[x])+\expe{}{x}^\trsp\Sigma\expe{}{x}$,$s(z) \defeq 1 -2\eta\frac{z^\trsp\Sigma^2z}{z^\trsp\Sigma z} + \eta^2\frac{z^\trsp\Sigma^3z}{z^\trsp\Sigma z}$。

这里使用Rayleigh Quotient归一化技巧,令:
\begin{equation}
x = \frac{\Sigma^{\frac12}z}{|\Sigma^{\frac12}z|_2},
\end{equation}
记$\Sigma$的特征值为$\lambda_j \geq 0$,显然:
\begin{aligned}
s(z) & = 1 -2\eta x^\trsp\Sigma x+\eta^2x^\trsp\Sigma^2 x \\
& = 1 + \sum_j(-2\eta\lambda_j + \eta^2\lambda_j^2) \\
& \geq \inf_{\lambda = \min_j\{\lambda_j\}\geq 0}(1-2\eta\lambda +\eta^2\lambda^2) \geq 0.
\end{aligned}
那么:
\begin{aligned}
\expe{}{\text{Loss}(w_{i+1})} & = \expe{}{s(z)\text{Loss}(w_{i})} + \frac{\eta^2}{2b}\expe{}{\Tr(M(w_i)\Sigma)}\\
&\geq \frac{\eta^2}{2b}\expe{}{\Tr(M(w_i)\Sigma)}\\
&\overset{\text{alignment property}}{\geq } \frac{\eta^2}{2b} 2\delta|\Sigma|_F^2\text{Loss}(w_{i}).
\end{aligned}
由于$\text{Loss}(w_{i})$有界,因此$\expe{}{\text{Loss}(w_{i+1})}/\expe{}{\text{Loss}(w_{i})} \leq 1$,所以自然:
\begin{aligned}
\frac{\eta^2\delta}{b} |\Sigma|_F^2 &\leq 1\\
|\Sigma|_F &\leq \frac{\sqrt{b/\delta}}{\eta}.
\end{aligned}

题目6

研究大模型的缩放定律对减少其训练开销至关重要。即,最终的测试损失是如何随着训练步数和模型大小的变化而变化的?本题中,我们研究训练线性模型时的缩放定律。

  1. 在本小问中,我们考虑使用梯度下降学习一个一维线性模型的情况。
  • 定义数据分布$\mathcal{D}$为一个$\spr^2$上的分布,每个数据是一个数对$(x,y)$,分别代表输入和输出,并服从分布$x\sim\mathcal{N}(0,1),y\sim\mathcal{N}(3x,1)$.
  • 用梯度下降算法学习线性模型$f_{w}(x)=w\cdot x$,其中$w,x\in\spr$。初始化$w_0=0$并进行多步迭代。每次选代时,从$\mathcal{D}$中采样$(x_t.y_t)$,然后更新$w_t$为$w_{t+1}\gets w_t - \eta\nabla\ell_t(w_t)$,其中$\ell_t = \frac12(f_w(x_t)-y_t)^2$是平方损失函数,$\eta>0$是学习率。

设学习率$\eta \in (0,\frac13]$,那么$T\geq 0$步迭代之后的测试损失期望
\begin{equation}
\bar{\mathcal{L}}_{\eta,T} = \expe{w_T}{\expe{(x,y)\sim\mathcal{D}}{\frac12(f_{w_T}(x)-y)^2}}
\end{equation}
是多少?

  1. 现在我们在第一小问的设定下,考虑学习率$\eta$被调到最优的情况。求函数$g(T)$,使得当$T\to +\infty$时,以下条件成立:
    \begin{equation}
    \left|\inf_{\eta\in (0,\frac13]} \bar{\mathcal{L}}_{\eta,T} -g(T) \right|=O\left(\frac{(\log T)^2}{T^2}\right).
    \end{equation}
  2. 一个常常被观测到的实验现象是大语言模型的预训练过程大致遵循Chinchilla缩放定律:
    \begin{equation}
    \bar{\mathcal{L}}_{N,T} \approx \frac{A}{N^\alpha}+\frac{B}{T^\beta} +C,
    \end{equation}
    其中$\bar{\mathcal{L}}_{N,T}$是在经过$T$步训练后具有$N$个参数的模型的测试损失的期望,$A,B,\alpha,\beta,C$是常数。现在,我们来举一个训练多维线性模型的例子,使其也遵循类似的缩放定律。
  • 固定$a>0,b\geq 1$。每个数据$(x_{\bullet},y)$由一个输入和输出组成,其中输入$x_{\bullet}$是一个无限维向量(可看作一个序列),输出$y$满足$y\in\spr$。定义数据分布$\mathcal{D}$如下,首先,从Zipf分布中采样$k$,$\prob{k=i}\propto i^{-(a+1)}(i\geq 1)$。令$j=\lceil k^b \rceil$. 然后,从$\mathcal{N}(0,1)$中采样得到$x_{\bullet}$的第$j$个坐标$x_j$,并令其余坐标为$0$。最后,$y\sim\mathcal{N}(3x_j,1)$。这样得到的$(x_{\bullet},y)$的分布即数据分布$\mathcal{D}$。
  • 我们研究一个仅关注前$N$个输入坐标的线性模型。定义函数$\phi_N(x_{\bullet})=(x_1,\cdots,x_N)$。我们研究的线性模型具有参数$\rvw\in\spr^N$,输出为$f_{\rvw}(x_{\bullet})=\langle \rvw,\phi_N(x_{\bullet}) \rangle$。
  • 我们使用梯度下降算法来学习该线性模型,初始化$\rvw_0 = \boldsymbol{0}$并进行多步达代每次迭代时。从$\mathcal{D}$中采样$(x_{t,\bullet},y_t)$,然后更新$\rvw_t$为$\rvw_{t+1}\gets \rvw_t - \eta\nabla\ell_t(\rvw_t)$,其中$\ell_t = \frac12(f_\rvw(x_{t,\bullet})-y_t)^2$。

令$\bar{\mathcal{L}}_{\eta,N,T} = \expe{\rvw_T}{\expe{(\rvx,y)\sim\mathcal{D}}{\frac12(f_{\rvw_T}(\rvx)-y)^2}}$为以学习率$\eta \in (0,\frac13]$对具有$N$个参数的线性模型进行$T\geq 0$步训练后的测试损失的期望。

请求出$\alpha,\beta,C$,使得$\exists \gamma >0, \forall c>0$,当$T = N^{c+o(1)}$且$N$足够大时,以下条件成立:
\begin{equation}
\epsilon(N,T)=\frac{\inf_{\eta \in (0,\frac13]}\bar{\mathcal{L}}_{\eta,N,T} -C}{\frac{1}{N^\alpha}+\frac{1}{T^\beta}}, (\log N + \log T)^{-\gamma} \leq \epsilon(N,T) \leq (\log N + \log T)^{\gamma}.
\end{equation}
即$\inf_{\eta \in (0,\frac13]}\bar{\mathcal{L}}_{\eta,N,T} = \widetilde{\Theta}(N^{-\alpha}+T^{-\beta})+C$,$\widetilde{\Theta}$表示忽略任何关于$\log N$和$\log T$的多项式。

题目6的证明

注意到,$y=3x+\varepsilon, \varepsilon\sim\mathcal{N}(0,1)$,那么显然:
\begin{aligned}
\bar{\mathcal{L}}_{\eta,T} &= \expe{w_T}{\expe{(x,y)\sim\mathcal{D}}{\frac12(f_{w_T}(x)-y)^2}}\\
& = \expe{w_T}{\expe{x\sim\mathcal{N}(0,1),\varepsilon\sim\mathcal{N}(0,1)}{\frac12(w_Tx-(3x+\varepsilon)))^2}}\\
& = \expe{w_T}{\frac12(w_T-3)^2\expe{x\sim\mathcal{N}(0,1)}{x^2}+\frac12\expe{\varepsilon\sim\mathcal{N}(0,1)}{\varepsilon^2}}\\
& = \frac12 \expe{w_T}{(w_T-3)^2}+\frac12.
\end{aligned}
问题划归为求$w_T$的一阶矩和二阶矩。考察梯度下降公式:
\begin{equation}
w_{t+1} = w_t - \eta\nabla\ell_t(w_t) = w_t - \eta(w_tx_t^2-x_ty_t).
\end{equation}
考察一阶矩有:
\begin{aligned}
\expe{}{w_{t+1}} &= \expe{}{w_t} - \eta (\expe{}{w_t}\expe{}{x_t^2}-\expe{}{x_ty_t})\\
& = \expe{}{w_t} - \eta \expe{}{w_t} + 3\eta \\
& = (1-\eta)\expe{}{w_t} + 3\eta.
\end{aligned}
也即是$\expe{}{w_{t+1}} - 3 = (1-\eta)\expe{}{w_t} + 3\eta - 3 = (1-\eta)(\expe{}{w_t} - 3)$,递推得到$\expe{}{w_T} = (1-\eta)^T(\expe{}{w_0}-3)+3 = 3\left(1-(1-\eta)^T\right)$.

同理考察二阶矩有:
\begin{aligned}
\expe{}{w_{t+1}^2} &= \expe{}{w_t^2} - 2\eta (\expe{}{w_t^2}\expe{}{x_t^2}-\expe{}{w_t}\expe{}{x_ty_t}) \\
\,\,\, &+ \eta^2\left(\expe{}{w_t^2}\expe{}{x_t^4} -2 \expe{}{w_t}\expe{}{x_t^3y_t} + \expe{}{x_t^2y_t^2}\right)\\
& = \expe{}{w_t^2} - 2\eta(\expe{}{w_t^2}-3\expe{}{w_t}) + \eta^2(3\expe{}{w_t^2} -18 \expe{}{w_t} + 28)\\
& = (1-2\eta+3\eta^2)\expe{}{w_t^2} - (18\eta^2-6\eta)\expe{}{w_t} + 28\eta^2
\end{aligned}
带入一阶矩递推公式,凑配得:
\begin{aligned}
\expe{}{w_{t+1}^2} - 6\expe{}{w_{t+1}} & = (1-2\eta+3\eta^2)\expe{}{w_t^2} - (18\eta^2-6\eta)\expe{}{w_t} + 28\eta^2 - 6(1-\eta)\expe{}{w_t} - 18\eta\\
& = (1-2\eta+3\eta^2)(\expe{}{w_t^2}-6\expe{}{w_t}) + 28\eta^2 - 18\eta.
\end{aligned}
记$s = \frac{28\eta -18}{2-3\eta}$,也即是:
\begin{aligned}
\expe{w_t}{(w_t-3)^2} &= \expe{}{w_{t}^2} - 6\expe{}{w_{t}} + 9 \\
& = (1-2\eta+3\eta^2)^{t}(\expe{}{w_0^2} - 6\expe{}{w_0} + s) - s(1-2\eta+3\eta^2) + 9\\
& = s\left((1-2\eta+3\eta^2)^{t} -1 + 2\eta -3\eta^2 \right) + 9.
\end{aligned}
带回原式得到:
\begin{aligned}
\bar{\mathcal{L}}_{\eta,T} &= \frac12 \expe{w_T}{(w_T-3)^2}+\frac12\\
& = \frac{s}{2}\left((1-2\eta+3\eta^2)^{T} -1+ 2\eta -3\eta^2\right) + 5\\
& = \frac{(14\eta -9)\left((1-2\eta+3\eta^2)^{T} -1+ 2\eta -3\eta^2\right)}{2-3\eta}+5.
\end{aligned}

…第三问待续…