admin管理员组

文章数量:1623796

Reference:
Slides of EE4C03, TUD
Hayes M H. Statistical digital signal processing and modeling

Content

    • Steepest Descent Algorithm
      • Stability / Convergence
      • Converge rate
      • Mean-square error
    • LMS Algorithm
      • Convergence in the mean
      • Convergence in the mean-square
    • Normalized LMS
    • Application: Noise Cancellation
    • Recursive Least Squares

In the previous five chapters, we have considered a variety of different problems including signal modeling, Wiener filtering, and spectrum estimation. In each case, we made an important assumption that the signals that were being analyzed were stationary. Unfortunately, since the signals that arise in almost every application will be nonstationary, the approaches and techniques that we have been considering thus far would not be appropriate. Therefore, we are going to develop a better approach that begins with a nonstationary assumption at the outset.

To motivate the approach that we will be considering in this chapter, let us consider the Wiener filtering problem within the context of nonstationary processes:
d ^ ( n ) = ∑ k = 0 p w ( k ) x ( n − k ) \hat d(n)=\sum_{k=0}^p w(k)x(n-k) d^(n)=k=0pw(k)x(nk)
with the Wiener-Hopf equations
R x w = r d x \mathbf R_x \mathbf w=\mathbf r_{dx} Rxw=rdx
However, if d ( n ) d(n) d(n) and x ( n ) x(n) x(n) are nonstationary, then the filter coefficients that minimize E { ∣ e ( n ) ∣ 2 } E\{|e(n)|^2\} E{e(n)2} will depend on n n n, and the filter will be shift-varying, i.e.,
d ^ ( n ) = ∑ k = 0 p w n ( k ) x ( n − k ) (1) \hat d(n)=\sum_{k=0}^p w_n(k)x(n-k)\tag{1} d^(n)=k=0pwn(k)x(nk)(1)
where w n ( k ) w_n(k) wn(k) is the value of the k k kth filter coefficient at time n n n. Using vector notation, this estimate may be expressed as
d ^ ( n ) = w n T x ( n ) (2) \hat d(n)=\mathbf w_n^T \mathbf x(n)\tag{2} d^(n)=wnTx(n)(2)
where
w n = [ w n ( 0 ) , w n ( 1 ) , ⋯   , w n ( p ) ] T (3) \mathbf w_n=[w_n(0),w_n(1),\cdots,w_n(p)]^T\tag{3} wn=[wn(0),wn(1),,wn(p)]T(3)
is the vector of filter coefficients at time n n n, and
x ( n ) = [ x ( n ) , x ( n − 1 ) , ⋯   , x ( n − p ) ] T (4) \mathbf x(n)=[x(n),x(n-1),\cdots, x(n-p)]^T\tag{4} x(n)=[x(n),x(n1),,x(np)]T(4)
The Wiener-Hopf equations will then become
R x ( n ) w n = r d x ( n ) (5) \mathbf R_x (n) \mathbf w_n=\mathbf r_{dx} (n)\tag{5} Rx(n)wn=rdx(n)(5)
where
R x ( n ) = [ E { x ( n ) x ∗ ( n ) } E { x ( n − 1 ) x ∗ ( n ) } ⋯ E { x ( n − p ) x ∗ ( n ) } E { x ( n ) x ∗ ( n − 1 ) } E { x ( n − 1 ) x ∗ ( n − 1 ) } ⋯ E { x ( n − p ) x ∗ ( n − 1 ) } ⋮ ⋮ ⋮ E { x ( n ) x ∗ ( n − p ) } E { x ( n − 1 ) x ∗ ( n − p ) } ⋯ E { x ( n − p ) x ∗ ( n − p ) } ] (6) \begin{aligned} &\begin{array}{lccc} \mathbf{R}_{x}(n)=\left[\begin{array}{cccc} E\left\{x(n) x^{*}(n)\right\} & E\left\{x(n-1) x^{*}(n)\right\} & \cdots & E\left\{x(n-p) x^{*}(n)\right\} \\ E\left\{x(n) x^{*}(n-1)\right\} & E\left\{x(n-1) x^{*}(n-1)\right\} & \cdots & E\left\{x(n-p) x^{*}(n-1)\right\} \\ \vdots & \vdots & & \vdots \\ E\left\{x(n) x^{*}(n-p)\right\} & E\left\{x(n-1) x^{*}(n-p)\right\} & \cdots & E\left\{x(n-p) x^{*}(n-p)\right\} \end{array}\right] \\ & \end{array}\\ \end{aligned}\tag{6} Rx(n)=E{x(n)x(n)}E{x(n)x(n1)}E{x(n)x(np)}E{x(n1)x(n)}E{x(n1)x(n1)}E{x(n1)x(np)}E{x(np)x(n)}E{x(np)x(n1)}E{x(np)x(np)}(6)
is a ( p + 1 ) × ( p + 1 ) (p+1) \times(p+1) (p+1)×(p+1) Hermitian matrix of autocorrelations and
r d x ( n ) = [ E { d ( n ) x ∗ ( n ) } , E { d ( n ) x ∗ ( n − 1 ) } , ⋯   , E { d ( n ) x ∗ ( n − p ) } ] T (7) \mathbf{r}_{d x}(n)=\left[E\left\{d(n) x^{*}(n)\right\}, E\left\{d(n) x^{*}(n-1)\right\}, \cdots, E\left\{d(n) x^{*}(n-p)\right\}\right]^{T}\tag{7} rdx(n)=[E{d(n)x(n)},E{d(n)x(n1)},,E{d(n)x(np)}]T(7)
Solving ( 6 ) (6) (6) would be impractical in most real-time implementations. However, the problem may be simplified considerably if we relax the requirement that w n \mathbf w_n wn minimize the mean-square error at each time n n n and consider, instead, a coefficient update equation of the form
w n + 1 = w n + Δ w n (8) \mathbf w_{n+1}=\mathbf w_n+\Delta \mathbf w_n\tag{8} wn+1=wn+Δwn(8)
where Δ w n \Delta \mathbf w_n Δwn is a correction that is applied to the filter coefficients w n \mathbf w_n wn at time n n n to form a new set of coefficients, w n + 1 \mathbf w_{n+1} wn+1 at time n + 1 n+1 n+1.

Even for stationary process, we may still prefer to implement a time-invariant Wiener filter using ( 8 ) (8) (8) when p p p is large or R x \mathbf R_x Rx is ill-conditioned. Besides, solving the Wiener-Hopf equations requires that the r x ( k ) r_x(k) rx(k) and r d x ( k ) r_{dx}(k) rdx(k) be known. Since these ensemble averages are typically unknown, then it is necessary to estimate them from measurements of the processes and build them into the adaptive filter.

Steepest Descent Algorithm

Let w n \mathbf w_n wn be an estimate of the vector that minimizes the mean-square error ξ ( n ) \xi(n) ξ(n) at time n n n. At time n + 1 n + 1 n+1 a new estimate is formed by adding a correction to w n \mathbf w_n wn that is designed to bring w n \mathbf w_n wn closer to the desired solution. The correction involves taking a step of size μ \mu μ in the direction of maximum descent down the quadratic error surface.

As shown in Fig. 9.4b, for any vector w \mathbf w w, the gradient is orthogonal to the line that is tangent to the contour of constant error at w \mathbf w w. However, since the gradient vector points in the direction of steepest ascent, the direction of steepest descent points in the negative gradient direction. Thus, the update equation for w n \mathbf w_n wn is
w n + 1 = w n − μ ∇ ξ ( n ) (SD.1) \mathbf w_{n+1}=\mathbf w_n-\mu \nabla \xi(n) \tag{SD.1} wn+1=wnμξ(n)(SD.1)
The step size μ μ μ affects the rate at which the weight vector moves down the quadratic surface and must be a positive number.

Let us now evaluate the gradient vector ∇ ξ ( n ) \nabla \xi (n) ξ(n). Assuming that w \mathbf w w is complex, the gradient is the derivative of E { ∣ e ( n ) ∣ 2 } E\{|e(n)|^2\} E{e(n)2} with respect to w ∗ \mathbf w^* w. With
∇ ξ ( n ) = ∇ E { ∣ e ( n ) ∣ 2 } = E { e ( n ) ∇ e ∗ ( n ) } \nabla \xi(n)=\nabla E\{|e(n)|^2\}=E\{e(n)\nabla e^*(n) \} ξ(n)=E{e(n)2}=E{e(n)e(n)}
Since e ( n ) = d ( n ) − w T x ( n ) e(n)=d(n)-\mathbf w^T \mathbf x(n) e(n)=d(n)wTx(n), we have
∇ ξ ( n ) = − E { e ( n ) x ∗ ( n ) } (SD.2) \nabla \xi(n)=-E\{e(n)\mathbf x^*(n) \}\tag{SD.2} ξ(n)=E{e(n)x(n)}(SD.2)
Thus, ( S D . 1 ) (SD.1) (SD.1) becomes
w n + 1 = w n + μ E { e ( n ) x ∗ ( n ) } (SD.3) \mathbf w_{n+1}=\mathbf w_n+\mu E\{e(n)\mathbf x^*(n) \} \tag{SD.3} wn+1=wn+μE{e(n)x(n)}(SD.3)
To see how this steepest descent update equation for w n \mathbf w_n wn performs, let us consider what happens in the case of stationary processes. If x ( n ) x(n) x(n) and d ( n ) d(n) d(n) are jointly WSS then
w n + 1 = w n + μ E { [ d ( n ) − w n T x ( n ) ] x ∗ ( n ) } = w n + μ ( r d x − R x w n ) (SD.4) \begin{aligned} \mathbf w_{n+1}&=\mathbf w_n+\mu E\{[d(n)-\mathbf w_n^T\mathbf x(n)]\mathbf x^*(n) \}\\ &=\mathbf w_n+\mu(\mathbf r_{dx}-\mathbf R_x\mathbf w_n) \end{aligned}\tag{SD.4} wn+1=wn+μE{[d(n)wnTx(n)]x(n)}=wn+μ(rdxRxwn)(SD.4)
Note that if w n \mathbf w_n wn is the solution to the Wiener-Hopf equations, w n = R x − 1 r d x \mathbf w_n=\mathbf R_x^{-1}\mathbf r_{dx} wn=Rx1rdx, then the correlation term is zero and w n + 1 = w n \mathbf w_{n+1}=\mathbf w_n wn+1=wn for all n n n.

The steepest descent algorithm may be summarized as follows:

  1. Initialize the steepest descent algorithm with an initial estimate, w 0 , \mathbf{w}_{0}, w0, of the optimum weight vector w \mathbf{w} w.
  2. Evaluate the gradient of ξ ( n ) \xi(n) ξ(n) at the current estimate, w n , \mathbf{w}_{n}, wn, of the optimum weight vector.
  3. Update the estimate at time n n n by adding a correction that is formed by taking a step of size μ \mu μ in the negative gradient direction

w n + 1 = w n + μ ( r d x − R x w n ) \mathbf{w}_{n+1}=\mathbf w_n+\mu(\mathbf r_{dx}-\mathbf R_x\mathbf w_n) wn+1=wn+μ(rdxRxwn)

  1. Go back to (2) and repeat the process.

Stability / Convergence

Of greater interest, however, is how the weights evolve in time, beginning with an arbitrary initial weight vector w 0 \mathbf w_0 w0. The following property defines what is required for w n \mathbf w_n wn to converge to w \mathbf w w.

For jointly wide-sense stationary process, d ( n ) d(n) d(n) and x ( n ) x(n) x(n), the steepest descent adaptive filter converges to the solution to the Wiener-Hopf equations
lim ⁡ n → ∞ w n = R x − 1 r d x \lim_{n\to \infty} \mathbf w_n=\mathbf R_x^{-1}\mathbf r_{dx} nlimwn=Rx1rdx
if the step size satisfies the condition
0 < μ < 2 λ m a x (SD.5) 0<\mu<\frac{2}{\lambda_{max}}\tag{SD.5} 0<μ<λmax2(SD.5)
where λ m a x \lambda_{max} λmax is the maximum eigenvalue of the autocorrelation matrix R x \mathbf R_x Rx.

We first define a weight error vector
c n = w n − w (SD.6) \mathbf c_n=\mathbf w_n-\mathbf w \tag{SD.6} cn=wnw(SD.6)
We will then prove that lim ⁡ n → ∞ c n = 0 \lim_{n\to \infty}\mathbf c_n=0 limncn=0. By applying ( S D . 4 ) (SD.4) (SD.4) and the Wiener-Hopf equations R x w = r d x \mathbf R_x \mathbf w=\mathbf r_{dx} Rxw=rdx, we can develop
c n + 1 = w n + 1 − w = ( I −   u R x ) w n + μ R x w − w = [ I − μ R x ] ( w n − w ) = [ I − μ R x ] c n (SD.7) \begin{aligned} \mathbf c_{n+1}&=\mathbf w_{n+1}-\mathbf w=(\mathbf I-\,u \mathbf R_x)\mathbf w_n+\mu \mathbf R_x \mathbf w-\mathbf w\\ &=[\mathbf I-\mu \mathbf R_x](\mathbf w_n-\mathbf w)=[\mathbf I-\mu \mathbf R_x]\mathbf c_n \end{aligned}\tag{SD.7} cn+1=wn+1w=(IuRx)wn+μRxww=[IμRx](wnw)=[IμRx]cn(SD.7)
Since R x = R x H \mathbf R_x=\mathbf R_x^H Rx=RxH and R x ⪰ 0 \mathbf R_x \succeq 0 Rx0, it can be factored as
R x = V Λ V H \mathbf R_x=\mathbf V \boldsymbol {\Lambda} \mathbf V^H Rx=VΛVH
with V V H = I \mathbf V \mathbf V^H=\mathbf I VVH=I and λ k ≥ 0 \lambda_k\ge0 λk0. Incorporating this factorization into ( S D . 7 ) (SD.7) (SD.7) leads to
c n + 1 = [ I − μ V Λ V H ] c n = V ( I − μ Λ ) V H c n (SD.8) \mathbf c_{n+1}=[\mathbf I-\mu \mathbf V \boldsymbol {\Lambda} \mathbf V^H]\mathbf c_n=\mathbf V(\mathbf I-\mu \boldsymbol \Lambda)\mathbf V^H\mathbf c_n\tag{SD.8} cn+1=[IμVΛVH]cn=V(IμΛ)VHcn(SD.8)
If we define
u n = V H c n (SD.9) \mathbf u_n=\mathbf V^H \mathbf c_n \tag{SD.9} un=VHcn(SD.9)
then ( S D . 8 ) (SD.8) (SD.8) becomes
u n + 1 = ( I − μ Λ ) u n (SD.10) \mathbf u_{n+1}=(\mathbf I-\mu \boldsymbol \Lambda)\mathbf u_n\tag{SD.10} un+1=(IμΛ)un(SD.10)

With an initial weight vector u 0 \mathbf u_0 u0, it follows that
u n = ( I − μ Λ ) n u 0 (SD.11) \mathbf u_{n}=(\mathbf I-\mu \boldsymbol \Lambda)^n\mathbf u_0\tag{SD.11} un=(IμΛ)nu0(SD.11)
In order for w n \mathbf w_n wn to converge to w \mathbf w w it is necessary that the weight error vector c n \mathbf c_n cn converge to zero and, therefore, that u n = V H c n \mathbf u_n = \mathbf V^H \mathbf c_n un=VHcn converge to zero. This will occur for any u 0 \mathbf u_0 u0 if and only if
∣ 1 − μ λ k ∣ < 1 ; k = 0 , 1 , ⋯   , p (SD.12) |1-\mu \lambda_k|<1;\quad k=0,1,\cdots,p \tag{SD.12} 1μλk<1;k=0,1,,p(SD.12)
which places the following restriction on the step size μ \mu μ
0 < μ < 2 λ m a x (SD.5) 0<\mu<\frac{2}{\lambda_{max}}\tag{SD.5} 0<μ<λmax2(SD.5)
as was to be shown.


Converge rate

From ( S D . 11 ) (SD.11) (SD.11), we can see that each entry of u n \mathbf u_n un converges with a rate determined by ∣ 1 − μ λ i ∣ |1-\mu \lambda_i| 1μλi. If 1 − μ λ m a x > 0 1-\mu \lambda_{max}>0 1μλmax>0, then the slowest mode is determined by λ m i n \lambda_{min} λmin.

Define a time constant τ \tau τ such that
( 1 − μ λ m i n ) τ = 1 e (SD.13) (1-\mu \lambda_{min})^\tau=\frac{1}{\rm e} \tag{SD.13} (1μλmin)τ=e1(SD.13)
For sufficiently small μ \mu μ,
τ = − 1 ln ⁡ ( 1 − μ λ m i n ) ≈ 1 μ λ m i n (SD.14) \tau=\frac{-1}{\ln (1-\mu \lambda_{min})}\approx \frac{1}{\mu \lambda_{min}}\tag{SD.14} τ=ln(1μλmin)1μλmin1(SD.14)
If μ = 1 / λ m a x \mu=1/\lambda_{max} μ=1/λmax, then
τ ≈ λ m a x λ m i n = c o n d ( R x ) (SD.15) \tau \approx \frac{\lambda_{max}}{\lambda_{min}}=\mathrm{cond}(\mathbf R_x)\tag{SD.15} τλminλmax=cond(Rx)(SD.15)
If the eigenvalues of R x \mathbf R_x Rx are widely spread then convergence will be slow.


Mean-square error

For wide-sense stationary processes the minimum mean-square error is
ξ m i n = r d ( 0 ) − r d x H w (SD.16) \xi_{min}=r_d(0)-\mathbf r_{dx}^H\mathbf w \tag{SD.16} ξmin=rd(0)rdxHw(SD.16)
For an arbitrary weight vector w n \mathbf w_n wn, the mean-square error is
ξ ( n ) = E { ∣ d ( n ) − w n T x ( n ) ∣ 2 } = r d ( 0 ) − r d x H w n − w n H r d x + w n H R x w n = r d ( 0 ) − r d x H ( w + c n ) − ( w + c n ) H r d x + ( w + c n ) H R x ( w + c n ) = r d ( 0 ) − r d x H w n + c n H R x c n = ξ m i n + c n H R x c n = ξ m i n + u n H Λ x u n = ξ m i n + ∑ k = 0 P λ k ( 1 − μ λ k ) 2 n ∣ u 0 ( k ) ∣ 2 (SD.17) \begin{aligned} \xi(n)&=E\{|d(n)-\mathbf w_n^T\mathbf x(n)|^2\}=r_d(0)-\mathbf r_{dx}^H \mathbf w_n-\mathbf w_n^H \mathbf r_{dx}+\mathbf w_n ^H \mathbf R_x \mathbf w_n\\ &=r_d(0)-\mathbf r_{dx}^H (\mathbf w+\mathbf c_n)-(\mathbf w+\mathbf c_n)^H \mathbf r_{dx}+(\mathbf w+\mathbf c_n) ^H \mathbf R_x (\mathbf w+\mathbf c_n)\\ &=r_d(0)-\mathbf r_{dx}^H \mathbf w_n+\mathbf c_n^H\mathbf R_x\mathbf c_n\\ &=\xi_{min}+\mathbf c_n^H\mathbf R_x \mathbf c_n\\ &=\xi_{min}+\mathbf u_n^H\boldsymbol \Lambda_x \mathbf u_n\\ &=\xi_{min}+\sum_{k=0}^P \lambda_k(1-\mu \lambda_k)^{2n}|u_0(k)|^2 \end{aligned}\tag{SD.17} ξ(n)=E{d(n)wnTx(n)2}=rd(0)rdxHwnwnHrdx+wnHRxwn=rd(0)rdxH(w+cn)(w+cn)Hrdx+(w+cn)HRx(w+cn)=rd(0)rdxHwn+cnHRxcn=ξmin+cnHRxcn=ξmin+unHΛxun=ξmin+k=0Pλk(1μλk)2nu0(k)2(SD.17)
Thus, if the step size μ μ μ satisfies the condition for convergence given in ( S D . 5 ) (SD.5) (SD.5), then ξ ( n ) \xi (n) ξ(n) decays exponentially to ξ m i n \xi_{min} ξmin.

LMS Algorithm

In previous section, we developed the weight-vector update equation given by
w n + 1 = w n − μ ∇ ξ ( n ) = w n + μ ( r d x − R x w n ) \mathbf{w}_{n+1}=\mathbf w_n-\mu \nabla \xi(n)=\mathbf w_n+\mu(\mathbf r_{dx}-\mathbf R_x\mathbf w_n) wn+1=wnμξ(n)=wn+μ(rdxRxwn)
In practical, R x \mathbf R_x Rx and r d x \mathbf r_{dx} rdx are not perfectly known and have to be estimated from the data.

The Least-Mean-Square algorithm makes simple estimation based just on current samples
R ^ x = x ∗ ( n ) x T ( n ) , r ^ d x = x ∗ ( n ) d ( n ) (LMS.1) \hat {\mathbf R}_x=\mathbf x^*(n)\mathbf x^T(n),\quad \hat{\mathbf r}_{dx}=\mathbf x^*(n)d(n)\tag{LMS.1} R^x=x(n)xT(n),r^dx=x(n)d(n)(LMS.1)
The resulting instantaneous gradient estimate is
∇ ξ ^ ( w ) = R ^ x w n − r ^ d x = x ∗ ( n ) x T ( n ) w n − x ∗ ( n ) d ( n ) = − x ∗ ( n ) e ( n ) (LMS.2) \hat {\nabla \xi}(\mathbf w)=\hat{\mathbf R}_x\mathbf w_n-\hat{\mathbf r}_{dx}=\mathbf x^*(n)\mathbf x^T(n)\mathbf w_n-\mathbf x^*(n)d(n)=-\mathbf x^*(n)e(n)\tag{LMS.2} ξ^(w)=R^xwnr^dx=x(n)xT(n)wnx(n)d(n)=x(n)e(n)(LMS.2)
LMS algorithm: (initialized usually by w 0 = 0 {\mathbf w}_0=\mathbf 0 w0=0)
y ( n ) : = w n T x ( n ) e ( n ) : = d ( n ) − y ( n ) w n + 1 : = w n + μ e ( n ) x ∗ ( n ) (LMS.3) \begin{aligned} y(n) &:=\mathbf{w}_n^T \mathbf{x}(n) \\ e(n) &:=d(n)-y(n) \\ {\mathbf{w}}_{n+1} &:={\mathbf{w}}_n+\mu e(n) \mathbf{x}^{*}(n) \end{aligned}\tag{LMS.3} y(n)e(n)wn+1:=wnTx(n):=d(n)y(n):=wn+μe(n)x(n)(LMS.3)


Convergence in the mean

Since w n \mathbf w_n wn is a vector of random variables, the convergence of the LMS algorithm must be considered within a statistical framework. Therefore, we will begin by assuming that x ( n ) x(n) x(n) and d ( n ) d(n) d(n) are jointly wide-sense stationary processes.

Assume that the data x ( n ) \mathbf x(n) x(n) and the LMS weight vector w n \mathbf w_n wn are statistically independent.

For jointly wide-sense stationary process, d ( n ) d(n) d(n) and x ( n ) x(n) x(n), the LMS algorithm converges in the mean, i.e.,
lim ⁡ n → ∞ E { w n } = R x − 1 r d x (LMS.4) \lim_{n\to \infty} E\{ \mathbf w_n\}=\mathbf R_x^{-1}\mathbf r_{dx}\tag{LMS.4} nlimE{wn}=Rx1rdx(LMS.4)
if
0 < μ < 2 λ m a x (LMS.5) 0<\mu<\frac{2}{\lambda_{max}}\tag{LMS.5} 0<μ<λmax2(LMS.5)
where λ m a x \lambda_{max} λmax is the maximum eigenvalue of the autocorrelation matrix R x {\mathbf R}_x Rx.

The LMS coefficient update equation:
w n + 1 = w n + μ [ d ( n ) − w n T x ( n ) ] x ∗ ( n ) \mathbf w_{n+1}=\mathbf w_n+\mu[d(n)-\mathbf w_n^T\mathbf x(n)]\mathbf x^*(n) wn+1=wn+μ[d(n)wnTx(n)]x(n)
Taking the expected value, we have
E { w n + 1 } = E { w n } + μ E { d ( n ) x ∗ ( n ) } − μ E { w n T x ( n ) x ∗ ( n ) } = E { w n } + μ E { d ( n ) x ∗ ( n ) } − μ E { x ∗ ( n ) x T ( n ) w n } = ( I − μ R x ) E { w n } + μ r d x \begin{aligned} E\{\mathbf w_{n+1}\}&=E\{\mathbf w_{n}\}+\mu E\{d(n)\mathbf x^*(n)\}-\mu E\{\mathbf w_{n}^T\mathbf x(n)\mathbf x^*(n)\}\\ &=E\{\mathbf w_{n}\}+\mu E\{d(n)\mathbf x^*(n)\}-\mu E\{\mathbf x^*(n) \mathbf x^T(n)\mathbf w_n\}\\ &=(\mathbf I-\mu \mathbf R_x)E\{\mathbf w_{n}\}+\mu \mathbf r_{dx} \end{aligned} E{wn+1}=E{wn}+μE{d(n)x(n)}μE{wnTx(n)x(n)}=E{wn}+μE{d(n)x(n)}μE{x(n)xT(n)wn}=(IμRx)E{wn}+μrdx
Defining a weight error vector
c n = w n − w (LMS.6) \mathbf c_n=\mathbf w_n-\mathbf w\tag{LMS.6} cn=wnw(LMS.6)
Then
E { c n + 1 } = ( I − μ R x ) E { w n } + μ R x w − w = ( I − μ R x ) E { c n } = V ( I − μ Λ x ) V H E { c n } (LMS.7) \begin{aligned} E\{\mathbf c_{n+1}\} &=(\mathbf I-\mu \mathbf R_x)E\{\mathbf w_{n}\}+\mu \mathbf R_x \mathbf w-\mathbf w\\ &=(\mathbf I-\mu \mathbf R_x)E\{\mathbf c_{n}\}\\ &=\mathbf V(\mathbf I-\mu \boldsymbol \Lambda_x)\mathbf V^HE\{\mathbf c_{n}\} \end{aligned}\tag{LMS.7} E{cn+1}=(IμRx)E{wn}+μRxww=(IμRx)E{cn}=V(IμΛx)VHE{cn}(LMS.7)
Similarly, define
u n = V H c n (LMS.8) \mathbf u_n=\mathbf V^H\mathbf c_n\tag{LMS.8} un=VHcn(LMS.8)
It follows from ( L M S . 6 ) (LMS.6) (LMS.6) that
E { u n } = ( I − μ Λ ) n u 0 (LMS.9) E\{\mathbf u_{n}\}=(\mathbf I-\mu \boldsymbol \Lambda)^n \mathbf u_0 \tag{LMS.9} E{un}=(IμΛ)nu0(LMS.9)
In order for E { w n } E\{\mathbf w_n\} E{wn} to converge to w \mathbf w w it is necessary that
∣ 1 − μ λ k ∣ < 1 ⟹ 0 < μ < 2 λ m a x |1-\mu \lambda_k|<1\Longrightarrow0<\mu<\frac{2}{\lambda_{max}} 1μλk<10<μ<λmax2


Convergence in the mean-square

Ensemble-average value of the LMS mean-square error:
ξ ( n ) = E { ∣ e ( n ) ∣ 2 } = E { ∣ d ( n ) − w n T x ( n ) ∣ 2 } = E { ∣ d ( n ) − ( c n + w ) T x ( n ) ∣ 2 } = E { ∣ d ( n ) − w T x ( n ) ∣ 2 } + E { c n H x ∗ ( n ) x T ( n ) c n } ≜ ξ m i n ( n ) + ξ e x ( n ) (LMS.10) \begin{aligned} \xi(n)&=E\{|e(n)|^2\}=E\{|d(n)-\mathbf w_n^T\mathbf x(n)|^2\}\\ &=E\{|d(n)-(\mathbf c_n+\mathbf w)^T\mathbf x(n)|^2\}\\ &=E\{|d(n)-\mathbf w^T\mathbf x(n)|^2\}+E\{\mathbf c_n^H\mathbf x^*(n)\mathbf x^T(n)\mathbf c_n\}\\ &\triangleq \xi_{min}(n)+\xi_{ex}(n) \end{aligned}\tag{LMS.10} ξ(n)=E{e(n)2}=E{d(n)wnTx(n)2}=E{d(n)(cn+w)Tx(n)2}=E{d(n)wTx(n)2}+E{cnHx(n)xT(n)cn}ξmin(n)+ξex(n)(LMS.10)
where ξ m i n \xi_{min} ξmin is Wiener error and ξ e x \xi_{ex} ξex is excess error. And
ξ e x ( n ) = E { c n H x ∗ ( n ) x T ( n ) c n } = E { t r [ x ∗ ( n ) x T ( n ) c n c n H ] } ≜ t r ( R x K ) (LMS.11) \begin{aligned} \xi_{ex}(n)&=E\{\mathbf c_n^H\mathbf x^*(n)\mathbf x^T(n)\mathbf c_n\}\\ &=E\{\mathrm{tr} [\mathbf x^*(n)\mathbf x^T(n)\mathbf c_n \mathbf c_n^H]\}\\ &\triangleq \mathrm{tr}(\mathbf R_x \mathbf K) \end{aligned}\tag{LMS.11} ξex(n)=E{cnHx(n)xT(n)cn}=E{tr[x(n)xT(n)cncnH]}tr(RxK)(LMS.11)

The mean-square error ξ ( n ) \xi (n) ξ(n) converges to a steady-state value of
ξ ( ∞ ) = ξ m i n + ξ e x ( ∞ ) = ξ m i n 1 1 − μ ∑ k = 0 p λ K 2 − μ λ k (LMS.12) \xi (\infty)=\xi_{min}+\xi _{ex}(\infty)=\xi_{min}\frac{1}{1-\mu \sum_{k=0}^p\frac{\lambda_K}{2-\mu \lambda_k}}\tag{LMS.12} ξ()=ξmin+ξex()=ξmin1μk=0p2μλkλK1(LMS.12)

and the LMS algorithm is said to converge in the mean-square if and only if the step-size μ \mu μ satisfies the following two conditions:
0 < μ < 2 λ m a x (LMS.13) 0<\mu<\frac{2}{\lambda_{max}}\tag{LMS.13}\\ 0<μ<λmax2(LMS.13)

μ ∑ k = 0 p λ K 2 − μ λ k < 1 (LMS.14) \mu \sum_{k=0}^p\frac{\lambda_K}{2-\mu \lambda_k}<1\tag{LMS.14} μk=0p2μλkλK<1(LMS.14)

where λ k \lambda _k λk is the eigenvalue of R x \mathbf R_x Rx.

Note that ( L M S . 13 ) (LMS.13) (LMS.13) is the condition that is required for the LMS algorithm to converge in the mean, and ( L M S . 14 ) (LMS.14) (LMS.14) guarantees that ξ ( ∞ ) \xi(\infty) ξ() is positive.

If μ ≪ 2 / λ m a x \mu \ll 2/\lambda_{max} μ2/λmax, as is typically the case, then μ λ k ≪ 2 \mu \lambda_k \ll 2 μλk2 and ( L M S . 14 ) (LMS.14) (LMS.14) may be simplified to
1 2 μ ∑ k = 0 p λ k = 1 2 μ t r ( R x ) < 1 (LMS.15) \frac{1}{2}\mu \sum_{k=0}^p \lambda_k=\frac{1}{2}\mu \mathrm{tr}(\mathbf R_x)<1 \tag{LMS.15} 21μk=0pλk=21μtr(Rx)<1(LMS.15)
From ( L M S . 12 ) (LMS.12) (LMS.12), we have
ξ ( ∞ ) = ξ m i n + ξ e x ( ∞ ) ≈ ξ m i n 1 1 − 1 2 μ t r ( R x ) (LMS.16) \xi(\infty)=\xi_{min}+\xi _{ex}(\infty)\approx \xi_{min}\frac{1}{1-\frac{1}{2}\mu \mathrm{tr}(\mathbf R_x)}\tag{LMS.16} ξ()=ξmin+ξex()ξmin121μtr(Rx)1(LMS.16)
In summary, if
0 < μ < 2 t r ( R x ) (LMS.17) 0<\mu<\frac{2}{\mathrm{tr}(\mathbf R_x)}\tag{LMS.17} 0<μ<tr(Rx)2(LMS.17)
then the LMS converges in the mean square, and
ξ e x ( ∞ ) ≈ μ ξ m i n 1 2 t r ( R x ) 1 − 1 2 μ t r ( R x ) ≈ 1 2 μ ξ m i n t r ( R x ) (LMS.18) \xi_{ex}(\infty) \approx \mu \xi_{min} \frac{\frac{1}{2}\mathrm{tr}(\mathbf R_x)}{1-\frac{1}{2}\mu\mathrm{tr}(\mathbf R_x)}\approx \frac{1}{2}\mu \xi_{min} \mathrm{tr}(\mathbf R_x) \tag{LMS.18} ξex()μξmin121μtr(Rx)21tr(Rx)21μξmintr(Rx)(LMS.18)

Thus, for small μ \mu μ, the excess mean-square error is proportional to the step size μ \mu μ.

Normalized LMS

For stationary processes, the LMS algorithm converges in the mean if 0 < μ < 2 / λ m a x 0<\mu<2/\lambda_{max} 0<μ<2/λmax, and converges in the mean-square if 0 < μ < 2 / t r ( R x ) 0<\mu<2/\mathrm{tr}(\mathbf R_x) 0<μ<2/tr(Rx). However, since R x \mathbf R_x Rx is generally unknown, then either λ m a x \lambda_{max} λmax or R x \mathbf R_x Rx must be estimated in order to use these bounds. One way around this difficulty is to use the fact that, for stationary processes, t r ( R x ) = ( p + l ) E { ∣ x ( n ) ∣ 2 } \mathrm{tr}(\mathbf R_x) = (p + l)E\{|x(n) |^2\} tr(Rx)=(p+l)E{x(n)2}. Therefore, the condition for mean-square convergence may be replaced with
0 < μ < 2 ( p + 1 ) E { ∣ x ( n ) ∣ 2 } (NLMS.1) 0<\mu<\frac{2}{(p+1)E\{|x(n) |^2\}}\tag{NLMS.1} 0<μ<(p+1)E{x(n)2}2(NLMS.1)
where E { ∣ x ( n ) ∣ 2 } E\{|x(n) |^2\} E{x(n)2} is the power in the process x ( n ) x(n) x(n). This power may be estimated using a time average such as
E ^ { ∣ x ( n ) ∣ 2 } = 1 p + 1 ∑ k = 0 p ∣ x ( n − k ) ∣ 2 (NLMS.2) \hat E\{ | x(n) |^2 \}=\frac{1}{p+1}\sum_{k=0}^p |x(n-k)|^2 \tag{NLMS.2} E^{x(n)2}=p+11k=0px(nk)2(NLMS.2)
which leads to the following bound on the step size for mean-square convergence:
0 < μ < 2 x H ( n ) x ( n ) (NLMS.3) 0<\mu<\frac{2}{\mathbf x^H(n)\mathbf x(n)}\tag{NLMS.3} 0<μ<xH(n)x(n)2(NLMS.3)
Therefore, a convenient way to incorporate this bound into the LMS filter is to use a (time-varying) step size of the form
μ ( n ) = β x H ( n ) x ( n ) = β ∥ x ( n ) ∥ 2 (NLMS.4) \mu(n)=\frac{\beta}{\mathbf x^H(n)\mathbf x(n)}=\frac{\beta}{\|\mathbf x (n)\|^2}\tag{NLMS.4} μ(n)=xH(n)x(n)β=x(n)2β(NLMS.4)
where β \beta β is a normalized step size with 0 < β < 2 0<\beta<2 0<β<2.

Replacing μ \mu μ in the LMS weight vector update equation with μ ( n ) μ(n) μ(n) leads to the Normalized LMS algorithm (NLMS), which is given by
w n + 1 = w n + β x ∗ ( n ) ∥ x ( n ) ∥ 2 e ( n ) (NLMS.5) \mathbf w_{n+1}=\mathbf w_n+\beta \frac{\mathbf x^*(n)}{\|\mathbf x(n)\|^2}e(n)\tag{NLMS.5} wn+1=wn+βx(n)2x(n)e(n)(NLMS.5)
Compared with the LMS algorithm, the normalized LMS algorithm requires additional computation to evaluate the normalization term ∥ x ( n ) ∥ 2 \|x(n)\|^2 x(n)2. However, if this term is evaluated recursively as follows
∥ x ( n + 1 ) ∥ 2 = ∥ x ( n ) ∥ 2 + ∣ x ( n + 1 ) ∣ 2 − ∣ x ( n − p ) ∣ 2 (NLMS.6) \|\mathbf x(n+1)\|^2=\|\mathbf x(n)\|^2+|x(n+1)|^2-|x(n-p)|^2\tag{NLMS.6} x(n+1)2=x(n)2+x(n+1)2x(np)2(NLMS.6)
then the extra computation involves only two squaring operations, one addition, and one subtraction.

Application: Noise Cancellation

An important issue in the implementation of an adaptive filter that is apparent from Fig. 9 .1 is the requirement that the error signal, e ( n ) e(n ) e(n), be available to the adaptive algorithm. Since it is e ( n ) e(n) e(n) that allows the filter to measure its performance and determine how the filter coefficients should be modified, without e ( n ) e(n) e(n) the filter would not be able to adapt.

In some applications, acquiring d ( n ) d(n) d(n) is straightforward, which makes the evaluation of e ( n ) e(n) e(n) easy. For example: the problem of system identification. A plant produces an output d ( n ) d(n) d(n) in response to a known input x ( n ) x(n) x(n). The goal is to develop a model for the plant, W n ( z ) W_n(z) Wn(z), that produces a response d ^ ( n ) \hat d(n) d^(n) that is as close as possible to d ( n ) d(n) d(n). To account for observation noise in the measurement of the plant output, an additive noise source v ( n ) v(n) v(n) is shown in the figure.

However, in some applications, the acquisition of d ( n ) d(n) d(n) is not straightforward. For example: Consider the problem of interference cancellation in which a signal d ( n ) d (n) d(n) is observed in the presence of an interfering signal v ( n ) v(n) v(n),
x ( n ) = d ( n ) + v ( n ) x(n) = d(n) + v(n) x(n)=d(n)+v(n)
Since d ( n ) d(n) d(n) is unknown, the error sequence cannot be generated directly. However, the system shown below will generate a sequence that may be used by the adaptive filter to estimate d ( n ) d(n) d(n).

Specifically, suppose that d ( n ) d(n) d(n) and v ( n ) v(n) v(n) are real-valued, uncorrelated zero mean processes. In addition, suppose that d ( n ) d(n) d(n) is a narrowband process and that v ( n ) v(n) v(n) is a broadband process with an autocorrelation sequence that is approximately zero for lags k ≥ k 0 k\ge k_0 kk0.

We need E { ∣ d ( n ) − y ( n ) ∣ 2 } E\{|d(n)-y(n)|^2\} E{d(n)y(n)2}, but the system can only provide E { ∣ d ( n ) + v ( n ) − y ( n ) ∣ 2 } E\{|d(n)+v(n)-y(n)|^2 \} E{d(n)+v(n)y(n)2}. Fortunately, based on the assumptions above, we have
E { ∣ e ( n ) ∣ 2 } = E { ∣ d ( n ) + v ( n ) − y ( n ) ∣ 2 } = E { ∣ d ( n ) − y ( n ) ∣ 2 } + E { ∣ v ( n ) ∣ 2 } + 2 ℜ { E [ v ( n ) ( d ( n ) − y ( n ) ) ∗ ] } \begin{aligned} E\{|e(n)|^2\}=&E\{|d(n)+v(n)-y(n)|^2 \}\\ =&E\{|d(n)-y(n)|^2 \}+E\{|v(n)|^2 \}+2\Re\{ E [v(n)(d(n)-y(n))^* ]\} \end{aligned} E{e(n)2}==E{d(n)+v(n)y(n)2}E{d(n)y(n)2}+E{v(n)2}+2{E[v(n)(d(n)y(n))]}
Note that
E [ v ( n ) ( d ( n ) − y ( n ) ) ∗ ] = E { v ( n ) y ∗ ( n ) } = E { v ( n ) [ ∑ k = 0 p w n ∗ ( k ) x ∗ ( n − k 0 − k ) ] } = ∑ k = 0 p w n ∗ ( k ) [ E { v ( n ) d ∗ ( n − k 0 − k ) } + E { v ( n ) v ∗ ( n − k 0 − k ) } ] = 0 \begin{aligned} E [v(n)(d(n)-y(n))^* ]&=E\{ v(n)y^*(n) \}=E\{v(n) [\sum_{k=0}^p w_n^*(k)x^*(n-k_0-k)]\}\\ &=\sum_{k=0}^p w^*_n(k)[E\{v(n)d^*(n-k_0-k)\}+E\{v(n)v^*(n-k_0-k)\}]\\ &=0 \end{aligned} E[v(n)(d(n)y(n))]=E{v(n)y(n)}=E{v(n)[k=0pwn(k)x(nk0k)]}=k=0pwn(k)[E{v(n)d(nk0k)}+E{v(n)v(nk0k)}]=0
Therefore,
E { ∣ e ( n ) ∣ 2 } = E { ∣ v ( n ) ∣ 2 } + E { ∣ d ( n ) − y ( n ) ∣ 2 } E\{|e(n)|^2\}=E\{|v(n)|^2\}+E\{|d(n)-y(n)|^2\} E{e(n)2}=E{v(n)2}+E{d(n)y(n)2}
Minimizing E { ∣ e ( n ) ∣ 2 } E\{|e(n)|^2\} E{e(n)2} is equivalent to minimizing E { ∣ d ( n ) − y ( n ) ∣ 2 } E\{|d(n)-y(n)|^2\} E{d(n)y(n)2}. Thus, the output of the adaptive filter is the minimum mean-square estimate of d ( n ) d(n) d(n).

Recursive Least Squares

In each of the adaptive filtering methods discussed so far, we have considered gradient descent algorithms for the minimization of the mean-square error
ξ ( n ) = E { ∣ e ( n ) ∣ 2 } (RLS.1) \xi(n)=E\{|e(n)|^2\}\tag{RLS.1} ξ(n)=E{e(n)2}(RLS.1)
The difficulty with these methods is that they all require knowledge of the autocorrelation of the input process, E { x ( n ) x ∗ ( n − k ) } E\{x(n)x^*(n-k)\} E{x(n)x(nk)}, and the cross-correlation between the input and the desired output, E { d ( n ) x ∗ ( n − k ) } E\{d(n)x^*(n-k)\} E{d(n)x(nk)}. When this statistical information is unknown, we have forced to estimate these statistics from the data. In the LMS adaptive filter, these ensemble averages are estimated using instantaneous values,
E ^ { e ( n ) x ∗ ( n − k ) } = e ( n ) x ∗ ( n − k ) (RLS.2) \hat E\{ e(n)x^*(n-k) \}=e(n)x^*(n-k)\tag{RLS.2} E^{e(n)x(nk)}=e(n)x(nk)(RLS.2)
An alternative, is to consider error measures that do not include expectations and that may be computed directly from the data. For example, a least squares error
ε ( n ) = ∑ i = 0 n ∣ e ( i ) ∣ 2 (RLS.3) \varepsilon(n)=\sum_{i=0}^n |e(i)|^2\tag{RLS.3} ε(n)=i=0ne(i)2(RLS.3)
Another way to motivate it is to consider the Wiener receiver
w 0 = R x − 1 r d x \mathbf w_0=\mathbf R_x^{-1}\mathbf r_{dx} w0=Rx1rdx
with R x = E { x ∗ ( n ) x T ( n ) } \mathbf R_x=E\{\mathbf x^*(n)\mathbf x^T(n)\} Rx=E{x(n)xT(n)} and r d x = E { x ∗ ( n ) d ( n ) } \mathbf r_{dx}=E\{\mathbf x^*(n)d(n)\} rdx=E{x(n)d(n)}.

With finite data, all expectations are estimated from the available data:
R ^ x = 1 N ∑ n = 0 N − 1 x ∗ ( n ) x T ( n ) = 1 N X ∗ X T r ^ d x = 1 N ∑ n = 0 N − 1 x ∗ ( n ) d ( n ) = 1 N X ∗ d T \begin{aligned} &\hat {\mathbf R}_x=\frac{1}{N}\sum_{n=0}^{N-1} \mathbf x^*(n)\mathbf x^T(n)=\frac{1}{N}\mathbf X^* \mathbf X^T\\ &\hat {\mathbf r}_{dx}=\frac{1}{N}\sum_{n=0}^{N-1} \mathbf x^*(n)d(n)=\frac{1}{N}\mathbf X^* \mathbf d^T \end{aligned} R^x=N1n=0N1x(n)xT(n)=N1XXTr^dx=N1n=0N1x(n)d(n)=N1XdT
Therefore,
w ^ 0 = R ^ x − 1 r ^ d x = ( X ∗ X T ) − 1 X ∗ d T (RLS.4) \hat {\mathbf w}_0=\hat{\mathbf R}_x^{-1}\hat{\mathbf r}_{dx}=(\mathbf X^* \mathbf X^T)^{-1}\mathbf X^* \mathbf d^T\tag{RLS.4} w^0=R^x1r^dx=(XXT)1XdT(RLS.4)
which is the optimal finite-sample solution for the following cost function
ξ ^ ( w ) = 1 N ∑ n = 0 N − 1 ∣ w T x ( n ) − d ( n ) ∣ 2 = 1 N ∥ w T X − d ∥ 2 ≜ 1 N ∑ i = 0 n ∣ e ( i ) ∣ 2 (RLS.5) \hat \xi (\mathbf w)=\frac{1}{N}\sum_{n=0}^{N-1} |\mathbf w^T \mathbf x(n) -d(n)|^2=\frac{1}{N}\|\mathbf w^T \mathbf X-\mathbf d\|^2\triangleq\frac{1}{N} \sum_{i=0}^n |e(i)|^2 \tag{RLS.5} ξ^(w)=N1n=0N1wTx(n)d(n)2=N1wTXd2N1i=0ne(i)2(RLS.5)
Therefore, it is reasonable to measure error as ( R L S . 3 ) (RLS.3) (RLS.3).


Suppose at time n n n we know
X n : = [ x ( 0 ) , ⋯   , x ( n ) ] , d n : = [ d ( 0 ) , ⋯   , d ( n ) ] (RLS.6) \mathbf X_n:=[\mathbf x(0),\cdots,\mathbf x(n)], \quad \mathbf d_n:=[d(0),\cdots,d(n)]\tag{RLS.6} Xn:=[x(0),,x(n)],dn:=[d(0),,d(n)](RLS.6)
The solution to minimize ∥ w T X n − d n ∥ 2 \|\mathbf w^T \mathbf X_n-\mathbf d_n\|^2 wTXndn2 is
w ^ n = ( X n ∗ X n T ) − 1 X n ∗ d n T = : Φ n − 1 θ n (RLS.7) \hat {\mathbf w}_n=(\mathbf X_n^* \mathbf X_n^T)^{-1}\mathbf X_n^* \mathbf d_n^T =: \boldsymbol{\Phi}_n^{-1} \boldsymbol{\theta}_n \tag{RLS.7} w^n=(XnXnT)1XndnT=:Φn1θn(RLS.7)
where
Φ n : = X n ∗ X n T = ∑ i = 0 n x ∗ ( i ) x T ( i ) , θ n : = X n ∗ d n T = ∑ i = 0 n x ∗ ( i ) d ( i ) (RLS.8) \boldsymbol{\Phi}_{n}:=\mathbf{X}_{n}^{*} \mathbf{X}_{n}^{\mathrm{T}}=\sum_{i=0}^{n} \mathbf{x}^{*}(i) \mathbf{x}^{\mathrm{T}}(i), \quad \boldsymbol{\theta}_{n}:=\mathbf{X}_{n}^{*} \mathbf{d}_{n}^{\mathrm{T}}=\sum_{i=0}^{n} \mathbf{x}^{*}(i) d(i) \tag{RLS.8} Φn:=XnXnT=i=0nx(i)xT(i),θn:=XndnT=i=0nx(i)d(i)(RLS.8)
To Update Φ n − 1 \Phi_{n}^{-1} Φn1, we can apply the matrix inversion lemma:
Φ n + 1 − 1 = ( Φ n + x ∗ ( n + 1 ) x T ( n + 1 ) ) − 1 = Φ n − 1 − Φ n − 1 x ∗ ( n + 1 ) x T ( n + 1 ) Φ n − 1 1 + x T ( n + 1 ) Φ n − 1 x ∗ ( n + 1 ) (RLS.9) \boldsymbol{\Phi}_{n+1}^{-1}=\left(\boldsymbol{\Phi}_{n}+\mathbf{x}^{*}(n+1) \mathbf{x}^{\mathrm{T}}(n+1)\right)^{-1}=\boldsymbol{\Phi}_{n}^{-1}-\frac{\boldsymbol{\Phi}_{n}^{-1} \mathbf{x}^{*}(n+1) \mathbf{x}^{\mathrm{T}}(n+1) \boldsymbol{\Phi}_{n}^{-1}}{1+\mathbf{x}^{\mathrm{T}}(n+1) \boldsymbol{\Phi}_{n}^{-1} \mathbf{x}^{*}(n+1)}\tag{RLS.9} Φn+11=(Φn+x(n+1)xT(n+1))1=Φn11+xT(n+1)Φn1x(n+1)Φn1x(n+1)xT(n+1)Φn1(RLS.9)
The update of θ n \boldsymbol{\theta}_{n} θn is straight forward:
θ n + 1 = θ n + x ∗ ( n + 1 ) d ( n + 1 ) (RLS.10) \boldsymbol{\theta}_{n+1}=\boldsymbol{\theta}_{n}+\mathbf x^*(n+1)d(n+1)\tag{RLS.10} θn+1=θn+x(n+1)d(n+1)(RLS.10)

Recursive Least Squares (RLS) algorithm:
Initialization: θ 0 = 0 \boldsymbol{\theta}_{0}=0 θ0=0 and P 0 = δ − 1 I , \mathbf{P}_{0}=\delta^{-1} \mathbf{I}, P0=δ1I, where δ \delta δ is a very small positive constant.

P n + 1 : = P n − P n x ∗ ( n + 1 ) x T ( n + 1 ) P n 1 + x T ( n + 1 ) P n x ∗ ( n + 1 ) θ n + 1 : = θ n + x ∗ ( n + 1 ) d ( n + 1 ) w ^ n + 1 : = P n + 1 θ n + 1 \begin{aligned} \mathbf{P}_{n+1} &:=\mathbf{P}_{n}-\frac{\mathbf{P}_{n} \mathbf{x}^{*}(n+1) \mathbf{x}^{\mathrm{T}}(n+1) \mathbf{P}_{n}}{1+\mathbf{x}^{\mathrm{T}}(n+1) \mathbf{P}_{n} \mathbf{x}^{*}(n+1)} \\ \boldsymbol{\theta}_{n+1} &:=\boldsymbol{\theta}_{n}+\mathbf{x}^{*}(n+1) d(n+1) \\ \hat{\mathbf{w}}_{n+1} &:=\mathbf{P}_{n+1} \boldsymbol{\theta}_{n+1} \end{aligned} Pn+1θn+1w^n+1:=Pn1+xT(n+1)Pnx(n+1)Pnx(n+1)xT(n+1)Pn:=θn+x(n+1)d(n+1):=Pn+1θn+1


Finite horizon: For adaptive purpose, we want an effective window of data.

  • Sliding window: Φ n \boldsymbol{\Phi}_n Φn and θ n \boldsymbol{\theta}_n θn based on only the last L L L samples
    Φ n + 1 − 1 = ( Φ n + x ∗ ( n + 1 ) x T ( n + 1 ) − x ∗ ( n − L ) x T ( n − L ) ) − 1 θ n + 1 = θ n + x ∗ ( n + 1 ) d ( n + 1 ) − x ∗ ( n − L ) d ( n − L ) (RLS.11) \begin{aligned} \boldsymbol{\Phi}_{n+1}^{-1}&=\left(\boldsymbol{\Phi}_{n}+\mathbf{x}^{*}(n+1) \mathbf{x}^{\mathrm{T}}(n+1)-\mathbf{x}^{*}(n-L) \mathbf{x}^{\mathrm{T}}(n-L)\right)^{-1}\\ \boldsymbol{\theta}_{n+1}&=\boldsymbol{\theta}_{n}+\mathbf x^*(n+1)d(n+1)-\mathbf x^*(n-L)d(n-L) \end{aligned}\tag{RLS.11} Φn+11θn+1=(Φn+x(n+1)xT(n+1)x(nL)xT(nL))1=θn+x(n+1)d(n+1)x(nL)d(nL)(RLS.11)
    applying twice matrix inversion lemma → \to doubles complexity, and we have to keep L L L previous data samples in memory.

  • Exponential window: scale down Φ n \boldsymbol{\Phi}_n Φn and θ n \boldsymbol{\theta}_n θn by a factor λ ≈ 1 \lambda \approx 1 λ1
    Φ n + 1 − 1 = ( λ Φ n + x ∗ ( n + 1 ) x T ( n + 1 ) ) − 1 θ n + 1 = λ θ n + x ∗ ( n + 1 ) d ( n + 1 ) (RLS.12) \begin{aligned} \boldsymbol{\Phi}_{n+1}^{-1}&=\left(\lambda\boldsymbol{\Phi}_{n}+\mathbf{x}^{*}(n+1) \mathbf{x}^{\mathrm{T}}(n+1)\right)^{-1}\\ \boldsymbol{\theta}_{n+1}&=\lambda \boldsymbol{\theta}_{n}+\mathbf x^*(n+1)d(n+1) \end{aligned}\tag{RLS.12} Φn+11θn+1=(λΦn+x(n+1)xT(n+1))1=λθn+x(n+1)d(n+1)(RLS.12)
    Corresponds to
    X n + 1 = [ x ( n + 1 ) λ 1 / 2 x ( n ) λ x ( n − 1 ) λ 3 / 2 x ( n − 2 ) ⋯   ] d k + 1 = [ d ( n + 1 ) λ 1 / 2 d ( n ) λ d ( n − 1 ) λ 3 / 2 d ( n − 2 ) ⋯   ] (RLS.13) \begin{aligned} \mathbf{X}_{n+1}&=[\mathbf{x}(n+1) \quad \lambda^{1 / 2} \mathbf{x}(n)\quad \lambda \mathbf{x}(n-1) \quad \lambda^{3 / 2} \mathbf{x}(n-2) \quad \cdots] \\ \mathbf{d}_{k+1}&=[d(n+1)\quad \lambda^{1 / 2} d(n) \quad \lambda d(n-1) \quad \lambda^{3 / 2} d(n-2) \cdots] \end{aligned}\tag{RLS.13} Xn+1dk+1=[x(n+1)λ1/2x(n)λx(n1)λ3/2x(n2)]=[d(n+1)λ1/2d(n)λd(n1)λ3/2d(n2)](RLS.13)
    Exponentially weighted RLS algorithm:
    P n + 1 : = λ − 1 P n − λ − 2 P n x ∗ ( n + 1 ) x T ( n + 1 ) P n 1 + λ − 1 x T ( n + 1 ) P n x ∗ ( n + 1 ) θ n + 1 : = λ θ n + x ∗ ( n + 1 ) d ( n + 1 ) w ^ n + 1 : = P n + 1 θ n + 1 \begin{aligned} \mathbf{P}_{n+1} &:=\lambda^{-1} \mathbf{P}_{n}-\lambda^{-2} \frac{\mathbf{P}_{n} \mathbf{x}^{*}(n+1) \mathbf{x}^{\mathrm{T}}(n+1) \mathbf{P}_{n}}{1+\lambda^{-1} \mathbf{x}^{\mathrm{T}}(n+1) \mathbf{P}_{n} \mathbf{x}^{*}(n+1)} \\ \boldsymbol{\theta}_{n+1} &:=\lambda \boldsymbol{\theta}_{n}+\mathbf{x}^{*}(n+1) d(n+1) \\ \hat{\mathbf{w}}_{n+1} &:=\mathbf{P}_{n+1} \boldsymbol{\theta}_{n+1} \end{aligned} Pn+1θn+1w^n+1:=λ1Pnλ21+λ1xT(n+1)Pnx(n+1)Pnx(n+1)xT(n+1)Pn:=λθn+x(n+1)d(n+1):=Pn+1θn+1

本文标签: adaptivefiltering