log(n) factor in the uniform convergence rate of smoothing estimator

5 minute read

Published:

Given data i.i.d. ${(X_i,Y_i)}_{i=1}^n$, nonparametric “quantile regression” estimator at level $\tau \in (0,1)$  the minimizer of the following empirical loss function

$\hat q_\tau(x) = \arg\min_q \sum_{i=1}^n \rho_\tau(Y_i - q) K\big(\frac{X_i-x}{h}\big)$,

where $\rho_\tau(u) = (\tau-\mathbf{1}(u \leq 0))u$, $h \to 0$, $K \in L_1(\mathbb R)$ and is symmetric with “compact support”. This framework is discussed in Qu and Yoon (2015) Journal of Econometrics, 185, 1-19.

The sup convergence rate in $x$ (covariates) direction usually involves a log factor,

\(\sup_{x}\lvert\hat q_{\tau}(x)-q_{\tau}(x)\rvert = O(\sqrt{\log(n)} a_n)\), at fixed $\tau$.

But when taking sup at $ \tau$ direction, $ \log(n)$ factor does not have to be there

$ \sup_\tau \lvert\hat q_\tau(x)-q_\tau(x)\rvert = O(a_n)$, at fixed $ x$.

This has nontrivial meaning in statistics theory:

The reason why we have to pay a price for non-parametric estimation ‘in x direction’ is that the estimators for different values of x are asymptotically independent. Thus the sup over x behaves like the maximum over independent normals, which grows with rate $ \sqrt{\log n}$.

In ‘tau direction’, the estimators are not asymptotically independent and we have process convergence.

This is like estimating the density vs estimating the distribution function. If we estimate the distribution function, the scaled estimator converges as a process (classical Donsker theorem) and there is no price to pay for uniformity.

If we estimate the density, the estimator is ‘local’ and estimators at different points are asymptotically independent. This gives the additional square root log factor (intuitively, square root log is the order of the maximum of n independent normals).

Does this make things a bit clearer?


Remarks:

(1) To understand the “asymptotic independence” for local density estimation, we may illustrate with a kernel estimator. Let $ \hat f(x) = (nh)^{-1} \sum_{i=1}^n K(\frac{X_i-x}{h})$, where $ h \to 0$, $ K \in L_1(\mathbb R)$ and is symmetric with “compact support”. $ h \to 0$ implies the data $ {X_i}$ that is used to construct $ \hat f(x)$ is eventually those around $ x$. If we take $ x_1 \neq x_2$, then asymptotically ($ h \to 0, n \to \infty$), there will be no overlap in the localized data set used to construct $ \hat f(x_1)$ and $ \hat f(x_2)$, so the two is asymptotically independent.

(2) [Cont’d from 1.] The asymptotic independence of smoothing estimator is somewhat counter-intuitive, because we usually assume that the true function is in a Hölder ball, which means that the function is continuously differentiable with bounded derivative with certain order. Since the function is continuous and even differentiable, the estimation at a certain point should use information of the neighboring points. But the smoothing estimator does exactly the opposite by localizing the estimator and treating the estimator at each point independently.

(3) [Cont’d from 1.] Simultaneous confidence band for $ \hat f(x)$ in the sense of Bickel and Rosenblatt (1973): this approach essentially bring the dependence in $ x$ direction back to the estimator. This is done via strong approximation which approximate the difference $ \sqrt{nh}\big(\hat f(x)-f(x)\big)$ by a Gaussian process $ G(x)$ (bias is made negligible by choosing smaller $ h$), which is continuous in $ x$. $ \sup_x \lvert G(x)\rvert$ converges to a certain type of extreme value distribution.

(4) Let the empirical distribution function be $ \hat F(x) = n^{-1} \sum_{i=1}^n \mathbf 1(X_i \leq x)$. It is known that $ \sqrt{n} (\hat F(x)-F(x)) \Rightarrow B(x)$, where $ B(x)$ is a Gaussian process. Such result is the weak convergence result detailed van der Vaart and Wellner (1996).

(5) If we interpolate a set of quantile estimators in a grid of $ \tau$, namely interpolating $ {\hat q_{\tau_1}(x),…,\hat q_{\tau_K}(x)}$ for fixed $ x$, where it is possible that $ K \to \infty$ with $ n$. Denote the interpolated estimator $ \tilde q_\tau(x)$, the interpolated estimator has the same convergence rate as its pointwise counterparts (see my paper with Stanislav Volgushev and Guang Cheng, although we was actually focusing on sieve estimator but it is basically localization). Heuristically, this holds because the estimators $ {\hat q_{\tau_1}(x),…,\hat q_{\tau_K}(x)}$ are not asymptotically independent, and the interpolated estimator benefits from such dependence. Thus, the interpolation can be done without any additional cost.

Conjectures (or overgeneralization, no guarantee of accuracy)

(1) Can we come up with a criterion on when an additional factor $ \sqrt{\log(n)}$ is necessary? For now, smoothing over i.i.d. data yields log(n) factor in sup error, and smoothing over (informative) dependence data is free from the log(n) factor. Examples for the later case are quantile regression in tau direction, and varying coefficients model, a reference is Zhu, Li and Kong (2012). Moreover, to construct the “simultaneous confidence band” (SCB), in the later case it is equivalent to showing the “weak convergence” to a Gaussian process. In the former case, one needs to go through the construction of Bickel and Rosenblatt (1973), which requires “strong approximation”.

(2) Improving the smoothing estimator by incorporating the neighboring information? If we can somehow “reduce” smoothing, and account for the continuity of the true function, say, try to derive $ \hat f(x_1)$ and $ \hat f(x_2)$ with overlapped/common data at certain degree, maybe we can improve the convergence rate and get rid of the log(n) term.