By Bolun Dai | Feb 27th 2021
The blog is derived from the notes I have from taking the Mathematics of Deep Learning course by Joan Bruna at New York Univeristy.
A supervised learning problem usually consists of four components: the input domain
The input domain
this is saying for data sampled from data distribution
here the loss
The goal of supervised learning is to predict the target function
Since we cannot know exactly how well an algorithm will work in practice (the true “risk”) because we don’t know the true distribution of data that the algorithm will work on, we can instead measure its performance on a known set of training data (the “empirical” risk) (Wikipedia). The empirical risk is defined as
Instead of looking at any function
where it is a normed space. By saying it is a norm space we are saying a measurement of length exists, in this case the “length” measurement denotes the complexity of the function. Thus,
where
it is a convex set, we can then only look for functions within this ball.
Now the principle of stuctural risk minimization can be introduced. When fitting a function to a dataset overfitting can occur, this usually happens when the learned function has high complexity. When overfitting occurs, the learned model will have a really good performance on the training set and will generally perform badly on the test set. Therefore, to avoid overfitting, we would like the learned model to have low complexity while minimizing the training error. These two objectives can be written as
this is called the constrained form, where the emprical risk is minimized while restricting the complexity of the learned model. We can also write it in the penalized form
to penalize models with large complexity. Or the interpolating form
Note the interpolating form makes sense only when the data has no false labels.
Suppose using the constrained form we found a function
This is saying the empirical loss of the candidate function
We now see why
Assume we have
For the case
we have
which is denoted by the right figure.
For the case
we have
which is denoted by the left figure. Thus we have
Now let’s take a look at each of the three errors: approximation error
The first term is the approximation error of the set of functions
which decreases as
where
From the law of big numbers we know that when the number of samples
This section will discuss how approximation error
First, let’s talk about the statisitical curse. Assume we observe
Suppose the target function
Now the function is parametrized by a
From our knowledge of linear algebra we know that as long as
What about functions that are only locally linear? A function that is locally linear can be defined as
which is the same as being
For this class a functions a good candidate for determining the complexity is something like the variance in curvature. We don’t want the function to be very curly. To describe this mathematically we can use the Lipschitz constant
I am not sure why the second term is there, but that does not interfere with the intuition provided. Just to restate the goal
and assume
If we take the expectation of the square on both sides we have
this is saying the square expectation has the same representation as the wassertein distance between the real data distribution
Also using maximum descrepancy (did not go into detail) one can establish the lower bound is also
To summarize, for linear functions we would only need a small amount of samples but this set of functions can be too restricitive, while looking a Lipschitz functions we would need exponentionally more samples, which is simply too much. Therefore, the majority of the functions discussed in this set of notes will be between these two categories.
Now let’s talk about the curse of dimensionality in approximation. To state the results, for smooth functions we can get a nice approximation with small neural networks. Where for functions that are not very smooth we would need an infinite number of neurons.
For the curse of dimensionality in optimization we would also need to grid the domain, which requires an exponentional number of operations. One set of functions that breaks this curse is convex functions.