# Lecture 11 – Overfitting

ANNOUNCER: The following program

is brought to you by Caltech. YASER ABU-MOSTAFA: Welcome back. Last time, we introduced neural

networks, and we started with multilayer perceptrons, and the idea

is to combine perceptrons using logical operations like OR’s and AND’s, in

order to be able to implement more sophisticated boundaries than the

simple linear boundary of a perceptron. And we took a final example, where we

were trying to implement a circle boundary in this case, and

we realized that we can actually do this– at least approximate it– if we have a sufficient

number of perceptrons. And we convinced ourselves that combining

perceptrons in a layered fashion will be able to implement

more interesting functionalities. And then we faced the simple problem

that, even for a single perceptron, when the data is not linearly separable,

the optimization– finding the boundary based on data– is a pretty difficult optimization

problem. It’s combinatorial optimization. And therefore, it is next to hopeless

to try to do that for a network of perceptrons. And therefore, we introduced neural

networks that came in as a way of having a nice algorithm for multilayer

perceptrons, by simply softening the threshold.

Instead of having it just go from

-1 to +1, it would go from -1 to +1 gradually

using a sigmoid function, in this case, the tanh. And if the signal that is

given by this amount– the usual signal that goes

into the perceptron– is large, large negative, or large positive, the tanh approximates -1 or +1. So we get the decision

function we want. And if s is very small, this is almost

linear– tanh(s) is linear. The most important aspect about it

is that it’s differentiable– it’s a smooth function, and therefore the dependency

of the error in the output on the parameters w_ij will be a well-behaved

function, for which we can apply things like gradient descent. And the neural network

looks like this. It starts with the input, followed by

a bunch of hidden layers, followed by the output layer.

And we spent some time trying to argue

about the function of the hidden layers, and how they transform the

inputs into a particularly useful nonlinear transformation, as far as

implementing the output is concerned, and the question of interpretation. And then we introduced the

backpropagation algorithm, which is applying stochastic gradient descent

to neural networks. Very simply, it decides on the direction

along every coordinate in the w space, using the very simple

rule of gradient descent. And in this case, you only

need two quantities. One of them is x_i, that was implemented

using this formula, the forward formula, so to speak, going from

layer l minus 1 to layer l. And then there is another quantity

that we defined, which was called delta, that is computed backwards. You start from layer l, and then

go to layer l minus 1. And the formula is strikingly similar

to the formula in the forward thing, but instead of the nonlinearity

being applied, you multiply by something. And once you get all the delta’s and x’s

by a forward and a backward run, then you simply can decide on the move

in every weight, according to this very simple formula that involves

the x’s and the delta’s.

And the simplicity of the

backpropagation algorithm, and its efficiency, are the reasons why neural

networks have become very popular as a standard tool of implementing functions

that need machine learning in industry, for quite some time now. Today, I’m going to start

a completely new topic. It’s called overfitting, and it will

take us three full lectures to cover overfitting and the techniques

that go with it. And the techniques are very important,

because they apply to almost any machine learning problem that

you’re going to see. And they are applied on top of any

algorithm or model you use. So you can use neural networks

or linear models, et cetera. But the techniques that we’re going to

use here, which are regularization and validation, apply to all

of these models. So this is another layer of techniques

for machine learning. And overfitting is a very important topic. It is fair to say that

the ability to deal with overfitting is what separates professionals from

amateurs in machine learning.

Everybody can fit, but if you know what

overfitting is, and how to deal with it, then you have an edge that

someone who doesn’t know the fundamentals would not be

able to comprehend. So the outline today is, first, we are

going to start– what is the notion? what is overfitting? And then we are going to identify

the main culprit for overfitting, which is noise. And after observing some experiments, we

will realize that noise covers more territory than we thought. There’s another type of noise,

which we are going to call deterministic noise. It’s a novel notion that is very

important for overfitting in machine learning, and we’re going to

talk about it a little bit. And then, very briefly, I’m going to

give you a glimpse into the next two lectures by telling you how

to deal with overfitting. And then we will be ready, having diagnosed what the problem

is, to go for the cures– regularization next time, and validation

the time after that. OK. Let’s start by illustrating the

situation where overfitting occurs. So let’s say we have a simple

target function. Let’s take it to be a 2nd-order

target function, a parabola.

So my input space is the real numbers. I have only a scalar input x. And there’s a value y, and I have

this target that is 2nd-order. We are going to generate five data

points from that target, in order to learn from. This is an illustration. Let’s look at the five data points. As you see, the data points look like

they belong to the curve, but they don’t seem to belong perfectly

to the curve.

So there must be noise, right? This is a noisy case, where

the target itself– the deterministic part of the target

is a function, and then there is added noise. It’s not a lot of noise, obviously–

very small amount. Nonetheless, it will

affect the outcome. So we do have a noisy

target in this case. Now, if I just told you that you have

five points, which is the case you face when you learn. The target disappears, I have five

points, and you want to fit them. Going back to your math, you realize,

I want to fit five points. Maybe I should use– a 4th-order

polynomial will do it, right? You have five parameters. So let’s fit it with

4th-order polynomial. This is the guy who doesn’t know

machine learning, by the way. So I say, I’m going to use

the 4th-order polynomial.

And what will the fit look like? Perfect fit, in sample. And you measure your quantities. The first quantity is E_in. Success! We achieved zero training errors. And then when you go for the

out-of-sample, you are comparing the red curve to the blue curve, and

the news is not good. I’m not even going to calculate

it, it’s just huge. This is a familiar situation

for us, and we know what the deal is. The point I want to make here is that,

when you say overfitting, overfitting is a comparative term. It must be that one situation

is worse than another. You went further than you should. And there is a distinction between

overfitting and just bad generalization. So the reason I’m calling this

overfitting is because, if you use, let’s say, a 3rd-order polynomial, you

will not be able to achieve zero training error, in general. But you will get a better E_out. Therefore, the overfitting here happened

by using the 4th order instead of the 3rd order.

You went further. That’s the key. And that point is made even more clearly

when you talk about neural networks and overfitting

within the same model. In the case of overfitting with

3rd-order polynomial versus 4th-order polynomial, you are comparing

two models. Here, I’m going to take just neural

networks, and I’ll show you how overfitting can occur within

the same model.

So let’s say we have a neural network,

and it is fitting noisy data. That’s a typical situation. So you run your backpropagation

algorithm with several epochs, and you plot what happens to E_in,

and you get this curve. Can you see this curve at all? Let me try to magnify it, hoping

that it will become clearer. A little bit better. This is the number of epochs.

You start from an initial condition,

a random vector. And then you run stochastic gradient

descent, and evaluate the total E_in at the end of every epoch,

and you plot it. And it goes down. It doesn’t go to zero. The data is noisy. You don’t have enough parameters

to fit it perfectly. But this looks like a typical situation,

where E_in goes down. Now, because this is an experiment, you

have set aside a test set that you did not use in training. And what you are going to do, you are

going to take this test set and evaluate what happens out-of-sample. Not only at the end but as you go. Just to see, as I train, am I making

progress out-of-sample or not? You’re making

progress in the sample.

So you plot the out-of-sample,

and this is what you get. So this is estimated by a test set. Now, there are many things you can say

about this curve, and one of them is, in the beginning when you start

with a random w, even though you’re using a full

neural network when you evaluate this point, you have only one

hypothesis that does not depend on the data set. This is the random w that you got. So it’s not a surprise that E_in and E_out

are about the same value here. Because they are floating around. As you go down the road, and start

exploring the weight space by going from one iteration to the next, you’re

exploring more and more of the space of weights.

So you are getting the benefit, or the

harm, of having the full neural network model, gradually. In the beginning here, you are only

exploring a small part of the space. So if you can think of an effective

VC dimension as you go, if you can define that, then there is

an effective VC dimension that is growing with time until it gets– after you have explored the whole

space, or at least potentially explored the whole space, if you

had different data sets– then you have the effective VC

dimension, will be the total number of free parameters in the model. So the generalization error, which is

the difference between the red and green curve, is getting

worse and worse. That’s not a surprise. But there is a point, which is

an important point here, which happens around here. Let me now shrink this back, now that

you know where the curves are.

And let’s look at where

overfitting occurs. Overfitting occurs when you knock down

E_in, so you get a smaller E_in, but E_out goes up. If you look at these curves, you will

realize that this is happening around here. Now there is very little, in terms of the

difference in generalization error, before the blue line and

after the blue line. Yet I am making a specific distinction,

that crossing this boundary went into overfitting. Why is that? Because up till here, I can always

reduce the E_in, and even though E_out is following suit with

very diminishing returns, it’s still a good idea to minimize E_in. Because you are getting smaller E_out. The problems happen when you cross,

because now you think you’re doing well, you are reducing E_in, and you are

harming the performance.

That’s what needs to be taken care of. So that’s where overfitting occurs. In this situation, it might be a very

good idea to be able to detect when this happens, and simply stop at that

point and report that, instead of reporting the final hypothesis

you will get after all the iterations, right? Because in this case, you’re going to get

this E_out instead of that E_out, which is better. And indeed, the algorithm that goes with

that is called early stopping. And it will be based on validation. And although it’s based on validation,

it is a regularization, in terms of putting the brakes. So now we can see the relative

aspect of overfitting. Overfitting can happen when you compare

two things, whether the two things are two different models or two

instances within the same model. We look at this and say that if

there is overfitting, we’d better be able to detect it, to stop

earlier than we would otherwise because otherwise, we will

be harming ourselves.

So this is the main story. Now let’s look at what is overfitting

as a definition, and what is the culprit for it. Overfitting, as a criterion,

is the following. It’s fitting the data more

than is warranted. And this is a little bit strange. What would be more than is warranted? I mean, we are in machine learning. We are in the business of fitting data. So I can fit the data. I keep fitting it. But there comes a point, where

this is no longer good.

Why does this happen? What is the culprit? The culprit, in this case, is that you’re

fitting the noise. The data has noise in it, and you are

trying to look at the finite sample set that you got, and you’re

trying to get it right. In trying to get it right, you are

inadvertently fitting the noise. This is understood. I can see that this is not good.

At least, it’s not useful at all. Fitting the noise, there’s no pattern to

detect in the noise, so fitting the noise cannot possibly

help me out-of-sample. However, if it was only just useless,

we would be OK. We wouldn’t be having this lecture. Because you think, I give

the data, the data has the signal and the noise. I cannot distinguish between them.

I just get x and get y.

y has a component which is a signal, and a component which is noise, but I get just

one number. I cannot distinguish between the two. And I am fitting them. And now I’m going to fit the noise. Let’s look at it this way. I’m in the business of fitting. I cannot distinguish the two. Fitting the noise is the

cost of doing business. If it’s just useless, I wasted some

effort, but nothing bad happened. The problem is

that it’s harmful. It’s not a question of being useless,

and that’s a big difference. Because machine learning

is machine learning. If you fit the noise in-sample, the

learning algorithm gets a pattern. It imagines a pattern and extrapolates

that out-of-sample. So based on the noise, it gives you something

out-of-sample and tells you this is the pattern in the data

which it isn’t. And that will worsen your

out-of-sample because it’s taking you away from the correct solution. So you can think of the learning

algorithm in this case, when detecting a pattern that doesn’t exist, the learning algorithm is hallucinating.

Oh, there’s a great pattern, and this is

what it looks like, and it reports it, and eventually, obviously that

imaginary thing ends up hurting the performance. So let’s look at a case study. The main reason for the case study,

that we vaguely now understand that it’s a problem of the noise, so

let’s see how the noise affects the situation. Can we get overfitting without noise? What is the deal? So I’m going to give you

a specific case.

I’m going to start with

a 10th-order target. 10th-order target means

10th-order polynomial. I’m always working on

the real numbers. The input is a scalar, and I’m

defining polynomials based on that, I’m going to take

10th-order target. The 10th-order target, one

of them looks like this. You choose the coefficient somehow,

and you get something like that. A fairly elaborate thing.

And then you generate data, and the data

will be noisy because we want to investigate the impact of

noise on overfitting. Let’s say I’m going to generate

15 data points in this case. So this is what you get. Let’s look at these points. The noise here is not as trivial

as it was last time. There’s a difference.

These are not lying on the curve. So there is a noise that is

contributing to that. Now the other guy, which is a 50th order, is noiseless. That is, I’m going to generate

a 50th-order polynomial, so it’s much more elaborate than the

blue curve here, but I’m not going to add noise to it. I’m going to generate also 15 points

from this guy, but the 15 points, as you will see, perfectly

lie on the curve. This is all of them here. So this is the data, this

is the target, and the data lies on the target.

These are two interesting cases. One of them is a simple

target, so to speak. Added noise, that makes it complicated. This one is complicated

differently. It’s a high-order target, to begin

with, but there is no noise. These are the two cases that I’m

going to try to investigate overfitting. We are going to have two different

fits for each target. We are in the business of overfitting. We have to have comparative models. So I’m going to have two models

to fit every case. And see if I get overfitting

here, and I get it here. This is the first guy

that we saw before. The simple target with noise. And this guy is the other one, which is

the complex target without noise. 10th-order, 50th-order. We’ll just refer to them as a noisy

low-order target and a noiseless high-order target. This is what we want to learn. Now, what are we going to learn? We’re going to learn with two models. One of them is the same thing–

we have a 2nd-order polynomial that we’re going to use to

fit.

That’s our model. And we’re going to have

a 10th-order polynomial These are the two guys that

we are going to use. Here’s what happens with

the 2nd-order fit. You have the data points, and you fit

them, and it’s not surprising. For the 2nd order, it’s a simple

curve, and it tries to find a compromise. Here we are

applying mean squared error, so this is what you get. Now, let’s analyze the performance

of this fellow. What I’m going to list here, as you see,

I’m going to say, what is the in-sample error, what is the out-of-sample

error, for the 2nd order which is already here, and the 10th order,

which I haven’t shown yet.

The in-sample error

in this case is 0.05. This is a number. It depends

on the scale. It’s some number. When you get the out-of-sample version,

not surprisingly, it’s bigger, because this one fits the data. The other one is out-of-sample,

so it’s going to be bigger. But the difference is not dramatic, and

this is the performance you get. Now let’s apply the 10th-order fit. You already foresee what

a problem can exist here.

The red curve sees the data, tries to

fit that, and uses all the degrees of freedom it has– it has 11 of them–

and then it gets this guy. And when you look at the in-sample

error, obviously the in-sample error must be smaller than

the in-sample error here. You have more to fit and you

fit it better, so you get a smaller in-sample error.

And what is out-of-sample error? Just terrible. So this is patently

a case of overfitting. When you went from 2nd order to

10th order, the in-sample error indeed went down. The out-of-sample error went up. Way up. So you say, this confirms

what we have said before. We are fitting the noise. And you can see here that you’re

fitting the noise. You can see the red curve is trying to

go for these guys, and you know that these guys are off the target. Therefore, the red curve is bending

particularly, to capture something that is noise.

So this is the case. Here it’s a little bit strange because

here we don’t have any noise. And we also have the same models.

We’re going to take the same two models. We have 2nd order and 10th

order, fitting here. Let’s see how they perform here. Well, this is the 2nd-order fit. Again, that’s what you expect

from a 2nd-order fit.

And you look at the in-sample error and

out-of-sample error, and they are OK– ballpark fine. You get some error, and the other

one is bigger than it. Now we go for the 10th order, which

is the interesting one. This is the 10th order. You need to remember that the 10th

order is fitting a 50th order. So it doesn’t have enough

parameters to fit if we had all the glory of the target function

in front of us. But we don’t have all the glory

of the target function. We have only 15 points. So it does as good a job as

possible for fitting. And when we look at the in-sample error,

definitely the in-sample error is smaller than here. Because we have more. It’s extremely small. It did it, really, well. And then when you go for

the out-of-sample. Oh, no! You see, this is a squared error. So these guys, when you go down

and when you go up, kill you.

And indeed they did. So this is overfitting galore. And now you ask yourself, you just

told us about noise and not noise. This is noiseless, right? Why did we get overfitting here? We will find out that the reason

we are getting overfitting here is because actually, this guy has noise. But it’s not your usual noise. It’s another type of noise. And getting that notion down is very

important to understand the situations in practice, where you are going

to get overfitting. You could be facing a completely

noiseless, in the conventional sense, situation, and yet there is overfitting

because you are fitting another type of noise. So let’s look at the irony

in this example. Here is the first example– the noisy simple target. So you are learning a 10th-order target,

and the target is noisy. And I’m not showing the target here, I’m

showing the data points together with the two fits. Now let’s say that I tell you that

the target is 10th order, and you have two learners. One of them is O, and

one of them is R– O for overfitting, and R is for

restricted, as it turns out.

And you tell them, guys, I’m not going

to tell you what the target is, because if I tell you what

the target is, this is no longer machine learning. But let me help you out a little bit. The target is a 10th-order polynomial. And I’m going to give you 15 points. Choose your model. Fair enough? The information given does not

depend on the data set, so it’s a fair thing. The first learner says, I know

that the target is 10th order. Why not pick a 10th-order model? Sounds like a good idea.

And they do this, and they get the red

curve, and they cry and cry and cry! The other guy said, oh,

it’s the 10th-order model? Who cares? How many points do you have? 15. OK, 15. I am going to take a 2nd order, and I am

pushing my luck because 2nd order is 3 parameters, I have

15 points, the ratio is 5. Someone told us a rule of thumb

that it should be 10.

I’m flirting with danger. But I cannot use a line when you are

telling me the thing is 10th order, so let me try my luck with 2nd. That’s what you do. And they win. So it’s a rather interesting irony,

because there is a thought in people’s mind that you try to get as much

information about the target function, and put it in the hypothesis set. In some sense, this is true,

for certain properties. But if you are matching the complexity,

here is the guy who took the 10th-order target, and decided

to put the information all too well in the hypothesis– I’m taking a 10th-order hypothesis set– lost. So again, we know all too well now.

The question is, do you match the data resources, rather than the

target complexity.

There will be other properties

of the target function, that we will take to heart. Symmetry and whatnot, there are a bunch

of hints that we can take. But the question of complexity is not

one of the things that you just apply the general idea of let me match

the target function. That’s not the case. In this case, you are looking at

generalization issues, and you know that generalization issues depend

on the size and the quality of the data set. Now, the example that I just gave you, we

have seen it before when we introduced learning curves if you remember

what those were. Those were, yeah, I’m going to put

how E_in and E_out change with several examples. And I gave you

something, and I told you that this is an actual situation we’ll see later,

and this is the situation.

So this is the case where you take the

2nd-order polynomial model, H_2, and the inevitable error, which is the black

line, comes now not only from the limitations of the model–

an inability for a 2nd order to replicate a 10th order, which is the

target in this case– but also because there is noise added. Therefore, there’s an amount

of error that is inevitable because of the noise. But the model is very limited. The generalization is not bad,

which is the difference between the two curves. If you have more examples, the two

curves will converge, as they always do, but they converge to the inevitable

amount of error, which is dictated by the fact that you’re using

such a simple model in this case. And when we looked at the other case,

also introduced in this case– this was the 10th-order fellow.

So the 10th-order fellow is– you can

fit a lot, so the in-sample error is always smaller than here. That is understood. The out-of-sample error starts by

being terrible because you are overfitting. And then it goes down, and it converges

to something better because that carries the ability

of H_10 to approximate a 10th order, which should be

perfect, except that we have noise. So all of this is due to

the noise added to the examples. And the gray area is the interesting

part for us.

Because in the gray area, the in-sample

error for the more complex model is smaller. It’s smaller always, but we

are observing it in this case. And the out-of-sample error is bigger. That’s what defines the gray area. Therefore in this gray area,

very specifically, overfitting is happening. If you move from the simpler model to

the bigger model, you get better in-sample error and worse

out-of-sample error. Now we realize that this guy is not going

to lose forever. The guy who chose the correct complexity is

not going to lose forever. They lost only because of the number

of examples that was inadequate. If the number of examples is adequate,

they will win handily. Like here– if you look here, you end

up with an out-of-sample error far better than you would ever get here. But now I have enough examples,

to be able to do that. Now, we understand overfitting. We understand that overfitting will

not happen for all the numbers of examples, but for a small number of

examples where you cannot pin down the function, then you suffer from the usual

bad generalization that we saw.

Now, we notice that we get overfitting

even without noise, and we want to pin it down a little bit. So let’s look at this case. This is the case of the 50th-order

target, the higher-order target that doesn’t have any noise– conventional noise, at least. And these are the two fits. And there’s still an irony, because

here are the two learners. The first guy chose the 10th order, the

second guy chose the 2nd order. And the idea here is the following. You told me that the target

now doesn’t have noise. Right? That means I don’t worry

about overfitting. Wrong. But we’ll know why. So given the choices, I’m going to try

to get close to the 50th order, because I have a better chance.

If I choose the 10th order, and someone

else chooses the 2nd order, I’m closer to the 50th, so I think

I will perform better. At least that’s the concept. So you do this, and you know that there

is no noise, so you decide on this idea, and again you

get a bad performance. And you ask yourself,

this is not my day. I tried everything, and I seem

to be making a wise choice, and I’m always losing. And why is this the case,

when there is no noise? And then you ask, is there

no noise? And that will lead us to define

that there is an actual noise in this case, and we’ll analyze it and

understand what it is about. So I will take these two examples, and

then make a very elaborate experiment.

And I will show you the results

of that experiment. I will encourage you, if you

are interested in the subject, to simulate this experiment. All the parameters are given. And it will give you a very good feel

for overfitting, because now we are going to look at the figure, and do not doubt in our mind that overfitting will occur whenever you

encounter a real problem. And therefore, you have to be careful. It’s not like I constructed a particular

funny case. No, if you average over a huge

number of experiments, you will find that overfitting occurs in the

majority of the cases. So let’s look at the detailed

experiment. I’m going to study the impact of two

things– the noise level, which I already conceptually convinced myself

that it’s related to overfitting, and the target complexity, just because

it does seem to be related.

Not sure why, but it seems like when

I took a complex target, albeit noiseless, and I still got overfitting,

so let me see what the target complexity does. We are going to take, as

a general target function– I’m going to describe what it is, and

I’m going to add noise to it. The noise is a function of x. So I’m just getting it generically, and

as always, we have independence from one x to another. Even though the

parameters of the noise distribution depend on x– I can have different

noise for different points in the space– the realization of epsilon is

independent from one x to another. That is always the assumption.

When we have different data points, they are independent.

So this is the thing, and I’m going to

measure the level of noise by the energy in that noise, and we’re going

to call it sigma squared. I’m taking the expected value

of epsilon to be 0. If there were an expected value, I would

put it in the target, so I will remain with 0. And then there’s fluctuation around it,

and the fluctuation either could be big, large sigma

squared, or small.

And I’m quantifying it

with sigma squared. No particular distribution is needed. You can say Gaussian, and indeed I applied Gaussian

in the experiment. But for the statement, you just

need the energy of that. Now let’s write it down. I want to make the target function

more complex, at will. So I’m going to make it

a higher-order polynomial.

Now I have another parameter, pretty

much like the sigma squared. I have another parameter which is

capital Q, the order of the polynomial. I’m calling it Q_f, because it describes

the target complexity of f, just to remember that

it’s related to f. And what do I do, I define a polynomial,

which is the sum of coefficients times a power of x, from q equals 0 to Q,

so it’s indeed a Qth-order polynomial, and I add the noise here. Now, to run the experiment

right, I’m going to normalize this quantity, such that the energy

here is always 1. And the reason I do that is

because I want the sigma squared to mean something. The signal-to-noise ratio is

always what means something.

So if I normalize the signal to energy

1, then I can say sigma squared is the amount of noise. And if you look at this, it is not

easy to generate interesting polynomials using this formula. Because if you pick these guys at

random– let’s say independent coefficients at random, to generate a general target, these guys are the powers of x. So you start with the x, and then the

parabola, and then the 3rd order, and then the 4th order, and

then the 5th order. Very, very boring guys. One of them is doing this way, and the

other one is doing this way, and they get steeper and steeper.

So if you combine them with random

coefficients, you will almost always get something that looks this way or

something that looks this way. And the other guys don’t play a role,

because this one dominates. The way to get interesting guys here

is, instead of generating the alpha_q’s here as random, you go for

a standard set of polynomials, which are called Legendre polynomials. Legendre polynomials are just

polynomials with specific coefficients. There is nothing mysterious about them,

except that the choice of the coefficients is such that, from one

order to the next, they’re orthogonal to each other.

So it’s like harmonics in

a sinusoidal expansion. If you take the 1st-order Legendre,

then the 2nd, the 3rd, and the 4th, and you take the inner product,

you see they are 0. They are orthogonal to each other, and

you normalize them to get energy 1. Because of this, if you have

a combination of Legendre’s with random coefficients, then you get

something interesting. All of a sudden, you get the shape. And when you are done, it

is just a polynomial. All you do, you collect the guys that

happen to be the coefficients of x, the coefficients of x squared,

coefficients of x cubed, and these will be your alpha’s.

Nothing changed in the fact

that I’m generating a polynomial. I just was generating the alpha’s in

a very elaborate way, to make sure that I get interesting targets. That’s all there is to it. As far as we are concerned, we generated

guys that have this form and happened to be interesting–

representative of different functionalities. So in this case we have the noise

level. That’s one parameter that affects overfitting.

We have potentially– the target complexity seems to

be affecting overfitting. At least we are conjecturing

that it is. And the final guy that affects

overfitting is the number of data points. If I give you more data points, you are

less susceptible to overfitting. Now I’d like to understand the

dependency between these. And if we go back to the experiment we

had, this is just one instance of those, where the target complexity

here is 10. I use the 10th-order polynomial,

so Q_f is 10. The noise is whatever the distance

between the points and the curve is. That’s what captures Sigma squared. And the data size here is 15. I have 15 data points. So this is one instance. I’m generating at will random

instances of that, to see if the observation of

overfitting persists. Now, how am I going

to measure overfitting? I’m going to define an overfit

measure, which is a pretty simple one. We’re fitting a data set

from x_1, y_1 to x_N, y_N.

And we are using two models, our usual two models. Nothing changed. We either use 2nd-order polynomials

or the 10th-order polynomials. And if going from the 2nd-order

polynomial to the 10th-order polynomial gets us in trouble,

then we are overfitting. And we would like to quantify that. When you compare the out-of-sample

errors of the two models, you have a final hypothesis from H_2,

and this is the fit– the green curve that you have seen.

Another final hypothesis from the

other model is the red curve– the wiggly guy. If you want to define an overfit

measure based on the two, what you do is get the out-of-sample error for

the more complex guy, minus the out-of-sample error

for the simple guy. Why is this an overfit measure? Because if the more complex guy is

worse, it means its out-of-sample error is bigger, and you

get a positive number, a large positive if the overfitting

is terrible. And if this is negative, it means that

actually, the more complex guy is doing better, so you are not overfitting. Zero means that they are the same.

So now I have a number in my mind that

measures the level of overfitting in any particular setting. And if you apply this to, again, the

same case we had before, you look at here, and the out-of-sample error

for the red is terrible. The out-of-sample error of green

is nothing to be proud of, but better. And the overfit measure in this case

will be positive, so we have overfitting. Now let’s look at the result of

running this for tens of millions of iterations. Not epochs iterations. Complete runs. Generate the target, generate

the data set, fit both, and look at the overfit measure.

Repeat 10 million times, for

all kinds of parameters. So you get a pattern for

what is going on. This is what you get. First, the impact of sigma squared. I’m going to have a plot

in which you get N, the number of examples, and the level of noise,

sigma squared. And on the plot, I’m going to give

a color depending on the intensity of the overfit. That intensity will depend on

the number of points and the level of the noise that you have. And this is what you get. First, let’s look at the

color convention. So 0 is green. If you get redder, there’s

more overfitting. If you get bluer, there

is less overfitting. Now I looked at the number

of examples, and I picked an interesting range.

If you go, this is 80,

100, and 120 points. So what happens to 40? All of them are dark red. Terrible overfitting. And if you go beyond that, you

have enough examples now not to overfit, so it’s almost all blue. So I’m just giving you the

transition part of it. You look at it. There is a noise level. As I increase the noise level,

overfitting worsens. Why is that? Because if I pick any number

of examples, let’s say 100. If I had 100, and it had that little

noise, I’d be doing fine. Doing fine in terms of

not overfitting. And as I go, I get into the red

region, and then I get deeply into the red region. So this tells me, indeed,

that overfitting worsens with sigma squared. By the way, for all of the targets here,

I picked a fixed complexity. 20. 20th-order polynomial. I fixed it because I just wanted

a number, and I wanted only to relate the noise to the overfitting.

So that’s what I’m doing here. When I change the complexity,

this will be the other plot. For this guy, we get something that

is nice, and it’s really according to what we expect. As you increase the number of points,

the overfitting goes down. As you increase the level of noise,

the overfitting goes up. That is what we expect. Now let’s go for the impact of Q_f

because that was the mysterious part. There was no noise and we

were getting overfitting. Is this going to persist? What is the deal? This is what you get. So here, we fixed the level of noise. We fixed it at sigma

squared equals 0.1. Now we are increasing the target

complexity, from trivial to 100th-order polynomial. That’s a pretty serious guy. We are plotting the same range for

the number of points, from 80, 100, and 120. That’s where it happens. And you can see that overfitting

occurs significantly. And it worsens also with

the target complexity. Because let’s say, you

look at this guy. If you look at this guy, you are here

in the green, and gets red, and then it gets darker red.

Not as pronounced as in this case. But you do get the overfitting effect

by increasing the target complexity. And when the number of examples is

bigger, then there’s less overfitting, as you expect it to be. But if you go high enough– I guess it’s getting lighter blue,

green, yellow. Eventually, it will get to red. And if you look at these two guys, the

main observation is that the red region is serious. Overfitting is real and here to stay,

and we have to deal with it. It’s not like an individual

case there.

Now, there are two things you can

derive from these two figures. The first thing is that there seems

to be another factor, other than conventional noise– let’s call it

conventional noise for the moment– that affects overfitting. And we want to characterize that. That is the first thing we derive. The second thing we derive is

a nice logo for the course! That’s where it came from. So now let’s look at noise, and

look at the impact of noise. You can notice that noise is

between quotation marks here because now we’re going to expand our horizon

about what constitutes noise. Here are the two guys. And in the first case, we are going

now to call it stochastic noise. Noise is stochastic, but obviously,

we are calling it stochastic because the other guy will

not be stochastic. And there’s absolutely

nothing to add here.

This is what we expect. We’re just calling it a name. Now we are going to call whatever effect

that is done by having a more complex target here, we are going

also to call it noise. But it is going to be called

deterministic noise. Because there is nothing

stochastic about it. There’s a particular target function. I just cannot capture it, so

it looks like noise to me. And we would like to understand what

deterministic noise is about. However, if you look at it, and now you

speak in terms of stochastic noise and deterministic noise, you would

like to see what affects overfitting. So, we put it in a box. First observation: if I have more points, I

have less overfitting. If you move from here to

here, things get bluer. If you move from here to here,

things get bluer. I have less overfitting. Second thing: if I increase the stochastic noise– increase the energy in the

stochastic noise– the overfitting goes up. Indeed, if I go from here to

here, things get redder.

And finally, with deterministic noise,

which is vaguely associated in my mind with the increase of target complexity,

I also increase the overfitting. If I go from here to here,

I am getting redder. Albeit I have to travel further, and

it’s a bit more subtle, the direction is that I get more

overfitting as I get more deterministic noise, whatever

that might be. So now, let’s spend some time just

analyzing what deterministic noise is, and why it affects overfitting

the way it does. Let’s start with the definition. What is it? It will be noisy. If I tell you what is

the stochastic noise, you will say, here’s my target, and

there is something on top of it. That is what I call stochastic noise. So the deterministic noise will be

the same thing, except that it captures something deterministic. It’s the part of the target that your

hypothesis set cannot capture. So let’s look at the picture. Here is the picture. This is your target, the blue guy. You take a hypothesis set that– let’s

say simple, and you look for the guy that best approximates f.

Not in the learning sense. You try very hard to find

the best possible approximation. You’re still not going to get f,

because your hypothesis set is limited, but the best guy will be

sitting there, and it will fail to pick a certain part of the target. And that is the part we are labeling

the deterministic noise. And if you think from an operational

point of view, if you are that hypothesis, noise is all the same. It’s something I cannot capture. Whether I couldn’t capture it, because

there’s nothing to capture– as in stochastic noise– or I couldn’t capture it, because I’m

limited in capturing, and this I have to consider as out of my league. Both of them are noise, as

far as I’m concerned. Something I cannot deal with. This is how we define it. And then we ask, why are

we calling it noise? It’s a little bit of

a philosophical issue.

But let’s say that you have

a young sibling– your kid brother– has just learned fractions. So they used to have just

1, 2, 3, 4, 5, 6. They’re not even into negative numbers,

and they learn fractions, and now they’re very excited. They realize that there’s more

to numbers than just 1, 2, 3. So you are the big brother. You are a big Caltech guy. So you must know more about numbers. They come to ask you, tell

me more about numbers. Now, in your mind, you probably

can explain to them negative numbers a little bit by deficiency. Real numbers, just intuitively

continuous. You are not going to tell them about limits,

or anything like that. They’re too young for that. But you probably are not going to tell

them about complex numbers, are you? Because their hypothesis set is so

limited complex numbers, for them, would be completely noisy. The problem with explaining something

that people cannot capture is that they will create a pattern

that doesn’t exist. And then you tell them complex numbers,

and they really can’t comprehend it, but they got the notion.

So now it’s the noise. They fit the

noise, and they tell you, is 7.34521 a complex number? Because in their minds– they just got on a tangent. So you’re better off just

killing that part. And giving them a simple thing that they

can learn, because the additional part will mislead them. Mislead them, as in noise. So this is our idea, that if I have

a hypothesis set, and there is part of the target that I cannot capture,

there’s no point in trying to capture it, because when you try to capture it,

you are detecting a false pattern that you cannot extrapolate,

given your limitations.

That’s why it’s called noise. Now the main differences between

deterministic noise and stochastic noise– both of them can be

plotted, a realization– but the main differences are, the

first thing is that deterministic noise depends on your hypothesis set. For the same target function, if you

use a more sophisticated hypothesis set, the deterministic noise will be

smaller, because you were able to capture more. The stochastic

noise will be the same. Nothing can capture it, so all

hypotheses are the same. We cannot capture it, and

therefore it’s noise. The other thing is that, if I give you

a particular point x, deterministic noise is a fixed amount, which is the

difference between the value of the target at that point and the best

hypothesis approximation you have. If I gave you stochastic noise, then

you are generating this at random. And if I give you two instances

of x, the same x, the noise will change from one

occurrence to another, whereas here, it’s the same. Nonetheless, they behave the

same for machine learning, because invariably we have a given data set.

Nobody changes x’s on us and gives

us another realization of the x. We just have the Xs given to

us together with the labels. So this doesn’t make

a difference for us. And we settle on a hypothesis set. Once you settle on a hypothesis set, the

deterministic noise is as bad as the stochastic noise. It’s something that we cannot capture,

and it depends on something that we have already fixed, so it doesn’t

depend on anything.

So in a given learning situation,

they behave the same. Now, let’s see the impact

of overfitting. This is what we have seen before. This is the case where we have

increasing target complexity, so increasing deterministic noise in the

terminology we just introduced, and the number of points. Red means

overfitting, so this is how much overfitting is there. And we are looking at deterministic

noise, as it relates to the target complexity. Because the quantitative thing

we had is target complexity. We defined what a realization of

deterministic noise is, but it’s not clear to us what quantity we should

measure out of deterministic noise, to tell us that this is the

level of noise that results in overfitting, yet. We have the one in the case of

stochastic noise very easily.

We just take the energy out of it. So here we realize that as you increase

the target complexity, the deterministic noise increases, which is

the overfitting phenomenon that we observe– increases. But you’ll notice something is interesting here. It doesn’t start until you get to 10. Because this was overfitting of what? The 10th order versus the 2nd order. So if you’re going to start having

deterministic noise, you’d better go above 10, so that there is something

that you cannot approximate. This is the part where it’s there. So here, I wouldn’t say proportional, but it

increases with the target complexity, and it decreases

with N as we expect. Now for the finite N, you suffer the

same way you suffer from the stochastic noise. We have declared that deterministic noise

is the part that your hypothesis set cannot capture. So what is the problem? If I cannot capture it, it won’t

hurt me, because when I try to fit, I won’t capture it anyway.

No. You cannot capture it in its entirety. But if I give you only a finite sample,

then you only get a few points, and you may be able to capture

a little bit of the stochastic noise, or the deterministic

noise in this case. Again, if I have 10 points– if you give me a million points,

and even if there is stochastic noise, there’s nothing I can do to

capture the noise. Let me remind you of the example

we gave in linear regression.

We took linear regression and said,

let’s say that we are learning a linear function. So linear regression would

be perfect in this case. This is the target. And then we added noise to the examples,

so instead of getting the points perfectly on that line,

you get points right or left. And then we tried to use linear

regression to fit it. If you didn’t have any noise,

linear regression would be perfect in this case. Now, since there’s noise, and it doesn’t

see the line– it only sees those guys, it eats a little bit

into the noise, and therefore gets deviated from the target. And that is

why you are getting worse performance than without the noise. Now, if I have 10 points, linear

regression will have an easy time eating into that, because there

isn’t much to fit.

There are only 10 guys, and maybe

there’s some linear pattern there. If I get a million points, the chances

are I won’t be able to fit any of them at all, because they are noise all

over the place, and I cannot find a compromise using my few parameters, and

therefore I will end up not being affected by them. In the infinite case, I

cannot get anything. They are noisy, and I cannot fit them.

They are out of my ability. But the problem is that once you have

a finite sample, you’re given the unfortunate ability to be able to fit

the noise, and you will indeed fit it.

Whether it’s stochastic– that it doesn’t make sense– or

deterministic, there is no point in fitting it, because you know in your

hypothesis set, there is no way to generalize out-of-sample for it. It is out of your ability. So the problem here is that for the

finite N, you get to try to fit the noise, both stochastic

and deterministic. Now, let me go quickly through

a quantitative analysis that will put deterministic noise and stochastic noise

in the same equation so that they become clear.

Remember bias-variance? That was a few lectures ago. What was that about? We had a decomposition of the expected

out-of-sample error into two terms. This is the expected value of

out-of-sample error. I remember, this is the hypothesis we get, and we

have a dependency on the data set that got us. We compare it to the target function,

and we get the expected value concerning those. That ended up being a variance,

which tells me how far I am from the centroid within the hypothesis set, and

that means that there’s a variety of things I get based on D.

The other one is how far the

centroid is from the target, which tells me the bias of my hypothesis

set from the target. The leap of faith we had is that

this quantity, which is the average hypothesis you get over all data sets,

is about the same as the best hypothesis in the hypothesis set. So we had that. And in this case, f was noiseless

in this analysis. Now, I’d like to add noise to the

target and see how this decomposition will go, because this will give us

a very good insight into the role of stochastic noise versus

deterministic noise.

So we add noise. And we’re going to plot it red because

we want to pay attention to it, and because we are going

to get the expected values concerning it. So now is the realization,

the target plus epsilon. And I’m going to assume that the

expected value of the noise is 0. Again, if the expected value is

something else, we put that in the target, leave the part that is

pure fluctuation outside, and call that epsilon.

Now I would like to

repeat the analysis, more quickly, obviously, with the added noise. Here is the noise term. First, this is what we started with. So I’m comparing what you get in your

hypothesis, in a particular learning situation, to the target. But now the target is noisy. So the first thing is to replace

this fellow with the noisy version, which is y. I know that y has f of

x, plus the noise.

That’s what I’m comparing to. And now, because y depends on the

noise, I’m not only getting the averaging concerning the data set,

I’m also getting the average concerning the realization

of the noise. So I’m getting the expected value

concerning D and epsilon– epsilon affecting y. So you expand this, and this is just rewriting it. f of x

plus epsilon is y, so I’m writing it this way. And we do the same thing we did before,

but just carry this around, until we see where it goes. So what did we do? We added and subtracted the centroid– the average hypothesis, remember– in preparation for getting squared

terms, and cross terms. And here we have the epsilon

added to the mix.

And then we write it down. And in the first case, we get the

squared, so we put these together and put them as squared. We take these two guys together

and put them as squared. And this guy by itself,

we put it as squared. We will still have cross-terms, but

these are the ones that I’m going to focus on. And then we have more cross terms

than we had before because there’s epsilon in it. But the good news is that, if you get

the expected value of the cross terms, all of them will go to 0. The ones that used to go

to 0 will go to 0. The other ones will go to 0, because the

expected value of epsilon goes to 0, and epsilon is independent of

the other random thing here, which is the data set. The data set is generated. Its noise is generated. Epsilon is generated on the test point

x, which is independent, and therefore you will get 0. So it’s very easy to argue that this is

0, and you will get the same decomposition with

this fellow added.

So let’s look at it. Well, we’ll see that two noise terms come up. This is the variance term. Let me put it. This is the bias term. And this is the added term, which

is just sigma squared, the energy of the noise. Let me just discuss this a little bit. We had the expected value concerning D, and concerning epsilon. And then, remember that we take the

expected value concerning x, averaged over all the space, to

get just the bias and variance, rather than the bias of x– of your

test point.

So I did that already. So every expectation concerns

the data set, the input point, and the

realization of the noisy epsilon. But I’m keeping the guys that survive,

because the other guys– epsilon doesn’t appear here, so the

thing is constant concerning it, so I take it out. Here, neither epsilon nor D appears, so I just leave it for simplicity. And here, D doesn’t appear, but epsilon

and x appear, so I do it this way. I could put a more elaborate

notation, but I just wanted to keep it simple. Now, look at this decomposition.

We have the moving from your

hypothesis to the centroid, from the centroid to the target proper, and then

from the target proper to the actual output, which has

a noise aspect to it. So it’s again the same thing of trying

to approximate something and putting it in steps. Now if you look at the last quantity,

that is patently the stochastic noise. The interesting thing is that there

is another term here which is corresponding to the

deterministic noise. And that is this fellow. That’s another name for the bias. Why is that? Because our leap of faith told us

that this guy, the average, is about the same as the best hypothesis. So we are measuring how the best

hypothesis can approximate f. Well, this tells me the energy

of deterministic noise. And this is why it’s deterministic

noise.

And putting it this way gives

you the solid ground to treat them the same. Because if you increase the number of

examples, you may get a better variance. There are more examples,

so you don’t float around fitting all of them. So the red region, which used to be the

variance, shrinks and shrinks. These guys are both inevitable. There is nothing you can do about this,

and there’s nothing you can do about this given a hypothesis set.

So these are fixed. But again, in the bias-variance,

remember the approximation was an overall approximation. We took the entire target function

and the entire hypothesis. We didn’t look at particular

data points. We looked at approximation proper, and

that’s why these are inevitable. You tell me what the hypothesis set is,

well, that’s the best I can do. And this is the best I can do as far

as the noise, which is just not predicting anything in the noise.

Now, both the deterministic noise and

the stochastic noise will have a finite version of the data points, and

the algorithm will try to fit them. And that’s why this guy

gets a variety. Because depending on the particular fit of

those, you will get one or another. So these guys affect the variance, by

making the fit more susceptible to going in more places. Depending on what happens, I will go

this way and that way– not because it’s indicated by the target function

I want to learn, but just because there is noise present in the sample

that I am blindly following because I can’t distinguish noise from the signal,

and therefore I end up with more variety, and I end up with a worse

variance and overfit.

Now very briefly, I’m going

to give you a lead into the next two lectures. We understand what overfitting is, and we

understand that it’s due to noise. And we understand that noise is in the

eye of the beholder, so to speak. There is stochastic noise, but there’s

another noise that is not noise but depends on which

hypothesis looks at it. It looks like noise to some and does not

look like noise to others, and we call that deterministic noise. And we saw experimentally that

it affects overfitting. So how do we deal with overfitting? What does it mean to deal

with overfitting? We want to avoid it. We don’t want to spend more energy

fitting and get worse out-of-sample errors, whether by choice of a model

or by actually optimizing within a model as we did with

neural networks.

There are two cures. One of them is called regularization,

and that is best described as putting the brakes. So overfitting– you are going,

going, going, going, and you hurt yourself. So all I’m doing here is, I’m

just making sure that you don’t go all the way. And when you do that, I’m going

to avoid overfitting this way. The other one is called validation. What is the cure in this

case for overfitting? You check the bottom line and make

sure that you don’t overfit. It’s a different philosophy. That is, the reason I’m overfitting is

because I’m going for E_in, and I’m minimizing it, and I’m

going all the way. I say, no, wait a minute. E_in is not a very good indication

of what happens.

Maybe there’s another way to be able to

tell what is happening out of the sample, and therefore avoid

overfitting, because you can check on what is happening in the real

quantity you care about. So these are the two approaches. I’ll give you just an appetizer– a very short appetizer for putting

the brakes– the regularization part, which is the subject of the next lecture. Remember this curve? That’s what we started with. We had the five points, we had the

4th-order polynomial, we fit, and we ended up in trouble. And we can describe this as free fit, that is, fit all you can. To fit all you can, five points, I’ll

take a 4th-order polynomial, go for it, I get this, and that’s what happens.

Now, putting the brakes means that

you’re going to not allow yourself to go all the way, and you are going

to have a restrained fit. The reason I’m showing this is

because it’s fairly dramatic. You will think that I need– this curve is so incredibly bad that

you think you need to do something dramatic to avoid that. But here, what I’m going to do, I’m just

going to make you fit, and I’m going to make you fit

using a 4th-order polynomial. I’ll give you that privilege. But I’m going to prevent you from

fitting the points perfectly.

I’m going to put some

friction in it, such that you cannot get

exactly to the points. And the amount of break I’m

going to put here is so minimal, it’s laughable. When you go for your car service, they

measure the brake, and they tell you, oh, the brake is 70%, et cetera,

and then when it gets to 40%, they tell you you need to

do something about the brake. The brakes here are about 1%. So if this was a car, you would be

braking here, and you would be stopping in Glendale! It’s like completely ridiculous. But that little amount of brake

will result in this. Dramatic. Fantastic fit. The red curve is a 4th-order polynomial,

but we didn’t allow it to fit all the way. And you can see that it’s not fitting

all the way because it is not getting the points right. It’s getting there, but not exactly.

So we don’t have to do much to

prevent the overfitting. But we need to understand what

is regularization, and how to choose it, et cetera. And this we’ll talk about next time. And then the time after that, we’re

going to talk about validation, which is the other prescription. I will stop here, and we will take

questions after a short break. Let’s start the Q&A, and we’ll start

with a question in-house. STUDENT: So in the previous lecture, we

spoke about stochastic gradient descent, and we said that we should

choose point by point, and move in the direction of the gradient

of error in this point. PROFESSOR: Negative

of the gradient, yes. STUDENT: So the question is,

how important is it to choose points randomly? I mean, we can choose them just from the list– first point,

second point, and so on? PROFESSOR: Yeah.

Depending on the runs, it could be no

difference at all, or it could be a real difference. And the best way to think of

randomization in this case is that it’s an insurance policy. There’s something about the pattern

that is detrimental in a particular case. You are always safe by picking the

points at random because there’s no chance that the random thing will

have a pattern eventually if you keep doing it.

So in many cases, you just run

through examples 1 through N, 1 through N. 1 through N,

and you will be fine. In some cases, you take

a random permutation. In some cases even, you stay true to

picking the point at random, and you hope that the representation of

a point will be the same, in the long run. In my own experience, there is little

difference in a typical case. Now and then, there’s

a funny case. Therefore, you are safer using

the stochastic presentation– the random presentation

of the examples– to be able not to fall

into the trap in those cases. Yeah. There’s another question in-house. STUDENT: Hi, Professor. I have a question about slide 4. It’s about neural networks. I don’t understand– how do you draw the

out-of-sample error on that plot? PROFESSOR: OK. In general, you cannot

draw the out-of-sample error. If you could draw it, you

would just pick it. This is a case where I give you

a data set, and you decide to set aside part of the data

set for testing. So you are not involved

at all in the training.

And what you do, you go about your

training, and at the end of every epoch, when you evaluate the in-sample

error on the entire batch, which is the green curve here, you also evaluate,

for that set of weights– the frozen weights at the end of the

epoch– you evaluate that on the test set, and you get a point. And because that point is not involved

in the training, it becomes an out-of-sample point, and that

gets the red point. And you go down. Now, there’s an interesting tricky point

here, because if you decide at some point to maybe, I look at the red curve. Now I am going to stop where

the red curve is minimum. STUDENT: Yes. PROFESSOR: OK? Now at that point, the set that used to

be a test set is no longer a test set, because now it has just

been involved in a decision regarding training. Becomes slightly contaminated, and becomes

a validation set, which we’re going to talk about when we talk

about validation.

But that is the premise. STUDENT: OK. I understand. Also, can I– slide 16? PROFESSOR: Slide 16. STUDENT: I didn’t follow that. Why the

two noises are the same, for the same learning problem? PROFESSOR: They’re the same

in the sense that they are part of the outputs that I’m being given, or that I’m

trying to predict. And that part, I cannot predict

regardless of what I do. In the case of stochastic

noise, it’s obvious. There’s nothing to predict there,

so whatever I do, I miss it. In the case here, it’s particular to

the hypothesis set that I have. So I take a hypothesis set, look

at a non-learning scenario, look at the target function, and choose

your best scenario.

You choose, this is my best hypothesis,

which we called here h star. If you look at the difference

between h star and f, the difference is a part which I cannot

capture, because the best I could do is h star. So the remaining part is what I’m

referring to as deterministic noise, and it is beyond my ability

given my hypothesis set. So that’s why they are the same– the

same in the sense of unreachable as far as my resources are concerned. STUDENT: OK. In a real problem, do we know the

complexity of the target function? PROFESSOR: In general, no. We also don’t know the particulars of the

noise. We know that the problem is noisy, but we cannot identify the noise. We cannot, in most cases,

even measure the noise. So the purpose here is to understand

that, even in the case of a noiseless target in the conventional

sense, there is something that we can identify– conceptually identify– that does affect the overfitting. And even if we don’t know the

particulars of it, we will have to put in place the guards, to avoid overfitting.

That was the goal here,

rather than trying to– Any time you see the target function

drawn, you should immediately have an alarm bell that this is conceptual,

because you never actually see the target function in a real

learning situation. STUDENT: Oh. So, that’s why the two

noises are equal, then. Because we don’t know the target

function, so we don’t know which part is deterministic. PROFESSOR: Yeah. If I knew the target,

and if I knew the noise, then the situation would be good, but

then I don’t need machine learning.

I already have that. STUDENT: Thank you. PROFESSOR: So we go for the

questions from the outside? MODERATOR: Yeah.

Quick conceptual question. Is it OK to say that the deterministic

noise is the part of reality that is too complex to be modeled? PROFESSOR: It is

part of the reality– that part. And basically, it’s our failure to model

it is what made it noise, as far as we are concerned. So obviously you can, in some sense,

model it by going to a bigger hypothesis set. The bigger hypothesis set will have

a closer h star to the target, and therefore the difference

will be small. But the situation pertains to the case

where you already chose the hypothesis set according to the prescriptions of

VC dimension, number of examples, and other considerations. And given that hypothesis set, you

already concede that even if the target is noiseless, there is part of

it that behaves as noise, as far as I’m concerned. And I will have to treat it as such when

I consider overfitting and the other considerations. MODERATOR: Also, is it fair to say that

over-training will cause overfitting? PROFESSOR: I think they

probably are synonymous.

Overfitting is relative. Over-training will be relative within

the same model if I try to give it a definition. That you over-train, so you already

settled on the model, and you’re over-training it. The case of neural networks

would be over-training. The case of choosing the 3rd-order

polynomial versus the 4th-order polynomial will not be

over-training, but it will be overfitting. It’s all technicalities, but

just to answer the question. MODERATOR: Practically, when

you implement these algorithms, and there’s also some approximation, maybe due to the

floating-point number or something. So is this another source of error? Does it produce overfitting? Or is it– PROFESSOR: It’s– Formally speaking, yes,

it’s another source. But it is so minute concerning the other guys, that it’s never mentioned. We have another in-house question. STUDENT: A couple of lectures ago, we

spoke about 3rd linear model, which is logistic regression.

PROFESSOR: You said

the 3rd linear model? STUDENT: Yes. So the question is, is it true

that initially, I have data that is completely linearly separable? So the points marked– some points are marked -1, and some

are +1, and there is a plane which separates them. Is it true that by applying this learning

model, you’re never stuck in a local minimum and get 0 in-sample error? PROFESSOR: OK. This is a very specific question

about logistic regression. If the thing is completely clean, then

you obviously can get closer and closer to having the probability

being perfect, by having bigger and bigger weights. So there is a minimum. And again, it’s a unique minimum. Except that the minimum is

at infinity, in terms of the size of the weight. But this doesn’t bother you, because you

are going to stop at some point when the gradient is small,

according to your specification. And you can specify this

any way you want.

So the goal is not necessarily

to arrive at the minimum. Which hardly ever happens, even if

the thing is not at infinity. But get close enough, in the sense that

the value is close to the minimum, and therefore you achieve the small

error that you want. MODERATOR: Can you resolve again the contradiction of when

you increase the complexity of the model, you should be

reducing your bias, and hence your deterministic noise? So here we had an example

when we had H– well, H_10 had more

errors than H_2. PROFESSOR: H_10 had a total error

more than H_2. If we were doing the

approximation game, H_10 would be better. We had three terms in the bias-variance.

If we were only going by these two, then there is no question

that the bigger model, H_10, will win. Because this is for all, and this one

will be better for H_10 than H_2, because H_10 is closer to the target

we want, and therefore we will be making smaller errors. This is not the source of the

problem of overfitting. This is just identifying terms in the

bias-variance decomposition, bias-variance-noise decomposition

in this case, that correspond to the different

types of noise.

The problem of overfitting

happens here. And that happens because of

the finite-sample version of both. That is, I get N points in which there

is a contribution of noise coming from the stochastic and coming

from the deterministic. On those points, the algorithm will try

to fit that noise, even though if it knew, it wouldn’t,

because it knows that they’re out of reach. But it gets a finite sample, and it can

use its resources to try to fit part of that noise, and that is

what is causing overfitting. And that ends up being harmful, and so

harmful in the H_10 case, that the harm offsets the fact that I’m closer

to the target function.

That doesn’t help me very much. Because, the same thing we said before, Let’s say there’s H_10. And the target function

is sitting here. That doesn’t do me much good if my

algorithm, and the distraction of the noise, lead me to go

in that direction. I will be further from the target

function than another guy who, only working with this, remained in the

confines and ended up being closer to the target function. It’s a question of the variance term

that results in overfitting, not this guy, even though

these guys contain both types of noise contributing to their value. But their value is static. It doesn’t change with N, and

it has nothing to do with the overfitting aspect. MODERATOR: In the case of polynomial

fitting, a way to avoid overfitting could be to

use piecewise linear– piecewise linear functions

around each point.

So it is a method of regularization? Or is it– PROFESSOR: OK. Depends on the number of degrees

of freedom you have. You can have a piecewise linear,

which is horrible. It’s like something you can’t tell. It depends on how many pieces. If you have as many pieces as there

are points, you can see what the problem is. So it is, what is the

VC dimension of your model? And I can take it– if it’s piecewise

linear, and I have only four parameters, then I don’t worry too much

that it’s piecewise linear. I only worry about the four

parameters aspect of it. The 10th-order polynomial was bad because

of the 11 parameters, not because of another factor. But anything you do to restrict your

model, in terms of the fitting, can be called regularization. And there are some good methods and

bad methods, but they are all regularization, in terms

of putting the brakes. MODERATOR: Some practical question is,

how do you usually get the profile of the out-of-sample error? Do you sacrifice points, or– PROFESSOR: OK.

This is a good question. When we talk about validation–

validation has an impact on overfitting. It’s used to do that. But it’s also used in model

selection in general. And because of that, it’s very tempting

to say, I’m going to use validation, and I’m going to set

aside several points. But obviously, the problem is that when

you set aside several points, you deprive yourself of a resource that

you could have used for training, to arrive at a better hypothesis. So there’s a tradeoff, and we’ll

discuss that tradeoff in very specific terms, and find ways

to go around it, like cross-validation. But this will be the subject of the

lecture on validation, coming up soon.

MODERATOR: In the example of the color

plots, here the order of the polynomial is a good indication

of the VC dimension, right? PROFESSOR: These are the plots. What is the question? MODERATOR: Here, Q_f is directly related

to the VC dimension, right? PROFESSOR: The target complexity

has nothing to do with the VC dimension. It’s the target. I’m talking about different targets. The VC dimension has to do only with

the two fellows we are using. We are using H_2 and H_10, 2nd-order

polynomials and 10th-order polynomials. So if we take the degrees of freedom

as being a VC dimension, they will have different VC dimensions. The discrepancy in the VC dimension,

given the same number of examples, is the reason why we

have a discrepancy in the out-of-sample error. But you also have a discrepancy

in the in-sample error. The case of overfitting is such that

the in-sample error is moving in one direction, and the out-of-sample

moving in another direction.

So the only relevant thing in this plot

to the VC dimension is the fact that the two models have different

VC dimensions, H_2 and H_10. MODERATOR: I guess you never

really have a measure of the target complexity, like in practice? PROFESSOR: Correct. This was an illustration. And even in the case of the illustration,

when we had explicitly a definition of the target complexity, it

wasn’t completely clear how to map this into the energy of deterministic

noise, a counterpart for sigma squared here. This is completely clean. And as you can see, because of that,

the plot is very regular. Here, first, we define this in

a particular case, to be able to run an experiment.

Second, in terms of that, it’s not clear

what is– can you tell me what is the energy of the deterministic

noise here? There’s quite a bit of normalization

that was done. So when we normalize the target

to make sigma squared meaningful, we sacrifice the fact–

the target now is sandwiched between the limited range. And therefore the amount of energy,

whatever the deterministic noise is, will be limited, regardless of

how complex the target is. So there is a compromise we had

to make, to be able to find these plots. However, the moral of the story here

is that there’s something about the target complexity that behaved in the

same way, as far as overfitting is concerned, as noise. And we identified it as

deterministic noise. We didn’t quantify it further. And it will be– It’s possible to quantify it.

You can get the energy for this

and that, and you can do it. But these are research topics. As far as we are concerned, in a real

situation, we won’t be able to identify either the stochastic noise

or the deterministic noise. We just know they’re there. We know the impact of overfitting. And we will be able to find methods

to be able to cure the overfitting, without knowing all of the

specifics that we could know about the noise involved. MODERATOR: Do you ever measure the– is there some similar kind of measure

of the model complexity, of the target function? Do you ever use a VC dimension

for that? PROFESSOR: Not explicitly.

One can apply it. You say what

is the model that would include the target function? And then, based on the inclusion of the

target function, you can say that this is the complexity of that model. The analysis we use is such that the

complexity of the target function doesn’t come in, in terms

of the VC analysis. But there are other methods. There are other approaches, other than the VC analysis, where

the target complexity matters. So I didn’t particularly spend time

trying to capture the complexity of the target function until this moment,

where the complexity of the target function could translate to something

in the bias-variance decomposition, and that has an impact on overfitting

and generalization.

MODERATOR: I think that’s it. PROFESSOR: We will see you on Thursday.

ᵛᶦᵈᵗᵒᵒⁿ™ ².¹ ᴏɴᴇ ᴛɪᴍᴇ ᴏꜰꜰᴇʀ – ᴛʜᴇ 2ᴅ ᴀᴍɪɴᴀᴛɪᴏɴ ᴠɪᴅᴇᴏ ᴍᴀᴋᴇʀ After The Massive Success Of VidToon™ 1.0

And More Than 10ᴋ Happy Customers…WE ARE BACK On Popular Demand! Redefine Profitability With The World’s Easiest & Most Popular Video Animation Software It’s ʙɪɢɢᴇʀ**.** ʙᴇᴛᴛᴇʀ**.** ᴀɴᴅ ꜰᴀꜱᴛᴇʀ**.**