Data Mining with Weka (4.2: Linear regression)
Hi! This is Lesson 4.2 on Linear Regression. Back in Lesson 1.3, we mentioned
the difference between a classification problem and a regression problem. A classification problem is when what you’re
trying to predict is a nominal value, whereas in a regression problem what you’re trying
to predict is a numeric value. We’ve seen examples of datasets with nominal
and numeric attributes before, but we’ve never looked at the problem of regression, of trying
to predict a numeric value as the output of a machine learning scheme. That’s what we’re doing in this [lesson],
linear regression. We’ve only had nominal classes so far, so
now we’re going to look at numeric classes. This is a classical statistical method, dating
back more than 2 centuries.
This is the kind of picture you see. You have a cloud of data points in 2 dimensions,
and we’re trying to fit a straight line to this cloud of data points and looking for
the best straight-line fit. Only in our case, we might have more than 2
dimensions, there might be multiple dimensions. It’s still a standard problem. Let’s just look at the 2-dimensional case
here. You can write a straight-line equation in
this form, with weights w0 plus w1a1 plus w2a2, and so on. Just think about this in one dimension where
there’s only one “a”. Forget about all the things at the end here,
just consider w0 plus w1a1.
That’s the equation of this line — it’s the
equation of a straight line — where w0 and w1 are two constants to be determined from
the data. This, of course, is going to work most naturally
with numeric attributes, because we’re multiplying these attribute values by weights. We’ll worry about nominal attributes in just
a minute. We’re going to calculate these weights from
the training data — w0, w1, and w2. Those are what we’re going to calculate from
the training data.
Then, once we’ve calculated the weights, we’re
going to predict the value for the first training instance, a1. The notation gets horrendous here. I know it looks pretty scary, but it’s pretty
simple. We’re using this linear sum with these weights
that we’ve calculated, using the attribute values of the first [training] instance to get the predicted value for that instance. We’re going to get predicted values for the
training instances using this rather horrendous formula here. I know it looks pretty scary, but it’s
not so scary. These w’s are just numbers that we’ve calculated
from the training data, and then these things here are the attribute values of the first
training instance a1 — that 1 at the top here means it’s the first training instance. This 1, 2, 3 means it’s the first, second,
and third attribute. We can write this in this neat little sum
form here, which looks a little bit better. Notice, by the way, that we’re defining a0
— the zeroth attribute value — to be 1.
That just makes this formula work. For the first training instance, that gives
us this number x, the predicted value for the first training instance, and this particular
value of a1. Then we’re choosing the weights to minimize
the squared error on the training data. This is the actual x value for this fifth-time
instance. This is the predicted value for the fifth training
instance. We’re going to take the difference between
the actual and the predicted value, square them up, and add them all together. And that’s what we’re trying to minimize. We get the weights by minimizing this sum
of squared errors. That’s a mathematical job; we don’t need to
worry about the mechanics of doing that.
It’s a standard matrix problem. It works fine if there are more instances
than attributes. You couldn’t expect this to work if you had
a huge number of attributes and not very many instances. But providing there are more instances than
attributes — and usually there are, of course — that’s going to work ok. If we did have nominal values, if we just
have a 2-valued/binary-valued, we could just convert it to 0 and 1 and use those numbers. If we have multi-valued nominal attributes,
you’ll have a look at that in the activity at the end of this lesson. We’re going to open a regression dataset and
see what it does: CPU. arff. This is a regular kind of dataset. It’s got numeric attributes, and the most
important thing here is that it’s got a numeric class — we’re trying to predict a numeric
value.
We can run LinearRegression; it’s in the functions
category. We just run it, and this is the output. We’ve got the model here. The class has been predicted as a linear sum. These are the weights I was talking about. It’s this weight times this attribute value
plus this weight times this attribute value, and so on. Minus — and this is w0, the constant weight,
not modified by an attribute. This is a formula for computing the class. When you use that formula, you can look at
the success of it in terms of the training data. The correlation coefficient, which is a standard
statistical measure, is 0.9.
That’s pretty good. Then there are various other error figures
here that are printed. On the slide, you can see the interpretation
of these error figures. It’s really hard to know which one to use. They all tend to produce the same sort of
picture, but I guess the exact one you should use depends on the application. There’s the mean absolute error and the root
mean squared error, which is the standard metric to use. That’s linear regression. I’m going to look at nonlinear regression
here. A “model tree” is a tree where each leaf has
one of these linear regression models. We create a tree like this, and then at each
leaf, we have a linear model, which has those coefficients. It’s like a patchwork of linear models, and
this set of 6 linear patches approximates a continuous function. There’s a method under “trees” with the rather
mysterious name of M5P. If we just run that, that produces a model
tree.
Maybe I should just visualize the tree. Now I can see the model tree, which is similar
to the one on the slide. You can see that each of these — in this
case 5 — leaves has a linear model — LM1, LM2, LM3, … And if we look back here, the
linear models are defined like this: LM1 has this linear formula; this linear formula for
LM2; and so on. We chose trees > M5P, we ran it, and we looked
at the output. We could compare these performance figures
— 92-93% correlation, mean absolute error of 30, and so on — with the ones for regular
linear regression, which got a slightly lower correlation, and a slightly higher absolute
error — in fact, I think all these error figures are slightly higher. That’s something we’ll be asking you to do
in the activity associated with this lesson. Linear regression is a well-founded, venerable
mathematical technique.
Practical problems often require non-linear
solutions. The M5P method builds trees of regression
models, with linear models at each leaf of the tree. You can read about this in the course text
in Section 4.6. Off you go now and do the activity associated
with this lesson. See you soon. Bye!.