Overfitting with Linear Classifiers

by Pegah   Last Updated April 16, 2018 12:19 PM

Today our professor stated in class that "overfitting with linear classifiers is not possible". I hold that to be wrong, since even linear classifiers can be sensitive to outliers in the training set - take for instance a hard margin Support Vector Machine: One single noisy datapoint can alter which hyperplane will be used to separate datasets. Or am I wrong? Obviously, linearity will probably prevent rather from overfitting due to lower model complexity, still I do not see why overfitting should be impossible. One additional point is that when I tried to think about this problem I realized that "overfitting" does not seem to be formally defined. Why is that? Wouldn't some distance measure between training and test set performance allow for such a formalisation? Thanks

Answers 3

A linear regression / classifier can absolutely be overfit if used without proper care.

Here's a small example. Let's create two vectors, the first is simply $5000$ random coin flips:

N <- 5000
y <- rbinom(N, 1, .5)

The second vector is $5000$ observations, each randomly assigned to one of $500$ random classes:

N.classes <- 500
rand.class <- factor(sample(1:N.classes, N, replace=TRUE))

There should be no relation between our flips y and our random classes rand.class, they were determined completely independently.

Yet, if we attempt to predict the random flip with the random class using logistic regression (a linear classifier), it sure thinks there is a relationship

M <- glm(y ~ rand.class, family="binomial")
hist(coef(M), breaks=50)

enter image description here

The true value of every one of these coefficients is zero. But as you can see, we have quite a spread. This linear classifier is for sure overfit.

Note: The extremes in this histogram, where the coefficients have wandered to $-15$ and $15$, are cases where a class had either no observations with y == 1 or no values with y == 0. The actual estimated values for these coefficients are plus and minus infinity, but the logistic regression algorithm is hard coded with a bound of $15$.

"overfitting" does not seem to be formally defined. Why is that?

Overfitting may be best understood within the context of a class of models which has some complexity parameter. In this case, a model could be said to be overfit when decreasing the complexity slightly results in better expected out of sample performance.

It would be very difficult to precisely define the concept in a model independent way. A single model is just fit, you need something to compare it to for it to be over or under fit. In my example above this comparison was with the truth, but you usually don't know the truth, hence the model!

Wouldn't some distance measure between training and test set performance allow for such a formalisation?

There is such a concept, it's called the optimism. It's defined by:

$$ \omega = E_{\text{test}} - E_{\text{train}} $$

where $E$ stands for error, and each term is averaged over all possible training and testing sets for your learning algorithm.

It doesn't quite get at the essence of overfitting though, because the performance on a test set can be quite a bit worse than the train, even though a model of higher complexity decreases both.

Matthew Drury
Matthew Drury
October 06, 2015 22:28 PM

I think that overfitting refers to model complexity rather than generalization ability. I understand the quote "a linear classifier cannot be overfitted" since its complexity is small, and there is no other simpler classifier providing a better performance.

The example is linked to the generalization ability of linear classifiers (and complex ones). Even in this second part, linear classifiers usually provide less variance than complex ones, thus the "overfitting" value for linear classifiers, following this concept, is also smaller (although the empirical risk of them could be so large). atb

pepe Catro
pepe Catro
April 16, 2018 11:20 AM

In the 70-ties, experiments with pattern recognition algorithms on large datasets revealed that adding extra features did in some cases increase the test-set error rates. This is counter intuitive because one would expect that adding an extra feature always increases classifier performance, or in case that the added feature is 'white noise', its addition does not influence classifier performance at all. The effect of adding still more extra features to a classifier, eventually leading to a decrease in test-set performance became known as the peaking phenomenon [1].

Feature peaking is caused by over-generalization during learning. The extra features cause the inclusion of so many additional parameters that the classifier begins to overfit the data. Hence, the peaking point is passed.

In general, we face a bias-variance trade-off when training classifiers. The more feature-variables we use, the better will the (unknown) underlying classifier mechanism possibly be modelled by our classifier. Hence, the systematic devition between fitted model and 'truth' will lessen, i.e. a smaller bias results. On the other hand, increasing the feature space of the classifier necessarily implies the addition of parameters (those that fit the added features). Thus, the variance of the fitted classifier increases too.

So the classifier exeeding the peaking point is just one stochastic realization of a high-dimensional classification problem, and a new fit will result in a highly different parameter vector. This fact reflects the increased variance.

[1. G. V. Trunk, "A Problem of Dimensionality: A Simple Example," in IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. PAMI-1, no. 3, pp. 306-307, July 1979]

Match Maker EE
Match Maker EE
April 16, 2018 12:09 PM

Related Questions

Does it make sense to combine PCA and LDA?

Updated June 03, 2016 09:08 AM