The High Bias Struggle is Real

I'm doing research this semester on machine learning and fairness. But before I can really dig into that, I need a firmer grasp of machine learning terminology. So I've been working my way through a paper called "A Few Useful Things to Know about Machine Learning" by Pedro Domingos.

Sounds sweet and gentle, right? It's not (for little ol' me at least).

Part of the problem is that I'm a bit of an island on this one. This research is for an independent study that covers a lot more than the details of machine learning algorithms. It's about the process of regulating these tools, and involves public policy, law, ethical frameworks, corporations, nongovernmental regulatory bodies, individual psychology, education, and more.

That means I'm making the most out of my interdisciplinary program. But it also means that I have to save up my big computer science questions and try to find a nice computer scientist to talk them out with me, because I often learn best when I can ask questions and interrogate responses. Until then, I'm left to my own devices. And that often involves writing out ideas to try to get them straight in my head.

Last night, I tackled this sentence by Googling things, searching the trusty Artificial Intelligence: A Modern Approach, and asking questions of a kind software engineer:

"A linear learner has high bias, because when the frontier between two classes is not a hyperplane the learner is unable to induce it."

This is describing an area in which data scientists need to be particularly careful when designing machine learning algorithms. Here's my translation:

"A computer model that is designed to find a way to explain patterns in example data by drawing a clear-cut, straight boundary between Things of Type 1 and Things of Type 2 can be way off in some cases. For one, the patterns in the data might be messier than can be represented by a straight boundary. It might be impossible to clearly say, 'Hey, everyone on that side of the line is this thing, and every on this side of the line is that thing.' Things that fall into the same category may be more mixed together.

Applying a computer model that tries to find a straight boundary to that kind of mixed-up data will give you some sort of model that works. It will find some sort of boundary. But it won't draw the right conclusions from the examples it is fed, and it won't tell you about the way the real world works. This is one reason it is important to make sure you understand your data and are looking for the right kinds of patterns."

And here's how I got to that... (If you're a machine learning person and happen to be reading this, please let me know if I got anything wrong so I can learn!)

Background

First, some context. Machine learning is a process in which a computer program learns how to apply rules it has learned based on a set of known data to data it hasn't seen before. The rules aren't universal truths but are based on probabilities learned from that initial data set. The process of learning the rules is called induction. And the ability to apply the rules learned from the training set to data in the wild is called generalization.

That's basically how humans learn some things, too. You understand basic rules about how the world works, and when you encounter new information that you haven't experienced before, you make a best guess about how to deal with it based on the thing you've learned were true.

In the case of machine learning, a programmer writes code that directs a computer to process a set of data (which could be about anything). When the program is learning the rules of its world, it is fed data as an input and it is told what data is the output. The program includes mathematical functions that change over time, creating and refining rules as it processes more input-output pairs.

The program or model being trained in this way is called the "learner." After it has learned all of the rules it can from its test data, the program is then run on data it has not encountered before. It's goal—like yours is when you learn—is to draw conclusions about that new data based on the rules it learned.

Sometimes, the learner has been taught that data can be organized into a few categories or classes of things depending on the features it has. That kind of program is called a classifier. It is applied to new data and classifies that data based on the rules that it learned. A common example of classification in practice is a spam filter. The filter is a classifier that classifies emails as either spam or not.

Linear Learner + Hyperplane + the Frontier Between Two Classes

The phrase "a linear learner" is referring to the model that is being trained to become a classifier. It is linear because a mathematical function can be written that neatly separates the data into categories—things either falls on one side of the line or the other. The red and blue lines in the image at right are separating the data in this way.

The "hyperplane" is the term used to describe the line. It can take up two dimensions or more.

The "frontier" refers that border formed by the line/hyperplane.

High Bias

Bias means that the learner is learning something, but that something is not what it is supposed to be learning. In situations of high bias, the model that was created doesn't capture the patterns that are most important in the data, and that means they may miss a lot of stuff when they're applied to real-world data sets (not just the training data).

High variance learners, in contrast, do a great job at accounting for all of the patterns in the training data. But when applied in the wild, they might overfit, or pick up a lot of random stuff that isn't related at all to the patterns that were most important.

Unable to Induce It

The word induce here refers to the way these computer programs formulate rules about data—the process of induction.