The typical person interacts with Machine Learning (ML) applications dozens, or even hundreds, of times every day, often without even realizing it. Machine Learning is ubiquitous in the world today – product recommendation systems, email spam filtering, facial recognition systems, predictive text, search engine algorithms, and predictive analytics are all examples of machine learning. Machine Learning is a subfield of artificial intelligence that uses algorithms to learn from data and draw inferences from that data.
While machine learning is used in a host of valuable tools, anyone who has shared their Netflix account with a friend without creating separate profiles can tell you that it’s imperfect. An email inadvertently caught in your spam filter, navigation directions that send you through a parking lot, or autocorrect suggesting you really meant to type “ducking” are all examples of times when ML makes an error. Errors are unavoidable in the real world – sometimes, an important email is going to end up in your spam filter, no matter how good the ML algorithm is. But other errors (like the infamous “ducking” autocorrect example) are due to how the model is set up – the autocorrect model is biased to exclude certain words, regardless of how commonly they may be used.
In this article, we will discuss five common sources of bias in machine learning. By being aware of these biases, you can be a more focused consumer of ML technology. Understanding the limitations of ML can help you understand how to best unlock its full potential and use it in the most effective and appropriate ways to grow your business.
Like all other types of research, machine learning is subject to the problem of researcher bias. When many people think of researchers, they think of them as unbiased arbiters of the truth – and, in reality, this is what many researchers strive to be! But, researchers are still humans, and are at risk of the same biases that any other humans are. Since every machine learning model relies on original inputs from human researchers, machine learning is influenced by the same researcher biases that every other type of research is.
When we talk about researcher bias, we do not mean researchers are intentionally fabricating or skewing results to push a particular agenda. Instead, researchers are biased (typically in subconscious ways) by their own backgrounds and experience – a social scientist will create a machine learning model that looks very different from one a computer scientist would create to solve the same problem. An American researcher will approach a problem differently from a Japanese researcher because of cultural differences. Numerous aspects of a researcher’s background can influence what a ML model produces.
Take the example of predictive analytics. Two researchers can be given the exact same data, but are nearly certain to arrive at different model specifications. In a famous example, 29 independent research teams analyzed the same dataset to determine whether dark-skinned soccer players received fewer red cards than light-skinned soccer players. 20 of these teams found statistically significant relationships, and 9 found no statistically significant relationships. Between the teams, the results varied significantly (from no difference to dark-skinned players being 3 times more likely than light-skinned players to get a red card). These research teams were focused on finding the correct answer, but biases within the teams caused them to create different models, and therefore reach different conclusions.
This type of bias is difficult to address as a consumer of machine learning. However, an important strategy is to (as much as possible) understand what goes into the algorithms for ML products / models. Often, this is proprietary information, further obscuring how models work. But, when model creators are transparent about how a model is produced, it allows consumers to understand the inputs of the model and determine how those particular inputs can bias outputs.
Training Data Bias
Any ML model is limited by its training data. We are unable to harness the whole of human knowledge to train ML models, so we are forced to make choices about what data to use to train these models. Bias from training data can take on a few forms, which are important to understand.
First, training data can be biased by the point in time it’s taken. Using ChatGPT as an example showcases this issue. ChatGPT’s training data only goes up in time to September 2021. This means ChatGPT does not have any information or knowledge about the past two years. An obvious implication is that it will struggle to provide any information on current events, which humans can easily adjust for. However, a larger issue is that human knowledge has changed over the past 2 years. If you ask ChatGPT to help you determine the best MarTech stack for your company, it will give you the cutting edge technology – from September 2021. As the training data becomes more stale over time, this problem is exacerbated.
A second, and perhaps harder to address or confront issue is that if training data contains historical biases, your ML model will learn to replicate these biases. Imagine you are creating a program to screen the “best” hires for senior leadership in your company. Trained on historical data, a ML model may “learn” that, in the past, most successful companies hired senior leadership that was disproportionately white men, and use those demographics to suggest candidates. In this way, your ML model can help to perpetuate historical biases – these candidates were not successful because white men are more successful, but because white men were much more likely to be given the opportunity in the past. You can train your models to ignore things like demographics (or, say, gaps in resumes – which disproportionately affect women who leave the workforce temporarily to raise children), but it is important to be explicit about this to remove bias from training data.
Many times when data is collected or analyzed, we need to use a sample of available data. In predictive analytics, your population refers to what you are trying to make inferences about – this can be, for example, the Canadian population, people who own pickup trucks, or businesses with annual revenue greater than $1 billion. Often, we either do not have data available for an entire population, or the entire population is too large to run analytics on efficiently, so we use a sample of that population. A sample is simply some subset of the population we care about.
The best way to take a sample that is representative of a population is to take a simple random sample of the population. In a simple random sample, each member of the population has an equal chance of being selected for the sample. Seems easy, right? So why is sampling bias such a problem when the solution is so simple? Let’s turn to an example.
The website fivethirtyeight.com has carved a niche in the industry of election predictions. Using proprietary models, fivethirtyeight aggregates political polls to make predictions about the outcomes of various United States (and international) elections. fivethirtyeight and its founder, Nate Silver, rose to prominence by predicting the results of the 2008 and 2012 United States elections with astounding accuracy. However, during the 2016 election, fivethirtyeight (and nearly every other organization in the electoral predictive analytics business) was wrong – predicting Hillary Clinton to win the presidential election instead of the actual winner, Donald Trump.
How could a predictive model that was so accurate in 2008 and 2012 be so wrong in 2016? This has to do with sampling bias and presents a major difficulty for predictive analytics. Let’s think back to populations and samples – political polls are samples of a population, but they are samples of an unknown population. They are attempts to sample a population (people who voted in the 2016 election) at a point in time before anyone has voted in that election. These polls create models of likely voters in order to best approximate the population of actual voters. In 2016, previous models of likely voters “broke” – people without college degrees tended to vote at significantly higher rates than in previous elections, and these people were more likely to vote for Donald Trump.
Predictive analytics is always faced with this sampling problem – since we are trying to make a prediction about the future, we are taking a sample of an unknown population. While we can create models (based on past behaviors) that we think will do a good job of approximating this population, we can never be completely sure we are accurate.
Past Behavior Bias
Relatedly, ML models in the predictive analytics space are biased towards the assumption that past behaviors (at least, mostly) are useful in predicting future behaviors. Many times, this is a fairly safe assumption – people who are buying Diet Coke today are likely very similar to the people who will be buying Diet Coke a year from now. But, without careful attention from researchers, there are circumstances where using past behavior to predict future behavior can be problematic.
Let’s look at the case of airport retail vendors in the aftermath of the September 11, 2001 terrorist attacks. In the immediate wake of the attacks, business decreased for most airport retailers. This was due in part to changes in the customer base for these businesses – since individuals could no longer pass security checks without an airline ticket, “meeters and greeters” at the airport were no longer potential customers for businesses located past the security checkpoint.
However, there was another change that airport businesses were able to take advantage of. As individuals faced longer and more uncertain security procedures, they began arriving at the airport much earlier, increasing the amount of time spent past security in the terminal. Essentially, the audience for many of these retailers changed – if they were using historical data from August 2001 or earlier, these retailers would be left behind in the changing landscape.
The bias of ML models to assume past behavior is predictive of future behavior is something that consumers must be wary of. As we use these models, it is important to use our own knowledge of the world and commercial landscape to understand when past behavior should be given more or less weight in predicting the future.
Proxy bias occurs when we try to measure characteristics of individuals by using some type of indirect measure. For example, imagine we do not have information about an individual’s income, but we do have their monthly rent/mortgage payment. We could treat “monthly housing expense” as a proxy for “income” and create a good proxy – people who pay more per month for housing likely have a larger income than those who pay less. However, we run into some pitfalls here that make the proxy imperfect. Some people may have their home completely paid off, giving them a monthly housing expense of $0 (or, realistically, only their property tax payments). A monthly housing expense can be a different representation of cost for someone living alone compared to someone living with a spouse or roommates. While this is a reasonable proxy measure, in the aggregate, at the individual level we will make some mistakes.
This problem can be exacerbated when using aggregate level data to make inferences about individuals. A considerable amount of consumer or demographic data in the US may be available only at the ZIP code level. Using this data can help us make probabilistic inferences about people (for example, their race, their income, what magazines they subscribe to), but at the individual level, these inferences can be very incorrect. This is a problem called the ecological fallacy, which says we cannot make accurate inferences about individuals using data from a higher level. That is, just because someone lives in a ZIP code where 70% of people subscribe to an outdoor magazine, we cannot safely assume that person enjoys outdoor activities.
We must be careful to understand where data comes from in ML models. Is the data at the individual level (and therefore accurate as of the most recent update) or is it inferred from some aggregate characteristic? If it is the latter, we should be cautious about the inferences we draw from this data.
Need support with you customer data? Talk to Shift Paradigm.