Big Data as Collective Judgments
People put a lot of stock in data. Data is essentially information. It often becomes the basis for an assumption. When we think of data, we think of facts. That works in many settings. For example, an ingredient label, done truthfully, lists ingredients. If it lists flour, one would reasonably assume the product contains flour. But opinions seep into data. And some data is exclusively opinions, aggregated, ranked, and sorted.
Data is raw input. It can be structured, unstructured, or semi-structured. It is information, but at first it is unorganized information. When information based on data is set in context, it becomes knowledge. Wisdom occurs when knowledge is built up over time and based on prior experiences. Wisdom can inform decision making. AI can sort through raw data, learn from it, and facilitate or replace human decision making. Therefore, it is imperative to understand whether raw data is fact or just a numeric or symbolic opinion.
How Someone Feels
Some data does not lend itself to collection as big data. It makes sense on its own, not when collected and combined. For example, collecting data in surveys is an area where opinion seems to become fact. If many people rank something that is subjective between one and ten and the results are scattered, it really does not tell much about preferences. In some cases, the number reflects a person’s mood or circumstance. If you ask someone who is very full how he feels about delicious cake, he may answer less enthusiastically than if you ask the same person at a time when he is very hungry and daydreaming about food. If the one person answers a two in the evening after dinner but a ten in the early afternoon, then the timing of the survey really matters. If the one person’s answer is subjective, unreliable, and varies based on time, mood, physical state, or whim, then how would it be that a collective of a million answers tabulated, averaged, plotted, etc. is meaningful at all? Sometimes, it simply is not. If the answers to questions about preferences vary less, then the information could be more useful, like in political poll. Sometimes the answers reveal a pattern. Political polling often reveals sentiment and leads to sound conclusions about how people feel. Other times, polling is wrong as those analyzing it operate on faith ignoring the noise.
The numbers on a pain scale are subjective. One person may say eight when another says four even if they feel exactly the same. It is difficult to validate how they feel. But if many people respond in pain surveys ranging from one to ten, and the data is collected and tabulated, it starts to feel like fact. The distribution of pain levels can look relatively organized, for example, many people may feel an eight after a certain surgery, leading one to conclude it is factual. Results could look random, like lots of ones to tens, even evenly distributed. That may lead to a temptation to average the numbers and conclude the pain is usually a five. When technology tabulates that way, the idea that the number itself was mere opinion gets lost. “I feel like my pain is an eight” is an opinion, not a fact. I may really feel it is an eight―it is truthful, but what an eight is to me is not objectively comparable to anyone else’s eight. The definition of an eight is open to matching a different feeling in a different person. If a pain scale describes eight as terrible, or ten as the worst pain you have ever had, then one would need to know what other pain you have had—only a little or many major surgeries, painful illnesses, etc. The pain scale is always a judgment. Studies indicate that pain questionnaires do not correlate to patient satisfaction with pain care. I am just using pain as an example: other dangerous collective judgments morph into research and look like fact.
Data used to make predictions or to assess risk is often a compilation of many subjective judgments. For example, if data to predict child neglect includes past referrals to child services, then the judgment by a random person who is merely reporting becomes part of a metric. Let’s say a mother allows her child plays alone outside. Many mothers allow their kids to be out unattended. Neighbors or passersby may report out of fear, bias, hatred, racism, personal gripe, or annoyance with a particular person, etc. Leaving children unattended is considered a risk factor for neglect and is considered neglect sometimes. But the record of who does it the most is not factual. The reports act as a proxy for the action. As with many reports, it is a record of who is observed leaving their children unattended by someone who is likely to report it. The data stems from who reports it the most: mandatory reporters and the neighbors or observers. People who do the same activity in spaces where their actions are not easily observed would not be reported. In fact, they are often praised for allowing their children to explore the outdoors. There are distinctions: many would argue that the risks associated with children playing outside unattended varies by neighborhood. The reports are judgments about poverty and exposure to potential violence, and about parenting. The data is incomplete. Certain people are disproportionately likely to be subjects of a report: those in crowded inner cities, and those exposed to more mandatory reporters.
When machines are trained on data that is a combination of facts and judgments, models are not perfect predictors nor are they necessarily verifiable. Predicting who will engage in neglect or abuse based on past facts could have some accuracy. Predicting based on past events sparked by someone’s judgment is different. When organizations decide which metrics to include, that itself is also a judgment. In Weapons of Math Destruction, Cathy O’Neil says “models are opinions embedded in mathematics.” In Automating Inequality, Virginia Eubanks observes that a model designed to predict neglect and abuse in Allegheny County, Pennsylvania was about 76 percent predictive, halfway between the 50-50 of a coin toss and perfect prediction. Furthermore, it is difficult to validate the accuracy because the model relies on proxies for abuse and neglect like judgments, custody loss, referrals, rereferrals. And many of the metrics are traits associated with poverty. Even judgments reflect moods of judges and then feed systems of metrics. It seems people get stuck in the system of surveillance because they are more like people who have gotten stuck in the system in the past. The relationships are not causal. The group does not speak for the individual very well. And, importantly the data focuses on people exposed to mandatory reporters or doing their parenting in public more and the personal judgment of the reporter. The bulk of data on child abuse and neglect centers on people whose lot in life is lower in the socioeconomic ranks. The models would not be adept at rooting out a wealthy sex trafficker, neglecter, or abuser. There is availability bias in the data.
Metrics can replace values when they are not used carefully. Sometimes metrics measure a good proxy for something. But often they do not. Liking a social media post is a judgment: it is saying in my opinion, this post is good. Or likes may be given with an ulterior motive: if I like this post, the person posting may like one of my posts. The number of likes a social media post receives is a totality of opinions which may be sincere or insincere. When likes become a metric for the quality of posts, a feedback loop exists.
There is incentive to post the most popular material, not the highest quality. That type of incentive structure leads to polarized news, decreases the high bar of journalism and academic writing, and encourages resharing information. It also exploits human weakness, like an inability to look away from a train wreck.
Many corporate jobs rely on personality tests, and with the interconnected online job boards, multiple employers can use AI to weed people out based on algorithms that review resumes, personality tests, and online profiles and activity. Many of their rules concern opinions about who might perform well based on who has performed well in the past. Often promotions are also based on past performance rather than evaluating skills needed for the different position. The choice about which qualities matter is a judgment. The data informing the predictive model tends to depend on the bucket of people who were given the chance to succeed in the past. The feedback loop that says introverts are not right for a given position may depend primarily on introverts’ inability to be hired in the past, rather than their skills or potential to succeed. The personality tests themselves delve only into how one views oneself and are hardly objective. But creating metrics based on many thousands of personality tests and then using AI to weed through each job seeker is as much big judgment as it is big data.
In Noise: A Flaw in Human Judgment, Daniel Kahneman, Olivier Sibony, and Cass Sunstein have lots of examples of judgments that suffer from noise, or variability that is random and unpredictable in human judgment. Noise is such variability when it is unwanted rather than variability when asking for a variety of opinions. The authors suggest noise can violate rights. For example, there is noise in sentencing, judgments, child neglect, hiring, asylum, and education, etc. Metrics and big data that rely on opinion exponentially exacerbate the effect of noise in judgments. For example, when child services dismisses one family but puts another into surveillance based on a happy child services agent in one case and a miserable one in another, or based on a good weather day versus a bad one, those singular events effect the two families and represent an unfairness. But also, they are recorded. When they become part of a big data system, randomness is built in. Algorithms that police use to predict risk of crime may reflect noise. A study of the Diagnostic and Statistical Manual of Mental Disorders (DSM-5) found so much noise that it determined some diagnoses “were so unreliable as to appear useless in clinical practice.” (Kahneman, et al., p. 285 citing Lieblich, “High Heterogeneity”). Even small-scale collective judgments look much more factual to people than they are.
Things that can easily be measured risk becoming popular metrics despite being personal judgments. From insurance to credit reports, data is feeding predictive risk models. Some of the math really is fact; some of it is numbered, measured judgments, answers to surveys, and proxies for actual non-measured events. Consumers of anything from media to health care should consider whether they are dealing with fact or opinion. It seems perhaps that it is just human nature to believe things more (or question them less) when there are numbers involved.
I do not intend anything here to imply that there are no facts or that truth and opinion are one in the same. This is not a post-truth world, but it is, to me, one where people should distinguish fact from opinion.