Big Data with a Grain of Salt


Big Data is everywhere, tracking click-throughs, web cookies and time stamps of our daily lives like the digital paparazzi. It can be used for good, revealing patterns in the most miniscule of details and leading to discoveries beyond our wildest imaginations. But in the wrong hands, it’s dangerous; unraveling identities, draining bank accounts or worse.

From a scientific standpoint, Big Bad Data is greedy and hungry, swallowing everything in its path to chew it up and spit it out, making something from nothing. This can lead to false correlations and, as our industry has seen, junk science. And now that we’re tallying up results from intricate tests of genetic up- and down-regulation, alterations in our body’s microbiome and neuron response patterns for product likes/dislikes, we have to be careful.

Linter and co-authors said it in their October 2016 C&T article, “Correlation does not indicate causation.” So much data can easily mislead the misguided or faint of heart. That’s why data analyst Susan Etlinger noted during a recent “TED Radio Hour” on NPR Radio that context is crucial.1

People Make Meaning, So Think Critically

“When it comes to Big Data and the challenges of interpreting it, size isn’t everything,” said Etlinger, who explained there’s also its speed and variety of types (images, text, video and audio). “And what unites these disparate types of data is they’re created by people and they require context.”

Etlinger expanded, “Facts are vulnerable to mis-use; willful or otherwise. Why? Because data doesn’t create meaning. People do. And now, with the capability to process exabytes of data at lightning speed, we have the potential to make bad decisions far more quickly, efficiently and with far greater impact than we did in the past,” she said.

This makes it all the more important to spend time on the humanities—sociology, rhetoric, philosophy, ethics, etc., because they give us context. “They help us become better critical thinkers,” Etlinger said, adding they also help teach us to find confirmation biases and false correlations. “[Just] because something happens after something doesn’t mean it happened because of it.”

“As my high school algebra teacher used to say, ‘show your math’ because if I don’t know what steps you took, I don’t know what steps you didn’t take. And if I don’t know what questions you asked, I don’t know what questions you didn’t ask.” She added this means asking the hardest question of all: Did the data really show us this? Or does the result make us more successful?

Unregulated and Out to Get Us

This is a hard question indeed, made harder by the fact that we’re self-regulated, as nongovernmental organizations (NGOs) like to remind us. To some, self-regulated means unregulated, which to distrusting human nature means companies are free to lie to and steal from us for financial gain.

So while a product may be built upon good, factual science, and business logic dictates we are shooting ourselves in the foot to put forth false claims and harmful ingredients to the very consumers we serve, humans create meaning. Especially skeptical ones. And when they base that meaning on Big Bad Data, it can create something from nothing which further fuels the skepticism. It makes it all the more important to ask: how good is our data?

Putting Big Data into Practice

Thankfully, we’re not alone. In fact, we’d be hard-pressed to find an industry that’s not collecting Big Data and trying to connect the right dots for good use. Richard Shriffin, of Indiana University, studied2 how others are attempting to draw causal inferences from Big Data.

“The age of Big Data poses enormous challenges because collecting and storing the data are only a minimal first step.” He described Big Data collection in stages: 1) finding interesting patterns in the data; 2) explaining those patterns, e.g., with experimental manipulations of variables and additional data; and 3) using the patterns and explanations for a variety of purposes.

“Finding interesting patterns is itself a daunting task because a hallmark of Big Data is the fact that it vastly exceeds human comprehension.”

He outlined several questions that must be asked. “How does one define causality ... in ways that make sense for large recurrent interacting systems?” and “How does one judge what is a significant pattern or correlation? ... These questions and their answers are, to a large extent, a matter of statistical practice and implementation.”

Then again, traditional statistics were initially developed to deal with 2×2 tables, which are nowhere near the complexities of Big Data. Needless to say, statisticians are working on this, too.

“Most Big Data [is] formed as a nonrandom sample taken from the infinitely complex real world: Pretty much everything in the real world interacts with everything else, to at least some degree,” wrote Shriffin. Well, that’s no help.

In the end, it seems the best thing we can do is remember our methods and data have limits. Whatever the results tell us should be reconsidered—and more than once. We don’t want to feed into Big Bad Data; we need to see it in context and serve it with a grain of salt.


More in Methods/Tools