New Tool Helps Make Sense of Big Data
Ross Professor Eric Schwartz develops decision tree to make life easier for marketing analysts.
ANN ARBOR, Mich. — Everyone in marketing wants big data — that granular, customer-level purchase information merchants collect by the terabyte.
But collecting it is the easy part. Making sense of it is more difficult than ever, especially since many companies have stopped buying rolled-up data from third-party providers. Figuring out which statistical models to apply to different sets of data can be confusing, time-consuming, and expensive.
U-M Ross Professor Eric Schwartz is making life easier for analysts and forecasters by building an easy-to-follow decision tree that shows which models to use for a given data set. His research means marketing and sales analysts can make efficient use of complex data and build more accurate forecasts.
"In this world of big data, people feel like they're drinking from a fire hose," says Schwartz, assistant professor of marketing. "It's not easy to distill all of that data down to a few key summary statistics that are easy to compute, easy to interpret, and relevant to managers' decisions. That's why we did the up-front work for the marketing analysts to uncover those key metrics and created this decision tree. That's the real time-saver."
Schwartz's paper, "Model Selection Using Database Characteristics: Developing a Classification Tree for Longitudinal Incidence Data," was written with Eric T. Bradlow and Peter S. Fader, both marketing professors and co-directors of the Wharton Customer Analytics Initiative of the Wharton School at the University of Pennsylvania.
Here's a typical situation: A company collects data on a number of recent product launches. The analysts have data of individual customers and their purchases over time. The goal is to project repeat purchase patterns. The problem is picking the correct model to use for each set of data, since each has its own patterns and characteristics.
One way is to run a number of different models to see which fits each one best. But this "bake-off" strategy is inefficient, since it must be repeated each time and offers little assurance that it will deliver an accurate forecast for the next data set. That's because it does not take advantage of all of the other data sets that were seen.
Schwartz and his co-authors figured there had to be a better way: looking across a wide range of data sets to discover patterns of when to use which model. They examined many data sets for patterns. They used 64 common patterns and ran a series of "bake-offs" to see which of four related statistical models would best fit which data patterns, and tested them multiple times.
A grant from Amazon Web Services allowed them to run the jobs in parallel and drastically reduce the time it would take to perform such a test — from 1,000 days down to two days.
Their method marries the two cultures of machine learning and statistical data modeling. The result is a relatively simple decision tree that allows a manager to see which general pattern his or her data set matches, and then select the right statistical tool from there.
The research showed that not only does this method save hours of time (both manual and computational), but it also reduces the error rate.
"Lately people toss around words like big data and data science, but this is what data science should be all about," Schwartz says. "It's not just summarizing or visualizing data. It's really trying to understand the deeper patterns in customer behavior data and understand when to use a particular analytical tool. That's the innovation here. Machine learning isn't commonly used in marketing. But we used it to understand the links between how you can find database characteristics that are easy to discover and use that to decide which model best fits."
Read the paper.
For more information, contact:
Terry Kosdrosky, (734) 936-2502, firstname.lastname@example.org