The Neutronian Data Quality Podcast – Episode 1

The Neutronian Data Quality Podcast
Welcome back to the Neutronian Data Quality Podcast, a show designed to dig into the topic of data quality where we will share examples of data quality issues that can arise and aspects to consider as you are evaluating marketing data.
In this next episode of the Neutronian Data Quality podcast, Renee Smith (Chief Research Officer at GutCheck), Katie Casavant (President of 0ptimus Analytics), and Lisa Abousaleh (VP of Customer Success and Marketing at Neutronian), chat about the importance of ensuring data quality throughout the entire data creation and buying process. The discussion was moderated by industry veteran Ashwini Karandikar, former Global President of Amnet, Senior Advisor to McKinsey and member of the MediaMath board.
In this first part of their discussion, they dig into data quality considerations during the data gathering process and how the data’s ultimate purpose impacts these considerations. Click on the link below to listen to this episode or read the trasnscript and stay tuned for the next episode of the Neutronian Data Quality Podcast with more insights from this discussion!

Intro: In this next episode of the Neutronian Data Quality Podcast, Renee Smith, Chief Research Officer at GutCheck, Katie Casavant, President of 0ptimus Analytics, and myself, Lisa Abousaleh, VP of Customer Success and Marketing at Neutronian, chat about the importance of ensuring data quality throughout the entire data creation and buying process.
Our discussion was moderated by industry veteran, Ashwini Karandikar, former Global President of Amnet, Senior Advisor to McKinsey and board member to MediaMath. In this first section of our discussion, we dig into data quality considerations during the data gathering process and how the data’s ultimate purpose impacts these considerations. Hope you enjoy!
Ashwini Karandikar: Hello and welcome to this wonderful podcast that is going to talk about data, quality. I’m Ashwini Karandikar, I’ve been in industry for a long time with Dentsu, with McKinsey, et cetera, and I have the wonderful honor of moderating the session with Lisa, Katie, and Renee. I would love for each one of you to introduce yourself and perhaps talk about your connection to the overall data industry and why you think it’s important.
Renee Smith: Thanks. I’m Renee Smith. I am currently the Chief Research Officer of GutCheck. I’ve also been in the market research industry for about 20 years now.
GutCheck is agile market research company, and we focus on trying to understand human behavior and the humanity that leads to the data that gets created. We use all sorts of data sources, and I’m excited to talk today with Lisa and Katie and you as well, because I think that there are some tips and tricks that can help the buyers of data get a better understanding.
Katie Casavant: Hi, I’m Katie Casavant. I’m the President of 0ptimus Analytics and 0ptimus Analytics is a vertically integrated artificial intelligence and machine learning data science company. We focus on innovation that’s brought together by incorporating data science, with innovative software engineering, to integrate custom research with advanced data science, analytics, and predictive modeling to measure and analyze and predict the behaviors of human beings.
It’s a great joy to work with the 0ptimus team. I’ve been in this space around data, big data, small data, for about 20 years and there simply isn’t I think anything more important than data quality as subscribed to the garbage in garbage out philosophy. And there’s simply no way to correct downstream for quality mistakes that happen upstream. So it’s just a hugely important topic and I’m excited to talk with this great group of ladies today.
Lisa Abousaleh: All right. Hello, I’m Lisa Abousaleh. I’m VP of Customer Success and Marketing at Neutronian. I have been involved with data, I fell into it early on in my career and have stuck with it along the way. Starting with more traditional marketing research and then evolving into data specifically focused for advertising and enabling marketing technology.
And in my current role at Neutronian, what we’re really focused on is bringing clarity and transparency to marketing technology and the data that’s being leveraged for that. So we feel it’s very important for data buyers to understand the quality of the data that they’re buying, making sure that they’re leveraging high quality data and the best data for their purpose. And then conversely for data sellers to really be able to leverage their advantage when they are doing the right things and they’re investing the time and effort to produce high quality data. So data quality is something that currently I’m living and breathing every day. But very excited to talk through this topic with the group here today.
Ashwini Karandikar: That is fantastic. I love that the four of us could get together and the sheer brainpower that each one of you are going to bring to this discussion is not only going to be fabulous for me, but I think anyone and everyone who’s listening is going to learn a lot about not just the nitty gritties of buying data, but also how to figure out what data quality is all about. Yay! And let’s get started with the questions.
So after that fantastic introduction, I would love to get down to a few basics around – What is data gathering? How do you figure out data quality overall? And I think my first question is going to be for Renee and for Katie, given your background and your experience. I agree with Katie about garbage in garbage out. Words matter, it matters what you say, how you say, how you actually ask the questions. And perhaps Renee first to you. Could you share some tips around framing the correct question so you can actually get the insight that you’re looking for?
Renee Smith: Sure. Maybe if we also stepped back for a second and talk about just quality in general. So in the say survey data collection world there’s standard known processes for collecting data, for making sure it’s representative, making sure it’s error free. And I think that some of those processes are less well known when it comes to non-survey data. And I would say that the reliability, comparability, all of those same elements that we know from other data sources, the more we can think about how they play into the veracity of larger data, I think is really important.
To answer your question more specifically, when we are collecting survey data as seed cases because we want to work with a partner like 0ptimus for audience expansion on behalf of our clients. Our clients are big manufacturers, tech and CPG in particular, also some financial services and healthcare. What we’re finding is that, thinking about the error that survey question can create is a very important first step to avoiding garbage in. So what I mean by that is we know that there are different abilities of people to recall different timeframes. The longer something is passed, the less likely they can recall it. So keeping the recall window that you’re asking about, have you done an activity in the last month, for example, or it can even be the last week, that can be important.
What’s also important is that we tend to think of when we’re doing a survey for insights, we tend to think of it as the list of responses needs to be mutually exhaustive and yet we know that people on average in a list they’re going to choose three to six items maybe from that list. So it can become important actually to keep the list a bit shorter so that the actual behaviors and activities that you’re trying to measure are chosen without error. So those are just a couple of initial tips that creates better seed cases.
Ashwini Karandikar: That’s very helpful, Katie, just to build on that, given the work that 0ptimus does and the serious AI work that you do, how would all of this play in your world overall?
Katie Casavant: Absolutely, just to reiterate what Renee said, it really does matter how a question is asked. Especially if the intended use case for that survey is to model selected responses and scale them out to identify a nationwide audience of people with a high probability of having the same point of view or attitude or behavior.
In our experience, survey questions that lack clarity or are too general, or ask about very rare attitudes or behaviors, in other words, they have a very small audience, do not tend to produce high quality models. Questions that are direct and that have collectively exhaustive response values and are really specific about the action that’s being surveyed, they tend to lead to better performing more, highly predictive and more accurate models.
Renee Smith: One of the things that as GutCheck’s teams and the 0ptimus teams have worked together, one of the things we’ve talked about, because we’ve seen it, is that if the predictive modeling, the lookalike modeling is not getting the precision that we were expecting it to, we’ve actually gone back to the drawing board to say what’s wrong with the survey questions and have made improvements that actually lead to better predictions. So what Katie just said is actually based on evidence.
Katie Casavant: Yeah.
Ashwini Karandikar: I was going continue building on the same point and a question to Lisa would be exactly around what you’re gathering the actual source. Given Neutronian’s focus on data verification, figuring out the actual quality of the actual data versus crap data. What are the various parts of the process in determining the quality? When you started working with 0ptimus, what was the process in figuring out what was the data, how did it work, etc.
Lisa Abousaleh: Sure. I think especially around data gathering, the two main areas that would come to mind from our perspective for quality would be around sourcing. So what type of data is being collected and from what sources? I think it’s paramount to understanding that because there’s going to be different considerations for the data that’s being gathered depending on the source. So for example, if it’s behavioral data, you’d want to ensure that it’s being collected and refreshed on a regular basis so that you’re transacting off of the most up-to-date information.
And then I think, to Rene’s point about the timeframe, I think it’s from a recall perspective but also from the attribute itself and how likely is it that quality is going to change over time. So if you ask me, do I have a pet, that’s going to remain consistent. So thinking through the considerations for how often you need to be asking someone that when you’re collecting the data versus my opinion about a recent news event or something that’s evolving. Making sure that’s in line with your data sources, that you have that understanding of the recency and the frequency at what it’s being collected.
The other consideration I would say is very much around compliance and consent. How’s the data being collected, how was consent collected, is it transparent to the user what they’re consenting to. And I think again, from a survey application it’s more straightforward than other data collection standpoints being that typically every time someone’s entering a survey, they’re being asked to confirm that they want to take the survey and so I think that it’s clear to the user in that case, the data is being collected from them. But especially with the evolving privacy regulations and laws, consumers are becoming more and more sensitive about their data, how it’s being collected, how it’s being leveraged. So I think that has to become more of a consideration when you are thinking about data gathering and ensuring that the sources are taking the appropriate steps to confirm and collect consent from their users.
Ashwini Karandikar: Key point, Renee, I would love for you to speak about compliance overall in terms of active consent, active compliance, how that affects modeling. And then Katie, especially given the area that 0ptimus focuses on, having active compliance and the value of that to the data that you bring. So maybe Renee, you can first speak to it and then Katie, you can build on it.
Renee Smith: Yeah, we use a combination of what’s in our privacy policy, which there are always links when you initially go into a survey that talks about if you’d like to see where the data are going, click here. In each survey, we use consents that are specific to what the need is.
So as an example, we don’t have our own sample sourcing so we’re working with a third party vendors there. They may already have consent that they can share the data with a third party. We have internal legal counsel to make sure that everything is actually above board, that we’ve actually reviewed whether they actually can or not.
And then, many times we will decide that we’re going to collect the matching information ourselves and then there’s an even clearer consent. In other words, we’re about to ask you for some personal information for the following purpose. Making sure we’re clear in really layman’s terms, you don’t want to be overly technical because you want people to be really clear about what you’re going to do with it. And then allowing people to skip that if they want to.
So the interesting data quality angle there is that not everyone wants to give consent and so who are you missing? And I’ll let Katie talk a little more about how that might affect the modeling.
Katie Casavant: That’s a great segue Renee. So when we think about, how do we triangulate for knowing coming into that sample. And, there’s the question of, do you have consent, yes or no, that’s fairly binary. We own our own first party data and there’s all sorts of legal contractual work around that to ensure that those are proper and compliant with all the nebulous and ever-changing regulations. Then of course, there’s one’s own internal operations because if you want to have data quality, you have to have an end to end commitment. It’s not a stamp, just having a, lawyer’s not enough, it’s really an end to end commitment.
But when it comes to, in the moment, in the survey, how are you managing for bias, for errors, for triangulation, for consent, when we conduct surveys, some clients for whom we conduct the surveys ourselves and then we work as Renee has mentioned with companies like GutCheck, but when we conduct surveys for modeling, we only use sample of verified consumers. So consumers that we can identify on our deterministic consumer ID graph. So not only do we triple know that we have their consent, we also know that they are who they say they are, and that they live where they say they live and that some of the demographic attributes that they report about themselves are accurate as well.
And that’s really important because when we match a file like our graph to say a file that perhaps Renee has provided or her sample provider has provided to us on their behalf, that match is really important. There’s a quality element there that’s just so important. There are lots of companies out there where you will see them reporting these huge match rates. I can give you a huge match rate too, but there’s just a lot of bad matches in there, which means that everything downstream from that is going to be quite poor.
So we only use match records where we have a hundred percent certainty that we have a true match. So we’d rather throw out matches and not have really fuzzy matching. We have a whole host of algorithmic AI techniques, as well as just handcrafted eyeballs and hands looking and going through record by record. That’s a way that we employ our technology and our capabilities to ensure that we’re doing our part to make sure that the seed data coming in that there’s no garbage in that, that it’s really high quality because there are no downstream analytics, no downstream AI, no downstream machine learning magic, that’s going to correct for poor quality data at the outset, it just doesn’t exist.
Renee Smith: I think the other thing that’s important to know is that when we work with a company like 0ptimus, after the modeling is done, the original seed cases are removed because people have been kind enough to provide you with some identifiers that you can use to match with their consent but you don’t want to create a disincentive for people to provide that the next time by making them think that they’re going to be part of this audience that’ll be marketed to.
So I think that’s ecosystem caring for the people who are willing to provide the information so that you can continue to have people provide it without feeling like they’re going to automatically become part of a marketing campaign.
Lisa Abousaleh: I think that goes into a whole topic of data governance. It’s how the data is collected and then how are you treating that data once you have it? Especially if you have PII, there has to be considerations and making sure that the appropriate policies, practices who has access to the data internally, et cetera, because Renee, as you pointed out it goes back to that initial trust and if the consumer is providing you with information based on a trusting relationship then there is onus on the data companies to make sure that they’re handling that with care.
Katie Casavant: There’s an interesting point that I’d like to make here cause we’re talking about good quality, poor quality seed data and building models out of that. When there’s poor quality seed data, maybe too few instances, or too low of a base rate, or not sufficient differentiation, or some level of inaccuracy and non-matching. As I said before, there’s very little that predictive modeling can do to correct for that poor seed data , but it doesn’t mean a model can’t be built.
So in these instances, we may technically be able to build a model, but that model may perform no better than random. So have we built a model? Yes, we’ve built a model, but the model is garbage. In some cases where the seed data’s poor, the predicting modeling exercise effectively becomes useless because it may perform, it may predict no better than random. Then you have the risk of a poor quality predictive model being used to deliver targeted communication. The most likely outcome is going to be the wrong message delivered to the wrong person and an investment with a negative ROI.
I often look at different companies websites, different folks who operate in this modeling space, I’m amazed when I look at the content some of these companies put out. They provide modeling as part of their service suite and you’ll see the words modeling and lookalike modeling and predictive modeling with no attempt to explain how they do what they do or what their philosophy is almost as though predictive modeling is a commodity service, that’s just delivered in a big black box. “Move on people, nothing to see here , no need for any questions.” And it couldn’t be any further from the truth. Renee can build and deliver pristine seed data and then if I’ve got a black box with just, I’ll be polite and say substandard modeling, you’ve just taken good data and turned it into poor quality data.
The way we combat this is we believe fiercely in model performance transparency and we believe in providing full disclosure on how a model that we built performed. We’ll tell you if it’s good, we’ll tell you if it’s bad. If it’s bad, we’ll tell you if there’s anything we can do to improve the results. And we run multiple modeling algorithms, modeling pipeline simultaneously, and the reason that’s important, it’s not just because it’s fancy engineering, but the reason it’s important is because we don’t predetermine the winner. We’ll choose the best performing model based on everything that’s run and based on what the client’s specific needs are. We provide a full compliment of statistical performance metrics and a few that we’ve actually created ourselves because we really believe that clients should always know how their model performed, which consumers are in the audience that’s created from it and why those consumers have been included in the audience.
Renee Smith: So I wanted to bring it kind of full circle to what Lisa had mentioned earlier. The model transparency is very important as well and we’ve talked about the seed cases, but the other inputs to the model are other big data that is being used as the predictors. And as Lisa said, if those are not fresh, if those are not up to date, if those somehow reflect a behavior or activity that changes very rapidly but the data weren’t updated with it , you can be really transparent about the model, but you’re still not going to get a good model.
Ashwini Karandikar: So that’s an excellent point. Given this excellent start that we’ve already had, I think where we should go next, which is really going to help advertisers and people listening right now, is to figure out, how do you actually know what you’re buying?
Closing: Thank you for listening to this episode of the Neutronian Data Quality Podcast. We’d love to hear your thoughts on the topics discussed today or if you have another data quality topic that you’d like to hear on the podcast you can reach us on our website, neutronian.com, our LinkedIn page, or send an email to [email protected].
Stay tuned for the next episode where Renee, Katie, Ashwini and I discuss more data quality considerations in the data creation and data buying processes.