How concerned should you be with missing responses in your survey?

One of the primary concerns with sampling in general is the issue of representativeness.

That is, we don’t want to sample only happy customers or those who come from large companies instead of small companies if we’re trying to make the right decisions about our entire set of customers.

Some amount of bias is inevitable with every survey, as it is with scientific research in general. One such bias is response bias–participants must be willing to respond to your survey or participate in your study.

To minimize this bias, you should invest some effort into using different sampling methods (such as email, in-product placement and support calls) and to different customers at different times during their customer journey.

After identifying a representative sample of customers and having them agree to take a survey, you’ll want them to actually complete the survey and not drop-out before finishing.

There are many things that will reduce a survey response rate.   One of them is having too many required questions.  However, just making all questions optional in the hope of increasing a response rate introduces a new problem: you’ll almost surely have partial and incomplete data for several respondents. Missing responses impact the representativeness of your data.

Pairwise Deletion

Even if a respondent doesn’t answer every question, you can still analyze and summarize their responses with the questions they did answer. You’ll just have different sample sizes per question. This approach is called pairwise deletion. It’s the typical approach we use for summarizing response options and running correlations.

For example, the table below shows data from nine respondents. User 2 didn’t answer the likelihood to recommend question.

User ID
Likelihood to Recommend
Brand Attitude
Ease of Use (SUS)
1
4
4
80
2
6
70
3
9
6
70
4
10
7
70
5
8
5
40
6
7
1
20
7
8
5
60
8
6
5
50
9
4
4
40
10
5
5
80

If we want to compute the correlation between likelihood to recommend and ease of use, we can only compute the correlation where customers responded to both items. User 2’s data is removed.

But if we want to also compute the correlation between brand attitude and ease of use, the non-response on the brand item doesn’t affect our ability to run a correlation on all nine respondents.

Listwise deletion

Another approach to working with missing data is called listwise deletion. Using this approach, respondents who have any missing value are removed entirely. So if we want to understand the key drivers of loyalty from say 20 features and functions, by default if respondents didn’t rate their level of satisfaction with any one of the 20 items, they are excluded from the analysis entirely.

Key driver analysis uses a technique called multiple regression and using this approach, all items need to have a response or the statistical procedure doesn’t work. This isn’t much of a problem if you have a large sample size and only a few missing values. It becomes a problem when responses are precious and excluding respondents listwise can introduce a new source of response bias.

Non-Response Bias

Missing data is a pain. It limits the analysis we can do, especially when we’re working with multiple items. Ideally, we’d want missing values to be just mistakes or from other random reasons. That is, we don’t want respondents systematically avoiding an item which impacts the representativeness of the sample.

For example, if there’s a sensitive item that asks about income, political attitudes, or some personally identifiable information that prevents a certain group of customers from responding, we should account for such behavior before drawing conclusions. Fortunately there are some techniques we can use to know if our missing values are systematic or random.

1. Not Missing at Random: NMAR

To find out if we are excluding a certain segment of our respondents we create a new binary variable for valid and missing and non-missing  responses.  The table below shows a simple example where respondents were asked a question about their income and three other attitudinal questions. Income values that are missing get coded a 1 and not missing a 0.

User IDMissingIncomeLikelihood to RecommendFavorability toward the BrandLikelihood to Repurchase
1
1
4
4
5
2
0
50k-74k
8
6
7
3
0
50k-74k
9
6
7
4
0
100k+
10
7
7
5
0
50k-74k
8
5
4
6
0
25-49k
7
1
2
7
0
25-49k
8
5
6
8
1
6
5
5
9
0
25-49k
4
4
4

If we want to understand if higher income customers are more or less likely to recommend the product, having missing income values complicates things.

We can compare the mean likelihood to recommend for the 1’s and 0’s (missing and non-missing). If there IS a significant difference in means, we have evidence that the data is NOT missing at random.  In other words, there’s a pattern to the non-responses.

With something like income it’s often, although not always, the higher income customers who tend not to respond. This is especially the case when there’s a way to associate responses back to the customer. We’ve also seen non-response bias with questions about health insurance and adult TV programming. Drawing conclusions with data that is not missing at random should be done with caution.

2. Missing at Random: MAR

If there is no significant difference between our primary variable of interest and the missing and non-missing values we have evidence that our data is missing at random. So while it’s bad that we have missing data, at least there’s not sufficient evidence that we’re systematically missing data from a segment of our customers–at least with respect to our primary variable of interest.

3. Missing Completely at Random: MCAR

Finally, we compare the means on multiple dependent variables: likelihood to recommend, favorability toward the brand and likelihood to repurchase for the responders and non-responders. We basically repeat the same statistical comparison we did with MAR, but with all quantitative variables we’re working with–in this case it’s three.

If we find NO significant differences between the responders and non-responders on ALL three variables, we have the best scenario called missing completely at random (MCAR). This gives us the most confidence that we aren’t systematically missing values from some of our respondents.

The sample data in the example table above is small, so it will be difficult to detect all but the largest differences due to missing data.  In a typical survey with hundreds of responses and a few dozen missing responses, you’ll have a greater ability to detect if there is a systematic difference from the non-responders.

Understanding the impacts of missing data is the first step.  The next logical step is doing something about the missing data. Substituting missing values for alternatives is a science of its own—and the topic of an upcoming blog!