Surely, popular thinking went, the larger the difference, the more people you’d need to ask to make sure it was real? It makes intuitive sense, but ignores the underlying principles of probability theory that govern such situations.
Now, before there’s a stampede for the exit, this article is not going to be heavy on mathematics, probability, statistics, or any other related esoterica. What we’re going to do is take a look at the underlying principles of probability theory—in general terms—and see how we can make use of them to understand issues such as the following:
- how many people to include in a usability test
- how to efficiently identify population norms and popular beliefs
- how to do quick-and-easy A/B test analysis
Then we’ll move on to take a look at a case study that shows why a large sample size doesn’t always guarantee accuracy in user research, when such situations can arise, and what we can do about it.
Understanding Optimal Usability Test Size
Across the usability landscape, conventional wisdom holds—as characterized by the title of Jakob Nielsen’s Alertbox article from 2000, “Why You Only Need to Test with 5 Users”—you can do usability testing with just a handful of users. With more, you’ll see diminishing returns on each successive test session, because it is likely that another user will already have found the bulk of the issues a user finds.
Nielsen provides the reasoning that each user—on average and in isolation—identifies approximately 31% of all usability issues in the product under evaluation. So, the first test session uncovers 31% of all issues; the second, 31% of issues, with some overlap with session 1; and so on. After five test sessions, you’ve already recorded approximately 75-80% of all issues, so the value of the sixth, seventh, and subsequent test sessions gets lower and lower.
Identifying Norms and Minority Views
There’s another way to explain this observation: Some problems are more widespread, or are experienced by more people, than others. Because we choose users at random for our usability tests, the more prominent problems are the ones that are likely to show up early and repeatedly.
In other words, as we do tests with more users, we not only learn about what issues people experience, but if we look at the overlap between the issues users find—even with only a handful of users—we gain an understanding of which problems are likely to be the most widespread among the target audience. So, even with a very small test base, we can be reasonably sure we’re identifying the problems that will affect the biggest proportion of the user base.
We can use the same principle to identify new features or changes to a product that would appeal to the most people—a principle ethnographers use to identify population norms among cultural groups. If we ask a small group of people—selected at random—what product changes they’d like to see, the most popular suggestions from the entire user population are the ones with the biggest probability of appearing in any small group of users you’ve chosen at random.
But this also highlights a danger of small sample-size tests and surveys: Minority voices don’t get heard. The issues that affect small segments of the target population are less likely to show up in a small random sample of users—and so, you’re more likely to miss them.
If your user research needs to include the voices of minority segments within your overall audience, it is important to plan for this ahead of time. There are a number of different options at your disposal:
- When selecting your test or survey participants, ensure that you include at least some participants who represent each minority segment. We sometimes refer to this as a stratified sample.
- Run tests or surveys using a lot more participants. This also has the advantage of reducing the overall level of error in your test data.
Let us now return to the subject of the Mariana Da Silva quotation. Why is it that we don’t have to measure as many people if heights differ greatly between the two populations? Don’t we still need to measure a decent-sized sample, calculate averages and confidence intervals, then carry out some sort of significance tests?
The short answer is: No.
If the two populations are very different—in terms of their distributions of heights—it’s likely we’ll very quickly see that difference reflected in the mean and standard deviation of our test data. For example, let’s assume we’ve measured the heights of ten men from each city and found that the average is different by 10 centimeters, or 4 inches. That’s a large observed difference. But what can we conclude from that? Our initial response might be that it’s likely just an anomaly in the sampling—we just picked taller Londoners.
However, as we measure more people from each city, and the height differential continues to appear, the likelihood that the difference is random chance becomes smaller and smaller very quickly. It just isn’t very likely that we’re randomly, but consistently choosing to measure abnormally tall people in London—or choosing abnormally short people in New York.
Now compare this to what happens when an observed difference is very small. With a small difference, it remains plausible longer—with a much, much larger number of people—that it’s due to random chance. Therefore, we need much larger sample sizes before our statistical analysis can conclude that the difference is real.