How Many Do We Need to Test? [transcript]

In this episode we’re going to be talking about sampling. How many do we need to test? “42 is the answer to the ultimate question of life, the universe and everything,” according to The Hitchhiker’s Guide to the Galaxy, a science fiction adventure. But figuring out a statistically relevant sample size is not so straightforward. Statistics and probabilities are not an intuitive subject to understand or explain. And, I know, you ask a statistician or quality or reliability engineer a question about something like sampling and you’re likely to never get a straight answer. “It depends,” is what you’ll probably hear most often, and if you do get an answer and ask, “why?” you’ll get a lot of mathematical explanations and graphical plots.

Let’s talk through a generic thought process for choosing a statistically relevant sample size. Let’s also talk about some basics that you can look into for your own knowledge growth that will help you to meet the statistic-minded folks at least halfway. Our goal today is so you can better talk through a sampling scenario with your quality and reliability engineering friends, and you can better prepare for the information that they’re going to ask about. Let’s get started after this brief introduction.

Hello and welcome to quality during design the place to use quality thinking to create products others love, for less. My name is Dianna. I’m a senior level quality professional and engineer with over 20 years of experience in manufacturing and design. Listen in and then join the conversation at 

Today we’re talking about sample sizes for testing our designs. There is a lot to consider when choosing a sample size that is statistically relevant, where we have a certain confidence that the performance of our product is meeting its requirements or specifications. There is no one perfect answer for how many to sample. It depends on what we’re measuring, how confident we want to be in the result, and a whole host of other things.

Why are there so many questions going into what sample size to choose? It’s because we want to limit our chances of making errors. We’re collecting samples, testing them, and then fitting the results of our sample to a mathematical model (like a probability distribution). We use that model to make predictions or conclusions about how the greater population is going to perform, and we make decisions from it. If we don’t test enough then we could fail to act when we should. That’s called a Type II error. In design, this could mean we missed an opportunity for improvement. If we have too much data, then we could want to act when we shouldn’t, which is a Type I error.

The ways we can position ourselves to use statistical analysis to make the best decisions is understanding statistics and also considering historical information and how we’ll collect the data. There are some rules of thumb, but they don’t apply to every test scenario. And, if we blindly use them, we may end up testing more than we really need. If we test 30 samples, that allows us to be confident to use some standard statistical tests where we can make conclusions about our data with a confidence level (you can search up the Central Limit Theorem to find out more about this). However, if we know more about our product and its requirements, we can make a better assessment of the sample size needed. There are many cases where we don’t need as many as 30 samples which will save on test costs, in that there’s less products to test and less test time needed.

If we have a brand-new design that’s never been tested before and we want to get to know how it will perform a little better, then a good rule of thumb to test is 5. Where this is in no way a sample size that is statistically relevant, it is enough data points to plot to get a sense of the data. From this limited test, we can make better assessments for what our sample size should be for our formal verification or validation testing. It’s better than a complete guess. In my experience, testing 5 as a screening-like test plus the samples we’ll need for our official, statistically relevant tests is still less than 30 samples total. In designed experiments like DOE (design of experiments) and ANOVA (analysis of variance), there are some rules of thumb that suggest that 3 samples is okay. If we’re under development and doing HALT (highly accelerated life test), then doing 1 to 2 samples might be enough.

Let’s talk about some of the things we start thinking about when someone asks, “How many samples do I need to test?” We’re about to go through that generic thought process for choosing a statistically relevant sample size.

One of the first questions is, “What is our acceptance criteria?” This is getting into what it is we’re exactly testing and what type of analysis we may need to perform with the data. Are we comparing the performance of test samples against a requirement, or are we comparing two things? Is a requirement a minimum or maximum value, or is it based on an average? What is considered a success and what is considered a failure? Do our requirements include the operating and environmental conditions of our product? Was this information considered when the requirement was developed? Are there any reliability requirements, that the design needs to function at a certain performance level after a set period? When requirements are being developed, it’s good to think about how we’re going to test this requirement to make sure that our product meets it. We may have a draft requirement, but then our quality or reliability engineer on our team may give us input like environmental conditions, reliability targets, or even confidence. And that input may be a good addition to a requirement.

Back to choosing a sample size, we have a good understanding of the requirement. Now, “Is the data going to be continuous or discrete?” Continuous data is measured and can be any value in a range. Discrete data is counted and can only be certain values. Whether our data is continuous or discrete narrows down our choices for how would determine a sample size. There are different probability distributions for each type of data.

We’ll also wonder, “How are we expecting this product to perform for this test? Do we have any historical data?” We could use results from benchtop testing, engineering confidence testing, or testing of similar products. What we’re looking for is a hint of statistical measures that might represent the formal data we’re going to be collecting through our sample. What we learn could affect our sample size. We’ll consider the type of probability distribution this data may take and the measures of location and dispersion. Typical measures of location are things like mean and median, and for dispersion are things like range, variance, and standard deviation. With our historical data, we’ll also look at the types of failure modes seen in a test. How do the observed failure modes align with the requirement? Is there more than one failure mode?

Still thinking about what we might need for a sample size: we also think about, “How confident do we need to be in the result?” We covered this in a previous episode titled The Five Aspects of Good Reliability Goals and Requirements. Here’s what I said about confidence levels in that episode: “If we don’t specify a confidence level, then we can assume a 50% confidence level. I’ve never had a team to find their confidence level in anything at a 50% level. Who wants to be half confident? We can stay our desired confidence level as part of our reliability requirement. The confidence level that we choose can be dependent upon customer perception, the effect on the overall function of our product, or how serious it is. If our product doesn’t work, name a few. Why do we add a confidence level? Because there’s variation in everything, both in how we make product and how we measure it. Setting a confidence level accounts for the variability we’re going to see in our test data.”

Another question we ask to help determine a statistically relevant sample size is about difference to detect. “What is the smallest difference we want to be able to notice?” The difference to detect is our ability to detect a difference between our requirement and what we measure, or the difference between a hypothesize value and the actual value. It’s related to a Type II error (that’s the error where we fail to act when we really should or fail to improve our design when it needs it). For continuous data, the difference to detect is typically related to the number of standard deviations, but it could be just based on a choice that we make. For discrete data, the difference to detect is usually based on a proportion of the population.

To choose a difference to detect, we consider everything we know about the situation at this point, starting with how our requirement (or acceptance criteria) compares with how we’re expecting our results to look from our historical data. Are we expecting our test results to be close to our target spec? If not, then maybe we’ll choose a different to detect that’s large. If we do expect the results to be close to the spec, then we’ll likely choose a small difference to detect. For example, if we’re testing for tensile strength with a minimum requirement of 5N, and our benchtop tests of some initial samples are indicating a performance of 70 to 100N, then we might not be at all worried about being able to detect a small difference. Our product is performing with a factor of safety that’s pretty large, so we decide that our risk of making an error is pretty low. So, we’ll choose a large difference to detect, like 4N. Choosing a large difference to detect will reduce our sample size requirements. However, if our benchtop results are indicating that a performance that range between 6 to 10N; that’s a lot closer to our 5N minimum requirement and there’s not much room for a factor of safety. We’ll want to detect a much smaller difference. We may want to be able to detect a difference that’s less than the sample standard deviation of our historical measures. How much less? That depends on how critical this component or feature is to the performance, functionality, reliability, and safety of our product. Choosing a small difference will increase our sample size requirement.

Finally, we’ll look at the way we’re planning to test. “What is the test method? Is it validated? What is its precision and accuracy?” We may have a historical test method we want to use, but is it going to be able to detect the difference that we need to? Is its precision and accuracy good enough for our test? Depending on how our requirement is stated, a reliability engineer may recommend a different test. There are reliability life tests where we’re estimating the durability of our component. Or we could use accelerated life testing where we apply stresses on our design to bring about and accumulate damage. Different test designs will require different sample sizes.

Now, we’re equipped with the information to be able to choose a sample size that is statistically relevant. We start with understanding the acceptance criteria, knowing any historical information, choosing how confident we need to be, and ensuring the test and data collection methods match the requirement.

What are topics you could study a bit to get a better understanding of sampling? I’d recommend looking at hypothesis tests as a topic, where we’re evaluating a null hypothesis an alternate hypothesis. This type of analysis is used in science, engineering, and testing. Getting a good handle on hypothesis testing will set you up for a better understanding of many sampling methods. Any basic statistics book would provide a good introduction to hypothesis testing.

Our goal today was to be able to better talk through a sampling scenario with our quality and reliability engineering friends and have a better appreciation of the types of things that are considered when a sample size is picked. We stepped through a generic thought process for choosing a statistically relevant sample size. And I recommended learning more about hypothesis testing.

What’s today’s insight to action? Before jumping to ask, “How many do I need to test?” – as a team, we can equip ourselves with as many answers as we can to the questions we asked today. I’ll include a list of these questions on the podcast blog. Even if we don’t have all the answers, it’s good to include quality and reliability engineers in sample size calculations, so we’ll be confident enough to approach them about our test. They may have an alternative test method like accelerated life testing. This plus knowing that each requirement needs its own sampling study (like we did today) shows that getting them involved with the iterative drafts of our test plan could be a benefit to the project.

Please visit this podcast blog and others at Subscribe to the weekly newsletter to keep in touch. If you like this podcast or have a suggestion for an upcoming episode, let me know. You can find me at, on LinkedIn, or you could leave me a voicemail at 484-341-0238. This has been a production of Denney Enterprises. Thanks for listening!