Sample_Sizer

Sample_Sizer is a *fancy* Sample Size Calculator and Power Analysis tool. It is *fancy* because:

  1. You can put limits on maximum sample size (you can't wait forever...), and solve for the effect size you can detect at that sample size.
  2. It has full peeking support [peeking is deciding whether to ship the experiment multiple times during the experiment].
  3. You can have multiple treatments in one test, and it'll adjust critical values accordingly.
  4. It supports both continuous and binary metrics, which isn't so fancy, but many other calculators on the internet don't.
  5. It has results for difference significance and power levels.

Select what percentage of the "subjects" (visitors to a website, users, patients, etc) will get each treatment (non-control experience).

Enter percentages from 0-100, for example: 25 = 25% or 0.2 = 0.2%.

Total percentages should be less than 100. The remainder of the subjects are assigned to the control group. So, only enter 1 number for a standard A/B test.

You can include up to 12 treatments.


Add Treatment Remove Treatment

Fill in the current (or baseline) metric average before this experiment, as well as the standard deviation of the metric.

TIP: If your metric is a "binary metric" (for each subject in the experiment, the metric value is either True or False) like a Conversion Rate or a Retention Rate, then the number to put for standard deviation is just: SQRT(Current Conversion Rate x (1 - Current Conversion Rate)). Clicking the button below inserts the binary standard deviation if the mean is between 0 and 1.


Insert Binary Standard Deviation

Peeking

Peeking is a powerful tool to reduce average experiment duration, but we need to adjust for its effect to properly size the experiment. Peeking is when you decide in the middle of the experiment whether you will continue with the experiment or just stop and make a decision now.

Sample_Sizer assumes the peeking periods are equally-spaced. For example, if you have a four-week experiment and two peeks, then you will make a decision about the experiment after the second week and at the end of the fourth week.

Adjust Significance Level and Power

By default, the significance level is set to 5% and the power of the sample sizing is set to 80%. These are "standard" values, but we shouldn't be wedded to them, so you can adjust power and significance level below.

Minimum Detectable Percentage Effect

The minimum detectable percentage effect is the percentage effect size that you want to be able to detect. Note that this is different from what you expect the treatment to be. It is usually smaller than what you expect.

A value of "1" means that you want the experiment to be able to detect a 1% increase in the metric.

You can also enter negative values if you want to reduce the metric.

TIP: if your metric is say, a conversion rate, and it's baseline value is 10%. If you put 1% here, then it will be for detecting at least a 1% change in 10% which is 10.1%. If you only want to be able to detect a full percentage point move, then use 10% here because 10% more than 10% is 11%.

Maximum Sample Size [optional]

If there is a limit to how long you can wait for experiment results (likely), specifying this option will take that into account. The results will tell you the minimum effect size you can be confident you'll be able to detect given your time constraints, if your experiment does not have enough power to detect the minimum detectable effect given the maximum sample size.

For example, if your minimum detectable effect implies you need 1,000,000 users, but you only have time to wait for 100,000. Sample_Sizer will return 100,000 and tell you what effect sizes you can expect to detect (the results will include the unconstrained sample size as well).

To not limit the maximum sample size, leave it at 0.


Tool by Zach Flynn.