- Overview
- Calculators
- A/B Test Sample Size
A/B Test Sample Size Calculator
Plan your A/B test by calculating the required sample size per variation. Enter your baseline conversion rate and minimum detectable effect to determine how many visitors you need for a statistically valid experiment.
n = (Zₐ/₂ + Z𝛃)² × (p₁(1-p₁) + p₂(1-p₂)) / (p₂ - p₁)²
Relative improvement you want to detect (e.g. 10% = detect 5% → 5.5%)
Sample Size Quick Reference
Sample size per variation at 95% confidence, 80% power:
| Baseline Rate | 5% MDE | 10% MDE | 20% MDE |
|---|---|---|---|
| 1% | 637,008 | 163,092 | 42,691 |
| 3% | 207,936 | 53,208 | 13,911 |
| 5% | 122,121 | 31,231 | 8,155 |
| 10% | 57,760 | 14,749 | 3,839 |
| 20% | 25,580 | 6,507 | 1,680 |
* MDE = Minimum Detectable Effect (relative). Lower MDE or baseline rate requires larger samples.
Frequently Asked Questions
How do you calculate A/B test sample size?
Sample size is calculated using the formula: n = (Zα/2 + Zβ)² × (p₁(1-p₁) + p₂(1-p₂)) / (p₂-p₁)², where p₁ is the baseline conversion rate, p₂ is the expected improved rate, Zα/2 is the z-value for your confidence level, and Zβ is the z-value for your desired power.
What is minimum detectable effect (MDE)?
MDE is the smallest relative improvement you want to be able to detect in your test. A 10% MDE on a 5% baseline means you want to detect if the variant achieves at least 5.5% (a 0.5 percentage point absolute improvement). Smaller MDEs require larger sample sizes.
What is statistical power?
Statistical power (1-β) is the probability of correctly detecting a real effect. 80% power means you have an 80% chance of detecting a true difference and a 20% chance of missing it (Type II error). Higher power requires more samples but reduces false negatives.
Why do I need so many visitors for my A/B test?
Sample size depends on your baseline rate, desired MDE, confidence level, and power. Lower baseline rates, smaller MDEs, higher confidence, and higher power all increase the required sample size. A 5% baseline with 5% relative MDE at 95% confidence and 80% power needs ~125,000 visitors per variation.
How long should I run my A/B test?
Divide your total required sample size by your daily traffic. For example, if you need 20,000 visitors total and get 2,000/day, run for at least 10 days. Also run for at least 1-2 full weeks to account for day-of-week variations in user behavior.
What confidence level and power should I use?
The standard is 95% confidence and 80% power. Use 90% confidence for faster iterations where false positives are less costly. Use 99% confidence for high-impact changes. Increase power to 90-95% when missing a real improvement would be very costly (e.g., pricing tests).
Can I reduce the required sample size?
Yes: (1) Accept a larger MDE — if you only care about big improvements, you need fewer samples. (2) Lower confidence to 90%. (3) Lower power to 70-80%. (4) Use one-tailed tests if you only care about improvements (not recommended for most cases). (5) Focus traffic on the test pages.
What happens if I stop my test early?
Stopping early when you see a significant result inflates false positive rates dramatically — a phenomenon called 'peeking.' You may conclude a variant is better when it isn't. Always commit to the pre-calculated sample size before analyzing results, or use sequential testing methods designed for continuous monitoring.
Quick Navigation
Why Sample Size Matters in A/B Testing
Running an A/B test without adequate sample size is like flipping a coin three times and concluding it's unfair. Sample size determines the reliability of your test results. Too few visitors and you'll either miss real improvements (false negatives) or declare winners that don't actually exist (false positives).
Calculating sample size before running your experiment is critical because:
- It tells you how long the test needs to run
- It prevents premature stopping (which inflates false positive rates)
- It ensures you have enough statistical power to detect meaningful differences
- It helps you decide if a test is feasible given your traffic levels
Sample Size Formula
The required sample size per variation for a two-sample proportion test is:
Sample Size Per Variation:
n = (Zₐ/₂ + Z𝛃)² × (p₁(1-p₁) + p₂(1-p₂)) / (p₂ - p₁)²
Where:
- n = required sample size per variation
- Zα/2 = z-value for the confidence level (e.g., 1.96 for 95%)
- Zβ = z-value for the statistical power (e.g., 0.842 for 80%)
- p₁ = baseline conversion rate
- p₂ = expected conversion rate (p₁ × (1 + MDE))
Sample Size Calculation Examples
Example 1: Standard E-commerce Test
Baseline conversion rate: 3%. You want to detect a 10% relative improvement (3% → 3.3%) at 95% confidence, 80% power.
p₁ = 0.03, p₂ = 0.033
Zₐ/₂ = 1.96, Z𝛃 = 0.842
n = (1.96 + 0.842)² × (0.03 × 0.97 + 0.033 × 0.967) / (0.003)²
n ≈ 44,202 per variation (88,404 total)
Example 2: High-converting Landing Page
Baseline: 15% conversion. Detecting 5% relative improvement at 95% confidence, 80% power.
p₁ = 0.15, p₂ = 0.1575
Absolute difference = 0.75pp
n ≈ 41,122 per variation
At 10,000 visitors/day: ~9 days to complete
Example 3: Bold Change, Low Traffic
Baseline: 2%. Detecting 50% relative improvement (2% → 3%) at 95% confidence, 80% power.
p₁ = 0.02, p₂ = 0.03
Absolute difference = 1pp
n ≈ 3,682 per variation (7,364 total)
At 500 visitors/day: ~15 days
Understanding Key Parameters
Baseline Conversion Rate
Your current conversion rate before the test. Lower baseline rates require more samples because conversions are rarer events. A 1% baseline needs roughly 5x more samples than a 5% baseline for the same relative MDE.
Minimum Detectable Effect (MDE)
The smallest relative improvement you want to detect. A 10% MDE on a 5% baseline means detecting an increase to 5.5%. Smaller MDEs require exponentially more samples — halving the MDE roughly quadruples the required sample size.
Confidence Level (1 - α)
The probability of not making a Type I error (false positive). At 95% confidence, there's a 5% chance of declaring a winner when there's actually no difference.
Statistical Power (1 - β)
The probability of detecting a real effect. At 80% power, there's a 20% chance of missing a real improvement (Type II error / false negative). Higher power requires more samples.
| Error Type | Name | Controlled By | Consequence |
|---|---|---|---|
| Type I (α) | False Positive | Confidence Level | Ship a change that doesn't work |
| Type II (β) | False Negative | Statistical Power | Miss a real improvement |
How to Reduce Required Sample Size
- Accept a larger MDE — Only test changes you expect to have a meaningful impact. If you're only willing to ship a 20%+ improvement, use a 20% MDE.
- Lower your confidence level — Use 90% instead of 95% for non-critical experiments. This reduces sample size by ~20%.
- Accept lower power — 80% power is standard, but 70% is acceptable for screening tests. This reduces sample size by ~15%.
- Focus traffic — Run the test only on pages or segments with the highest traffic to accelerate data collection.
- Use composite metrics — Metrics with higher rates (like click-through rate vs. purchase rate) require fewer samples.
Common Pitfalls in Sample Size Planning
- Using unrealistic MDEs — A 50% improvement sounds great but is rarely achievable. Most real improvements are 5-15%. Plan accordingly.
- Forgetting about test duration — Even if you have enough total traffic, you need to run for at least 1-2 full weeks to capture day-of-week effects.
- Not accounting for multiple comparisons — Testing 5 variants against a control requires a Bonferroni correction or similar adjustment.
- Ignoring seasonality — Running a test during a seasonal peak may not generalize to other periods.
- Peeking at results — Checking significance before reaching the planned sample size dramatically increases false positive rates.