A/B 测试样本量计算器

通过计算每个变体所需样本量来规划 A/B 测试。输入基准转化率和最小可检测效应,确定进行统计有效实验所需的访客数量。

n = (Zₐ/₂ + Z𝛃)² × (p₁(1-p₁) + p₂(1-p₂)) / (p₂ - p₁)²

Relative improvement you want to detect (e.g. 10% = detect 5% → 5.5%)

Sample Size Quick Reference

Sample size per variation at 95% confidence, 80% power:

Baseline Rate5% MDE10% MDE20% MDE
1%637,008163,09242,691
3%207,93653,20813,911
5%122,12131,2318,155
10%57,76014,7493,839
20%25,5806,5071,680

* MDE = Minimum Detectable Effect (relative). Lower MDE or baseline rate requires larger samples.

常见问题

如何计算 A/B 测试的样本量?

样本量使用以下公式计算:n = (Zα/2 + Zβ)² × (p₁(1-p₁) + p₂(1-p₂)) / (p₂-p₁)²,其中 p₁ 为基准转化率,p₂ 为预期改进后的转化率,Zα/2 为置信水平对应的 z 值,Zβ 为所需效能对应的 z 值。

什么是最小可检测效应(MDE)?

MDE 是你希望在测试中能够检测到的最小相对改进。在 5% 基准转化率下,10% 的 MDE 意味着你希望检测到变体是否达到至少 5.5%(绝对提升 0.5 个百分点)。MDE 越小,所需样本量越大。

什么是统计效能?

统计效能(1-β)是正确检测到真实效应的概率。80% 效能意味着你有 80% 的概率检测到真实差异,有 20% 的概率错过它(II 型错误)。效能越高需要越多样本,但可减少假阴性。

为什么 A/B 测试需要这么多访客?

样本量取决于基准转化率、所需 MDE、置信水平和效能。基准转化率越低、MDE 越小、置信度越高、效能越高,所需样本量越大。基准转化率 5%、相对 MDE 5%、95% 置信度、80% 效能时,每个变体需要约 125,000 名访客。

A/B 测试应该运行多长时间?

用总所需样本量除以每日流量。例如,若需要 20,000 名访客,每天获得 2,000,则至少运行 10 天。还应至少运行 1-2 周,以考虑用户行为在一周中不同日期的变化。

应使用什么置信水平和效能?

标准是 95% 置信度和 80% 效能。对于假阳性代价较小的快速迭代,使用 90% 置信度。对于高影响变更,使用 99% 置信度。当错过真实改进代价极大时(如定价测试),将效能提高至 90-95%。

能减少所需样本量吗?

可以:(1) 接受更大的 MDE——只关注大幅改进时,所需样本量更少;(2) 将置信度降低至 90%;(3) 将效能降低至 70-80%;(4) 使用单尾检验(若只关注改进而非恶化,但不推荐用于大多数情况);(5) 将流量集中在测试页面上。

提前停止测试会发生什么?

当看到显著结果就提前停止测试会大幅提高假阳性率——这种现象称为「偷看」。你可能会错误地认为变体更优。请始终在分析结果前完成预先计算的样本量,或使用专为持续监控设计的序贯检验方法。

Why Sample Size Matters in A/B Testing

Running an A/B test without adequate sample size is like flipping a coin three times and concluding it's unfair. Sample size determines the reliability of your test results. Too few visitors and you'll either miss real improvements (false negatives) or declare winners that don't actually exist (false positives).

Calculating sample size before running your experiment is critical because:

  • It tells you how long the test needs to run
  • It prevents premature stopping (which inflates false positive rates)
  • It ensures you have enough statistical power to detect meaningful differences
  • It helps you decide if a test is feasible given your traffic levels

Sample Size Formula

The required sample size per variation for a two-sample proportion test is:

Sample Size Per Variation:

n = (Zₐ/₂ + Z𝛃)² × (p₁(1-p₁) + p₂(1-p₂)) / (p₂ - p₁)²

Where:

  • n = required sample size per variation
  • Zα/2 = z-value for the confidence level (e.g., 1.96 for 95%)
  • = z-value for the statistical power (e.g., 0.842 for 80%)
  • p₁ = baseline conversion rate
  • p₂ = expected conversion rate (p₁ × (1 + MDE))

Sample Size Calculation Examples

Example 1: Standard E-commerce Test

Baseline conversion rate: 3%. You want to detect a 10% relative improvement (3% → 3.3%) at 95% confidence, 80% power.

p₁ = 0.03, p₂ = 0.033
Zₐ/₂ = 1.96, Z𝛃 = 0.842
n = (1.96 + 0.842)² × (0.03 × 0.97 + 0.033 × 0.967) / (0.003)²
n ≈ 44,202 per variation (88,404 total)

Example 2: High-converting Landing Page

Baseline: 15% conversion. Detecting 5% relative improvement at 95% confidence, 80% power.

p₁ = 0.15, p₂ = 0.1575
Absolute difference = 0.75pp
n ≈ 41,122 per variation
At 10,000 visitors/day: ~9 days to complete

Example 3: Bold Change, Low Traffic

Baseline: 2%. Detecting 50% relative improvement (2% → 3%) at 95% confidence, 80% power.

p₁ = 0.02, p₂ = 0.03
Absolute difference = 1pp
n ≈ 3,682 per variation (7,364 total)
At 500 visitors/day: ~15 days

Understanding Key Parameters

Baseline Conversion Rate

Your current conversion rate before the test. Lower baseline rates require more samples because conversions are rarer events. A 1% baseline needs roughly 5x more samples than a 5% baseline for the same relative MDE.

Minimum Detectable Effect (MDE)

The smallest relative improvement you want to detect. A 10% MDE on a 5% baseline means detecting an increase to 5.5%. Smaller MDEs require exponentially more samples — halving the MDE roughly quadruples the required sample size.

Confidence Level (1 - α)

The probability of notmaking a Type I error (false positive). At 95% confidence, there's a 5% chance of declaring a winner when there's actually no difference.

Statistical Power (1 - β)

The probability of detecting a real effect. At 80% power, there's a 20% chance of missing a real improvement (Type II error / false negative). Higher power requires more samples.

Error TypeNameControlled ByConsequence
Type I (α)False PositiveConfidence LevelShip a change that doesn't work
Type II (β)False NegativeStatistical PowerMiss a real improvement

How to Reduce Required Sample Size

  1. Accept a larger MDE— Only test changes you expect to have a meaningful impact. If you're only willing to ship a 20%+ improvement, use a 20% MDE.
  2. Lower your confidence level — Use 90% instead of 95% for non-critical experiments. This reduces sample size by ~20%.
  3. Accept lower power — 80% power is standard, but 70% is acceptable for screening tests. This reduces sample size by ~15%.
  4. Focus traffic — Run the test only on pages or segments with the highest traffic to accelerate data collection.
  5. Use composite metrics — Metrics with higher rates (like click-through rate vs. purchase rate) require fewer samples.

Common Pitfalls in Sample Size Planning

  • Using unrealistic MDEs — A 50% improvement sounds great but is rarely achievable. Most real improvements are 5-15%. Plan accordingly.
  • Forgetting about test duration — Even if you have enough total traffic, you need to run for at least 1-2 full weeks to capture day-of-week effects.
  • Not accounting for multiple comparisons — Testing 5 variants against a control requires a Bonferroni correction or similar adjustment.
  • Ignoring seasonality — Running a test during a seasonal peak may not generalize to other periods.
  • Peeking at results — Checking significance before reaching the planned sample size dramatically increases false positive rates.

Related Calculators