A/B 测试计算器
判断您的 A/B 测试结果是否具有统计显著性。输入对照组和变体组的访客数及转化数,即时获得包含 p 值、z 分数、提升率和统计效能的统计分析。
Z = (p₂ - p₁) / √(p̂ × (1 - p̂) × (1/n₁ + 1/n₂))
Control Group (A)
Variant Group (B)
常见问题
什么是 A/B 测试?
A/B 测试(也称为分割测试)是一种对照实验,用于比较某个内容的两个版本(如网页、电子邮件或广告),以确定哪个表现更好。版本 A 为对照组(原版),版本 B 为变体(修改版)。用户被随机分配到各组,并测量其行为(转化、点击等)。
A/B 测试中的统计显著性是什么?
统计显著性意味着对照组和变体之间的差异不太可能是随机产生的。通常,结果在 95% 置信度下被认为是显著的,即观察到的差异有不到 5% 的概率是偶然发生的。p 值量化了这一概率。
如何计算 A/B 测试的 p 值?
p 值使用双比例 z 检验计算。首先计算 z 分数:Z = (p₂ - p₁) / √(p̂ × (1 - p̂) × (1/n₁ + 1/n₂)),其中 p̂ 是合并比例。然后使用标准正态分布将 z 分数转换为双尾 p 值。
应该使用什么置信水平?
95% 置信度是大多数 A/B 测试的行业标准。对于方向性决策或速度比确定性更重要的快节奏实验,使用 90%。对于高风险决策(定价变更、重大改版),使用 99%,因为假阳性代价极大。
什么是统计效能?
统计效能是在真实效应存在时检测到它的概率。80% 的效能意味着如果变体之间确实存在差异,你有 80% 的概率检测到它。效能低意味着你可能会错过真实的改进(假阴性)。大多数实验应以至少 80% 的效能为目标。
A/B 测试应该运行多长时间?
测试应运行到达到所需样本量为止(使用样本量计算器确定)。不要仅因为结果看起来显著就提前停止测试——这会使假阳性率虚高。还应至少运行 1-2 个完整的业务周期(通常为 1-2 周),以考虑一周中不同日期的影响。
提升百分比意味着什么?
提升(或增益)是变体相对于对照组的相对改进。计算公式为:提升 = (变体转化率 - 对照转化率) / 对照转化率 × 100。例如,对照组转化率为 5%,变体为 6%,提升率为 20%——意味着变体比对照组表现好 20%。
小样本量的 A/B 测试结果可信吗?
小样本量会导致结果不可靠,置信区间较宽。即使小样本中出现「显著」的 p 值,观察到的效应量也可能被夸大。在得出结论之前,请确保样本量充足。使用 A/B 测试样本量计算器规划实验。
What is A/B Testing?
A/B testing (also known as split testing) is a method of comparing two versions of a webpage, email, ad, or any other content to determine which one performs better. Users are randomly divided into two groups: the control group (A) sees the original version, and the variant group (B) sees the modified version.
The key question A/B testing answers is: "Is the difference in performance between A and B real, or could it have happened by random chance?" This is where statistical significance comes in. Our calculator uses a two-proportion z-test to determine whether the observed difference is statistically significant.
A/B testing is fundamental to data-driven decision making in marketing, product development, UX design, and growth engineering. Companies like Google, Amazon, Netflix, and Booking.com run thousands of A/B tests annually to optimize their products.
Statistical Formula & How It Works
This calculator uses the two-proportion z-test to compare conversion rates between two independent groups:
Step 1: Calculate Conversion Rates
p₁ = Conversions₁ / Visitors₁
p₂ = Conversions₂ / Visitors₂
Step 2: Calculate Pooled Proportion
p̂ = (C₁ + C₂) / (n₁ + n₂)
Step 3: Calculate Z-Score
Z = (p₂ - p₁) / √(p̂ × (1 - p̂) × (1/n₁ + 1/n₂))
Step 4: Convert to P-Value (two-tailed)
p-value = 2 × (1 - Φ(|Z|))
If the p-value is less than alpha (where alpha = 1 - confidence level), the result is statistically significant. For 95% confidence, alpha = 0.05, so a p-value below 0.05 indicates significance.
A/B Test Calculation Examples
Example 1: Significant Result
An e-commerce site tests a new checkout page. Control: 10,000 visitors, 300 purchases. Variant: 10,000 visitors, 380 purchases.
Control rate: 300/10,000 = 3.00%
Variant rate: 380/10,000 = 3.80%
Uplift: (3.80 - 3.00) / 3.00 = +26.67%
Pooled: 680/20,000 = 3.40%
SE = √(0.034 × 0.966 × 0.0002) = 0.00256
Z = 0.008 / 0.00256 = 3.125
P-value = 0.0018
Result: Statistically significant at 95% confidence
Example 2: Not Significant
A SaaS company tests a new pricing page. Control: 500 visitors, 25 signups. Variant: 500 visitors, 30 signups.
Control rate: 25/500 = 5.00%
Variant rate: 30/500 = 6.00%
Uplift: +20.00%
Z = 0.668
P-value = 0.504
Result: Not significant — need more data
Example 3: Variant Performs Worse
An email marketing test. Control: 5,000 recipients, 250 clicks. Variant: 5,000 recipients, 200 clicks.
Control rate: 250/5,000 = 5.00%
Variant rate: 200/5,000 = 4.00%
Uplift: -20.00%
Z = -2.356
P-value = 0.0185
Result: Statistically significant — variant is worse
Choosing Your Significance Level
| Confidence | Alpha (α) | Z Critical | Best For |
|---|---|---|---|
| 90% | 0.10 | 1.645 | Quick iterations, low-risk changes |
| 95% | 0.05 | 1.960 | Industry standard, most A/B tests |
| 99% | 0.01 | 2.576 | High-stakes decisions, pricing changes |
Common A/B Testing Mistakes
- Stopping tests too early — Checking results before reaching the required sample size inflates false positive rates. Commit to your sample size before starting.
- Testing too many variations — Each additional variant requires a larger sample size and increases the chance of false positives (multiple comparisons problem).
- Ignoring statistical power — Low-powered tests frequently miss real effects. Aim for at least 80% power when planning your test.
- Not running full business cycles — User behavior varies by day of week, time of day, and season. Run tests for at least 1-2 full weeks.
- Testing tiny changes on small samples — Small effects need large samples to detect. Use the Sample Size Calculator to plan ahead.
- Cherry-picking metrics — Decide which metric to track before running the test. Looking at multiple metrics after the fact increases false discoveries.
When to Use A/B Testing
- Landing page optimization — Headlines, CTAs, images, form fields, layout
- Email marketing — Subject lines, send times, content, personalization
- Pricing pages — Pricing tiers, feature display, social proof
- Ad campaigns — Ad copy, creatives, targeting, bidding strategies
- Product features — Onboarding flows, UI changes, feature placement
- Checkout flows — Form design, payment options, trust signals
A/B Testing Best Practices
- Define your hypothesis before testing — Write down what you expect to happen and why. This prevents post-hoc rationalization.
- Calculate sample size upfront — Use our A/B Test Sample Size Calculator to determine how many visitors you need before starting.
- Test one variable at a time — Changing multiple elements makes it impossible to know which change caused the effect.
- Ensure random assignment — Users should be randomly assigned to control or variant with equal probability.
- Run the full duration— Don't stop early. Don't extend the test just because results aren't significant.
- Consider practical significance — A statistically significant 0.1% improvement may not be worth the development cost. Consider the business impact.
- Document everything — Record your hypothesis, sample size calculation, test duration, and results for institutional learning.