- Overview
- Calculators
- A/B Test Calculator
A/B Test Calculator
Determine if your A/B test results are statistically significant. Enter visitors and conversions for both control and variant groups to get instant statistical analysis including p-value, z-score, uplift, and power.
Z = (p₂ - p₁) / √(p̂ × (1 - p̂) × (1/n₁ + 1/n₂))
Control Group (A)
Variant Group (B)
Frequently Asked Questions
What is an A/B test?
An A/B test (also called a split test) is a controlled experiment where you compare two versions of something (e.g., a webpage, email, or ad) to determine which performs better. Version A is the control (original), and Version B is the variant (modified). Users are randomly assigned to each group, and their behavior (conversions, clicks, etc.) is measured.
What is statistical significance in A/B testing?
Statistical significance means the difference between your control and variant is unlikely to be due to random chance. Typically, a result is considered significant at 95% confidence, meaning there's less than a 5% probability the observed difference happened by chance. The p-value quantifies this probability.
How do you calculate the p-value for an A/B test?
The p-value is calculated using a two-proportion z-test. First, compute the z-score: Z = (p₂ - p₁) / √(p̂ × (1 - p̂) × (1/n₁ + 1/n₂)), where p̂ is the pooled proportion. Then convert the z-score to a two-tailed p-value using the standard normal distribution.
What confidence level should I use?
95% confidence is the industry standard for most A/B tests. Use 90% for directional decisions or fast-paced experiments where speed matters more than certainty. Use 99% for high-stakes decisions (pricing changes, major redesigns) where a false positive would be very costly.
What is statistical power?
Statistical power is the probability of detecting a true effect when one exists. A power of 80% means if there really is a difference between your variations, you have an 80% chance of detecting it. Low power means you might miss real improvements (false negatives). Most experiments should target at least 80% power.
How long should I run an A/B test?
Run your test until you reach the required sample size (use our Sample Size Calculator to determine this). Never stop a test early just because it looks significant — this inflates false positive rates. Also run for at least 1-2 full business cycles (typically 1-2 weeks) to account for day-of-week effects.
What does the uplift percentage mean?
Uplift (or lift) is the relative improvement of the variant over the control. It's calculated as: Uplift = (Variant Rate - Control Rate) / Control Rate × 100. For example, if control converts at 5% and variant at 6%, the uplift is 20% — meaning the variant performs 20% better than the control.
Can I trust my A/B test results with a small sample size?
Small sample sizes lead to unreliable results with wide confidence intervals. Even if you see a 'significant' p-value with small samples, the observed effect size is likely exaggerated. Aim for adequate sample sizes before drawing conclusions. Use our A/B Test Sample Size Calculator to plan your experiment.
Quick Navigation
What is A/B Testing?
A/B testing (also known as split testing) is a method of comparing two versions of a webpage, email, ad, or any other content to determine which one performs better. Users are randomly divided into two groups: the control group (A) sees the original version, and the variant group (B) sees the modified version.
The key question A/B testing answers is: "Is the difference in performance between A and B real, or could it have happened by random chance?" This is where statistical significance comes in. Our calculator uses a two-proportion z-test to determine whether the observed difference is statistically significant.
A/B testing is fundamental to data-driven decision making in marketing, product development, UX design, and growth engineering. Companies like Google, Amazon, Netflix, and Booking.com run thousands of A/B tests annually to optimize their products.
Statistical Formula & How It Works
This calculator uses the two-proportion z-test to compare conversion rates between two independent groups:
Step 1: Calculate Conversion Rates
p₁ = Conversions₁ / Visitors₁
p₂ = Conversions₂ / Visitors₂
Step 2: Calculate Pooled Proportion
p̂ = (C₁ + C₂) / (n₁ + n₂)
Step 3: Calculate Z-Score
Z = (p₂ - p₁) / √(p̂ × (1 - p̂) × (1/n₁ + 1/n₂))
Step 4: Convert to P-Value (two-tailed)
p-value = 2 × (1 - Φ(|Z|))
If the p-value is less than alpha (where alpha = 1 - confidence level), the result is statistically significant. For 95% confidence, alpha = 0.05, so a p-value below 0.05 indicates significance.
A/B Test Calculation Examples
Example 1: Significant Result
An e-commerce site tests a new checkout page. Control: 10,000 visitors, 300 purchases. Variant: 10,000 visitors, 380 purchases.
Control rate: 300/10,000 = 3.00%
Variant rate: 380/10,000 = 3.80%
Uplift: (3.80 - 3.00) / 3.00 = +26.67%
Pooled: 680/20,000 = 3.40%
SE = √(0.034 × 0.966 × 0.0002) = 0.00256
Z = 0.008 / 0.00256 = 3.125
P-value = 0.0018
Result: Statistically significant at 95% confidence
Example 2: Not Significant
A SaaS company tests a new pricing page. Control: 500 visitors, 25 signups. Variant: 500 visitors, 30 signups.
Control rate: 25/500 = 5.00%
Variant rate: 30/500 = 6.00%
Uplift: +20.00%
Z = 0.668
P-value = 0.504
Result: Not significant — need more data
Example 3: Variant Performs Worse
An email marketing test. Control: 5,000 recipients, 250 clicks. Variant: 5,000 recipients, 200 clicks.
Control rate: 250/5,000 = 5.00%
Variant rate: 200/5,000 = 4.00%
Uplift: -20.00%
Z = -2.356
P-value = 0.0185
Result: Statistically significant — variant is worse
Choosing Your Significance Level
| Confidence | Alpha (α) | Z Critical | Best For |
|---|---|---|---|
| 90% | 0.10 | 1.645 | Quick iterations, low-risk changes |
| 95% | 0.05 | 1.960 | Industry standard, most A/B tests |
| 99% | 0.01 | 2.576 | High-stakes decisions, pricing changes |
Common A/B Testing Mistakes
- Stopping tests too early — Checking results before reaching the required sample size inflates false positive rates. Commit to your sample size before starting.
- Testing too many variations — Each additional variant requires a larger sample size and increases the chance of false positives (multiple comparisons problem).
- Ignoring statistical power — Low-powered tests frequently miss real effects. Aim for at least 80% power when planning your test.
- Not running full business cycles — User behavior varies by day of week, time of day, and season. Run tests for at least 1-2 full weeks.
- Testing tiny changes on small samples — Small effects need large samples to detect. Use the Sample Size Calculator to plan ahead.
- Cherry-picking metrics — Decide which metric to track before running the test. Looking at multiple metrics after the fact increases false discoveries.
When to Use A/B Testing
- Landing page optimization — Headlines, CTAs, images, form fields, layout
- Email marketing — Subject lines, send times, content, personalization
- Pricing pages — Pricing tiers, feature display, social proof
- Ad campaigns — Ad copy, creatives, targeting, bidding strategies
- Product features — Onboarding flows, UI changes, feature placement
- Checkout flows — Form design, payment options, trust signals
A/B Testing Best Practices
- Define your hypothesis before testing — Write down what you expect to happen and why. This prevents post-hoc rationalization.
- Calculate sample size upfront — Use our A/B Test Sample Size Calculator to determine how many visitors you need before starting.
- Test one variable at a time — Changing multiple elements makes it impossible to know which change caused the effect.
- Ensure random assignment — Users should be randomly assigned to control or variant with equal probability.
- Run the full duration — Don't stop early. Don't extend the test just because results aren't significant.
- Consider practical significance — A statistically significant 0.1% improvement may not be worth the development cost. Consider the business impact.
- Document everything — Record your hypothesis, sample size calculation, test duration, and results for institutional learning.