P 值计算器
p 值衡量在原假设成立的条件下,获得与观测数据一样极端结果的概率。输入 z 分数或样本统计量(均值、总体均值、标准差、样本量),即时计算 p 值并判断统计显著性。
p-value = 2 × (1 − Φ(|Z|))
Frequently Asked Questions
什么是p值?
p值是在零假设为真的前提下,观察到当前结果或更极端结果的概率。p值越小,说明在零假设下获得当前数据的概率越低,为拒绝零假设提供更强的证据。p值不是"结果为真的概率",也不是"研究重要性"的度量。
p值小于0.05意味着什么?
p < 0.05是最常用的统计显著性阈值(α = 0.05),意味着若零假设为真,观察到此结果的概率小于5%。传统上p < 0.05时拒绝零假设,认为结果"统计显著"。但这是人为约定的阈值,不代表实际效应大小或结果的实践意义。
p值的常见误解有哪些?
常见误解:误解1:p值是零假设为真的概率(错误!p值是假设零假设为真时数据出现的概率);误解2:p < 0.05证明效应真实存在(实际仍有5%假阳性率);误解3:p值越小效应越大(p值只反映显著性,不反映效应量);误解4:p > 0.05意味着零假设正确(只说明证据不足以拒绝,而非证明零假设)。
如何解读不同的p值水平?
常用解读标准:p < 0.001:极强统计显著性(***);p < 0.01:强统计显著性(**);p < 0.05:统计显著性(*);0.05 ≤ p < 0.10:边缘显著性(部分领域接受);p ≥ 0.10:统计不显著。医学临床试验通常要求p < 0.05甚至更严格,粒子物理学要求p < 3×10⁻⁷(5σ标准)。
统计功效(Statistical Power)与p值有何关系?
统计功效(Power)= 1 - β,是当真实效应存在时能正确检测到的概率(即避免假阴性的能力)。功效与p值相关:功效越高,若存在真实效应则越容易得到显著p值;样本量越大,功效越高。大多数研究建议功效≥80%。功效不足时,即使有真实效应也可能因样本太小而得到p > 0.05。
多重比较如何影响p值?
多重比较问题(Multiple Testing Problem):同时进行多个假设检验时,偶然得到显著p值的概率会累积增加。进行20个独立检验,期望至少1个偶然显著(0.05 × 20 = 1)。校正方法:Bonferroni校正(显著性阈值 = 0.05/检验数);Benjamini-Hochberg FDR控制;较为保守,应根据研究目的选择合适方法。
贝叶斯方法与频率派p值有何不同?
p值是频率派统计方法的产物,回答:"假设H₀为真,数据出现的概率是多少?"贝叶斯方法则计算后验概率,回答:"在看到数据后,H₀(或H₁)为真的概率是多少?"贝叶斯因子(Bayes Factor)是贝叶斯替代p值的显著性指标,解释更为直观。现代统计学趋势是结合效应量、置信区间和贝叶斯方法,不过度依赖p值。
p值是如何计算的?
p值计算依赖检验统计量和其理论分布:t检验:计算t统计量,查t分布表(或用软件)得到p值;卡方检验:计算χ²统计量,查χ²分布;F检验(方差分析ANOVA):计算F统计量,查F分布;非参数检验(Mann-Whitney、Kruskal-Wallis):基于秩次计算。使用本计算器时,输入检验统计量和自由度,即可自动计算对应的p值。
What is a P-Value?
A p-value (probability value) is the probability of obtaining test results at least as extreme as the observed results, assuming that the null hypothesis is true. In other words, the p-value answers: "If there were no real effect, how likely is it to see data this extreme by random chance?"
A small p-value(typically < 0.05) suggests the observed data is unlikely under the null hypothesis, providing evidence to reject it. A large p-value means the data is consistent with the null hypothesis, so you fail to reject it.
P-values are fundamental in hypothesis testing across statistics, medicine, psychology, economics, and data science. They do not measure the probability that the null hypothesis is true — they measure the probability of the observed data given the null hypothesis.
P-Value Formula from Z-Score
This calculator computes p-values using the standard normal distribution (Z-test). The z-score measures how many standard errors the sample mean is from the population mean:
Z-Score from Sample Statistics
Z = (x̄ − μ₀) / (σ / √n)
x̄ = sample mean | μ₀ = population mean | σ = std dev | n = sample size
Two-Tailed P-Value
p = 2 × (1 − Φ(|Z|))
Left-Tailed P-Value
p = Φ(Z)
Right-Tailed P-Value
p = 1 − Φ(Z)
Where Φ(Z)is the cumulative distribution function (CDF) of the standard normal distribution. The calculator uses the Abramowitz & Stegun approximation (formula 7.1.26) for fast, accurate CDF evaluation.
Common Significance Levels (α)
The significance level (alpha, α) is the threshold below which you reject the null hypothesis. Choosing α before collecting data is essential to avoid p-hacking.
| Alpha (α) | Confidence | Z Critical (Two-Tailed) | Typical Use |
|---|---|---|---|
| 0.10 | 90% | ±1.645 | Exploratory research, low-stakes decisions |
| 0.05 | 95% | ±1.960 | Industry standard, most hypothesis tests |
| 0.01 | 99% | ±2.576 | Medical trials, high-stakes research |
| 0.001 | 99.9% | ±3.291 | Physics (particle discovery), genome-wide studies |
A result is statistically significantwhen p < α. At α = 0.05, you accept a 5% risk of a false positive (Type I error) — incorrectly rejecting a true null hypothesis.
One-Tailed vs Two-Tailed Tests
The choice between one-tailed and two-tailed tests depends on your hypothesis before seeing the data.
Two-Tailed Test
H₀: μ = μ₀ H₁: μ ≠ μ₀
Use when you are testing whether the mean is different from the population mean in either direction. This is the most common choice and the most conservative.
Left-Tailed Test
H₀: μ ≥ μ₀ H₁: μ < μ₀
Use when you specifically hypothesize the mean is less than the population value. The rejection region is in the left tail.
Right-Tailed Test
H₀: μ ≤ μ₀ H₁: μ > μ₀
Use when you specifically hypothesize the mean is greater than the population value. The rejection region is in the right tail.
A two-tailed p-value is exactly double the one-tailed p-value (for the same z-score). Choosing a one-tailed test after seeing results that go in the predicted direction is p-hacking and inflates false-positive rates.
P-Value Calculation Examples
Example 1: Two-Tailed Z-Test (Significant)
A researcher tests whether a new drug changes blood pressure. They observe z = 2.50 with α = 0.05 (two-tailed).
Z = 2.50
p = 2 × (1 − Φ(2.50)) = 2 × 0.00621 = 0.0124
p (0.0124) < α (0.05)
Result: Statistically significant — reject H₀
Example 2: From Sample Statistics
A quality test: sample mean = 105, population mean = 100, σ = 15, n = 36 (two-tailed, α = 0.05).
SE = 15 / √36 = 2.5
Z = (105 − 100) / 2.5 = 2.0
p = 2 × (1 − Φ(2.0)) ≈ 0.0455
p (0.0455) < α (0.05)
Result: Statistically significant — the sample differs from the population mean
Example 3: Right-Tailed (Not Significant)
Testing if a new teaching method improves scores. Z = 1.20, α = 0.05 (right-tailed).
Z = 1.20
p = 1 − Φ(1.20) ≈ 0.1151
p (0.1151) ≥ α (0.05)
Result: Not significant — fail to reject H₀
Common P-Value Mistakes
- Misinterpreting the p-value — A p-value is NOT the probability that the null hypothesis is true. It is the probability of the observed data assuming the null is true.
- P-hacking — Running multiple tests and only reporting the significant ones inflates the false-positive rate. Pre-register your hypothesis and correction method.
- Switching from two-tailed to one-tailed after seeing results halves the p-value and is a form of p-hacking.
- Confusing statistical with practical significance — A tiny p-value with a large sample can indicate a trivially small effect. Always check effect size.
- Ignoring assumptions — Z-tests assume the population standard deviation is known and data is approximately normal. Use a t-test for small samples with unknown σ.
- Treating p = 0.049 and p = 0.051 as categorically different — The 0.05 threshold is a convention, not a hard rule. Report actual p-values and confidence intervals.