bundles / scipy 1.17.1 / scipy / stats / _mstats_basic / pearsonr
function
scipy.stats._mstats_basic:pearsonr
Signature
def pearsonr ( x , y ) Summary
Pearson correlation coefficient and p-value for testing non-correlation.
Extended Summary
The Pearson correlation coefficient [1] measures the linear relationship between two datasets. The calculation of the p-value relies on the assumption that each dataset is normally distributed. (See Kowalski [3] for a discussion of the effects of non-normality of the input on the distribution of the correlation coefficient.) Like other correlation coefficients, this one varies between -1 and +1 with 0 implying no correlation. Correlations of -1 or +1 imply an exact linear relationship.
Parameters
x: (N,) array_likeInput array.
y: (N,) array_likeInput array.
Returns
r: floatPearson's correlation coefficient.
p-value: floatTwo-tailed p-value.
Warns
: `~scipy.stats.ConstantInputWarning`Raised if an input is a constant array. The correlation coefficient is not defined in this case, so
np.nanis returned.: `~scipy.stats.NearConstantInputWarning`Raised if an input is "nearly" constant. The array
xis considered nearly constant ifnorm(x - mean(x)) < 1e-13 * abs(mean(x)). Numerical errors in the calculationx - mean(x)in this case might result in an inaccurate calculation of r.
Notes
The correlation coefficient is calculated as follows:
where is the mean of the vector x and is the mean of the vector y.
Under the assumption that x and y are drawn from independent normal distributions (so the population correlation coefficient is 0), the probability density function of the sample correlation coefficient r is ([1], [2]):
where n is the number of samples, and B is the beta function. This is sometimes referred to as the exact distribution of r. This is the distribution that is used in pearsonr to compute the p-value. The distribution is a beta distribution on the interval [-1, 1], with equal shape parameters a = b = n/2 - 1. In terms of SciPy's implementation of the beta distribution, the distribution of r is
dist = scipy.stats.beta(n/2 - 1, n/2 - 1, loc=-1, scale=2)The p-value returned by pearsonr is a two-sided p-value. The p-value roughly indicates the probability of an uncorrelated system producing datasets that have a Pearson correlation at least as extreme as the one computed from these datasets. More precisely, for a given sample with correlation coefficient r, the p-value is the probability that abs(r') of a random sample x' and y' drawn from the population with zero correlation would be greater than or equal to abs(r). In terms of the object dist shown above, the p-value for a given r and length n can be computed as
p = 2*dist.cdf(-abs(r))When n is 2, the above continuous distribution is not well-defined. One can interpret the limit of the beta distribution as the shape parameters a and b approach a = b = 0 as a discrete distribution with equal probability masses at r = 1 and r = -1. More directly, one can observe that, given the data x = [x1, x2] and y = [y1, y2], and assuming x1 != x2 and y1 != y2, the only possible values for r are 1 and -1. Because abs(r') for any sample x' and y' with length 2 will be 1, the two-sided p-value for a sample of length 2 is always 1.
Examples
import numpy as np from scipy import stats from scipy.stats import mstats✓
mstats.pearsonr([1, 2, 3, 4, 5], [10, 9, 2.5, 6, 4])
✗s = 0.5 x = stats.norm.rvs(size=500) e = stats.norm.rvs(scale=s, size=500) y = x + e✓
mstats.pearsonr(x, y)
✗1/np.sqrt(1 + s**2)
✗y = np.abs(x)
✓mstats.pearsonr(x, y)
✗y = np.where(x < 0, x, 0)
✓mstats.pearsonr(x, y)
✗See also
- kendalltau
Kendall's tau, a correlation measure for ordinal data.
- spearmanr
Spearman rank-order correlation coefficient.
Aliases
-
scipy.stats._mstats_basic.pearsonr