{ } Raw JSON

bundles / scipy latest / scipy / cluster / vq / kmeans

function

scipy.cluster.vq:kmeans

source: /scipy/cluster/vq.py :332

Signature

def   kmeans ( obs k_or_guess iter = 20 thresh = 1e-05 check_finite = True * rng = None seed = None )

Summary

Performs k-means on a set of observation vectors forming k clusters.

Extended Summary

The k-means algorithm adjusts the classification of the observations into clusters and updates the cluster centroids until the position of the centroids is stable over successive iterations. In this implementation of the algorithm, the stability of the centroids is determined by comparing the absolute value of the change in the average Euclidean distance between the observations and their corresponding centroids against a threshold. This yields a code book mapping centroids to codes and vice versa.

Parameters

obs : ndarray

Each row of the M by N array is an observation vector. The columns are the features seen during each observation. The features must be whitened first with the whiten function.

k_or_guess : int or ndarray

The number of centroids to generate. A code is assigned to each centroid, which is also the row index of the centroid in the code_book matrix generated.

The initial k centroids are chosen by randomly selecting observations from the observation matrix. Alternatively, passing a k by N array specifies the initial k centroids.

iter : int, optional

The number of times to run k-means, returning the codebook with the lowest distortion. This argument is ignored if initial centroids are specified with an array for the k_or_guess parameter. This parameter does not represent the number of iterations of the k-means algorithm.

thresh : float, optional

Terminates the k-means algorithm if the change in distortion since the last k-means iteration is less than or equal to threshold.

check_finite : bool, optional

Whether to check that the input matrices contain only finite numbers. Disabling may give a performance gain, but may result in problems (crashes, non-termination) if the inputs do contain infinities or NaNs. Default: True

rng : {None, int, `numpy.random.Generator`}, optional

If rng is passed by keyword, types other than numpy.random.Generator are passed to numpy.random.default_rng to instantiate a Generator. If rng is already a Generator instance, then the provided instance is used. Specify rng for repeatable function behavior.

If this argument is passed by position or seed is passed by keyword, legacy behavior for the argument seed applies:

  • If seed is None (or numpy.random), the numpy.random.RandomState singleton is used.

  • If seed is an int, a new RandomState instance is used, seeded with seed.

  • If seed is already a Generator or RandomState instance then that instance is used.

Returns

codebook : ndarray

A k by N array of k centroids. The ith centroid codebook[i] is represented with the code i. The centroids and codes generated represent the lowest distortion seen, not necessarily the globally minimal distortion. Note that the number of centroids is not necessarily the same as the k_or_guess parameter, because centroids assigned to no observations are removed during iterations.

distortion : float

The mean (non-squared) Euclidean distance between the observations passed and the centroids generated. Note the difference to the standard definition of distortion in the context of the k-means algorithm, which is the sum of the squared distances.

Notes

For more functionalities or optimal performance, you can use sklearn.cluster.KMeans. This is a benchmark result of several implementations.

Array API Standard Support

kmeans has experimental support for Python Array API Standard compatible backends in addition to NumPy. Please consider testing these features by setting an environment variable SCIPY_ARRAY_API=1 and providing CuPy, PyTorch, JAX, or Dask arrays as array arguments. The following combinations of backend and device (or other capability) are supported.

====================  ====================  ====================
Library               CPU                   GPU
====================  ====================  ====================
NumPy                 ✅                     n/a                 
CuPy                  n/a                   ⛔                   
PyTorch               ✅                     ⛔                   
JAX                   ⚠️ no JIT
Dask                  ⚠️ computes graph     n/a                 
====================  ====================  ====================

See dev-arrayapi for more information.

Examples

import numpy as np
from scipy.cluster.vq import vq, kmeans, whiten
import matplotlib.pyplot as plt
features  = np.array([[ 1.9,2.3],
                      [ 1.5,2.5],
                      [ 0.8,0.6],
                      [ 0.4,1.8],
                      [ 0.1,0.1],
                      [ 0.2,1.8],
                      [ 2.0,0.5],
                      [ 0.3,1.5],
                      [ 1.0,1.0]])
whitened = whiten(features)
book = np.array((whitened[0],whitened[2]))
kmeans(whitened,book)
codes = 3
kmeans(whitened,codes)
pts = 50
rng = np.random.default_rng()
a = rng.multivariate_normal([0, 0], [[4, 1], [1, 4]], size=pts)
b = rng.multivariate_normal([30, 10],
                            [[10, 2], [2, 1]],
                            size=pts)
features = np.concatenate((a, b))
whitened = whiten(features)
codebook, distortion = kmeans(whitened, 2)
plt.scatter(whitened[:, 0], whitened[:, 1])
plt.scatter(codebook[:, 0], codebook[:, 1], c='r')
plt.show()
fig-1d3b0f5b7f1f1246.png

See also

kmeans2

a different implementation of k-means clustering with more methods for generating initial centroids but without using a distortion change threshold as a stopping criterion.

whiten

must be called prior to passing an observation matrix to kmeans.

Aliases

  • scipy.cluster.vq.kmeans

Referenced by

This package