skcmeans package¶

Submodules¶

skcmeans.algorithms module¶

Implementations of a number of C-means algorithms.

References

[1]	J. C. Bezdek, J. Keller, R. Krisnapuram, and N. R. Pal, Fuzzy models and algorithms for pattern recognition and image processing. Kluwer Academic Publishers, 2005.

class skcmeans.algorithms.CMeans(n_clusters=2, n_init=10, max_iter=300, tol=0.0001, verbosity=0, random_state=None, eps=1e-18, **kwargs)[source]¶

Bases: object

Base class for C-means algorithms.

Parameters:

Parameters:	n_clusters (int, optional) – The number of clusters to find. n_init (int, optional) – The number of times to attempt convergence with new initial centroids. max_iter (int, optional) – The number of cycles of the alternating optimization routine to run for each convergence. tol (float, optional) – The stopping condition. Convergence is considered to have been reached when the objective function changes less than tol. verbosity (int, optional) – The verbosity of the instance. May be 0, 1, or 2. Note Very much not yet implemented. random_state (`int` or `np.random.RandomState`, optional) – The generator used for initialization. Using an integer fixes the seed. eps (float, optional) – To avoid numerical errors, zeros are sometimes replaced with a very small number, specified here.

n_clusters (int, optional) – The number of clusters to find.
n_init (int, optional) – The number of times to attempt convergence with new initial centroids.
max_iter (int, optional) – The number of cycles of the alternating optimization routine to run for each convergence.
tol (float, optional) – The stopping condition. Convergence is considered to have been reached when the objective function changes less than tol.
verbosity (int, optional) –
The verbosity of the instance. May be 0, 1, or 2.

Note

Very much not yet implemented.
random_state (int or np.random.RandomState, optional) – The generator used for initialization. Using an integer fixes the seed.
eps (float, optional) – To avoid numerical errors, zeros are sometimes replaced with a very small number, specified here.

metric¶: string or function – The distance metric used. May be any of the strings specified for cdist, or a user-specified function.

initialization¶: function – The method used to initialize the cluster centers.

centers¶: np.ndarray – (n_clusters, n_features) The derived or supplied cluster centers.

memberships¶: np.ndarray – (n_samples, n_clusters) The derived or supplied cluster memberships.

calculate_centers(x)[source]¶

calculate_memberships(x)[source]¶

converge(x)[source]¶

Finds cluster centers through an alternating optimization routine.

Terminates when either the number of cycles reaches max_iter or the objective function changes by less than tol.

Parameters:	x (`np.ndarray`) – (n_samples, n_features) The original data.

distances(x)[source]¶

Calculates the distance between data x and the centers.

The distance, by default, is calculated according to metric, but this method should be overridden by subclasses if required.

Parameters:	x (`np.ndarray`) – (n_samples, n_features) The original data.
Returns:	(n_samples, n_clusters) Each entry (i, j) is the distance between sample i and cluster center j.
Return type:	`np.ndarray`

fit(x)[source]¶

Optimizes cluster centers by restarting convergence several times.

Parameters:	x (`np.ndarray`) – (n_samples, n_features) The original data.

static initialization(x, k, random_state=None, eps=1e-12)

Selects initial points randomly from the data.

Parameters:

Parameters:	x (`np.ndarray`) – (n_samples, n_features) The original data. k (int) – The number of points to select. random_state (int or `np.random.RandomState`, optional) – The generator used for initialization. Using an integer fixes the seed.
Returns:	Unitialized memberships selection (`np.ndarray`) – (k, n_features) A length-k subset of the original data.

x (np.ndarray) – (n_samples, n_features) The original data.
k (int) – The number of points to select.
random_state (int or np.random.RandomState, optional) – The generator used for initialization. Using an integer fixes the seed.

Returns:

Unitialized memberships
selection (np.ndarray) – (k, n_features) A length-k subset of the original data.

initialize(x)[source]¶

metric = 'euclidean'

objective(x)[source]¶

update(x)[source]¶

Updates cluster memberships and centers in a single cycle.

If the cluster centers have not already been initialized, they are chosen according to initialization.

Parameters:	x (`np.ndarray`) – (n_samples, n_features) The original data.

class skcmeans.algorithms.Fuzzy(n_clusters=2, n_init=10, max_iter=300, tol=0.0001, verbosity=0, random_state=None, eps=1e-18, **kwargs)[source]¶

Bases: skcmeans.algorithms.CMeans

Base class for fuzzy C-means clusters.

m¶: float – Fuzziness parameter. Higher values reduce the rate of drop-off from full membership to zero membership.

fuzzifier(memberships)[source]¶: Fuzzification operator. By default, for memberships $u$ this is $u^m$.

objective(x)[source]¶: Interpretable as the data’s weighted rotational inertia about the cluster centers. To be minimised.

fuzzifier(memberships)[source]

m = 2

objective(x)[source]

class skcmeans.algorithms.GustafsonKesselMixin(n_clusters=2, n_init=10, max_iter=300, tol=0.0001, verbosity=0, random_state=None, eps=1e-18, **kwargs)[source]¶

Bases: skcmeans.algorithms.Fuzzy

Gives clusters ellipsoidal character.

The Gustafson-Kessel algorithm redefines the distance measurement such that clusters may adopt ellipsoidal shapes. This is achieved through updates to a covariance matrix assigned to each cluster center.

Examples

Create a algorithm for probabilistic clustering with ellipsoidal clusters:

>>> class ProbabilisticGustafsonKessel(GustafsonKesselMixin, Probabilistic):
>>>     pass
>>> pgk = ProbabilisticGustafsonKessel()
>>> pgk.fit(x)

calculate_covariance(x)[source]¶

Calculates the covariance of the data u with cluster centers v.

Parameters:	x (`np.ndarray`) – (n_samples, n_features) The original data.
Returns:	(n_clusters, n_features, n_features) The covariance matrix of each cluster.
Return type:	`np.ndarray`

covariance = None¶

distances(x)[source]¶

fit(x)[source]¶

Optimizes cluster centers by restarting convergence several times.

Extends the default behaviour by recalculating the covariance matrix with resultant memberships and centers.

Parameters:	x (`np.ndarray`) – (n_samples, n_features) The original data.

update(x)[source]¶

Single update of the cluster algorithm.

Extends the default behaviour by including a covariance calculation after updating the centers

Parameters:	x (`np.ndarray`) – (n_samples, n_features) The original data.

class skcmeans.algorithms.Hard(n_clusters=2, n_init=10, max_iter=300, tol=0.0001, verbosity=0, random_state=None, eps=1e-18, **kwargs)[source]¶

Bases: skcmeans.algorithms.CMeans

Hard C-means, equivalent to K-means clustering.

calculate_memberships(x)[source]¶: The membership of a sample is 1 to the closest cluster and 0 otherwise.

calculate_centers(x)[source]¶: New centers are calculated as the mean of the points closest to them.

objective(x)[source]¶: Interpretable as the data’s rotational inertia about the cluster centers. To be minimised.

calculate_centers(x)[source]

calculate_memberships(x)[source]

objective(x)[source]

class skcmeans.algorithms.Possibilistic(n_clusters=2, n_init=10, max_iter=300, tol=0.0001, verbosity=0, random_state=None, eps=1e-18, **kwargs)[source]¶

Bases: skcmeans.algorithms.Fuzzy

Possibilistic C-means.

In the possibilistic algorithm, sample points are assigned memberships according to their relative proximity to the centers. This is controlled through a weighting to the cluster centers, approximately the variance of each cluster.

calculate_memberships(x)[source]¶: Memberships are calculated from the distance $d_{ij}$ between the sample $j$ and the cluster center $i$, and the weighting $w_i$ of each center.

\[u_{ik} = \left(1 + \left(\frac{d_{ik}}{w_i}\right)^\frac{1}{m -1} \right)^{-1}\]

calculate_centers(x)[source]¶: New centers are calculated as the mean of the points closest to them, weighted by the fuzzified memberships.

\[c_i = \left. \sum_k u_{ik}^m x_k \middle/ \sum_k u_{ik} \right.\]

calculate_centers(x)[source]

calculate_memberships(x)[source]

static initialization(x, k, random_state=None)¶

Selects initial points using a probabilistic clustering approximation.

Parameters:

Parameters:	x (`np.ndarray`) – (n_samples, n_features) The original data. k (int) – The number of points to select. random_state (int or `np.random.RandomState`, optional) – The generator used for initialization. Using an integer fixes the seed.
Returns:	`np.ndarray` – (n_samples, k) Cluster memberships `np.ndarray` – (k, n_features) Cluster centers

x (np.ndarray) – (n_samples, n_features) The original data.
k (int) – The number of points to select.
random_state (int or np.random.RandomState, optional) – The generator used for initialization. Using an integer fixes the seed.

Returns:

np.ndarray – (n_samples, k) Cluster memberships
np.ndarray – (k, n_features) Cluster centers

weights(x)[source]¶

class skcmeans.algorithms.Probabilistic(n_clusters=2, n_init=10, max_iter=300, tol=0.0001, verbosity=0, random_state=None, eps=1e-18, **kwargs)[source]¶

Bases: skcmeans.algorithms.Fuzzy

Probabilistic C-means.

In the probabilistic algorithm, sample points have total membership of unity, distributed equally among each of the centers. This tends to push cluster centers away from each other.

calculate_memberships(x)[source]¶: Memberships are calculated from the distance $d_{ij}$ between the sample $j$ and the cluster center $i$.

\[u_{ik} = \left(\sum_j \left( \frac{d_{ik}}{d_{jk}} \right)^{\frac{2}{m - 1}} \right)^{-1}\]

calculate_centers(x)[source]¶: New centers are calculated as the mean of the points closest to them, weighted by the fuzzified memberships.

\[c_i = \left. \sum_k u_{ik}^m x_k \middle/ \sum_k u_{ik} \right.\]

calculate_centers(x)[source]

calculate_memberships(x)[source]

skcmeans.initialization module¶

skcmeans.initialization.initialize_probabilistic(x, k, random_state=None)[source]¶

Selects initial points using a probabilistic clustering approximation.

Parameters:

Parameters:	x (`np.ndarray`) – (n_samples, n_features) The original data. k (int) – The number of points to select. random_state (int or `np.random.RandomState`, optional) – The generator used for initialization. Using an integer fixes the seed.
Returns:	`np.ndarray` – (n_samples, k) Cluster memberships `np.ndarray` – (k, n_features) Cluster centers

x (np.ndarray) – (n_samples, n_features) The original data.
k (int) – The number of points to select.
random_state (int or np.random.RandomState, optional) – The generator used for initialization. Using an integer fixes the seed.

Returns:

np.ndarray – (n_samples, k) Cluster memberships
np.ndarray – (k, n_features) Cluster centers

skcmeans.initialization.initialize_random(x, k, random_state=None, eps=1e-12)[source]¶

Selects initial points randomly from the data.

Parameters:

Parameters:	x (`np.ndarray`) – (n_samples, n_features) The original data. k (int) – The number of points to select. random_state (int or `np.random.RandomState`, optional) – The generator used for initialization. Using an integer fixes the seed.
Returns:	Unitialized memberships selection (`np.ndarray`) – (k, n_features) A length-k subset of the original data.

x (np.ndarray) – (n_samples, n_features) The original data.
k (int) – The number of points to select.
random_state (int or np.random.RandomState, optional) – The generator used for initialization. Using an integer fixes the seed.

Returns:

Unitialized memberships
selection (np.ndarray) – (k, n_features) A length-k subset of the original data.

skcmeans package¶

Submodules¶

skcmeans.algorithms module¶

skcmeans.initialization module¶

Module contents¶