skcmeans.algorithms Module

Implementations of a number of C-means algorithms.

References

[1]J. C. Bezdek, J. Keller, R. Krisnapuram, and N. R. Pal, Fuzzy models and algorithms for pattern recognition and image processing. Kluwer Academic Publishers, 2005.
class skcmeans.algorithms.CMeans(n_clusters=2, n_init=10, max_iter=300, tol=0.0001, verbosity=0, random_state=None, eps=1e-18, **kwargs)[source]

Base class for C-means algorithms.

Parameters:
  • n_clusters (int, optional) – The number of clusters to find.
  • n_init (int, optional) – The number of times to attempt convergence with new initial centroids.
  • max_iter (int, optional) – The number of cycles of the alternating optimization routine to run for each convergence.
  • tol (float, optional) – The stopping condition. Convergence is considered to have been reached when the objective function changes less than tol.
  • verbosity (int, optional) –

    The verbosity of the instance. May be 0, 1, or 2.

    Note

    Very much not yet implemented.

  • random_state (int or np.random.RandomState, optional) – The generator used for initialization. Using an integer fixes the seed.
  • eps (float, optional) – To avoid numerical errors, zeros are sometimes replaced with a very small number, specified here.
metric

string or function – The distance metric used. May be any of the strings specified for cdist, or a user-specified function.

initialization

function – The method used to initialize the cluster centers.

centers

np.ndarray – (n_clusters, n_features) The derived or supplied cluster centers.

memberships

np.ndarray – (n_samples, n_clusters) The derived or supplied cluster memberships.

converge(x)[source]

Finds cluster centers through an alternating optimization routine.

Terminates when either the number of cycles reaches max_iter or the objective function changes by less than tol.

Parameters:x (np.ndarray) – (n_samples, n_features) The original data.
distances(x)[source]

Calculates the distance between data x and the centers.

The distance, by default, is calculated according to metric, but this method should be overridden by subclasses if required.

Parameters:x (np.ndarray) – (n_samples, n_features) The original data.
Returns:(n_samples, n_clusters) Each entry (i, j) is the distance between sample i and cluster center j.
Return type:np.ndarray
fit(x)[source]

Optimizes cluster centers by restarting convergence several times.

Parameters:x (np.ndarray) – (n_samples, n_features) The original data.
static initialization(x, k, random_state=None, eps=1e-12)

Selects initial points randomly from the data.

Parameters:
  • x (np.ndarray) – (n_samples, n_features) The original data.
  • k (int) – The number of points to select.
  • random_state (int or np.random.RandomState, optional) – The generator used for initialization. Using an integer fixes the seed.
Returns:

  • Unitialized memberships
  • selection (np.ndarray) – (k, n_features) A length-k subset of the original data.

update(x)[source]

Updates cluster memberships and centers in a single cycle.

If the cluster centers have not already been initialized, they are chosen according to initialization.

Parameters:x (np.ndarray) – (n_samples, n_features) The original data.
class skcmeans.algorithms.Fuzzy(n_clusters=2, n_init=10, max_iter=300, tol=0.0001, verbosity=0, random_state=None, eps=1e-18, **kwargs)[source]

Base class for fuzzy C-means clusters.

m

float – Fuzziness parameter. Higher values reduce the rate of drop-off from full membership to zero membership.

fuzzifier(memberships)[source]

Fuzzification operator. By default, for memberships $u$ this is $u^m$.

objective(x)[source]

Interpretable as the data’s weighted rotational inertia about the cluster centers. To be minimised.

class skcmeans.algorithms.GustafsonKesselMixin(n_clusters=2, n_init=10, max_iter=300, tol=0.0001, verbosity=0, random_state=None, eps=1e-18, **kwargs)[source]

Gives clusters ellipsoidal character.

The Gustafson-Kessel algorithm redefines the distance measurement such that clusters may adopt ellipsoidal shapes. This is achieved through updates to a covariance matrix assigned to each cluster center.

Examples

Create a algorithm for probabilistic clustering with ellipsoidal clusters:

>>> class ProbabilisticGustafsonKessel(GustafsonKesselMixin, Probabilistic):
>>>     pass
>>> pgk = ProbabilisticGustafsonKessel()
>>> pgk.fit(x)
calculate_covariance(x)[source]

Calculates the covariance of the data u with cluster centers v.

Parameters:x (np.ndarray) – (n_samples, n_features) The original data.
Returns:(n_clusters, n_features, n_features) The covariance matrix of each cluster.
Return type:np.ndarray
fit(x)[source]

Optimizes cluster centers by restarting convergence several times.

Extends the default behaviour by recalculating the covariance matrix with resultant memberships and centers.

Parameters:x (np.ndarray) – (n_samples, n_features) The original data.
update(x)[source]

Single update of the cluster algorithm.

Extends the default behaviour by including a covariance calculation after updating the centers

Parameters:x (np.ndarray) – (n_samples, n_features) The original data.
class skcmeans.algorithms.Hard(n_clusters=2, n_init=10, max_iter=300, tol=0.0001, verbosity=0, random_state=None, eps=1e-18, **kwargs)[source]

Hard C-means, equivalent to K-means clustering.

calculate_memberships(x)[source]

The membership of a sample is 1 to the closest cluster and 0 otherwise.

calculate_centers(x)[source]

New centers are calculated as the mean of the points closest to them.

objective(x)[source]

Interpretable as the data’s rotational inertia about the cluster centers. To be minimised.

class skcmeans.algorithms.Possibilistic(n_clusters=2, n_init=10, max_iter=300, tol=0.0001, verbosity=0, random_state=None, eps=1e-18, **kwargs)[source]

Possibilistic C-means.

In the possibilistic algorithm, sample points are assigned memberships according to their relative proximity to the centers. This is controlled through a weighting to the cluster centers, approximately the variance of each cluster.

calculate_memberships(x)[source]

Memberships are calculated from the distance \(d_{ij}\) between the sample \(j\) and the cluster center \(i\), and the weighting \(w_i\) of each center.

\[u_{ik} = \left(1 + \left(\frac{d_{ik}}{w_i}\right)^\frac{1}{m -1} \right)^{-1}\]
calculate_centers(x)[source]

New centers are calculated as the mean of the points closest to them, weighted by the fuzzified memberships.

\[c_i = \left. \sum_k u_{ik}^m x_k \middle/ \sum_k u_{ik} \right.\]
static initialization(x, k, random_state=None)

Selects initial points using a probabilistic clustering approximation.

Parameters:
  • x (np.ndarray) – (n_samples, n_features) The original data.
  • k (int) – The number of points to select.
  • random_state (int or np.random.RandomState, optional) – The generator used for initialization. Using an integer fixes the seed.
Returns:

  • np.ndarray – (n_samples, k) Cluster memberships
  • np.ndarray – (k, n_features) Cluster centers

class skcmeans.algorithms.Probabilistic(n_clusters=2, n_init=10, max_iter=300, tol=0.0001, verbosity=0, random_state=None, eps=1e-18, **kwargs)[source]

Probabilistic C-means.

In the probabilistic algorithm, sample points have total membership of unity, distributed equally among each of the centers. This tends to push cluster centers away from each other.

calculate_memberships(x)[source]

Memberships are calculated from the distance \(d_{ij}\) between the sample \(j\) and the cluster center \(i\).

\[u_{ik} = \left(\sum_j \left( \frac{d_{ik}}{d_{jk}} \right)^{\frac{2}{m - 1}} \right)^{-1}\]
calculate_centers(x)[source]

New centers are calculated as the mean of the points closest to them, weighted by the fuzzified memberships.

\[c_i = \left. \sum_k u_{ik}^m x_k \middle/ \sum_k u_{ik} \right.\]