python实现聚类算法

sklearn包中的K-Means算法

1.  函数:sklearn.cluster.``KMeans

*class *sklearn.cluster.``KMeans(n_clusters=8init=’k-means++’n_init=10max_iter=300tol=0.0001precompute_distances=’auto’verbose=0random_state=Nonecopy_x=Truen_jobs=Nonealgorithm=’auto’)

  1. 主要参数

n_clusters:要进行的分类的个数,即上文中k值,默认是8

init:‘k-means++’, ‘random’ or an ndarray

‘k-means ++’:使用k-means++算法,默认选项

‘random’:从初始质心数据中随机选择k个观察值

第三个是数组形式的参数

n_init:使用不同的初始化运行算法的次数

max_iter  :最大迭代次数。默认300

random_state:设置某个整数使得结果固定

n_jobs: 设置并行量 (-1表示使用所有CPU)

  1. 主要属性:

cluster_centers_ :集群中心的坐标

labels_ : 每个点的标签

  1. 官网示例:
1
2
3
4
5
6
7
8
9
10
11
12
>>> from sklearn.cluster import KMeans
>>> import numpy as np
>>> X = np.array([[1, 2], [1, 4], [1, 0],
... [10, 2], [10, 4], [10, 0]])
>>> kmeans = KMeans(n_clusters=2, random_state=0).fit(X)
>>> kmeans.labels_
array([1, 1, 1, 0, 0, 0], dtype=int32)
>>> kmeans.predict([[0, 0], [12, 3]])
array([1, 0], dtype=int32)
>>> kmeans.cluster_centers_
array([[10., 2.],
[ 1., 2.]])

5.  My code:

1
2
3
4
5
6
7
8
9
10
%matplotlib inline
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
import numpy as np
X = np.array([[1, 2], [1, 4], [1, 0], [4, 2], [4, 4], [4, 0]])
plt.plot(X[:,0],X[:,1],'x')
kmeans = KMeans(n_clusters=2).fit(X)
print(kmeans.labels_)
print(kmeans.predict([[0, 0], [4, 4]]))
print(kmeans.cluster_centers_)

聚类算法衡量指标

1.  函数:sklearn.metrics.``silhouette_score

sklearn.metrics.``silhouette_score(Xlabelsmetric=’euclidean’sample_size=Nonerandom_state=None, **kwds)

  1. 实例:

https://scikit-learn.org/stable/auto_examples/cluster/plot_kmeans_digits.html#sphx-glr-auto-examples-cluster-plot-kmeans-digits-py

sklearn包中的K-Means算法

1.  函数:sklearn.mixture.``GaussianMixture

sklearn.mixture.``GaussianMixture(n_components=1covariance_type=’full’tol=0.001reg_covar=1e-06max_iter=100n_init=1init_params=’kmeans’weights_init=Nonemeans_init=Noneprecisions_init=Nonerandom_state=Nonewarm_start=Falseverbose=0verbose_interval=10)

  1. 实例:

https://scikit-learn.org/stable/auto_examples/mixture/plot_gmm_covariances.html#sphx-glr-auto-examples-mixture-plot-gmm-covariances-py

https://scikit-learn.org/stable/auto_examples/mixture/plot_gmm_selection.html#sphx-glr-auto-examples-mixture-plot-gmm-selection-py

聚类结果可视化

参考代码:A demo of K-Means clustering on the handwritten digits data
直接plot出来点太密集了,所以考虑随机抽取5k个点来plot,参考:np.random.randint、np.random.choice、random.sample三种随机函数的用法案例
结果如下:

先聚类后PCA

直接对原始的数据进行聚类,结果如下:

Weighted k-means

第一个方法

Clustering the US population: observation-weighted k-means
github-observation_weighted_kmeans

这是直接对data加权值,有对应的代码,但是数据结构用的是它自己的project,需要改写。

第二个方法:weighted kernel kmeans

WEIGHTED KERNEL K-MEANS 加权核K均值算法理解及其实现(一)
上文中提到,[1]中使用了这个方法,这个方法最初在[2]提到。有现成的代码库实现:
tslearn.clustering.GlobalAlignmentKernelKMeans
这个是现成的工具包,若要直接使用可以设置一个最简单的y=x的kernel即可。

Reference

[1]Weighted Graph Cuts without Eigenvectors:A Multilevel Approach

[2]Kernel k-means, Spectral Clustering and Normalized Cuts

聚类算法一览

https://www.cnblogs.com/lc1217/p/6893924.html

https://www.cnblogs.com/lc1217/p/6908031.html

https://www.cnblogs.com/lc1217/p/6963687.html

GMM vs K-means

https://www.jianshu.com/p/a4d8fa39c762

https://www.jianshu.com/p/13898e68c5c6

sklearn

https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html

https://scikit-learn.org/stable/modules/generated/sklearn.metrics.silhouette_score.html

https://scikit-learn.org/stable/auto_examples/cluster/plot_kmeans_digits.html#sphx-glr-auto-examples-cluster-plot-kmeans-digits-py

https://scikit-learn.org/stable/modules/generated/sklearn.mixture.GaussianMixture.html

https://scikit-learn.org/stable/auto_examples/mixture/plot_gmm_selection.html#sphx-glr-auto-examples-mixture-plot-gmm-selection-py