admin管理员组

文章数量:1622629

python 集群

带有Python的AI-无监督学习:集群 (AI with Python - Unsupervised Learning: Clustering)

Unsupervised machine learning algorithms do not have any supervisor to provide any sort of guidance. That is why they are closely aligned with what some call true artificial intelligence.

无监督的机器学习算法没有任何监督可提供任何类型的指导。 这就是为什么它们与真正的人工智能紧密结合的原因。

In unsupervised learning, there would be no correct answer and no teacher for the guidance. Algorithms need to discover the interesting pattern in data for learning.

在无监督学习中,将没有正确答案,也没有老师来指导。 算法需要发现数据中有趣的模式以供学习。

什么是群集? (What is Clustering?)

Basically, it is a type of unsupervised learning method and a common technique for statistical data analysis used in many fields. Clustering mainly is a task of dividing the set of observations into subsets, called clusters, in such a way that observations in the same cluster are similar in one sense and they are dissimilar to the observations in other clusters. In simple words, we can say that the main goal of clustering is to group the data on the basis of similarity and dissimilarity.

基本上,它是一种无监督的学习方法,并且是许多领域中用于统计数据分析的常用技术。 聚类主要是将观测值集合划分为子集(称为聚类)的任务,以使同一聚类中的观测值在某种意义上相似,而与其他聚类中的观测值不相似。 简而言之,可以说聚类的主要目的是基于相似性和不相似性对数据进行分组。

For example, the following diagram shows similar kind of data in different clusters −

例如,下图显示了不同集群中的同类数据-

数据聚类算法 (Algorithms for Clustering the Data)

Following are a few common algorithms for clustering the data −

以下是一些用于对数据进行聚类的常见算法-

K-均值算法 (K-Means algorithm)

K-means clustering algorithm is one of the well-known algorithms for clustering the data. We need to assume that the numbers of clusters are already known. This is also called flat clustering. It is an iterative clustering algorithm. The steps given below need to be followed for this algorithm −

K-均值聚类算法是用于聚类数据的众所周知的算法之一。 我们需要假设集群的数目是已知的。 这也称为平面聚类。 它是一种迭代聚类算法。 此算法需要遵循以下给出的步骤-

Step 1 − We need to specify the desired number of K subgroups.

步骤1-我们需要指定所需的K个子组数。

Step 2 − Fix the number of clusters and randomly assign each data point to a cluster. Or in other words we need to classify our data based on the number of clusters.

步骤2-固定集群的数量,并将每个数据点随机分配给集群。 换句话说,我们需要根据聚类数对数据进行分类。

In this step, cluster centroids should be computed.

在此步骤中,应计算簇质心。

As this is an iterative algorithm, we need to update the locations of K centroids with every iteration until we find the global optima or in other words the centroids reach at their optimal locations.

因为这是一个迭代算法,所以我们需要在每次迭代时更新K个质心的位置,直到找到全局最优值,换句话说,质心到达了它们的最佳位置。

The following code will help in implementing K-means clustering algorithm in Python. We are going to use the Scikit-learn module.

以下代码将有助于在Python中实现K-means聚类算法。 我们将使用Scikit学习模块。

Let us import the necessary packages −

让我们导入必要的包-


import matplotlib.pyplot as plt
import seaborn as sns; sns.set()
import numpy as np
from sklearn.cluster import KMeans

The following line of code will help in generating the two-dimensional dataset, containing four blobs, by using make_blob from the sklearn.dataset package.

下面的代码线将有助于在生成的二维数据集,包含四个斑点,通过使用make_blobsklearn.dataset包。


from sklearn.datasets.samples_generator import make_blobs

X, y_true = make_blobs(n_samples = 500, centers = 4,
            cluster_std = 0.40, random_state = 0)

We can visualize the dataset by using the following code −

我们可以使用以下代码可视化数据集-


plt.scatter(X[:, 0], X[:, 1], s = 50);
plt.show()

Here, we are initializing kmeans to be the KMeans algorithm, with the required parameter of how many clusters (n_clusters).

在这里,我们将kmeans初始化为KMeans算法,并带有多少个簇(n_clusters)的必需参数。


kmeans = KMeans(n_clusters = 4)

We need to train the K-means model with the input data.

我们需要用输入数据训练K-means模型。


kmeans.fit(X)
y_kmeans = kmeans.predict(X)
plt.scatter(X[:, 0], X[:, 1], c = y_kmeans, s = 50, cmap = 'viridis')

centers = kmeans.cluster_centers_

The code given below will help us plot and visualize the machine's findings based on our data, and the fitment according to the number of clusters that are to be found.

下面给出的代码将帮助我们根据我们的数据以及根据要找到的簇的数量进行装配,绘制并可视化机器的发现。


plt.scatter(centers[:, 0], centers[:, 1], c = 'black', s = 200, alpha = 0.5);
plt.show()

均值漂移算法 (Mean Shift Algorithm)

It is another popular and powerful clustering algorithm used in unsupervised learning. It does not make any assumptions hence it is a non-parametric algorithm. It is also called hierarchical clustering or mean shift cluster analysis. Followings would be the basic steps of this algorithm −

它是在无监督学习中使用的另一种流行且功能强大的聚类算法。 它没有做任何假设,因此它是一个非参数算法。 也称为层次聚类或均值漂移聚类分析。 以下是该算法的基本步骤-

  • First of all, we need to start with the data points assigned to a cluster of their own.

    首先,我们需要从分配给它们自己的群集的数据点开始。

  • Now, it computes the centroids and update the location of new centroids.

    现在,它计算质心并更新新质心的位置。

  • By repeating this process, we move closer the peak of cluster i.e. towards the region of higher density.

    通过重复此过程,我们将簇的峰值移近,即移向密度较高的区域。

  • This algorithm stops at the stage where centroids do not move anymore.

    该算法在质心不再移动的阶段停止。

With the help of following code we are implementing Mean Shift clustering algorithm in Python. We are going to use Scikit-learn module.

借助以下代码,我们正在Python中实现Mean Shift聚类算法。 我们将使用Scikit学习模块。

Let us import the necessary packages −

让我们导入必要的包-


import numpy as np
from sklearn.cluster import MeanShift
import matplotlib.pyplot as plt
from matplotlib import style
style.use("ggplot")

The following code will help in generating the two-dimensional dataset, containing four blobs, by using make_blob from the sklearn.dataset package.

下面的代码将有助于在生成的二维数据集,包含四个斑点,通过使用make_blobsklearn.dataset包。


from sklearn.datasets.samples_generator import make_blobs

We can visualize the dataset with the following code

我们可以使用以下代码可视化数据集


centers = [[2,2],[4,5],[3,10]]
X, _ = make_blobs(n_samples = 500, centers = centers, cluster_std = 1)
plt.scatter(X[:,0],X[:,1])
plt.show()

Now, we need to train the Mean Shift cluster model with the input data.

现在,我们需要使用输入数据训练均值漂移聚类模型。


ms = MeanShift()
ms.fit(X)
labels = ms.labels_
cluster_centers = ms.cluster_centers_

The following code will print the cluster centers and the expected number of cluster as per the input data −

以下代码将根据输入数据打印集群中心和集群的预期数量-


print(cluster_centers)
n_clusters_ = len(np.unique(labels))
print("Estimated clusters:", n_clusters_)
[[ 3.23005036 3.84771893]
[ 3.02057451 9.88928991]]
Estimated clusters: 2

The code given below will help plot and visualize the machine's findings based on our data, and the fitment according to the number of clusters that are to be found.

下面给出的代码将根据我们的数据,以及根据要发现的群集的数量来拟合设备,帮助绘制图表并直观地显示出来。


colors = 10*['r.','g.','b.','c.','k.','y.','m.']
   for i in range(len(X)):
   plt.plot(X[i][0], X[i][1], colors[labels[i]], markersize = 10)
plt.scatter(cluster_centers[:,0],cluster_centers[:,1],
   marker = "x",color = 'k', s = 150, linewidths = 5, zorder = 10)
plt.show()

衡量集群性能 (Measuring the Clustering Performance)

The real world data is not naturally organized into number of distinctive clusters. Due to this reason, it is not easy to visualize and draw inferences. That is why we need to measure the clustering performance as well as its quality. It can be done with the help of silhouette analysis.

现实世界中的数据并非自然地组织成许多独特的簇。 由于这个原因,不容易可视化和得出推论。 这就是为什么我们需要测量集群性能及其质量。 可以借助轮廓分析来完成。

轮廓分析 (Silhouette Analysis)

This method can be used to check the quality of clustering by measuring the distance between the clusters. Basically, it provides a way to assess the parameters like number of clusters by giving a silhouette score. This score is a metric that measures how close each point in one cluster is to the points in the neighboring clusters.

此方法可用于通过测量聚类之间的距离来检查聚类的质量。 基本上,它提供了一种通过给出轮廓分数来评估参数(如簇数)的方法。 该分数是衡量一个群集中的每个点与相邻群集中的点的接近程度的度量。

轮廓得分分析 (Analysis of silhouette score)

The score has a range of [-1, 1]. Following is the analysis of this score −

分数范围为[-1,1]。 以下是该分数的分析-

  • Score of +1 − Score near +1 indicates that the sample is far away from the neighboring cluster.

    +1得分-接近+1得分表示样本距离相邻簇很远。

  • Score of 0 − Score 0 indicates that the sample is on or very close to the decision boundary between two neighboring clusters.

    得分0-得分0表示样本在两个相邻聚类之间的决策边界上或非常接近。

  • Score of -1 − Negative score indicates that the samples have been assigned to the wrong clusters.

    -1分 -负分表示样本已分配给错误的聚类。

计算轮廓分数 (Calculating Silhouette Score)

In this section, we will learn how to calculate the silhouette score.

在本节中,我们将学习如何计算轮廓分数。

Silhouette score can be calculated by using the following formula −

剪影分数可以通过使用以下公式计算-

$$silhouette score = \frac{\left ( p-q \right )}{max\left ( p,q \right )}$$

$$ silhouette分数= \ frac {\ left(pq \ right)} {max \ left(p,q \ right)} $$

Here, 𝑝 is the mean distance to the points in the nearest cluster that the data point is not a part of. And, 𝑞 is the mean intra-cluster distance to all the points in its own cluster.

在此,𝑝是到该数据点不属于的最近群集中的点的平均距离。 并且,𝑞是到其自身群集中所有点的群集内平均距离。

For finding the optimal number of clusters, we need to run the clustering algorithm again by importing the metrics module from the sklearn package. In the following example, we will run the K-means clustering algorithm to find the optimal number of clusters −

为了找到最佳数量的聚类,我们需要通过从sklearn包中导入指标模块来再次运行聚类算法。 在以下示例中,我们将运行K-means聚类算法以找到最佳数目的聚类-

Import the necessary packages as shown −

导入所需的软件包,如下所示:


import matplotlib.pyplot as plt
import seaborn as sns; sns.set()
import numpy as np
from sklearn.cluster import KMeans

With the help of the following code, we will generate the two-dimensional dataset, containing four blobs, by using make_blob from the sklearn.dataset package.

用下面的代码的帮助下,我们将生成的二维数据集,包含四个斑点,通过使用make_blobsklearn.dataset包。


from sklearn.datasets.samples_generator import make_blobs

X, y_true = make_blobs(n_samples = 500, centers = 4, cluster_std = 0.40, random_state = 0)

Initialize the variables as shown −

初始化变量,如下所示:


scores = []
values = np.arange(2, 10)

We need to iterate the K-means model through all the values and also need to train it with the input data.

我们需要遍历所有值的K-means模型,还需要使用输入数据对其进行训练。


for num_clusters in values:
kmeans = KMeans(init = 'k-means++', n_clusters = num_clusters, n_init = 10)
kmeans.fit(X)

Now, estimate the silhouette score for the current clustering model using the Euclidean distance metric −

现在,使用欧几里德距离度量来估计当前聚类模型的轮廓分数-


score = metrics.silhouette_score(X, kmeans.labels_,
metric = 'euclidean', sample_size = len(X))

The following line of code will help in displaying the number of clusters as well as Silhouette score.

下面的代码行将帮助显示群集的数量以及Silhouette得分。


print("\nNumber of clusters =", num_clusters)
print("Silhouette score =", score)
scores.append(score)

You will receive the following output −

您将收到以下输出-


Number of clusters = 9
Silhouette score = 0.340391138371

num_clusters = np.argmax(scores) + values[0]
print('\nOptimal number of clusters =', num_clusters)

Now, the output for optimal number of clusters would be as follows −

现在,最佳集群数的输出如下:


Optimal number of clusters = 2

寻找最近的邻居 (Finding Nearest Neighbors)

If we want to build recommender systems such as a movie recommender system then we need to understand the concept of finding the nearest neighbors. It is because the recommender system utilizes the concept of nearest neighbors.

如果我们要构建推荐系统,例如电影推荐系统,那么我们需要了解查找最近邻居的概念。 这是因为推荐系统利用了最近邻居的概念。

The concept of finding nearest neighbors may be defined as the process of finding the closest point to the input point from the given dataset. The main use of this KNN)K-nearest neighbors) algorithm is to build classification systems that classify a data point on the proximity of the input data point to various classes.

查找最近邻居概念可以定义为从给定数据集中查找最接近输入点的过程。 此KNN(K-最近邻)算法的主要用途是构建分类系统,该分类系统在输入数据点与各种类别的邻近度上对数据点进行分类。

The Python code given below helps in finding the K-nearest neighbors of a given data set −

以下给出的Python代码有助于查找给定数据集的K个近邻-

Import the necessary packages as shown below. Here, we are using the NearestNeighbors module from the sklearn package

如下所示导入必要的软件包。 在这里,我们使用sklearn包中的NearestNeighbors模块


import numpy as np
import matplotlib.pyplot as plt
from sklearn.neighbors import NearestNeighbors

Let us now define the input data −

现在让我们定义输入数据-


A = np.array([[3.1, 2.3], [2.3, 4.2], [3.9, 3.5], [3.7, 6.4], [4.8, 1.9], 
             [8.3, 3.1], [5.2, 7.5], [4.8, 4.7], [3.5, 5.1], [4.4, 2.9],])

Now, we need to define the nearest neighbors −

现在,我们需要定义最近的邻居-


k = 3

We also need to give the test data from which the nearest neighbors is to be found −

我们还需要提供测试数据,从中可以找到最近的邻居-


test_data = [3.3, 2.9]

The following code can visualize and plot the input data defined by us −

以下代码可以可视化并绘制由我们定义的输入数据-


plt.figure()
plt.title('Input data')
plt.scatter(A[:,0], A[:,1], marker = 'o', s = 100, color = 'black')

Now, we need to build the K Nearest Neighbor. The object also needs to be trained

现在,我们需要构建K最近邻居。 该对象也需要训练


knn_model = NearestNeighbors(n_neighbors = k, algorithm = 'auto').fit(X)
distances, indices = knn_model.kneighbors([test_data])

Now, we can print the K nearest neighbors as follows

现在,我们可以如下打印K个最近的邻居


print("\nK Nearest Neighbors:")
for rank, index in enumerate(indices[0][:k], start = 1):
   print(str(rank) + " is", A[index])

We can visualize the nearest neighbors along with the test data point

我们可以可视化最近的邻居以及测试数据点


plt.figure()
plt.title('Nearest neighbors')
plt.scatter(A[:, 0], X[:, 1], marker = 'o', s = 100, color = 'k')
plt.scatter(A[indices][0][:][:, 0], A[indices][0][:][:, 1],
   marker = 'o', s = 250, color = 'k', facecolors = 'none')
plt.scatter(test_data[0], test_data[1],
   marker = 'x', s = 100, color = 'k')
plt.show()

输出量 (Output)

K Nearest Neighbors

K最近的邻居


1 is [ 3.1 2.3]
2 is [ 3.9 3.5]
3 is [ 4.4 2.9]

最近邻分类器 (K-Nearest Neighbors Classifier)

A K-Nearest Neighbors (KNN) classifier is a classification model that uses the nearest neighbors algorithm to classify a given data point. We have implemented the KNN algorithm in the last section, now we are going to build a KNN classifier using that algorithm.

K最近邻(KNN)分类器是使用最近邻居算法对给定数据点进行分类的分类模型。 我们在上一节中实现了KNN算法,现在我们将使用该算法构建KNN分类器。

KNN分类器的概念 (Concept of KNN Classifier)

The basic concept of K-nearest neighbor classification is to find a predefined number, i.e., the 'k' − of training samples closest in distance to a new sample, which has to be classified. New samples will get their label from the neighbors itself. The KNN classifiers have a fixed user defined constant for the number of neighbors which have to be determined. For the distance, standard Euclidean distance is the most common choice. The KNN Classifier works directly on the learned samples rather than creating the rules for learning. The KNN algorithm is among the simplest of all machine learning algorithms. It has been quite successful in a large number of classification and regression problems, for example, character recognition or image analysis.

K最近邻居分类的基本概念是找到预定义的数量,即距离最接近新样本的训练样本的“ k”-,该样本必须进行分类。 新样本将从邻居本身获取标签。 KNN分类器具有一个固定的用户定义常量,用于必须确定的邻居数量。 对于距离,标准欧几里德距离是最常见的选择。 KNN分类器直接在学习的样本上工作,而不是创建学习规则。 KNN算法是所有机器学习算法中最简单的算法之一。 在许多分类和回归问题(例如字符识别或图像分析)中,它已经非常成功。

Example

We are building a KNN classifier to recognize digits. For this, we will use the MNIST dataset. We will write this code in the Jupyter Notebook.

我们正在建立一个KNN分类器来识别数字。 为此,我们将使用MNIST数据集。 我们将在Jupyter Notebook中编写此代码。

Import the necessary packages as shown below.

如下所示导入必要的软件包。

Here we are using the KNeighborsClassifier module from the sklearn.neighbors package −

这里我们使用sklearn.neighbors包中的KNeighborsClassifier模块-


from sklearn.datasets import *
import pandas as pd
%matplotlib inline
from sklearn.neighbors import KNeighborsClassifier
import matplotlib.pyplot as plt
import numpy as np

The following code will display the image of digit to verify what image we have to test −

以下代码将显示数字图像以验证我们必须测试的图像-


def Image_display(i):
   plt.imshow(digit['images'][i],cmap = 'Greys_r')
   plt.show()

Now, we need to load the MNIST dataset. Actually there are total 1797 images but we are using the first 1600 images as training sample and the remaining 197 would be kept for testing purpose.

现在,我们需要加载MNIST数据集。 实际上总共有1797张图像,但我们将前1600张图像用作训练样本,其余197张将保留用于测试。


digit = load_digits()
digit_d = pd.DataFrame(digit['data'][0:1600])

Now, on displaying the images we can see the output as follows −

现在,在显示图像时,我们可以看到如下输出:


Image_display(0)

图片显示(0) (Image_display(0))

Image of 0 is displayed as follows −

图像0显示如下-

图像显示(9) (Image_display(9))

Image of 9 is displayed as follows −

图像9显示如下-

digit.keys() (digit.keys())

Now, we need to create the training and testing data set and supply testing data set to the KNN classifiers.

现在,我们需要创建训练和测试数据集,并将测试数据集提供给KNN分类器。


train_x = digit['data'][:1600]
train_y = digit['target'][:1600]
KNN = KNeighborsClassifier(20)
KNN.fit(train_x,train_y)

The following output will create the K nearest neighbor classifier constructor −

以下输出将创建K最近邻分类器构造函数-


KNeighborsClassifier(algorithm = 'auto', leaf_size = 30, metric = 'minkowski',
   metric_params = None, n_jobs = 1, n_neighbors = 20, p = 2,
   weights = 'uniform')

We need to create the testing sample by providing any arbitrary number greater than 1600, which were the training samples.

我们需要通过提供大于1600的任意数字(即训练样本)来创建测试样本。


test = np.array(digit['data'][1725])
test1 = test.reshape(1,-1)
Image_display(1725)

图片显示(6) (Image_display(6))

Image of 6 is displayed as follows −

图像6显示如下-

Now we will predict the test data as follows −

现在我们将预测测试数据如下:


KNN.predict(test1)

The above code will generate the following output −

上面的代码将生成以下输出-


array([6])

Now, consider the following −

现在,考虑以下内容-


digit['target_names']

The above code will generate the following output −

上面的代码将生成以下输出-


array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

翻译自: https://www.tutorialspoint/artificial_intelligence_with_python/artificial_intelligence_with_python_unsupervised_learning_clustering.htm

python 集群

本文标签: 集群PythonAI