admin管理员组

文章数量:1530085

k均值聚类算法 k会变化么

In this article, we’ll build an artificial intelligence movies recommendation system by using k-means algorithm which is a clustering algorithm. We’ll recommend movies to users which are more relevant to them based on their previous history. We’ll only import those data, where users has rated movies 4+ as we want to recommend only those movies which users like most. In this whole article, we have used Python programming language with their associated libraries i.e. NumPy, Pandas, Matplotlib and Scikit-Learn. Moreover, we have supposed that the reader has familiarity with Python and the aforementioned libraries.

在本文中,我们将使用k-means算法(一种聚类算法)构建一个人工智能电影推荐系统。 我们会根据以前的历史向与他们更相关的用户推荐电影。 我们只会导入那些用户将电影评为4+以上的数据,因为我们只想推荐用户最喜欢的电影。 在整篇文章中,我们将Python编程语言及其相关库(即NumPy,Pandas,Matplotlib和Scikit-Learn)结合使用。 此外,我们假设读者已经熟悉Python和上述库。

AI电影推荐系统介绍 (Introduction to AI Movies Recommendation System)

In this busy life as people don’t have time to search for their desired item and even they want it on their table or even in a little effort. So, the recommendation system has become an important part to help them to make a right choice for their desired thing and to grow our product. Since data is increasing day by day and in this era with such a large database, it has even become a difficult task to find a more relevant item of our interest, because often we can’t search an item of our interest with just a title and even sometimes it is harder. So, recommendation system help us to provide a most relevant item to individual available in our database.

在这种忙碌的生活中,人们没有时间去寻找所需的物品,甚至他们都希望将其放在桌子上,甚至不费吹灰之力。 因此,推荐系统已成为帮助他们为自己想要的东西做出正确选择并发展我们的产品的重要组成部分。 由于数据如此庞大,并且在这个时代,数据每天都在增加,因此,找到我们感兴趣的更相关项目甚至变得困难重重,因为我们常常无法仅用标题来搜索我们感兴趣的项目。甚至有时候更难。 因此,推荐系统可以帮助我们向数据库中的个人提供最相关的项目。

In this article, we’ll build a movies recommendation system. Movies recommendation system has become an essential part to movies website because an individual don’t know which movies are more interested to him with just a title or genre. Sometime an individual likes action movies but he/she will not always like every action movie. To handle this problem, many authors has provided a better way to recommend a movie to user 1 from the watch list or favorite movies of another user 2 whose movies database is more relevant to the user 1. That is, if the taste of two people is same, then both of them will like each other favorite food. Many tech giants has been using these recommendation system in their applications like YouTube, Netflix, etc.

在本文中,我们将构建电影推荐系统。 电影推荐系统已成为电影网站的重要组成部分,因为一个人仅凭标题或体裁就不知道哪些电影对他更感兴趣。 有时某个人喜欢动作片,但他/她并不总是喜欢每部动作片。 为了解决这个问题,许多作者提供了一种更好的方法,从电影数据库与用户1相关性更高的另一个用户2的观看列表中,将电影推荐给用户1或将其推荐给用户1。一样,那么他们俩都会喜欢彼此喜欢的食物。 许多技术巨头一直在其应用程序(例如YouTube,Netflix等)中使用这些推荐系统。

In this task, machine learning (ML) models has helped us a lot to build such recommendation system based on users previous watch history. ML models learns from users watch history and categorize them into groups which contain users of same taste. Different types of ML models has been used like clustering algorithms, deep learning models etc.

在此任务中,机器学习(ML)模型已帮助我们在基于用户先前观看历史的基础上建立了这样的推荐系统。 ML模型从用户观看历史中学习,并将其分类为包含相同爱好用户的组。 已经使用了不同类型的ML模型,例如聚类算法,深度学习模型等。

K均值聚类算法 (K-Means Clustering Algorithm)

K-Means is an unsupervised machine learning algorithm which can be used to categorize data into different groups. In this article we’ll use this algorithm to categorize users based on their 4+ ratings on movies. I’ll not describe the background mathematics of this algorithm but I’ll describe little intuition of this algorithm. If you want to understand the mathematical background of this algorithm, then I’ll suggest you to search it on Google, many authors has written articles on its mathematical background. Since, the complete mathematics behind this algorithm has been done by Scikit-Learn library so, we will only understand and implement it.

K-Means是一种无监督的机器学习算法,可用于将数据分类为不同的组。 在本文中,我们将使用此算法根据用户对电影的4级以上评分对用户进行分类。 我不会描述该算法的背景数学,但会描述该算法的一些直觉。 如果您想了解该算法的数学背景,那么我建议您在Google上进行搜索,许多作者已经撰写了有关其数学背景的文章。 由于此算法背后的完整数学已经由Scikit-Learn库完成,因此,我们仅会理解和实现它。

Note: Plots of data in this section are designed randomly and only for intuition of K-means algorithm.

注意:本节中的数据图是随机设计的,仅用于直观的K-means算法。

Suppose that we have 2-dimensional data in the form of (x₁, x₂). Let, we have plotted our data in Figure (1). Next we want to divide this data into groups. If we take a look at data, then we can observe that this data can be divided into three groups. In this plot which is only designed for intuition, a common man can observe that we can divide into three groups. But some times we have very complex and big data or some time we have 3-dimensional or 4-dimensional or more generally we can have 100 dimensions or 1000 or even more than this. Then, it is not possible for human to categorize such type of data and even we can’t plot such a higher dimensional data. Also, sometimes we don’t know the optimal number of clusters we should have for our data. So, we use some clustering algorithms which can work for such big data which can even of thousands of dimensions and their are methods which can be used to know the optimal number of clusters.

假设我们有(x 1,x 2)形式的二维数据。 让我们在图(1)中绘制数据。 接下来,我们要将这些数据分为几组。 如果我们看一下数据,那么我们可以观察到该数据可以分为三组。 在这个仅出于直觉而设计的情节中,一个普通人可以观察到我们可以分为三类。 但是有时候我们拥有非常复杂的大数据,或者有时候我们拥有3维或4维,或者更广泛的说,我们可以拥有100维或1000维甚至更多。 这样,人类不可能对此类数据进行分类,甚至我们也无法绘制此类高维数据。 另外,有时我们不知道应该为数据提供的最佳聚类数。 因此,我们使用了一些聚类算法,这些算法可以处理甚至可以包含数千个维的大数据,并且它们是可以用来了解最佳聚类数的方法。

Figure 2 — Scatter Plot After K-Means Clustering
图2 — K均值聚类后的散点图

In Figure (2), a demonstration of k-means clustering is shown. The data of Figure (1) has categorized into three groups and presented in the Figure (2) with a unique color for each group.

图(2)中 ,展示了k均值聚类的演示。 图(1)的数据分为三组,并在图(2)中以每组唯一的颜色显示。

One can arise a question, how actually k-means worked to categorize the data?

可能会出现一个问题,k均值实际上是如何对数据进行分类的?

To categorize data into groups which contain same type of items/data, there are 6 steps which k-means algorithm follow. Figure (3) is presenting the steps which k-means algorithm follow to categorize data.

要将数据分类为包含相同类型的项目/数据的组,k-means算法遵循6个步骤。 图(3)展示了k均值算法对数据进行分类的步骤。

Figure 3 — Graphical Abstract of K-Means Algorithm
图3 — K-Means算法的图形摘要

Figure (3) is describing the following steps of k-means algorithm.

图(3)描述了k-means算法的以下步骤。

  1. Firstly, we have to select the numbers of clusters which we want for our dataset. Later, an elbow method will be explained for selection of optimal number of clusters.

    首先,我们必须选择想要用于数据集的聚类数目。 稍后,将介绍一种肘形方法,用于选择最佳簇数。

  2. Then, we have to select k random points called centroid which are not necessary from our dataset. Because to avoid random initialization trap which can stuck to bad clusters, we’ll use k-means++ to initalize k centroids and it is provided by Scikit-Learn in k-means algorithm.

    然后,我们必须从数据集中选择k个称为质心的随机点。 因为要避免随机初始化陷阱(该陷阱可能卡在不良簇上),我们将使用k-means ++来初始化k个质心,这是Scikit-Learn在k-means算法中提供的。

  3. K-means algorithm will assign each data point to its closest centroid which will finally gives us k clusters.

    K-均值算法会将每个数据点分配给其最接近的质心,最终将获得k个聚类。
  4. The centroid will be re-center to a position which is now actually the centroid of its own cluster and will be new centroid.

    重心将重新居中到某个位置,该位置现在实际上是其自身簇的重心,并且将是新的重心。
  5. It will reset all clusters and again assign each dataset point to its new closest centroid.

    它将重置所有聚类,并再次将每个数据集点分配给其新的最接近的质心。
  6. If, the new clusters are same as the previous cluster was OR total iterations has completed then it will stop and gives us the final clusters of our dataset. Else, It will move again to step 4.

    如果新的聚类与先前的聚类相同或总迭代已完成,则它将停止并为我们提供数据集的最终聚类。 否则,它将再次移至步骤4。

肘法 (Elbow Method)

The elbow method is the best way to find optimal number of clusters. For this, we need to find within clusters sum of squares (WCSS). WCSS is the sum of squares of each point distance from its centroid and its mathematical formula is following

弯头法是找到最佳簇数的最佳方法。 为此,我们需要在群集内找到平方和(WCSS) 。 WCSS是距其质心的每个点距离的平方和,其数学公式如下

Where K is total number of clusters, Nᵢ is the size of i’th cluster or we can also say that data points in i’th cluster, Cᵢ is the centroid of i’th cluster and Pᵢ,ⱼ is the j’th data point of i’th cluster.

其中K是群集总数, Nᵢ是第i个群集的大小,或者我们也可以说第i个群集中的数据点, Cᵢ是第i个群集的质心, Pᵢ,ⱼ是第j个数据第i个群集的点。

So, what we’ll do with WCSS?

那么,我们将使用WCSS做什么?

WCSS will tells us how far are centroid from its data points. As we increase number of clusters, WCSS will become small and after some value of K the WCSS will reduce slowly and we will stop there and choose optimal number of clusters. I’ll suggest to Google for elbow method and take a look at more clear examples of elbow method. Here we have figure for intuition of elbow method.

WCSS会告诉我们质心离其数据点有多远。 随着群集数量的增加,WCSS将变小,并且在K值达到一定值后,WCSS将缓慢减小,我们将在此处停止并选择最佳群集数量。 我会向Google建议使用肘部方法,并看一下更清晰的肘部方法示例。 在这里,我们有直观的肘法图。

Figure 4 — Elbow Method Plot
图4 —弯头方法图

A demonstration of elbow method is show in Figure (4). As we can observe that, when number of clusters K moves from 1 to 5 then WCSS value decreases rapidly from 2500 to 400 approx. But, for clusters number 6 to onward it is decreasing slowly. So, here we can make a judgment that it is good for our dataset if we have 5 cluster. Further, as we can see its look like an elbow, the joint elbow will be the optimal number of clusters which is in this case is 5. Later we’ll see that we don’t have always such a smooth curve so in this work I have described another way to observe changes in WCSS and to know optimal clusters.

肘法的演示如图(4)所示 。 正如我们可以看到的,当簇数K从1移到5时,WCSS值从2500Swift减小到400。 但是,对于第6个集群,它正在缓慢减小。 因此,在这里我们可以判断如果有5个聚类,则对我们的数据集是有利的。 此外,如我们所见,它看起来像肘部,联合肘部将是最佳簇数,在这种情况下为5。稍后,我们将看到我们并不总是具有如此平滑的曲线,因此在这项工作中我已经描述了观察WCSS中的变化并了解最佳聚类的另一种方法。

本文使用的方法 (Methodology Used in this Article)

In this article, we’ll build a clustering based algorithm to categorize users into groups of same interest by using k-means algorithm. We will use data, where users has rated movies with 4+ rating on the supposition of that, if a user is rating a movie 4+ then he/she may like it. We have downloaded database The Movies Dataset from Kaggle which is a MovieLens Dataset. In the following sections, we have completely described the whole project, from Importing Dataset -> Data Engineering -> Building K-Means Clustering Model -> Analyzing Optimal Number of Clusters -> Training Model and Predicting -> Fixing Clusters -> Saving Training -> Finally, Making Recommendations for Users. A complete project of movies recommendation system can be downloaded from my GitHub Library AI Movies Recommendation System Based on K-means Clustering Algorithm. A Jupyter notebook of this article is also provided in the repository, you can download and play with that.

在本文中,我们将建立一个基于聚类的算法,通过使用k-means算法将用户分类为相同兴趣的组。 我们将使用数据,假设用户对电影的评价为4+,如果用户对电影的评价为4+,那么他/她可能会喜欢。 我们已经从Kaggle下载了Movies数据集数据库,它是MovieLens数据集。 在以下各节中,我们已经从导入数据集->数据工程->构建K-Means聚类模型->分析最佳聚类数->训练模型和预测->固定聚类->保存训练-完整描述了整个项目。 >最后,为用户提出建议 。 完整的电影推荐系统项目可以从我的GitHub Library下载基于K-means聚类算法的AI电影推荐系统 。 存储库中还提供了本文的Jupyter笔记本,您可以下载并使用它。

URL: https://github/asdkazmi/AI-Movies-Recommendation-System-K-Means-ClusteringURL: https://www.kaggle/rounakbanik/the-movies-dataset?select=ratings.csv

网址: https : //github/asdkazmi/AI-Movies-Recommendation-System-K-Means-Clustering网址: https : //www.kaggle/rounakbanik/the-movies-dataset? select =ratings.csv

Now lets start to work on coding:

现在开始编写代码:

导入所有必需的库 (Importing All Required Libraries)

import pandas as pd
print('Pandas version: ', pd.__version__)
import numpy as np
print('NumPy version: ', np.__version__)
import matplotlib
print('Matplotlib version: ', matplotlib.__version__)
from matplotlib import pyplot as plt
import sklearn
print('Scikit-Learn version: ', sklearn.__version__)
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.cluster import KMeans
import pickle
print('Pickle version: ', pickle.format_version)
import sys
print('Sys version: ', sys.version[0:5])
from sys import exc_info
import ast

Out:

出:

Pandas version:  0.25.1
NumPy version: 1.16.5
Matplotlib version: 3.1.1
Scikit-Learn version: 0.21.3
Pickle version: 4.0
Sys version: 3.7.4

数据工程 (Data Engineering)

This section is divided into two subsections. Firstly, we will import data and reduce it into a sub DataFrame, so that we can focus more on our model and can look what type of users has rated movies and what type of recommendation for him based on that. Secondly, we’ll perform feature engineering so that we have data in the form which is valid for machine learning algorithm.

本节分为两个小节。 首先,我们将导入数据并将其减少到子DataFrame中,以便我们可以将更多精力放在我们的模型上,并可以查看哪种类型的用户对电影进行了评级以及基于哪种类型的推荐。 其次,我们将进行特征工程设计,以便获得对机器学习算法有效的形式的数据。

为模型准备数据 (Preparing Data for Model)

We have downloaded MovieLens Dataset from Kaggle. Here first we’ll import rating dataset, because we want users rating on movies and further we’ll filter data where users has gives 4+ ratings

我们已经从Kaggle下载了MovieLens数据集。 在这里,首先,我们将导入评级数据集,因为我们希望用户对电影进行评级,并且进一步,我们将过滤用户获得4+评级的数据

ratings = pd.read_csv('./Prepairing Data/From Data/ratings.csv', usecols = ['userId', 'movieId','rating'])
print('Shape of ratings dataset is: ',ratings.shape, '\n')
print('Max values in dataset are \n',ratings.max(), '\n')
print('Min values in dataset are \n',ratings.min(), '\n')

Out:

出:

Shape of ratings dataset is:  (26024289, 3) 
Max values in dataset are
userId 270896.0
movieId 176275.0
rating 5.0
dtype: float64
Min values in dataset are
userId 1.0
movieId 1.0
rating 0.5
dtype: float64

Next we’ll filter this dataset for only 4+ ratings

接下来,我们将仅对该数据集过滤4个以上的评分

# Filtering data for only 4+ ratings
ratings = ratings[ratings['rating'] >= 4.0]
print('Shape of ratings dataset is: ',ratings.shape, '\n')
print('Max values in dataset are \n',ratings.max(), '\n')
print('Min values in dataset are \n',ratings.min(), '\n')

Out:

出:

Shape of ratings dataset is:  (12981742, 3) 
Max values in dataset are
userId 270896.0
movieId 176271.0
rating 5.0
dtype: float64
Min values in dataset are
userId 1.0
movieId 1.0
rating 4.0
dtype: float64

So, now minimum rating given by users is 4.0 and also data set has reduced from 2.6e⁷ to 1.2e⁷ which is less than half of the original dataset. But dataset is still large and we want to reduce it more.

因此,现在用户给出的最低评分为4.0,并且数据集已从2.6e⁷减少到1.2e⁷,不到原始数据集的一半。 但是数据集仍然很大,我们想进一步减少它。

For the intuition of this article, I want to work on a small dataset. So, now we will get a subset of this dataset for only first 200 movies. Later when we will reduce it further for first 100 users, then we’ll may have less than 200 movies which has been rated by users and we want to work around 100 movies.

出于本文的直觉,我想研究一个小的数据集。 因此,现在我们仅获得前200部电影的数据集的子集。 稍后,当我们进一步减少前100个用户的观看次数时,我们可能会少于200部已被用户评级的电影,而我们希望处理100部电影。

movies_list = np.unique(ratings['movieId'])[:200]
ratings = ratings.loc[ratings['movieId'].isin(movies_list)]
print('Shape of ratings dataset is: ',ratings.shape, '\n')
print('Max values in dataset are \n',ratings.max(), '\n')
print('Min values in dataset are \n',ratings.min(), '\n')

Out:

出:

Shape of ratings dataset is:  (776269, 3) 
Max values in dataset are
userId 270896.0
movieId 201.0
rating 5.0
dtype: float64
Min values in dataset are
userId 1.0
movieId 1.0
rating 4.0
dtype: float64

Still the dataset is large, so we again get another subset of ratings by extracting it for not all users but some users i.e. for 100 users.

数据集仍然很大,因此我们通过提取并非针对所有用户而是针对某些用户(即针对100个用户)提取评级的另一个子集。

users_list = np.unique(ratings['userId'])[:100]
ratings = ratings.loc[ratings['userId'].isin(users_list)]
print('Shape of ratings dataset is: ',ratings.shape, '\n')
print('Max values in dataset are \n',ratings.max(), '\n')
print('Min values in dataset are \n',ratings.min(), '\n')
print('Total Users: ', np.unique(ratings['userId']).shape[0])
print('Total Movies which are rated by 100 users: ', np.unique(ratings['movieId']).shape[0])

Out:

出:

Shape of ratings dataset is:  (447, 3) 
Max values in dataset are
userId 157.0
movieId 198.0
rating 5.0
dtype: float64
Min values in dataset are
userId 1.0
movieId 1.0
rating 4.0
dtype: float64
Total Users: 100
Total Movies which are rated by 100 users: 83

And finally, its done. We have a dataset of shape (447,3) which includes 4+ ratings of 83 movies by 100 users. As we were started with 200 movies but when we extracted it for only first 100 users, it looks like that 117 movies were not rated by first 100 users.

最后,它完成了。 我们有一个形状为(447,3)的数据集,其中包括100个用户对83部电影的4+评级。 当我们开始制作200部电影时,但是当我们仅提取前100名用户的电影时,似乎117部电影没有被前100名用户评级。

As, now we are not worried for ratings column and further we have supposed that each movie which is rated 4+ by user is of his/her interest. So, if a movie is an interest of user 1 then that movie will also be interest of another user 2 of same taste. Now, we can drop this column as each movie is a favorite for every user.

如此一来,我们现在不必担心收视率列,而且我们还认为,用户对每部评分4+的电影都感兴趣。 因此,如果电影是用户1的兴趣,那么该电影也将是相同品味的另一个用户2的兴趣。 现在,我们可以删除此列,因为每个电影都是每个用户的最爱。

users_fav_movies = ratings.loc[:, ['userId', 'movieId']]

Since we were sorted DataFrame by columns, so index may not be in proper order. Now, we want to reset the index.

由于我们按列对DataFrame进行了排序,因此索引可能没有正确的顺序。 现在,我们要重置索引。

users_fav_movies = ratings.reset_index(drop = True)

And finally, here is our final DataFrame of first 100 users favorite movies from the list of first 200 movies. The below DataFrame is printed with transpose

最后,这是我们最后200个电影列表中前100个用户最喜欢的电影的DataFrame。 下面的DataFrame用转置打印

users_fav_movies.T

Now, let save this DataFrame to csv file on our local, so that we can use it later.

现在,让我们将此DataFrame保存到本地的csv文件中,以便以后使用。

users_fav_movies.to_csv('./Prepairing Data/From Data/filtered_ratings.csv')

数据特色 (Data Featuring)

In this section, we will create a sparse matrix which we’ll use in k-means. For this, let define a function which return us a movies list for each user from dataset

在本节中,我们将创建一个稀疏矩阵,将其用在k均值中。 为此,让我们定义一个函数,该函数为我们返回数据集中每个用户的电影列表

def moviesListForUsers(users, users_data):
# users = a list of users IDs
# users_data = a dataframe of users favourite movies or users watched movies
users_movies_list = []
for user in users:
users_movies_list.append(str(list(users_data[users_data['userId'] == user]['movieId'])).split('[')[1].split(']')[0])
return users_movies_list

The method moviesListForUsers returns us a list which will contain strings for each users favorite movies list. Later we will use CountVectorizer to extract the features of strings which contains list of movies.

movieListForUsers方法将为我们返回一个列表,其中将包含每个用户喜欢的电影列表的字符串。 稍后,我们将使用CountVectorizer提取包含电影列表的字符串的功能。

Note: The method moviesListForUsers returns us list in the same order as users list. So to avoid trap, we will have users list in the descending order.

注意: movieListForUsers方法以与用户列表相同的顺序返回给我们列表。 因此,为避免陷入陷阱,我们将按降序排列用户列表。

In above defined method, we need to have a list of users and users_data dataframe. As users_data is the dataframe we already have. Now, let prepair the users list

在上面定义的方法中,我们需要有一个用户列表和users_data数据框。 由于users_data是我们已经拥有的数据框。 现在,让用户列表预先配对

users = np.unique(users_fav_movies['userId'])
print(users.shape)

Out:

出:

(100,)

Now, let prepare the list of movies for each user.

现在,让我们为每个用户准备电影列表。

users_movies_list = moviesListForUsers(users, users_fav_movies)
print('Movies list for', len(users_movies_list), ' users')
print('A list of first 10 users favourite movies: \n', users_movies_list[:10])

Out:

出:

Movies list for 100  users
A list of first 10 users favourite movies:
['147', '64, 79', '1, 47', '1, 150', '150, 165', '34', '1, 16, 17, 29, 34, 47, 50, 82, 97, 123, 125, 150, 162, 175, 176, 194', '6', '32, 50, 111, 198', '81']

Above is the list for first 10 users favorite movies. First string contain first users favorite movies IDs, second for second users and so on. It looks that the list of 7th users favorite movies is larger than others.

上面是前10个用户喜欢的电影的列表。 第一个字符串包含第一个用户喜欢的电影ID,第二个字符串包含第二个用户的依此类推。 看起来,第7位用户喜爱的电影列表比其他电影更大。

Now, we’ll prepare a sparse matrix for each user against each movie.

现在,我们将针对每个电影为每个用户准备一个稀疏矩阵。

If user has watched movie then 1, else 0

如果用户看过电影,则为1,否则为0

Let us first define a function for sparse matrix

让我们首先定义一个稀疏矩阵的函数

def prepSparseMatrix(list_of_str):
# list_of_str = A list, which contain strings of users favourite movies separate by comma ",".
# It will return us sparse matrix and feature names on which sparse matrix is defined
# i.e. name of movies in the same order as the column of sparse matrix
cv = CountVectorizer(token_pattern = r'[^\,\ ]+', lowercase = False)
sparseMatrix = cv.fit_transform(list_of_str)
return sparseMatrix.toarray(), cv.get_feature_names()

Now, let prepare the sparse matrix

现在,准备稀疏矩阵

sparseMatrix, feature_names = prepSparseMatrix(users_movies_list)

Now let put it into DataFrame to have a more clear presentation. The format will be as columns will presents each movie and index will presents users IDs

现在,将其放入DataFrame中以进行更清晰的演示。 格式将为: 将显示每个电影, 索引将显示用户ID

df_sparseMatrix = pd.DataFrame(sparseMatrix, index = users, columns = feature_names)
df_sparseMatrix

Now, let make it clear that the matrix we defined above is exactly as we want it? We’ll check it for some users.

现在,让我们明确地说,上面定义的矩阵正是我们想要的? 我们将对某些用户进行检查。

Let take a look at some users favorite movies lists

让我们来看看一些用户喜欢的电影列表

first_6_users_SM = users_fav_movies[users_fav_movies['userId'].isin(users[:6])].sort_values('userId')
first_6_users_SM.T

Now, let check the that if the users with above IDs have value 1 in the column of their favorite movie and 0 otherwise. Remember that in the sparseMatrix DataFrame df_sparseMatrix indexes were users IDs.

现在,让我们检查ID是否为ID最高的用户在其喜欢的电影的列中的值为1,否则为0。 请记住,在sparseMatrix DataFrame df_sparseMatrix索引中是用户ID。

df_sparseMatrix.loc[np.unique(first_6_users_SM['userId']), list(map(str, np.unique(first_6_users_SM['movieId'])))]

We can observe from above two DataFrames that our sparse matrix is correct and have values in proper place. As, we have done with data engineering, now let create our machine learning clustering model with k-means algorithm.

我们可以从两个DataFrame上方观察到,我们的稀疏矩阵是正确的,并且在适当的位置具有值。 同样,我们已经完成了数据工程,现在让我们使用k-means算法创建机器学习集群模型。

聚类模型 (Clustering Model)

To clustering the data, first of all we need to find the optimal number of clusters. For this purpose, we will define an object for elbow method which will contain two functions first for running k-means algorithm for different number of clusters and other to showing plot.

为了对数据进行聚类,首先我们需要找到最佳的聚类数量。 为此,我们将为肘形方法定义一个对象,该对象将包含两个函数,第一个函数用于针对不同数量的簇运行k-means算法,另一个用于显示图。

class elbowMethod():
def __init__(self, sparseMatrix):
self.sparseMatrix = sparseMatrix
self.wcss = list()
self.differences = list()
def run(self, init, upto, max_iterations = 300):
for i in range(init, upto + 1):
kmeans = KMeans(n_clusters=i, init = 'k-means++', max_iter = max_iterations, n_init = 10, random_state = 0)
kmeans.fit(sparseMatrix)
self.wcss.append(kmeans.inertia_)
self.differences = list()
for i in range(len(self.wcss)-1):
self.differences.append(self.wcss[i] - self.wcss[i+1])
def showPlot(self, boundary = 500, upto_cluster = None):
if upto_cluster is None:
WCSS = self.wcss
DIFF = self.differences
else:
WCSS = self.wcss[:upto_cluster]
DIFF = self.differences[:upto_cluster - 1]
plt.figure(figsize=(15, 6))
plt.subplot(121).set_title('Elbow Method Graph')
plt.plot(range(1, len(WCSS) + 1), WCSS)
plt.grid(b = True)
plt.subplot(122).set_title('Differences in Each Two Consective Clusters')
len_differences = len(DIFF)
X_differences = range(1, len_differences + 1)
plt.plot(X_differences, DIFF)
plt.plot(X_differences, np.ones(len_differences)*boundary, 'r')
plt.plot(X_differences, np.ones(len_differences)*(-boundary), 'r')
plt.grid()
plt.show()

Why we write elbow method in object?

为什么在对象中编写弯头方法?

As we don’t know where we will get elbow i.e. optimal number of clusters, so we write it in object in such a way that the values of WCSS will be in attribute of object and we’ll not lost them. As, firstly we may run elbow method for cluster number of 1–10 and later when we plot it, we may find that we don’t get joint of elbow yet and we need to run it for more. So, next time we can run the same instance of object from 11–20 and so on, until we’ll get joint for elbow. So we can save our time to run it for again from 1–20. And thus, we’ll not lost data of previous run.

由于我们不知道哪里会出现弯头,即最佳簇数,因此我们将其写入对象,以使WCSS的值位于对象的属性中,并且不会丢失它们。 因为,首先我们可以为1-10的簇数运行弯头方法,随后在绘制它时,我们可能发现还没有弯头,我们需要运行更多的弯头。 因此,下一次我们可以在11–20之间运行相同的对象实例,依此类推,直到我们得到肘关节。 因此,我们可以节省时间从1-20开始再次运行。 因此,我们不会丢失之前运行的数据。

You may observe that in the above class method showPlot, I have written two plots. Yeah, here I’m going to use another strategy when we can’t observe an elbow. And this is the difference between each two WCSS values and we can set a boundary for more clear observations of changing in WCSS value. That is, when the changes in WCSS value will remain inside our required boundary then we will say that we have find elbow after which changes are small. See below the plots

您可能会观察到,在上面的类方法showPlot中 ,我编写了两个图。 是的,当我们无法观察到肘部时,这里我将使用另一种策略。 这是每两个WCSS值之间的差异,我们可以设置边界以更清楚地观察WCSS值的变化。 也就是说,当WCSS值的变化将保持在我们所需的边界内时,我们将说找到弯头之后的变化很小。 见下面的情节

Now let, first we analyze for clusters 1–10 with the boundary of 10 i.e. when the changes in WCSS value will be remain inside the boundary, we’ll say that now we have find an elbow after which change is small.

现在,让我们首先分析边界为10的簇1–10,即当WCSS值的变化将保留在边界内时,我们要说的是,现在我们找到了一个弯头,其变化很小。

Remeber that the dataframe df_sparseMatrix was only for prsentation of sparseMatrix. For the algorithm, we always use only matrix sparseMatrix itself.

请记住,数据帧df_sparseMatrix仅用于表示sparseMatrix 。 对于算法,我们始终仅使用矩阵sparseMatrix本身。

Let first create an instance of elbow method on our defined sparseMatrix.

首先让我们在定义的sparseMatrix上创建一个elbow方法的实例。

elbow_method = elbowMethod(sparseMatrix)

Now, first we will run it for 1–10 number of cluster, i.e. first k-mean will run for no of clusters 𝑘=1, then for no. of clusters 𝑘=2 and so on upto no. of clusters 𝑘=10.

现在,首先我们将其运行1-10个集群,即,第一个k均值将不运行任何集群𝑘= 1,然后运行否。 簇𝑘= 2,依此类推,直到没有。 的集群𝑘= 10。

elbow_method.run(1, 10)elbow_method.showPlot(boundary = 10)

Since, we don’t have any clear elbow yet and also we don’t have differences inside the boundary. Now let run it for clusters 11–20

既然如此,我们还没有任何清晰的肘部,并且边界内也没有差异。 现在让它在集群11–20中运行

elbow_method.run(11, 30)elbow_method.showPlot(boundary = 10)

What happend?

发生了什么事?

We don’t have elbow, but we have boundary in differences graph. If we look at the differences graph, we observe that after the cluster 14, the differences are almost inside the boundary. So, we will run k-means for clusters 15 because the 14'th difference is the difference between 𝑘=14 and 𝑘=15. Since we have done to analyze the optimal clusters 𝑘. Now move to fitting the model and making recommendations.

我们没有弯头,但是在差异图中有边界。 如果我们看一下差异图,就会发现在聚类14之后,差异几乎在边界内。 因此,我们将对聚类15运行k均值 ,因为第14个差是we = 14和𝑘= 15之间的差。 由于我们已经完成了对最佳聚类cluster的分析。 现在开始拟合模型并提出建议。

在模型上拟合数据 (Fitting Data on Model)

Now let first create the same k-means model and run it to make predictions.

现在,让我们首先创建相同的k均值模型,然后运行它进行预测。

kmeans = KMeans(n_clusters=15, init = 'k-means++', max_iter = 300, n_init = 10, random_state = 0)
clusters = kmeans.fit_predict(sparseMatrix)

Now, let create a dataframe where we can see each user cluster number

现在,创建一个数据框,在其中我们可以看到每个用户集群号

users_cluster = pd.DataFrame(np.concatenate((users.reshape(-1,1), clusters.reshape(-1,1)), axis = 1), columns = ['userId', 'Cluster'])
users_cluster.T

Now we’ll define a function which will create a list of DataFrames where each DataFrame will contain the movieId and the counts for that movie (count: the number of users who has that respective movie in their favorite list). So, the movie which will have more counts will be of more interest to other users who has not watched that movie yet. For Example, we’ll create a list as following [dataframe_for_Cluster_1, dataframe_for_Cluster_2, ..., dataframe_for_Cluster_3] Where each DataFrame will be of following format

现在,我们将定义一个函数,该函数将创建一个DataFrames列表,其中每个DataFrame都将包含movieId和该电影的计数 ( count :在自己喜欢的列表中拥有该电影的用户数量)。 因此,具有更多计数的电影将吸引尚未观看该电影的其他用户。 例如,我们将创建一个如下列表: [dataframe_for_Cluster_1, dataframe_for_Cluster_2, ..., dataframe_for_Cluster_3] dataframe_for_Cluster_1,dataframe_for_Cluster_2 [dataframe_for_Cluster_1, dataframe_for_Cluster_2, ..., dataframe_for_Cluster_3] dataframe_for_Cluster_3 [dataframe_for_Cluster_1, dataframe_for_Cluster_2, ..., dataframe_for_Cluster_3]其中每个DataFrame的格式如下

where 3rd column of Count is representing the total number of users in the cluster who have watched that particular movie. So, we will sort movies by their count in order to prioritize the movie which have most seen by users in cluster and is more favorite for users in the cluster.

其中“ 计数”的第3列代表群集中观看该特定电影的用户总数。 因此,我们将按电影的数量对电影进行排序,以便对在集群中用户观看次数最多,并且对集群中用户更喜欢的电影进行优先级排序。

Now we want to create a list of all user movies in each cluster. For this, first we’ll define a method for creating movies of clusters.

现在,我们要创建每个群集中所有用户电影的列表。 为此,首先我们将定义一种创建群集电影的方法。

def clustersMovies(users_cluster, users_data):
clusters = list(users_cluster['Cluster'])
each_cluster_movies = list()
for i in range(len(np.unique(clusters))):
users_list = list(users_cluster[users_cluster['Cluster'] == i]['userId'])
users_movies_list = list()
for user in users_list:
users_movies_list.extend(list(users_data[users_data['userId'] == user]['movieId']))
users_movies_counts = list()
users_movies_counts.extend([[movie, users_movies_list.count(movie)] for movie in np.unique(users_movies_list)])
each_cluster_movies.append(pd.DataFrame(users_movies_counts, columns=['movieId', 'Count']).sort_values(by = ['Count'], ascending = False).reset_index(drop=True))
return each_cluster_moviescluster_movies = clustersMovies(users_cluster, users_fav_movies)

Now, let take a look at any one DataFrame of cluster_movies.

现在,让我们看一下cluster_movies的任何一个DataFrame。

cluster_movies[1].T

We have 30 movies in 1st cluster where movie with ID 1 is favorite by 19 users and at the top priority, followed by movie with ID 150 which is favorite by 8 users.

我们在第一个集群中有30部电影,其中ID 1的电影最受19个用户的喜爱,其次是ID 150的电影被8个用户的最爱。

Now, let see how much users we have in each cluster

现在,让我们看看每个集群中有多少用户

for i in range(15):
len_users = users_cluster[users_cluster['Cluster'] == i].shape[0]
print('Users in Cluster ' + str(i) + ' -> ', len_users)

Out:

出:

Users in Cluster 0 ->  35
Users in Cluster 1 -> 19
Users in Cluster 2 -> 1
Users in Cluster 3 -> 5
Users in Cluster 4 -> 8
Users in Cluster 5 -> 1
Users in Cluster 6 -> 12
Users in Cluster 7 -> 2
Users in Cluster 8 -> 1
Users in Cluster 9 -> 1
Users in Cluster 10 -> 1
Users in Cluster 11 -> 11
Users in Cluster 12 -> 1
Users in Cluster 13 -> 1
Users in Cluster 14 -> 1

As, we can see that there are some clusters which contain only 1 user or 2 or 5. As we don’t want such small cluster where we can’t recommend enough movies to users. As the user in a cluster of size one will not get any recommendation for movies OR even user in size of cluster 2 will not get enough recommendations. So, we have to fix such small clusters.

就像,我们可以看到有些群集仅包含1个用户或2个或5个。由于我们不希望有如此小的群集,因此无法向用户推荐足够的电影。 由于大小为1的群集中的用户将不会获得任何关于电影的推荐,或者甚至大小为2的用户也不会获得足够的推荐。 因此,我们必须修复如此小的集群。

修复小型集群 (Fixing Small Clusters)

Since, there are many clusters which includes less number of users. So we don’t want any user in a cluster alone and let say we want at least 6 users in each cluster. So we have to move users from small cluster into a large cluster which contain more relevant movies to user

由于存在许多群集,其中包含较少的用户。 因此,我们不希望一个集群中有任何用户,而每个集群中至少要有6个用户。 因此,我们必须将用户从小型集群转移到包含与用户更多相关电影的大型集群

First of all we’ll write a function to get user favorite movies list

首先,我们将编写一个函数以获取用户喜欢的电影列表

def getMoviesOfUser(user_id, users_data):
return list(users_data[users_data['userId'] == user_id]['movieId'])

Now, we’ll define a function for fixing clusters

现在,我们将定义一个修复簇的函数

def fixClusters(clusters_movies_dataframes, users_cluster_dataframe, users_data, smallest_cluster_size = 11):
# clusters_movies_dataframes: will be a list which will contain each dataframes of each cluster movies
# users_cluster_dataframe: will be a dataframe which contain users IDs and their cluster no.
# smallest_cluster_size: is a smallest cluster size which we want for a cluster to not remove
each_cluster_movies = clusters_movies_dataframes.copy()
users_cluster = users_cluster_dataframe.copy()
# Let convert dataframe in each_cluster_movies to list with containing only movies IDs
each_cluster_movies_list = [list(df['movieId']) for df in each_cluster_movies]
# First we will prepair a list which containt lists of users in each cluster -> [[Cluster 0 Users], [Cluster 1 Users], ... ,[Cluster N Users]]
usersInClusters = list()
total_clusters = len(each_cluster_movies)
for i in range(total_clusters):
usersInClusters.append(list(users_cluster[users_cluster['Cluster'] == i]['userId']))
uncategorizedUsers = list()
i = 0
# Now we will remove small clusters and put their users into another list named "uncategorizedUsers"
# Also when we will remove a cluster, then we have also bring back cluster numbers of users which comes after deleting cluster
# E.g. if we have deleted cluster 4 then their will be users whose clusters will be 5,6,7,..,N. So, we'll bring back those users cluster number to 4,5,6,...,N-1.
for j in range(total_clusters):
if len(usersInClusters[i]) < smallest_cluster_size:
uncategorizedUsers.extend(usersInClusters[i])
usersInClusters.pop(i)
each_cluster_movies.pop(i)
each_cluster_movies_list.pop(i)
users_cluster.loc[users_cluster['Cluster'] > i, 'Cluster'] -= 1
i -= 1
i += 1
for user in uncategorizedUsers:
elemProbability = list()
user_movies = getMoviesOfUser(user, users_data)
if len(user_movies) == 0:
print(user)
user_missed_movies = list()
for movies_list in each_cluster_movies_list:
count = 0
missed_movies = list()
for movie in user_movies:
if movie in movies_list:
count += 1
else:
missed_movies.append(movie)
elemProbability.append(count / len(user_movies))
user_missed_movies.append(missed_movies)
user_new_cluster = np.array(elemProbability).argmax()
users_cluster.loc[users_cluster['userId'] == user, 'Cluster'] = user_new_cluster
if len(user_missed_movies[user_new_cluster]) > 0:
each_cluster_movies[user_new_cluster] = each_cluster_movies[user_new_cluster].append([{'movieId': new_movie, 'Count': 1} for new_movie in user_missed_movies[user_new_cluster]], ignore_index = True)
return each_cluster_movies, users_cluster

Now, run it.

现在,运行它。

movies_df_fixed, clusters_fixed = fixClusters(cluster_movies, users_cluster, users_fav_movies, smallest_cluster_size = 6)

To observer changes for fixing clusters, first take a look at data which we were had before and and then data after fixing

为了观察者对固定集群的更改,首先要查看我们之前拥有的数据,然后再查看固定之后的数据

First we’ll print those clusters which contain maximum 5 users

首先,我们将打印包含最多5个用户的集群

j = 0
for i in range(15):
len_users = users_cluster[users_cluster['Cluster'] == i].shape[0]
if len_users < 6:
print('Users in Cluster ' + str(i) + ' -> ', len_users)
j += 1
print('Total Cluster which we want to remove -> ', j)

Out:

出:

Users in Cluster 2 ->  1
Users in Cluster 3 -> 5
Users in Cluster 5 -> 1
Users in Cluster 7 -> 2
Users in Cluster 8 -> 1
Users in Cluster 9 -> 1
Users in Cluster 10 -> 1
Users in Cluster 12 -> 1
Users in Cluster 13 -> 1
Users in Cluster 14 -> 1
Total Cluster which we want to remove -> 10

Now look at the users cluster data frame

现在看一下用户集群数据框架

print('Length of total clusters before fixing is -> ', len(cluster_movies))
print('Max value in users_cluster dataframe column Cluster is -> ', users_cluster['Cluster'].max())
print('And dataframe is following')
users_cluster.T

Out:

出:

Length of total clusters before fixing is ->  15
Max value in users_cluster dataframe column Cluster is -> 14
And dataframe is following

So, we want max value in Cluster column is 4 starting from index 0, as we’ll remove 10 smallest clusters and we’ll have 5 remaining clusters

因此,我们希望“ 簇”列中的最大值从索引0开始为4,因为我们将删除10个最小的簇,而剩下5个簇

Now, let see what happend after fixing data.

现在,让我们看看修复数据后发生了什么。

We want to remove all those 10 small clusters and also the users_cluster DataFrame shouldn’t contain any user whose clusters which is invalid.

我们要删除所有这10个小型集群,并且users_cluster DataFrame不应包含任何集群无效的用户。

print('Length of total clusters after fixing is -> ', len(movies_df_fixed))
print('Max value in users_cluster dataframe column Cluster is -> ', clusters_fixed['Cluster'].max())
print('And fixed dataframe is following')
clusters_fixed.T

Out:

出:

Length of total clusters after fixing is ->  5
Max value in users_cluster dataframe column Cluster is -> 4
And fixed dataframe is following

Now let see what happend when 10 clusters were deleted and how the remaining users clusters were adjusted which were already in large clusters.

现在,让我们看看删除10个群集时发生了什么,以及如何调整已经在大型群集中的其余用户群集。

Let take a look at anyone 11th cluster user. Since 11th cluster was already containing enough users i.e. 11 users and we were not want to delete that, but as now we only have max 5 cluster and max value of cluster column is 4, so what actually happend to 11 cluster? As there were 7 clusters before cluster no. 11 which were small and removed, so the value 11 now should be bring back to 4.

让我们看一下第11个集群用户。 由于第11个集群已经包含了足够的用户,即11个用户,我们不想删除该用户,但是由于现在只有5个集群,而集群列的最大值是4,那么11个集群实际上发生了什么? 由于集群号之前有7个集群。 11个较小且已删除的值,因此现在应将值11重新设为4。

print('Users cluster dataFrame for cluster 11 before fixing:')
users_cluster[users_cluster['Cluster'] == 11].T

Out:

出:

Users cluster dataFrame for cluster 11 before fixing:

Now let look at the cluster 4 after fixing

现在让我们看一下修复后的集群4

print('Users cluster dataFrame for cluster 4 after fixing which should be same as 11th cluster before fixing:')
clusters_fixed[clusters_fixed['Cluster'] == 4].T

Out:

出:

Users cluster dataFrame for cluster 4 after fixing which should be same as 11th cluster before fixing:

Both DataFrame are containing same users IDs, So we don’t disturbed any cluster and simililarly we did same with list of movies DataFrames for each cluster

两个DataFrame都包含相同的用户ID,因此我们不会打扰任何集群,并且类似地,我们对每个集群的电影DataFrames列表也进行了相同操作

Now let take a look at list of movies dataframes

现在让我们看一下电影数据帧列表

print('Size of movies dataframe after fixing -> ', len(movies_df_fixed))

Out:

出:

Size of movies dataframe after fixing ->  5

Now, lets look at the sizes of clusters

现在,让我们看一下集群的大小

for i in range(len(movies_df_fixed)):
len_users = clusters_fixed[clusters_fixed['Cluster'] == i].shape[0]
print('Users in Cluster ' + str(i) + ' -> ', len_users)

Out:

出:

Users in Cluster 0 ->  45
Users in Cluster 1 -> 21
Users in Cluster 2 -> 8
Users in Cluster 3 -> 15
Users in Cluster 4 -> 11

Each cluster is now containing enough users so that we can make recommendations for other users. Let take a look at each size of clusters movies list.

现在每个集群都包含足够的用户,以便我们可以为其他用户提出建议。 让我们看一下群集电影列表的每种大小。

for i in range(len(movies_df_fixed)):
print('Total movies in Cluster ' + str(i) + ' -> ', movies_df_fixed[i].shape[0])

Out:

出:

Total movies in Cluster 0 ->  64
Total movies in Cluster 1 -> 39
Total movies in Cluster 2 -> 15
Total movies in Cluster 3 -> 50
Total movies in Cluster 4 -> 25

As, we have done working with training machine learning model k-means, making predictions of clusters for each user and fixing some issues. Finally, we need to store this training so that we can use it later. For this, we will use Pickle library to save and load trainings. We have already imported Pickle, now we will use it.

同样,我们已经完成了训练机器学习模型k-means的工作,为每个用户做出了集群预测并解决了一些问题。 最后,我们需要存储此培训,以便以后使用。 为此,我们将使用Pickle库保存和加载培训。 我们已经导入了Pickle,现在我们将使用它。

Let me first design object to save and load trainings. We will directly design methods for saving/loading particular files and also we will design general save/load methods

让我首先设计一个对象来保存和加载培训。 我们将直接设计用于保存/加载特定文件的方法,还将设计一般的保存/加载方法

class saveLoadFiles:
def save(self, filename, data):
try:
file = open('datasets/' + filename + '.pkl', 'wb')
pickle.dump(data, file)
except:
err = 'Error: {0}, {1}'.format(exc_info()[0], exc_info()[1])
print(err)
file.close()
return [False, err]
else:
file.close()
return [True]
def load(self, filename):
try:
file = open('datasets/' + filename + '.pkl', 'rb')
except:
err = 'Error: {0}, {1}'.format(exc_info()[0], exc_info()[1])
print(err)
file.close()
return [False, err]
else:
data = pickle.load(file)
file.close()
return data
def loadClusterMoviesDataset(self):
return self.load('clusters_movies_dataset')
def saveClusterMoviesDataset(self, data):
return self.save('clusters_movies_dataset', data)
def loadUsersClusters(self):
return self.load('users_clusters')
def saveUsersClusters(self, data):
return self.save('users_clusters', data)

In above class, exc_info imported from sys library for error handling and error writings.

在上面的类中, exc_infosys库导入,用于错误处理和错误编写

We will use saveClusterMoviesDataset/loadClusterMoviesDataset methods to save/load list of clusters movies DataFrames and saveUsersClusters/loadUsersClusters methods to save/load users clusters DataFrames. Now, lets try it. We will run and print responses in order to check if any error comes. If it return True then its mean our files has been saved successfully in proper place.

我们将使用saveClusterMoviesDataset / loadClusterMoviesDataset方法来保存/加载群集电影数据帧列表,并使用saveUsersClusters / loadUsersClusters方法来保存/加载用户群集数据帧。 现在,尝试一下。 我们将运行并打印响应,以检查是否有错误。 如果返回True,则意味着我们的文件已成功保存在正确的位置。

saveLoadFile = saveLoadFiles()
print(saveLoadFile.saveClusterMoviesDataset(movies_df_fixed))
print(saveLoadFile.saveUsersClusters(clusters_fixed))

Out:

出:

[True]
[True]

As response is True for both save methods. Our trained data has now saved and we can use it later. Let check it if we can load it.

由于这两种保存方法的响应均为True 。 我们的训练数据现在已保存,以后可以使用。 让我们检查一下是否可以加载它。

load_movies_list, load_users_clusters = saveLoadFile.loadClusterMoviesDataset(), saveLoadFile.loadUsersClusters()
print('Type of Loading list of Movies dataframes of 5 Clusters: ', type(load_movies_list), ' and Length is: ', len(load_movies_list))
print('Type of Loading 100 Users clusters Data: ', type(load_users_clusters), ' and Shape is: ', load_users_clusters.shape)

Out:

出:

Type of Loading list of Movies dataframes of 5 Clusters:  <class 'list'>  and Length is:  5
Type of Loading 100 Users clusters Data: <class 'pandas.core.frame.DataFrame'> and Shape is: (100, 2)

We have successfully saved and loaded our data by using pickle library.

我们已经成功地使用pickle库保存并加载了数据。

As we worked for very small dataset. But often movies recommendation systems works with very large datasets as the dataset we were had initially, and there we have enough movies in each cluster to make recommendations.

当我们为非常小的数据集工作时。 但是电影推荐系统通常会像我们最初使用的数据集那样使用非常大的数据集,并且每个群集中都有足够的电影来进行推荐。

Now, we need to design functions for making recommendations to users.

现在,我们需要设计功能以向用户提出建议。

给用户的建议 (Recommendations for Users)

Now here we’ll create an object for recommending most favorite movies in the cluster to the user which user has not added to favorite earlier. And also when any user has added another movie in his favorite list, then we have to update clusters movies datasets also.

现在,我们将在这里创建一个对象,向用户推荐该群集中最喜欢的电影,而该用户之前并未添加到该电影。 而且,当任何用户在他的收藏夹列表中添加了另一部电影时,我们也必须更新群集电影数据集。

class userRequestedFor:
def __init__(self, user_id, users_data):
self.users_data = users_data.copy()
self.user_id = user_id
# Find User Cluster
users_cluster = saveLoadFiles().loadUsersClusters()
self.user_cluster = int(users_cluster[users_cluster['userId'] == self.user_id]['Cluster'])
# Load User Cluster Movies Dataframe
self.movies_list = saveLoadFiles().loadClusterMoviesDataset()
self.cluster_movies = self.movies_list[self.user_cluster] # dataframe
self.cluster_movies_list = list(self.cluster_movies['movieId']) # list
def updatedFavouriteMoviesList(self, new_movie_Id):
if new_movie_Id in self.cluster_movies_list:
self.cluster_movies.loc[self.cluster_movies['movieId'] == new_movie_Id, 'Count'] += 1
else:
self.cluster_movies = self.cluster_movies.append([{'movieId':new_movie_Id, 'Count': 1}], ignore_index=True)
self.cluster_movies.sort_values(by = ['Count'], ascending = False, inplace= True)
self.movies_list[self.user_cluster] = self.cluster_movies
saveLoadFiles().saveClusterMoviesDataset(self.movies_list)
def recommendMostFavouriteMovies(self):
try:
user_movies = getMoviesOfUser(self.user_id, self.users_data)
cluster_movies_list = self.cluster_movies_list.copy()
for user_movie in user_movies:
if user_movie in cluster_movies_list:
cluster_movies_list.remove(user_movie)
return [True, cluster_movies_list]
except KeyError:
err = "User history does not exist"
print(err)
return [False, err]
except:
err = 'Error: {0}, {1}'.format(exc_info()[0], exc_info()[1])
print(err)
return [False, err]

Now lets try it to make recommendations and updating favorite list request. For this, first we’ll import data for not only IDs but for movies details like title, genre etc.

现在,让我们尝试提出建议并更新收藏列表请求。 为此,首先我们不仅要导入ID的数据,还要导入电影详细信息(如标题,流派等)的数据。

movies_metadata = pd.read_csv(
'./Prepairing Data/From Data/movies_metadata.csv',
usecols = ['id', 'genres', 'original_title'])
movies_metadata = movies_metadata.loc[
movies_metadata['id'].isin(list(map(str, np.unique(users_fav_movies['movieId']))))].reset_index(drop=True)
print('Let take a look at movie metadata for all those movies which we were had in our dataset')
movies_metadata

Out:

出:

Let take a look at movie metadata for all those movies which we were had in our dataset

Here is the list of movies which user with ID 12 has added into its favorite movies

这是ID为12的用户已添加到其喜欢的电影的电影列表

user12Movies = getMoviesOfUser(12, users_fav_movies)
for movie in user12Movies:
title = list(movies_metadata.loc[movies_metadata['id'] == str(movie)]['original_title'])
if title != []:
print('Movie title: ', title, ', Genres: [', end = '')
genres = ast.literal_eval(movies_metadata.loc[movies_metadata['id'] == str(movie)]['genres'].values[0].split('[')[1].split(']')[0])
for genre in genres:
print(genre['name'], ', ', end = '')
print(end = '\b\b]')
print('')

Out:

出:

Movie title:  ['Dancer in the Dark'] , Genres: [Drama , Crime , Music , ]
Movie title: ['The Dark'] , Genres: [Horror , Thriller , Mystery , ]
Movie title: ['Miami Vice'] , Genres: [Action , Adventure , Crime , Thriller , ]
Movie title: ['Tron'] , Genres: [Science Fiction , Action , Adventure , ]
Movie title: ['The Lord of the Rings'] , Genres: [Fantasy , Drama , Animation , Adventure , ]
Movie title: ['48 Hrs.'] , Genres: [Thriller , Action , Comedy , Crime , Drama , ]
Movie title: ['Edward Scissorhands'] , Genres: [Fantasy , Drama , Romance , ]
Movie title: ['Le Grand Bleu'] , Genres: [Adventure , Drama , Romance , ]
Movie title: ['Saw'] , Genres: [Horror , Mystery , Crime , ]
Movie title: ["Le fabuleux destin d'Amélie Poulain"] , Genres: [Comedy , Romance , ]

And finally these are the top 10 recommended movies for that user

最后,这些是该用户推荐的十大电影

user12Recommendations = userRequestedFor(12, users_fav_movies).recommendMostFavouriteMovies()[1]
for movie in user12Recommendations[:15]:
title = list(movies_metadata.loc[movies_metadata['id'] == str(movie)]['original_title'])
if title != []:
print('Movie title: ', title, ', Genres: [', end = '')
genres = ast.literal_eval(movies_metadata.loc[movies_metadata['id'] == str(movie)]['genres'].values[0].split('[')[1].split(']')[0])
for genre in genres:
print(genre['name'], ', ', end = '')
print(']', end = '')
print()

Out:

出:

Movie title:  ['Trois couleurs : Rouge'] , Genres: [Drama , Mystery , Romance , ]
Movie title: ["Ocean's Eleven"] , Genres: [Thriller , Crime , ]
Movie title: ['Judgment Night'] , Genres: [Action , Thriller , Crime , ]
Movie title: ['Scarface'] , Genres: [Action , Crime , Drama , Thriller , ]
Movie title: ['Back to the Future Part II'] , Genres: [Adventure , Comedy , Family , Science Fiction , ]
Movie title: ["Ocean's Twelve"] , Genres: [Thriller , Crime , ]
Movie title: ['To Be or Not to Be'] , Genres: [Comedy , War , ]
Movie title: ['Back to the Future Part III'] , Genres: [Adventure , Comedy , Family , Science Fiction , ]
Movie title: ['A Clockwork Orange'] , Genres: [Science Fiction , Drama , ]
Movie title: ['Minority Report'] , Genres: [Action , Thriller , Science Fiction , Mystery , ]

And finally, we have successfully recommended movies to user based on his/her interest with most favorite movies by similar users.

最后,我们已经根据用户对相似用户最喜欢的电影的兴趣向他们成功推荐了电影。

You’re Done

你完成了

Thanks for reading this article. If you want this whole project in the deployment coding, then please visit my GitHub library AI Movies Recommendation System Based on K-means Clustering Algorithm and download it to work with it, it is completely free for everyone.

感谢您阅读本文。 如果您希望整个项目都采用部署编码,请访问我的GitHub库, 基于K-means聚类算法的AI电影推荐系统,并下载使用它,它对每个人都是完全免费的。

谢谢 (Thank You)

翻译自: https://medium/@asdkazmi/ai-movies-recommendation-system-with-clustering-based-k-means-algorithm-f04467e02fcd

k均值聚类算法 k会变化么

本文标签: 算法均值系统电影AI