Welcome! My name is Anant Agrawal and I am an Information Systems and Statistics double major at Carnegie Mellon University. I'm really excited about working as a full-time Product Manager after graduating and just wanted to write this small tutorial on the practical uses of clustering algorithms for Product Management. I hope you find it useful.
As I will demonstrate below with the Yelp dataset, these clustering algorithms can be really useful in defining the different clusters of user preferences for the content on your platform or your different products. This means you can better understand
In fact, this analysis can largely be replicated using the Amazon dataset , and for any platform delivering content that has categories (such as books and movies) and data about user preferences.
Clustering algorithms do exactly what you would guess they do. Given information about a set of objects (users, products, etc.), clustering algorithms differentiate between these objects based on their similarity in the given attribute(s). On a high level, they work by defining centers and putting all the points closest to that center in a certain cluster. They repeat this process until they have minimized the error, which is defined by the sum of all the distances between the center and each point in that cluster. This is not the way every clustering algorithm works, but it offers a good high level understanding of what these algorithms are trying to do. In this tutorial I am going to highlight two clustering algorithms that can be really powerful for product management.
K-means: This algorithm works as described above in that it minimizes the total distance between group centers and every point. The advantages of this algorithm is that it is really fast as it is just computing distances. However, this algorithm can only create circular clusters as it works by creating a radius to define a cluster.
Gaussian Mixture Models: This algorithm is a little more complicated in that it creates a multidimensional probability distribution for every point. It does this by assuming the points are Gaussian distributed. All this means is that it calculates the probability of a point belonging in each cluster for every point using both the mean and standard deviation parameters of the data. Using this model, an object can have partial membership in more than 1 cluster and clusters are not just restricted to being circular.
So, without further ado, let's dive into examples that help show us when one algorithm is better than another.
In my opinion, this is one of the coolest datasets I could find to show the true power of clustering algorithms. This example will analyze the Open Yelp Dataset which has almost 7 million reviews by real people on restaurants and other businesses from all over the country. Specifically, we are going to start by using clustering to see if people in have a stronger preference for the restaurants in Pittsburgh that serve burgers over the restuarants that serve Chinese Food or if it is the other way around.
#As always, lets start by importing the libraries we need
import json
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from mpl_toolkits.axes_grid1 import make_axes_locatable
import helper
#And the tools from Sklearn to do our clustering
from sklearn.cluster import KMeans
from sklearn.mixture import GaussianMixture
from sklearn.metrics import silhouette_samples, silhouette_score
from sklearn.metrics import mean_squared_error
from scipy.sparse import csr_matrix
#To make things cleaner, lets also not display all the warnings
import warnings
warnings.filterwarnings('ignore')
The Open Yelp dataset is comprised of giant JSON files with information about the users, businesses, and ratings on Yelp. Lets start by loading in the data and seeing what information we have about a business and review:
businesses = []
with open('business.json') as f:
for line in f:
businesses.append(json.loads(line))
reviews = []
with open('review.json') as f:
for line in f:
reviews.append(json.loads(line))
testBusiness = businesses[1]
testReview = reviews[1]
print(testBusiness)
print('')
print(testReview)
Since we want to know about the preferences users have for businesses in Pittsburgh we should probably filter the businesses and reviews so we only get the information regarding Pittsburgh businesses. It would be really interesting to see how this data compares to results from other cities such as 'Boston' or 'San Francisco'.
#Get the businesses from Pittsburgh
pitt_business_ids = []
pitt_business_names = []
pitt_business_categories = []
for jsonObj in businesses:
if jsonObj['city'] == 'Pittsburgh':
pitt_business_ids.append(jsonObj['business_id'])
pitt_business_names.append(jsonObj['name'])
categories = None
if jsonObj['categories'] != None:
categories = ""
for category in jsonObj['categories']:
categories += category
pitt_business_categories.append(categories)
business = pd.DataFrame({'business_id': pitt_business_ids,
'name': pitt_business_names,
'category': pitt_business_categories})
print(len(business))
business.head()
#Get all the review information
user_ids = []
bus_ids = []
ratings = []
for jsonObj in reviews:
user_ids.append(jsonObj['user_id'])
bus_ids.append(jsonObj['business_id'])
ratings.append(jsonObj['stars'])
ratings = pd.DataFrame({'user_id': user_ids,
'business_id': bus_ids,
'rating': ratings})
print(len(ratings))
ratings.head()
# #Filter so we only get the information for reviews on Pittsburgh businesses
# ratings = ratings[ratings['business_id'].isin(pitt_business_ids)]
# print(len(ratings))
# ratings.head()
print(len(ratings))
Now we have the data that we want. But we still have to restructure it to be able to cluster properly. The way we want to structure the data is that every column represents a category ('Chinese', 'Indian', 'Burger') and every row a user. In each row we want the average rating that the user gave for each category. Any category for which they have no reviews is just an NA.
# Calculate the average rating of restaurants that serve burgers and restaurants that serve chinese food.
def create_category_table(categories, column_names):
category_ratings = pd.DataFrame()
for category in categories:
cat_businesses = business[business['category'].str.contains(pat=category, na=False)]
ratings_from_cat = ratings[ratings['business_id'].isin(cat_businesses['business_id'])]
avg_votes_per_user = ratings_from_cat.loc[:, ['user_id', 'rating']].groupby(['user_id'])['rating'].mean().round(2)
category_ratings = pd.concat([category_ratings, avg_votes_per_user], axis=1)
category_ratings.columns = column_names
return category_ratings
categories = ["Sandwiches",
"Pizza",
"Chinese",
"Food Stands",
"Steakhouses",
"Mexican",
"Fast Food",
"Seafood",
"Indian",
"Gluten-Free",
"Breakfast & Brunch",
"Delis",
"Burgers",
"Salad",
"Vegan",
"Comfort Food",
"Mediterranean",
"Latin American",
"German",
"Cafes",
"Vegetarian",
"Italian",
"Middle Eastern",
"Diners",
"Hot Dogs",
"Caribbean",
"French",
"Buffets",
"Thai"]
BurgerChineseRatings = create_category_table(categories, categories)
BurgerChineseRatings
BurgerChineseRatings.to_csv(path_or_buf="yelpcf.csv", index=False)
And now we are going to create the scatterplot comparing the restaurant ratings for restaurants with burgers versus Chinese restaurants.
def draw_scatterplot(x_data, x_label, y_data, y_label):
fig = plt.figure(figsize=(8,8))
ax = fig.add_subplot(111)
plt.xlim(0, 5)
plt.ylim(0, 5)
ax.set_xlabel(x_label)
ax.set_ylabel(y_label)
ax.scatter(x_data, y_data, s=30)
draw_scatterplot(BurgerChineseRatings['avg_burgers_ratings'],'avg_burgers_ratings', BurgerChineseRatings['avg_chinese_ratings'], 'avg_chinese_ratings')
An interesting thing to note here is that when drawing the scatterplot NAs are automatically dropped. This means the data is showing only users who ate at both restaurants that served burgers and ones that served chinese food. Also, user preferences are hard to see. To make them easier to see we will need to get rid of the users that have high ratings for both categories or low ratings for both categories. This way we can see the preferences of users who actually have a strong preferences for one category over another.
def highPreferenceUsers(category_ratings, category1Data, category2Data, score_limit_1, score_limit_2):
biased_dataset = category_ratings[((category1Data < score_limit_1) & (category2Data > score_limit_2)) | ((category2Data < score_limit_1) & (category1Data > score_limit_2))]
biased_dataset = pd.concat([biased_dataset[:300], category_ratings[:2]])
biased_dataset = pd.DataFrame(biased_dataset.to_records())
return biased_dataset
biased_dataset = highPreferenceUsers(BurgerChineseRatings, BurgerChineseRatings['avg_burgers_ratings'], BurgerChineseRatings['avg_chinese_ratings'], 3, 3)
draw_scatterplot(biased_dataset['avg_burgers_ratings'],'avg_burgers_ratings', biased_dataset['avg_chinese_ratings'], 'avg_chinese_ratings')
Alright, so its still hard to see the different clusters here. Particularly it is hard to tell if there are more users who have a strong preference for burgers over chinese food or whether there are more users who have a strong preference for chinese food over burgers. This is where clustering comes in.
def draw_clusters(biased_dataset, xlabel, ylabel, predictions, cmap='viridis'):
fig = plt.figure(figsize=(8,8))
ax = fig.add_subplot(111)
plt.xlim(0, 5)
plt.ylim(0, 5)
ax.set_xlabel(xlabel)
ax.set_ylabel(ylabel)
clustered = pd.concat([biased_dataset.reset_index(), pd.DataFrame({'group':predictions})], axis=1)
plt.scatter(clustered[xlabel], clustered[ylabel], c=clustered['group'], s=20, cmap=cmap)
#Create the dataset with the users who have strong preferences for a category
X = biased_dataset[['avg_burgers_ratings','avg_chinese_ratings']].values
#Drop any NA values (Can't cluster with NAs)
X = pd.DataFrame(X).dropna()
X[0]
But before we start clustering, we should stop to think about which algorithm to use. Remember that it is stated that the Gaussian Mixture Model assumes the data follows the Gaussian assumption. A good way to get some idea about this is to see if the data for each set of ratings looks normally distributed. Below is the distribution for each variable:
plt.hist(X[0], bins='auto')
plt.show()
plt.hist(X[1], bins='auto')
plt.show()
Looking at these distributions, it seems pretty clear that these distributions are not normal. So should we abandon the Gaussian Mixture Model? Lets see how the clusters compare between K-means and the Gaussian Mixture Model:
#GMM Model
gmm = GaussianMixture(n_components=2).fit(X)
predictionsGMM = gmm.predict(X)
# Plot
draw_clusters(biased_dataset, 'avg_burgers_ratings', 'avg_chinese_ratings', predictionsGMM)
# Use K means to define 2 cluster groups
kmeans_1 = KMeans(n_clusters=2)
predictions = kmeans_1.fit_predict(X)
# Plot
draw_clusters(biased_dataset, 'avg_burgers_ratings', 'avg_chinese_ratings', predictions)
Wow, they look pretty similar. What if we tried to find 3 clusters instead?
#GMM Model for 3 clusters
gmm = GaussianMixture(n_components=3).fit(X)
predictionsGMM2 = gmm.predict(X)
# Plot
draw_clusters(biased_dataset, 'avg_burgers_ratings', 'avg_chinese_ratings', predictionsGMM2)
This is one example of the scatterplot the GMM model produced with this data:
from IPython.display import Image
Image("GMM.png")
# Use K means to define 3 clusters
kmeans_3 = KMeans(n_clusters=3)
predictions2 = kmeans_3.fit_predict(X)
# Plot
draw_clusters(biased_dataset, 'avg_burgers_ratings', 'avg_chinese_ratings', predictions2)
Now we can see a much bigger difference. The Gaussian Mixture Model seems way off as it created a small cluster far above all the other clusters. Now, sometimes the Gaussian Mixture Model comes up with decent looking clusters even when the Gaussian assumption does not hold, but as can be seen in this case, it is better to stick with K means as K means does not hold this assumption to be true.
# TODO: Create an instance of KMeans to find two clusters
kmeans_2 = KMeans(n_clusters=3)
# TODO: use fit_predict to cluster the dataset
prediction3 = kmeans_2.fit_predict(X)
#Add the predicted cluster to the dataset
def userTypeTable(X, x_label, y_label, prediction):
user_preferences = pd.DataFrame(X)
user_preferences.rename(columns={0:x_label,
1:y_label},
inplace=True)
user_preferences['cluster'] = prediction
return user_preferences
chinese_burger_user_preferences = userTypeTable(X, 'avg_burgers_ratings', 'avg_chinese_ratings', prediction3)
#Plot the new scatterplot with the appropriate labels
def scatterplot3Cluster(user_preferences, x_label, y_label, legend_labels):
cluster0 = user_preferences[user_preferences['cluster'] == 0]
cluster1 = user_preferences[user_preferences['cluster'] == 1]
cluster2 = user_preferences[user_preferences['cluster'] == 2]
c0 = plt.scatter(cluster0[x_label], cluster0[y_label], marker='o')
c1 = plt.scatter(cluster1[x_label], cluster1[y_label], marker='s')
c2 = plt.scatter(cluster2[x_label], cluster2[y_label], marker='x')
plt.legend((c0, c1, c2),
legend_labels,
scatterpoints=1,
loc='lower left',
ncol=1,
fontsize=14)
plt.xlabel(x_label)
plt.ylabel(y_label)
return plt.show()
scatterplot3Cluster(chinese_burger_user_preferences, 'avg_burgers_ratings', 'avg_chinese_ratings', ('Loves Burgers', 'Loves Chinese', 'Eh About Both'))
The clusters seem to move around in terms of which cluster number is which cluster so if the legend makes no sense I apologize. The bottom right should be 'Love Burgers' and top right should be 'Loves Chinese'. You may need to rerun the cell above till these are aligned.
So, this clustering is a lot more interesting. It essentially shows us which users like chinese restaurants, but not burger restaurants, users who are ok with both chinese restuarants and restaurants that serve burgers, and users who like burger restaurants, but not chinese restaurants. It seems that there are many more users who prefer Chinese Food over Burgers than vice versa and also many users that don't really have a strong preference either way. But what are the exact numbers of users with each preference?
def countByUserType(user_type_table, colname, clusterNames):
prefTable = user_type_table.groupby('cluster').count()[[colname]]
prefTable['User Type'] = clusterNames
prefTable.set_index('User Type', drop=True, inplace=True)
prefTable.columns = ['Count']
return prefTable
byusertable = countByUserType(chinese_burger_user_preferences, 'avg_burgers_ratings', ['Loves Burgers', 'Loves Chinese', 'Eh about Both'])
byusertable
Wow, now that we have the numbers we can see that there are 3 times as many users who prefer Chinese over Burgers than vice-versa. In fact, it seems there are more users who feel lukewarm about both categories than those that prefer burgers over chinese food.
As we noted before, the scatterplot automatically removes any user who gave no ratings for restaurants in either category. While this makes sense for plotting purposes, knowing how many users have only rated one category versus another is itself pretty telling.
print(BurgerChineseRatings['avg_burgers_ratings'].isna().sum())
print(BurgerChineseRatings['avg_chinese_ratings'].isna().sum())
Here, for example, we can see that there are more users who have never rated a Chinese restaurant than users who have never rated a restaurant serving burgers. Since rating a restaurant is likely highly correlated with going to a restaurant, the number of NAs could indicate that, in Pittsburgh, restaurants with burgers are more popular than Chinese restaurants.
Thus, so far, we are able to say that more users go to restaurants with burgers than chinese restaurants, but, of the users that go to both, there are more who prefer Chinese restaurants than those that prefer burgers.
Here is another analysis of Indian Restaurants versus Chinese Restaurants:
IndianChineseRatings = create_category_table(['Indian', 'Chinese'], ['avg_indian_ratings', 'avg_chinese_ratings'])
highPrefUsers = highPreferenceUsers(IndianChineseRatings, IndianChineseRatings['avg_indian_ratings'], IndianChineseRatings['avg_chinese_ratings'], 3, 3)
print(IndianChineseRatings['avg_indian_ratings'].isna().sum())
print(IndianChineseRatings['avg_chinese_ratings'].isna().sum())
draw_scatterplot(highPrefUsers['avg_indian_ratings'],'avg_indian_ratings', highPrefUsers['avg_chinese_ratings'], 'avg_chinese_ratings')
X1 = highPrefUsers[['avg_indian_ratings', 'avg_chinese_ratings']].values
# To-Do: Drop any NA values (Can't cluster with NAs)
X1 = pd.DataFrame(X1).dropna()
# To-Do: Create an instance of KMeans to find two clusters
kmeans_3 = KMeans(n_clusters=3)
# To-Do: use fit_predict to cluster the dataset
prediction4 = kmeans_2.fit_predict(X1)
indian_burger_user_preferences = userTypeTable(X1, 'avg_indian_ratings', 'avg_chinese_ratings', prediction4)
scatterplot3Cluster(indian_burger_user_preferences, 'avg_chinese_ratings', 'avg_indian_ratings', ('Loves Indian', 'Loves Chinese', 'Eh about Both'))
usertypetable = countByUserType(indian_burger_user_preferences, 'avg_indian_ratings', ['Loves Indian', 'Loves Chinese', 'Eh about Both'])
usertypetable
It seems that many more people go to Chinese restuarants since their are far less NAs, but that, of the users that have strong preferences towards a category of restaurant, there are about the same users with either preference.
Alright, the more clusters we add, the better each cluster represents the group of users in the cluster. This is because the more clusters, the smaller the clusters, and the more similar the users in the cluster. But if we have too many clusters, then the entire clustering is useless as the clusters get a lot less distinguished from each other. That is why it is pivotal to select a number of clusters that represents very similar users, but does not break groups of similar users into too many different user cluster groups.
One method designed to show the best number of clusters is the silhouette analysis. A silhouette analysis essentially plots the average silhouette score for each clusters for different numbers of clusters. A silhouette score is a value from -1 to 1, where a value of 1 means that objects are very well matched to their cluster in comparison to the other clusters. As can be seen below, this can be used to find the ideal number of clusters.
def clustering_errors(k, data):
kmeans = KMeans(n_clusters=k).fit(data)
predictions = kmeans.predict(data)
silhouette_avg = silhouette_score(data, predictions)
return silhouette_avg
df = chinese_burger_user_preferences[['avg_burgers_ratings','avg_chinese_ratings']]
X = biased_dataset[['avg_burgers_ratings','avg_chinese_ratings']].values
X = pd.DataFrame(X).dropna()
# Choose the range of k values to test.
# We added a stride of 5 to improve performance. We don't need to calculate the error for every k value
possible_k_values = range(2, len(X)+1, 5)
# Calculate error values for all k values we're interested in
errors_per_k = [clustering_errors(k, X) for k in possible_k_values]
# Plot the each value of K vs. the silhouette score at that value
fig, ax = plt.subplots(figsize=(16, 6))
plt.plot(possible_k_values, errors_per_k)
# Ticks and grid
xticks = np.arange(min(possible_k_values), max(possible_k_values)+1, 5.0)
ax.set_xticks(xticks, minor=False)
ax.set_xticks(xticks, minor=True)
ax.xaxis.grid(True, which='both')
yticks = np.arange(round(min(errors_per_k), 2), max(errors_per_k), .05)
ax.set_yticks(yticks, minor=False)
ax.set_yticks(yticks, minor=True)
ax.yaxis.grid(True, which='both')
df = indian_burger_user_preferences[['avg_indian_ratings','avg_chinese_ratings']]
X = highPrefUsers[['avg_indian_ratings', 'avg_chinese_ratings']].values
X = pd.DataFrame(X).dropna()
X
# Choose the range of k values to test.
# We added a stride of 5 to improve performance. We don't need to calculate the error for every k value
possible_k_values = range(2, len(X)+1, 5)
# Calculate error values for all k values we're interested in
errors_per_k = [clustering_errors(k, X) for k in possible_k_values]
# Plot the each value of K vs. the silhouette score at that value
fig, ax = plt.subplots(figsize=(16, 6))
plt.plot(possible_k_values, errors_per_k)
# Ticks and grid
xticks = np.arange(min(possible_k_values), max(possible_k_values)+1, 5.0)
ax.set_xticks(xticks, minor=False)
ax.set_xticks(xticks, minor=True)
ax.xaxis.grid(True, which='both')
yticks = np.arange(round(min(errors_per_k), 2), max(errors_per_k), .05)
ax.set_yticks(yticks, minor=False)
ax.set_yticks(yticks, minor=True)
ax.yaxis.grid(True, which='both')
As we can see from this analysis, the optimal number of clusters for the Chinese vs Burger analysis is 62 and the optimal number for the Indian versus Chinese analysis is 27. What these clusters mean past 3 clusters can be really hard to tell in this context, but it can be really important to choose the right number of clusters depending on what you are trying to do. When showing a heatmap of users ratings for each movie or business or example, it can make a big difference. An example is shown below:
# Merge the two tables then pivot so we have Users X Movies dataframe
business_ratings = pd.merge(ratings, business[['business_id', 'name']], on='business_id' )
user_business_ratings = pd.pivot_table(business_ratings, index='user_id', columns= 'name', values='rating')
print('dataset dimensions: ', user_business_ratings.shape, '\n\nSubset example:')
user_business_ratings
n_businesses = 30
n_users = 18
most_rated_businesses_users_selection = helper.sort_by_rating_density(user_business_ratings, n_businesses, n_users)
print('dataset dimensions: ', most_rated_businesses_users_selection.shape)
most_rated_businesses_users_selection.head()
helper.draw_movies_heatmap(most_rated_businesses_users_selection)
most_rated_businesses_1k = helper.get_most_rated_movies(user_business_ratings, 1000)
most_rated_businesses_1k
def get_most_rated_movies(user_movie_ratings, max_number_of_movies):
# 1- Count
user_movie_ratings = user_movie_ratings.append(user_movie_ratings.count(), ignore_index=True)
# 2- sort
user_movie_ratings_sorted = user_movie_ratings.sort_values(len(user_movie_ratings)-1, axis=1, ascending=False)
user_movie_ratings_sorted = user_movie_ratings_sorted.drop(user_movie_ratings_sorted.tail(1).index)
# 3- slice
most_rated_movies = user_movie_ratings_sorted.iloc[:, :max_number_of_movies]
return most_rated_movies
def get_users_who_rate_the_most(most_rated_movies, max_number_of_movies):
# Get most voting users
# 1- Count
most_rated_movies['counts'] = pd.Series(most_rated_movies.count(axis=1))
# 2- Sort
most_rated_movies_users = most_rated_movies.sort_values('counts', ascending=False)
# 3- Slice
most_rated_movies_users_selection = most_rated_movies_users.iloc[:max_number_of_movies, :]
most_rated_movies_users_selection = most_rated_movies_users_selection.drop(['counts'], axis=1)
return most_rated_movies_users_selection
def sort_by_rating_density(user_movie_ratings, n_movies, n_users):
most_rated_movies = get_most_rated_movies(user_movie_ratings, n_movies)
most_rated_movies = get_users_who_rate_the_most(most_rated_movies, n_users)
return most_rated_movies
def draw_movie_clusters(clustered, max_users, max_movies):
c=1
for cluster_id in clustered.group.unique():
# To improve visibility, we're showing at most max_users users and max_movies movies per cluster.
# You can change these values to see more users & movies per cluster
d = clustered[clustered.group == cluster_id].drop(['index', 'group'], axis=1)
n_users_in_cluster = d.shape[0]
d = sort_by_rating_density(d, max_movies, max_users)
d = d.reindex(d.mean().sort_values(ascending=False).index, axis=1)
d = d.reindex(d.count(axis=1).sort_values(ascending=False).index)
d = d.iloc[:max_users, :max_movies]
n_users_in_plot = d.shape[0]
# We're only selecting to show clusters that have more than 9 users, otherwise, they're less interesting
if len(d) > 9:
print('cluster # {}'.format(cluster_id))
print('# of users in cluster: {}.'.format(n_users_in_cluster), '# of users in plot: {}'.format(n_users_in_plot))
fig = plt.figure(figsize=(15,4))
ax = plt.gca()
ax.invert_yaxis()
ax.xaxis.tick_top()
labels = d.columns.str[:40]
ax.set_yticks(np.arange(d.shape[0]) , minor=False)
ax.set_xticks(np.arange(d.shape[1]) , minor=False)
ax.set_xticklabels(labels, minor=False)
ax.get_yaxis().set_visible(False)
# Heatmap
heatmap = plt.imshow(d, vmin=0, vmax=5, aspect='auto')
ax.set_xlabel('movies')
ax.set_ylabel('User id')
divider = make_axes_locatable(ax)
cax = divider.append_axes("right", size="5%", pad=0.05)
# Color bar
cbar = fig.colorbar(heatmap, ticks=[5, 4, 3, 2, 1, 0], cax=cax)
cbar.ax.set_yticklabels(['5 stars', '4 stars','3 stars','2 stars','1 stars','0 stars'])
plt.setp(ax.get_xticklabels(), rotation=90, fontsize=9)
plt.tick_params(axis='both', which='both', bottom='off', top='off', left='off', labelbottom='off', labelleft='off')
#print('cluster # {} \n(Showing at most {} users and {} movies)'.format(cluster_id, max_users, max_movies))
plt.show()
# Let's only show 5 clusters
# Remove the next three lines if you want to see all the clusters
# Contribution welcomed: Pythonic way of achieving this
# c = c+1
# if c > 6:
# break
sparse_ratings = csr_matrix(pd.SparseDataFrame(most_rated_businesses_1k).to_coo())
sparse_ratings
# 20 clusters
predictions = KMeans(n_clusters=20, algorithm='full').fit_predict(sparse_ratings)
max_users = 70
max_businesses = 50
clustered = pd.concat([most_rated_businesses_1k.reset_index(), pd.DataFrame({'group':predictions})], axis=1)
clustered
draw_movie_clusters(clustered, max_users, max_businesses)
# TODO: Pick a cluster ID from the clusters above
cluster_number = 12
# Let's filter to only see the region of the dataset with the most number of values
n_users = 75
n_businesses = 300
cluster = clustered[clustered.group == cluster_number].drop(['index', 'group'], axis=1)
cluster = sort_by_rating_density(cluster, n_businesses, n_users)
helper.draw_movies_heatmap(cluster, axis_labels=False)
cluster.fillna('').head()
# TODO: Fill in the name of the column/movie. e.g. 'Forrest Gump (1994)'
business_name = "Pirata"
cluster[business_name].mean()
# The average rating of 20 movies as rated by the users in the cluster
cluster.mean().head(20)
user_id = 5
# Get all this user's ratings
user_2_ratings = cluster.loc[user_id, :]
# Which businesses did they not rate? (We don't want to recommend movies they've already rated)
user_2_unrated_movies = user_2_ratings[user_2_ratings.isnull()]
# What are the ratings of these movies the user did not rate?
avg_ratings = pd.concat([user_2_unrated_movies, cluster.mean()], axis=1, join='inner').loc[:,0]
# Let's sort by rating so the highest rated movies are presented first
avg_ratings.sort_values(ascending=False)[:20]
Another really short, but interesting example we can look at is the Iris dataset. This dataset has variables about the sepal length and petal length of 3 different flowers. The petal length and sepal length are normally distributed and so we see that the Gaussian Mixture Model is fitting the data well. We know from the data set that there are 50 data points for each flower type and so we can see the Gaussian Mixture Model fits the model well and that it improves as it is given more data.
from sklearn import datasets
import numpy as np
iris = datasets.load_iris()
iris
irisData = iris['data']
sepalData = []
for flower in irisData:
flowerData = []
flowerData.append(flower[0])
flowerData.append(flower[1])
sepalData.append(flowerData)
sepalData = np.asarray(sepalData)
plt.scatter(sepalData[:, 0], sepalData[:, 1], s=40, cmap='viridis');
plt.hist(sepalData[:, 0], bins='auto')
plt.show()
plt.hist(sepalData[:, 0], bins='auto')
plt.show()
from sklearn.mixture import GaussianMixture
gmm = GaussianMixture(n_components=3).fit(sepalData)
labels = gmm.predict(sepalData)
plt.scatter(sepalData[:, 0], sepalData[:, 1], c=labels, s=40, cmap='viridis');
sepalData = pd.DataFrame(sepalData)
sepalData.rename(columns={0:'sepal_length',
1:'sepal_width'},
inplace=True)
sepalData['cluster'] = labels
sepalData
sepalData.groupby('cluster').count()[['sepal_length']]
from sklearn.mixture import GaussianMixture
gmm = GaussianMixture(n_components=3).fit(irisData)
labels = gmm.predict(irisData)
irisData = pd.DataFrame(irisData)
irisData.rename(columns={0:'sepal_length',
1:'sepal_width',
2:'petal_length',
3:'petal_width'},
inplace=True)
irisData['cluster'] = labels
irisData.groupby('cluster').count()[['sepal_length']]