Code and visual answers to all of the above included below!
import numpy as np
import matplotlib.pyplot as plt
from PIL import Image
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.cluster import SpectralClustering
from sklearn.metrics import accuracy_score
%run kmeans
Create two clusters normally distributed about (0,0) and (4,4) respectively in two dimensions
group1 = np.random.normal(loc=[0,0], scale=[1,1], size=(100,2))
group2 = np.random.normal(loc=[4,4], scale=[1,1], size=(100,2))
X = np.vstack([group1,group2])
plt.scatter(X[:,0],X[:,1])
plt.show()
The results of our kmeans algorithm (setting number of clusters=2) shows that it easily finds the centers of each group
centroids, cluster_indexes = kmeans(X, k=2)
ax1 = plt.scatter(X[:,0], X[:,1], alpha=0.5, c=cluster_indexes)
ax2 = plt.scatter(centroids[:,0], centroids[:,1], marker='x', s=200, c='r', linewidths=8)
plt.show()
group1 = np.random.normal(loc=[0,0], scale=[0.1,0.1], size=(100,2))
group2 = np.random.normal(loc=[0,2], scale=[0.5,0.5], size=(100,2))
group3 = np.random.normal(loc=(3,3), scale=[1,1], size=(200,2))
X = np.vstack([group1,group2,group3])
plt.scatter(X[:,0],X[:,1])
plt.show()
The above plot was created with three distinct clusters, some with higher variance. We see that depending on the random initialization of our clusters, kmeans can produce very different results, as shown in the two plots below
centroids_run1, cluster_indexes_run1 = kmeans(X, k=3)
#cluster_colors_run1 = cluster_colors(cluster_indexes_run1)
centroids_run2, cluster_indexes_run2 = kmeans(X, k=3)
#cluster_colors_run2 = cluster_colors(cluster_indexes_run2)
plt.figure(figsize=(12,5))
ax1 = plt.subplot(121)
ax1.scatter(X[:,0], X[:,1], alpha=0.5, c=cluster_indexes_run1)
ax1.scatter(centroids_run1[:,0], centroids_run1[:,1], marker='x', s=200, c='r', linewidths=8)
ax2 = plt.subplot(122)
ax2.scatter(X[:,0], X[:,1], alpha=0.5, c=cluster_indexes_run2)
ax2.scatter(centroids_run2[:,0], centroids_run2[:,1], marker='x', s=200, c='r', linewidths=8)
plt.show()
kmeans_output, initial_centroids = kmeans_plus_plus(X, k=3)
centroids, cluster_indexes = kmeans_output
plt.figure(figsize=(12,6))
plt.scatter(X[:,0], X[:,1], alpha=0.3, c=cluster_indexes, label="_nolegend_")
plt.scatter(centroids[:,0], centroids[:,1], marker='x', s=200, c='red', linewidths=8)
plt.scatter(initial_centroids[:,0], initial_centroids[:,1], marker='x', s=200, c='black', linewidths=8)
plt.legend(['Final Centroids','Initial Centroids'])
plt.show()
dog_original = Image.open('dog.png')
dog_original
h,w = dog_original.height, dog_original.width
Greyscale images only have one dimension so we convert our image into a single column array
X = np.array(dog_original).reshape(-1,1)
We are going to find four clusters in our image, thereby reduce our greyscale image from representation in 0-255 space, to only four values!
output, initial_centroids = kmeans_plus_plus(X, k=4)
centroids, cluster_indexes = output
Centroids can become floating point representations, but images require integer values
centroids = centroids.astype(np.uint8)
centroids
Map every element of X to the centroid value of the cluster it belongs to
centroid_map = {i:centroid for i,centroid in enumerate(centroids)}
new_X = np.array([centroid_map[cluster_index] for cluster_index in cluster_indexes])
new_X = new_X.reshape((h,w))
dog_new = Image.fromarray(new_X, 'L')
The result? Art!
dog_new
vangogh = Image.open('starry_night.jpg')
vangogh
h,w = vangogh.height, vangogh.width
X = np.array(vangogh).reshape(-1,3)
output, initial_centroids = kmeans_plus_plus(X, k=5, tolerance=2)
centroids, cluster_indexes = output
centroids = centroids.astype(np.uint8)
centroid_map = {i:centroid for i,centroid in enumerate(centroids)}
new_X = np.array([centroid_map[cluster_index] for cluster_index in cluster_indexes])
new_X = new_X.reshape((h,w,3))
vangogh_new = Image.fromarray(new_X)
vangogh_new
landscape = Image.open('landscape.jpg')
landscape
h,w = landscape.height, landscape.width
X = np.array(landscape).reshape(-1,3)
output, initial_centroids = kmeans_plus_plus(X, k=8, tolerance=5)
centroids, cluster_indexes = output
centroids = centroids.astype(np.uint8)
centroid_map = {i:centroid for i,centroid in enumerate(centroids)}
new_X = np.array([centroid_map[cluster_index] for cluster_index in cluster_indexes])
new_X = new_X.reshape((h,w,3))
landscape_new = Image.fromarray(new_X)
landscape_new
cancer_data = pd.read_csv('cancer_data.csv')
cancer_data['label'] = cancer_data['diagnosis'].map({'M':1, 'B':0})
Since this is an unsupervised problem, we don't have labels. We train our random forest then is by creating a duplicate dataset where each column has been shuffled. We concatenate these two datasets and task the random forest to predict label = original or shuffled dataset
original = cancer_data.drop(columns=['id','diagnosis','Unnamed: 32','label'])
shuffled = original.copy()
for col in shuffled.columns:
np.random.shuffle(shuffled[col].values)
original['target'] = 1
shuffled['target'] = 2
combined_data = pd.concat([original, shuffled], axis=0)
X = combined_data.drop(columns=['target'])
y = combined_data['target']
rf = RandomForestClassifier(n_estimators=100, min_samples_leaf=3, max_features=0.8)
rf.fit(X, y)
leaf_samples is a function that returns which data points of our original dataset end up in which leaf node (repeated for every tree in the random forest)
leafs = leaf_samples(rf, original.drop(columns=['target']))
Constructing the similarity matrix is then just a matter of incrementing the i,j entries in the matrix if they ever appeared in a leaf node together
similarity_matrix = np.zeros((len(original), len(original)))
for arr in leafs:
for i in range(len(arr)):
for j in range(len(arr)):
similarity_matrix[arr[i], arr[j]] +=1
cluster = SpectralClustering(n_clusters=2, affinity='precomputed')
cluster_preds = cluster.fit_predict(similarity_matrix)
Creating two clusters (correspoding hopefully to Malignant and Benign) via Spectral Clustering turns out to actually achieve 90% accuracy when compared with the true labels
accuracy_score(cancer_data['label'], cluster_preds)
confusion_matrix(cancer_data['label'], cluster_preds)
We can also see how our regular kmeans does at creating two clusters corresponding to our true lables. Accuracy is less than spectral clustering, but still achieves 85%
original_array = np.array(original.drop(columns=['target']))
centroids, cluster_indexes = kmeans(original_array, k=2)
accuracy_score(cancer_data['label'], cluster_indexes)