{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# AHC: Agglomerative Hierarchical Clustering\n", "\n", "In AHC objects are automatically inserted into a taxonomy based on pairwise distances between objects. Given a numerical representation of text paragraphs it is possible to generate taxonomies, assign paragraphs to correct nodes in the taxonomy, to compare two taxonomies, etc." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "ExecuteTime": { "end_time": "2020-05-26T11:25:22.992694Z", "start_time": "2020-05-26T11:25:22.704981Z" } }, "outputs": [], "source": [ "import numpy as np\n", "import pandas as pd\n", "\n", "file_url = 'http://dmlab.cs.put.poznan.pl/dokuwiki/lib/exe/fetch.php?media=zoo.tab'\n", "\n", "# we need to skip the first two rows because they contain metadata\n", "df = pd.read_csv(file_url, header = 0, delimiter = \"\\t\")[2:]\n", "\n", "original_headers = list(df.columns.values)\n", "\n", "# transform DataFrame to a NumPy matrix\n", "ndf = df[original_headers[:-1]].to_numpy()\n", "\n", "ndf" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The code below creates a hierarchical clustering model based on two parameters: \n", "\n", " * __linkage__ defines the way the distance between two clusters is computed\n", " * __metric__ defines the way the distance between two points is computed\n", " \n", "Possible values for __linkage__ are: *single*, *complete*, *average*, *weighted*, *centroid*, *ward*, for definitions of these methods [see the SciPy documentation](https://docs.scipy.org/doc/scipy/reference/generated/scipy.cluster.hierarchy.linkage.html)\n", "\n", "Possible values for __metric__ include, among others: *euclidean*, *cityblock*, *minkowski*, *cosine*, *hamming*. The full list with definitions can be found [in the SciPy documentation](https://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.distance.pdist.html)\n", "\n", "Run the example below and then try to find the best combination of these parameters, i.e., the combination which produces the hierarchy which is the most similar to our biological understanding of the tree of life." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "ExecuteTime": { "end_time": "2020-05-26T11:25:25.044954Z", "start_time": "2020-05-26T11:25:23.878382Z" } }, "outputs": [], "source": [ "%matplotlib inline\n", "\n", "from scipy.cluster.hierarchy import dendrogram, linkage\n", "\n", "import matplotlib\n", "import matplotlib.pyplot as plt\n", "\n", "matplotlib.rcParams['figure.figsize'] = (20.0, 40.0)\n", "\n", "model = linkage(ndf[:,2:16], \n", " method = 'complete', \n", " metric='euclidean')\n", "\n", "\n", "plot = dendrogram(model, labels = ndf[:,0] + ':', orientation = 'right', leaf_font_size = 20)\n", "\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# K-Means Clustering\n", "\n", "The code below illustrates the use of k-means to cluster numerical points. When running the code experiment with varying the number of true clusters generated and the expected number of clusters. See what happens when\n", "\n", " * expected number of clusters is smaller/greater than the true number of clusters\n", " * the true number of clusters is small vs large" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "ExecuteTime": { "end_time": "2020-05-26T11:25:47.078764Z", "start_time": "2020-05-26T11:25:47.075046Z" } }, "outputs": [], "source": [ "import numpy as np\n", "import matplotlib.pyplot as plt\n", "from mpl_toolkits.mplot3d import Axes3D\n", "\n", "from sklearn.cluster import KMeans\n", "from sklearn.datasets import make_blobs\n", "\n", "true_number_of_clusters = 3\n", "predicted_number_of_clusters = 5\n", "\n", "# generate an artificial dataset\n", "X, y = make_blobs(n_samples=1000, n_features=3, centers=true_number_of_clusters)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "ExecuteTime": { "end_time": "2020-05-26T11:25:47.957399Z", "start_time": "2020-05-26T11:25:47.932637Z" } }, "outputs": [], "source": [ "%matplotlib notebook\n", "\n", "fig = plt.figure()\n", "ax = Axes3D(fig)\n", "ax.scatter(X[:, 0], X[:, 1], X[:, 2])" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "ExecuteTime": { "end_time": "2020-05-26T11:25:52.458728Z", "start_time": "2020-05-26T11:25:52.397734Z" } }, "outputs": [], "source": [ "model = KMeans(n_clusters=predicted_number_of_clusters)\n", "model.fit(X)\n", "\n", "# assign points to clusters\n", "labels = model.predict(X)\n", "\n", "# print the coordinates of centroids\n", "print(model.cluster_centers_)\n", "\n", "labels[:10]" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "ExecuteTime": { "end_time": "2020-05-26T11:25:53.304686Z", "start_time": "2020-05-26T11:25:53.263791Z" } }, "outputs": [], "source": [ "fig = plt.figure()\n", "\n", "ax = Axes3D(fig)\n", "ax.scatter(X[:, 0], X[:, 1], X[:, 2], c=labels)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As you can see, guessing the right number of clusters is important. Below we will experiment with different settings of the number of cluster and try to deduce the best using [the Silhouette score](http://scikit-learn.org/stable/auto_examples/cluster/plot_kmeans_silhouette_analysis.html)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "ExecuteTime": { "end_time": "2020-05-26T14:13:32.305282Z", "start_time": "2020-05-26T14:13:25.637441Z" } }, "outputs": [], "source": [ "from sklearn.metrics import silhouette_score\n", "from tqdm.notebook import tqdm\n", "\n", "k_range = range(2,20)\n", "\n", "scores = []\n", "\n", "for k in tqdm(k_range):\n", " kmeans = KMeans(n_clusters=k)\n", " labels = kmeans.fit_predict(X)\n", " score = silhouette_score(X, labels, metric='euclidean')\n", " scores.append((k, score))\n", "\n", "ax_x, ax_y = zip(*scores)\n", "\n", "fig = plt.figure()\n", "ax = plt.axes()\n", "plt.xticks(k_range)\n", "plt.grid()\n", "plt.plot(ax_x, ax_y)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Clustering of text documents" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Word embeddings\n", "\n", "The main idea is to represent a word as a vector in an $n$ dimensional space (usually $n=300$), in such a way that the proximity of words in this latent vector space indicates semantic proximity of words. So, this representation not only puts synonyms close to each other, it also puts words that are semantically related quite close to each other.\n", "\n", "In order to run this code you have to install SpaCy and download precomputed word embeddings:\n", "\n", "```bash\n", "bash$ pip install -U spacy\n", "bash$ python -m spacy download en_core_web_md\n", "\n", "```" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "ExecuteTime": { "end_time": "2020-05-26T15:07:39.471788Z", "start_time": "2020-05-26T15:07:24.761109Z" } }, "outputs": [], "source": [ "import spacy\n", "\n", "nlp = spacy.load('en_core_web_md')\n", "\n", "print(f'Size of the vocabulary: {len(nlp.vocab)}')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Words in Spacy have vector representations computed using GloVe (Global Vectors for Word Representation), a method of unsupervised vector computation trained on Wikipedia, Twitter and Common Crawl." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "ExecuteTime": { "end_time": "2020-05-26T15:07:56.139257Z", "start_time": "2020-05-26T15:07:56.125185Z" } }, "outputs": [], "source": [ "king = nlp('king')\n", "print(king.vector)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "ExecuteTime": { "end_time": "2020-05-26T15:08:00.572361Z", "start_time": "2020-05-26T15:08:00.498154Z" } }, "outputs": [], "source": [ "words = ['king','queen','kingdom','emperor','spades','Lear','grass']\n", " \n", "for w in words:\n", " print(\"{} {} \\t {} \".format(king, w, king.similarity(nlp(w))))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Since each word is represented as a vector in a multidimensional space, the most natural distance matrix in high dimensional space is the cosine distance between vectors." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "ExecuteTime": { "end_time": "2020-05-26T15:08:04.855045Z", "start_time": "2020-05-26T15:08:04.852395Z" } }, "outputs": [], "source": [ "# cosine similarity\n", "cosine = lambda v1, v2: np.dot(v1, v2) / (np.linalg.norm(v1) * np.linalg.norm(v2))" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "ExecuteTime": { "end_time": "2020-05-26T15:03:38.843826Z", "start_time": "2020-05-26T15:03:29.294864Z" } }, "outputs": [], "source": [ "%%time\n", "\n", "vx = nlp('king').vector - nlp('man').vector + nlp('woman').vector\n", "\n", "results = list({w for w in nlp.vocab \n", " if w.has_vector \n", " and w.orth_.islower() \n", " and w.lower_ != 'king' \n", " and w.lower_ != 'man' \n", " and w.lower_ != 'woman'})\n", "\n", "results.sort(key=lambda w: cosine(w.vector, vx), reverse=True)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "ExecuteTime": { "start_time": "2020-05-26T15:05:57.838Z" } }, "outputs": [], "source": [ "for r in results[:10]:\n", " print(f'{r.text:>10}: {cosine(r.vector, vx):%f.3}')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To see the clustering in action let us import a text document and try to find clusters the document" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "ExecuteTime": { "end_time": "2020-05-26T15:55:32.436805Z", "start_time": "2020-05-26T15:55:31.572105Z" } }, "outputs": [], "source": [ "from bs4 import BeautifulSoup\n", "import requests\n", "\n", "respond = requests.get(\"https://en.wikipedia.org/wiki/PoznaƄ\")\n", "soup = BeautifulSoup(respond.text, \"html\")\n", "page = soup.find_all('p')\n", "\n", "raw_text = ''.join([paragraph.text for paragraph in page])\n", "\n", "print(raw_text)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "First, we will use the traditional vector space model, where each phrase will be represented as the vector of words appearing in this phrase. We will use the TF-IDF weighting of words and some stop-word removal" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "ExecuteTime": { "end_time": "2020-05-26T15:08:16.916650Z", "start_time": "2020-05-26T15:08:16.651530Z" } }, "outputs": [], "source": [ "from sklearn.feature_extraction.text import TfidfVectorizer\n", "from sklearn.cluster import KMeans\n", "\n", "phrases = raw_text.split('.')\n", "vectorizer = TfidfVectorizer(stop_words='english')\n", "X = vectorizer.fit_transform(phrases)\n", "\n", "print('\\nVocabulary')\n", "print(vectorizer.get_feature_names())\n", "\n", "print('\\nList of stopwords')\n", "print(vectorizer.get_stop_words())\n", "\n", "print('\\nDocument in the vector space')\n", "print(X.todense())" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "ExecuteTime": { "end_time": "2020-05-26T15:08:25.190484Z", "start_time": "2020-05-26T15:08:25.133076Z" } }, "outputs": [], "source": [ "from tqdm.notebook import tqdm\n", "\n", "true_k = 5\n", "num_words = 10\n", "\n", "model = KMeans(n_clusters=true_k, max_iter=100, n_init=1)\n", "model.fit(X)\n", "\n", "centroids = model.cluster_centers_.argsort()[:, ::-1]\n", "terms = vectorizer.get_feature_names()\n", "\n", "for i in tqdm(range(true_k)):\n", "\n", " print (f'Cluster {i}: {[terms[j] for j in centroids[i, :num_words]]}')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we will compare this result to clustering applied to word embeddings. To shorten the list of words we will employ an additional filtering based on word frequency and we'll remove the most common words, such as the, of, we, etc. In order to perform this filtering we need to install package `wordfreq`:\n", "\n", "```bash\n", "bash$ pip install wordfreq\n", "```" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "ExecuteTime": { "end_time": "2020-05-26T15:31:42.879972Z", "start_time": "2020-05-26T15:31:42.877025Z" } }, "outputs": [], "source": [ "from wordfreq import zipf_frequency\n", "\n", "print(f\"{zipf_frequency('the','en')} {zipf_frequency('city','en')} {zipf_frequency('Poznan','en')}\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "ExecuteTime": { "end_time": "2020-05-26T15:59:18.941778Z", "start_time": "2020-05-26T15:59:18.939247Z" } }, "outputs": [], "source": [ "raw_text = raw_text.replace(',', ' ').replace('(', ' ').replace(')', ' ')" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "ExecuteTime": { "end_time": "2020-05-26T15:59:45.565140Z", "start_time": "2020-05-26T15:59:32.717920Z" } }, "outputs": [], "source": [ "words = {\n", " word.lower(): nlp(word).vector\n", " for word in set(raw_text.split())\n", " if zipf_frequency(word, 'en') < 5\n", " and zipf_frequency(word, 'en') > 0\n", " and np.any(nlp(word).vector)\n", "}" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "ExecuteTime": { "end_time": "2020-05-26T15:59:59.481711Z", "start_time": "2020-05-26T15:59:59.476804Z" } }, "outputs": [], "source": [ "X = np.vstack(list(words.values()))" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "ExecuteTime": { "end_time": "2020-05-26T16:00:13.442822Z", "start_time": "2020-05-26T16:00:13.324530Z" } }, "outputs": [], "source": [ "true_k = 10\n", "\n", "model = KMeans(n_clusters=true_k, max_iter=1000, n_init=1)\n", "model.fit(X)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "ExecuteTime": { "end_time": "2020-05-26T16:00:27.544105Z", "start_time": "2020-05-26T16:00:27.312409Z" } }, "outputs": [], "source": [ "for i, c in tqdm(enumerate(model.cluster_centers_)):\n", "\n", " cluster_words = sorted(words.items(), key=lambda x: cosine(x[1], c), reverse=True)\n", " print(f\"Cluster {i}: {[k for k,v in cluster_words][:num_words]}\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.2" }, "toc": { "base_numbering": 1, "nav_menu": {}, "number_sections": false, "sideBar": true, "skip_h1_title": false, "title_cell": "Table of Contents", "title_sidebar": "Contents", "toc_cell": false, "toc_position": {}, "toc_section_display": true, "toc_window_display": false }, "varInspector": { "cols": { "lenName": 16, "lenType": 16, "lenVar": 40 }, "kernels_config": { "python": { "delete_cmd_postfix": "", "delete_cmd_prefix": "del ", "library": "var_list.py", "varRefreshCmd": "print(var_dic_list())" }, "r": { "delete_cmd_postfix": ") ", "delete_cmd_prefix": "rm(", "library": "var_list.r", "varRefreshCmd": "cat(var_dic_list()) " } }, "position": { "height": "547.667px", "left": "1411px", "right": "20px", "top": "115px", "width": "366.625px" }, "types_to_exclude": [ "module", "function", "builtin_function_or_method", "instance", "_Feature" ], "window_display": false } }, "nbformat": 4, "nbformat_minor": 2 }