{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "### Decision Trees in scikit-learn\n", "\n", "The goal of the laboratory is to familiarize students with building tree classifiers using Python and scikit-learn library.\n", "\n", "The first step is to import data into [pandas DataFrame](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html) " ] }, { "cell_type": "code", "execution_count": null, "metadata": { "ExecuteTime": { "end_time": "2020-05-02T18:19:34.420020Z", "start_time": "2020-05-02T18:19:33.456430Z" } }, "outputs": [], "source": [ "import pandas as pd\n", "\n", "df = pd.read_csv('http://dmlab.cs.put.poznan.pl/dokuwiki/lib/exe/fetch.php?media=dt_data.csv')\n", "\n", "df.describe()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "It is important to first get familiar with the data by looking at a small subset of rows." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "ExecuteTime": { "end_time": "2020-05-02T18:19:36.949241Z", "start_time": "2020-05-02T18:19:36.940866Z" } }, "outputs": [], "source": [ "df.head(10)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "ExecuteTime": { "end_time": "2020-05-02T18:19:38.180834Z", "start_time": "2020-05-02T18:19:38.176861Z" } }, "outputs": [], "source": [ "df.columns" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We have to perform some basic data preprocessing:\n", "\n", "* remove the _Unnamed: 3_ column\n", "* recode the label" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "ExecuteTime": { "end_time": "2020-05-02T18:20:00.237582Z", "start_time": "2020-05-02T18:20:00.226953Z" } }, "outputs": [], "source": [ "# inplace=True modifies existing dataframe instead of returning a reference to a new object\n", "\n", "df.drop('Unnamed: 3', axis=1, inplace=True)\n", "df.drop('Card_Cust_ID', axis=1, inplace=True)\n", "\n", "df.Spend_Drop_over50pct.replace([0,1], ['no','yes'], inplace=True)\n", "\n", "df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Scikit-learn does not accept categorical variables, they should be encoded as numbers" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "ExecuteTime": { "end_time": "2020-05-02T18:20:09.975652Z", "start_time": "2020-05-02T18:20:09.767011Z" } }, "outputs": [], "source": [ "from sklearn.preprocessing import LabelEncoder, OneHotEncoder\n", "\n", "gender_ = LabelEncoder().fit(df.Gender)\n", "df.Gender = gender_.transform(df.Gender)\n", "\n", "df.head()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "ExecuteTime": { "end_time": "2020-05-02T18:20:15.811530Z", "start_time": "2020-05-02T18:20:15.804305Z" } }, "outputs": [], "source": [ "education_ = LabelEncoder().fit(df.Education_level)\n", "df.Education_level = education_.transform(df.Education_level)\n", "\n", "df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Next we should examine the distribution of values of the target attribute" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "ExecuteTime": { "end_time": "2020-05-02T18:20:21.027526Z", "start_time": "2020-05-02T18:20:21.022842Z" } }, "outputs": [], "source": [ "df.Spend_Drop_over50pct.value_counts()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "ExecuteTime": { "end_time": "2020-05-02T18:20:22.453618Z", "start_time": "2020-05-02T18:20:22.447727Z" } }, "outputs": [], "source": [ "round( df.Spend_Drop_over50pct.value_counts()/len(df) * 100 , 2)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Before building the classifier we have to construct the _train set_ and the _test set_" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "ExecuteTime": { "end_time": "2020-05-02T18:20:27.951639Z", "start_time": "2020-05-02T18:20:27.936013Z" } }, "outputs": [], "source": [ "from sklearn.model_selection import train_test_split\n", "\n", "y = df.pop('Spend_Drop_over50pct')\n", "X = df\n", "\n", "X_train, X_test, y_train, y_test = train_test_split(\n", " X, y, test_size=0.7, random_state=42)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "ExecuteTime": { "end_time": "2020-05-02T18:20:30.389476Z", "start_time": "2020-05-02T18:20:30.385403Z" } }, "outputs": [], "source": [ "y_train.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This is all we need to build a [Decision Tree classifier](http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html#sklearn.tree.DecisionTreeClassifier)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "ExecuteTime": { "end_time": "2020-05-02T18:20:35.861566Z", "start_time": "2020-05-02T18:20:35.841464Z" } }, "outputs": [], "source": [ "from sklearn.tree import DecisionTreeClassifier" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "ExecuteTime": { "end_time": "2020-05-02T18:20:36.867898Z", "start_time": "2020-05-02T18:20:36.862583Z" } }, "outputs": [], "source": [ "# build the model\n", "model = DecisionTreeClassifier(\n", " criterion='gini', \n", " max_depth=5, \n", " min_samples_leaf=5)\n", "\n", "# train the model\n", "model.fit(X_train, y_train)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "After building the tree we can check its accuracy, precision, and recall" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "ExecuteTime": { "end_time": "2020-05-02T18:20:38.074103Z", "start_time": "2020-05-02T18:20:38.059418Z" } }, "outputs": [], "source": [ "from sklearn.metrics import classification_report, confusion_matrix, accuracy_score\n", "\n", "y_pred = model.predict(X_test)\n", "\n", "print(f'accuracy: {accuracy_score(y_test, y_pred)}\\n')\n", "print(f'confusion matrix\\n {confusion_matrix(y_test, y_pred)}')\n", "print(classification_report(y_test, y_pred))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "It is also possible to visualize the generated tree using the [Graphviz library](http://www.webgraphviz.com)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "ExecuteTime": { "end_time": "2020-05-02T18:24:07.551504Z", "start_time": "2020-05-02T18:24:07.499566Z" } }, "outputs": [], "source": [ "from sklearn.tree import export_graphviz\n", "import graphviz\n", "\n", "export_graphviz(model, \n", " out_file = \"model.dot\", \n", " filled=True,\n", " feature_names = X_train.columns)\n", "\n", "with open(\"model.dot\") as f:\n", " dot_graph = f.read()\n", "graphviz.Source(dot_graph)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Guessing correct values of algorithm's parameters can be tedious and difficult. It is much easier to allow the algorithm to explore the whole space of parameter values using either exhaustive grid search or some heuristic" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "ExecuteTime": { "end_time": "2020-05-02T18:24:18.899129Z", "start_time": "2020-05-02T18:24:16.161920Z" } }, "outputs": [], "source": [ "from sklearn.model_selection import GridSearchCV\n", "\n", "params = { \n", " 'max_depth' : range(1,15),\n", " 'criterion' : ['gini','entropy'],\n", " 'min_samples_leaf' : range(2,20)\n", "}\n", "\n", "model_ = GridSearchCV(estimator = DecisionTreeClassifier(), \n", " cv=5,\n", " param_grid=params, \n", " n_jobs=7)\n", "model_.fit(X, y)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "ExecuteTime": { "end_time": "2020-05-02T18:24:20.132244Z", "start_time": "2020-05-02T18:24:20.129897Z" } }, "outputs": [], "source": [ "print(model_.best_score_)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "ExecuteTime": { "end_time": "2020-05-02T18:24:21.261951Z", "start_time": "2020-05-02T18:24:21.258930Z" } }, "outputs": [], "source": [ "model_.best_estimator_" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's use the best combination of parameters for scoring our classifier" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "ExecuteTime": { "end_time": "2020-05-02T18:24:23.962054Z", "start_time": "2020-05-02T18:24:23.947825Z" } }, "outputs": [], "source": [ "model = model_.best_estimator_\n", "y_pred = model.predict(X_test)\n", "\n", "print(f'accuracy: {accuracy_score(y_test, y_pred)}\\n')\n", "print(f'confusion matrix\\n {confusion_matrix(y_test, y_pred)}')\n", "print(classification_report(y_test, y_pred))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Unfortunately, the performance w.r.t. the minority class is still not satisfying. One possible solution is to present the classifier with some class weights." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "ExecuteTime": { "end_time": "2020-05-02T18:24:25.608261Z", "start_time": "2020-05-02T18:24:25.592859Z" } }, "outputs": [], "source": [ "# build the model\n", "model = DecisionTreeClassifier(\n", " criterion='gini', \n", " max_depth=2, \n", " min_samples_leaf=11,\n", " class_weight={'no': 1.0, 'yes': 5.0})\n", "\n", "# train the model\n", "model.fit(X_train, y_train)\n", "\n", "# apply the model\n", "y_pred = model.predict(X_test)\n", "\n", "print(f'accuracy: {accuracy_score(y_test, y_pred)}\\n')\n", "print(f'confusion matrix\\n {confusion_matrix(y_test, y_pred)}')\n", "print(classification_report(y_test, y_pred))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Finally, let us verify how the accuracy of traditional decision trees compares with the accuracy of [Random Forests](http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html#sklearn.ensemble.RandomForestClassifier)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "ExecuteTime": { "end_time": "2020-05-02T18:24:36.856174Z", "start_time": "2020-05-02T18:24:36.800418Z" } }, "outputs": [], "source": [ "from sklearn.ensemble import RandomForestClassifier\n", "\n", "model = RandomForestClassifier(\n", " criterion='gini',\n", " n_estimators=10,\n", " max_features=0.5,\n", " n_jobs=-1,\n", " class_weight={'no': 1.0, 'yes': 5.0},\n", " random_state=42)\n", "\n", "# train the model\n", "model.fit(X_train, y_train)\n", "\n", "# apply the model\n", "y_pred = model.predict(X_test)\n", "\n", "print(f'accuracy: {accuracy_score(y_test, y_pred)}\\n')\n", "print(f'confusion matrix\\n {confusion_matrix(y_test, y_pred)}')\n", "print(classification_report(y_test, y_pred))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Bias vs variance tradeoff\n", "\n", "[Bias vs variance](https://en.wikipedia.org/wiki/Bias–variance_tradeoff) is a common problem in machine learning tasks. In very simple terms:\n", "\n", "* high bias is the result of errors in the underlying model (its assumptions, class of models, etc.)\n", "* high variance is the result of overfitting due to too many parameters\n", "\n", "This figure ilustrates the main idea:\n", "\n", "![bias vs variance](http://www.bogotobogo.com/python/scikit-learn/images/Bias-Tradeoff/Low-High-Variances-Biases.png)\n", "\n", "In the following sections we will verify how much accuracy changes depending on the size of the train set. First, we will recreate the dataset." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If you **really want to understand** bias vs variance problem, read [this](https://sebastianraschka.com/pdf/lecture-notes/stat479fs18/08_eval-intro_notes.pdf) " ] }, { "cell_type": "code", "execution_count": null, "metadata": { "ExecuteTime": { "end_time": "2019-04-15T08:53:21.200759Z", "start_time": "2019-04-15T08:53:20.901596Z" } }, "outputs": [], "source": [ "df = pd.read_csv('http://dmlab.cs.put.poznan.pl/dokuwiki/lib/exe/fetch.php?media=dt_data.csv')\n", "\n", "df.drop('Unnamed: 3', axis=1, inplace=True)\n", "df.Spend_Drop_over50pct.replace([0,1], ['A','B'], inplace=True)\n", "\n", "gender_ = LabelEncoder().fit(df.Gender)\n", "df.Gender = gender_.transform(df.Gender)\n", "\n", "education_ = LabelEncoder().fit(df.Education_level)\n", "df.Education_level = education_.transform(df.Education_level)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The next cell shows how to compute accuracy scores on the train set and the validation set for different dataset partitions. We use the 5-fold cross validation for each combination of the train/validation set size. The result contains accuracy scores for each fold." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "ExecuteTime": { "end_time": "2019-04-15T09:00:35.726199Z", "start_time": "2019-04-15T09:00:30.740250Z" } }, "outputs": [], "source": [ "from sklearn.model_selection import learning_curve\n", "\n", "train_sizes = range(1, 350)\n", "\n", "features = ['Gender', 'Education_level', 'Last_Month_spend', 'Last_3m_avg_spend']\n", "target = 'Spend_Drop_over50pct'\n", "\n", "train_sizes, train_scores, validation_scores = learning_curve(\n", " estimator=DecisionTreeClassifier(max_depth=2),\n", " X = df[features],\n", " y = df[target],\n", " train_sizes = train_sizes,\n", " cv = 5,\n", " scoring = 'accuracy',\n", " shuffle = True)\n", "\n", "print('Training scores:\\n', train_scores)\n", "print()\n", "print('Validation scores:\\n', validation_scores)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Since for each combination of the train/validation set size we receive 5 results (one for each fold of the cross validation), we have to aggregate these scores in order to produce the final plot." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "ExecuteTime": { "end_time": "2019-04-15T09:00:36.843207Z", "start_time": "2019-04-15T09:00:36.831135Z" } }, "outputs": [], "source": [ "train_scores_mean = train_scores.mean(axis = 1)\n", "validation_scores_mean = validation_scores.mean(axis = 1)\n", "\n", "print('Mean training scores\\n', pd.Series(train_scores_mean, index = train_sizes))\n", "print()\n", "print('\\nMean validation scores\\n',pd.Series(validation_scores_mean, index = train_sizes))" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "ExecuteTime": { "end_time": "2019-04-15T09:00:37.800825Z", "start_time": "2019-04-15T09:00:37.559157Z" } }, "outputs": [], "source": [ "import matplotlib.pyplot as plt\n", "%matplotlib inline\n", "\n", "plt.style.use('seaborn')\n", "\n", "plt.plot(train_sizes, 1-train_scores_mean, label = 'Training error')\n", "plt.plot(train_sizes, 1-validation_scores_mean, label = 'Validation error')\n", "\n", "plt.ylabel('Error', fontsize = 14)\n", "plt.xlabel('Training set size', fontsize = 14)\n", "plt.title('Learning curves for a decision tree model', fontsize = 18, y = 1.03)\n", "plt.legend()\n", "plt.ylim(0,1)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "from yellowbrick.model_selection import validation_curve\n", "\n", "from sklearn.tree import DecisionTreeClassifier\n", "\n", "viz = validation_curve(\n", " DecisionTreeClassifier(), X, y, param_name=\"max_depth\",\n", " param_range=np.arange(1, 25), cv=10, scoring=\"accuracy\",\n", ")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from yellowbrick.model_selection import LearningCurve\n", "from sklearn.model_selection import StratifiedKFold\n", "\n", "# Create the learning curve visualizer\n", "cv = StratifiedKFold(n_splits=10)\n", "sizes = np.linspace(0.1, 1.0, 100)\n", "\n", "X_lc = OneHotEncoder().fit_transform(X)\n", "y_lc = LabelEncoder().fit_transform(y)\n", "\n", "# Instantiate the classification model and visualizer\n", "model = DecisionTreeClassifier()\n", "visualizer = LearningCurve(\n", " model, \n", " cv=cv, \n", " scoring='accuracy', \n", " train_sizes=sizes, \n", " n_jobs=4\n", ")\n", "\n", "visualizer.fit(X_lc, y_lc)\n", "visualizer.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Homework\n", "\n", "1. Read the description of the [Italian wine dataset](http://archive.ics.uci.edu/ml/machine-learning-databases/wine/wine.names). \n", "1. Download the dataset from http://archive.ics.uci.edu/ml/machine-learning-databases/wine/wine.data and load it into a pandas DataFrame. **Careful: the label is the first column in the dataset!**\n", "1. Build a decision tree classifier and print out its accuracy. Instead of using a classical Decision Tree, you can experiment with [ExtraTreeClassifier](http://scikit-learn.org/stable/modules/generated/sklearn.tree.ExtraTreeClassifier.html#sklearn.tree.ExtraTreeClassifier) or a Random Forest\n", "1. Use the grid search to perform parameter space search and boost classifier's accuracy.\n", "1. Perform the _bias vs variance_ analysis\n", "\n", "\n", "Send the resulting `*.ipynb` file to Mikolaj.Morzy@put.poznan.pl until Sunday, May 24th, 21:00." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.2" }, "toc": { "base_numbering": 1, "nav_menu": {}, "number_sections": true, "sideBar": true, "skip_h1_title": false, "title_cell": "Table of Contents", "title_sidebar": "Contents", "toc_cell": false, "toc_position": {}, "toc_section_display": true, "toc_window_display": false }, "varInspector": { "cols": { "lenName": 16, "lenType": 16, "lenVar": 40 }, "kernels_config": { "python": { "delete_cmd_postfix": "", "delete_cmd_prefix": "del ", "library": "var_list.py", "varRefreshCmd": "print(var_dic_list())" }, "r": { "delete_cmd_postfix": ") ", "delete_cmd_prefix": "rm(", "library": "var_list.r", "varRefreshCmd": "cat(var_dic_list()) " } }, "types_to_exclude": [ "module", "function", "builtin_function_or_method", "instance", "_Feature" ], "window_display": false } }, "nbformat": 4, "nbformat_minor": 4 }