{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Naiwny klasyfikator Bayesa\n",
    "\n",
    "Celem ćwiczenia jest zapoznanie się z możliwością wykorzystania biblioteki `scikit-learn` do zbudowania prostego przepływu ilustrującego wykorzystanie algorytmu [naiwnego klasyfikatora Bayesa](https://en.wikipedia.org/wiki/Naive_Bayes_classifier)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Setup\n",
    "\n",
    "## Potrzebne biblioteki\n",
    "Do wykonania ćwiczenia skorzystamy z następujących bibliotek:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-04-21T13:22:49.809887Z",
     "start_time": "2020-04-21T13:22:49.792681Z"
    }
   },
   "outputs": [],
   "source": [
    "# Data manipulation\n",
    "import pandas as pd\n",
    "import numpy as np\n",
    "\n",
    "# Options for pandas\n",
    "pd.options.display.max_columns = 50\n",
    "pd.options.display.max_rows = 30\n",
    "\n",
    "# Visualizations\n",
    "# import plotly\n",
    "# import plotly.graph_objs as go\n",
    "# import plotly.offline as ply\n",
    "# plotly.offline.init_notebook_mode(connected=True)\n",
    "\n",
    "import matplotlib as plt\n",
    "\n",
    "# Autoreload extension\n",
    "if 'autoreload' not in get_ipython().extension_manager.loaded:\n",
    "    %load_ext autoreload\n",
    "    \n",
    "%autoreload 2\n",
    "\n",
    "# Machine learning\n",
    "from sklearn import datasets, metrics\n",
    "from sklearn.model_selection import train_test_split\n",
    "from sklearn.naive_bayes import GaussianNB"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Import danych"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Do ćwiczenia wykorzystamy [zbiór irysów Fishera](https://en.wikipedia.org/wiki/Iris_flower_data_set). Gdyby dane wejściowe były zamieszczone w pliku tekstowym, wygodniej byłoby posłużyć się bezpośrednio obiektem klasy `DataFrame` z biblioteki `pandas` do wczytania danych i ich podziału na dane treningowe i etykiety."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-04-21T13:15:44.898151Z",
     "start_time": "2020-04-21T13:15:44.662596Z"
    }
   },
   "outputs": [],
   "source": [
    "iris = pd.read_csv('https://gist.githubusercontent.com/curran/a08a1080b88344b0c8a7/raw/639388c2cbc2120a14dcf466e85730eb8be498bb/iris.csv', header=0)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-04-21T13:15:45.309558Z",
     "start_time": "2020-04-21T13:15:45.290419Z"
    }
   },
   "outputs": [],
   "source": [
    "# funkcja pop() usuwa wskazaną kolumnę z obiektu i zwraca ją jako wynik\n",
    "y = iris.pop('species')\n",
    "X = iris"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-04-21T13:15:58.060338Z",
     "start_time": "2020-04-21T13:15:58.033125Z"
    }
   },
   "outputs": [],
   "source": [
    "y.value_counts()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-04-21T13:15:58.742821Z",
     "start_time": "2020-04-21T13:15:58.710063Z"
    }
   },
   "outputs": [],
   "source": [
    "# metoda DataFrame.describe() zwraca proste podsumowanie statystyczne kolumn\n",
    "X.describe()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "My jednak wykorzystamy wersję zbioru `Iris` dostarczaną razem z biblioteką `scikit-learn`. Funkcja `load_iris()` zwraca słownik zawierający tablicę danych, osobną listę z etykietami kwiatów, listę nazw atrybutów."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-04-21T13:16:19.822636Z",
     "start_time": "2020-04-21T13:16:19.808887Z"
    }
   },
   "outputs": [],
   "source": [
    "iris = datasets.load_iris()\n",
    "\n",
    "print(f\"Pierwsze pięć kwiatów: \\n {iris['data'][:5]}\")\n",
    "print(f\"Pierwszych pięć etykiet: \\n {iris['target'][:5]}\")\n",
    "print(f\"Nazwy atrybutów: \\n {iris['feature_names']}\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-04-21T13:17:06.351887Z",
     "start_time": "2020-04-21T13:17:06.338424Z"
    }
   },
   "outputs": [],
   "source": [
    "# opis zbioru danych\n",
    "print(iris.DESCR)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Budowanie klasyfikatora\n",
    "\n",
    "Przed przystąpieniem do uczenia klasyfikatora musimy podzielić zbiór danych na część treningową i testową. Wykorzystamy w tym celu funkcję pomocniczą dostarczaną przez `scikit-learn`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-04-21T13:19:06.420041Z",
     "start_time": "2020-04-21T13:19:06.407164Z"
    }
   },
   "outputs": [],
   "source": [
    "# podziel zbiór danych na zbiór treningowy i testowy\n",
    "X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.7)\n",
    "\n",
    "print(f\"Zbiór uczący: {X_train.shape}, zbiór testujący: {X_test.shape}\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-04-21T13:19:15.236427Z",
     "start_time": "2020-04-21T13:19:15.213049Z"
    }
   },
   "outputs": [],
   "source": [
    "# przygotuj model i dopasuj model do danych\n",
    "model = GaussianNB()\n",
    "model.fit(X_train,y_train)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-04-21T13:20:21.198626Z",
     "start_time": "2020-04-21T13:20:21.186088Z"
    }
   },
   "outputs": [],
   "source": [
    "# zobacz, w jaki sposób model oszacował prawd. a priori\n",
    " \n",
    "print(f\"Klasy decyzyjne: {model.classes_}\")\n",
    "print(f\"Liczba instancji każdej klasy: {model.class_count_}\")\n",
    "print(f\"Prawdopod. a priori każdej klasy: {model.class_prior_}\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-04-21T13:22:09.586678Z",
     "start_time": "2020-04-21T13:22:09.571490Z"
    }
   },
   "outputs": [],
   "source": [
    "# dokonaj predykcji i oceń jakość modelu\n",
    "predicted = model.predict(X_test)\n",
    "expected = y_test\n",
    " \n",
    "print(f\"Dokładność modelu: {metrics.accuracy_score(expected, predicted)}\\n\")\n",
    "print(f\"Macierz pomyłek: \\n {metrics.confusion_matrix(expected, predicted)}\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-04-21T13:22:34.948869Z",
     "start_time": "2020-04-21T13:22:34.933887Z"
    }
   },
   "outputs": [],
   "source": [
    "print(metrics.classification_report(expected, predicted))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-04-21T13:23:29.212704Z",
     "start_time": "2020-04-21T13:23:29.190090Z"
    }
   },
   "outputs": [],
   "source": [
    "# zobacz, jak podanie prawdopodobieństwa a priori zmieni działanie klasyfikatora\n",
    "model = GaussianNB(priors=[0.5, 0.49, 0.01])\n",
    "model.fit(X_train,y_train)\n",
    " \n",
    "predicted = model.predict(X_test)\n",
    "expected = y_test\n",
    " \n",
    "print(f\"Dokładność modelu: {metrics.accuracy_score(expected, predicted)}\\n\")\n",
    "print(f\"Macierz pomyłek: \\n {metrics.confusion_matrix(expected, predicted)}\")"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.3"
  },
  "toc": {
   "base_numbering": 1,
   "nav_menu": {},
   "number_sections": true,
   "sideBar": true,
   "skip_h1_title": false,
   "title_cell": "Table of Contents",
   "title_sidebar": "Contents",
   "toc_cell": false,
   "toc_position": {},
   "toc_section_display": true,
   "toc_window_display": false
  },
  "varInspector": {
   "cols": {
    "lenName": 16,
    "lenType": 16,
    "lenVar": 40
   },
   "kernels_config": {
    "python": {
     "delete_cmd_postfix": "",
     "delete_cmd_prefix": "del ",
     "library": "var_list.py",
     "varRefreshCmd": "print(var_dic_list())"
    },
    "r": {
     "delete_cmd_postfix": ") ",
     "delete_cmd_prefix": "rm(",
     "library": "var_list.r",
     "varRefreshCmd": "cat(var_dic_list()) "
    }
   },
   "types_to_exclude": [
    "module",
    "function",
    "builtin_function_or_method",
    "instance",
    "_Feature"
   ],
   "window_display": false
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}