{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-03-25T12:27:12.639313Z",
     "start_time": "2020-03-25T12:27:10.696370Z"
    }
   },
   "outputs": [],
   "source": [
    "from sklearn import datasets, preprocessing, feature_selection\n",
    "from itertools import compress\n",
    "\n",
    "import pandas as pd\n",
    "import numpy as np\n",
    "import matplotlib.pyplot as plt"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Zbiór danych"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "W ćwiczeniu wykorzystamy biblioteczny zbiór danych o cenach mieszkań w Bostonie. Wyświetl opis zbioru i zapoznaj się z interpretacją poszczególnych atrybutów"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-03-25T12:27:14.237638Z",
     "start_time": "2020-03-25T12:27:14.206915Z"
    }
   },
   "outputs": [],
   "source": [
    "boston = datasets.load_boston()\n",
    "\n",
    "print(boston.DESCR)\n",
    "\n",
    "print('Data shape: ', boston.data.shape)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Ocena atrybutów"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Usuwanie atrybutów o małej wariancji"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Na początku użyjemy prostego filtra do odrzucenia atrybutów, w których ponad 75% przykładów ma tę samą wartość. Posłużymy się w tym celu klasą [VarianceThreshold](http://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.VarianceThreshold.html#sklearn.feature_selection.VarianceThreshold). Obejrzyj wynik filtrowania i sprawdź, co się stanie, jeśli zaostrzysz kryteria filtrowania."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-03-25T12:27:20.740991Z",
     "start_time": "2020-03-25T12:27:20.700692Z"
    }
   },
   "outputs": [],
   "source": [
    "from sklearn.feature_selection import VarianceThreshold\n",
    "\n",
    "sel = VarianceThreshold(threshold=(.75 * (1 - .75)))\n",
    "boston_new = sel.fit_transform(boston.data)\n",
    "\n",
    "print('Data shape: ', boston_new.data.shape)\n",
    "\n",
    "feature_names = compress(boston.feature_names, sel.get_support())\n",
    "pd.DataFrame(boston_new, columns=feature_names)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Wybór atrybutów przez regresję liniową"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Następnie posłużymy się selektorem [SelectKBest](http://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SelectKBest.html) do znalezienia 2 najlepszych atrybutów z punktu widzenia prostej regresji liniowej. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-03-25T12:27:33.180651Z",
     "start_time": "2020-03-25T12:27:32.911265Z"
    }
   },
   "outputs": [],
   "source": [
    "from sklearn.feature_selection import SelectKBest, f_regression\n",
    "\n",
    "boston_new = SelectKBest(f_regression, k=2).fit_transform(boston.data, boston.target)\n",
    "\n",
    "feature1 = boston_new[:, 0]\n",
    "feature2 = boston_new[:, 1]\n",
    "\n",
    "plt.plot(feature1, feature2, 'r.')\n",
    "plt.xlabel(\"Feature number 1\")\n",
    "plt.ylabel(\"Feature number 2\")\n",
    "plt.ylim([np.min(feature2), np.max(feature2)])\n",
    "plt.show()\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Wybór atrybutów przez regresję liniową z regularyzacją"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Następny przykład pokazuje znalezienie dwóch najlepszych atrybutów z punktu widzenia prostej regresji liniowej, w której parametry podlegają regularyzacji L1. W tym celu wykorzystamy klasę [SelectFromModel](http://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SelectFromModel.html#sklearn.feature_selection.SelectFromModel). Zauważ, że regresja z regularyzacją jest realizowana przez klasę [LassoCV](http://scikit-learn.org/stable/modules/linear_model.html#lasso)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-03-25T13:48:10.640252Z",
     "start_time": "2020-03-25T13:48:10.590450Z"
    }
   },
   "outputs": [],
   "source": [
    "from sklearn.feature_selection import SelectFromModel\n",
    "from sklearn.linear_model import LassoCV\n",
    "\n",
    "clf = LassoCV()\n",
    "\n",
    "sfm = SelectFromModel(clf, threshold=0.25)\n",
    "sfm.fit(boston.data, boston.target)\n",
    "n_rows, n_features = sfm.transform(boston.data).shape\n",
    "\n",
    "while n_features > 2:\n",
    "    sfm.threshold += 0.1\n",
    "    boston_new = sfm.transform(boston.data)\n",
    "    n_rows, n_features = boston_new.shape\n",
    "\n",
    "feature1 = boston_new[:, 0]\n",
    "feature2 = boston_new[:, 1]\n",
    "\n",
    "plt.plot(feature1, feature2, 'r.')\n",
    "plt.xlabel(\"Feature number 1\")\n",
    "plt.ylabel(\"Feature number 2\")\n",
    "plt.ylim([np.min(feature2), np.max(feature2)])\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Rekurencyjny wybór atrybutów"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Ostatni przykład to wykorzystanie klasy [Recursive Feature Extraction](https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.RFE.html#sklearn.feature_selection.RFE) do rekurencyjnego wyboru atrybutów. Metoda wykorzystuje klasyfikator/regresor do wyboru, w każdym kroku, najlepszy możliwy atrybut."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-03-25T15:36:37.702052Z",
     "start_time": "2020-03-25T15:36:37.528057Z"
    }
   },
   "outputs": [],
   "source": [
    "from sklearn.feature_selection import RFE\n",
    "from sklearn.tree import DecisionTreeRegressor\n",
    "\n",
    "estimator = DecisionTreeRegressor()\n",
    "selector = RFE(estimator, 2, step=1)\n",
    "\n",
    "selector = selector.fit(boston.data, boston.target)\n",
    "\n",
    "boston_new = boston.data[:,selector.support_]\n",
    "\n",
    "feature1 = boston_new[:, 0]\n",
    "feature2 = boston_new[:, 1]\n",
    "\n",
    "plt.plot(feature1, feature2, 'r.')\n",
    "plt.xlabel(\"Feature number 1\")\n",
    "plt.ylabel(\"Feature number 2\")\n",
    "plt.ylim([np.min(feature2), np.max(feature2)])\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-03-25T15:39:46.863766Z",
     "start_time": "2020-03-25T15:39:46.859440Z"
    }
   },
   "outputs": [],
   "source": [
    "for (attr, rank, selected) in zip(boston.feature_names, selector.ranking_, selector.support_):\n",
    "    print(f'{attr:>10}: rank={rank:<2} selected={selected}')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# zadanie samodzielne\n",
    "\n",
    "Wykorzystaj metodę [sklearn.datasets.make_classification()](http://scikit-learn.org/stable/modules/generated/sklearn.datasets.make_classification.html) do zbudowania zbioru danych zawierającego 1000 przypadków opisanych przy użyciu 20 atrybutów, z których \n",
    "\n",
    "* 5 atrybutów jest faktycznie informacyjnych\n",
    "* 5 atrybutów jest nadmiarowych\n",
    "* 5 atrybutów jest zduplikowanych. \n",
    "\n",
    "Wykorzystaj którąś z przedstawionych metod wyboru atrybutów do ograniczenia zbioru atrybutów do 5 najważniejszych atrybutów. Sprawdź, czy uda Ci się odzyskać atrybuty informacyjne. Swoje rozwiązanie w postaci notatnika Jupyter prześlij na adres Mikolaj.Morzy@put.poznan.pl do **piątku, 10 kwietnia, do godziny 12:00**. Pamiętaj, że Twoja analiza ma być w pełni reprodukowalna, zatem nie może zakładać np. obecności plików na lokalnym dysku. Przyjmij, że w środowisku uruchomieniowym będą obecne pakiety: `pandas`,`numpy`,`sklearn`,`matplotlib`,`tqdm`, oraz standardowa biblioteka Pythona 3.7. Wszystkie dodatkowe pakiety muszą być jawnie instalowane (np. przez `!pip install abc`)."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.3"
  },
  "toc": {
   "base_numbering": 1,
   "nav_menu": {},
   "number_sections": false,
   "sideBar": true,
   "skip_h1_title": false,
   "title_cell": "Table of Contents",
   "title_sidebar": "Contents",
   "toc_cell": false,
   "toc_position": {
    "height": "calc(100% - 180px)",
    "left": "10px",
    "top": "150px",
    "width": "357.865px"
   },
   "toc_section_display": true,
   "toc_window_display": true
  },
  "varInspector": {
   "cols": {
    "lenName": 16,
    "lenType": 16,
    "lenVar": 40
   },
   "kernels_config": {
    "python": {
     "delete_cmd_postfix": "",
     "delete_cmd_prefix": "del ",
     "library": "var_list.py",
     "varRefreshCmd": "print(var_dic_list())"
    },
    "r": {
     "delete_cmd_postfix": ") ",
     "delete_cmd_prefix": "rm(",
     "library": "var_list.r",
     "varRefreshCmd": "cat(var_dic_list()) "
    }
   },
   "types_to_exclude": [
    "module",
    "function",
    "builtin_function_or_method",
    "instance",
    "_Feature"
   ],
   "window_display": false
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}