{ "cells": [ { "cell_type": "code", "execution_count": null, "metadata": { "ExecuteTime": { "end_time": "2020-03-25T12:27:12.639313Z", "start_time": "2020-03-25T12:27:10.696370Z" } }, "outputs": [], "source": [ "from sklearn import datasets, preprocessing, feature_selection\n", "from itertools import compress\n", "\n", "import pandas as pd\n", "import numpy as np\n", "import matplotlib.pyplot as plt" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# ZbiĂłr danych" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "W Äwiczeniu wykorzystamy biblioteczny zbiĂłr danych o cenach mieszkaĹ w Bostonie. WyĹwietl opis zbioru i zapoznaj siÄ z interpretacjÄ poszczegĂłlnych atrybutĂłw" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "ExecuteTime": { "end_time": "2020-03-25T12:27:14.237638Z", "start_time": "2020-03-25T12:27:14.206915Z" } }, "outputs": [], "source": [ "boston = datasets.load_boston()\n", "\n", "print(boston.DESCR)\n", "\n", "print('Data shape: ', boston.data.shape)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Ocena atrybutĂłw" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Usuwanie atrybutĂłw o maĹej wariancji" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Na poczÄ tku uĹźyjemy prostego filtra do odrzucenia atrybutĂłw, w ktĂłrych ponad 75% przykĹadĂłw ma tÄ samÄ wartoĹÄ. PosĹuĹźymy siÄ w tym celu klasÄ [VarianceThreshold](http://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.VarianceThreshold.html#sklearn.feature_selection.VarianceThreshold). Obejrzyj wynik filtrowania i sprawdĹş, co siÄ stanie, jeĹli zaostrzysz kryteria filtrowania." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "ExecuteTime": { "end_time": "2020-03-25T12:27:20.740991Z", "start_time": "2020-03-25T12:27:20.700692Z" } }, "outputs": [], "source": [ "from sklearn.feature_selection import VarianceThreshold\n", "\n", "sel = VarianceThreshold(threshold=(.75 * (1 - .75)))\n", "boston_new = sel.fit_transform(boston.data)\n", "\n", "print('Data shape: ', boston_new.data.shape)\n", "\n", "feature_names = compress(boston.feature_names, sel.get_support())\n", "pd.DataFrame(boston_new, columns=feature_names)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## WybĂłr atrybutĂłw przez regresjÄ liniowÄ " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "NastÄpnie posĹuĹźymy siÄ selektorem [SelectKBest](http://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SelectKBest.html) do znalezienia 2 najlepszych atrybutĂłw z punktu widzenia prostej regresji liniowej. " ] }, { "cell_type": "code", "execution_count": null, "metadata": { "ExecuteTime": { "end_time": "2020-03-25T12:27:33.180651Z", "start_time": "2020-03-25T12:27:32.911265Z" } }, "outputs": [], "source": [ "from sklearn.feature_selection import SelectKBest, f_regression\n", "\n", "boston_new = SelectKBest(f_regression, k=2).fit_transform(boston.data, boston.target)\n", "\n", "feature1 = boston_new[:, 0]\n", "feature2 = boston_new[:, 1]\n", "\n", "plt.plot(feature1, feature2, 'r.')\n", "plt.xlabel(\"Feature number 1\")\n", "plt.ylabel(\"Feature number 2\")\n", "plt.ylim([np.min(feature2), np.max(feature2)])\n", "plt.show()\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## WybĂłr atrybutĂłw przez regresjÄ liniowÄ z regularyzacjÄ " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "NastÄpny przykĹad pokazuje znalezienie dwĂłch najlepszych atrybutĂłw z punktu widzenia prostej regresji liniowej, w ktĂłrej parametry podlegajÄ regularyzacji L1. W tym celu wykorzystamy klasÄ [SelectFromModel](http://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SelectFromModel.html#sklearn.feature_selection.SelectFromModel). ZauwaĹź, Ĺźe regresja z regularyzacjÄ jest realizowana przez klasÄ [LassoCV](http://scikit-learn.org/stable/modules/linear_model.html#lasso)." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "ExecuteTime": { "end_time": "2020-03-25T13:48:10.640252Z", "start_time": "2020-03-25T13:48:10.590450Z" } }, "outputs": [], "source": [ "from sklearn.feature_selection import SelectFromModel\n", "from sklearn.linear_model import LassoCV\n", "\n", "clf = LassoCV()\n", "\n", "sfm = SelectFromModel(clf, threshold=0.25)\n", "sfm.fit(boston.data, boston.target)\n", "n_rows, n_features = sfm.transform(boston.data).shape\n", "\n", "while n_features > 2:\n", " sfm.threshold += 0.1\n", " boston_new = sfm.transform(boston.data)\n", " n_rows, n_features = boston_new.shape\n", "\n", "feature1 = boston_new[:, 0]\n", "feature2 = boston_new[:, 1]\n", "\n", "plt.plot(feature1, feature2, 'r.')\n", "plt.xlabel(\"Feature number 1\")\n", "plt.ylabel(\"Feature number 2\")\n", "plt.ylim([np.min(feature2), np.max(feature2)])\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Rekurencyjny wybĂłr atrybutĂłw" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Ostatni przykĹad to wykorzystanie klasy [Recursive Feature Extraction](https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.RFE.html#sklearn.feature_selection.RFE) do rekurencyjnego wyboru atrybutĂłw. Metoda wykorzystuje klasyfikator/regresor do wyboru, w kaĹźdym kroku, najlepszy moĹźliwy atrybut." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "ExecuteTime": { "end_time": "2020-03-25T15:36:37.702052Z", "start_time": "2020-03-25T15:36:37.528057Z" } }, "outputs": [], "source": [ "from sklearn.feature_selection import RFE\n", "from sklearn.tree import DecisionTreeRegressor\n", "\n", "estimator = DecisionTreeRegressor()\n", "selector = RFE(estimator, 2, step=1)\n", "\n", "selector = selector.fit(boston.data, boston.target)\n", "\n", "boston_new = boston.data[:,selector.support_]\n", "\n", "feature1 = boston_new[:, 0]\n", "feature2 = boston_new[:, 1]\n", "\n", "plt.plot(feature1, feature2, 'r.')\n", "plt.xlabel(\"Feature number 1\")\n", "plt.ylabel(\"Feature number 2\")\n", "plt.ylim([np.min(feature2), np.max(feature2)])\n", "plt.show()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "ExecuteTime": { "end_time": "2020-03-25T15:39:46.863766Z", "start_time": "2020-03-25T15:39:46.859440Z" } }, "outputs": [], "source": [ "for (attr, rank, selected) in zip(boston.feature_names, selector.ranking_, selector.support_):\n", " print(f'{attr:>10}: rank={rank:<2} selected={selected}')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# zadanie samodzielne\n", "\n", "Wykorzystaj metodÄ [sklearn.datasets.make_classification()](http://scikit-learn.org/stable/modules/generated/sklearn.datasets.make_classification.html) do zbudowania zbioru danych zawierajÄ cego 1000 przypadkĂłw opisanych przy uĹźyciu 20 atrybutĂłw, z ktĂłrych \n", "\n", "* 5 atrybutĂłw jest faktycznie informacyjnych\n", "* 5 atrybutĂłw jest nadmiarowych\n", "* 5 atrybutĂłw jest zduplikowanych. \n", "\n", "Wykorzystaj ktĂłrÄ Ĺ z przedstawionych metod wyboru atrybutĂłw do ograniczenia zbioru atrybutĂłw do 5 najwaĹźniejszych atrybutĂłw. SprawdĹş, czy uda Ci siÄ odzyskaÄ atrybuty informacyjne. Swoje rozwiÄ zanie w postaci notatnika Jupyter przeĹlij na adres Mikolaj.Morzy@put.poznan.pl do **piÄ tku, 10 kwietnia, do godziny 12:00**. PamiÄtaj, Ĺźe Twoja analiza ma byÄ w peĹni reprodukowalna, zatem nie moĹźe zakĹadaÄ np. obecnoĹci plikĂłw na lokalnym dysku. Przyjmij, Ĺźe w Ĺrodowisku uruchomieniowym bÄdÄ obecne pakiety: `pandas`,`numpy`,`sklearn`,`matplotlib`,`tqdm`, oraz standardowa biblioteka Pythona 3.7. Wszystkie dodatkowe pakiety muszÄ byÄ jawnie instalowane (np. przez `!pip install abc`)." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.3" }, "toc": { "base_numbering": 1, "nav_menu": {}, "number_sections": false, "sideBar": true, "skip_h1_title": false, "title_cell": "Table of Contents", "title_sidebar": "Contents", "toc_cell": false, "toc_position": { "height": "calc(100% - 180px)", "left": "10px", "top": "150px", "width": "357.865px" }, "toc_section_display": true, "toc_window_display": true }, "varInspector": { "cols": { "lenName": 16, "lenType": 16, "lenVar": 40 }, "kernels_config": { "python": { "delete_cmd_postfix": "", "delete_cmd_prefix": "del ", "library": "var_list.py", "varRefreshCmd": "print(var_dic_list())" }, "r": { "delete_cmd_postfix": ") ", "delete_cmd_prefix": "rm(", "library": "var_list.r", "varRefreshCmd": "cat(var_dic_list()) " } }, "types_to_exclude": [ "module", "function", "builtin_function_or_method", "instance", "_Feature" ], "window_display": false } }, "nbformat": 4, "nbformat_minor": 2 }