{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Wstępne przetwarzanie danych w Pythonie\n", "\n", "W ćwiczeniu zapoznamy się z najprostszymi sposobami przetwarzania danych (normalizacja, one-hot-encoding, binaryzacja) danych przy wykorzystaniu biblioteki `scikit-learn` i `pandas`." ] }, { "cell_type": "code", "execution_count": 93, "metadata": { "ExecuteTime": { "end_time": "2020-03-19T20:51:05.638850Z", "start_time": "2020-03-19T20:51:05.635878Z" } }, "outputs": [], "source": [ "import pandas as pd\n", "import numpy as np\n", "import sklearn\n", "import matplotlib.pyplot as plt\n", "\n", "from sklearn import datasets, preprocessing" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Irysy Fishera" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Funkcja `load_iris()` utworzy obiekt zawierający słynny zbiór danych o [irysach Fischera](https://en.wikipedia.org/wiki/Iris_flower_data_set)" ] }, { "cell_type": "code", "execution_count": 28, "metadata": { "ExecuteTime": { "end_time": "2020-03-19T20:12:47.463822Z", "start_time": "2020-03-19T20:12:47.456767Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Nazwy atrybutów: ['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']\n", "Klasy decyzyjne: ['setosa' 'versicolor' 'virginica']\n", "kształt danych: (150, 4)\n", "kształt etykiety: (150,)\n" ] } ], "source": [ "iris = datasets.load_iris()\n", "\n", "print('Nazwy atrybutów: ', iris.feature_names)\n", "print('Klasy decyzyjne: ', iris.target_names)\n", "\n", "print('kształt danych: ', iris.data.shape)\n", "print('kształt etykiety: ', iris.target.shape)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "W ćwiczeniu wykorzystamy bibliotekę [pandas](https://pandas.pydata.org) do przechowywania danych i wyników pośrednich obliczeń. Pandas to specjalizowana biblioteka która bardzo dobrze współpracuje z bibliotekami [Numerical Python](http://www.numpy.org) oraz [SciKit Learn](http://scikit-learn.org), głównym narzędziem do eksploracji danych i uczenia maszynowego. Jeśli nigdy nie pracowaliście z `pandas` to warto rzucić okiem na krótkie wprowadzenia, np. [tutaj](https://medium.com/@wbusaka/a-gentle-introduction-to-pandas-5ed17421a59d), [tutaj](https://towardsdatascience.com/an-introduction-to-pandas-in-python-b06d2dd51aba) lub [tutaj](https://towardsdatascience.com/a-quick-introduction-to-the-pandas-python-library-f1b678f34673)" ] }, { "cell_type": "code", "execution_count": 54, "metadata": { "ExecuteTime": { "end_time": "2020-03-19T20:36:29.498789Z", "start_time": "2020-03-19T20:36:29.486093Z" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
sepal length (cm)sepal width (cm)petal length (cm)petal width (cm)target
05.13.51.40.20
14.93.01.40.20
24.73.21.30.20
34.63.11.50.20
45.03.61.40.20
55.43.91.70.40
64.63.41.40.30
75.03.41.50.20
84.42.91.40.20
94.93.11.50.10
\n", "
" ], "text/plain": [ " sepal length (cm) sepal width (cm) petal length (cm) petal width (cm) \\\n", "0 5.1 3.5 1.4 0.2 \n", "1 4.9 3.0 1.4 0.2 \n", "2 4.7 3.2 1.3 0.2 \n", "3 4.6 3.1 1.5 0.2 \n", "4 5.0 3.6 1.4 0.2 \n", "5 5.4 3.9 1.7 0.4 \n", "6 4.6 3.4 1.4 0.3 \n", "7 5.0 3.4 1.5 0.2 \n", "8 4.4 2.9 1.4 0.2 \n", "9 4.9 3.1 1.5 0.1 \n", "\n", " target \n", "0 0 \n", "1 0 \n", "2 0 \n", "3 0 \n", "4 0 \n", "5 0 \n", "6 0 \n", "7 0 \n", "8 0 \n", "9 0 " ] }, "execution_count": 54, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# utworzenie nowego obiektu DataFrame\n", "df = pd.DataFrame(iris.data)\n", "\n", "# dodanie nowej kolumny\n", "df['target'] = iris.target\n", "\n", "# redefinicja nazw kolumn\n", "df.columns = iris.feature_names + ['target']\n", "\n", "df.head(n=10)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Każda kolumna w obiekcie `DataFrame` jest typu `Series` i posiada [bardzo bogate API](https://pandas.pydata.org/docs/reference/series.html). " ] }, { "cell_type": "code", "execution_count": 55, "metadata": { "ExecuteTime": { "end_time": "2020-03-19T20:36:33.612402Z", "start_time": "2020-03-19T20:36:33.605527Z" } }, "outputs": [ { "data": { "text/plain": [ "count 150.000000\n", "mean 5.843333\n", "std 0.828066\n", "min 4.300000\n", "25% 5.100000\n", "50% 5.800000\n", "75% 6.400000\n", "max 7.900000\n", "Name: sepal length (cm), dtype: float64" ] }, "execution_count": 55, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df['sepal length (cm)'].describe()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Często będziemy wykorzystywali funkcję `apply()` która pozwoli definiować *ad hoc* funkcje wykonywane na poszczególnych elementach danej serii." ] }, { "cell_type": "code", "execution_count": 56, "metadata": { "ExecuteTime": { "end_time": "2020-03-19T20:36:34.924257Z", "start_time": "2020-03-19T20:36:34.918951Z" } }, "outputs": [ { "data": { "text/plain": [ "0 True\n", "1 False\n", "2 False\n", "3 False\n", "4 False\n", "Name: sepal length (cm), dtype: bool" ] }, "execution_count": 56, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df['sepal length (cm)'].head().apply(lambda x: x > 5.0)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Z kolei do rysowania wykresów wykorzystamy bibliotekę [MatPlot](https://matplotlib.org), podstawowe narzędzie tworzenia prostych wizualizacji w Pythonie." ] }, { "cell_type": "code", "execution_count": 57, "metadata": { "ExecuteTime": { "end_time": "2020-03-19T20:36:35.951398Z", "start_time": "2020-03-19T20:36:35.792488Z" } }, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "x = df['sepal length (cm)'][:]\n", "y = df['sepal width (cm)'][:]\n", "t = df['target']\n", "\n", "plt.scatter(x, y, c=t)\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Podobny efekt moglibyśmy uzyskać korzystając bezpośrednio z metody `plot()` dostarczanej przez `pandas.Series`. W poniższym przykładzie `iloc` oznacza *index localization* i należy wyrażenie rozumieć jako wybór wszystkich wierszy `:` oraz drugiej i trzeciej kolumny `[1,2]` (kolumny są indeksowane od 0)." ] }, { "cell_type": "code", "execution_count": 58, "metadata": { "ExecuteTime": { "end_time": "2020-03-19T20:36:37.321558Z", "start_time": "2020-03-19T20:36:37.141699Z" } }, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 58, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "df.iloc[:,[1,2]].plot(kind='scatter', x=0, y=1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Normalizacja" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Pierwsza operacja to normalizacja liniowa realizowana przez klasę [MinMaxScaler](http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html#sklearn.preprocessing.MinMaxScaler). Klasa ta realizuje następującą transformację atrybutu:\n", "\n", "$$v' = \\frac{v-min}{max-min} * (max'-min') + min'$$\n", "\n", "gdzie $max,min$ to oryginalna wartość maksymalna i minimalna atrybutu, $max',min'$ to wartości maksymalna i minimalna w nowej skali, $v'$ to nowa wartość atrybutu, a $v$ to oryginalna wartość atrybutu." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Ponieważ będziemy tylko przetwarzać atrybuty kwiatów, a nie etykietę (ostatnią kolumnę), więc w pierwszym kroku zapiszemy cztery kolumny z cechami kwiatów (oraz ich nazwy) do osobnych zmiennych." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "ExecuteTime": { "end_time": "2020-03-19T20:40:21.508696Z", "start_time": "2020-03-19T20:40:21.502812Z" } }, "outputs": [], "source": [ "X = df.iloc[:, :-1]\n", "cols = df.columns[:-1]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Poniższy kawałek kodu pokazuje, w jaki sposób należy przeprowadzić normalizację całej tabeli z danymi." ] }, { "cell_type": "code", "execution_count": 78, "metadata": { "ExecuteTime": { "end_time": "2020-03-19T20:43:31.979315Z", "start_time": "2020-03-19T20:43:31.964111Z" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
sepal length (cm)sepal width (cm)petal length (cm)petal width (cm)
00.2222220.6250000.0677970.041667
10.1666670.4166670.0677970.041667
20.1111110.5000000.0508470.041667
30.0833330.4583330.0847460.041667
40.1944440.6666670.0677970.041667
\n", "
" ], "text/plain": [ " sepal length (cm) sepal width (cm) petal length (cm) petal width (cm)\n", "0 0.222222 0.625000 0.067797 0.041667\n", "1 0.166667 0.416667 0.067797 0.041667\n", "2 0.111111 0.500000 0.050847 0.041667\n", "3 0.083333 0.458333 0.084746 0.041667\n", "4 0.194444 0.666667 0.067797 0.041667" ] }, "execution_count": 78, "metadata": {}, "output_type": "execute_result" } ], "source": [ "norm = preprocessing.MinMaxScaler(feature_range=(0,1)).fit(X)\n", "X_minmax = pd.DataFrame(norm.transform(X), columns=cols)\n", "\n", "X_minmax.head()" ] }, { "cell_type": "code", "execution_count": 72, "metadata": { "ExecuteTime": { "end_time": "2020-03-19T20:42:01.120369Z", "start_time": "2020-03-19T20:42:01.100167Z" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
sepal length (cm)sepal width (cm)petal length (cm)petal width (cm)
count150.000000150.000000150.000000150.000000
mean0.4287040.4405560.4674580.458056
std0.2300180.1816110.2992030.317599
min0.0000000.0000000.0000000.000000
25%0.2222220.3333330.1016950.083333
50%0.4166670.4166670.5677970.500000
75%0.5833330.5416670.6949150.708333
max1.0000001.0000001.0000001.000000
\n", "
" ], "text/plain": [ " sepal length (cm) sepal width (cm) petal length (cm) \\\n", "count 150.000000 150.000000 150.000000 \n", "mean 0.428704 0.440556 0.467458 \n", "std 0.230018 0.181611 0.299203 \n", "min 0.000000 0.000000 0.000000 \n", "25% 0.222222 0.333333 0.101695 \n", "50% 0.416667 0.416667 0.567797 \n", "75% 0.583333 0.541667 0.694915 \n", "max 1.000000 1.000000 1.000000 \n", "\n", " petal width (cm) \n", "count 150.000000 \n", "mean 0.458056 \n", "std 0.317599 \n", "min 0.000000 \n", "25% 0.083333 \n", "50% 0.500000 \n", "75% 0.708333 \n", "max 1.000000 " ] }, "execution_count": 72, "metadata": {}, "output_type": "execute_result" } ], "source": [ "X_minmax.describe()" ] }, { "cell_type": "code", "execution_count": 74, "metadata": { "ExecuteTime": { "end_time": "2020-03-19T20:42:37.735209Z", "start_time": "2020-03-19T20:42:37.596036Z" } }, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "x = X_minmax['sepal length (cm)'][:]\n", "y = X_minmax['sepal width (cm)'][:]\n", "t = df['target']\n", "\n", "plt.scatter(x, y, c=t)\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Standaryzacja" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Innym rodzajem normalizacji jest *standaryzacja* zmiennej, tzn. zamiana zmiennej w taki sposób, aby średnia wartość zmiennej wynosiła 0 a jej odchylenie standardowe wynosiło 1. W bibliotece scikit-learn jest to realizowane przez klasę [StandardScaler](http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html#sklearn.preprocessing.StandardScaler) która realizuje następującą transformację:\n", "\n", "$$v' = \\frac{v-\\mu}{\\sigma}$$\n", "\n", "gdzie $\\mu$ to wartość średnia atrybutu, a $\\sigma$ to odchylenie standardowe." ] }, { "cell_type": "code", "execution_count": 76, "metadata": { "ExecuteTime": { "end_time": "2020-03-19T20:43:16.531540Z", "start_time": "2020-03-19T20:43:16.520682Z" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
sepal length (cm)sepal width (cm)petal length (cm)petal width (cm)
0-0.9006811.019004-1.340227-1.315444
1-1.143017-0.131979-1.340227-1.315444
2-1.3853530.328414-1.397064-1.315444
3-1.5065210.098217-1.283389-1.315444
4-1.0218491.249201-1.340227-1.315444
\n", "
" ], "text/plain": [ " sepal length (cm) sepal width (cm) petal length (cm) petal width (cm)\n", "0 -0.900681 1.019004 -1.340227 -1.315444\n", "1 -1.143017 -0.131979 -1.340227 -1.315444\n", "2 -1.385353 0.328414 -1.397064 -1.315444\n", "3 -1.506521 0.098217 -1.283389 -1.315444\n", "4 -1.021849 1.249201 -1.340227 -1.315444" ] }, "execution_count": 76, "metadata": {}, "output_type": "execute_result" } ], "source": [ "scale = preprocessing.StandardScaler().fit(X)\n", "X_scaled = pd.DataFrame(scale.transform(X), columns=cols)\n", "\n", "X_scaled.head()" ] }, { "cell_type": "code", "execution_count": 77, "metadata": { "ExecuteTime": { "end_time": "2020-03-19T20:43:22.359507Z", "start_time": "2020-03-19T20:43:22.339883Z" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
sepal length (cm)sepal width (cm)petal length (cm)petal width (cm)
count1.500000e+021.500000e+021.500000e+021.500000e+02
mean-2.775558e-16-9.695948e-16-8.652338e-16-4.662937e-16
std1.003350e+001.003350e+001.003350e+001.003350e+00
min-1.870024e+00-2.433947e+00-1.567576e+00-1.447076e+00
25%-9.006812e-01-5.923730e-01-1.226552e+00-1.183812e+00
50%-5.250608e-02-1.319795e-013.364776e-011.325097e-01
75%6.745011e-015.586108e-017.627583e-017.906707e-01
max2.492019e+003.090775e+001.785832e+001.712096e+00
\n", "
" ], "text/plain": [ " sepal length (cm) sepal width (cm) petal length (cm) \\\n", "count 1.500000e+02 1.500000e+02 1.500000e+02 \n", "mean -2.775558e-16 -9.695948e-16 -8.652338e-16 \n", "std 1.003350e+00 1.003350e+00 1.003350e+00 \n", "min -1.870024e+00 -2.433947e+00 -1.567576e+00 \n", "25% -9.006812e-01 -5.923730e-01 -1.226552e+00 \n", "50% -5.250608e-02 -1.319795e-01 3.364776e-01 \n", "75% 6.745011e-01 5.586108e-01 7.627583e-01 \n", "max 2.492019e+00 3.090775e+00 1.785832e+00 \n", "\n", " petal width (cm) \n", "count 1.500000e+02 \n", "mean -4.662937e-16 \n", "std 1.003350e+00 \n", "min -1.447076e+00 \n", "25% -1.183812e+00 \n", "50% 1.325097e-01 \n", "75% 7.906707e-01 \n", "max 1.712096e+00 " ] }, "execution_count": 77, "metadata": {}, "output_type": "execute_result" } ], "source": [ "X_scaled.describe()" ] }, { "cell_type": "code", "execution_count": 79, "metadata": { "ExecuteTime": { "end_time": "2020-03-19T20:43:44.317559Z", "start_time": "2020-03-19T20:43:44.183999Z" } }, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "x = X_scaled['sepal length (cm)'][:]\n", "y = X_scaled['sepal width (cm)'][:]\n", "t = df['target']\n", "\n", "plt.scatter(x, y, c=t)\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Dyskretyzacja " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Alternatywą dla ręcznego wskazywania podziału atrybutu na przedziały jest wykorzystanie automatycznego wyliczenia przedziałów przy użyciu klasy [KBinsDiscretizer](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.KBinsDiscretizer.html#sklearn.preprocessing.KBinsDiscretizer) która w taki sposób dzieli przestrzeń atrybutu na zadane _k_ przedziałów, aby odległości między środkami geometrycznymi przedziałów były jak największe" ] }, { "cell_type": "code", "execution_count": 80, "metadata": { "ExecuteTime": { "end_time": "2020-03-19T20:43:57.151420Z", "start_time": "2020-03-19T20:43:56.989962Z" } }, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "kbin = preprocessing.KBinsDiscretizer(n_bins=3, strategy='kmeans', encode='ordinal').fit(df[['sepal length (cm)']])\n", "\n", "df_kbinned = pd.DataFrame(kbin.transform(df[['sepal length (cm)']]))\n", "\n", "x = df['sepal length (cm)'][:]\n", "y = df_kbinned[:]\n", "t = df['target']\n", "\n", "plt.scatter(x, y, c=t)\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Binaryzacja" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Czasami zachodzi konieczność transformacji atrybutu do postaci flagi binarnej reprezentującej wynik jakiegoś testu logicznego na wartościach atrybutu. Zadanie to realizuje klasa [Binarizer](http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.Binarizer.html)." ] }, { "cell_type": "code", "execution_count": 89, "metadata": { "ExecuteTime": { "end_time": "2020-03-19T20:49:46.273689Z", "start_time": "2020-03-19T20:49:46.259197Z" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
sepal length (cm)sepal width (cm)petal length (cm)petal width (cm)targetsepal length (cm)sepal width (cm)petal length (cm)petal width (cm)
05.13.51.40.201.01.00.00.0
14.93.01.40.201.00.00.00.0
24.73.21.30.201.01.00.00.0
34.63.11.50.201.01.00.00.0
45.03.61.40.201.01.00.00.0
\n", "
" ], "text/plain": [ " sepal length (cm) sepal width (cm) petal length (cm) petal width (cm) \\\n", "0 5.1 3.5 1.4 0.2 \n", "1 4.9 3.0 1.4 0.2 \n", "2 4.7 3.2 1.3 0.2 \n", "3 4.6 3.1 1.5 0.2 \n", "4 5.0 3.6 1.4 0.2 \n", "\n", " target sepal length (cm) sepal width (cm) petal length (cm) \\\n", "0 0 1.0 1.0 0.0 \n", "1 0 1.0 0.0 0.0 \n", "2 0 1.0 1.0 0.0 \n", "3 0 1.0 1.0 0.0 \n", "4 0 1.0 1.0 0.0 \n", "\n", " petal width (cm) \n", "0 0.0 \n", "1 0.0 \n", "2 0.0 \n", "3 0.0 \n", "4 0.0 " ] }, "execution_count": 89, "metadata": {}, "output_type": "execute_result" } ], "source": [ "binarize = preprocessing.Binarizer(threshold=3).fit(X)\n", "\n", "X_binned = pd.DataFrame(binarize.transform(X), columns=cols)\n", "\n", "pd.concat([df,X_binned], axis=1).head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Wyświetlenie histogramów" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Do prostego zliczenia wartości w kolumnie można wykorzystać:\n", "- `pandas.Series.value_counts()`\n", "- `collections.Counter`\n", "\n", "natomiast jeśli chcemy tylko narysować histogram wartości w kolumnie, wystarczy `pandas.Series.hist()`" ] }, { "cell_type": "code", "execution_count": 91, "metadata": { "ExecuteTime": { "end_time": "2020-03-19T20:50:24.543861Z", "start_time": "2020-03-19T20:50:24.538194Z" } }, "outputs": [ { "data": { "text/plain": [ "0.0 83\n", "1.0 67\n", "Name: sepal width (cm), dtype: int64" ] }, "execution_count": 91, "metadata": {}, "output_type": "execute_result" } ], "source": [ "X_binned['sepal width (cm)'].value_counts()" ] }, { "cell_type": "code", "execution_count": 92, "metadata": { "ExecuteTime": { "end_time": "2020-03-19T20:50:52.909404Z", "start_time": "2020-03-19T20:50:52.905402Z" } }, "outputs": [ { "data": { "text/plain": [ "Counter({1.0: 67, 0.0: 83})" ] }, "execution_count": 92, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from collections import Counter\n", "\n", "Counter(X_binned['sepal width (cm)'].values)" ] }, { "cell_type": "code", "execution_count": 95, "metadata": { "ExecuteTime": { "end_time": "2020-03-19T20:51:36.703602Z", "start_time": "2020-03-19T20:51:36.504641Z" } }, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 95, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "iVBORw0KGgoAAAANSUhEUgAAAXQAAAD8CAYAAABn919SAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADl0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uIDMuMC4zLCBodHRwOi8vbWF0cGxvdGxpYi5vcmcvnQurowAAE1BJREFUeJzt3XuMHWd5x/Hvg00KeMFOCJxaTto1IqRNY3HxEQpCpWcxIEMqHKkoCgqtU1msgJbS0qq4RRX0gpT8YSiNkFqLULuVYRMiqC3CLTXZoiISsElgcykkBAdiHBuwvbDgAilP/9gxGMfOmT2XOfa7349keWbOO/s+z6798/g9l4nMRJJ09nvCqAuQJA2GgS5JhTDQJakQBrokFcJAl6RCGOiSVAgDXZIKYaBLUiEMdEkqxNImJzv//PNzfHy8p3N/+MMfsmzZssEWdIaz58XBnsvXb7979+79bmY+o9u4RgN9fHycPXv29HTu9PQ0nU5nsAWd4ex5cbDn8vXbb0Q8VGecSy6SVAgDXZIKYaBLUiEMdEkqhIEuSYUw0CWpEAa6JBXCQJekQhjoklSIRt8p2o+Z/bNcs/mWxufdd+3ljc8pSb3wCl2SCmGgS1IhDHRJKkStQI+IP4uIeyLi7oj4UEQ8KSJWR8QdEfFARNwYEecMu1hJ0ul1DfSIWAX8CdDOzEuBJcBVwHXAezLz2cARYNMwC5UkPb66Sy5LgSdHxFLgKcAB4KXAzdXj24ErBl+eJKmuyMzugyLeArwLOAZ8GngLcHt1dU5EXAh8orqCP/ncSWASoNVqrZ2amuqp0EOHZzl4rKdT+7Jm1fLmJ63Mzc0xNjY2svlHwZ4Xh8XWc7/9TkxM7M3MdrdxXV+HHhHnAhuA1cBR4MPA+rqFZOZWYCtAu93OXu/acf2OnWyZaf5l8/uu7jQ+53GL7a4uYM+LxWLrual+6yy5vAz4RmZ+JzN/CnwEeDGwolqCAbgA2D+kGiVJNdQJ9G8Cl0XEUyIigHXAvcBtwGuqMRuBncMpUZJUR9dAz8w7mH/y80vATHXOVuBtwFsj4gHg6cANQ6xTktRFrUXpzHwH8I6TDj8IvHDgFUmSeuI7RSWpEAa6JBXCQJekQhjoklQIA12SCmGgS1IhDHRJKoSBLkmFMNAlqRAGuiQVwkCXpEIY6JJUCANdkgphoEtSIQx0SSpE10CPiIsj4q4Tfn0/Iv40Is6LiFsj4v7q93ObKFiSdGp17lj01cx8XmY+D1gL/Aj4KLAZ2J2ZFwG7q31J0ogsdMllHfD1zHwI2ABsr45vB64YZGGSpIVZaKBfBXyo2m5l5oFq+xGgNbCqJEkLFplZb2DEOcC3gd/KzIMRcTQzV5zw+JHMfMw6ekRMApMArVZr7dTUVE+FHjo8y8FjPZ3alzWrljc/aWVubo6xsbGRzT8K9rw4LLae++13YmJib2a2u42rdZPoyiuBL2XmwWr/YESszMwDEbESOHSqkzJzK7AVoN1uZ6fTWcCUv3D9jp1smVlIuYOx7+pO43MeNz09Ta/fr7OVPS8Oi63npvpdyJLLa/nFcgvALmBjtb0R2DmooiRJC1cr0CNiGfBy4CMnHL4WeHlE3A+8rNqXJI1IrTWMzPwh8PSTjn2P+Ve9SJLOAL5TVJIKYaBLUiEMdEkqRPOvA5SkERnffMtI5t22flkj83iFLkmFMNAlqRAGuiQVwkCXpEIY6JJUCANdkgphoEtSIQx0SSqEgS5JhTDQJakQBrokFcJAl6RC1L1j0YqIuDki/ici7ouIF0XEeRFxa0TcX/3+mBtES5KaU/cK/b3AJzPzN4DnAvcBm4HdmXkRsLvalySNSNdAj4jlwEuAGwAy8yeZeRTYAGyvhm0HrhhWkZKk7upcoa8GvgP8a0TcGRHvr24a3crMA9WYR4DWsIqUJHUXmfn4AyLawO3AizPzjoh4L/B94M2ZueKEcUcy8zHr6BExCUwCtFqttVNTUz0VeujwLAeP9XRqX9asWt78pJW5uTnGxsZGNv8o2PPiMKqeZ/bPNj4nwOrlS/rqd2JiYm9mtruNqxPovwrcnpnj1f5vM79e/mygk5kHImIlMJ2ZFz/e12q327lnz56aLfyy63fsZMtM8zdY2nft5Y3Pedz09DSdTmdk84+CPS8Oo+p5lHcs6qffiKgV6F2XXDLzEeBbEXE8rNcB9wK7gI3VsY3Azh5rlSQNQN1L3jcDOyLiHOBB4A+Z/8fgpojYBDwEXDmcEiVJddQK9My8CzjV5f66wZYjSeqV7xSVpEIY6JJUCANdkgphoEtSIQx0SSqEgS5JhTDQJakQBrokFcJAl6RCGOiSVAgDXZIKYaBLUiEMdEkqhIEuSYUw0CWpEAa6JBWi1g0uImIf8APg/4BHM7MdEecBNwLjwD7gysw8MpwyJUndLOQKfSIzn3fCjUo3A7sz8yJgd7UvSRqRfpZcNgDbq+3twBX9lyNJ6lXdQE/g0xGxNyImq2OtzDxQbT8CtAZenSSptsjM7oMiVmXm/oh4JnAr8GZgV2auOGHMkcw89xTnTgKTAK1Wa+3U1FRPhR46PMvBYz2d2pc1q5Y3P2llbm6OsbGxkc0/Cva8OIyq55n9s43PCbB6+ZK++p2YmNh7wnL3adUK9F86IeKdwBzweqCTmQciYiUwnZkXP9657XY79+zZs6D5jrt+x062zNR6Dneg9l17eeNzHjc9PU2n0xnZ/KNgz4vDqHoe33xL43MCbFu/rK9+I6JWoHddcomIZRHx1OPbwCuAu4FdwMZq2EZgZ8/VSpL6VueStwV8NCKOj/9gZn4yIr4I3BQRm4CHgCuHV6YkqZuugZ6ZDwLPPcXx7wHrhlGUJGnhfKeoJBXCQJekQhjoklQIA12SCmGgS1IhDHRJKoSBLkmFMNAlqRAGuiQVwkCXpEIY6JJUCANdkgphoEtSIQx0SSqEgS5JhTDQJakQtQM9IpZExJ0R8bFqf3VE3BERD0TEjRFxzvDKlCR1s5Ar9LcA952wfx3wnsx8NnAE2DTIwiRJC1Mr0CPiAuBy4P3VfgAvBW6uhmwHrhhGgZKkeupeof8j8JfAz6r9pwNHM/PRav9hYNWAa5MkLUBk5uMPiPhd4FWZ+aaI6AB/AVwD3F4ttxARFwKfyMxLT3H+JDAJ0Gq11k5NTfVU6KHDsxw81tOpfVmzannzk1bm5uYYGxsb2fyjYM+Lw6h6ntk/2/icAKuXL+mr34mJib2Z2e42bmmNr/Vi4NUR8SrgScDTgPcCKyJiaXWVfgGw/1QnZ+ZWYCtAu93OTqdTr4OTXL9jJ1tm6pQ7WPuu7jQ+53HT09P0+v06W9nz4jCqnq/ZfEvjcwJsW7+skX67Lrlk5l9l5gWZOQ5cBXwmM68GbgNeUw3bCOwcWpWSpK76eR3624C3RsQDzK+p3zCYkiRJvVjQGkZmTgPT1faDwAsHX5IkqRe+U1SSCmGgS1IhDHRJKoSBLkmFMNAlqRAGuiQVwkCXpEIY6JJUCANdkgphoEtSIQx0SSqEgS5JhTDQJakQBrokFcJAl6RCGOiSVIiugR4RT4qIL0TElyPinoj42+r46oi4IyIeiIgbI+Kc4ZcrSTqdOlfoPwZempnPBZ4HrI+Iy4DrgPdk5rOBI8Cm4ZUpSeqmzk2iMzPnqt0nVr8SeClwc3V8O3DFUCqUJNVSaw09IpZExF3AIeBW4OvA0cx8tBryMLBqOCVKkuqIzKw/OGIF8FHgb4Bt1XILEXEh8InMvPQU50wCkwCtVmvt1NRUT4UeOjzLwWM9ndqXNauWNz9pZW5ujrGxsZHNPwr2vDiMqueZ/bONzwmwevmSvvqdmJjYm5ntbuOWLuSLZubRiLgNeBGwIiKWVlfpFwD7T3POVmArQLvdzk6ns5Apf+76HTvZMrOgcgdi39Wdxuc8bnp6ml6/X2cre14cRtXzNZtvaXxOgG3rlzXSb51XuTyjujInIp4MvBy4D7gNeE01bCOwc1hFSpK6q3PJuxLYHhFLmP8H4KbM/FhE3AtMRcQ/AHcCNwyxTklSF10DPTO/Ajz/FMcfBF44jKIkSQvnO0UlqRAGuiQVwkCXpEIY6JJUCANdkgphoEtSIQx0SSqEgS5JhTDQJakQBrokFcJAl6RCGOiSVAgDXZIKYaBLUiEMdEkqhIEuSYWocwu6CyPitoi4NyLuiYi3VMfPi4hbI+L+6vdzh1+uJOl06lyhPwr8eWZeAlwG/FFEXAJsBnZn5kXA7mpfkjQiXQM9Mw9k5peq7R8wf4PoVcAGYHs1bDtwxbCKlCR1t6A19IgYZ/7+oncArcw8UD30CNAaaGWSpAWJzKw3MGIM+C/gXZn5kYg4mpkrTnj8SGY+Zh09IiaBSYBWq7V2amqqp0IPHZ7l4LGeTu3LmlXLm5+0Mjc3x9jY2MjmHwV7XhxG1fPM/tnG5wRYvXxJX/1OTEzszcx2t3G1Aj0ingh8DPhUZr67OvZVoJOZByJiJTCdmRc/3tdpt9u5Z8+eWg2c7PodO9kys7Snc/ux79rLG5/zuOnpaTqdzsjmHwV7XhxG1fP45lsanxNg2/plffUbEbUCvc6rXAK4AbjveJhXdgEbq+2NwM5eCpUkDUadS94XA78PzETEXdWxvwauBW6KiE3AQ8CVwylRklRH10DPzP8G4jQPrxtsOZKkXvlOUUkqhIEuSYUw0CWpEAa6JBXCQJekQhjoklQIA12SCmGgS1IhDHRJKoSBLkmFMNAlqRAGuiQVwkCXpEIY6JJUCANdkgphoEtSIercgu4DEXEoIu4+4dh5EXFrRNxf/f6Ym0NLkppV5wp9G7D+pGObgd2ZeRGwu9qXJI1Q10DPzM8Ch086vAHYXm1vB64YcF2SpAXqdQ29lZkHqu1HgNaA6pEk9Sgys/ugiHHgY5l5abV/NDNXnPD4kcw85Tp6REwCkwCtVmvt1NRUT4UeOjzLwWM9ndqXNauWNz9pZW5ujrGxsZHNPwr2vDiMqueZ/bONzwmwevmSvvqdmJjYm5ntbuOW9vj1D0bEysw8EBErgUOnG5iZW4GtAO12OzudTk8TXr9jJ1tmei23d/uu7jQ+53HT09P0+v06W9nz4jCqnq/ZfEvjcwJsW7+skX57XXLZBWystjcCOwdTjiSpV3Vetvgh4PPAxRHxcERsAq4FXh4R9wMvq/YlSSPUdQ0jM197mofWDbgWSVIffKeoJBXCQJekQhjoklQIA12SCmGgS1IhDHRJKoSBLkmFMNAlqRAGuiQVwkCXpEIY6JJUCANdkgphoEtSIQx0SSqEgS5JhTDQJakQfQV6RKyPiK9GxAMRsXlQRUmSFq7nQI+IJcD7gFcClwCvjYhLBlWYJGlh+rlCfyHwQGY+mJk/AaaADYMpS5K0UP0E+irgWyfsP1wdkySNQNebRPcrIiaByWp3LiK+2uOXOh/47mCqqi+ua3rGXzKSnkfMnheHRdXzxHV99/vrdQb1E+j7gQtP2L+gOvZLMnMrsLWPeQCIiD2Z2e7365xN7HlxsOfyNdVvP0suXwQuiojVEXEOcBWwazBlSZIWqucr9Mx8NCL+GPgUsAT4QGbeM7DKJEkL0tcaemZ+HPj4gGrppu9lm7OQPS8O9ly+RvqNzGxiHknSkPnWf0kqxBkX6N0+TiAifiUibqwevyMixpuvcrBq9PzWiLg3Ir4SEbsjotZLmM5kdT82IiJ+LyIyIs7qV0TU6Tcirqx+zvdExAebrnHQavy5/rWIuC0i7qz+bL9qFHUOUkR8ICIORcTdp3k8IuKfqu/JVyLiBQMtIDPPmF/MP7n6deBZwDnAl4FLThrzJuCfq+2rgBtHXXcDPU8AT6m237gYeq7GPRX4LHA70B513UP+GV8E3AmcW+0/c9R1N9DzVuCN1fYlwL5R1z2Avl8CvAC4+zSPvwr4BBDAZcAdg5z/TLtCr/NxAhuA7dX2zcC6iIgGaxy0rj1n5m2Z+aNq93bmX/N/Nqv7sRF/D1wH/G+TxQ1BnX5fD7wvM48AZOahhmsctDo9J/C0ans58O0G6xuKzPwscPhxhmwA/i3n3Q6siIiVg5r/TAv0Oh8n8PMxmfkoMAs8vZHqhmOhH6Gwifl/4c9mXXuu/it6YWbe0mRhQ1LnZ/wc4DkR8bmIuD0i1jdW3XDU6fmdwOsi4mHmXy335mZKG6mhfmTK0N/6r8GJiNcBbeB3Rl3LMEXEE4B3A9eMuJQmLWV+2aXD/P/APhsRazLz6EirGq7XAtsyc0tEvAj494i4NDN/NurCzlZn2hV6nY8T+PmYiFjK/H/VvtdIdcNR6yMUIuJlwNuBV2fmjxuqbVi69fxU4FJgOiL2Mb/WuOssfmK0zs/4YWBXZv40M78BfI35gD9b1el5E3ATQGZ+HngS85/xUrJaf997daYFep2PE9gFbKy2XwN8JqtnG85SXXuOiOcD/8J8mJ/ta6vQpefMnM3M8zNzPDPHmX/e4NWZuWc05fatzp/r/2D+6pyIOJ/5JZgHmyxywOr0/E1gHUBE/Cbzgf6dRqts3i7gD6pXu1wGzGbmgYF99VE/K3yaZ4G/xvwz5G+vjv0d83+hYf6H/mHgAeALwLNGXXMDPf8ncBC4q/q1a9Q1D7vnk8ZOcxa/yqXmzziYX2a6F5gBrhp1zQ30fAnwOeZfAXMX8IpR1zyAnj8EHAB+yvz/ujYBbwDecMLP+X3V92Rm0H+ufaeoJBXiTFtykST1yECXpEIY6JJUCANdkgphoEtSIQx0SSqEgS5JhTDQJakQ/w+MaBjA6KjCLgAAAABJRU5ErkJggg==\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "X_binned['sepal width (cm)'].hist()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Uzupełnienie wartości brakujących" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Wartości brakujące mogą poważnie zaburzyć wynik analizy. Wiele algorytmów klasyfikacji i regresji nie akceptuje danych wejściowych, w których występują wartości puste. Klasa [SimpleImputer](https://scikit-learn.org/stable/modules/generated/sklearn.impute.SimpleImputer.html) pozwala zamienić wartości brakujące na wartość średnią, medianę lub wartość modalną wyznaczaną na podstawie całego atrybutu." ] }, { "cell_type": "code", "execution_count": 96, "metadata": { "ExecuteTime": { "end_time": "2020-03-19T20:51:46.149837Z", "start_time": "2020-03-19T20:51:46.143719Z" }, "scrolled": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[[ 1. 2. nan]\n", " [nan 4. 5.]\n", " [ 6. nan 7.]]\n", "\n", "[[1. 2. 6. ]\n", " [3.5 4. 5. ]\n", " [6. 3. 7. ]]\n" ] } ], "source": [ "from sklearn.impute import SimpleImputer\n", "\n", "matrix = np.array([[ 1, 2, np.nan], [np.nan, 4, 5], [6, np.nan, 7]])\n", "\n", "# alternatywne strategie to 'mean', 'median' i 'most_frequent'\n", "imp = SimpleImputer(missing_values=np.nan, strategy='mean').fit(matrix)\n", "\n", "print(matrix)\n", "print()\n", "print(imp.transform(matrix))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Kodowanie zmiennej celu" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Bardzo często wykorzystywaną klasą jest [OneHotEncoder](http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html#sklearn.preprocessing.OneHotEncoder) która zamienia atrybut kategoryczny na flagi binarne. Po transformacji tworzonych jest $k$ flag binarnych, gdzie $k$ to liczba unikanlnych wartości występujących w oryginalnym atrybucie. Wynikiem transformacji jest indeks bitmapowy oryginalnego atrybutu." ] }, { "cell_type": "code", "execution_count": 97, "metadata": { "ExecuteTime": { "end_time": "2020-03-19T20:51:56.710233Z", "start_time": "2020-03-19T20:51:56.705900Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0\n", " 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1\n", " 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2\n", " 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2\n", " 2 2]\n" ] } ], "source": [ "df_target = df['target'].values\n", "\n", "print(df_target)" ] }, { "cell_type": "code", "execution_count": 20, "metadata": { "ExecuteTime": { "end_time": "2020-03-19T20:04:11.226818Z", "start_time": "2020-03-19T20:04:11.202137Z" } }, "outputs": [ { "data": { "text/plain": [ "matrix([[1., 0., 0.],\n", " [1., 0., 0.],\n", " [1., 0., 0.],\n", " [1., 0., 0.],\n", " [1., 0., 0.],\n", " [1., 0., 0.],\n", " [1., 0., 0.],\n", " [1., 0., 0.],\n", " [1., 0., 0.],\n", " [1., 0., 0.],\n", " [1., 0., 0.],\n", " [1., 0., 0.],\n", " [1., 0., 0.],\n", " [1., 0., 0.],\n", " [1., 0., 0.],\n", " [1., 0., 0.],\n", " [1., 0., 0.],\n", " [1., 0., 0.],\n", " [1., 0., 0.],\n", " [1., 0., 0.],\n", " [1., 0., 0.],\n", " [1., 0., 0.],\n", " [1., 0., 0.],\n", " [1., 0., 0.],\n", " [1., 0., 0.],\n", " [1., 0., 0.],\n", " [1., 0., 0.],\n", " [1., 0., 0.],\n", " [1., 0., 0.],\n", " [1., 0., 0.],\n", " [1., 0., 0.],\n", " [1., 0., 0.],\n", " [1., 0., 0.],\n", " [1., 0., 0.],\n", " [1., 0., 0.],\n", " [1., 0., 0.],\n", " [1., 0., 0.],\n", " [1., 0., 0.],\n", " [1., 0., 0.],\n", " [1., 0., 0.],\n", " [1., 0., 0.],\n", " [1., 0., 0.],\n", " [1., 0., 0.],\n", " [1., 0., 0.],\n", " [1., 0., 0.],\n", " [1., 0., 0.],\n", " [1., 0., 0.],\n", " [1., 0., 0.],\n", " [1., 0., 0.],\n", " [1., 0., 0.],\n", " [0., 1., 0.],\n", " [0., 1., 0.],\n", " [0., 1., 0.],\n", " [0., 1., 0.],\n", " [0., 1., 0.],\n", " [0., 1., 0.],\n", " [0., 1., 0.],\n", " [0., 1., 0.],\n", " [0., 1., 0.],\n", " [0., 1., 0.],\n", " [0., 1., 0.],\n", " [0., 1., 0.],\n", " [0., 1., 0.],\n", " [0., 1., 0.],\n", " [0., 1., 0.],\n", " [0., 1., 0.],\n", " [0., 1., 0.],\n", " [0., 1., 0.],\n", " [0., 1., 0.],\n", " [0., 1., 0.],\n", " [0., 1., 0.],\n", " [0., 1., 0.],\n", " [0., 1., 0.],\n", " [0., 1., 0.],\n", " [0., 1., 0.],\n", " [0., 1., 0.],\n", " [0., 1., 0.],\n", " [0., 1., 0.],\n", " [0., 1., 0.],\n", " [0., 1., 0.],\n", " [0., 1., 0.],\n", " [0., 1., 0.],\n", " [0., 1., 0.],\n", " [0., 1., 0.],\n", " [0., 1., 0.],\n", " [0., 1., 0.],\n", " [0., 1., 0.],\n", " [0., 1., 0.],\n", " [0., 1., 0.],\n", " [0., 1., 0.],\n", " [0., 1., 0.],\n", " [0., 1., 0.],\n", " [0., 1., 0.],\n", " [0., 1., 0.],\n", " [0., 1., 0.],\n", " [0., 1., 0.],\n", " [0., 1., 0.],\n", " [0., 1., 0.],\n", " [0., 1., 0.],\n", " [0., 1., 0.],\n", " [0., 0., 1.],\n", " [0., 0., 1.],\n", " [0., 0., 1.],\n", " [0., 0., 1.],\n", " [0., 0., 1.],\n", " [0., 0., 1.],\n", " [0., 0., 1.],\n", " [0., 0., 1.],\n", " [0., 0., 1.],\n", " [0., 0., 1.],\n", " [0., 0., 1.],\n", " [0., 0., 1.],\n", " [0., 0., 1.],\n", " [0., 0., 1.],\n", " [0., 0., 1.],\n", " [0., 0., 1.],\n", " [0., 0., 1.],\n", " [0., 0., 1.],\n", " [0., 0., 1.],\n", " [0., 0., 1.],\n", " [0., 0., 1.],\n", " [0., 0., 1.],\n", " [0., 0., 1.],\n", " [0., 0., 1.],\n", " [0., 0., 1.],\n", " [0., 0., 1.],\n", " [0., 0., 1.],\n", " [0., 0., 1.],\n", " [0., 0., 1.],\n", " [0., 0., 1.],\n", " [0., 0., 1.],\n", " [0., 0., 1.],\n", " [0., 0., 1.],\n", " [0., 0., 1.],\n", " [0., 0., 1.],\n", " [0., 0., 1.],\n", " [0., 0., 1.],\n", " [0., 0., 1.],\n", " [0., 0., 1.],\n", " [0., 0., 1.],\n", " [0., 0., 1.],\n", " [0., 0., 1.],\n", " [0., 0., 1.],\n", " [0., 0., 1.],\n", " [0., 0., 1.],\n", " [0., 0., 1.],\n", " [0., 0., 1.],\n", " [0., 0., 1.],\n", " [0., 0., 1.],\n", " [0., 0., 1.]])" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "one_hot = preprocessing.OneHotEncoder(categories='auto').fit(df_target.reshape(-1,1))\n", "\n", "one_hot.transform(df_target.reshape(-1,1)).todense()" ] }, { "cell_type": "code", "execution_count": 98, "metadata": { "ExecuteTime": { "end_time": "2020-03-19T20:52:13.753025Z", "start_time": "2020-03-19T20:52:13.748602Z" } }, "outputs": [ { "data": { "text/plain": [ "array([[0]])" ] }, "execution_count": 98, "metadata": {}, "output_type": "execute_result" } ], "source": [ "one_hot.inverse_transform(np.array([[1,0,0]]))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## zadanie samodzielne\n", "\n", "Zapoznaj się z dokumentacją klasy [Normalizer](http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.Normalizer.html#sklearn.preprocessing.Normalizer) która dokonuje normalizacji pojedynczych instancji w zbiorze uczącym. Dokonaj normalizacji zbioru *Iris*, sprawdzając przy tym, jaki jest efekt zmiany wartości parametru *norm* używanego przy inicjalizacji klasy.\n", "\n", "*podpowiedź* : wykorzystaj metodę [DataFrame.sum()](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.sum.html)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.3" }, "toc": { "base_numbering": 1, "nav_menu": {}, "number_sections": false, "sideBar": true, "skip_h1_title": false, "title_cell": "Table of Contents", "title_sidebar": "Contents", "toc_cell": false, "toc_position": {}, "toc_section_display": true, "toc_window_display": true }, "varInspector": { "cols": { "lenName": 16, "lenType": 16, "lenVar": 40 }, "kernels_config": { "python": { "delete_cmd_postfix": "", "delete_cmd_prefix": "del ", "library": "var_list.py", "varRefreshCmd": "print(var_dic_list())" }, "r": { "delete_cmd_postfix": ") ", "delete_cmd_prefix": "rm(", "library": "var_list.r", "varRefreshCmd": "cat(var_dic_list()) " } }, "types_to_exclude": [ "module", "function", "builtin_function_or_method", "instance", "_Feature" ], "window_display": false } }, "nbformat": 4, "nbformat_minor": 2 }