### Decision Trees in scikit-learn

The goal of the laboratory is to familiarize students with building tree classifiers using Python and scikit-learn library.

The first step is to import data into [pandas DataFrame](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html) 

In [None]:
import pandas as pd

df = pd.read_csv('http://dmlab.cs.put.poznan.pl/dokuwiki/lib/exe/fetch.php?media=dt_data.csv')

df.describe()

It is important to first get familiar with the data by looking at a small subset of rows.

In [None]:
df.head(10)

In [None]:
df.columns

We have to perform some basic data preprocessing:

* remove the _Unnamed: 3_ column
* recode the label

In [None]:
# inplace=True modifies existing dataframe instead of returning a reference to a new object

df.drop('Unnamed: 3', axis=1, inplace=True)
df.drop('Card_Cust_ID', axis=1, inplace=True)

df.Spend_Drop_over50pct.replace([0,1], ['no','yes'], inplace=True)

df.head()

Scikit-learn does not accept categorical variables, they should be encoded as numbers

In [None]:
from sklearn.preprocessing import LabelEncoder, OneHotEncoder

gender_ = LabelEncoder().fit(df.Gender)
df.Gender = gender_.transform(df.Gender)

df.head()

In [None]:
education_ = LabelEncoder().fit(df.Education_level)
df.Education_level = education_.transform(df.Education_level)

df.head()

Next we should examine the distribution of values of the target attribute

In [None]:
df.Spend_Drop_over50pct.value_counts()

In [None]:
round( df.Spend_Drop_over50pct.value_counts()/len(df) * 100 , 2)

Before building the classifier we have to construct the _train set_ and the _test set_

In [None]:
from sklearn.model_selection import train_test_split

y = df.pop('Spend_Drop_over50pct')
X = df

X_train, X_test, y_train, y_test = train_test_split(
 X, y, test_size=0.7, random_state=42)

In [None]:
y_train.head()

This is all we need to build a [Decision Tree classifier](http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html#sklearn.tree.DecisionTreeClassifier)

In [None]:
from sklearn.tree import DecisionTreeClassifier

In [None]:
# build the model
model = DecisionTreeClassifier(
 criterion='gini', 
 max_depth=5, 
 min_samples_leaf=5)

# train the model
model.fit(X_train, y_train)

After building the tree we can check its accuracy, precision, and recall

In [None]:
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

y_pred = model.predict(X_test)

print(f'accuracy: {accuracy_score(y_test, y_pred)}\n')
print(f'confusion matrix\n {confusion_matrix(y_test, y_pred)}')
print(classification_report(y_test, y_pred))

It is also possible to visualize the generated tree using the [Graphviz library](http://www.webgraphviz.com)

In [None]:
from sklearn.tree import export_graphviz
import graphviz

export_graphviz(model, 
 out_file = "model.dot", 
 filled=True,
 feature_names = X_train.columns)

with open("model.dot") as f:
 dot_graph = f.read()
graphviz.Source(dot_graph)

Guessing correct values of algorithm's parameters can be tedious and difficult. It is much easier to allow the algorithm to explore the whole space of parameter values using either exhaustive grid search or some heuristic

In [None]:
from sklearn.model_selection import GridSearchCV

params = { 
 'max_depth' : range(1,15),
 'criterion' : ['gini','entropy'],
 'min_samples_leaf' : range(2,20)
}

model_ = GridSearchCV(estimator = DecisionTreeClassifier(), 
 cv=5,
 param_grid=params, 
 n_jobs=7)
model_.fit(X, y)

In [None]:
print(model_.best_score_)

In [None]:
model_.best_estimator_

Let's use the best combination of parameters for scoring our classifier

In [None]:
model = model_.best_estimator_
y_pred = model.predict(X_test)

print(f'accuracy: {accuracy_score(y_test, y_pred)}\n')
print(f'confusion matrix\n {confusion_matrix(y_test, y_pred)}')
print(classification_report(y_test, y_pred))

Unfortunately, the performance w.r.t. the minority class is still not satisfying. One possible solution is to present the classifier with some class weights.

In [None]:
# build the model
model = DecisionTreeClassifier(
 criterion='gini', 
 max_depth=2, 
 min_samples_leaf=11,
 class_weight={'no': 1.0, 'yes': 5.0})

# train the model
model.fit(X_train, y_train)

# apply the model
y_pred = model.predict(X_test)

print(f'accuracy: {accuracy_score(y_test, y_pred)}\n')
print(f'confusion matrix\n {confusion_matrix(y_test, y_pred)}')
print(classification_report(y_test, y_pred))

Finally, let us verify how the accuracy of traditional decision trees compares with the accuracy of [Random Forests](http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html#sklearn.ensemble.RandomForestClassifier)

In [None]:
from sklearn.ensemble import RandomForestClassifier

model = RandomForestClassifier(
 criterion='gini',
 n_estimators=10,
 max_features=0.5,
 n_jobs=-1,
 class_weight={'no': 1.0, 'yes': 5.0},
 random_state=42)

# train the model
model.fit(X_train, y_train)

# apply the model
y_pred = model.predict(X_test)

print(f'accuracy: {accuracy_score(y_test, y_pred)}\n')
print(f'confusion matrix\n {confusion_matrix(y_test, y_pred)}')
print(classification_report(y_test, y_pred))

### Bias vs variance tradeoff

[Bias vs variance](https://en.wikipedia.org/wiki/Bias–variance_tradeoff) is a common problem in machine learning tasks. In very simple terms:

* high bias is the result of errors in the underlying model (its assumptions, class of models, etc.)
* high variance is the result of overfitting due to too many parameters

This figure ilustrates the main idea:

![bias vs variance](http://www.bogotobogo.com/python/scikit-learn/images/Bias-Tradeoff/Low-High-Variances-Biases.png)

In the following sections we will verify how much accuracy changes depending on the size of the train set. First, we will recreate the dataset.

If you **really want to understand** bias vs variance problem, read [this](https://sebastianraschka.com/pdf/lecture-notes/stat479fs18/08_eval-intro_notes.pdf) 

In [None]:
df = pd.read_csv('http://dmlab.cs.put.poznan.pl/dokuwiki/lib/exe/fetch.php?media=dt_data.csv')

df.drop('Unnamed: 3', axis=1, inplace=True)
df.Spend_Drop_over50pct.replace([0,1], ['A','B'], inplace=True)

gender_ = LabelEncoder().fit(df.Gender)
df.Gender = gender_.transform(df.Gender)

education_ = LabelEncoder().fit(df.Education_level)
df.Education_level = education_.transform(df.Education_level)

The next cell shows how to compute accuracy scores on the train set and the validation set for different dataset partitions. We use the 5-fold cross validation for each combination of the train/validation set size. The result contains accuracy scores for each fold.

In [None]:
from sklearn.model_selection import learning_curve

train_sizes = range(1, 350)

features = ['Gender', 'Education_level', 'Last_Month_spend', 'Last_3m_avg_spend']
target = 'Spend_Drop_over50pct'

train_sizes, train_scores, validation_scores = learning_curve(
 estimator=DecisionTreeClassifier(max_depth=2),
 X = df[features],
 y = df[target],
 train_sizes = train_sizes,
 cv = 5,
 scoring = 'accuracy',
 shuffle = True)

print('Training scores:\n', train_scores)
print()
print('Validation scores:\n', validation_scores)

Since for each combination of the train/validation set size we receive 5 results (one for each fold of the cross validation), we have to aggregate these scores in order to produce the final plot.

In [None]:
train_scores_mean = train_scores.mean(axis = 1)
validation_scores_mean = validation_scores.mean(axis = 1)

print('Mean training scores\n', pd.Series(train_scores_mean, index = train_sizes))
print()
print('\nMean validation scores\n',pd.Series(validation_scores_mean, index = train_sizes))

In [None]:
import matplotlib.pyplot as plt
%matplotlib inline

plt.style.use('seaborn')

plt.plot(train_sizes, 1-train_scores_mean, label = 'Training error')
plt.plot(train_sizes, 1-validation_scores_mean, label = 'Validation error')

plt.ylabel('Error', fontsize = 14)
plt.xlabel('Training set size', fontsize = 14)
plt.title('Learning curves for a decision tree model', fontsize = 18, y = 1.03)
plt.legend()
plt.ylim(0,1)

In [None]:
import numpy as np
from yellowbrick.model_selection import validation_curve

from sklearn.tree import DecisionTreeClassifier

viz = validation_curve(
 DecisionTreeClassifier(), X, y, param_name="max_depth",
 param_range=np.arange(1, 25), cv=10, scoring="accuracy",
)

In [None]:
from yellowbrick.model_selection import LearningCurve
from sklearn.model_selection import StratifiedKFold

# Create the learning curve visualizer
cv = StratifiedKFold(n_splits=10)
sizes = np.linspace(0.1, 1.0, 100)

X_lc = OneHotEncoder().fit_transform(X)
y_lc = LabelEncoder().fit_transform(y)

# Instantiate the classification model and visualizer
model = DecisionTreeClassifier()
visualizer = LearningCurve(
 model, 
 cv=cv, 
 scoring='accuracy', 
 train_sizes=sizes, 
 n_jobs=4
)

visualizer.fit(X_lc, y_lc)
visualizer.show()

### Homework

1. Read the description of the [Italian wine dataset](http://archive.ics.uci.edu/ml/machine-learning-databases/wine/wine.names). 
1. Download the dataset from http://archive.ics.uci.edu/ml/machine-learning-databases/wine/wine.data and load it into a pandas DataFrame. **Careful: the label is the first column in the dataset!**
1. Build a decision tree classifier and print out its accuracy. Instead of using a classical Decision Tree, you can experiment with [ExtraTreeClassifier](http://scikit-learn.org/stable/modules/generated/sklearn.tree.ExtraTreeClassifier.html#sklearn.tree.ExtraTreeClassifier) or a Random Forest
1. Use the grid search to perform parameter space search and boost classifier's accuracy.
1. Perform the _bias vs variance_ analysis


Send the resulting `*.ipynb` file to Mikolaj.Morzy@put.poznan.pl until Sunday, May 24th, 21:00.