# Scikit-learn 2

First classifier, how to train and test a model

## 1 Training and evaluating a classifier

This sections explains how simple training and evaluation work in `scikit-learn`.

### Your first classifier

Once you have `X` and `y` and have some intuitions about your data, train your first classifier.

In [3]:
# review, make sure correct dataset is loaded
from sklearn.datasets import load_iris
iris = load_iris()
X = iris.data
y = iris.target

X.shape, y.shape

((150, 4), (150,))

`scikit-learn` offers a range of different classifiers, they are organized into different modules. One of the simplest classifiers is KNN ("k nearest neighbor").

Classifier classes are imported from modules, then instantiated. Simply printing the object gives you all parameters of the classifiers, including implicit defaults.

In [1]:
from sklearn.neighbors import KNeighborsClassifier
clf = KNeighborsClassifier(n_neighbors=3) # convention: call classifier instance `clf`, or describe estimator type
clf

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=3, p=2,
           weights='uniform')

Then, call the `fit` method to actually learn the relationship between `X` and `y`:

In [4]:
clf.fit(X, y)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=3, p=2,
           weights='uniform')

**Question for you: In the case of KNN, what does "learning" mean?**

After a model is "fitted", you can predict the reponse for new observations. The input must be a list of lists, exactly the same as `X`:

In [5]:
clf.predict([[0.2, 0.4, 0.5, 0.1], [0.3, 0.9, 0.2, 0.5]])

array([0, 0])

**Question for you: In the case of KNN, what does "predicting" mean? And: What does this output mean?**

### Simple training and testing split

Since the ultimate goal is to _generalize_ well (this is **extremely** important), there must be a way to evaluate if our model performs well on unseen data. The first method we look at is to simply hold out part of the data (meaning: not show it to the classifier during training), and then evaluate the model on the held-out data.

Split up your observations and responses into a training and testing part each:

In [7]:
from sklearn.model_selection import train_test_split
# variable names are conventional, please also adopt them
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) # 20 percent go into the test set

In [8]:
for v in (X_train, X_test, y_train, y_test):
    print(v.shape)

(120, 4)
(30, 4)
(120,)
(30,)


Now, in order to evaluate properly, we need to train a classifier that has only seen the training part of the data, and is oblivious to the correct test set answers. Fit the classifier again, only on the training set `X_train` and `y_train`:

In [9]:
clf.fit(X_train, y_train)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=3, p=2,
           weights='uniform')

Then, predict the correct answers for the test observations (`y_test`), using the `predict` function. Store the results in the variable `y_pred`:

In [10]:
y_pred = clf.predict(X_test)

**Question for you: What is the difference between `y_test` and `y_pred`?**

In [11]:
# compare `y_test` and `y_pred`:
print "Actual\tPredicted"
for t, p in zip(y_test[:10], y_pred[:10]):
    print "{}\t{}".format(t, p)

Actual	Predicted
1	1
0	0
2	2
1	1
1	1
0	0
1	1
2	2
1	1
1	1


### Evaluation by calculating accuracy

Now that we have the predictions for the test set examples, and the true answers, we can evaluate automatically the performance of the classifier. There are several different reasonable metrics, **accuracy** is by far the simplest: it calculates **how many predictions were made correctly, out of all predictions**.

Implement accuracy now, as a function that takes `y_pred` and `y_test` as inputs. 

In [14]:
import numpy as np

def accuracy(y_pred, y_test):
    return np.average(y_pred == y_test)

accuracy(y_pred, y_test)

1.0

Then compare the result with the `scikit-learn` implementation:

In [None]:
from sklearn import metrics # module dedicated to measuring performance
metrics.accuracy_score(y_test, y_pred)

# 2 Outlook

This method of model fitting and evaluation has several problems, which we will discuss in later classes. To give you some food for thought:
- We have split the data randomly into training and testing examples. Therefore, it is possible that the test only contains "easy" examples or only hard ones. Isn't this a bit unfair?
- Each classifier has hyperparameters that need to be set by the user. For instance, `n_neighbors` in our case. We have set it to `5`, an arbitrary decision. Can we do better than that?
- Does accuracy work for a regression problem? Why not?