Binder

Analyzing results from notebooks

The .ipynb format is capable of storing tables and charts in a standalone file. This makes it a great choice for model evaluation reports. NotebookCollection allows you to retrieve results from previously executed notebooks to compare them.

[1]:
import papermill as pm
import jupytext

from sklearn_evaluation import NotebookCollection

Let’s first generate a few notebooks, we have a train.py script that trains a single model, let’s convert it to a jupyter notebook:

[2]:
nb = jupytext.read('train.py')
jupytext.write(nb, 'train.ipynb')

We use papermill to execute the notebook with different parameters, we’ll train 4 models: 2 random forest, a linear regression and a support vector regression:

[3]:
# models with their corresponding parameters
params = [{
    'model': 'sklearn.ensemble.RandomForestRegressor',
    'params': {
        'n_estimators': 50
    }
}, {
    'model': 'sklearn.ensemble.RandomForestRegressor',
    'params': {
        'n_estimators': 100
    }
}, {
    'model': 'sklearn.linear_model.LinearRegression',
    'params': {
        'normalize': True
    }
}, {
    'model': 'sklearn.svm.LinearSVR',
    'params': {}
}]

# ids to identify each experiment
ids = [
    'random_forest_1', 'random_forest_2', 'linear_regression',
    'support_vector_regression'
]

# output files
files = [f'{i}.ipynb' for i in ids]

# execute notebooks using papermill
for f, p in zip(files, params):
    pm.execute_notebook('train.ipynb', output_path=f, parameters=p)

To use NotebookCollection, we pass a a list of paths, and optionally, ids for each notebook (uses paths by default).

The only requirement is that cells whose output we want to extract must have tags, each tag then becomes a key in the notebook collection. For instructions on adding tags, see this.

Extracted tables add colors to certain cells to identify the best and worst metrics. By default, it assumes that metrics are errors (smaller is better). If you are using scores (larger is better), pass scores=True, if you have both, pass a list of scores:

[4]:
nbs = NotebookCollection(paths=files, ids=ids, scores=['r2'])

To get a list of tags available:

[5]:
list(nbs)
[5]:
['model_name', 'feature_names', 'model_params', 'plot', 'metrics', 'river']

model_params contains a dictionary with model parameters, let’s get them (click on the tabs to switch):

[6]:
# pro-tip: then typing the tag, press the "Tab" key for autocompletion!
nbs['model_params']
[6]:
{
    'bootstrap': True,
    'ccp_alpha': 0.0,
    'criterion': 'mse',
    'max_depth': None,
    'max_features': 'auto',
    'max_leaf_nodes': None,
    'max_samples': None,
    'min_impurity_decrease': 0.0,
    'min_impurity_split': None,
    'min_samples_leaf': 1,
    'min_samples_split': 2,
    'min_weight_fraction_leaf': 0.0,
    'n_estimators': 50,
    'n_jobs': None,
    'oob_score': False,
    'random_state': None,
    'verbose': 0,
    'warm_start': False,
}
{
    'bootstrap': True,
    'ccp_alpha': 0.0,
    'criterion': 'mse',
    'max_depth': None,
    'max_features': 'auto',
    'max_leaf_nodes': None,
    'max_samples': None,
    'min_impurity_decrease': 0.0,
    'min_impurity_split': None,
    'min_samples_leaf': 1,
    'min_samples_split': 2,
    'min_weight_fraction_leaf': 0.0,
    'n_estimators': 100,
    'n_jobs': None,
    'oob_score': False,
    'random_state': None,
    'verbose': 0,
    'warm_start': False,
}
{
    'copy_X': True,
    'fit_intercept': True,
    'n_jobs': None,
    'normalize': True,
    'positive': False,
}
{
    'C': 1.0,
    'dual': True,
    'epsilon': 0.0,
    'fit_intercept': True,
    'intercept_scaling': 1.0,
    'loss': 'epsilon_insensitive',
    'max_iter': 1000,
    'random_state': None,
    'tol': 0.0001,
    'verbose': 0,
}

plot has a y_true vs y_pred chart:

[7]:
nbs['plot']
[7]:

On each notebook, metrics outputs a data frame with a single row with mean absolute error (mae) and mean squared error (mse) as columns.

For single-row tables, a “Compare” tab shows all results at once:

[8]:
nbs['metrics']
[8]:
random_forest_1 random_forest_2 linear_regression support_vector_regression
mae 2.174084 2.197695 3.148256 4.006771
mse 10.177599 10.813955 20.724023 29.562735
r2 0.865515 0.857107 0.726157 0.609364
mae mse r2
0 2.174084 10.177599 0.865515
mae mse r2
0 2.197695 10.813955 0.857107
mae mse r2
0 3.148256 20.724023 0.726157
mae mse r2
0 4.006771 29.562735 0.609364

We can see that the second random forest is performing the best in both metrics.

river contains a multi-row table where with error metrics broken down by the CHAS indicator feature. Multi-row tables do not display the “Compare” tab:

[9]:
nbs['river']
[9]:
mae mse r2
CHAS
0.0 2.196266 10.546888 0.865551
1.0 1.784667 3.694524 0.861210
mae mse r2
CHAS
0.0 2.215335 11.158536 0.857754
1.0 1.888000 4.764643 0.821009
mae mse r2
CHAS
0.0 3.145562 21.137297 0.730547
1.0 3.195546 13.468775 0.494026
mae mse r2
CHAS
0.0 4.025660 30.258920 0.614267
1.0 3.675176 17.340808 0.348568

If we only compare two notebooks, the output is a bit different:

[10]:
# only compare two notebooks
nbs_two = NotebookCollection(paths=files[:2], ids=ids[:2], scores=['r2'])

Comparing single-row tables includes a diff column with the error difference between experiments. Error reductions are showed in green, increments in red:

[11]:
nbs_two['metrics']
[11]:
random_forest_1 random_forest_2 diff diff_relative ratio
mae 2.174084 2.197695 0.023611 1.07% 1.010860
mse 10.177599 10.813955 0.636356 5.88% 1.062525
r2 0.865515 0.857107 -0.008408 -0.98% 0.990286
mae mse r2
0 2.174084 10.177599 0.865515
mae mse r2
0 2.197695 10.813955 0.857107

When comparing multi-row tables, the “Compare” tab appears, showing the difference between the tables:

[12]:
nbs_two['river']
[12]:
mae mse r2
CHAS
0.0 0.019069 0.611648 -0.007797
1.0 0.103333 1.070119 -0.040201
mae mse r2
CHAS
0.0 2.196266 10.546888 0.865551
1.0 1.784667 3.694524 0.861210
mae mse r2
CHAS
0.0 2.215335 11.158536 0.857754
1.0 1.888000 4.764643 0.821009

When displaying dictionaries, a “Compare” tab shows with a diff view:

[13]:
nbs_two['model_params']
[13]:
f1{f1{
2    'bootstrap': True,2    'bootstrap': True,
3    'ccp_alpha': 0.0,3    'ccp_alpha': 0.0,
4    'criterion': 'mse',4    'criterion': 'mse',
5    'max_depth': None,5    'max_depth': None,
6    'max_features': 'auto',6    'max_features': 'auto',
7    'max_leaf_nodes': None,7    'max_leaf_nodes': None,
8    'max_samples': None,8    'max_samples': None,
9    'min_impurity_decrease': 0.0,9    'min_impurity_decrease': 0.0,
10    'min_impurity_split': None,10    'min_impurity_split': None,
11    'min_samples_leaf': 1,11    'min_samples_leaf': 1,
12    'min_samples_split': 2,12    'min_samples_split': 2,
13    'min_weight_fraction_leaf': 0.0,13    'min_weight_fraction_leaf': 0.0,
t14    'n_estimators': 50,t14    'n_estimators': 100,
15    'n_jobs': None,15    'n_jobs': None,
16    'oob_score': False,16    'oob_score': False,
17    'random_state': None,17    'random_state': None,
18    'verbose': 0,18    'verbose': 0,
19    'warm_start': False,19    'warm_start': False,
20}20}
Legends
Colors
 Added 
Changed
Deleted
Links
(f)irst change
(n)ext change
(t)op
{
    'bootstrap': True,
    'ccp_alpha': 0.0,
    'criterion': 'mse',
    'max_depth': None,
    'max_features': 'auto',
    'max_leaf_nodes': None,
    'max_samples': None,
    'min_impurity_decrease': 0.0,
    'min_impurity_split': None,
    'min_samples_leaf': 1,
    'min_samples_split': 2,
    'min_weight_fraction_leaf': 0.0,
    'n_estimators': 50,
    'n_jobs': None,
    'oob_score': False,
    'random_state': None,
    'verbose': 0,
    'warm_start': False,
}
{
    'bootstrap': True,
    'ccp_alpha': 0.0,
    'criterion': 'mse',
    'max_depth': None,
    'max_features': 'auto',
    'max_leaf_nodes': None,
    'max_samples': None,
    'min_impurity_decrease': 0.0,
    'min_impurity_split': None,
    'min_samples_leaf': 1,
    'min_samples_split': 2,
    'min_weight_fraction_leaf': 0.0,
    'n_estimators': 100,
    'n_jobs': None,
    'oob_score': False,
    'random_state': None,
    'verbose': 0,
    'warm_start': False,
}

Lists (and sets) are compared based on elements existence:

[14]:
nbs_two['feature_names']
[14]:
Both Only in random_forest_1 Only in random_forest_2
AGE
B
CHAS
CRIM
DIS
INDUS
LSTAT
NOX
PTRATIO
RAD
RM
TAX
ZN
['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX', 'PTRATIO', 'B', 'LSTAT']
['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX', 'PTRATIO', 'B', 'LSTAT']

Using the mapping interface

NotebookCollection has a dict-like interface, you can retrieve data from individual notebooks:

[15]:
nbs['model_params']['random_forest_1']
[15]:
{'bootstrap': True,
 'ccp_alpha': 0.0,
 'criterion': 'mse',
 'max_depth': None,
 'max_features': 'auto',
 'max_leaf_nodes': None,
 'max_samples': None,
 'min_impurity_decrease': 0.0,
 'min_impurity_split': None,
 'min_samples_leaf': 1,
 'min_samples_split': 2,
 'min_weight_fraction_leaf': 0.0,
 'n_estimators': 50,
 'n_jobs': None,
 'oob_score': False,
 'random_state': None,
 'verbose': 0,
 'warm_start': False}
[16]:
nbs['plot']['random_forest_2']
[16]:
../_images/user_guide_NotebookCollection_31_0.png