Binder

Analyzing results from notebooks

The .ipynb format is capable of storing tables and charts in a standalone file. This makes it a great choice for model evaluation reports. NotebookCollection allows you to retrieve results from previously executed notebooks to compare them.

[1]:
import papermill as pm
import jupytext

from sklearn_evaluation import NotebookCollection

Let’s first generate a few notebooks, we have a train.py script that trains a single model, let’s convert it to a jupyter notebook:

[2]:
nb = jupytext.read('train.py')
jupytext.write(nb, 'train.ipynb')

We use papermill to execute the notebook with different parameters, we’ll train 4 models: 2 random forest, a linear regression and a support vector regression:

[3]:
# models with their corresponding parameters
params = [{
    'model': 'sklearn.ensemble.RandomForestRegressor',
    'params': {
        'n_estimators': 50
    }
}, {
    'model': 'sklearn.ensemble.RandomForestRegressor',
    'params': {
        'n_estimators': 100
    }
}, {
    'model': 'sklearn.linear_model.LinearRegression',
    'params': {
        'normalize': True
    }
}, {
    'model': 'sklearn.svm.LinearSVR',
    'params': {}
}]

# ids to identify each experiment
ids = [
    'random_forest_1', 'random_forest_2', 'linear_regression',
    'support_vector_regression'
]

# output files
files = [f'{i}.ipynb' for i in ids]

# execute notebooks using papermill
for f, p in zip(files, params):
    pm.execute_notebook('train.ipynb', output_path=f, parameters=p)

To use NotebookCollection, we pass a a list of paths, and optionally, ids for each notebook (uses paths by default).

The only requirement is that cells whose output we want to extract must have tags, each tag then becomes a key in the notebook collection. For instructions on adding tags, see this.

Extracted tables add colors to certain cells to identify the best and worst metrics. By default, it assumes that metrics are errors (smaller is better). If you are using scores (larger is better), pass scores=True, if you have both, pass a list of scores:

[4]:
nbs = NotebookCollection(paths=files, ids=ids, scores=['r2'])

To get a list of tags available:

[5]:
list(nbs)
[5]:
['model_name', 'feature_names', 'model_params', 'plot', 'metrics', 'river']

model_params contains a dictionary with model parameters, let’s get them (click on the tabs to switch):

[6]:
# pro-tip: then typing the tag, press the "Tab" key for autocompletion!
nbs['model_params']
[6]:
{
    'bootstrap': True,
    'ccp_alpha': 0.0,
    'criterion': 'squared_error',
    'max_depth': None,
    'max_features': 1.0,
    'max_leaf_nodes': None,
    'max_samples': None,
    'min_impurity_decrease': 0.0,
    'min_samples_leaf': 1,
    'min_samples_split': 2,
    'min_weight_fraction_leaf': 0.0,
    'n_estimators': 50,
    'n_jobs': None,
    'oob_score': False,
    'random_state': None,
    'verbose': 0,
    'warm_start': False,
}
{
    'bootstrap': True,
    'ccp_alpha': 0.0,
    'criterion': 'squared_error',
    'max_depth': None,
    'max_features': 1.0,
    'max_leaf_nodes': None,
    'max_samples': None,
    'min_impurity_decrease': 0.0,
    'min_samples_leaf': 1,
    'min_samples_split': 2,
    'min_weight_fraction_leaf': 0.0,
    'n_estimators': 100,
    'n_jobs': None,
    'oob_score': False,
    'random_state': None,
    'verbose': 0,
    'warm_start': False,
}
{
    'copy_X': True,
    'fit_intercept': True,
    'n_jobs': None,
    'normalize': True,
    'positive': False,
}
{
    'C': 1.0,
    'dual': True,
    'epsilon': 0.0,
    'fit_intercept': True,
    'intercept_scaling': 1.0,
    'loss': 'epsilon_insensitive',
    'max_iter': 1000,
    'random_state': None,
    'tol': 0.0001,
    'verbose': 0,
}

plot has a y_true vs y_pred chart:

[7]:
nbs['plot']
[7]:

On each notebook, metrics outputs a data frame with a single row with mean absolute error (mae) and mean squared error (mse) as columns.

For single-row tables, a “Compare” tab shows all results at once:

[8]:
nbs['metrics']
[8]:
  random_forest_1 random_forest_2 linear_regression support_vector_regression
mae 2.194431 2.172539 3.148256 4.017698
mse 10.420729 10.380061 20.724023 30.603511
r2 0.862303 0.862840 0.726157 0.595612
mae mse r2
0 2.194431 10.420729 0.862303
mae mse r2
0 2.172539 10.380061 0.86284
mae mse r2
0 3.148256 20.724023 0.726157
mae mse r2
0 4.017698 30.603511 0.595612

We can see that the second random forest is performing the best in both metrics.

river contains a multi-row table where with error metrics broken down by the CHAS indicator feature. Multi-row tables do not display the “Compare” tab:

[9]:
nbs['river']
[9]:
mae mse r2
CHAS
0.0 2.218620 10.790109 0.862451
1.0 1.769778 3.936070 0.852136
mae mse r2
CHAS
0.0 2.194139 10.724741 0.863284
1.0 1.793333 4.329014 0.837374
mae mse r2
CHAS
0.0 3.145562 21.137297 0.730547
1.0 3.195546 13.468775 0.494026
mae mse r2
CHAS
0.0 4.114987 31.880407 0.593597
1.0 2.309739 8.186887 0.692448

If we only compare two notebooks, the output is a bit different:

[10]:
# only compare two notebooks
nbs_two = NotebookCollection(paths=files[:2], ids=ids[:2], scores=['r2'])

Comparing single-row tables includes a diff column with the error difference between experiments. Error reductions are showed in green, increments in red:

[11]:
nbs_two['metrics']
[11]:
  random_forest_1 random_forest_2 diff diff_relative ratio
mae 2.194431 2.172539 -0.021892 -1.01% 0.990024
mse 10.420729 10.380061 -0.040668 -0.39% 0.996097
r2 0.862303 0.862840 0.000537 0.06% 1.000623
mae mse r2
0 2.194431 10.420729 0.862303
mae mse r2
0 2.172539 10.380061 0.86284

When comparing multi-row tables, the “Compare” tab appears, showing the difference between the tables:

[12]:
nbs_two['river']
[12]:
  mae mse r2
CHAS      
0.000000 -0.024481 -0.065368 0.000833
1.000000 0.023555 0.392944 -0.014762
mae mse r2
CHAS
0.0 2.218620 10.790109 0.862451
1.0 1.769778 3.936070 0.852136
mae mse r2
CHAS
0.0 2.194139 10.724741 0.863284
1.0 1.793333 4.329014 0.837374

When displaying dictionaries, a “Compare” tab shows with a diff view:

[13]:
nbs_two['model_params']
[13]:
f1{f1{
2    'bootstrap': True,2    'bootstrap': True,
3    'ccp_alpha': 0.0,3    'ccp_alpha': 0.0,
4    'criterion': 'squared_error',4    'criterion': 'squared_error',
5    'max_depth': None,5    'max_depth': None,
6    'max_features': 1.0,6    'max_features': 1.0,
7    'max_leaf_nodes': None,7    'max_leaf_nodes': None,
8    'max_samples': None,8    'max_samples': None,
9    'min_impurity_decrease': 0.0,9    'min_impurity_decrease': 0.0,
10    'min_samples_leaf': 1,10    'min_samples_leaf': 1,
11    'min_samples_split': 2,11    'min_samples_split': 2,
12    'min_weight_fraction_leaf': 0.0,12    'min_weight_fraction_leaf': 0.0,
t13    'n_estimators': 50,t13    'n_estimators': 100,
14    'n_jobs': None,14    'n_jobs': None,
15    'oob_score': False,15    'oob_score': False,
16    'random_state': None,16    'random_state': None,
17    'verbose': 0,17    'verbose': 0,
18    'warm_start': False,18    'warm_start': False,
19}19}
Legends
Colors
 Added 
Changed
Deleted
Links
(f)irst change
(n)ext change
(t)op
{
    'bootstrap': True,
    'ccp_alpha': 0.0,
    'criterion': 'squared_error',
    'max_depth': None,
    'max_features': 1.0,
    'max_leaf_nodes': None,
    'max_samples': None,
    'min_impurity_decrease': 0.0,
    'min_samples_leaf': 1,
    'min_samples_split': 2,
    'min_weight_fraction_leaf': 0.0,
    'n_estimators': 50,
    'n_jobs': None,
    'oob_score': False,
    'random_state': None,
    'verbose': 0,
    'warm_start': False,
}
{
    'bootstrap': True,
    'ccp_alpha': 0.0,
    'criterion': 'squared_error',
    'max_depth': None,
    'max_features': 1.0,
    'max_leaf_nodes': None,
    'max_samples': None,
    'min_impurity_decrease': 0.0,
    'min_samples_leaf': 1,
    'min_samples_split': 2,
    'min_weight_fraction_leaf': 0.0,
    'n_estimators': 100,
    'n_jobs': None,
    'oob_score': False,
    'random_state': None,
    'verbose': 0,
    'warm_start': False,
}

Lists (and sets) are compared based on elements existence:

[14]:
nbs_two['feature_names']
[14]:
Both Only in random_forest_1 Only in random_forest_2
AGE
B
CHAS
CRIM
DIS
INDUS
LSTAT
NOX
PTRATIO
RAD
RM
TAX
ZN
['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX', 'PTRATIO', 'B', 'LSTAT']
['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX', 'PTRATIO', 'B', 'LSTAT']

Using the mapping interface

NotebookCollection has a dict-like interface, you can retrieve data from individual notebooks:

[15]:
nbs['model_params']['random_forest_1']
[15]:
{'bootstrap': True,
 'ccp_alpha': 0.0,
 'criterion': 'squared_error',
 'max_depth': None,
 'max_features': 1.0,
 'max_leaf_nodes': None,
 'max_samples': None,
 'min_impurity_decrease': 0.0,
 'min_samples_leaf': 1,
 'min_samples_split': 2,
 'min_weight_fraction_leaf': 0.0,
 'n_estimators': 50,
 'n_jobs': None,
 'oob_score': False,
 'random_state': None,
 'verbose': 0,
 'warm_start': False}
[16]:
nbs['plot']['random_forest_2']
[16]:
../_images/user_guide_NotebookCollection_31_0.png