Binder

Analyzing results from notebooks

The .ipynb format is capable of storing tables and charts in a standalone file. This makes it a great choice for model evaluation reports. NotebookCollection allows you to retrieve results from previously executed notebooks to compare them.

[1]:
import papermill as pm
import jupytext

from sklearn_evaluation import NotebookCollection

Let’s first generate a few notebooks, we have a train.py script that trains a single model, let’s convert it to a jupyter notebook:

[2]:
nb = jupytext.read('train.py')
jupytext.write(nb, 'train.ipynb')

We use papermill to execute the notebook with different parameters, we’ll train 4 models: 2 random forest, a linear regression and a support vector regression:

[3]:
# models with their corresponding parameters
params = [{
    'model': 'sklearn.ensemble.RandomForestRegressor',
    'params': {
        'n_estimators': 50
    }
}, {
    'model': 'sklearn.ensemble.RandomForestRegressor',
    'params': {
        'n_estimators': 100
    }
}, {
    'model': 'sklearn.linear_model.LinearRegression',
    'params': {
        'normalize': True
    }
}, {
    'model': 'sklearn.svm.LinearSVR',
    'params': {}
}]

# ids to identify each experiment
ids = [
    'random_forest_1', 'random_forest_2', 'linear_regression',
    'support_vector_regression'
]

# output files
files = [f'{i}.ipynb' for i in ids]

# execute notebooks using papermill
for f, p in zip(files, params):
    pm.execute_notebook('train.ipynb', output_path=f, parameters=p)

To use NotebookCollection, we pass a a list of paths, and optionally, ids for each notebook (uses paths by default).

The only requirement is that cells whose output we want to extract must have tags, each tag then becomes a key in the notebook collection. For instructions on adding tags, see this.

Extracted tables add colors to certain cells to identify the best and worst metrics. By default, it assumes that metrics are errors (smaller is better). If you are using scores (larger is better), pass scores=True, if you have both, pass a list of scores:

[4]:
nbs = NotebookCollection(paths=files, ids=ids, scores=['r2'])

To get a list of tags available:

[5]:
list(nbs)
[5]:
['model_name', 'feature_names', 'model_params', 'plot', 'metrics', 'river']

model_params contains a dictionary with model parameters, let’s get them (click on the tabs to switch):

[6]:
# pro-tip: then typing the tag, press the "Tab" key for autocompletion!
nbs['model_params']
[6]:
{
    'bootstrap': True,
    'ccp_alpha': 0.0,
    'criterion': 'squared_error',
    'max_depth': None,
    'max_features': 'auto',
    'max_leaf_nodes': None,
    'max_samples': None,
    'min_impurity_decrease': 0.0,
    'min_samples_leaf': 1,
    'min_samples_split': 2,
    'min_weight_fraction_leaf': 0.0,
    'n_estimators': 50,
    'n_jobs': None,
    'oob_score': False,
    'random_state': None,
    'verbose': 0,
    'warm_start': False,
}
{
    'bootstrap': True,
    'ccp_alpha': 0.0,
    'criterion': 'squared_error',
    'max_depth': None,
    'max_features': 'auto',
    'max_leaf_nodes': None,
    'max_samples': None,
    'min_impurity_decrease': 0.0,
    'min_samples_leaf': 1,
    'min_samples_split': 2,
    'min_weight_fraction_leaf': 0.0,
    'n_estimators': 100,
    'n_jobs': None,
    'oob_score': False,
    'random_state': None,
    'verbose': 0,
    'warm_start': False,
}
{
    'copy_X': True,
    'fit_intercept': True,
    'n_jobs': None,
    'normalize': True,
    'positive': False,
}
{
    'C': 1.0,
    'dual': True,
    'epsilon': 0.0,
    'fit_intercept': True,
    'intercept_scaling': 1.0,
    'loss': 'epsilon_insensitive',
    'max_iter': 1000,
    'random_state': None,
    'tol': 0.0001,
    'verbose': 0,
}

plot has a y_true vs y_pred chart:

[7]:
nbs['plot']
[7]:

On each notebook, metrics outputs a data frame with a single row with mean absolute error (mae) and mean squared error (mse) as columns.

For single-row tables, a “Compare” tab shows all results at once:

[8]:
nbs['metrics']
[8]:
  random_forest_1 random_forest_2 linear_regression support_vector_regression
mae 2.157353 2.139443 3.148256 6.452945
mse 9.863015 10.252310 20.724023 64.773925
r2 0.869672 0.864528 0.726157 0.144091
mae mse r2
0 2.157353 9.863015 0.869672
mae mse r2
0 2.139443 10.25231 0.864528
mae mse r2
0 3.148256 20.724023 0.726157
mae mse r2
0 6.452945 64.773925 0.144091

We can see that the second random forest is performing the best in both metrics.

river contains a multi-row table where with error metrics broken down by the CHAS indicator feature. Multi-row tables do not display the “Compare” tab:

[9]:
nbs['river']
[9]:
mae mse r2
CHAS
0.0 2.183544 10.202242 0.869944
1.0 1.697556 3.907691 0.853202
mae mse r2
CHAS
0.0 2.188373 10.689910 0.863728
1.0 1.280444 2.570003 0.903454
mae mse r2
CHAS
0.0 3.145562 21.137297 0.730547
1.0 3.195546 13.468775 0.494026
mae mse r2
CHAS
0.0 6.367750 64.249339 0.180966
1.0 7.948588 73.983317 -1.779290

If we only compare two notebooks, the output is a bit different:

[10]:
# only compare two notebooks
nbs_two = NotebookCollection(paths=files[:2], ids=ids[:2], scores=['r2'])

Comparing single-row tables includes a diff column with the error difference between experiments. Error reductions are showed in green, increments in red:

[11]:
nbs_two['metrics']
[11]:
  random_forest_1 random_forest_2 diff diff_relative ratio
mae 2.157353 2.139443 -0.017910 -0.84% 0.991698
mse 9.863015 10.252310 0.389295 3.80% 1.039470
r2 0.869672 0.864528 -0.005144 -0.60% 0.994085
mae mse r2
0 2.157353 9.863015 0.869672
mae mse r2
0 2.139443 10.25231 0.864528

When comparing multi-row tables, the “Compare” tab appears, showing the difference between the tables:

[12]:
nbs_two['river']
[12]:
  mae mse r2
CHAS      
0.000000 0.004829 0.487668 -0.006216
1.000000 -0.417112 -1.337688 0.050252
mae mse r2
CHAS
0.0 2.183544 10.202242 0.869944
1.0 1.697556 3.907691 0.853202
mae mse r2
CHAS
0.0 2.188373 10.689910 0.863728
1.0 1.280444 2.570003 0.903454

When displaying dictionaries, a “Compare” tab shows with a diff view:

[13]:
nbs_two['model_params']
[13]:
f1{f1{
2    'bootstrap': True,2    'bootstrap': True,
3    'ccp_alpha': 0.0,3    'ccp_alpha': 0.0,
4    'criterion': 'squared_error',4    'criterion': 'squared_error',
5    'max_depth': None,5    'max_depth': None,
6    'max_features': 'auto',6    'max_features': 'auto',
7    'max_leaf_nodes': None,7    'max_leaf_nodes': None,
8    'max_samples': None,8    'max_samples': None,
9    'min_impurity_decrease': 0.0,9    'min_impurity_decrease': 0.0,
10    'min_samples_leaf': 1,10    'min_samples_leaf': 1,
11    'min_samples_split': 2,11    'min_samples_split': 2,
12    'min_weight_fraction_leaf': 0.0,12    'min_weight_fraction_leaf': 0.0,
t13    'n_estimators': 50,t13    'n_estimators': 100,
14    'n_jobs': None,14    'n_jobs': None,
15    'oob_score': False,15    'oob_score': False,
16    'random_state': None,16    'random_state': None,
17    'verbose': 0,17    'verbose': 0,
18    'warm_start': False,18    'warm_start': False,
19}19}
Legends
Colors
 Added 
Changed
Deleted
Links
(f)irst change
(n)ext change
(t)op
{
    'bootstrap': True,
    'ccp_alpha': 0.0,
    'criterion': 'squared_error',
    'max_depth': None,
    'max_features': 'auto',
    'max_leaf_nodes': None,
    'max_samples': None,
    'min_impurity_decrease': 0.0,
    'min_samples_leaf': 1,
    'min_samples_split': 2,
    'min_weight_fraction_leaf': 0.0,
    'n_estimators': 50,
    'n_jobs': None,
    'oob_score': False,
    'random_state': None,
    'verbose': 0,
    'warm_start': False,
}
{
    'bootstrap': True,
    'ccp_alpha': 0.0,
    'criterion': 'squared_error',
    'max_depth': None,
    'max_features': 'auto',
    'max_leaf_nodes': None,
    'max_samples': None,
    'min_impurity_decrease': 0.0,
    'min_samples_leaf': 1,
    'min_samples_split': 2,
    'min_weight_fraction_leaf': 0.0,
    'n_estimators': 100,
    'n_jobs': None,
    'oob_score': False,
    'random_state': None,
    'verbose': 0,
    'warm_start': False,
}

Lists (and sets) are compared based on elements existence:

[14]:
nbs_two['feature_names']
[14]:
Both Only in random_forest_1 Only in random_forest_2
AGE
B
CHAS
CRIM
DIS
INDUS
LSTAT
NOX
PTRATIO
RAD
RM
TAX
ZN
['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX', 'PTRATIO', 'B', 'LSTAT']
['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX', 'PTRATIO', 'B', 'LSTAT']

Using the mapping interface

NotebookCollection has a dict-like interface, you can retrieve data from individual notebooks:

[15]:
nbs['model_params']['random_forest_1']
[15]:
{'bootstrap': True,
 'ccp_alpha': 0.0,
 'criterion': 'squared_error',
 'max_depth': None,
 'max_features': 'auto',
 'max_leaf_nodes': None,
 'max_samples': None,
 'min_impurity_decrease': 0.0,
 'min_samples_leaf': 1,
 'min_samples_split': 2,
 'min_weight_fraction_leaf': 0.0,
 'n_estimators': 50,
 'n_jobs': None,
 'oob_score': False,
 'random_state': None,
 'verbose': 0,
 'warm_start': False}
[16]:
nbs['plot']['random_forest_2']
[16]:
../_images/user_guide_NotebookCollection_31_0.png