Tutorial

This tutorial will walk through the steps involved in using Cobalt to analyze a model. To keep this self-contained, we’ll use a synthetic dataset generated by scikit-learn and train a basic random forest model.

See this Jupyter notebook.

from sklearn.datasets import make_classification
X, y = make_classification(n_samples=5000, random_state=73902)
X_train = X[:3000, :]
y_train = y[:3000]

This dataset has 20 randomly created features and two classes.

from sklearn.ensemble import RandomForestClassifier

rf_model = RandomForestClassifier()
rf_model.fit(X_train, y_train)
y_pred = rf_model.predict(X)

We’ll put this data into a Pandas DataFrame to load it into Cobalt.

import pandas as pd
df = pd.DataFrame(X, columns=[f"feat_{i}" for i in range(20)])
df["y_true"] = pd.Series(y, dtype="category")
df["y_pred"] = pd.Series(y_pred, dtype="category")

To analyze the data and model in Cobalt, we first create a CobaltDataset.

import cobalt

ds = cobalt.CobaltDataset(df)

We’ll then add our model:

ds.add_model(
    input_columns=[f"feat_{i}" for i in range(20)],
    target_column="y_true",
    prediction_column="y_pred",
    task="classification",
    name="rf",
)

# add columns to the dataset with pointwise performance metrics
ds.compute_model_performance_metrics()

In order to analyze the data, we will need some embeddings—vector representations of the data. In this case, we could use the raw features, since they are numeric and have a uniform scale. However, because we used a random forest model, we can extract a potentially more useful embedding from the model’s internal structure.

Here’s how the embedding vectors are created. The model consists of an ensemble of decision trees. For each (data point, tree) pair, we can record the id of the leaf of the decision tree that the data point falls in. This gives us a matrix of shape (n_data_points, n_trees). We treat each row of this matrix as an embedding vector for the data point. We compute distances between vectors using the Hamming distance: the distance between x and y is the fraction of entries where the two vectors differ. (That is, in numpy notation, `d(x, y) = (x != y).mean()`.)

# for each tree in the forest, get the id of each data point's leaf
rf_emb = rf_model.apply(X)

# use these leaf ids as representations, with the hamming metric for similarity
ds.add_embedding_array(rf_emb, metric="hamming", name="rf_emb")

Cobalt will use these embedding vectors and the specified similarity metric to construct a topological representation of the dataset, in the form of a collection of graphs. These will be used to analyze the model’s performance and highlight relevant groups of data points.

Finally, we’ll tell Cobalt about our dataset split, with a DatasetSplit object. This can be any arbitrary division of the data, but here we’ll use it to indicate which data was used to train the model. Then we can instantiate the Workspace object, which will serve as the home for our analysis.

split = cobalt.DatasetSplit(ds, {"train": range(3000), "test": range(3000, 5000)})
w = cobalt.Workspace(ds, split)

There are a number of algorithms that the Workspace can run to analyze the model and data. We’ll ask it to look for failure groups: regions of similar data points where the model has a high error rate. To do this, use the find_failure_groups() method.

w.find_failure_groups(
    run_name="rf_failures",
    failure_metric="error",
    config={"threshold": 0.3}
)

The run_name is an identifier for the resulting set of groups, and will be used to help give each group a unique name. Different values of failure_metric can be used to find groups that have low performance according to different metrics (e.g. error rate, false positive rate, etc.). Setting "threshold" puts a bound on the minimum error rate in each returned group—here we require that each group has at least 30% mispredicted data points.

The output is represented as a table that will look like the following:

Group Name

Size

Description

accuracy

error

rf_failures/1

11

feat_12 mean=-1.2 (↓)

0.454545

0.545455

rf_failures/2

12

feat_16 mean=0.68 (↑)

0.416667

0.583333

rf_failures/3

17

feat_16 mean=0.27 (↑)

0.352941

0.647059

rf_failures/4

29

feat_12 mean=0.91 (↑)

0.344828

0.655172

rf_failures/5

15

feat_9 mean=0.81 (↑)

0.333333

0.666667

rf_failures/6

12

feat_12 mean=0.88 (↑)

0.333333

0.666667

rf_failures/7

12

feat_12 mean=-1.5 (↓)

0.333333

0.666667

We can also explore the results graphically by running

w.ui

This will show an interactive view with a representation of the data as well as the discovered groups.