Tutorial
This tutorial will walk through the steps involved in using Cobalt to analyze a model. To keep this self-contained, we’ll use a synthetic dataset generated by scikit-learn and train a basic random forest model.
from sklearn.datasets import make_classification
X, y = make_classification(n_samples=5000, random_state=73902)
X_train = X[:3000, :]
y_train = y[:3000]
This dataset has 20 randomly created features and two classes.
from sklearn.ensemble import RandomForestClassifier
rf_model = RandomForestClassifier()
rf_model.fit(X_train, y_train)
y_pred = rf_model.predict(X)
We’ll put this data into a Pandas DataFrame to load it into Cobalt.
import pandas as pd
df = pd.DataFrame(X, columns=[f"feat_{i}" for i in range(20)])
df["y_true"] = pd.Series(y, dtype="category")
df["y_pred"] = pd.Series(y_pred, dtype="category")
To analyze the data and model in Cobalt, we first create a
CobaltDataset
.
import cobalt
ds = cobalt.CobaltDataset(df)
We’ll then add our model:
ds.add_model(
input_columns=[f"feat_{i}" for i in range(20)],
target_column="y_true",
prediction_column="y_pred",
task="classification",
name="rf",
)
# add columns to the dataset with pointwise performance metrics
ds.compute_model_performance_metrics()
In order to analyze the data, we will need some embeddings—vector representations of the data. In this case, we could use the raw features, since they are numeric and have a uniform scale. However, because we used a random forest model, we can extract a potentially more useful embedding from the model’s internal structure.
Here’s how the embedding vectors are created. The model consists of an ensemble
of decision trees. For each (data point, tree) pair, we can record the id of the
leaf of the decision tree that the data point falls in. This gives us a matrix
of shape (n_data_points, n_trees)
. We treat each row of this matrix as an
embedding vector for the data point. We compute distances between vectors using
the Hamming distance: the distance between x
and y
is the fraction of
entries where the two vectors differ. (That is, in numpy notation, `d(x, y) =
(x != y).mean()`
.)
# for each tree in the forest, get the id of each data point's leaf
rf_emb = rf_model.apply(X)
# use these leaf ids as representations, with the hamming metric for similarity
ds.add_embedding_array(rf_emb, metric="hamming", name="rf_emb")
Cobalt will use these embedding vectors and the specified similarity metric to construct a topological representation of the dataset, in the form of a collection of graphs. These will be used to analyze the model’s performance and highlight relevant groups of data points.
Finally, we’ll tell Cobalt about our dataset split, with a
DatasetSplit
object. This can be any arbitrary division of the
data, but here we’ll use it to indicate which data was used to train the model.
Then we can instantiate the Workspace
object, which will serve
as the home for our analysis.
split = cobalt.DatasetSplit(ds, {"train": range(3000), "test": range(3000, 5000)})
w = cobalt.Workspace(ds, split)
There are a number of algorithms that the Workspace
can run to
analyze the model and data. We’ll ask it to look for failure groups: regions
of similar data points where the model has a high error rate. To do this, use
the find_failure_groups()
method.
w.find_failure_groups(
run_name="rf_failures",
failure_metric="error",
config={"threshold": 0.3}
)
The run_name
is an identifier for the resulting set of groups, and will be
used to help give each group a unique name. Different values of
failure_metric
can be used to find groups that have low performance
according to different metrics (e.g. error rate, false positive rate, etc.).
Setting "threshold"
puts a bound on the minimum error rate in each returned
group—here we require that each group has at least 30% mispredicted data
points.
The output is represented as a table that will look like the following:
Group Name
Size
Description
accuracy
error
rf_failures/1
11
feat_12 mean=-1.2 (↓)
0.454545
0.545455
rf_failures/2
12
feat_16 mean=0.68 (↑)
0.416667
0.583333
rf_failures/3
17
feat_16 mean=0.27 (↑)
0.352941
0.647059
rf_failures/4
29
feat_12 mean=0.91 (↑)
0.344828
0.655172
rf_failures/5
15
feat_9 mean=0.81 (↑)
0.333333
0.666667
rf_failures/6
12
feat_12 mean=0.88 (↑)
0.333333
0.666667
rf_failures/7
12
feat_12 mean=-1.5 (↓)
0.333333
0.666667
We can also explore the results graphically by running
w.ui
This will show an interactive view with a representation of the data as well as the discovered groups.