In [None]:
import numpy as np
import pandas as pd
from sklearn.datasets import make_classification
from sklearn.ensemble import RandomForestClassifier

import cobalt

# Cobalt Tutorial
In this notebook, we'll walk through the process of loading a dataset and model into Cobalt for analysis. 

## Dataset
We'll use a synthetic dataset generated by scikit-learn for simplicity. This dataset has multiple clusters, with each cluster assigned one of two labels.

In [None]:
X, y = make_classification(n_samples=5000, random_state=73902)
X_train = X[:3000, :]
y_train = y[:3000]

Let's put the data in a Pandas DataFrame; this will be part of the data loaded into Cobalt.

In [None]:
df = pd.DataFrame(X, columns=[f"feat_{i}" for i in range(20)])
df["y_true"] = pd.Series(y, dtype="category")

## Model
We'll create a simple random forest model for this dataset.

In [None]:
rf_model = RandomForestClassifier(random_state=3849)
rf_model.fit(X_train, y_train)
y_pred = rf_model.predict(X)
df["y_pred"] = pd.Series(y_pred, dtype="category")

We can now create a `CobaltDataset` object from our dataframe. This will automatically extract some metadata about the columns, and will serve to organize additional information about our data and model.

In [None]:
ds = cobalt.CobaltDataset(df)

We'll now add our model to the dataset, telling Cobalt which columns are the model input, which column countains the model output and the model target, and the model task.

In [None]:
ds.add_model(
    input_columns=[f"feat_{i}" for i in range(20)],
    target_column="y_true",
    prediction_column="y_pred",
    task="classification",
    name="rf",
)
# add columns to the dataset to flag model mispredictions
ds.compute_model_performance_metrics()

## Embeddings
To analyze the model, Cobalt will need an *embedding*: a vector representation of each data point. Since this dataset comes as a collection of numeric features with a uniform scale, we could use the raw features as the embedding. However, it is usually more helpful to use an embedding that reflects the model's internal state.

Here, we will use an embedding extracted from the random forest model. The model consists of an ensemble of decision trees. For each (data point, tree) pair, we can record the id of the leaf of the decision tree that the data point falls in. This gives us a matrix of shape  `(n_data_points, n_trees)`. We treat each row of this matrix as an embedding vector for the data point. We compute distances between vectors using the Hamming distance: the distance between `x` and `y` is the fraction of entries where the two vectors differ. (That is, in numpy notation, `d(x, y) = (x != y).mean()`.)

In [None]:
rf_emb = rf_model.apply(X)
ds.add_embedding_array(rf_emb, metric="hamming", name="rf_emb")

## Workspace

The `Workspace` object is the home for any Cobalt analysis. We can create one from the `CobaltDataset` alone, but here we will also provide a `DatasetSplit` object. This can be any division of the dataset into subsets; we'll use it to separate the train and test subsets by providing lists of indices for each subset.

In [None]:
split = cobalt.DatasetSplit(
    ds, {"train": np.arange(3000), "test": np.arange(3000, 5000)}
)
w = cobalt.Workspace(ds, split)

Now we'll ask the `Workspace` to find "failure groups": groups of similar data points where the model has an elevated error rate. This will show a table summarizing the groups discovered.

The `run_name` is an identifier for the resulting set of groups, and will be used to help give each group a unique name. Different values of `failure_metric` can be used to find groups that have low performance according to different metrics (e.g. error rate, false positive rate, etc.). Setting `"threshold"` puts a bound on the minimum error rate in each returned group---here we require that each group has at least 30% mispredicted data points.

In [None]:
w.find_failure_groups(
    run_name="rf_failures", failure_metric="error", config={"threshold": 0.3}
)

We can now explore these groups interactively in the UI.

In [None]:
w.ui