In [None]:
# No need to run this cell if these packages are already installed
%pip install torch transformers datasets

In [None]:
import datasets
import numpy as np
import pandas as pd
import torch
import transformers

import cobalt

# Text Classification Tutorial
In this notebook, we will walk through the process of loading a dataset and model into Cobalt for analysis.

## Dataset

We'll use a popular dataset, Banking77, from HuggingFace for this demo.

- Banking77 dataset from HuggingFace is a text classification dataset.
- It is composed of online banking text queries and has 77 classes.

It is available on HuggingFace and can be accessed via this link: https://huggingface.co/datasets/mteb/banking77 .

Let's load the dataset from HuggingFace

In [None]:
dataset = datasets.load_dataset("banking77")
train_set, test_set = dataset["train"], dataset["test"]

print("Train set size:", train_set.shape[0])
print("Test set size:", test_set.shape[0])

Convert integer labels to string labels for both train and test sets. This is an optional step but it makes understanding the labels much easier.

In [None]:
train_str_labels = [train_set.features["label"].int2str(i["label"]) for i in train_set]
train_set = train_set.add_column("true_label", train_str_labels).remove_columns("label")

test_str_labels = [test_set.features["label"].int2str(i["label"]) for i in test_set]
test_set = test_set.add_column("true_label", test_str_labels).remove_columns("label")

Additionally, let's convert our datasets into pandas dataframes.

In [None]:
train_df = train_set.to_pandas()
test_df = test_set.to_pandas()

Here's an example of how the data looks.

In [None]:
train_df.head()

## Model

We'll use a pre-trained HuggingFace model for this dataset.

The model can be accessed via this link: https://huggingface.co/philschmid/BERT-Banking77 .

Let's load in the model and tokenizer from the model_id and create a text-classification pipeline.

In [None]:
model_id = "philschmid/BERT-Banking77"
tokenizer = transformers.AutoTokenizer.from_pretrained(model_id)
model = transformers.AutoModelForSequenceClassification.from_pretrained(model_id)
classifier_text = transformers.pipeline(
    "text-classification", tokenizer=tokenizer, model=model
)

We can now generate predictions using the text-classification pipeline and add them to the train and test dataframes.

In [None]:
test_predictions = classifier_text(test_set["text"])
test_df["predicted_label"] = [
    test_predictions[i]["label"] for i in range(len(test_predictions))
]

train_predictions = classifier_text(train_set["text"])
train_df["predicted_label"] = [
    train_predictions[i]["label"] for i in range(len(train_predictions))
]

# Run time: ~6 mins

We can now create a CobaltDataset object from our dataframes.

But first, let's combine our dataframes.

In [None]:
combined_df = pd.concat([train_df, test_df], axis=0).reset_index(drop=True)

Load the dataframe into cobalt.

In [None]:
ds = cobalt.CobaltDataset(combined_df)

We'll now add our model to the dataset, telling Cobalt which columns are the model input, which column countains the model output and the model target, and the model task.

In [None]:
ds.add_model(
    input_columns="text",
    target_column="true_label",
    prediction_column="predicted_label",
    task="classification",
    name="HF_model",
)
# add columns to the dataset to flag model mispredictions
ds.compute_model_performance_metrics()

## Embeddings

To analyze the model, Cobalt will need an embedding: a vector representation of each data point. For text, a text-embedding model such as SentenceBert can be used to generate embeddings. Additionally, it can also be helpful to use an embedding that reflects the model's internal state.

Here, we will use an embedding extracted from the pre-trained HuggingFace model that we loaded above. The model is a fine-tuned BertModel and we will be extracting the hidden state of its last layer to form our embeddings. It is possible to choose other layers instead.

We will use this helper function to retrieve embeddings.

In [None]:
def get_hidden_state_embeddings_for_texts(
    texts, pipeline, task_index=0, hidden_state=-1
):
    # Retrieves hidden state embeddings from a specific hidden layer.
    # Task index is 0 for Bert Sequence Classification
    all_embeddings = []

    for text in texts:
        # Tokenize the input text
        tokens = pipeline.tokenizer(text, return_tensors="pt", padding=True)

        # Get the hidden states from the model
        with torch.no_grad():
            outputs = pipeline.model(**tokens, output_hidden_states=True)

        # Extract the hidden state from the desired layer
        embedding = outputs.hidden_states[hidden_state]

        # To explicitly access the embedding corresponding to the task
        embedding = embedding[:, task_index, :]

        # Add embedding to embedding list
        all_embeddings.append(embedding.numpy())

    return np.array(all_embeddings).squeeze()

Let's retrieve embeddings for the train and test datasets.

In [None]:
test_embeddings = get_hidden_state_embeddings_for_texts(
    test_set["text"], pipeline=classifier_text
)
train_embeddings = get_hidden_state_embeddings_for_texts(
    train_set["text"], pipeline=classifier_text
)

print(train_embeddings.shape, test_embeddings.shape)
# Run time: ~6 mins

Let's concatenate these embeddings and load them into cobalt. 

We compute distances between vectors (embeddings) using the Cosine distance here.

In [None]:
combined_embeddings = np.concatenate([train_embeddings, test_embeddings])
ds.add_embedding_array(combined_embeddings, metric="cosine", name="last_layer_emb")

## Workspace

The Workspace object is the home for any Cobalt analysis. 

We can create one from the CobaltDataset alone, but here we will also provide a DatasetSplit object. This can be any division of the dataset into subsets; we'll use it to separate the train and test subsets by providing lists of indices for each subset.

In [None]:
split = cobalt.DatasetSplit(
    ds,
    {
        "train": np.arange(train_df.shape[0]),
        "test": np.arange(train_df.shape[0], combined_df.shape[0]),
    },
)
w = cobalt.Workspace(ds, split)

Now we'll ask the Workspace to find "failure groups": groups of similar data points where the model has an elevated error rate. This will show a table summarizing the groups discovered.

The run_name is an identifier for the resulting set of groups, and will be used to help give each group a unique name. Different values of failure_metric can be used to find groups that have low performance according to different metrics (e.g. error rate, false positive rate, etc.).

Setting "threshold" puts a bound on the minimum error rate in each returned group---here we require that each group has at least 30% mispredicted data points.

In [None]:
w.find_failure_groups(
    run_name="HF_model_failures", failure_metric="error", config={"threshold": 0.3}
)

We can now explore these groups interactively in the UI.

In [None]:
w.ui

# 