Concept Overview

When should you use Cobalt?

You should use Cobalt to understand and surface actionable information about high dimensional or unstructured data. This can be useful both when you have a model trained to make predictions, and even before you have a model.

If you’re in the data curation step, it will help you understand what’s in your dataset and enable you to consider clusters and their similar clusters to build custom inclusion/exclusion criteria for your downstream modeling tasks. There are also other use cases, and it’s always good to know what’s in your data early so you can limit unwelcome surprises.

If your model is not performing to the extent you wish (i.e., low accuracy, high error rate), Cobalt will help you locate specific semantically similar groups of data points that suggest something about why the model is failing.

What else might you try?

One approach you might use is to visualize your embeddings on a t-SNE, UMAP or PCA plot. This will give you a 2D representation of your embedding. You can color the resulting scatter plot by error, or an indicator function and check whether there are regions within the scatter plot that suffer from a lot of errors.

If this gives you what you need, that’s great! But, based on the limitations of these tools, it’s likely that you will want more. Specifically, PCA is a linear method and preserves the axes of greatest variance within the embedding—there is no reason to expect that this will preserve the geometric separability of the targets. UMAP and t-SNE are great visualization techniques, but if you’re looking at their outputs to find patterns, you should know that in order to create a reasonable 2D representation, they often split up semantically similar data as well as bring non-semantically similar data together, so results may be misleading.

What does Cobalt offer?

Since we know what you want is finding error patterns, we engineered a solution that finds error patterns, without cutting up groups of similar data points. Specifically, we used concepts from topology, a branch of mathematics, to do this in a maximally accurate way. The central technique that our analysis is based on is building a map—technically speaking, a graph—of this embedding space by preserving the relationships between these points.

Right now, Cobalt provides tools to detect drifted groups of data points and groups with high error. In development, we are working on tools to pick up spurious patterns and correlations.

What you need to know

What is an embedding?

An embedding is a representation of each data point as a fixed dimension vector. An embedding has a similarity / distance metric associated with it (examples include cosine similarity and Euclidean distance). The idea here is that if two data points are semantically similar to one another, then they should be nearby in the embedding space as well. There are various ways of computing embeddings. One can use a pretrained embedding model (e.g sBERT), or extract hidden states from models. In tabular settings, one can use the output from random forest, or the raw tabular features. If a model is available, extracting hidden states from the model is the best bet. Otherwise, if you are in a pre-training step and there is no model available, using publicly available models like sBERT or CLIP can give you a good understanding of what is in your dataset. Additional approaches like training an autoencoder to extract embeddings can also be useful ways of creating embeddings.

The Cobalt interface for embedding allows you to create a cobalt.CobaltDataset from a pandas DataFrame and then call CobaltDataset.add_embedding_array(). This method accepts a numpy array and the name of a metric that can be used to compute distances between each row in the embedding matrix.

Metadata

To proceed with your analysis, you can add metadata information about the data in the DataFrame. And one can add a DatasetMetadata object to communicate to cobalt which columns in the dataset contain timestamp data, or contain media data (e.g image paths).

When constructing the CobaltDataset object, you have the opportunity to pass in DatasetMetadata object. If you do not, Cobalt auto-computes which columns in the dataframe you passed in correspond to image path columns, and text columns. In particular, Cobalt distinguishes between long text and short text. Long text data contains multiple space separated words, and short text columns do not. We auto-compute TF-IDF-based keyword analysis on long text columns. Cobalt does not currently automatically detect timestamp columns, so if you want to color by a certain timestamp column, please add that to CobaltDataset.metadata.timestamp_columns after construction.

To add a model that you want to analyze, add a ModelMetadata object with the CobaltDataset.add_model() method to communicate which columns in the dataframe are model predictions, model targets, and model input columns. This will provide input for Cobalt to organize your analysis tools in the UI as well as define the relevant columns that the Workspace.find_failure_groups() method will consider to output failure groups.

Visualization

Once this has been appropriately labeled, you can run w = cobalt.Workspace(ds: CobaltDataset) and then w.ui.

If you prefer a more programmatic interface, you can run w.find_failure_groups().

At this point, if you w.ui, you’ll see a visual interface show up in the notebook interface with a network in the center. Each node in your graph represents a cluster of data, and edges between those nodes represent similarities between clusters. In sum, you can think of the network as a literal map of your dataset, based on the information that is contained in the embedding that you provided. To see how your model makes decisions about your data, you can start by coloring the map by your model’s prediction.

At this point, we recommend looking at data points that lie on the border between two different colors to see what the model might get confused by.