Concept Overview
When should you use Cobalt?
You should use Cobalt to understand and surface actionable information about high dimensional or unstructured data. This can be useful both when you have a model trained to make predictions, and even before you have a model.
If you’re in the data curation step, it will help you understand what’s in your dataset and enable you to consider clusters and their similar clusters to build custom inclusion/exclusion criteria for your downstream modeling tasks. There are also other use cases, and it’s always good to know what’s in your data early so you can limit unwelcome surprises.
If your model is not performing to the extent you wish (i.e., low accuracy, high error rate), Cobalt will help you locate specific semantically similar groups of data points that suggest something about why the model is failing.
What else might you try?
One approach you might use is to visualize your embeddings on a t-SNE, UMAP or PCA plot. This will give you a 2D representation of your embedding. You can color the resulting scatter plot by error, or an indicator function and check whether there are regions within the scatter plot that suffer from a lot of errors.
If this gives you what you need, that’s great! But, based on the limitations of these tools, it’s likely that you will want more. Specifically, PCA is a linear method and preserves the axes of greatest variance within the embedding—there is no reason to expect that this will preserve the geometric separability of the targets. UMAP and t-SNE are great visualization techniques, but if you’re looking at their outputs to find patterns, you should know that in order to create a reasonable 2D representation, they often split up semantically similar data as well as bring non-semantically similar data together, so results may be misleading.
What does Cobalt offer?
Since we know what you want is finding error patterns, we engineered a solution that finds error patterns, without cutting up groups of similar data points. Specifically, we used concepts from topology, a branch of mathematics, to do this in a maximally accurate way. The central technique that our analysis is based on is building a map—technically speaking, a graph—of this embedding space by preserving the relationships between these points.
Right now, Cobalt provides tools to detect drifted groups of data points and groups with high error. In development, we are working on tools to pick up spurious patterns and correlations.
What you need to know
What is an embedding?
An embedding is a representation of each data point as a fixed dimension vector. An embedding has a similarity / distance metric associated with it (examples include cosine similarity and Euclidean distance). The idea here is that if two data points are semantically similar to one another, then they should be nearby in the embedding space as well. There are various ways of computing embeddings. One can use a pretrained embedding model (e.g sBERT), or extract hidden states from models. In tabular settings, one can use the output from random forest, or the raw tabular features. If a model is available, extracting hidden states from the model is the best bet. Otherwise, if you are in a pre-training step and there is no model available, using publicly available models like sBERT or CLIP can give you a good understanding of what is in your dataset. Additional approaches like training an autoencoder to extract embeddings can also be useful ways of creating embeddings.
The Cobalt interface for embedding allows you to create a
cobalt.CobaltDataset
from a pandas DataFrame
and then call
CobaltDataset.add_embedding_array()
. This method accepts a numpy array
and the name of a metric that can be used to compute distances between each row
in the embedding matrix.
Metadata
To proceed with your analysis, you can add metadata information about the data
in the DataFrame
. And one can add a DatasetMetadata
object to
communicate to cobalt which columns in the dataset contain timestamp data, or
contain media data (e.g image paths).
When constructing the CobaltDataset
object, you have the opportunity
to pass in DatasetMetadata
object. If you do not, Cobalt auto-computes
which columns in the dataframe you passed in correspond to image path columns,
and text columns. In particular, Cobalt distinguishes between long text and
short text. Long text data contains multiple space separated words, and short
text columns do not. We auto-compute TF-IDF-based keyword analysis on long text
columns. Cobalt does not currently automatically detect timestamp columns, so if
you want to color by a certain timestamp column, please add that to
CobaltDataset.metadata.timestamp_columns
after construction.
To add a model that you want to analyze, add a ModelMetadata
object
with the CobaltDataset.add_model()
method to communicate which columns in
the dataframe are model predictions, model targets, and model input columns.
This will provide input for Cobalt to organize your analysis tools in the UI as
well as define the relevant columns that the
Workspace.find_failure_groups()
method will consider to output failure
groups.
Visualization
Once this has been appropriately labeled, you can run
w = cobalt.Workspace(ds: CobaltDataset)
and then w.ui
.
If you prefer a more programmatic interface, you can run
w.find_failure_groups()
.
At this point, if you w.ui
, you’ll see a visual interface show up in
the notebook interface with a network in the center. Each node in your graph
represents a cluster of data, and edges between those nodes represent
similarities between clusters. In sum, you can think of the network as a
literal map of your dataset, based on the information that is contained in the
embedding that you provided. To see how your model makes decisions about your
data, you can start by coloring the map by your model’s prediction.
At this point, we recommend looking at data points that lie on the border between two different colors to see what the model might get confused by.