Preparing Data for Cobalt

Cobalt provides a structure to organize your data in order to streamline the analysis of models and data. The Cobalt data schema contains the following information (among other things):

  • Tabular data, potentially containing input features to the model or metadata for each data point.

  • Metadata about the column types in the tabular dataset.

  • Embeddings, which are vector representations of each data point that can be used to map out similarities and relationships between data points. These are used to create TDA graphs.

  • Model tasks, outputs, and ground truth results, used to evaluate model performance on different subsets. More than one model can be represented.

  • A split, or division of the data into coarse subsets, such as a training and validation subset.

Most of this information is optional, but omitting it will limit the functionality Cobalt can provide.

The dataset and most of its metadata is encapsulated in a CobaltDataset object. To create one, you only need to provide a Pandas DataFrame. Then you can add embeddings, adjust column metadata, and specify models. The general process looks like this:

from cobalt import CobaltDataset

# df is a DataFrame containing the data points (features, text, image paths, etc...)
ds = CobaltDataset(df)

# X is a numpy array of shape (len(df), D) containing embeddings for each data point
ds.add_embedding_array(X, name="array_embedding", metric="euclidean")

# "text" is a column in df containing text we want to embed
# will use a sentence_transformers model to produce embeddings
ds.add_text_column_embedding("text", embedding_name="text_embedding")

# make sure "text" is tagged as containing longform text
# (for which analysis like keyword extraction is suitable)
# autodetection works well, so this is usually not necessary
ds.metadata.data_types["text"].text_type = TextDataType.long_text

# assume ground truth sentiment classification labels in the "sentiment" column
# model predictions in the "pred_sentiment" column
# Cobalt will compute performance metrics for this model
ds.add_model(
   input_columns="text",
   target_column="sentiment",
   prediction_column="pred_sentiment",
   task="classification",
   name="sentiment_classifier",
)

Once the dataset is created, you can optionally also create a DatasetSplit object. This can be used to define a number of large divisions of your dataset that may be useful for later analysis, e.g. a train/test split. If there is a column in the data table that labels rows by their split membership, you can use DatasetSplit.from_dataset_column() to create it:

split = DatasetSplit.from_dataset_column(ds, "split")

Otherwise, you can pass a dictionary mapping split names to indices to the constructor:

split = DatasetSplit(ds, {"train": np.arange(10000), "test": np.arange(10000, 12000)})

The dataset (and optionally, split) can then be used to create a Workspace object that will be used to build graphs and perform analyses.

DataFrame Requirements

DataFrames should have sequential integer indices when used to construct a CobaltDataset. You can ensure this by calling df.reset_index() before creating the CobaltDataset object. All column names in the table should also be strings (rather than integers or other data types). The constructor will raise an error if this is not the case.

Creating Embeddings

Embeddings are key to Cobalt’s TDA analysis—they enable us to build sophisticated maps of data points based on realistic measures of similarity. The CobaltDataset class includes functionality for creating embeddings from raw data in a few ways.

For tabular data with numerical feature types, a subset of the columns (perhaps with some simple rescaling) can work well as an embedding. This can be done with CobaltDataset.add_column_embedding().

For more complex tabular data, an approach based on random forests can often produce a very useful similarity metric. The random forest can be trained to predict a selected outcome column or can be trained to distinguish between the provided data and synthetically generated data with a similar distribution. Use CobaltDataset.add_rf_embedding() to generate a random forest embedding from an already existing embedding (e.g. one added with add_column_embedding()).

For text data, CobaltDataset.add_text_column_embedding() creates embeddings locally with models from the sentence-transformers library.

Alternatively, you can create your own embeddings using any desired method. A NumPy array containing the embedding vectors can be added to a dataset by calling CobaltDataset.add_embedding_array(). Be sure to specify the appropriate distance metric to be used with the embedding vectors.

Available metrics include: