Preparing Data for Cobalt

Cobalt provides a structure to organize your data in order to streamline the analysis of models and data. The Cobalt data schema contains the following information (among other things):

Tabular data, potentially containing input features to the model or metadata for each data point.
Embeddings, which are vector representations of each data point that can be used to map out similarities and relationships between data points.
Model tasks, outputs, and ground truth results, used to evaluate model performance on different subsets. More than one model can be represented.
A split, or division of the data into coarse subsets, such as a training and validation subset.

Most of this information is optional, but omitting it will limit the functionality Cobalt can provide.

Cobalt provides a convenience function load_tabular_dataset() which takes a Pandas DataFrame and some additional information to prepare your data for analysis. It has many different options, which can be seen at its documentation page. Typical use looks like

dataset, split = load_tabular_dataset(
   df,
   embeddings="rf",
   metadata_df=metadata,
   outcome_col="y",
   prediction_col="y_hat",
   task="classification"
)

This creates a CobaltDataset object and a DatasetSplit object, automatically generating useful embeddings from the data in df by training a small random forest model. The model predictions are assumed to be either in df or metadata, with the column name "y_hat", and Cobalt will assume the model is a classifier, so that values in these columns correspond to class labels.

Alternatively, the CobaltDataset and DatasetSplit objects can be constructed directly. This is a little less streamlined, but more flexible, and will be the preferred method in the future. An example of this use case for an image dataset might look like

# df contains tabular information about the dataset
# names of columns that are not helpful to view are in columns_to_hide
ds_meta = DatasetMetadata(timestamp_columns=["timestamp"], hidable_columns=columns_to_hide)
dataset = Dataset(df, ds_meta)
# img_paths is a list containing paths to each image
dataset.add_media_column(img_paths, local_root_path="imgs/")
dataset.add_model(target_column="y", prediction_column="y_hat", task="classification", name="image classifier")
# X contains vector embeddings of images taken from the `fc2` layer of a CNN
dataset.add_embedding_array(X, metric="cosine", name="fc2")
split = DatasetSplit(dataset, {"train": np.arange(10000), "test": np.arange(10000, 20000)})

It is advisable to set up your dataframe to have a sequential integer index by calling df.reset_index() before setting up its schema; this may help avoid issues with data indexing and display in the data table. It is also helpful to ensure that all column names in the table are strings (rather than integers or other data types). The load_tabular_dataset() function will convert all column names to strings for this reason.

Creating Embeddings

Cobalt includes basic functionality for creating embeddings for tabular data. This works by training a random forest on the dataset and leveraging the induced similarity metric. This can be trained to predict a selected outcome column or can be trained to distinguish between the provided data and synthetically generated data with a similar distribution. The current implementation does not have much customizability, but the default behavior tends to work well. The load_tabular_dataset() function will automatically create these embeddings if you pass embeddings=”rf”, or you can use the get_tabular_embeddings() function with method="rf" if you want more control over exactly which columns are used to train the model.

Alternatively, you can create your own embeddings using any desired method. A NumPy array containing the embedding vectors can be passed directly as the embeddings parameter to load_tabular_dataset(). Embeddings can also be added to a dataset by calling CobaltDataset.add_embedding_array(). Be sure to specify the appropriate distance metric to be used with the embedding vectors.

Available metrics include:

"euclidean": The standard Euclidean distance between vectors.
"manhattan": The L1 or taxicab distance (sum of absolute difference of vector coordinates).
"chebyshev": The L-infinity distance (largest difference between vector coordinates).
"cosine": The cosine dissimilarity, or 1 minus the dot product of normalized vectors.
"hamming": The Hamming distance, or number of coordinates where the two vectors are different.
"correlation": The Pearson correlation coefficient between two vectors.