Preparing Data for Cobalt
Cobalt provides a structure to organize your data in order to streamline the analysis of models and data. The Cobalt data schema contains the following information (among other things):
Tabular data, potentially containing input features to the model or metadata for each data point.
Embeddings, which are vector representations of each data point that can be used to map out similarities and relationships between data points.
Model tasks, outputs, and ground truth results, used to evaluate model performance on different subsets. More than one model can be represented.
A split, or division of the data into coarse subsets, such as a training and validation subset.
Most of this information is optional, but omitting it will limit the functionality Cobalt can provide.
Cobalt provides a convenience function load_tabular_dataset()
which takes a Pandas DataFrame and some additional information to prepare your
data for analysis. It has many different options, which can be seen at its
documentation page. Typical use looks like
dataset, split = load_tabular_dataset(
df,
embeddings="rf",
metadata_df=metadata,
outcome_col="y",
prediction_col="y_hat",
task="classification"
)
This creates a CobaltDataset
object and a
DatasetSplit
object, automatically generating useful
embeddings from the data in df
by training a small random forest model. The
model predictions are assumed to be either in df
or metadata
, with the
column name "y_hat"
, and Cobalt will assume the model is a classifier, so that
values in these columns correspond to class labels.
Alternatively, the CobaltDataset
and DatasetSplit
objects
can be constructed directly. This is a little less streamlined, but more
flexible, and will be the preferred method in the future. An example of this use
case for an image dataset might look like
# df contains tabular information about the dataset
# names of columns that are not helpful to view are in columns_to_hide
ds_meta = DatasetMetadata(timestamp_columns=["timestamp"], hidable_columns=columns_to_hide)
dataset = Dataset(df, ds_meta)
# img_paths is a list containing paths to each image
dataset.add_media_column(img_paths, local_root_path="imgs/")
dataset.add_model(target_column="y", prediction_column="y_hat", task="classification", name="image classifier")
# X contains vector embeddings of images taken from the `fc2` layer of a CNN
dataset.add_embedding_array(X, metric="cosine", name="fc2")
split = DatasetSplit(dataset, {"train": np.arange(10000), "test": np.arange(10000, 20000)})
It is advisable to set up your dataframe to have a sequential integer index by
calling df.reset_index()
before setting up its schema; this may help avoid
issues with data indexing and display in the data table. It is also helpful to
ensure that all column names in the table are strings (rather than integers or
other data types). The load_tabular_dataset()
function will
convert all column names to strings for this reason.
Creating Embeddings
Cobalt includes basic functionality for creating embeddings for tabular data.
This works by training a random forest on the dataset and leveraging the
induced similarity metric.
This can be trained to predict a selected outcome column or can be trained to
distinguish between the provided data and synthetically generated data with a
similar distribution. The current implementation does not have much
customizability, but the default behavior tends to work well. The
load_tabular_dataset()
function will automatically create these
embeddings if you pass embeddings=”rf”
, or you can use the
get_tabular_embeddings()
function with method="rf"
if you
want more control over exactly which columns are used to train the model.
Alternatively, you can create your own embeddings using any desired method. A
NumPy array containing the embedding vectors can be passed directly as the
embeddings parameter to load_tabular_dataset()
. Embeddings can also be
added to a dataset by calling CobaltDataset.add_embedding_array()
. Be
sure to specify the appropriate distance metric to be used with the embedding
vectors.
Available metrics include:
"euclidean"
: The standard Euclidean distance between vectors."manhattan"
: The L1 or taxicab distance (sum of absolute difference of vector coordinates)."chebyshev"
: The L-infinity distance (largest difference between vector coordinates)."cosine"
: The cosine dissimilarity, or 1 minus the dot product of normalized vectors."hamming"
: The Hamming distance, or number of coordinates where the two vectors are different."correlation"
: The Pearson correlation coefficient between two vectors.