The Cobalt Workspace
The Workspace
object is the home for any analysis done on your
data and models in Cobalt. It provides methods to create and process TDA graphs,
analyze features in your data, and extract groups of interest.
TDA Graphs
Cobalt automatically creates TDA graphs based on a dataset when the UI is displayed. However, there are a number of options that can be customized in the graph creation process, so the Workspace provides methods to create new graphs with specific parameters.
The Workspace.new_graph()
method creates a new graph and saves it in
the Workspace for later retrieval and exploration in the UI. Some important
parameters to configure are the subset of data on which to build the graph (by
default, the entire dataset), the embedding to use to build the graph, and the
distance metric to use for the embeddings. More advanced parameters can be
passed as keyword arguments; these will be used to construct a
GraphSpec
object.
On a reasonably powerful personal machine, creating graphs for datasets of up to a few hundred thousand data points should not be a heavy lift. Except for very high-dimensional data, most such graphs should be ready within a few minutes, and often much faster. Larger datasets may take more time to build graphs; for millions of data points it may take up to an hour. If you want to build graphs from large datasets, it is worth experimenting first with smaller samples of data to test the embedding and preprocessing steps before building a graph on the full dataset.
If no graph has been created when the UI is displayed, Cobalt will automatically
create a graph on the full dataset. To avoid this, you can set
Workspace.auto_graph
to False
.
Graphs are stored as MultiResolutionGraph
objects, which store a
collection of levels
representing the data at
different resolution scales. Each level is a DisjointPartitionGraph
object, which has a collection of nodes and edges, where each node of the graph
corresponds with a set of data points.
The set of data point indices for each node is stored in
DisjointPartitionGraph.nodes
, and the node index for each data point is
stored in DisjointPartitionGraph.node_membership
. Note that the data
point ids refer to indices into the source dataset for the graph (which may not
be the full dataset). It is important to track this carefully when finding the
data points for a given node. The Cobalt UI does this automatically.
Edges are available either as a list of tuples (i, j)
of node indices
(DisjointPartitionGraph.edge_list
), or as a numpy array of shape
(n_edges, 2)
(DisjointPartitionGraph.edge_mtx
). Each edge has an
associated weight, and edges are sorted in order of decreasing weight. These
weights are available as a numpy array in
DisjointPartitionGraph.edge_weights
. When the number of edges shown in
the graph viewer is adjusted, this is done by removing lower weight edges until
the average degree in the graph is the specified value.
The collection of all graphs created in a Workspace is available in
Workspace.graphs
, which is a dictionary mapping graph names to
MultiResolutionGraph
s.
Saving and Retrieving Groups
Groups of data points can be saved via manual interaction in the UI. These
groups can also be created and retrieved via methods on the Workspace
object. Workspace.get_groups()
will get a dictionary mapping group names
to CobaltDataSubset
objects. Workspace.add_group()
will add a
group to the saved groups, updating the list in the UI and allowing it to be
selected for interactive exploration.
The saved groups can also be exported to a Pandas DataFrame with
Workspace.export_groups_as_dataframe()
. The resulting dataframe has one
column for each group, with the entries of that column being a boolean mask
indicating membership of each data point in the group. A DataFrame with this
format can also be imported to the Workspace by calling
Workspace.import_groups_from_dataframe()
.
Group Algorithms
A number of the algorithms implemented in Cobalt produce collections of groups
of interest, based on TDA graphs and various other pieces of information. All
of these group algorithms return GroupResultsCollection
objects, which hold the returned groups together with some helpful metadata. The
groups are also stored in the Workspace
object for later retrieval (and so
that they can be displayed in the UI). When shown as the output of a Jupyter
cell, these objects display a table summarizing the groups they contain, but
much more information is available internally.
Each algorithm accepts a run_name
parameter (i.e. a name for the results of
this run of the algorithm), which is used as an identifier for the resulting
group collection when stored in the Workspace
. The results of an algorithm
run can be replaced by running the algorithm again with the same run_name
.
Failure Groups
The Workspace.find_failure_groups()
method is used to understand
the types of data on which a model struggles to perform well. Conceptually, it
looks for regions of connected nodes in a graph where the model performs poorly
according to some performance metric. Different types of models have different
performance metric options. For classification models, the default performance
metric is the error rate—the number of incorrect model predictions in a group
of data points.
A custom model performance metric can be set up by calling
Workspace.add_evaluation_metric_values()
. This takes a name for
the metric, an array of values (one for each data point), an index of the model
to which it will apply, and a flag that indicates whether higher or lower values
of this metric are better. Then the name of this metric can be passed to
Workspace.find_failure_groups()
under the failure_metric
parameter.
The analysis can be performed on only a specified subset of the dataset, which
might help focus the analysis on only a test set, or give some quick initial
results by running the algorithm on a subsample of the full dataset. This is
done by passing a CobaltDataSubset
object as the subset
parameter. To run the analysis on only the test dataset, for instance, one would
run
workspace.find_failure_groups(subset=split["test"])
or to run the analysis on a random subsample of 5000 points, one would run
workspace.find_failure_groups(subset=dataset.sample(5000))
Some additional algorithm configuration can be done using the config
parameter. These affect lower-level aspects of the algorithm. Some of the
parameters that can be passed are:
"graph"
: the graph that will be used for the analysis. By default one is created based on the specified embedding, but a precomputed graph can be used. This must be aDisjointPartitionGraph
object—i.e., a specific level of a multiresolution graph. It is generally good to also specify"n_edges"
when providing a graph, as otherwise all edges, regardless of edge strength, will be used, typically leading to overly-large groups."threshold"
: the minimum value of the model evaluation metric that must be attained in order to include a node of the graph in a group. Note that if the evaluation metric is one where higher values of the metric are better, this must be provided as a negative value."min_mean_points_per_node"
: If the graph is automatically created, the algorithm will select a coarseness level by targeting an average number of data points per node of the graph. By default, this is 5 data points per node, but can be adjusted by setting this parameter.
In some situations, the default parameter configuration may result in groups
that are hard to understand (e.g., too large, too small, too spread out in the
graph). Adjusting parameters can be helpful in this case. In addition to the
configuration parameters mentioned above, the min_size
and min_failures
parameters can also be helpful in making the results more useful.
The collection of all failure group algorithm runs can be accessed via
Workspace.failure_groups
.
Drifted Groups
Workspace.find_drifted_groups()
is similar to
Workspace.find_failure_groups()
. However, instead of helping to
understand the variation in a model performance metric, it helps to understand
the relative distribution of two groups of data. A typical use case would be
comparing the data used to train the model with new data received while the
model is in production.
The key parameters to this method are reference_group
and
comparison_group
, which specify the two groups that should be compared. The
method will return a collection of groups of similar data points where
comparison_group
is overrepresented. Each such group would typically
correspond with a type of data point which has little or no representation in
reference_group
. These parameters can be provided either as
CobaltDataSubset
objects, or as names of saved groups or
dataset splits. A typical usage would be
workspace.find_drifted_groups(reference_group="train", comparison_group="test")
The collection of all drifted group algorithm runs can be accessed via
Workspace.drifted_groups
.
Clustering
To obtain a set of natural clusters from the data in a graph, use
Workspace.find_clusters()
. This will search through the nodes at varying
levels of coarseness of a multiresolution graph to find a partition of the data
into clusters whose quality is as high as possible given some constraints.
The subset
parameter allows customization of the data to be clustered. As
usual, this can be specified as a CobaltDataSubset
object or as the
name of a saved group or dataset split. It is also often helpful to set the
min_n_groups
and max_n_groups
parameters to guide the clustering
algorithm toward the desired level of coarseness.
The collection of all clustering algorithm runs can be accessed via
Workspace.clustering_results
.