The Cobalt Workspace

The Workspace object is the home for any analysis done on your data and models in Cobalt. It provides methods to create and process TDA graphs, analyze features in your data, and extract groups of interest.

TDA Graphs

Cobalt automatically creates TDA graphs based on a dataset when the UI is displayed. However, there are a number of options that can be customized in the graph creation process, so the Workspace provides methods to create new graphs with specific parameters.

The Workspace.new_graph() method creates a new graph and saves it in the Workspace for later retrieval and exploration in the UI. Some important parameters to configure are the subset of data on which to build the graph (by default, the entire dataset), the embedding to use to build the graph, and the distance metric to use for the embeddings. More advanced parameters can be passed as keyword arguments; these will be used to construct a GraphSpec object.

On a reasonably powerful personal machine, creating graphs for datasets of up to a few hundred thousand data points should not be a heavy lift. Except for very high-dimensional data, most such graphs should be ready within a few minutes, and often much faster. Larger datasets may take more time to build graphs; for millions of data points it may take up to an hour. If you want to build graphs from large datasets, it is worth experimenting first with smaller samples of data to test the embedding and preprocessing steps before building a graph on the full dataset.

If no graph has been created when the UI is displayed, Cobalt will automatically create a graph on the full dataset. To avoid this, you can set Workspace.auto_graph to False.

Graphs are stored as MultiResolutionGraph objects, which store a collection of levels representing the data at different resolution scales. Each level is a DisjointPartitionGraph object, which has a collection of nodes and edges, where each node of the graph corresponds with a set of data points.

The set of data point indices for each node is stored in DisjointPartitionGraph.nodes, and the node index for each data point is stored in DisjointPartitionGraph.node_membership. Note that the data point ids refer to indices into the source dataset for the graph (which may not be the full dataset). It is important to track this carefully when finding the data points for a given node. The Cobalt UI does this automatically.

Edges are available either as a list of tuples (i, j) of node indices (DisjointPartitionGraph.edge_list), or as a numpy array of shape (n_edges, 2) (DisjointPartitionGraph.edge_mtx). Each edge has an associated weight, and edges are sorted in order of decreasing weight. These weights are available as a numpy array in DisjointPartitionGraph.edge_weights. When the number of edges shown in the graph viewer is adjusted, this is done by removing lower weight edges until the average degree in the graph is the specified value.

The collection of all graphs created in a Workspace is available in Workspace.graphs, which is a dictionary mapping graph names to MultiResolutionGraphs.

Saving and Retrieving Groups

Groups of data points can be saved via manual interaction in the UI. These groups can also be created and retrieved via methods on the Workspace object. Workspace.get_groups() will get a dictionary mapping group names to CobaltDataSubset objects. Workspace.add_group() will add a group to the saved groups, updating the list in the UI and allowing it to be selected for interactive exploration.

The saved groups can also be exported to a Pandas DataFrame with Workspace.export_groups_as_dataframe(). The resulting dataframe has one column for each group, with the entries of that column being a boolean mask indicating membership of each data point in the group. A DataFrame with this format can also be imported to the Workspace by calling Workspace.import_groups_from_dataframe().

Group Algorithms

A number of the algorithms implemented in Cobalt produce collections of groups of interest, based on TDA graphs and various other pieces of information. All of these group algorithms return GroupResultsCollection objects, which hold the returned groups together with some helpful metadata. The groups are also stored in the Workspace object for later retrieval (and so that they can be displayed in the UI). When shown as the output of a Jupyter cell, these objects display a table summarizing the groups they contain, but much more information is available internally.

Each algorithm accepts a run_name parameter (i.e. a name for the results of this run of the algorithm), which is used as an identifier for the resulting group collection when stored in the Workspace. The results of an algorithm run can be replaced by running the algorithm again with the same run_name.

Failure Groups

The Workspace.find_failure_groups() method is used to understand the types of data on which a model struggles to perform well. Conceptually, it looks for regions of connected nodes in a graph where the model performs poorly according to some performance metric. Different types of models have different performance metric options. For classification models, the default performance metric is the error rate—the number of incorrect model predictions in a group of data points.

A custom model performance metric can be set up by calling Workspace.add_evaluation_metric_values(). This takes a name for the metric, an array of values (one for each data point), an index of the model to which it will apply, and a flag that indicates whether higher or lower values of this metric are better. Then the name of this metric can be passed to Workspace.find_failure_groups() under the failure_metric parameter.

The analysis can be performed on only a specified subset of the dataset, which might help focus the analysis on only a test set, or give some quick initial results by running the algorithm on a subsample of the full dataset. This is done by passing a CobaltDataSubset object as the subset parameter. To run the analysis on only the test dataset, for instance, one would run

workspace.find_failure_groups(subset=split["test"])

or to run the analysis on a random subsample of 5000 points, one would run

workspace.find_failure_groups(subset=dataset.sample(5000))

Some additional algorithm configuration can be done using the config parameter. These affect lower-level aspects of the algorithm. Some of the parameters that can be passed are:

  • "graph": the graph that will be used for the analysis. By default one is created based on the specified embedding, but a precomputed graph can be used. This must be a DisjointPartitionGraph object—i.e., a specific level of a multiresolution graph. It is generally good to also specify "n_edges" when providing a graph, as otherwise all edges, regardless of edge strength, will be used, typically leading to overly-large groups.

  • "threshold": the minimum value of the model evaluation metric that must be attained in order to include a node of the graph in a group. Note that if the evaluation metric is one where higher values of the metric are better, this must be provided as a negative value.

  • "min_mean_points_per_node": If the graph is automatically created, the algorithm will select a coarseness level by targeting an average number of data points per node of the graph. By default, this is 5 data points per node, but can be adjusted by setting this parameter.

In some situations, the default parameter configuration may result in groups that are hard to understand (e.g., too large, too small, too spread out in the graph). Adjusting parameters can be helpful in this case. In addition to the configuration parameters mentioned above, the min_size and min_failures parameters can also be helpful in making the results more useful.

The collection of all failure group algorithm runs can be accessed via Workspace.failure_groups.

Drifted Groups

Workspace.find_drifted_groups() is similar to Workspace.find_failure_groups(). However, instead of helping to understand the variation in a model performance metric, it helps to understand the relative distribution of two groups of data. A typical use case would be comparing the data used to train the model with new data received while the model is in production.

The key parameters to this method are reference_group and comparison_group, which specify the two groups that should be compared. The method will return a collection of groups of similar data points where comparison_group is overrepresented. Each such group would typically correspond with a type of data point which has little or no representation in reference_group. These parameters can be provided either as CobaltDataSubset objects, or as names of saved groups or dataset splits. A typical usage would be

workspace.find_drifted_groups(reference_group="train", comparison_group="test")

The collection of all drifted group algorithm runs can be accessed via Workspace.drifted_groups.

Clustering

To obtain a set of natural clusters from the data in a graph, use Workspace.find_clusters(). This will search through the nodes at varying levels of coarseness of a multiresolution graph to find a partition of the data into clusters whose quality is as high as possible given some constraints.

The subset parameter allows customization of the data to be clustered. As usual, this can be specified as a CobaltDataSubset object or as the name of a saved group or dataset split. It is also often helpful to set the min_n_groups and max_n_groups parameters to guide the clustering algorithm toward the desired level of coarseness.

The collection of all clustering algorithm runs can be accessed via Workspace.clustering_results.