Building TDA Graphs

At the core of Cobalt’s capabilities is its engine for building graphs representing data. The core objects are CobaltGraph and HierarchicalCobaltGraph. A CobaltGraph describes a graph whose nodes correspond with subsets of a dataset, and a HierarchicalCobaltGraph stacks multiple CobaltGraph s describing the same dataset at different levels of resolution.

Graphs are built in several steps:

A nearest-neighbor graph is constructed using a provided embedding and distance metric.
This neighbor graph is symmetrized and distances are normalized to correct for variations in density. Distances are then converted to edge weights, so that points that are closer together have stronger connections. This produces the base graph for the dataset.
(Optional) The data points are partitioned and edges in the base graph are pruned according to any number of filter functions. A filter function specifies a value for each data point, and data points are grouped into bins based on these values. Edges in the graph that go between bins that are too far distant are then pruned.
A hierarchical clustering algorithm is applied to the base graph, producing a sequence of increasingly coarser partitions of the nodes. Each of these partitions produces a CobaltGraph by converting each subset into a node, and linking together nodes if there is an edge in the base graph between them. These edges are assigned weights based on the number and weight of edges in the base graph.

Basic Graph Parameters

The most important parameter to configure when building a graph is the embedding. If the embedding is poor, no additional tweaking of parameters will produce a useful graph.

For text data, off-the-shelf embedding models often work well; you can create text embeddings with CobaltDataset.add_text_column_embedding().

For tabular data, you might use raw or scaled columns. Or, you could use CobaltDataset.add_rf_embedding() to train a random forest model on the data and use the leaves from the trees as embedding data.

The next most important parameter is the distance metric. For text embedding models, this is almost always the cosine dissimilarity. For tabular data, it can be worth experimenting with different distance metrics, or even constructing your own by combining different metrics on different columns with CombinedMetric.

Other parameters that are often worth experimenting with include:

NeighborParams.M: This is the total number of neighbors to find for each data point.
NeighborParams.K: This is the number of neighbors to keep for each data point in the symmetrized graph, conditional on their being reverse neighbors. This value must be at most M. Increasing K and M together will tend to increase the connectivity of the graph.
NeighborParams.min_nbrs: This is the minimum number of neighbors to keep for each data point in the normalized graph. Increasing this value may help connect disconnected portions of the graph. It must always be less than K.

Parameter Grid Search

Workspace.new_graph() includes an option to perform a grid search of some parameters (M, K, min_nbrs, affinity, and optionally the embedding), to optimize an objective function that serves as a proxy for graph quality. To use it, pass grid_search=True to new_graph(). By default, embeddings are not included in optimization, but passing embedding_search_mode="all" will test graphs built with all embeddings in the dataset, and using embedding_search_mode="given_plus_generated" will additionally test scaled and transformed versions of the embeddings.

Filter Functions

Filter functions can help emphasize the variation in a particular feature within a graph. They ensure that data points with different values of the feature are well separated in the graph. This can reveal structure that is hard to find in other ways.

To configure a filter function, you need to provide at least two pieces of information: first, the values of the function, FilterSpec.f_vals, as a NumPy array. This should have one value for each data point in the input data. These values will be used to split the data into FilterSpec.n_bins disjoint bins, in one of two ways. If FilterSpec.bin_method is “rng”, these bins will have equal width, spanning the range from the minimum to the maximum value of f_vals. If bin_method is “uni”, bins will be chosen so that each bin has approximately the same number of data points.

Edges in the base graph will then be removed based on the generated bins and values. The behavior is determined by FilterSpec.pruning_threshold and FilterSpec.pruning_method. If pruning_method is “bin”, pruning_threshold determines how close two bins must be for edges to be allowed between them. For instance, if pruning_threshold=1, then edges between data points that lie in the same bin or adjacent bins will be kept, and all other bins will be removed. If pruning_method is “pct”, pruning is done based on a quantile threshold: an edge between two data points is kept only if their entries in f_vals differ by less than pruning_threshold in quantiles. For instance, if point i is at the 10th percentile of f_vals and j is at the 20th percentile, and pruning_threshold=0.2, then the edge will be kept, but if j were at the 40th percentile, the edge would be removed.

Clustering is performed on this pruned graph, and the resulting clusters are split so that each node contains data points from only one bin. These clusters are then used to construct the output graphs, where nodes are linked based on the pruned graph.

To apply filter functions to graph construction, pass a list of filter specifications to Workspace.new_graph():

g = w.new_graph(
    name="graph_with_filters",
    embedding="embedding_name",
    filters=[
        {"f_vals": f_arr, "bin_method": "uni"},
    ],
)

Advanced Graph Parameters

Neighbor Graph and Base Graph

NeighborParams.backend: The algorithm to use to find the nearest neighbors for each point. The default “nndescent” is an efficient approximate algorithm. In certain situations (particularly if there are many points with identical distances) it may struggle to create a good graph. The “exact” algorithm simply computes all pairwise distances and selects the nearest neighbors from these. It is considerably less efficient but will produce the best possible results.
NeighborParams.seed: The seed to use for any randomness in the “nndescent” algorithm. This has a fixed default value for reproducibility.
NeighborParams.deduplicate: If set to True, the data points will be deduplicated before creating the nearest neighbor graph. This is particularly helpful when there are large numbers of duplicate data points, as these can interfere with the “nndescent” algorithm.
NeighborParams.affinity: The function used to convert distances to weights. The default “slpi” is selected for backwards-compatibility and combines the square root of a negative logarithm with an inverse function. For new work we recommend “exponential” or “expinv”, which use exp(-d) (with 1/d for the tail in the case of “expinv”).
NeighborParams.strict_partition: Used to split the data points into separate subsets before building graphs independently on each subset. This can be helpful if your data already has a known stratification and you want to enforce this on the graphs, but still want to see all the data at once.

To adjust these parameters, pass neighbor_params={...} to Workspace.new_graph(), e.g.:

g = w.new_graph(
    name="graph_with_neighbor_params",
    embedding="embedding_name",
    neighbor_params={
        "deduplicate": True,
        "affinity": "expinv",
        "seed": 9214,
    },
)

Clustering

ClusteringParams.allow_multiple_merges_per_node: Each step of the hierarchical clustering algorithm merges nodes together when they are joined by sufficiently strong edges. By default, each node can participate in only one such merge in a single step of the clustering algorithm. Setting this to True removes this constraint.
ClusteringParams.filter_levels_per_component: After the clustering is completed, the levels are filtered to ensure that the number of nodes reduces at a certain rate. When the overall graph has many small components, this can mean that the larger components coarsen very quickly. Setting this parameter to True solves this problem by having the filtering method target the rate of change of the average number of nodes per component.
ClusteringParams.num_threads: Number of threads to use in the clustering algorithm.
ClusteringParams.max_height: The maximum number of steps to use in the hierarchical clustering algorithm. In most cases, the default should be sufficiently high, but if you encounter graphs where the top level is not coarse enough, increasing this parameter may help.

To adjust these parameters, pass clustering_params={...} to Workspace.new_graph(), e.g.:

g = w.new_graph(
    name="graph_with_clustering_params",
    embedding="embedding_name",
    clustering_params={
        "allow_multiple_merges_per_node": True,
        "filter_levels_per_component": True,
        "num_threads": 4,
    },
)