Building TDA Graphs
At the core of Cobalt’s capabilities is its engine for building graphs
representing data. The core objects are CobaltGraph and
HierarchicalCobaltGraph. A CobaltGraph describes a graph whose
nodes correspond with subsets of a dataset, and a HierarchicalCobaltGraph
stacks multiple CobaltGraph s describing the same dataset at different levels
of resolution.
Graphs are built in several steps:
A nearest-neighbor graph is constructed using a provided embedding and distance metric.
This neighbor graph is symmetrized and distances are normalized to correct for variations in density. Distances are then converted to edge weights, so that points that are closer together have stronger connections. This produces the base graph for the dataset.
(Optional) The data points are partitioned and edges in the base graph are pruned according to any number of filter functions. A filter function specifies a value for each data point, and data points are grouped into bins based on these values. Edges in the graph that go between bins that are too far distant are then pruned.
A hierarchical clustering algorithm is applied to the base graph, producing a sequence of increasingly coarser partitions of the nodes. Each of these partitions produces a
CobaltGraphby converting each subset into a node, and linking together nodes if there is an edge in the base graph between them. These edges are assigned weights based on the number and weight of edges in the base graph.
Basic Graph Parameters
The most important parameter to configure when building a graph is the embedding. If the embedding is poor, no additional tweaking of parameters will produce a useful graph.
For text data, off-the-shelf embedding models often work
well; you can create text embeddings with
CobaltDataset.add_text_column_embedding().
For tabular data, you might use raw or scaled columns. Or, you could use
CobaltDataset.add_rf_embedding() to train a random forest model on the
data and use the leaves from the trees as embedding data.
The next most important parameter is the distance metric. For text embedding
models, this is almost always the cosine dissimilarity. For tabular data, it can
be worth experimenting with different distance metrics, or even constructing
your own by combining different metrics on different columns with
CombinedMetric.
Other parameters that are often worth experimenting with include:
NeighborParams.M: This is the total number of neighbors to find for each data point.NeighborParams.K: This is the number of neighbors to keep for each data point in the symmetrized graph, conditional on their being reverse neighbors. This value must be at mostM. IncreasingKandMtogether will tend to increase the connectivity of the graph.NeighborParams.min_nbrs: This is the minimum number of neighbors to keep for each data point in the normalized graph. Increasing this value may help connect disconnected portions of the graph. It must always be less thanK.
Parameter Grid Search
Workspace.new_graph() includes an option to perform a grid search of some
parameters (M, K, min_nbrs, affinity, and optionally the
embedding), to optimize an objective function that serves as a proxy for graph
quality. To use it, pass grid_search=True to new_graph(). By default,
embeddings are not included in optimization, but passing
embedding_search_mode="all" will test graphs built with all embeddings in
the dataset, and using embedding_search_mode="given_plus_generated" will
additionally test scaled and transformed versions of the embeddings.
Filter Functions
Filter functions can help emphasize the variation in a particular feature within a graph. They ensure that data points with different values of the feature are well separated in the graph. This can reveal structure that is hard to find in other ways.
To configure a filter function, you need to provide at least two pieces of
information: first, the values of the function, FilterSpec.f_vals, as a
NumPy array. This should have one value for each data point in the input data.
These values will be used to split the data into FilterSpec.n_bins
disjoint bins, in one of two ways. If FilterSpec.bin_method is “rng”,
these bins will have equal width, spanning the range from the minimum to the
maximum value of f_vals. If bin_method is “uni”, bins will be chosen so
that each bin has approximately the same number of data points.
Edges in the base graph will then be removed based on the generated bins and
values. The behavior is determined by FilterSpec.pruning_threshold and
FilterSpec.pruning_method. If pruning_method is “bin”,
pruning_threshold determines how close two bins must be for edges to be
allowed between them. For instance, if pruning_threshold=1, then edges
between data points that lie in the same bin or adjacent bins will be kept, and
all other bins will be removed. If pruning_method is “pct”, pruning is done
based on a quantile threshold: an edge between two data points is kept only if
their entries in f_vals differ by less than pruning_threshold in
quantiles. For instance, if point i is at the 10th percentile of f_vals
and j is at the 20th percentile, and pruning_threshold=0.2, then the
edge will be kept, but if j were at the 40th percentile, the edge would be
removed.
Clustering is performed on this pruned graph, and the resulting clusters are split so that each node contains data points from only one bin. These clusters are then used to construct the output graphs, where nodes are linked based on the pruned graph.
To apply filter functions to graph construction, pass a list of filter
specifications to Workspace.new_graph():
g = w.new_graph(
name="graph_with_filters",
embedding="embedding_name",
filters=[
{"f_vals": f_arr, "bin_method": "uni"},
],
)
Advanced Graph Parameters
Neighbor Graph and Base Graph
NeighborParams.backend: The algorithm to use to find the nearest neighbors for each point. The default “nndescent” is an efficient approximate algorithm. In certain situations (particularly if there are many points with identical distances) it may struggle to create a good graph. The “exact” algorithm simply computes all pairwise distances and selects the nearest neighbors from these. It is considerably less efficient but will produce the best possible results.NeighborParams.seed: The seed to use for any randomness in the “nndescent” algorithm. This has a fixed default value for reproducibility.NeighborParams.deduplicate: If set to True, the data points will be deduplicated before creating the nearest neighbor graph. This is particularly helpful when there are large numbers of duplicate data points, as these can interfere with the “nndescent” algorithm.NeighborParams.affinity: The function used to convert distances to weights. The default “slpi” is selected for backwards-compatibility and combines the square root of a negative logarithm with an inverse function. For new work we recommend “exponential” or “expinv”, which useexp(-d)(with1/dfor the tail in the case of “expinv”).NeighborParams.strict_partition: Used to split the data points into separate subsets before building graphs independently on each subset. This can be helpful if your data already has a known stratification and you want to enforce this on the graphs, but still want to see all the data at once.
To adjust these parameters, pass neighbor_params={...} to
Workspace.new_graph(), e.g.:
g = w.new_graph(
name="graph_with_neighbor_params",
embedding="embedding_name",
neighbor_params={
"deduplicate": True,
"affinity": "expinv",
"seed": 9214,
},
)
Clustering
ClusteringParams.allow_multiple_merges_per_node: Each step of the hierarchical clustering algorithm merges nodes together when they are joined by sufficiently strong edges. By default, each node can participate in only one such merge in a single step of the clustering algorithm. Setting this to True removes this constraint.ClusteringParams.filter_levels_per_component: After the clustering is completed, the levels are filtered to ensure that the number of nodes reduces at a certain rate. When the overall graph has many small components, this can mean that the larger components coarsen very quickly. Setting this parameter to True solves this problem by having the filtering method target the rate of change of the average number of nodes per component.ClusteringParams.num_threads: Number of threads to use in the clustering algorithm.ClusteringParams.max_height: The maximum number of steps to use in the hierarchical clustering algorithm. In most cases, the default should be sufficiently high, but if you encounter graphs where the top level is not coarse enough, increasing this parameter may help.
To adjust these parameters, pass clustering_params={...} to
Workspace.new_graph(), e.g.:
g = w.new_graph(
name="graph_with_clustering_params",
embedding="embedding_name",
clustering_params={
"allow_multiple_merges_per_node": True,
"filter_levels_per_component": True,
"num_threads": 4,
},
)