Cobalt API
- class cobalt.Workspace(dataset: CobaltDataset, split: DatasetSplit | TypeAliasForwardRef('SplitDescriptor') | None = None, auto_graph: bool = True, run_server: bool | None = None)
Bases:
objectEncapsulates analysis done with a dataset and models.
- ui
A user interface that can be used to interact with the data, models, and other analysis.
- run_auto_group_analysis
Whether to automatically run a group analysis of the data and models when the UI is opened, if no analysis has yet been run.
Initialize a Workspace.
- Parameters:
dataset – The CobaltDataset to use for the analysis.
split – A division of the dataset into predetermined groups, e.g. test/train.
auto_graph – Whether to automatically run the graph creation.
run_server – Whether to run a web server to host images. If None (default), will run a server unless a Colab environment is detected.
The dataset split can be provided in a number of different ways.
- add_column(key: str, data, is_categorical: bool | Literal['auto'] = 'auto', dataset: str | None = None)
Add or replace a column in the dataset.
Will update any already-opened UI with the new data, which will not happen just by calling CobaltDataset.set_column().
- Parameters:
key – Name of the column to add.
data – ArrayLike of values to store in the column. Must have length equal to the length of the dataset.
is_categorical – Whether the column values should be treated as categorical. If “auto” (the default), will autodetect.
dataset – The name of the dataset to add the column to. If None, will add the column to the primary dataset.
- add_dataset(dataset: CobaltDataset, name: str | None = None, split: DatasetSplit | TypeAliasForwardRef('SplitDescriptor') | None = None) None
Add a dataset to this workspace.
- Parameters:
dataset – CobaltDataset to add
name – Optional name. If not provided, uses dataset.name
split – Optional split for this dataset. Can be a DatasetSplit or descriptor (dict, list of indices, etc). If not provided, creates default split.
Example
>>> orders = cobalt.CobaltDataset(orders_df) >>> orders.name = "orders" >>> workspace.add_dataset(orders, split={"train": train_indices, "test": test_indices})
- add_evaluation_metric_values(name: str, metric_values: ArrayLike, model: int | str | ModelMetadata = 0, lower_values_are_better: bool = True, dataset: str | CobaltDataset | None = None)
Add values for a custom evaluation metric.
- Parameters:
name – A name for this evaluation metric. This will be used to name a column in the dataset where these values will be stored, as well as to name the metric itself.
metric_values – An arraylike with one value for each data point in the dataset.
model – The name or index of the model in self.dataset that this metric evaluates.
lower_values_are_better – If True, Cobalt will interpret lower values of this metric as positive; otherwise, it will interpret higher values as positive.
dataset – The dataset the model belongs to. May be a dataset name or a CobaltDataset object.
- add_graph(name: str, graph: HierarchicalDataGraph | HierarchicalCobaltGraph, subset: CobaltDataSubset | None = None, init_max_nodes: int = 500, init_max_degree: float = 15.0, params: dict | None = None, source_columns: List[str] | None = None, embedding: Embedding | None = None)
Add a graph to self.graphs.
- Parameters:
name (str) – A name for the graph.
graph – The graph to add (HierarchicalDataGraph or HierarchicalCobaltGraph).
subset – The subset of the self.dataset this graph is constructed from. If graph is a HierarchicalCobaltGraph and subset is None, uses graph.subset.
init_max_nodes – The maximum number of nodes to show in the initial view of this graph.
init_max_degree – The maximum average node degree for the initial view of this graph.
params – Optional dict of parameters used to construct the graph
source_columns – Optional list of column names used to build the graph
embedding – Optional Embedding object used to build the graph
- add_group(name: str, group: CobaltDataSubset, compute_stats: bool = True, description: str | None = None)
Add a group to the collection of saved groups.
- Parameters:
name – The name to identify the group.
group – A CobaltDataSubset object to be saved as a group.
compute_stats – Whether to compute summary statistics for the group. For large datasets with many features this can be time consuming and setting this to False may help.
description – An optional description to be displayed with the group.
- static analyze(subset: CobaltDataSubset) Tuple[DataFrame, DataFrame]
Compute numerical and categorical statistics for the given subset.
- Returns:
A tuple (numerical_statistics, categorical statistics) giving summary statistics for numerical and categorical features in the dataset.
- auto_analysis(ref: str | CobaltDataSubset, cmp: str | CobaltDataSubset, model: int | str | ModelMetadata = 0, embedding: int | str | Embedding = 0, failure_metric: str | Series | None = None, min_size: int = 3, min_failures: int = 3, config: Dict[str, Dict] | None = None, run_name: str | None = None, manual: bool = True, visible: bool = True)
Returns an analysis of errors and warnings with the data and model.
- Parameters:
ref – The subset of the data on which to do the reference analysis. Users should typically pass in the training dataset.
cmp – The subset of the data on which to do the comparison analysis. Users may pass in a test dataset, or a production dataset.
model – The index or name of the model object you want to consider.
embedding – The embedding to use to create a graph if none is provided. If none is provided, will use the default dataset embedding. (If one does not exist, will raise an error.)
failure_metric – The failure metric to use to find error patterns based on.
min_size – The minimum size of a returned group.
min_failures – The minimum number of failures in a failure group, for a classification task.
config – A dictionary containing further configuration parameters that will be passed to the underlying algorithm.
run_name – A name under which to store the results. If one is not provided, it will be chosen automatically.
manual – Used internally to signal whether the clustering analysis was created by the user.
visible – Whether to show the results of this analysis in the UI.
- Returns:
a dictionary with keys “summaries” and “groups”
Under “summaries” is a tuple of two DataFrames. The first is a table summarizing the discovered error groups; the second is a table summarizing the discovered warning groups.
Under “groups” is a tuple of two lists of CobaltDataSubsets, the first listing the error groups, and the second listing the warning groups.
- property clustering_results: Dict[str, GroupResultsCollection]
Results from all previous runs of the clustering algorithm.
- property dataset: CobaltDataset
The dataset being analyzed in this workspace.
- property drifted_groups: Dict[str, GroupResultsCollection]
The collection of all drifted group analysis results.
- export_groups_as_dataframe() DataFrame
Exports saved groups as a DataFrame.
The columns of the resulting DataFrame are named after the saved groups, and the column for each group contains a boolean mask indicating which data points in the dataset belong to that group.
- property failure_groups: Dict[str, GroupResultsCollection]
The collection of all failure group analysis results.
- feature_compare(group_1: str | CobaltDataSubset, group_2: str | CobaltDataSubset | Literal['all', 'rest', 'neighbors'], numerical_features: List[str] | None = None, categorical_features: List[str] | None = None, numerical_test: Literal['t-test', 'perm'] = 't-test', categorical_test: Literal['G-test'] = 'G-test', include_nan: bool = False, neighbor_graph: str | HierarchicalCobaltGraph | None = None)
Compare the distributions of features between two subsets.
- find_clusters(method: Literal['modularity', 'global_modularity'] = 'modularity', subset: str | CobaltDataSubset | CobaltDataset | None = None, graph: str | HierarchicalCobaltGraph | None = None, embedding: int | str | Embedding = 0, min_group_size: int | float = 1, max_group_size: int | float = inf, max_n_groups: int = 10000, min_n_groups: int = 1, config: Dict[str, Any] | None = None, run_name: str | None = None, manual: bool = True, visible: bool = True, generate_group_descriptions: bool = True) GroupResultsCollection
Run an analysis to find natural clusters in the dataset.
Saves the results in self.clustering_results under run_name.
- Parameters:
method – Algorithm to use for finding clusters. Currently only “modularity” is supported.
subset – The subset of the data on which to perform the analysis. If none is provided, will use the entire dataset.
graph – A graph to use for the clustering. If none is provided, will create a new graph based on the specified embedding. Note that if a graph is provided, it must be built on the subset specified by the
subsetparameter.embedding – The embedding to use to create a graph if none is provided. If none is provided, will use the default dataset embedding. (If one does not exist, will raise an error.)
min_group_size – The minimum size for a returned cluster. If a value between 0 and 1 is provided, it will be interpreted as a fraction of the size of the subset of data being clustered.
max_group_size – The maximum size for a returned cluster. If a value between 0 and 1 is provided, it will be interpreted as a fraction of the size of the subset of data being clustered.
max_n_groups – The maximum number of clusters to return.
min_n_groups – The minimum number of clusters to return.
config – A dictionary containing further configuration parameters that will be passed to the underlying algorithm.
run_name – A name under which to store the results. If one is not provided, it will be chosen automatically.
manual – Used internally to signal whether the clustering analysis was created by the user.
visible – Whether to show the results of this analysis in the UI.
generate_group_descriptions – Whether to generate statistical and textual descriptions of returned clusters. True by default, but consider setting to False for large datasets with many columns, or when a large number of clusters is desired, as this process can be very time consuming.
- Returns:
A GroupResultsCollection object containing the discovered clusters and the parameters used by the algorithm.
- find_drifted_groups(reference_group: str | CobaltDataSubset, comparison_group: str | CobaltDataSubset, embedding: int | str | Embedding = 0, relative_prevalence_threshold: float = 2, p_value_threshold: float = 0.05, min_size: int = 5, run_name: str | None = None, config: Dict[str, Any] | None = None, manual: bool = True, visible: bool = True, generate_group_descriptions: bool = True, model: int | str | ModelMetadata = 0, graph: str | HierarchicalCobaltGraph | None = None) GroupResultsCollection
Return groups in the comparison group that are underrepresented in the reference group.
- Parameters:
reference_group – The reference subset of the data, e.g. the training set.
comparison_group – The subset of the data that may have regions that are not well represented in the reference set. This may be a test dataset or production data.
embedding – The embedding to use for the analysis. If none is provided, will use the default dataset embedding. (If one does not exist, will raise an error.)
relative_prevalence_threshold –
How much more common points from comparison_group need to be in a group relative to the overall average for it to be considered drifted. This is computed by comparing the ratio of comparison points to reference points in a group, compared with the ratio in the overall dataset. If the overall balance of points is 1:1 from each group and relative_prevalence_threshold = 2, a drifted group will have at least a 2:1 balance in favor of data points from the comparison set. If the overall ratio of points is 1:2 comparison : reference, then a drifted group will need to have at least a 1:1 ratio.
Choose this value based on what amount of overrepresentation of the comparison group would be meaningful to you. Under the default parameter of 2, the interpretation is roughly that for any returned group, points from the comparison subset are at least twice as common as they would be in a random sample of data points.
p_value_threshold – Used in a significance test that the prevalence of points from the comparison group is at least as high as required based on the value of relative_prevalence_threshold.
min_size – The minimum number of data points that need to be in the drifted region else, the drifted region is dropped from the result
run_name – A name under which to store the results. If one is not provided, it will be chosen automatically.
config – A dictionary containing further configuration parameters that will be passed to the underlying algorithm.
manual – Used internally to signal whether the failure group analysis was created by the user.
visible – Whether to show the results of this analysis in the UI.
generate_group_descriptions – Whether to generate statistical and textual descriptions of returned groups. True by default, but consider setting to False for large datasets with many columns, as this process can be very time consuming.
model – Index or name of the model whose error metric will be shown with the returned groups.
graph – The graph object to use when comparing groups. This graph must be built on the concatenation of reference_group with comparison_group.
- Returns:
A GroupResultsCollection object containing the discovered drifted groups and the parameters used by the algorithm.
- find_failure_groups(method: Literal['superlevel'] = 'superlevel', subset: str | CobaltDataSubset | CobaltDataset | None = None, model: int | str | ModelMetadata = 0, embedding: int | str | Embedding = 0, failure_metric: str | Series | None = None, min_size: int = 1, max_size: int | float = inf, min_failures: int = 3, config: Dict[str, Dict] | None = None, run_name: str | None = None, manual: bool = True, visible: bool = True, generate_group_descriptions: bool = True, graph: str | HierarchicalCobaltGraph | None = None) GroupResultsCollection
Run an analysis to find failure groups in the dataset.
Saves the results in self.failure_groups under run_name.
- Parameters:
method – Algorithm to use for finding failure groups. Currently only “superlevel” is supported.
subset – The subset of the data on which to perform the analysis. If none is provided, will use the entire dataset.
model – Index or name of the model for which failure groups should be found.
embedding – The embedding to use for the analysis. If none is provided, will use the default dataset embedding. (If one does not exist, will raise an error.)
failure_metric – The performance metric to use. If a string, will use the model performance metric with that name; otherwise, must be a Pandas Series, with length either equal to the length of the specified subset, or the whole dataset. If a Series is passed, it will be added to the dataset as a model evaluation metric.
min_size – The minimum size for a returned failure group. Smaller groups will be discarded.
max_size – The maximum size for a returned failure group. Larger groups will be split into smaller groups by applying a clustering algorithm.
min_failures – The minimum number of failure for a returned failure groups. Smaller groups will be discarded. Default is set to 3 to allow DS to spot failure patterns. This is only for classification tasks.
config – A dictionary containing further configuration parameters that will be passed to the underlying algorithm.
run_name – A name under which to store the results. If one is not provided, it will be chosen automatically.
manual – Used internally to signal whether the failure group analysis was created by the user.
visible – Whether to show the results of this analysis in the UI.
generate_group_descriptions – Whether to generate statistical and textual descriptions of returned groups. True by default, but consider setting to False for large datasets with many columns, as this process can be very time consuming.
graph – A graph object or name of a graph to use in finding the failure groups. If provided, this graph must be built on the subset of data provided in the subset argument.
- Returns:
A GroupResultsCollection object containing the discovered failure groups and the parameters used by the algorithm.
- static from_arrays(model_inputs: List | ndarray | DataFrame, model_predictions: ndarray, ground_truth: ndarray | None, task: str = 'classification', embedding: ndarray | None = None, embeddings: List[ndarray] | None = None, embedding_metric: str | None = None, embedding_metrics: List[str] | None = None, split: DatasetSplit | TypeAliasForwardRef('SplitDescriptor') | None = None)
Returns a Workspace object constructed from user-defined arrays.
- Parameters:
model_inputs – the data evaluated by the model.
model_predictions – the model’s predictions corresponding to model_inputs.
ground_truth – ground truths for model_inputs.
task – model task, pass in “classification”
embedding – embedding array to include.
embeddings – list of embedding arrays to use.
embedding_metric – embedding metric corresponding to embedding.
embedding_metrics – list of metrics corresponding to embeddings.
split – an optional dataset split.
At most one of
embeddingorembeddings(and the correspondingembedding_metricorembedding_metrics) should be provided.
- get_graph_level(graph: str | HierarchicalCobaltGraph, level: int, name: str | None = None) GroupCollection
Create a GroupCollection from a specified level of a graph.
This method is deprecated. Use CobaltGraph.get_group_collection() instead:
>>> hierarchical_graph = workspace.graphs["graph_name"] >>> graph_level = hierarchical_graph.levels[level] >>> group_collection = graph_level.get_group_collection(name)
- Parameters:
graph – Name of the graph to use, or the graph object itself.
level – The level of the graph to use for the groups. One group will be created for each node in the graph.
name – An optional name for the GroupCollection.
- get_graph_levels(graph: str | HierarchicalCobaltGraph, min_level: int, max_level: int, name_prefix: str | None = None) Dict[int, GroupCollection]
Create GroupCollections for a range of levels of a graph.
All levels between min_level and max_level will be used. The return value is a dict mapping levels to GroupCollections.
This method is experimental and its interface may be changed in the future.
- Parameters:
graph – Name of the graph to use, or the graph object itself.
min_level – The lowest level of the graph to use for the groups.
max_level – The highest level of the graph to use for the groups.
name_prefix – If provided, the GroupCollection for level i will be named “{name_prefix}_{i}”.
- get_group_neighbors(group: CobaltDataSubset | str, graph: HierarchicalCobaltGraph | str, size_ratio: float = 1.0) CobaltDataSubset
Find a set of data points that are neighbors of a group.
Returns a set of data points that is well connected to the given group in the graph, and which does not include any points from the original group.
This method is experimental and its functionality may change in the future.
- Parameters:
group – A CobaltDataSubset or name of a saved group to find the neighbors of.
graph – A HierarchicalCobaltGraph or name of a graph in which to find the neighbors.
size_ratio – Approximate relative size of the group of neighbors. The algorithm will attempt to return a group of neighbors that is approximately
size_ratiotimes the size of the input group.
- get_groups() GroupCollection
Get a GroupCollection object with the currently saved groups.
- Returns:
GroupCollection read-only object with groups. A group consists of a subset of data points together with some metadata about the subset.
- get_linked_datasets(dataset_name: str | None = None) List[str]
Get names of datasets linked to the specified dataset.
- Parameters:
dataset_name – Dataset name, or None for primary dataset
- Returns:
List of dataset names linked to the specified dataset
- get_split(dataset: str | CobaltDataset | None = None) DatasetSplit
Get the split for a dataset.
- Parameters:
dataset – CobaltDataset object, name of a dataset, or None for primary dataset
- Returns:
DatasetSplit for the specified dataset
- property graphs: Dict[str, HierarchicalCobaltGraph]
The graphs that have been created and saved.
- import_groups_from_dataframe(df: DataFrame)
Imports groups from a DataFrame with one column for each group.
The name of each column will be used as the name for the group, and the entries in the column will be interpreted as boolean values indicating the membership of each data point in that group.
- link_datasets(left: str | CobaltDataset, right: str | CobaltDataset, left_column: str, right_column: str | None = None) None
Create a link between two datasets in the workspace.
There are two cases: if the values in both columns are scalar, rows will be linked together if they have the same value in both columns. If one column (say, the left) contains lists, then a row in the left dataset will be linked to all rows in the right dataset where the value of the right column is contained in the list in the left column.
- Parameters:
left – Name or CobaltDataset instance of the first dataset.
right – Name or CobaltDataset instance of the second dataset.
left_column – Column name in left dataset for linking.
right_column – Column name in right dataset for linking. If None, uses the same column name as left_column.
- Raises:
ValueError – If a link already exists between these datasets
Example
>>> # Using dataset names >>> workspace.link_datasets( >>> "customers", "orders", >>> "customer_id", "customer_id" >>> )
>>> # Using dataset objects with same column name >>> workspace.link_datasets( >>> customers, orders, >>> "customer_id" >>> )
- static load(path: str) Workspace
Load a Workspace saved with Workspace.save().
Compatibility with Workspaces saved by previous versions of Cobalt is not guaranteed.
- new_graph(name: str | None = None, subset: str | CobaltDataSubset | CobaltDataset | None = None, embedding: int | str | Embedding = 0, metric: str | Metric | None = None, init_max_nodes: int = 500, init_max_degree: float = 15.0, **kwargs) HierarchicalCobaltGraph
Create a new graph from a specified subset.
The resulting graph will be returned and added to the Workspace.
- Parameters:
name – The name to give the graph in self.graphs. If None: Autoname it.
subset – The subset of the dataset to include in the graph. If a string, will try to use a subset with that name from the dataset split or the saved groups (in that order). Otherwise, should be a CobaltDataSubset.
embedding – The embedding to use to generate the graph. May be specified as an index into self.dataset.embeddings, the name of the embedding, or an Embedding object.
metric – The distance metric to use when constructing the graph. If none is provided, will use the metric specified by the embedding.
init_max_nodes – The maximum number of nodes to show in the initial view of this graph.
init_max_degree – The maximum average node degree for the initial view of this graph.
**kwargs –
Additional keyword parameters. These can include:
Parameters for GraphSpec (e.g., M, K, min_nbrs, affinity, L_coarseness, L_connectivity, filters).
grid_search (bool): If True, perform a grid search over graph parameters to select the best graph according to a scoring function. Default is False. This has a performance cost but can yield higher-quality graphs.
- Grid search options (only used when grid_search=True):
param_grid: List of parameter dicts to search over. If None, uses a default grid.
scorer: Scoring function key (“spectral_score” or “modularity_score”) or a callable. Default is “spectral_score”.
subsample_max_size: Max data size for scoring phase. Default 1000.
random_state: RNG seed for subsampling. Default 42.
reverse: If True, higher scores are better. Default True.
embedding_search_mode: One of “given” (default, use passed embedding only), “all” (search all embeddings in dataset), or “given_plus_generated” (passed embedding plus auto-generated scaled and random forest embeddings).
When grid_search=True, the returned graph’s params attribute will contain a “grid_search_details” key with the selected parameters, score, and other grid search metadata.
- Returns:
The created graph.
- Return type:
- save(path: str) str
Save this workspace to a file.
The file can be loaded with Workspace.load(). It will include the dataset, embeddings, saved groups, autogroups, and graphs created in this Workspace. However, no UI state will be preserved.
This method is experimental and forwards compatibility is not guaranteed. Future versions of Cobalt may not be able to load Workspaces saved with this version.
- property saved_groups: GroupCollection
An object that represents the currently saved groups.
This does not include groups selected by algorithms like
find_failure_groups(), only groups saved manually in the UI or withWorkspace.add_group().
- view_table(subset: List[int] | CobaltDataSubset | None = None, display_columns: List[str] | None = None, max_rows: int | None = None)
Returns a visualization of the dataset table.
- class cobalt.UI(workspace: Workspace, dataset: CobaltDataset, table_image_size: Tuple[int, int] = (80, 80))
Bases:
objectAn interactive UI visualizing the data in a Workspace.
- Parameters:
workspace – the Workspace object that this UI will visualize
dataset – the CobaltDataset being analyzed
table_image_size – for datasets with images, the (height, width) size in pixels that these will be shown in the data table.
- build()
Construct the UI.
This normally happens automatically when the UI object appears as an output in a notebook cell.
- get_current_graph() HierarchicalCobaltGraph
Return the currently shown graph.
- get_current_graph_source_data() CobaltDataSubset
Return the current dataset being displayed in the current graph.
- Returns:
A CobaltDataSubset of the data represented by the graph. Note that if sub-sampling is enabled, this may not be the entire dataset.
- get_filtered_data() CobaltDataSubset
Return the results of the current filters applied in the data table.
- Returns:
A CobaltDataSubset of the data displayed in the data table.
Note that if data is selected in the graph, or a group is selected, this will be the subset of the selected data satisfying the filter conditions.
- get_graph_and_clusters() Tuple[Graph, SubsetCollection]
Return the current graph and the datapoints that belong to each node.
- Returns:
A tuple(Graph, List[CobaltDataSubset]) representing the current graph as networkx, and a list of the datapoints that each node represents.
Note that the graph has the same number of nodes as the number of elements in the list.
- get_graph_selection() CobaltDataSubset
Return the current subset selected in the graph.
- class cobalt.CobaltDataset(dataset: DataFrame, metadata: DatasetMetadata | None = None, models: List[ModelMetadata] | None = None, embeddings: List[Embedding] | None = None, name: str | None = None, arrays: Dict[str, ndarray] | None = None)
Bases:
DatasetBase,SerializableMixin,JSONSerializableMixinFoundational object for a Cobalt analysis.
Encapsulates all necessary information regarding the data, metadata, and model outputs associated with an analysis.
- name
Optional string for dataset name
- add_array(key: str, array: ndarray | csr_array)
Add a new array to the dataset.
Will raise an error if an array with the given name already exists.
- add_column_embedding(columns: str | List[str], metric: str | Metric = 'euclidean', name: str | None = None, scaling: Literal['standardize', 'robust'] | None = None)
Create an embedding from one or more columns of the dataset.
This creates a ColumnEmbedding that references the specified columns directly, without copying the data.
- Parameters:
columns – A column name (str) or list of column names to include in the embedding.
metric – The preferred distance metric to use with this embedding. Defaults to “euclidean”.
name – An optional name for the embedding. If not provided, a name will be generated from the column names.
scaling –
An optional method for scaling the values of the embedding. If provided, may be:
”standardize”: normalize columns to mean 0 and standard deviation 1
”robust”: normalize columns to median 0 and interquartile range 1.
Note that if this parameter is provided, an unscaled version of the embedding will also be created.
- Raises:
ValueError – If any column doesn’t exist or is not numerical.
- add_embedding_array(embedding: ndarray | Any, metric: str | Metric = 'euclidean', name: str | None = None)
Add an embedding to the dataset.
- Parameters:
embedding – An array or arraylike object containing the embedding values. Should be two-dimensional and have the same number of rows as the dataset.
metric – The preferred distance metric to use with this embedding. Defaults to “euclidean”; “cosine” is another useful option.
name – An optional name for the embedding.
- add_media_column(paths: List[str], local_root_path: str | None = None, column_name: str | None = None)
Add a media column to the dataset.
- Parameters:
paths – A list or other array-like object containing the paths to the media file for each data point in the dataset.
local_root_path – A root path for all the paths in paths
column_name – The name for the column in the dataset that should store the media file paths.
- add_model(input_columns: str | List[str] | None = None, target_column: str | List[str] | None = None, prediction_column: str | List[str] | None = None, task: str | ModelTask = 'custom', performance_columns: List[str | dict] | None = None, name: str | None = None)
Add a new model.
- Parameters:
input_columns – The column(s) in the dataset that the model takes as input.
target_column – The column(s) in the dataset with the target values for the model outputs.
prediction_column – The column(s) in the dataset with the model’s outputs.
task – The task the model performs. This determines which performance metrics are calculated automatically. The default is “custom”, which does not compute any performance metrics. Other options are “regression” and “classification”.
performance_columns – Columns of the dataset containing pointwise model performance metrics. This can be used to add extra custom performance metrics for the model.
name – An optional name for the model. If one is not provided, a unique id will be generated.
- add_rf_embedding(source_embedding: str | ColumnEmbedding | ArrayEmbedding, outcome_column: str | None = None, embedding_name: str | None = None, n_estimators: int = 50, max_depth: int = 7, max_samples: float = 0.25, random_state: int | None = None, store_model: bool = False)
Create a random forest embedding.
- Parameters:
source_embedding – The embedding to use as input features. Can be specified as the name of an existing embedding (str), or as a ColumnEmbedding or ArrayEmbedding object.
outcome_column – Optional target column name for supervised embedding.
embedding_name – Optional embedding name; autogenerated if omitted.
n_estimators – Number of trees in the forest.
max_depth – Maximum depth of each tree.
max_samples – Fraction of samples to use for each tree.
random_state – Random seed for reproducibility.
store_model – If True, store the trained RF model in the embedding for later use with embed(). Default False to save memory.
- Raises:
ValueError – If outcome_column missing or ArrayEmbedding non-numeric.
- add_scaled_embedding(source_embedding: str | ColumnEmbedding | ArrayEmbedding, scaling: Literal['standardize', 'robust'] = 'standardize', embedding_name: str | None = None, metric: str | Metric = 'euclidean')
Create a lazily-computed scaled embedding and add it to the dataset.
The scaled embedding does not store a copy of the scaled array. Instead, it references the source embedding and computes scaled values on demand.
- Parameters:
source_embedding – The embedding to scale. Can be specified as the name of an existing embedding (str), or as a ColumnEmbedding or ArrayEmbedding object.
scaling – ‘standardize’ (zero mean, unit variance) or ‘robust’ (median/IQR).
embedding_name – Optional name; autogenerated if omitted.
metric – Distance metric for the resulting embedding.
- Raises:
ValueError – If the source embedding type is unsupported or data is non-numeric.
- add_text_column_embedding(source_column: str, embedding_model: str = 'all-MiniLM-L6-v2', embedding_name: str | None = None, device: str | None = None)
Create text embeddings from a column of the dataset.
Embeddings are created locally using a sentence_transformers model.
- Parameters:
source_column – The column of the dataset containing the text to embed.
embedding_model – The name of the sentence_transformers model to use. The default is all-MiniLM-L6-v2, which is small and reasonably fast, even on a CPU.
embedding_name – The name to save the embedding with. If none is provided, a name will be constructed from the column name and the embedding model name.
device – The torch device to run the embedding model on. If none is provided, a device will be chosen automatically.
- property array_names: List[str]
Names of the arrays stored in this dataset.
- as_subset()
Returns all rows of this CobaltDataset as a CobaltDataSubset.
- compute_model_performance_metrics()
Compute the performance metrics for each model in dataset.
Adds columns to the dataset storing the computed metrics, and updates the ModelMetadata.error_column attributes corerspondingly.
- create_rich_media_table(break_newlines: bool = True, highlight_terms: Dict[str, List[str]] | None = None, run_server: bool | None = False) DataFrame
Returns media table with images columns as HTML column.
- property df: DataFrame
Returns a pd.DataFrame of the underlying data for this dataset.
- property embedding_names: List[str]
The names of embeddings in this dataset.
- filter(condition: str) CobaltDataSubset
Returns subset where condition evaluates to True in the DataFrame.
- Parameters:
condition – String predicate that is evaluated using the pd.eval function.
- Returns:
Selected Subset of type CobaltDataSubset
Example
>>> df = pd.DataFrame({'a': [1, 2, 3, 4]}) >>> ds = cobalt.CobaltDataset(df) >>> subset = ds.filter('a > 2') >>> len(subset) 2
- get_array(key: str) ndarray
Get an array from the dataset.
- get_embedding(index: int | str = 0) ndarray | csr_array
Return the embedding array with the given name or integer index.
- get_embedding_array(index: int | str = 0) ndarray | csr_array
Return the embedding array with the given name or integer index.
- get_image_columns() List[str]
Gets image columns.
- get_model_performance_data(metric: str, model_index: int | str) ndarray
Returns computed performance metric.
- get_summary_statistics(categorical_max_unique_count: int = 10) Tuple[DataFrame, DataFrame]
Returns summary statistics for each feature in the dataset.
- classmethod load(file_path: str) CobaltDataset
Load a saved dataset from a .json file.
- mask(m: ArrayLike) CobaltDataSubset
Return a CobaltDataSubset consisting of rows at indices where
mis nonzero.
- property metadata: DatasetMetadata
A DatasetMetadata object containing the metadata for this dataset.
- property models: ModelMetadataCollection
The models associated with this dataset.
Each ModelMetadata object represents potential outcome, prediction, and error columns.
- overall_model_performance_score(metric: str, model_index: int | str) float
Computes the mean model performance score.
- overall_model_performance_scores(model_index: int | str) Dict[str, float]
Computes performance score for each available metrics.
- sample(max_samples: int, random_state: int | None = None) CobaltDataSubset
Return a CobaltDataSubset containing up to max_samples sampled rows.
Up to max_samples rows will be sampled without replacement and returned as a CobaltDataSubset. If fewer rows exist than max_samples, all rows are returned.
- Parameters:
max_samples – The maximum number of samples to pull.
random_state – An optional integer to be used as a seed for random sampling.
- Returns:
A CobaltDataSubset representing up to max_samples randomly sampled datapoints.
- save(file_path: str | PathLike) str
Write this dataset to a .json file.
Returns the path written to.
- select_col(col: str) Series
Return the values for column col of this dataset.
- set_column(key: str, data, is_categorical: bool | Literal['auto'] = 'auto')
Add or replace a column in the dataset.
- Parameters:
key – Name of the column to add.
data – ArrayLike of values to store in the column. Must have length equal to the length of the dataset.
is_categorical – Whether the column values should be treated as categorical. If “auto” (the default), will autodetect.
- set_column_text_type(column: str, input_type: TextDataType)
Set the type for a text column in the dataset.
Options include “long_text”, which means the data in the column will be subject to keyword analysis but will not be available for coloring, and “short_text”, which prevents keyword analysis but allows categorical coloring.
- subset(indices: ArrayLike) CobaltDataSubset
Returns a CobalDataSubset consisting of rows indexed by indices.
- time_range(start_time: Timestamp, end_time: Timestamp) CobaltDataSubset
Return a CobaltDataSubset within a time range.
- Parameters:
start_time – A pd.Timestamp marking the start of the time window.
end_time – A pd.Timestamp marking the end of the time window.
- Returns:
A CobaltDataSubset consisting of datapoints within the range [start_time, end_time).
- to_dict() dict
Save all information in this dataset to a dict.
- class cobalt.CobaltDataSubset(source: CobaltDataset, indices: ndarray | List[int])
Bases:
DatasetBaseRepresents a subset of a CobaltDataset.
Should in general be constructed by calling the subset() method (or other similar methods) on a CobaltDataset or CobaltDataSubset.
In principle, this could have repeated data points, since there is no check for duplicates.
- source_dataset
The CobaltDataset of which this is a subset.
- indices
np.ndarray of integer row indices defining the subset.
- as_mask() ndarray[bool]
Returns mask of self on self.source_dataset.
- as_mask_on(base_subset: CobaltDataSubset) ndarray[bool]
Returns mask of self on another subset.
- Raises:
ValueError – if self is not a subset of base_subset.
- complement() CobaltDataSubset
Returns the complement of this set in its source dataset.
- concatenate(dataset: CobaltDataSubset) CobaltDataSubset
Add another data subset to this one. Does not check for overlaps.
- Returns:
A new CobaltDataSubset object containing points from self and the passed dataset.
- Raises:
ValueError – if self and dataset have different parent datasets.
- create_rich_media_table(break_newlines: bool = True, highlight_terms: Dict[str, List[str]] | None = None, run_server: bool | None = False) DataFrame
Returns media table with images columns as HTML column.
- property df: DataFrame
Returns a pd.DataFrame of the data represented by this data subset.
- difference(dataset: CobaltDataSubset) CobaltDataSubset
Returns the subset of self that is not contained in dataset.
- Raises:
ValueError – if self and dataset have different parent datasets.
- property embedding_names: List[str]
Return the available embedding names.
- filter(condition: str) CobaltDataSubset
Returns subset where condition evaluates to True in the DataFrame.
- Parameters:
condition – String predicate that is evaluated using the pd.eval function.
- Returns:
Selected Subset of type CobaltDataSubset
Example
>>> df = pd.DataFrame({'a': [1, 2, 3, 4]}) >>> ds = cobalt.CobaltDataset(df) >>> subset = ds.filter('a > 2') >>> len(subset) 2
- get_embedding(index: int | str = 0) ndarray | csr_array
Return the embedding array with the given name or integer index.
- get_embedding_array(index: int | str = 0) ndarray | csr_array
Return the embedding array with the given name or integer index.
- get_image_columns() List[str]
Gets image columns.
- get_model_performance_data(metric: str, model_index: int | str) ndarray
Returns computed performance metric.
- get_model_performance_metrics()
Retrieve and aggregate performance metrics for each model in the subset.
This method iterates over each model and retrieves its overall performance scores.
- Returns:
- A dictionary structured as {model_name: {metric_name: metric_value}},
where metric_value is the computed score for each metric.
- Return type:
dict
- get_summary_statistics(categorical_max_unique_count: int = 10) Tuple[DataFrame, DataFrame]
Returns summary statistics for each feature in the dataset.
- intersect(dataset: CobaltDataSubset) CobaltDataSubset
Returns the intersection of self with dataset.
- Raises:
ValueError – if self and dataset have different parent datasets.
- intersection_size(dataset: CobaltDataSubset) int
Returns the size of the intersection of self with dataset.
Somewhat more efficient than len(self.intersect(dataset)).
- Raises:
ValueError – if self and dataset have different parent datasets.
- mask(m: ArrayLike) CobaltDataSubset
Return a CobaltDataSubset consisting of rows at indices where
mis nonzero.
- property metadata: DatasetMetadata
A DatasetMetadata object containing the metadata for this dataset.
- property models: ModelMetadataCollection
The models associated with this dataset.
Each ModelMetadata object represents potential outcome, prediction, and error columns.
- overall_model_performance_score(metric: str, model_index: int | str) float
Computes the mean model performance score.
- overall_model_performance_scores(model_index: int | str) Dict[str, float]
Computes performance score for each available metrics.
- sample(max_samples: int, random_state: int | None = None) CobaltDataSubset
Return a CobaltDataSubset containing up to max_samples sampled rows.
Up to max_samples rows will be sampled without replacement and returned as a CobaltDataSubset. If fewer rows exist than max_samples, all rows are returned.
- Parameters:
max_samples – An integer indicating the maximum number of samples to pull.
random_state – An optional integer to be used as a seed for random sampling.
- Returns:
A CobaltDataSubset representing up to max_samples randomly sampled datapoints.
- select_col(col: str) Series
Return the pd.Series for column col of this data subset.
- subset(indices: ArrayLike) CobaltDataSubset
Returns a subset obtained via indexing into self.df.
Tracks the dependency on self.source_dataset.
- to_dataset() CobaltDataset
Converts this subset to a standalone CobaltDataset.
- Returns:
returns this object as a dataset.
- Return type:
dataset (CobaltDataset)
- class cobalt.ModelMetadata(outcome_columns: List[str], prediction_columns: List[str], task: ModelTask, input_columns: List[str] | None = None, error_columns: List[str] | None = None, evaluation_metrics: Sequence[EvaluationMetric | Dict] | None = None, name: str | None = None)
Bases:
SerializableMixinInformation about a model and its relationship to a dataset.
Stores information about the model’s inputs and outputs (as names of columns in the dataset), as well as ground truth data. Provides access to model performance metrics.
- name
An optional name for the model.
- task
The task performed by the model. Can be “classification”, “regression”, or “custom” (the default). This determines which performance metrics are available by default.
- input_columns
A list of column(s) in the dataset containing the input data for the model.
- prediction_columns
A list of column(s) in the dataset containing the outputs produced by the model.
- outcome_columns
A list of column(s) in the dataset containing the target outputs for the model.
- add_metric_column(metric_name: str, column: str, lower_values_are_better: bool = True)
Add a column from the dataset as a performance metric for this model.
- Parameters:
metric_name – The name for the metric. If you want to compare different models using this metric, use the same name for the metric in each.
column – The name of the column in the dataset that contains the values of this metric for the model.
lower_values_are_better – Whether lower or higher values of the metric indicate better performance.
- get_confusion_matrix(dataset: DatasetBase, normalize_mode: bool | Literal['all', 'index', 'columns'] = 'index', selected_classes: List[str] | None = None) pd.DataFrame | None
Calculate the confusion matrix for the model if applicable.
- Parameters:
dataset – The dataset containing the outcomes and predictions.
normalize_mode – Specifies the normalization mode for the confusion matrix.
selected_classes – Specifies the classes to include in the matrix, with all others aggregated as “other”.
- Returns:
Confusion matrix as a DataFrame, or None if not applicable.
- Return type:
Optional[pd.DataFrame]
- Raises:
ValueError – If the model task is not classification.
- get_statistic_metrics(dataset: DatasetBase, selected_classes: List[str] | None = None)
Return a DataFrame containing recall, precision, F1 score, and accuracy for each class.
This method uses the model’s confusion matrix and can filter metrics to only selected classes. Metrics calculated include recall, precision, F1 score, and accuracy.
- Parameters:
dataset – The dataset to compute the confusion matrix.
selected_classes – List of classes to include in the metrics calculation. If None, metrics for all classes are calculated.
- Returns:
A DataFrame with recall, precision, F1 score, and accuracy for each class.
- Return type:
pd.DataFrame
- property outcome_column
Returns the first outcome column if len(outcome_columns) > 0, else None.
- property performance_metrics: Dict[str, EvaluationMetric]
The relevant performance metrics for this model.
The returned objects have a
calculate()method, which computes pointwise performance metrics, and anoverall_score()method, which computes the overall performance for a group. These methods acceptCobaltDataSubsetobjects and return dictionaries mapping metric names to values.
- property prediction_column
Returns the first prediction column if len(prediction_columns) > 0, else None.
- class cobalt.DatasetMetadata(media_columns: List[MediaInformationColumn] | None = None, timestamp_columns: List[str] | None = None, hidable_columns: List[str] | None = None, default_columns: List[str] | None = None, other_metadata_columns: List[str] | None = None, default_topic_column: str | None = None)
Bases:
SerializableMixinEncapsulates various metadata about a CobaltDataset.
- media_columns
Optional list of MediaInformationColumns.
- timestamp_columns
Optional list of timestamp column name strings.
- hidable_columns
Optional list of hidable column name strings.
- default_columns
Optional list containing the names of columns to display by default in an interactive data table.
- other_metadata_columns
Optional list of column name strings.
- data_types
Dict mapping column names to DatasetColumnMetadata objects.
- property default_topic_column: str | None
Default column to use for topic analysis.
If len(self.long_text_columns) == 0, will always be None.
- property long_text_columns: List[str]
Columns containing large amounts of text data.
These are candidates for topic or keyword analysis.
- timestamp_column(index=0) str
Return the (string) name of the indexth timestamp column.
- class cobalt.MediaInformationColumn(column_name: str, file_type: str, host_directory: str, is_remote=False)
Bases:
ColumnRepresent a column containing information about media files.
- column_name
Column Name in dataframe.
- Type:
str
- file_type
A string indicating the file type, e.g. its extension.
- Type:
str
- host_directory
Path or URL where the file is located.
- Type:
str
- is_remote
Whether the file is remote.
- autoname_media_visualization_column() dict
Autoname media column.
- class cobalt.Embedding(name=None)
Bases:
ABCEncapsulates metadata about a dataset embedding.
- property admissible_distance_metrics: Sequence[str | Metric]
Distance metrics that are reasonable to use with this embedding.
Other distance metrics may still be useful, but these are metrics that are known to make sense for the data.
- abstract property default_distance_metric: str | Metric
Default distance metric to use with this embedding.
- abstract property dimension: int
The dimension of the embedding.
- property distance_metrics: Sequence[str | Metric]
Suggested distance metrics for use with this embedding.
- abstractmethod get(dataset: DatasetBase) np.ndarray
Get the values of this embedding for a dataset.
- abstractmethod get_available_distance_metrics() Sequence[str | Metric]
Return the list of distance metrics that could be used.
- class cobalt.ArrayEmbedding(array_name: str, dimension: int, metric: str | Metric, name: str | None = None)
Bases:
Embedding,DictConstructibleMixin,SerializableMixinAn embedding stored in an array associated with a Dataset.
- array_name
The name of the array in the dataset storing the embedding values
- property admissible_distance_metrics: List[str | Metric]
Distance metrics that are reasonable to use with this embedding.
Other distance metrics may still be useful, but these are metrics that are known to make sense for the data.
- property default_distance_metric: str | Metric
Default distance metric to use with this embedding.
- property dimension: int
The dimension of the embedding.
- property distance_metrics: Sequence[str | Metric]
Suggested distance metrics for use with this embedding.
- get(dataset: DatasetBase) np.ndarray
Return a np.ndarray of the embedding rows at specified indices.
- Parameters:
dataset – Data(sub)set for which to get the embedding values.
- Returns:
The np.ndarray containing the embedding values for the rows in the given dataset.
- get_available_distance_metrics() List[str]
Return the list of distance metrics that could be used.
- class cobalt.ColumnEmbedding(columns: List[str], metric: str | Metric, name=None)
Bases:
Embedding,DictConstructibleMixin,SerializableMixinRepresents an embedding as a column range.
- columns
List of strings naming the columns to include in this embedding.
- property admissible_distance_metrics: List[str | Metric]
Distance metrics that are reasonable to use with this embedding.
Other distance metrics may still be useful, but these are metrics that are known to make sense for the data.
- property default_distance_metric: str | Metric
Default distance metric to use with this embedding.
- property dimension: int
The dimension of the embedding.
- property distance_metrics: Sequence[str | Metric]
Suggested distance metrics for use with this embedding.
- get(dataset: DatasetBase) np.ndarray
Return a np.ndarray of the embedding rows at specified indices.
Only columns specified in the columns attribute are included.
- Parameters:
dataset – Data(sub)set for which to get the embedding values.
- Returns:
The np.ndarray containing the embedding values for the rows in the given dataset.
- get_available_distance_metrics() Sequence[str | Metric]
Return the list of distance metrics that could be used.
- class cobalt.RandomForestEmbedding(source_embedding_name: str, dimension: int, outcome_column: str | None = None, n_estimators: int = 50, max_depth: int = 7, max_samples: float = 0.25, random_state: int | None = None, name: str | None = None, model: RandomForestClassifier | RandomForestRegressor | None = None)
Bases:
Embedding,DictConstructibleMixin,SerializableMixinAn embedding computed using Random Forest leaf node assignments.
This embedding wraps a source embedding and applies a Random Forest model to generate leaf node indices as features. The RF can be trained in either supervised mode (with an outcome column) or unsupervised mode (using a synthetic classification task).
The trained model can optionally be stored to enable embedding new data.
- source_embedding_name
Name of the source embedding used as RF input.
- outcome_column
Name of the target column (None for unsupervised).
- n_estimators
Number of trees in the forest.
- max_depth
Maximum depth of each tree.
- supervised
Whether the RF was trained with supervision.
Initialize a RandomForestEmbedding.
- Parameters:
source_embedding_name – Name of the source embedding to use as input.
dimension – The dimension of the embedding (number of trees).
outcome_column – Target column name for supervised training. If None, uses unsupervised mode with synthetic labels.
n_estimators – Number of trees in the forest.
max_depth – Maximum depth of each tree.
max_samples – Fraction of samples to use for each tree.
random_state – Random seed for reproducibility.
name – Optional name for this embedding.
model – Optional pre-trained RandomForest model. If provided, can be used to embed new data via embed().
Note
The metric is always Hamming distance for RF embeddings since leaf node indices are discrete integer values.
- property admissible_distance_metrics: List[str]
Distance metrics that are reasonable for RF embeddings.
- property default_distance_metric: str
Default distance metric (hamming for RF leaf indices).
- property dimension: int
The dimension of the embedding (number of trees).
- property distance_metrics: Sequence[str | Metric]
Suggested distance metrics for use with this embedding.
- embed(X: ndarray) ndarray
Embed new data using the stored model.
- Parameters:
X – Input array of shape (n_samples, n_features).
- Returns:
Leaf node indices array of shape (n_samples, n_estimators).
- Raises:
ValueError – If no model is stored.
- get(dataset: DatasetBase) np.ndarray
Get the RF embedding values for a dataset.
First checks for a pre-computed embedding array in the dataset. If not found and a model is stored, computes embeddings on the fly.
- Parameters:
dataset – Data(sub)set for which to get the embedding values.
- Returns:
The RF embedding array of shape (n_samples, n_estimators).
- get_available_distance_metrics() List[str]
Return the list of distance metrics that could be used.
- has_model() bool
Check if a trained model is available.
- property model: RandomForestClassifier | RandomForestRegressor | None
The trained RandomForest model, if stored.
- property supervised: bool
Whether the RF was trained with supervision.
- with_model(model: RandomForestClassifier | RandomForestRegressor) RandomForestEmbedding
Return a copy of this embedding with the given model attached.
- Parameters:
model – A trained RandomForest model (Classifier or Regressor).
- Returns:
A new RandomForestEmbedding with the model stored.
- class cobalt.ScaledEmbedding(source_embedding_name: str, scaling: Literal['standardize', 'robust'], dimension: int, metric: str | Metric = 'euclidean', name: str | None = None)
Bases:
Embedding,DictConstructibleMixin,SerializableMixinAn embedding that lazily computes scaled values from a source embedding.
This embedding wraps a continuous numeric embedding (ColumnEmbedding or ArrayEmbedding) and applies scaling (standardization or robust/IQR scaling) on demand. It does not store the scaled array directly, providing memory savings and avoiding redundant data during serialization.
Scaling parameters (mean/std or median/IQR) are computed from the full source dataset to ensure consistency when working with subsets.
Note
This class is intended for continuous numeric embeddings only. It should NOT be used with discrete embeddings like RandomForestEmbedding, which use Hamming distance on integer leaf indices.
- source_embedding_name
Name of the source embedding to scale.
- scaling
The scaling method (‘standardize’ or ‘robust’).
Initialize a ScaledEmbedding.
- Parameters:
source_embedding_name – Name of the source embedding to scale.
scaling – Scaling method - ‘standardize’ (zero mean, unit variance) or ‘robust’ (median centering, IQR scaling).
dimension – The dimension of the embedding.
metric – Distance metric for the scaled embedding.
name – Optional name for this embedding.
- property admissible_distance_metrics: List[str | Metric]
Distance metrics that are reasonable to use with this embedding.
- property default_distance_metric: str | Metric
Default distance metric to use with this embedding.
- property dimension: int
The dimension of the embedding.
- property distance_metrics: Sequence[str | Metric]
Suggested distance metrics for use with this embedding.
- get(dataset: DatasetBase) np.ndarray
Get the scaled embedding values for a dataset.
Scaling parameters (mean/std or median/IQR) are computed from the full source dataset to ensure consistency across subsets.
- Parameters:
dataset – Data(sub)set for which to get the embedding values.
- Returns:
The scaled embedding array.
- Raises:
TypeError – If the source embedding is a RandomForestEmbedding.
- get_available_distance_metrics() List[str | Metric]
Return the list of distance metrics that could be used.
- class cobalt.DatasetSplit(dataset: CobaltDataset, split: TypeAliasForwardRef('SplitDescriptor') | None = None, train: CobaltDataSubset | List[int] | ndarray | None = None, test: CobaltDataSubset | List[int] | ndarray | None = None, prod: CobaltDataSubset | List[int] | ndarray | None = None)
Bases:
dictThe DatasetSplit object can contain any number of user-defined subsets of data.
This can be used to separate out training data from production data, or a baseline dataset from a comparison set, or labeled from unlabeled data, or any number of divisions. These subsets are stored as a dictionary of CobaltDataSubsets, each with a name. When an object that is not a CobaltDataSubset is added to the dictionary, it is automatically converted to a subset by calling dataset.subset(). This means that the split can be created or updated by simply adding lists of data point indices.
There are a few special subset names that will be given extra meaning by Cobalt: “train”, “test”, and “prod”. The “train” subset is meant to include data that was used to train the model under consideration, the “test” subset data that was originally used to evaluate that model, and “prod” data collected later, e.g. when the model is in production. If specified, these subsets will be used in automated failure mode and problem analyses.
Construct a DatasetSplit object.
- Parameters:
dataset – The CobaltDataset that this separates into subsets.
split –
A collection of subsets. Can be given as any of the following:
a sequence of integers indicating how many data points fall in each split
a sequence of subsets
a dict mapping subset names to subsets.
Subsets can be provided either as CobaltDataSubset objects or as arrays of indices into dataset. If none is provided, a single subset named “all” will be created, containing all data points.
There are three special names for subsets, “train”, “test”, and “prod”, which are used to inform the automatic model analysis. These can also be passed as keyword parameters for convenience, e.g.
DatasetSplit(dataset, train=np.arange(1000), prod=np.arange(1000,2000)).
- clear() None. Remove all items from D.
- property comparable_subset_pairs: List[Tuple[Tuple[str, CobaltDataSubset], Tuple[str, CobaltDataSubset]]]
Returns a list of pairs of disjoint subsets in this split, with names.
Each pair is returned in both orders.
- copy() a shallow copy of D
- classmethod from_dataset_column(dataset: CobaltDataset, column: str) DatasetSplit
Create a split from a column in the dataset.
The column’s value for each data point should be the name of the split subset containing that point.
- Parameters:
dataset – The dataset to create a split for
column – The name of the column in the dataset that contains the split information. The entries of this column should be strings, as they will be used as names for the split subsets.
- classmethod fromkeys(iterable, value=None, /)
Create a new dictionary with keys from iterable and values set to value.
- get(key, default=None, /)
Return the value for key if key is in the dictionary, else default.
- property has_multiple_subsets: bool
Whether this split has multiple disjoint subsets that can be compared.
- items() a set-like object providing a view on D's items
- keys() a set-like object providing a view on D's keys
- property names: List[str]
Names of subsets in this split.
- pop(k[, d]) v, remove specified key and return the corresponding value.
If the key is not found, return the default if given; otherwise, raise a KeyError.
- popitem()
Remove and return a (key, value) pair as a 2-tuple.
Pairs are returned in LIFO (last-in, first-out) order. Raises KeyError if the dict is empty.
- property prod: CobaltDataSubset | None
The production subset, if it exists.
- setdefault(key, default=None, /)
Insert key with a value of default if key is not in the dictionary.
Return the value for key if key is in the dictionary, else default.
- property test: CobaltDataSubset | None
The testing subset, if it exists.
- property train: CobaltDataSubset | None
The training subset, if it exists.
- update([E, ]**F) None. Update D from mapping/iterable E and F.
If E is present and has a .keys() method, then does: for k in E.keys(): D[k] = E[k] If E is present and lacks a .keys() method, then does: for k, v in E: D[k] = v In either case, this is followed by: for k in F: D[k] = F[k]
- values() an object providing a view on D's values
- class cobalt.ProblemGroup(subset: ~cobalt.schema.dataset.CobaltDataSubset, name: str | None = None, metrics: ~typing.Dict[str, float] = <factory>, description: str | None = None, display_info: ~cobalt.schema.group.GroupDisplayInfo = <factory>, keywords: ~typing.Dict[str, ~cobalt.schema.group.GroupKeywords] = <factory>, auto_descriptions: ~typing.Dict[str, ~typing.List[~cobalt.schema.group.GroupAutoDescription]] = <factory>, comparison_stats: ~typing.Dict[str, ~cobalt.schema.group.GroupComparisonStats] = <factory>, feature_bounds: ~cobalt.schema.group.GroupFeatureBounds | None = None, group_type: ~cobalt.cobalt_types.GroupType = GroupType.any, other_fields: ~typing.Dict[str, ~typing.Any] = <factory>, problem_description: str = '', severity: float = 1.0, primary_metric: str | None = None, visible: bool = True, run_id: ~uuid.UUID | None = None)
Bases:
GroupMetadataA group representing a problem with a model.
- description: str | None = None
A short description of the contents of the group.
- feature_bounds: GroupFeatureBounds | None = None
Upper and lower bounds for individual features on this group.
- get_autodescriptions(column: str, n_descriptions: int = 1, descriptions_per_prompt: int = 1, n_samples: int = 10, max_sample_length: int = 250, set_description: bool = True, score_descriptions: bool = False, seed: int = 582, description_model: str = 'gpt-4.1', scoring_model: str = 'gpt-4.1-mini') List[GroupAutoDescription]
Use an LLM to generate hypotheses for properties that distinguish this group from others.
This works by sampling a number of documents from the group and prompting the LLM to describe a feature present in the documents in the sample but not present in a sample of documents not in the group.
Models provided through the OpenAI API are currently supported. To use this functionality, you must first configure your API key, either by calling
cobalt.setup_api_client()or setting the OPENAI_API_KEY environment variable.- Parameters:
column – The column of the dataset containing the documents to describe.
n_descriptions – The number of descriptions to generate for the group. Each description will be generated with a fresh sample of documents, so generating multiple descriptions can increase the likelihood of finding useful hypotheses.
descriptions_per_prompt – The number of descriptions to generate for each sample. This must be a divisor of n_descriptions.
n_samples – The number of documents from the group to sample and use in the prompt for each description.
max_sample_length – The maximum number of characters to include from each sampled document. This puts an upper bound on the cost of each API call.
set_description – Whether to use the generated descriptions to set the group’s primary description. If score_descriptions is True, the description with the highest F1-score will be used; otherwise the first description returned will be used.
score_descriptions – Whether to evaluate the quality of the generated descriptions. Description scoring is done by selecting a set of samples from the group and a set of samples from the rest of the dataset, and prompting a model to evaluate whether the description accurately captures each sample. This is treated as a classifier distinguishing between documents in the group and documents not in the group, and the precision, recall, and F1-score are reported.
seed – Used to control the samples from each group. Does not affect the LLM sampling.
description_model – Which model to use to generate descriptions.
scoring_model – Which model to use to score descriptions.
- group_type: GroupType = 'Group'
Describes the semantic meaning of the group in context.
- name: str | None = None
The group’s name. Should be unique within a SubsetCollection.
- primary_metric: str | None = None
The main metric used to evaluate this group.
- problem_description: str = ''
A brief description of the problem.
- severity: float = 1.0
A score representing the degree of seriousness of the problem.
Used to sort a collection of groups. Typically corresponds to the value of a performance metric on the group, and in general is only comparable within the result set of a single algorithm run.
- subset: CobaltDataSubset
The data points included in this group.
- metrics: Dict[str, float]
Relevant numeric metrics for this group.
- display_info: GroupDisplayInfo
Information to be displayed in the group explorer in the UI.
- keywords: Dict[str, GroupKeywords]
Distinctive keywords found in text columns in the group.
- comparison_stats: Dict[str, GroupComparisonStats]
Results of statistical tests comparing this group with others.
- class cobalt.SubsetCollection(source_dataset: CobaltDataset, indices: Sequence[Sequence[int]], name: str | None = None)
Bases:
SerializableWithDatasetsMixinA collection of subsets of a CobaltDataset.
- aggregate_col(col: str, method: Literal['mean', 'sum', 'mode'] | Callable[[Series], Any] | None = None) Sequence[float]
Aggregate the values of a column within each subset using the specified method.
- concatenate() CobaltDataSubset
Concatenate all subsets in the collection.
- get_array(key: str) Sequence[ndarray]
Retrieve the slice of an array for each subset.
- is_pairwise_disjoint()
Return True if there are no overlaps between subsets, False otherwise.
- select_col(col: str) Sequence[Series]
Retrieve the values of a column on each subset.
- class cobalt.GroupMetadata(subset: 'CobaltDataSubset', name: 'Optional[str]' = None, metrics: 'Dict[str, float]' = <factory>, description: 'Optional[str]' = None, display_info: 'GroupDisplayInfo' = <factory>, keywords: 'Dict[str, GroupKeywords]' = <factory>, auto_descriptions: 'Dict[str, List[GroupAutoDescription]]' = <factory>, comparison_stats: 'Dict[str, GroupComparisonStats]' = <factory>, feature_bounds: 'Optional[GroupFeatureBounds]' = None, group_type: 'GroupType' = <GroupType.any: 'Group'>, other_fields: 'Dict[str, Any]' = <factory>)
Bases:
SerializableMixin,DictConstructibleMixin- description: str | None = None
A short description of the contents of the group.
- feature_bounds: GroupFeatureBounds | None = None
Upper and lower bounds for individual features on this group.
- get_autodescriptions(column: str, n_descriptions: int = 1, descriptions_per_prompt: int = 1, n_samples: int = 10, max_sample_length: int = 250, set_description: bool = True, score_descriptions: bool = False, seed: int = 582, description_model: str = 'gpt-4.1', scoring_model: str = 'gpt-4.1-mini') List[GroupAutoDescription]
Use an LLM to generate hypotheses for properties that distinguish this group from others.
This works by sampling a number of documents from the group and prompting the LLM to describe a feature present in the documents in the sample but not present in a sample of documents not in the group.
Models provided through the OpenAI API are currently supported. To use this functionality, you must first configure your API key, either by calling
cobalt.setup_api_client()or setting the OPENAI_API_KEY environment variable.- Parameters:
column – The column of the dataset containing the documents to describe.
n_descriptions – The number of descriptions to generate for the group. Each description will be generated with a fresh sample of documents, so generating multiple descriptions can increase the likelihood of finding useful hypotheses.
descriptions_per_prompt – The number of descriptions to generate for each sample. This must be a divisor of n_descriptions.
n_samples – The number of documents from the group to sample and use in the prompt for each description.
max_sample_length – The maximum number of characters to include from each sampled document. This puts an upper bound on the cost of each API call.
set_description – Whether to use the generated descriptions to set the group’s primary description. If score_descriptions is True, the description with the highest F1-score will be used; otherwise the first description returned will be used.
score_descriptions – Whether to evaluate the quality of the generated descriptions. Description scoring is done by selecting a set of samples from the group and a set of samples from the rest of the dataset, and prompting a model to evaluate whether the description accurately captures each sample. This is treated as a classifier distinguishing between documents in the group and documents not in the group, and the precision, recall, and F1-score are reported.
seed – Used to control the samples from each group. Does not affect the LLM sampling.
description_model – Which model to use to generate descriptions.
scoring_model – Which model to use to score descriptions.
- group_type: GroupType = 'Group'
Describes the semantic meaning of the group in context.
- name: str | None = None
The group’s name. Should be unique within a SubsetCollection.
- subset: CobaltDataSubset
The data points included in this group.
- metrics: Dict[str, float]
Relevant numeric metrics for this group.
- display_info: GroupDisplayInfo
Information to be displayed in the group explorer in the UI.
- keywords: Dict[str, GroupKeywords]
Distinctive keywords found in text columns in the group.
- comparison_stats: Dict[str, GroupComparisonStats]
Results of statistical tests comparing this group with others.
- class cobalt.GroupCollection(source_dataset: CobaltDataset, indices: Sequence[Sequence[int]], name: str | None = None, group_type: GroupType = GroupType.any)
Bases:
SubsetCollection,SerializableMixinA collection of groups from a source CobaltDataset.
A group consists of a subset of data points together with some metadata about the subset. This metadata can include things like:
A name for the group
Distinctive keywords for the group
Model performance metrics on the group
Distinctive features for the group
The schema for metadata is defined in the
GroupMetadataclass.The groups in a collection are stored in a specific order, and can be accessed by indexing, e.g.
collection[0]to get the first group. If a group has been assigned a name, it can also be accessed by name, e.g.collection["group name"]. This will return the CobaltDataSubset containing the data points in the group. To access the metadata for a group, index intocollection.metadatain the same way.It should not usually be necessary to manually instantiate GroupCollection objects, but they will be returned by various Cobalt methods and functions.
The GroupCollection interface is under development and changes may be made in the near future.
- aggregate_col(col: str, method: Literal['mean', 'sum', 'mode'] | Callable[[Series], Any] | None = None) Sequence[float]
Aggregate the values of a column within each subset using the specified method.
- compare_models(models: Sequence[ModelMetadata | str], metrics: List[str], select_best_model: bool = True, statistical_test: Literal['t-test', 'wilcoxon'] | None = None) DataFrame
Produce a dataframe comparing two or more models on each group.
Evaluates each specified metric for each model on each group, and puts these values in a column called “model_name_metric_name”. If select_best_model is True, will also include a column indicating the best model for each group with respect to each metric, as well as the change in performance compared to the next-best model. If statistical_test is specified, will also run a test that the performance difference is significantly different between the two models on each group. The resulting p-values are not currently adjusted for multiple comparisons.
- compute_group_keywords(col: str | Sequence[str] | None = None, n_keywords: int = 10, set_descriptions: bool = True, set_names: bool = False, warn_if_no_data: bool = True, reference_class: Literal['collection', 'dataset'] = 'dataset', use_all_text_columns: bool = True, **kwargs)
Find distinctive keywords for each group and store them in the group metadata.
- Parameters:
col – The column or columns containing text from which to extract keywords. If none is provided, will either use all text columns or use the default text column, depending on the value of use_all_text_columns.
n_keywords – The number of keywords to find for each group.
set_descriptions – If True, will set each group’s description to a string constructed from the top keywords.
set_names – If True, will set each group’s name based on the discovered keywords, using the default parameters to
set_names_from_keywords().warn_if_no_data – If True, will issue a warning if there is no text data to extract keywords from.
reference_class – If “collection”, will look for keywords that distinguish groups in this collection from each other. If “dataset”, will look for keywords that distinguish each group from the rest of the dataset.
use_all_text_columns – Controls the behavior of the method when
colis not specified.
- concatenate() CobaltDataSubset
Concatenate all subsets in the collection.
- evaluate_model(model: ModelMetadata | str, metrics: Sequence[str] | None = None) DataFrame
Produce a dataframe containing model performance metrics for each group.
- Parameters:
model – Name of the model to evaluate, or a ModelMetadata object to evaluate.
metrics – Names of the metrics to evaluate on the model. By default, will use all metrics defined for the model.
- classmethod from_groups(groups: Sequence[GroupMetadata])
Create a GroupCollection from a list of GroupMetadata objects.
- classmethod from_subset_collection(subsets: SubsetCollection, name: str | None = None)
Promote a SubsetCollection to a GroupCollection.
This allows adding metadata to each subset.
- get_array(key: str) Sequence[ndarray]
Retrieve the slice of an array for each subset.
- get_autodescriptions(column: str, n_descriptions: int = 1, descriptions_per_prompt: int = 1, n_samples: int = 10, max_sample_length: int = 250, set_descriptions: bool = True, score_descriptions: bool = False, description_model: str = 'gpt-4.1', scoring_model: str = 'gpt-4.1-mini', parallel: bool = True)
Use an LLM to describe properties that distinguish each group from the dataset.
This works by sampling a number of documents from each group and prompting the LLM to describe a feature present in the documents in the sample but not present in a sample of documents not in the group.
Models provided through the OpenAI API are currently supported. To use this functionality, you must first configure your API key, either by calling
cobalt.setup_api_client()or setting the OPENAI_API_KEY environment variable.- Parameters:
column – The column of the dataset containing the documents to describe.
n_descriptions – The number of descriptions to generate for the group. Each description will be generated with a fresh sample of documents, so generating multiple descriptions can increase the likelihood of finding useful hypotheses.
descriptions_per_prompt – The number of descriptions to generate for each sample. This must be a divisor of n_descriptions.
n_samples – The number of documents from the group to sample and use in the prompt for each description.
max_sample_length – The maximum number of characters to include from each sampled document. This puts an upper bound on the cost of each API call.
set_descriptions – Whether to use the generated descriptions to set each group’s primary description. If score_descriptions is True, the description with the highest F1-score will be used; otherwise the first description returned will be used.
score_descriptions – Whether to evaluate the quality of the generated descriptions. Description scoring is done by selecting a set of samples from the group and a set of samples from the rest of the dataset, and prompting a model to evaluate whether the description accurately captures each sample. This is treated as a classifier distinguishing between documents in the group and documents not in the group, and the precision, recall, and F1-score are reported.
seed – Used to control the samples from each group. Does not affect the LLM sampling.
description_model – Which model to use to generate descriptions.
scoring_model – Which model to use to score descriptions.
parallel – Whether to run each group’s descriptions in parallel. This is recommended to avoid waiting for sequential API calls.
- is_pairwise_disjoint()
Return True if there are no overlaps between subsets, False otherwise.
- property metadata: GroupMetadataIndexer
Get a group together with its metadata.
- select_col(col: str) Sequence[Series]
Retrieve the values of a column on each subset.
- set_names_from_keywords(col: str, n_keywords: int = 3, delimiter: str = ', ', min_match_rate: float = 0.0)
Set names for each group based on already-computed keywords.
Names groups with a string containing a number of the top keywords found for that group.
If two groups would end up with the same name, groups after the first will be named with a number to ensure names are unique.
- Parameters:
col – The column whose keywords should be used to create the group names.
n_keywords – The number of keywords to use to form each name.
delimiter – The character(s) that should separate keywords from each other in the group names.
min_match_rate – The minimum fraction of data points in the group that should contain a keyword in order for it to be used in the group name.
- set_names_sequential(prefix: str | None = None, prefix_source: Literal['group_type', 'collection_name'] = 'group_type', sep: str = ' ')
Set names for each group sequentially with a prefix string.
- class cobalt.GroupResultsCollection(name: str, run_type: RunType, source_data: CobaltDataSubset, group_type: GroupType, algorithm: str, params: dict, groups=None, visible: bool = True, run_id: UUID | None = None)
Bases:
GroupCollectionContains the results of a group analysis on a dataset.
- aggregate_col(col: str, method: Literal['mean', 'sum', 'mode'] | Callable[[Series], Any] | None = None) Sequence[float]
Aggregate the values of a column within each subset using the specified method.
- compare_models(models: Sequence[ModelMetadata | str], metrics: List[str], select_best_model: bool = True, statistical_test: Literal['t-test', 'wilcoxon'] | None = None) DataFrame
Produce a dataframe comparing two or more models on each group.
Evaluates each specified metric for each model on each group, and puts these values in a column called “model_name_metric_name”. If select_best_model is True, will also include a column indicating the best model for each group with respect to each metric, as well as the change in performance compared to the next-best model. If statistical_test is specified, will also run a test that the performance difference is significantly different between the two models on each group. The resulting p-values are not currently adjusted for multiple comparisons.
- compute_group_keywords(col: str | Sequence[str] | None = None, n_keywords: int = 10, set_descriptions: bool = True, set_names: bool = False, warn_if_no_data: bool = True, reference_class: Literal['collection', 'dataset'] = 'dataset', use_all_text_columns: bool = True, **kwargs)
Find distinctive keywords for each group and store them in the group metadata.
- Parameters:
col – The column or columns containing text from which to extract keywords. If none is provided, will either use all text columns or use the default text column, depending on the value of use_all_text_columns.
n_keywords – The number of keywords to find for each group.
set_descriptions – If True, will set each group’s description to a string constructed from the top keywords.
set_names – If True, will set each group’s name based on the discovered keywords, using the default parameters to
set_names_from_keywords().warn_if_no_data – If True, will issue a warning if there is no text data to extract keywords from.
reference_class – If “collection”, will look for keywords that distinguish groups in this collection from each other. If “dataset”, will look for keywords that distinguish each group from the rest of the dataset.
use_all_text_columns – Controls the behavior of the method when
colis not specified.
- concatenate() CobaltDataSubset
Concatenate all subsets in the collection.
- evaluate_model(model: ModelMetadata | str, metrics: Sequence[str] | None = None) DataFrame
Produce a dataframe containing model performance metrics for each group.
- Parameters:
model – Name of the model to evaluate, or a ModelMetadata object to evaluate.
metrics – Names of the metrics to evaluate on the model. By default, will use all metrics defined for the model.
- classmethod from_groups(groups: Sequence[GroupMetadata])
Create a GroupCollection from a list of GroupMetadata objects.
- classmethod from_subset_collection(subsets: SubsetCollection, name: str | None = None)
Promote a SubsetCollection to a GroupCollection.
This allows adding metadata to each subset.
- get_array(key: str) Sequence[ndarray]
Retrieve the slice of an array for each subset.
- get_autodescriptions(column: str, n_descriptions: int = 1, descriptions_per_prompt: int = 1, n_samples: int = 10, max_sample_length: int = 250, set_descriptions: bool = True, score_descriptions: bool = False, description_model: str = 'gpt-4.1', scoring_model: str = 'gpt-4.1-mini', parallel: bool = True)
Use an LLM to describe properties that distinguish each group from the dataset.
This works by sampling a number of documents from each group and prompting the LLM to describe a feature present in the documents in the sample but not present in a sample of documents not in the group.
Models provided through the OpenAI API are currently supported. To use this functionality, you must first configure your API key, either by calling
cobalt.setup_api_client()or setting the OPENAI_API_KEY environment variable.- Parameters:
column – The column of the dataset containing the documents to describe.
n_descriptions – The number of descriptions to generate for the group. Each description will be generated with a fresh sample of documents, so generating multiple descriptions can increase the likelihood of finding useful hypotheses.
descriptions_per_prompt – The number of descriptions to generate for each sample. This must be a divisor of n_descriptions.
n_samples – The number of documents from the group to sample and use in the prompt for each description.
max_sample_length – The maximum number of characters to include from each sampled document. This puts an upper bound on the cost of each API call.
set_descriptions – Whether to use the generated descriptions to set each group’s primary description. If score_descriptions is True, the description with the highest F1-score will be used; otherwise the first description returned will be used.
score_descriptions – Whether to evaluate the quality of the generated descriptions. Description scoring is done by selecting a set of samples from the group and a set of samples from the rest of the dataset, and prompting a model to evaluate whether the description accurately captures each sample. This is treated as a classifier distinguishing between documents in the group and documents not in the group, and the precision, recall, and F1-score are reported.
seed – Used to control the samples from each group. Does not affect the LLM sampling.
description_model – Which model to use to generate descriptions.
scoring_model – Which model to use to score descriptions.
parallel – Whether to run each group’s descriptions in parallel. This is recommended to avoid waiting for sequential API calls.
- property groups: List[Group]
The groups, with metadata (e.g. descriptions, model performance metrics) for each.
- is_pairwise_disjoint()
Return True if there are no overlaps between subsets, False otherwise.
- property metadata: GroupMetadataIndexer
Get a group together with its metadata.
- property raw_groups: List[CobaltDataSubset]
The groups as a list of CobaltDataSubset objects.
Omits the descriptive metadata.
- select_col(col: str) Sequence[Series]
Retrieve the values of a column on each subset.
- set_names_from_keywords(col: str, n_keywords: int = 3, delimiter: str = ', ', min_match_rate: float = 0.0)
Set names for each group based on already-computed keywords.
Names groups with a string containing a number of the top keywords found for that group.
If two groups would end up with the same name, groups after the first will be named with a number to ensure names are unique.
- Parameters:
col – The column whose keywords should be used to create the group names.
n_keywords – The number of keywords to use to form each name.
delimiter – The character(s) that should separate keywords from each other in the group names.
min_match_rate – The minimum fraction of data points in the group that should contain a keyword in order for it to be used in the group name.
- set_names_sequential(prefix: str | None = None, prefix_source: Literal['group_type', 'collection_name'] = 'group_type', sep: str = ' ')
Set names for each group sequentially with a prefix string.
- summary(model: ModelMetadata | None = None, production_subset: CobaltDataSubset | None = None) DataFrame
Create a tabular summary of the groups in this collection.
- Parameters:
model – A ModelMetadata object whose performance metrics will be computed for the groups.
production_subset – If provided, will calculate the fraction of data points in each group that fall in this subset.
- name: str
A name for the collection of results. May be referred to as a “run name”, since it corresponds to a particular run of an algorithm.
- source_data: CobaltDataSubset
The data(sub)set used for the analysis, as a CobaltDataSubset object.
- group_type: GroupType
What each group in the collection represents, e.g. a failure group or a cluster.
- algorithm: str
The algorithm used to produce the groups.
- params: Dict
Parameters passed to the group-finding algorithm.
- run_type: RunType
Whether the algorithm was run manually by the user or automatically by Cobalt.
- visible: bool
Whether the groups should be displayed in the UI.
- run_id: UUID
A unique ID for this collection of groups.
- class cobalt.HierarchicalCobaltGraph(name: str, graph: HierarchicalDataGraph, subset: CobaltDataSubset, params: Dict[str, Any] | None = None, embedding: Embedding | None = None, source_columns: List[str] | None = None)
Bases:
DictConstructibleMixin,SerializableMixinA hierarchical collection of graphs built from a dataset.
Each graph in the collection is a
CobaltGraphwhose nodes correspond with subsets of the source data. These are hierarchically arranged, so that if i < j, each node in self.levels[j] is a union of nodes from self.levels[i].- levels
List of
CobaltGraphobjects, one per resolution level
- name
The name of the graph
- subset
The CobaltDataSubset this graph was built from
- params
Dictionary of parameters used to build this graph
- embedding
The Embedding object used to build the graph
- source_columns
List of column names used to build the graph
- base_graph
A very high-resolution
CobaltGraph, where nodes are as small as possible. May be higher resolution than self.levels[0].
- property n_levels: int
Number of resolution levels in the hierarchical graph.
- property neighbor_graph: CobaltGraph
A normalized neighbor graph mapping relationships between points.
This is an alias for self.base_graph provided for backwards compatibility.
- property raw_graph: KNNGraph | None
A nearest-neighbor graph giving raw distances between points.
May not be available.
- class cobalt.CobaltGraph(graph: DataGraph, subset: CobaltDataSubset)
Bases:
DictConstructibleMixin,SerializableMixinA single-resolution graph based on a dataset.
Each node in the graph corresponds with a set of similar data points. Edges connect related groups of data points.
CobaltGraphobjects are usually obtained by selecting a particular resolution level from aHierarchicalCobaltGraph.- subset
The
CobaltDataSubsetthis graph was built from
- node_subsets
A
SubsetCollectionof the subsets for each node in the graph.
- property N: int
Total number of data points in the graph.
- property csr_graph: CSRGraph
The underlying sparse graph structure without information about data points.
- property edge_list: List[tuple]
List of edges as (source, target) tuples.
- property edge_mtx: ndarray
Edge matrix as (n_edges, 2) array where each row is [source, target].
- property edge_weights: ndarray
Array of edge weights.
- property edges: List[Dict[str, int]]
List of edges as dicts with ‘source’,’target’, and ‘weight’ keys.
- get_group_collection(name: str | None = None) GroupCollection
Convert the nodes of this graph to a GroupCollection.
This can be used to quickly analyze each node as an individual group.
- induced_subgraph(node_indices: ndarray) CobaltGraph
Create the subgraph induced by a collection of nodes.
Also creates the corresponding subset of self.subset.
- property n_edges: int
Number of edges in the graph.
- property node_membership: ndarray
Array giving the node ID for each data point in self.subset.
- property node_sets: List[ndarray]
List of node memberships. Each element is an array of indices into self.subset.
- property nodes: List[ndarray]
List of node memberships. Each element is an array of indices into self.subset.
Alias for self.node_sets.
- partition_modularity(partition_vec: ndarray) float
Compute the graph modularity score of a partition of the graph nodes.
The partition is specified as an integer array of length len(self.nodes), assigning each node a partition ID.
- class cobalt.GraphSpec(X: ~numpy.ndarray, metric: str | ~mapper.distances.Metric, filters: ~typing.Sequence[~cobalt.build_graph.FilterSpec] = (), neighbor_params: ~cobalt.build_graph.NeighborParams | None = None, clustering_params: ~cobalt.build_graph.ClusteringParams = <factory>, M: int | None = None, K: int | None = None, min_nbrs: int | None = None, affinity: ~typing.Literal['slpi', 'exponential', 'expinv', 'gaussian'] = 'slpi', L_coarseness: int = 20, L_connectivity: int = 20)
Bases:
objectA set of parameters for creating a graph.
- K: int | None = None
The number of mutual nearest neighbors to keep for each data point.
If not provided, this will be chosen automatically. It is preferred to specify this parameter as part of neighbor_params.
- L_coarseness: int = 20
The number of neighbors to keep for each data point when clustering data points into graph nodes.
- L_connectivity: int = 20
The number of neighbors to keep for each data point when connecting nodes in the graph.
- M: int | None = None
The number of nearest neighbors to compute for each data point.
If not provided, this will be chosen automatically. It is preferred to specify this parameter as part of neighbor_params.
- affinity: Literal['slpi', 'exponential', 'expinv', 'gaussian'] = 'slpi'
The function to convert normalized distances into weights.
It is preferred to specify this parameter as part of neighbor_params.
- filters: Sequence[FilterSpec] = ()
A (possibly empty) list of FilterSpec objects that describe filter functions to apply to the graph.
These may be provided as dicts that will be used to construct FilterSpec objects.
- min_nbrs: int | None = None
The minimum number of neighbors to keep for each data point.
If not provided, this will be chosen automatically. It is preferred to specify this parameter as part of neighbor_params.
- neighbor_params: NeighborParams | None = None
Parameters determining how the underlying neighbor graph is constructed from the embedding.
May be provided as a dict that will be used to construct a NeighborParams object.
- X: ndarray
The source data. Should be an array of shape (n_points, n_dims).
- metric: str | Metric
The distance metric to use to create the graph.
May be given as a name, or as a Metric object (e.g. a CombinedMetric or a CustomMetric).
- clustering_params: ClusteringParams
Parameters affecting the hierarchical clustering of data points that produces the multiresolution graph.
May be provided as a dict that will be used to construct a ClusteringParams object.
- class cobalt.FilterSpec(f_vals: ndarray, n_bins: int = 10, bin_method: Literal['rng', 'uni'] = 'rng', pruning_method: Literal['bin', 'pct'] = 'bin', pruning_threshold: int | float = 1, smoothing_ratio: float = 0.0)
Bases:
objectA set of parameters for a filter on a graph.
Separates the dataset into n_bins bins, based on the values of f_vals for each data point. Data points within each bin are clustered to form nodes, and are linked together if they are in nearby bins.
- bin_method: Literal['rng', 'uni'] = 'rng'
Either “rng” or “uni”. If “rng”, the bins will have equal width; if “uni” they will have equal numbers of data points.
- n_bins: int = 10
The number of bins to separate the dataset into.
- pruning_method: Literal['bin', 'pct'] = 'bin'
Either “bin” or “pct”. If “bin”, will only allow edges between nodes from nearby bins. If “pct”, will only allow edges between nodes whose percentile difference for f_vals is within the given threshold.
- pruning_threshold: int | float = 1
The maximum distance two nodes can be apart while still being connected.
- f_vals: ndarray
An array of values, one for each data point.
- class cobalt.NeighborParams(M: int | None = None, deduplicate: bool = False, strict_partition: numpy.ndarray | None = None, backend: Literal['nndescent', 'exact'] = 'nndescent', seed: int | None = None, max_dist: float = inf, K: int | None = None, min_nbrs: int | None = None, normalize_method: Literal['none', 'kth_neighbor', 'neighborhood_weight'] = 'neighborhood_weight', normalize_target: Literal['log', 'sqrt'] | float = 'log', normalize_kth_neighbor_idx: int | None = None, affinity: Literal['slpi', 'exponential', 'expinv', 'gaussian'] = 'slpi')
Bases:
object- K: int | None = None
The number of mutual nearest neighbors to keep for each data point.
- M: int | None = None
The number of nearest neighbors to compute for each data point.
- affinity: Literal['slpi', 'exponential', 'expinv', 'gaussian'] = 'slpi'
The function used to convert normalized distances to edge weights.
- backend: Literal['nndescent', 'exact'] = 'nndescent'
Method to use to compute nearest neighbors.
The default “nndescent” is an efficient approximate algorithm. In some situations “exact” may provide significantly higher-quality results at the expense of more computation (for large datasets).
- deduplicate: bool = False
Whether to deduplicate the data points before computing nearest neighbors.
- max_dist: float = inf
The maximum raw distance between data points for which an edge will be included.
This is an exclusive bound: points at distance max_dist will not have an edge between them.
- min_nbrs: int | None = None
The minimum number of neighbors to keep for each data point.
- seed: int | None = None
Random seed to use for the “nndescent” backend.
Has a fixed default for reproducibility.
- strict_partition: ndarray | None = None
An array assigning a partition id to each data point.
The data will be split into these partitions before building the graph, and an independent graph will be built on each subset.
- class cobalt.ClusteringParams(allow_multiple_merges_per_node: bool = False, filter_levels_per_component: bool = False, num_threads: int = 1, max_height: int = 1000, max_cluster_growth_rate: float = 2.0, min_affinity_ratio: float = 0.8, min_n_clusters_ratio: float = 0.85)
Bases:
object- allow_multiple_merges_per_node: bool = False
Whether to allow merging sets of more than two nodes together in a single clustering step.
The default setting is for backwards compatibility; we recommend setting this to True.
- filter_levels_per_component: bool = False
Whether to take into account the number of graph components when selecting the output levels.
After the initial clustering is done, levels are filtered out to ensure a certain rate of decrease in the number of nodes per level. When this setting is True, the filtering is done to ensure a certain rate of decrease in the number of nodes per component per level. This increases the quality of the levels for graphs with many small components.
The default setting is for backwards compatibility; we recommend setting this to True.
- max_height: int = 1000
Maximum number of steps to take while clustering the graph.
If the top level graph has too many nodes, you can try increasing this.
- num_threads: int = 1
Number of threads to use in the node merge step.
- class cobalt.CombinedMetric(metrics: List[str], block_bounds: Sequence[Sequence[int]], weights: Sequence[float] | None = None)
Bases:
Metric,DictConstructibleMixinA linear combination of named metrics.
The distance between two vectors will be computed as a weighted sum of metrics applied to slices of the coordinates. For instance, a CombinedMetric might use the “euclidean” metric on coordinates 0 through 10, and the “cosine” metric on coordinates 10 through 20, adding these distances together to produce an aggregate distance.
A CombinedMetric expects vectors of a fixed dimension and will cause errors if used with data vectors of a different dimension.
A CombinedMetric will support sparse data if all its component metrics are implemented for sparse data.
- Parameters:
metrics – The names of the metric functions to use for each block.
block_bounds – A sequence of pairs of indices (or a 2-dimensional array) containing the start and end index of the coordinates used for each metric. For instance, if block_bounds[0] = [0, 10], metrics[0] will be applied to the slice 0:10 of each data vector. Note that this means blocks can overlap.
weights – A sequence of weights used to scale the distances from each metric. If none is provided, will use a weight of 1 for every block.
The effective distance between two vectors x and y is equal to:
sum( weights[i] * metrics[i]( x[block_bounds[i][0]:block_bounds[i][1]], y[block_bounds[i][0]:block_bounds[i][1]] ) )
- class cobalt.CustomMetric(dist_fn: Callable[[np.ndarray, np.ndarray], float], sparse_dist_fn: Callable[[np.ndarray, np.ndarray, np.ndarray, np.ndarray], float] | None = None, name: str | None = None)
Bases:
Metric,MsgpackSerializableMixinA custom metric defined by a user-provided function.
An implementation of the metric for sparse arrays may optionally be provided.
- Parameters:
dist_fn – A Numba-compiled function with signature float32(float32[:], float32[:]).
sparse_dist_fn – A Numba-compiled function with signature float32(int32[:], float32[:], int32[:], float32[:], int). The int32 array parameters are indices for the sparse entries; the float32 array parameters are values for the sparse entries. The final parameter is the dimension of the vector.
name – An optional name that will be saved with graphs generated using this metric.
- class cobalt.settings
Bases:
objectSettings that affect global behavior.
- graph_decay_node_repulsion: bool = True
Whether to decay repulsive forces between nodes beyond a certain distance.
Note that to be applied, this setting must be changed before the graph is created.
- graph_layout_singletons_separately: bool = False
Whether to lay out singleton nodes in the graph separately from all other components.
Note that to be applied, this setting must be changed before the graph is created.
- graph_prevent_node_overlaps: bool = True
Whether to prevent nodes in the graph from overlapping.
This tends to produce more readable graphs, but the layout may be less responsive.
Note that to be applied, this setting must be changed before the graph is created.
- graph_use_rich_node_labels: bool = False
Default node hover label format for graphs.
Setting this to True will allow for the use of larger, more expressive node labels.
Note that to be applied, this setting must be changed before the graph is created.
- classmethod register_colormap(colormap: str | matplotlib.colors.Colormap, name: str | None = None, category: Literal['numerical', 'categorical'] = 'numerical', n_categories: int | None = None)
Register a colormap to be available in the Cobalt UI.
This function allows you to add matplotlib colormaps (either built-in or custom) to the Cobalt coloring options. You can pass either: - A string name of a matplotlib built-in colormap (e.g., “rainbow”, “coolwarm”) - A matplotlib colormap object (e.g., from LinearSegmentedColormap or ListedColormap)
- Parameters:
colormap – Either a string name of a matplotlib colormap, or a matplotlib colormap object (Colormap instance from matplotlib.colors).
name – The name to use for the colormap in the UI. Required if colormap is an object. If colormap is a string, this parameter is ignored and the string is used as the name.
category – Either “numerical” or “categorical” to specify which type of data the colormap is designed for. Defaults to “numerical”.
n_categories – For categorical colormaps, the number of distinct categories the colormap supports. If not provided, will attempt to infer from the colormap’s .N property (for colormap objects) or default to 10.
Examples
>>> from cobalt import settings >>> from matplotlib.colors import LinearSegmentedColormap, ListedColormap >>> >>> # Register a built-in matplotlib colormap >>> settings.register_colormap("rainbow", category="numerical") >>> settings.register_colormap("coolwarm", category="numerical") >>> >>> # Register a custom gradient colormap >>> custom_gradient = LinearSegmentedColormap.from_list( ... colors=[(0, 0, 0), (1, 0, 0)], ... N=256 ... ) >>> settings.register_colormap( ... custom_gradient, name="black-to-red", category="numerical" >>> ) >>> >>> # Register a custom categorical colormap >>> custom_categorical = ListedColormap( ... ["#FF6B6B", "#4ECDC4", "#45B7D1"], ... ) >>> settings.register_colormap( ... custom_categorical, ... name="custom-categorical", ... category="categorical", ... n_categories=3 >>> ) >>> >>> # Registered colormaps will now be available in any Workspace UI created after this
- table_max_base64_total_size: int = 20000000
The maximum amount of image data to base64 encode in the table data payload.
- cobalt.check_license()
Check the configured license key and print the result.
- cobalt.setup_api_client()
Set up the API client by updating or adding the API key to the JSON config file.
- cobalt.setup_license()
Prompts for a license key and sets it in the configuration file.
The license key will be saved in ~/.config/cobalt/cobalt.json.
- cobalt.register_license(force: bool = False)
Registers this installation of Cobalt for noncommercial or trial usage.
Requests your name and email address and configures a license key. If you have already registered Cobalt on a different computer, this will link your computer with the previous registration.